CHINA’S burgeoning AI development firm, DeepSeek, is making waves in the global tech and investment world, sparking debate within China about the factors behind its surprising success against international competitors like ChatGPT.
The AI startup has garnered praise for its performance, affordability, and open-source architecture. However, a growing consensus in online communities suggests that its incorporation of Chinese characters during pre-training may be a significant contributing factor.
The prevailing theory is that the higher information density of Chinese training data has enhanced DeepSeek’s logical capabilities, enabling it to process complex concepts more effectively. Proponents of this view argue that training on Chinese has sharpened DeepSeek’s language comprehension. The ideographic nature of Chinese characters allows them to convey meaning even when written incorrectly, aiding reader comprehension.
“Chinese characters achieve maximum information transmission with minimal cost. As an efficient information encoding, Chinese has greatly improved efficiency and reduced costs in the processing of artificial intelligence,” Xiang Ligang, a telecommunications industry analyst and public opinion leader, said on social media, as reported by The South China Morning Post.
Others suggest that Chinese characters’ close association with multifaceted information, such as images and audio, has played a role. Traditional Chinese poetry, often paired with paintings or music, may have provided DeepSeek with rich multimodal learning material.
Yang Zhuoran, assistant professor at Yale University, speaking to technology media portal DeepTech, highlighted the importance of data quality in training large models. Data quality, he explained, influences not only a model’s ability to acquire and express knowledge but also the style and accuracy of its generated content.
DeepSeek’s training data sources remain confidential, but speculation suggests they include classical literature, internet slang, academic papers, government documents, and regional dialects.
This speculation echoes concerns raised when ChatGPT first gained popularity. Critics feared that Chinese internet censorship could lead to a scarcity of Chinese-language data, potentially hindering China’s AI sector. However, some now argue that the abstract nature of internet language, influenced by keyword censorship, may have inadvertently benefited the model’s training.
Chinese internet users often employ homophones or indirect expressions to circumvent censorship, resulting in greater language complexities. A single character can have multiple meanings, initially posing a challenge for AI. However, as one user commented, with more training, the model learns to understand and generate these cryptic expressions, improving its capabilities.
DeepSeek’s proficiency in handling Chinese has impressed many. Users have employed it to write in classical Chinese, generate couplets, translate dialects, and even draft official documents, with several praising it for surpassing previous AI models.
The academic community, however, points out that using the Chinese language and sources for training is not new, suggesting that DeepSeek’s training model is not entirely original. They emphasise the importance of high-quality training data, training strategies, and extensive iterative optimisation.
Chinese tech blog Shi Yu Xing Kong notes that there is no inherent language barrier in AI’s understanding of human knowledge. Whether Chinese or English, AI learns the same information.
Intriguingly, users interacting with DeepSeek’s AI in English sometimes encounter Chinese pop-ups in the conversation, a phenomenon observed in both DeepSeek-R1 and the latest version of OpenAI’s O3-mini.
According to the DeepSeek-R1 technical report, the training process comprises two stages. The first involves collecting a large amount of Chain of Thought data to fine-tune the DeepSeek-V3 basic model. The second phase, reinforcement learning (RL), involves researchers designing rewards for accuracy and formatting. Providing feedback on each generated response, this reinforcement guides the model’s optimisation.
-BTS Media