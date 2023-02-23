« previous post |

Limited data sets a hurdle as China plays catch-up to ChatGPT

Lack of high-quality Chinese texts on Internet a barrier to training AI models.



Ryan McMorrow, Nian Liu, Eleanor Olcott, and Madhumita Murgia, FT, Ars Technica (2/21/23)

Baidu struggled with its previous attempt at a chatbot, known as Plato, which analysts said could not even answer a simple question such as: “When is Alibaba co-founder Jack Ma’s birthday?”

Analysts point to the lack of high-quality Chinese-language text on the Internet and in other data sets as a barrier for training AI software.

GPT, the program underlying ChatGPT, sucked in hundreds of thousands of English academic papers, news articles, books, and social media posts to learn the patterns that form language. Meanwhile, Baidu’s Ernie has been trained primarily on Chinese-language data as well as English-language data from Wikipedia and Reddit.

China's ChatGPT Fever

Irene Zhang, ChinaTalk (2/21/23)

This piece by Xiao Fang 肖芳, a reporter for business news outlet Jiemian, made waves over the past week. It identifies two major challenges impeding Chinese AI development: paltry training materials and toxic competition in the technology industry.

Baidu’s Plato seems possessed by a low-class internet troll; there is truth to the popular online joke that it was trained on the Weibo comment section. Thanks to the burdened development of China's Internet content industry over the past decade, the quality of Chinese Internet content has deteriorated consistently. […] Moreover, each content platform has turned into a data silo in order to maximize user traffic, time, and business value. Even the contents of various contracts and documents have to be paywalled; how can you expect the Chinese version of ChatGPT to help you write emails? [Jordan [Schneider]: so much for China’s ‘data advantage’…] Last year, I heard a professor from Peking University share a set of data at a media event, confirming the current situation of Chinese Internet content quality: As of 2021, although the numbers of Simplified Chinese Internet users and English Internet users are comparable, English content accounts for 60.4% of the top 10 million websites in global rankings, while Chinese content accounts for only 1.4%. The poor quality of Chinese Internet content is the result of Chinese Internet companies, represented by Baidu and ByteDance, who rush to make quick profits. Instead of patiently transporting more books and literature into the Internet, these platforms judge the quality of content based on whether it kills time and drives revenue. After several years of precipitation, it is now difficult to search for high-quality information on the internet in Simplified Chinese, and it should not surprise us that these chatbots confuse themselves as soon as they are asked meaningful questions. [Jordan: I wonder to what extent the overall quality of discussion and thinking in a language will reflect how good its LLMs are. My hunch is that this is not how it will end up working out…]

Knowledge / data bases for training ChatGPT rivals are severely limited in China for the simple fact that most of the contents of the WWW are banned behind the Great Firewall. So long as the PRC is governed by the CCP, that is unlikely to change. Another liability for the development of ChatGPT rivals is that Chinese firms are dependent on the US for high speed processors. That too will not change in the foreseeable future. As to whether the linguistic foundations of data processing affect the volume of throughput one way or another, that remains to be demonstrated.

