Language Log

Vignettes of quality data impoverishment in the world of PRC AI

February 23, 2023 @ 8:00 am · Filed by Victor Mair under Artificial intelligence, Computational linguistics, Data bases

Some snippets:

Limited data sets a hurdle as China plays catch-up to ChatGPT

Lack of high-quality Chinese texts on Internet a barrier to training AI models.

Ryan McMorrow, Nian Liu, Eleanor Olcott, and Madhumita Murgia, FT, Ars Technica (2/21/23)

…

Baidu struggled with its previous attempt at a chatbot, known as Plato, which analysts said could not even answer a simple question such as: “When is Alibaba co-founder Jack Ma’s birthday?”

Analysts point to the lack of high-quality Chinese-language text on the Internet and in other data sets as a barrier for training AI software.

GPT, the program underlying ChatGPT, sucked in hundreds of thousands of English academic papers, news articles, books, and social media posts to learn the patterns that form language. Meanwhile, Baidu’s Ernie has been trained primarily on Chinese-language data as well as English-language data from Wikipedia and Reddit.

…

China's ChatGPT Fever

Irene Zhang, ChinaTalk (2/21/23)

This piece by Xiao Fang 肖芳, a reporter for business news outlet Jiemian, made waves over the past week. It identifies two major challenges impeding Chinese AI development: paltry training materials and toxic competition in the technology industry.

Baidu’s Plato seems possessed by a low-class internet troll; there is truth to the popular online joke that it was trained on the Weibo comment section. Thanks to the burdened development of China's Internet content industry over the past decade, the quality of Chinese Internet content has deteriorated consistently. […] Moreover, each content platform has turned into a data silo in order to maximize user traffic, time, and business value. Even the contents of various contracts and documents have to be paywalled; how can you expect the Chinese version of ChatGPT to help you write emails? [Jordan [Schneider]: so much for China’s ‘data advantage’…]

Last year, I heard a professor from Peking University share a set of data at a media event, confirming the current situation of Chinese Internet content quality:

As of 2021, although the numbers of Simplified Chinese Internet users and English Internet users are comparable, English content accounts for 60.4% of the top 10 million websites in global rankings, while Chinese content accounts for only 1.4%.

The poor quality of Chinese Internet content is the result of Chinese Internet companies, represented by Baidu and ByteDance, who rush to make quick profits. Instead of patiently transporting more books and literature into the Internet, these platforms judge the quality of content based on whether it kills time and drives revenue. After several years of precipitation, it is now difficult to search for high-quality information on the internet in Simplified Chinese, and it should not surprise us that these chatbots confuse themselves as soon as they are asked meaningful questions. [Jordan: I wonder to what extent the overall quality of discussion and thinking in a language will reflect how good its LLMs are. My hunch is that this is not how it will end up working out…]

Knowledge / data bases for training ChatGPT rivals are severely limited in China for the simple fact that most of the contents of the WWW are banned behind the Great Firewall. So long as the PRC is governed by the CCP, that is unlikely to change. Another liability for the development of ChatGPT rivals is that Chinese firms are dependent on the US for high speed processors. That too will not change in the foreseeable future. As to whether the linguistic foundations of data processing affect the volume of throughput one way or another, that remains to be demonstrated.

Selected readings

"Uh-oh! DeepL in the classroom; it's already here" (2/22/23)
"DeepL Translator" (2/16/23)
"GLM-130B: An Open Bilingual Pre-Trained Model" (1/25/2023)
"ChatGPT writes Haiku" (12/21/22)
"Translation and analysis" (9/13/04)
"Welcome to China" (3/10/14)
"Alexa down, ChatGPT up?" (12/8/22)
"Detecting LLM-created essays" (12/20/22)
"Artificial Intelligence in Language Education: with a note on GPT-3" (1/4/23)
"ChatGPT: Theme and Variations" (2/21/23)

[h.t. Michael Carr and Bill Benzon]

February 23, 2023 @ 8:00 am · Filed by Victor Mair under Artificial intelligence, Computational linguistics, Data bases

Permalink

11 Comments

Callum said,

February 23, 2023 @ 11:18 am

And why should tech companies have unfettered access to free, high quality data without the producers' say or the end user's opinion on the eventual product? This is all an accident, of course, and better mechanisms could be devised for data protection, but this is a useful reminder that the internet could be different. It doesn't have to be the case that Silicon Valley can plunder a natural online resource without cost or consequence.
Nathan said,

February 23, 2023 @ 12:26 pm

There's no "plundering" going on. These companies have deprived no one of anything. All of this text is still freely available to us all, just as we all intended when we typed it. We can use it for corpus linguistics, AI training, seeding an RNG, testing data compression algorithms, whatever we want. Isn't that wonderful? No fair trying to alter the deal after the fact.
Taylor, Philip said,

February 23, 2023 @ 1:05 pm

"Another liability for the development of ChatGPT rivals is that Chinese firms are dependent on the US for high speed processors" — Is that true ? According to Reuters (March 23, 2022), "[c]urrently, Taiwan Semiconductor Manufacturing Co. (2330.TW) builds the bulk of Nvidia chips".
Dan Blum said,

February 23, 2023 @ 3:08 pm

There may be some issues with China purchasing chips from Taiwanese companies.
Peter Metcalfe said,

February 23, 2023 @ 4:52 pm

Link the Chinese AIs up with TikTok. What could go wrong?
AntC said,

February 23, 2023 @ 5:21 pm

@Philip "… Chinese firms are dependent on the US for high speed processors" — Is that true ?

The picture is complex, but broadly yes. In semiconductor manufacture — which has a tangled supply chain — you need to distinguish sharply between Mainland China vs Taiwan. From your press piece

Taiwan Semiconductor Manufacturing Co. (2330.TW) builds the bulk of Nvidia chips and Huang said "being a foundry at the caliber of a TSMC is not for the faint of heart,"

Building/commissioning a 'foundry' (manufacturing plant) is a huge outlay of capital and years in the planning. Most of TSMCs foundries are in the Mainland — but this is for mass-producing 'low-tech' = old tech chips (like for mobile handsets). The article you link to is nearly a year old — which might as well be the dark ages. Specifically it reflects the position from before PRC threw its weight behind Putin. The strategic challenge today for the hi-tech end is keeping leading-edge knowledge away from a potentially hostile power (PRC). Today's concerns.

It'll surprise no-one that as soon as Mainland foundries start manufacturing a new chipset, they'll try to reverse-engineer and steal the technology [**]. Over the past year, the U.S. has put its foot down and insisted nothing counting as 'leading edge' gets anywhere near the Mainland — not even auxiliary components, let alone the CPU. So TSMC is building an advanced tech plant in Taiwan (outside Kaohsiung).

Intel does retain significant high-end manufacturing in the U.S. But is not producing anything like the needed volumes. Typically U.S. chipset designers will prototype in the U.S. then subcontract bulk manufacture to TSMC or other Taiwan specialists, who commodotise the process, then mass-produce in the Mainland. This is the reason the U.S. is such a firm ally to Taiwan. The quid pro quo is the U.S. wants Taiwan to stop subcontracting any new tech to the Mainland, so rapidly PRC will fall behind. (Other players like Thailand/Vietnam are trying to get into the mass-production end; but again there is huge investment and a long lead time needed.)

Complicating all this is that some of the essential minerals are currently sourced only from the Mainland. There are other possible sources (not yet developed), but they're all in politically sensitive countries.

So it's a delicate dance of co-ompetition from US hi-tech to Taiwan to PRC. I'd say U.S. administrations (of either colour) have gotten too relaxed for too long in treating China as if not friendly at least not hostile. PRC playing their usual waiting game goes very well for them given the long lead times getting a 'foundry' operational.

[**] PRC has also been making huge employment offers to Taiwanese engineers — who used to travel freely to the U.S. Taiwan has now blacklisted those people: if they return to Taiwan, they can't work in any semiconductor company for 5 years — by which time their knowledge will be useless.
Doctor Science said,

February 23, 2023 @ 9:23 pm

"Limited data sets a hurdle as China plays catch-up to ChatGPT"
— why would they *want* to? I have seen absolutely zero instances of ChatGPT being good for anything, and several of it being somewhere between bad and disastrously bad.

I lie, it is good for generating bullshit. But the world was not suffering some sort of Critical Bullshit Shortage–on the contrary.
AntC said,

February 24, 2023 @ 4:30 am

Can ChatGPT write a good melody?

Spoiler alert: no. And that's after the guy has spent a lot of time rejecting attempts/trying to coach it as to what he wants.

It's a position similar to what Victor points out re lack of training data for MSM: nobody's given ChatGPT enough music/in enough genres.
AntC said,

February 24, 2023 @ 6:53 am

@DrSci I have seen absolutely zero instances of ChatGPT being good for anything,

I think you're being unfair: ChatGPT has proven reasonably effective at writing computer programs (for a fairly narrow but oft-required collection of of requirements). I guess I should say effective at a 'flying start' to get the bulk of the work done, with a human to check/refine the detail.

YMMV as to whether those are "good for anything", I suppose.
Doctor Science said,

February 24, 2023 @ 8:48 pm

@AntC: thank you, I had actually never heard that ChatGPT was being used to generate computer programs. I've heard of it only in the context of search engine results & essay/story writing, where it is *worse* than just "bad".
liuyao said,

March 20, 2023 @ 9:00 pm

If anyone could get around the Great Firewall of China, it'd be folks at the tech companies. Not to mention that they are in good terms with the goverment that they could get permission. It's that they didn't know it would have such a huge return on the capability of the language model. (Nor did folks at Google and Meta, for that matter; or if they did, they couldn't convince their higher-ups.)

What is more astonishing, in China and for folks here, is that allegedly only a fraction of the data used to train ChatGPT was in non-English languages, yet it still beats anything that has come before, hands down. Does that mean if there is a corpus of high-quality writing in any language, all other languages would benefit? Presumably the model was not "tuned" to do translation, yet it can do a rather good job at it after only showing a few pairs (what they call "few-shot learning").

RSS feed for comments on this post

Vignettes of quality data impoverishment in the world of PRC AI

As of 2021, although the numbers of Simplified Chinese Internet users and English Internet users are comparable, English content accounts for 60.4% of the top 10 million websites in global rankings, while Chinese content accounts for only 1.4%.

11 Comments

Callum said,

Nathan said,

Taylor, Philip said,

Dan Blum said,

Peter Metcalfe said,

AntC said,

Doctor Science said,

AntC said,

AntC said,

Doctor Science said,

liuyao said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta