OpenAI's Chinese problem

« previous post | next post »

We have expressed concern over the quality of training and source materials for Chinese AI and LLMs.  Less than a week ago, we examined "AI based on Xi Jinping Thought (5/21/24), which may be considered as an attempt to "purify" what goes into Chinese AI.  It turns out that there really is a problem, and it is affecting not just China's own AI efforts, but is infecting ours as well.

OpenAI’s latest blunder shows the challenges facing Chinese AI models:
Finding high-quality data sets is tricky because of the way China’s internet functions.
By Zeyi Yang, MIT Technology Review (May 22, 2024)

As we shall soon see, pursuing this topic takes us into very sensitive, disquieting territory concerning the nature of China's internet.  It will be difficult for us to avoid assessing the quality of China's knowledge basis and information resources overall.

Here are the opening paragraphs of the MIT Technology Review article by Zeyi Yang:

Last week’s release of GPT-4o, a new AI “omnimodel” that you can interact with using voice, text, or video, was supposed to be a big moment for OpenAI. But just days later, it feels as if the company is in big trouble. From the resignation of most of its safety team to Scarlett Johansson’s accusation that it replicated her voice for the model against her consent, it’s now in damage-control mode.

Add to that another thing OpenAI fumbled with GPT-4o: the data it used to train its tokenizer—a tool that helps the model parse and process text more efficiently—is polluted by Chinese spam websites. As a result, the model’s Chinese token library is full of phrases related to pornography and gambling. This could worsen some problems that are common with AI models: hallucinations, poor performance, and misuse. 

I wrote about it on Friday after several researchers and AI industry insiders flagged the problem. They took a look at GPT-4o’s public token library, which has been significantly updated with the new model to improve support of non-English languages, and saw that more than 90 of the 100 longest Chinese tokens in the model are from spam websites. These are phrases like “_free Japanese porn video to watch,” “Beijing race car betting,” and “China welfare lottery every day.”

Since such phrases account for 90% of the language used to train the model, even Chinese find it alarming:

“It’s an embarrassing thing to see as a Chinese person. Is that just how the quality of the [Chinese] data is? Is it because of insufficient data cleaning or is the language just like that?” says Zhengyang Geng, a PhD student in computer science at Carnegie Mellon University.

On behalf of the Chinese people, I have been complaining for years that most of the world's internet is not available to them.  Even with VPNs (and not everybody can afford a VPN), there are so many websites that they just can't reach.  That's one of the main reasons why I personally do not like to spend much time in China — you just feel so cut off. 

If OpenAI and search engines such as Google want to include China in the world wide web of information flow, they have no choice but to mine what is available on the Chinese internet, and I would have to say not much that isn't Chinese or is permitted by the paranoid Chinese government.  In any event, the products of neither OpenAI nor Google are accessible in China, nor are Wikipedia or so many other extremely valuable sources of information from outside the Great Firewall.


It could be tempting to draw a conclusion about a language or a culture from the tokens OpenAI chose for GPT-4o. After all, these are selected as commonly seen and significant phrases from the respective languages. There’s an interesting blog post by a Hong Kong–based researcher named Henry Luo, who queried the longest GPT-4o tokens in various different languages and found that they seem to have different themes. While the tokens in Russian reflect language about the government and public institutions, the tokens in Japanese have a lot of different ways to say “thank you.”

After I published the story [in the link mentioned a few paragraphs above], Victor Shih, a political science professor at the University of California, San Diego, commented on it on X: “When you try not [to] train on Chinese state media content, this is what you get.”

It’s half a joke, and half a serious point about the two biggest problems in training large language models to speak Chinese: the readily available data online reflects either the “official,” sanctioned way of talking about China or the omnipresent spam content that drowns out real conversations.

In fact, among the few long Chinese tokens in GPT-4o that aren’t either pornography or gambling nonsense, two are “socialism with Chinese characteristics” and “People’s Republic of China.” The presence of these phrases suggests that a significant part of the training data actually is from Chinese state media writings, where formal, long expressions are extremely common.

The lack of quality training data is a much bigger problem than the failure to filter out the porn and general nonsense in GPT-4o’s token-training data. If there isn’t an existing data set, AI companies have to put in significant work to identify, source, and curate their own data sets and filter out inappropriate or biased content. 

It doesn’t seem OpenAI did that, which in fairness makes some sense, given that people in China can’t use its AI models anyway. 

Still, there are many people living outside China who want to use AI services in Chinese. And they deserve a product that works properly as much as speakers of any other language do.

I would tend to disagree with the author's last point.  If China doesn't want its people to know about the world and doesn't want the world to know about what's going inside the Great Firewall, I don't think AI, Google, or any other outside entity should make costly, time-consuming efforts to compensate for the willful obscurantism of the CCP/PRC (except maybe foreign intelligence agencies).


Selected readings

[h.t. Mark Liberman]


  1. Tore said,

    May 26, 2024 @ 5:34 pm

    Reading the part about Japanese, the training data is clearly from 2channel, which is quite worrying when you think about the inherent bias of the content

  2. Anna said,

    May 26, 2024 @ 6:45 pm

    The Italian list in the referenced article contains very few Italian words. A lot of English, a little bit of French and German, a bunch of words that look like misspelled Italian…

    I can't imagine where it comes from or make any sense of it.
    The foreign words are not in common use in Italian, and the misspellings are not typical native-speaker mistakes.

  3. Alex Shpilkin said,

    May 27, 2024 @ 7:31 am

    Re your last sentence—I don’t know. OpenAI’s goal is profit (for all that it claimed differently at first), and to some extent it’s silly to ask it to work for the benefit of people it won’t be able to sell to anyway.

    On the other hand, this amounts to equating the will of the Chinese population with that of the Chinese government, which is of course what every authoritarian government wants, and furthering the isolation of Chinese culture from the rest of the world, which is again furthering the CCP’s goals for it. I don’t know of a good answer, but the one you give here seems overly dismissive in that regard.

    (While I’m not Chinese you can guess from my name I also have first-hand experience with this question and this particular approach to it.)

  4. SusanC said,

    May 27, 2024 @ 10:06 am

    Tokenisers, specifically, are very vulnerable to random junk that just happened to be present in the sample of text that was used to train the tokenizer.

    "SolidGoldMagikarp " and other similar nonsense in OpenAI's tokeniser was just random stuff that happened to be in the text sample they used to train the tokeniser.

    There's probably some statistical theorem to the effect that a random sample of text of suitable length N will probably contain some tokens that are much more frequent in the sample than they are in the distribution the sample is taken from.

  5. KWillets said,

    May 27, 2024 @ 10:54 am

    The Korean longest tokens have mostly mundane origins. Dozens of them are simply common verbs with formal tense endings, which add 2-3 syllables, and others reflect the syllable explosion of transliterating English words into Hangul (eg "site" to "sa-i-teu"). If a proper stemmer were used I suspect the list would change dramatically.

    In the same vein I don't interpret the predominance of spam phrases in the Chinese list to mean the same about the distribution of input documents — spammers simply repeat themselves a lot, and in longer phrases. The fact that many have been condensed to a single token reflects some success in minimizing their impact.

  6. Feanor said,

    May 27, 2024 @ 2:58 pm

    If Chinese texts from mainland China are unavailable in sufficient quantity, what about texts from Singapore, Taiwan and Hong Kong? Is that insufficient too?

  7. Victor Mair said,

    May 28, 2024 @ 7:03 am

    ChatGPT doesn't understand Chinese well. Is there hope?

    A Look into Text Corpus, the Training Data that Matters to AI's Accuracy, in the Chinese World
    Mu Chen, Baiguan
    Mar 30, 2023

  8. ktschwarz said,

    May 28, 2024 @ 7:48 pm

    Anna: "The Italian list in the referenced article contains very few Italian words…" Thanks for pointing that out. Henry Luo assigned tokens to languages using "methods like langid and langdetect" (those are Python packages), which are evidently bad at identifying Italian. (Not quite as obviously bad with the other European languages, but still, for example, understandably, understatement, and undergraduate were classified as German.) Luo just copied what the scripts spit out, not recognizing how much of it was wrong, and whipped up blather from them like "a community that values growth, individualization, and the exploration of potential outcomes." He sounds like a chatbot himself.

    Hilariously, adipiscing is in the Italian list — that is, of course, not Italian but lorem ipsum! Evidently there's enough lorem ipsum in the training data to make it worth tokenizing!

    There's a non-English word in the English list, too: recommandations. Yes, there's a lot of misspelled English and code-switching around the net, but there is no way recommandations is that much more common than recommendations in actual English text. In fact, recommandations is a correctly spelled French word. The tokenizer would have found it in actual French texts, but Luo's scripts incorrectly identified it as English.

  9. Dominik Lukes said,

    May 29, 2024 @ 3:14 am

    I just wanted to point out the fundamental potential misunderstanding here. The tokenizer training data set is not the same as the LLM training data set. It just chooses the tokens to segment the data with in a most efficient manner. Most tokens are parts of words, not words. If it chooses tokens that are representative of the tokenizer training set and not the actual training set, this just means that the tokens will be 'undertrained' – ie not contributing much to the model performance.

    It's even slightly misleading to call it 'tokenizer training' (even though that's the term of art). It is using a compression algorithm called BPE (Byte Pair Encoding) that simply pairs the most frequent byte pair, replaces it with one byte and keeps going until a desired size of vocabulary is reached. There is no perfect vocabulary size – the best LLMs range from 30k to 200k with similar overall performance.

    The one reason to have more tokens is to make running LLMs in other languages less expensive because all prediction is done on tokens and the fewer tokens you have to generate, the cheaper it is. It also somewhat improves quality on some languages but the main reason is cost.

    In short, all those 'biased' tokens will not really contribute to the 'bias' of the LLM, they just make it less computationally efficient for those languages.

  10. Dominik Lukes said,

    May 29, 2024 @ 3:34 am

    BTW: I've just read the MIT tech review article and it is just awful. It completely ignores the tokenizer training vs model training issues. But what's worse, it starts by correctly saying that 90 out 100 longers Chinese tokens were from spam websites – this is not surprising, lots of long tokens in English come from Reddit user names (like _GoldenKarp). But in the very next paragraph it says that 90% of training data is from these websites. Which is not true either about the tokenizer or the model training data set.

    The general lesson is, you cannot learn true things (good or bad) about LLMs from the mainstream press. Even when they report some correct facts, they distort them and cannot evaluate their true import.

  11. Rodger C said,

    May 29, 2024 @ 9:56 am

    The general lesson is, you cannot learn true things (good or bad) about LLMs from the mainstream press. Even when they report some correct facts, they distort them and cannot evaluate their true import.

    That's why you can't learn true things about anything from the mainstream press.

  12. Philip Taylor said,

    May 30, 2024 @ 11:55 am

    Oh, I don't know, Roger —

  13. Benjamin E. Orsatti said,

    May 31, 2024 @ 8:04 am

    (1) Philip Taylor wins this thread;
    (2) Look how much bigger Hitler's head is than Chamberlain's — was that gallows typesetter humo(u)r?

  14. Philip Taylor said,

    June 3, 2024 @ 4:53 am

    More likely just the lack of Photoshop !

  15. Victor Mair said,

    June 4, 2024 @ 9:59 am

    As China’s Internet Disappears, ‘We Lose Parts of Our Collective Memory’

    The number of Chinese websites is shrinking and posts are being removed and censored, stoking fears about what happens when history is erased.

    By Li Yuan
    June 4, 2024
    Updated 9:16 a.m. ET

    Chinese people know their country’s internet is different. There is no Google, YouTube, Facebook or Twitter. They use euphemisms online to communicate the things they are not supposed to mention. When their posts and accounts are censored, they accept it with resignation.

    They live in a parallel online universe.

RSS feed for comments on this post