Language Log

The implications of Chinese for AI development

October 25, 2021 @ 8:45 pm · Filed by Victor Mair under Artificial intelligence, Language and computers, Typing

New article in EnterpriseAI (October 21, 2021):

"Language Model Training Gets Another Player: Inspur AI Research Unveils Yuan 1.0", by Todd R. Weiss

From Pranav Mulgund:

This article introduces an interesting new advance in an artificial intelligence (AI) model for Chinese. As you probably know, Chinese has been long held as one of the hardest languages for AI to crack. Baidu and Google have both been trying for a long time, but have had a lot of difficulty given the complexity of the language. But the company Inspur just came out with a model called Yuan 1.0 that shows significant advances from previous companies' AIs.

Notable quotes from this article:

"…Yuan 1.0 scored almost 20 percent better on Chinese language benchmarks and took home the top spot in six categories, such as noun-pronoun relationships, natural language inference, and idiom reading comprehension…."

"In the process of creating Yuan 1.0, Inspur built the most comprehensive Chinese language corpus in the world, more than twice the size of the largest existing Chinese corpus, and used all 5TB of it to train this new model."

"There is a lot of interest in big models, but we should expect a series of similar announcements for a while, approaching 1 trillion parameters," said [Karl] Freund. "But soon, it will take a different hardware and software approach, something like Nvidia Grace or Cerebras MemoryX, to scale to 100 trillion parameters for brain-scale AI."

Ultimately, though, one must ask if there is a market for these innovations…. "We think so, but it is just emerging," said Freund. "The models to date are error-prone and can promote bias and misinformation. So, the use of these models remains a bit scary."

The refractory nature of Chinese poses a unique challenge to developers of language model training. On the other hand, some of the experience gained in attempting to cope with Chinese provides information and data that are valuable for improving models for other, less complex languages. However, one thing that worries me about all of these models they're talking about is that they are BIG. We have seen that problem of scale all along in the computerization and digitization of Chinese, from Unicode to the big models for AI discussed in this article. In the past, when I raised these issues, pollyannaish people always said to me, "Don't worry, memory is cheap." But when the size of the computer resources required for these huge projects becomes truly colossal, surely cost must enter as a significant factor, which causes one to wonder whether they are practical for actual use and financially realizable for other than experimental, theoretical purposes.

Selected readings

"The infinitude of Chinese characters" (9/9/20) — includes a very long list of relevant posts
"How many more Chinese characters are needed?" (10/25/16)
"Is there a practical limit to how much can fit in Unicode?" (10/27/17)
"Sinographs by the numbers" (1/22/19)
"Sinographic inputting: 'it's nothing' — not" (2/22/21)
"Information content of text in English and Chinese" (10/9/17)
"One world, how many bytes?" (8/5/05)
"Comparing communication efficiency across languages" (4/4/08)
"Mailbag: comparative communication efficiency" (4/5/08)
"Is English more efficient than Chinese after all?" (4/28/08

[h.t. Bill Benzon]

October 25, 2021 @ 8:45 pm · Filed by Victor Mair under Artificial intelligence, Language and computers, Typing

Permalink

5 Comments

John F. Carr said,

October 26, 2021 @ 7:22 am

On the AI pollyannas' "memory is cheap": Fast memory is expensive. The image recognition contests of the 2010s worked on 224×224 pixel images (1/20 of a megabyte) because that was a convenient size to fit in the tiny fast memory on a GPU card.

Tradtionally, training a neural network was much slower than using it. I assume this is still true but I am no longer working in the field. It could pay off to build a custom supercomputer for training and farm the translation out to commodity hardware. Maybe you could ask the NSA what they do.
Andy Stow said,

October 26, 2021 @ 12:41 pm

"Maybe you could ask the NSA what they do."

I'm sure they'll pop in here now and answer if they want to.
AntC said,

October 26, 2021 @ 8:14 pm

other, less complex languages

One of the messages drummed in at my Linguistics Intro course was that all languages are complex — just complex in different ways. And myl's linked post on 'Information content' would appear to bear out "the differences … appear to be quite small".

One of the reasons learning Chinese (Sinolects) is complex for (human) users of languages with alphabetic writing systems is the sheer number of lects to memorise. Surely that's not a limitation AI faces: we're not talking about volumes of memory to hold the characters — which would be comparable to the number of head-words in a dictionary for alphabetic languages. IOW it's not an "infinitude".

Prof Mair has often enough opined that Putonghua was easy to learn phonetically — if you avoided the writing system. There is a difficulty getting from spoken to written form, because of the number of homophones. Is that more of a hazard than with (say) English spelling? How come?

So what's the nature of the 'complexity' that makes the challenges for noun-pronoun relationships, natural language inference, and idiom reading comprehension, …?

I can't help but feel we're not comparing apples with apples here. If myl (in the Lila Gleitman English speech-to-text) piece thinks that the "about normal" error rate produces "impressive" results, perhaps all that's going on is the Chinese researchers don't find that error rate acceptable — in which case I agree with them. And perhaps achieving even small reductions in the error rate for English would also require dramatic amounts of extra training data and computing power?
Jaime Teddy said,

October 26, 2021 @ 10:06 pm

I think Chinese is not difficult to speak, only the writing is more complicated. My wife speaks Chinese very well, and I am still learning from her.
Peter Dirix said,

October 27, 2021 @ 3:58 am

Working in speech recognition and natural language understanding, I can say that there are languages far more challenging than Mandarin. The hardest "big" language in any case is Arabic, but also language like Hindi, Hebrew, Japanese and Thai are particularly more difficult than others.

RSS feed for comments on this post

The implications of Chinese for AI development

5 Comments

John F. Carr said,

Andy Stow said,

AntC said,

Jaime Teddy said,

Peter Dirix said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta