Hype over AI and Classical Chinese / Literary Sinitic

« previous post | next post »

From the get-go, I'm dubious about any claims that current AI can fully and accurately translate Classical Chinese / Literary Sinitic (CC/LS) into Modern Standard Mandarin (MSM), much less English or other language, on a practical, functional basis.  Since the following article is from one of China's official propaganda "news" outlets (China Daily [CD]), the chances that we will get an accurate accounting of the true situation is next to nil anyway.

Language system translates ancient Chinese texts

By Li Wenfang in Guangzhou | China Daily | Updated: 2023-11-03 09:42

It starts out on a sour note:

If foreigners learning Chinese think the modern language is difficult to grasp, they should be glad they don't have to learn classical Chinese. Ancient texts are far more challenging, and not easy for even native Chinese speakers to decipher.

This is a cockamamie approach to the analysis of a written language in its ancient stages.  What is it about ancient classical Chinese texts that makes them so difficult?  How do they differ from modern Chinese texts?  What about their morphology, their grammar, their syntax, their phonology and prosody, their lexicon, their literary allusions…?

A fundamental, fatal flaw in the conceptualization of Sinitic on the part of conservative indigenous scholars is that there are no essential linguistic discrepancies between CC/LS and MSM, only stylistic disparities.

Anyway, for what it's worth, the CD article continues:

Thankfully, a team of researchers from the South China University of Technology has made such work easier. The team has developed a large, artificial intelligence-powered language machine that automatically translates ancient Chinese texts into modern language.

I would love to see a sample of such work — if it really exists.

The team won first prize in the international ancient Chinese machine translation contest during the Machine Translation Summit 2023 held in the Macao Special Administrative Region in September. The system could help enhance people's understanding of Chinese history and promote traditional Chinese culture, said Jin Lianwen, the professor who led the research team in their work at the university's deep learning and vision computing lab.

It could help.

It could also help with data mining and analysis, and intelligent development and application related to ancient texts and relics, he said.

It could also help.

The massive system requires powerful computing capacity, which was a huge hurdle the team had to overcome during its research, Jin said, adding that the team received multiple graphic processing unit servers from a cooperating company.

The machine translation is meant to provide classical Chinese enthusiasts with a convenient way to gain a general understanding of ancient texts, with Jin saying that authoritative publications should be the go-to sources for precise translations.

"massive system requires powerful computing capacity" — not for commercially viable applications, if it effectively works at all with enormous computer cap

"multiple graphic processing unit servers" — the graphics are not the real problems for machine translation of ancient texts; what's needed are means for dealing with the starkly differing structures, semantics, grammars, etc. — i.e., the linguistic attributes — of the ancient language and the modern language (see below)

"general understanding" vs. "precise translations"

Where / what are these "authoritative publications" that can be used as "go-to sources".

The major challenge in robotic ancient language translation was the lack of high-quality ancient language data, Li Bin, an associate professor in the linguistic technology department at Nanjing Normal University, told a seminar at the summit.

Since "high-quality ancient language data" is lacking, where are the researchers going to get it?  It does not appear that the machines are going to supply it.  Indeed, it would appear that the machines need this "high-quality ancient language data" to function properly.  The researchers at South China University of Technology and Nanjing Normal University might want to check out the Chinese Text Project, which includes hundreds of digitized ancient Chinese texts and for all intents and purposes was built by one man, Donald Sturgeon, beginning in 2006.

At present, such translation relies heavily on the professional knowledge of classical language experts.

Ah, the human brain and scholarly expertise!

Jin's team has also developed a system that recognizes and analyzes ancient Chinese text on pictures. 

Not sure what is meant by this.

The system automatically locates, extracts and arranges related texts in order. The texts can then be punctuated and translated into modern Chinese using the aforementioned language system.

The algorithm for the system has been optimized to address challenges such as analyzing text on wrinkled or creased photos or those with low resolution.

In cooperation with Shanghai University and the Intsig Information Co, Jin's team has also created a system for analyzing and recognizing texts written in the language of China's Yi ethnic group.

Classical text translation technology, when combined with text recognition technology, can facilitate ancient text digitalization and understanding.

"When joined by powerful AI technology such as ChatGPT, it can become an interactive system for understanding ancient texts."

Ah, ChatGPT!  There's the rub / nub!

The team will continue its research on ancient text understanding and protection, Jin said, adding that sufficient computing capacity will be necessary for further research.

Yes, BIGGER computers, but they also need much, much better programming.

The Chinese name of the software is Shēndù xuéxí yǔ shìjué jìsuàn 深度学习与视觉计算 (Deep Learning and Visual Computing).  This is their website.  I looked through it quickly, but among their projects and patents, I didn't see anything that already achieved machine translation, particularly from ancient languages to modern languages.  They seem to be focusing on OCR and image recognition.

They don't really say how their system will work or what it can actually do with regard to the claims made in the CD article.  So far, it seems to be an impractical pipe dream.

As to why the Chinese felt compelled to rush into print with this before they really had a feasible product, judging from all the documentation provided on Language Log and elsewhere, ChatGPT and other LLMs have been making enormous strides toward solving the problem of how to get from ancient languages to modern languages, whereas similar advances have not been forthcoming for Sinitic.

Japanese researchers have gone part of the way, as described here:
"Literary Sinitic / Classical Chinese dependency parsing" (11/27/19)

The hypercomplex nature of the script and the minimalist grammar of the language indeed pose formidable (virtually insuperable) challenges to machine translation.  It's like the brick wall that even a seeming genius like Lin Yutang confronted when he tried to invent a simple, self-contained typewriter to compete with Christopher Latham Sholes.  No can do.

Selected readings

[Thanks to James Fanell]


  1. Avi Rappoport said,

    November 9, 2023 @ 8:18 pm

    I found the conference: https://www.ancientnlp.com/alt2023/ but no proceedings published.

    This looks similar to the the Text Retrieval Conference put on by NIST for many years, which does provide a reality check for the claims of search and information retrieval researchers and vendors. In cases where the technology is in early stages, the winner may be very slightly better at a task than the other entrants, but none of them perform very well at all.

  2. Anselm Lingnau said,

    November 9, 2023 @ 8:33 pm

    The "multiple graphic processing unit servers" are presumably used to run the AI model(s), which involves parallel calculations that such servers are particularly good at. Nothing in particular to do with the graphics involved in classical Chinese.

  3. magni said,

    November 9, 2023 @ 9:07 pm

    This referenced article is reminiscent of the quality I would anticipate from China Daily: poorly written and sensationalized. Nevertheless, it's one of the few magazines/news outlets that English teachers in China can and may recommend to students for reading exercises.
    As far as I understand, OCR technology for processing literary Chinese texts is currently impractical. If researchers are making progress in that area, I commend their efforts.

  4. John Swindle said,

    November 10, 2023 @ 12:54 am

    The AI chatbot Google Bard seems able to concoct Literary Sinitic or early Mandarin text for existing works.

    A user on the social media site Reddit found an old vase inscribed 天何六年督製
    'manufacture supervised in the 6th Year of Tianhe'. Asked about this 6th Year of Tianhe, Bard said it was a fictional reign date from the novel "Water Margin." It was able to provide a substantial quotation, an English translation, and an explanation of context. Pressed for chapter and verse, it apologized and said it couldn't actually find what it had quoted in the original Chinese text and might have hallucinated it. "Ultimately," it said, "the meaning of the inscription is up to the interpretation of the individual."

  5. Andreas Johansson said,

    November 10, 2023 @ 2:04 am

    The power of nomenclature: that modern Chinese speakers find Classical Chinese difficult is noteworthy, but nobody would think it strange that modern Italian speakers find Classical Latin difficult.

  6. Hervé Guérin said,

    November 10, 2023 @ 4:55 am

    GPU (Graphical Process Units) are specialized hardware without which it is impossible to conduct LLM (Large Language Model) training and then generation. And they are expensive, hence the founding.

  7. Victor Mair said,

    November 10, 2023 @ 7:00 am

    @Andreas Johansson

    "that modern Chinese speakers find Classical Chinese difficult is noteworthy"

    Excellent observation!

  8. Jonathan Smith said,

    November 10, 2023 @ 8:55 am

    @Andreas Johansson
    and yet the "not easy for even native Chinese" meme is much bigger than nomenclature — from a Chinese POV, "outsider" expertise in literally anything Chinese language/writing related seems worthy of remark; the more arcane the area, the more remark-able. The underlying mindset seems to be that these cultural products are after all of/by/for "the Chinese" and shortcomings in familiarity relative to forners just violate the natural order of things.

  9. Michael Watts said,

    November 12, 2023 @ 1:01 am

    A user on the social media site Reddit found an old vase inscribed 天何六年督製
    'manufacture supervised in the 6th Year of Tianhe'. Asked about this 6th Year of Tianhe, Bard said it was a fictional reign date

    So what is going on with 天何? It doesn't appear to be a non-fictional era name. If you search for it, you get a bunch of results that assume you meant to type 天和. But it seems odd for a manufacturing seal to feature a typo?

  10. John Swindle said,

    November 13, 2023 @ 3:34 pm

    @Michael Watts: I was hoping someone would pick up on that! A web search for "天何六年" (with the quotation marks) does yield some text and image hits, not all of them to the Reddit post. It may or may not have something to do with a Qing Dynasty work called 初月楼闻见录 'Record of Hearing and Seeing in Chuyue Tower', but there seem to be actual physical objects with that date.

RSS feed for comments on this post