Literary Sinitic / Classical Chinese dependency parsing

« previous post | next post »

We are keenly aware that, while advances in machine translation of Vernacular Sinitic (VS) (Mandarin) are quite impressive and fundamentally serviceable, they cannot be applied directly to the translation of Literary Sinitic / Classical Chinese (LS/CC).  That would be like using an Italian translating program for Latin, a Hindi translation program for Sanskrit, or a Modern Greek translation program for Classical Greek, probably even less useful than these parallel cases, because the whole structure and nature of LS/CC and VS are different from each other.

However, now there is available a LS/CC parsing program that takes us on a major step toward a functional system for the machine translation of the literary / classical written language (it is only a written / book language, not a spoken language).  It was developed by  YASUOKA Koichi 安岡 孝一 of Kyoto University's Institute for Research in Humanities (Jinbun kagaku kenkyūjo 人文科学研究所) and is available here.

For a trial, you can click on "kaiseki 解析" ("analysis") to parse the sample that is already in the text entry field:

wèi yǒu yì ér hòu qí jūn zhě yě


The parser analyzes it thus:

In my estimation, the grammatical analysis provided by the parser is essentially correct, and provides a workable basis for understanding the meaning of the sentence.

This sentence, by the way, is from the Mencius (372-289 BC), a Confucian text dating to the 4th c. BC.  Here is the English translation of this line by James Legge (1815-1897):

"There never has been a righteous man who made his sovereign an after consideration."

Let's take another LS sentence and see how the parser does with it:

xiào wén shàng lín yuàn yù bài sè fū


The parser analyzes it thus:

Here is a brief critique of the parser's output by Diana Shuheng Zhang:

I think this parsing program needs more allusory input…. It is only suitable when very few or only very apparent proper nouns are involved. For instance, it fails at my first test:

"孝文上林苑欲拜啬夫" [from “Wén zhì lùn 文質論” ("Discourse on Patterns and Substance") by Ruan Yu 阮瑀 (d.212)]

"(Emperor) Xiaowen (of the Western Han) desired to appoint to office a bailiff from the Shanglin Imperial Park"

上林苑 is a proper place name for a Western Han royal park; 嗇夫 is an official title, "bailiff" in the Han Dynasty, a title very commonly found in both Qin and Han legal codes for people with lowly status in agencies who were respon­sible principally for maintaining supplies. (Hucker, A Dictionary of Official Titles in Imperial China, No. 4940, p. 404 [1985 Taiwan edition]) — However, this parsing program cannot recognize either of them and misinterpreted both of them.

In sum, although the linguistics of this program is quite good, Sinologically it needs more input from literary historians and specialists in translation.

Bǔchōng 補充 ("supplement"):  I made that test on purpose, using an allusion-heavy but grammatically simple sentence, only to see how well this parsing program is able to deal with allusions and proper names. Classical Chinese is a special literary language with an enclosed circle of textual producers and recipients, within which all members are presumed to share the tacit understanding of implications and connotations achieved by allusions — a rhetorical tool by which the coterie of Classical Chinese users have attained their mutually-binding and enclosing purposes for the past two millennia. It is therefore almost impossible to fully comprehend a Classical Chinese text without knowing allusions. You simply cannot be a member of it without showing to this clique your attestation écrite de soumission (投名狀).

For this reason, although it is not this parsing program's responsibility to explain the allusions, it is nevertheless the program's work to recognize the allusions and proper names to my own understanding. Clearly at this stage it still cannot recognize some, if not most, of them. (I'm glad that it got 孝文 right, though! Good first move!)

Here's a paper in Japanese that explains how the system works:

安岡孝一: 漢文の依存文法解析と返り点の関係について, 日本漢字学会第1回研究大会予稿集 (2018年12月1日), pp. 33-48.

[Thanks to Molly C. Des Jardin and Chenfeng Wang]


  1. Jim Breen said,

    November 27, 2019 @ 4:56 pm

    Interesting to see a dependency analysis running for CC. From a quick glance at the paper, it seems they have developed a morpheme lexicon, then are applying fairly standard tools, including MeCab (my favourite morpheme segmenter) to create the dependency tree(s).

    Note that these days such dependency analysis, important and interesting as it is, does not usually play a significant role in machine translation systems. Modern ML has moved on from the older systems which required feeding with a lot of language-specific information. Instead, they are using systems in which neural networks are trained on large amounts of source and translated text.

  2. julie lee said,

    November 29, 2019 @ 5:12 am

    I remember that years ago, when I was a graduate student in Chinese, Professor Chang Chun-shu of the History Dept. at the U. of Michigan told me: "Classical Chinese is not difficult, it's the allusions that are difficult."

  3. Colin McLarty said,

    November 30, 2019 @ 9:47 am

    Jim Breen: When I last looked into this, machine translation between languages with vast digitized corpora of bilingual texts had indeed switched to statistical methods. But professional machine-aided translation, for corporations that needed actual material to hand to users and customers, in the many languages that do not have such corpora, was still largely rule-based. I think rule-based translation (whether into MSM or English) and analysis of LS/CC is still a valuable goal.

  4. liuyao said,

    December 1, 2019 @ 8:47 am

    If a passage is loaded with proper nouns (or dates, places, official titles), MARKUS may be a valuable tool that complements this.

    Hilde De Weerdt at Leiden University is leading this project.

  5. Philip Taylor said,

    December 3, 2019 @ 6:08 am

    Sadly "MARKUS only works with Google Chrome". Whatever happened to the idea of being browser-agnostic ? I can understand a web site requiring HTML 5, since the latter introduces a number of elements (video, audio, canvas, …) that do not exist in previous DOCTYPEs, but to insist on one and only one browser (and especially one that is notorious for covert data mining) is outrageous.

RSS feed for comments on this post