## Why electronic machine translation services sometimes seem to fail

The inability of Google Translate, Microsoft Translator, Baidu Fanyi, and other translation services to correctly render jī nián dàjí 鸡年大吉 ("may the / your year of the chicken be greatly auspicious!") in various languages points up a vital distinction that I have long wanted to make, and now is as good a time as ever.  Namely, just as you could not expect these translation services to handle Cantonese, Shanghainese, Taiwanese, etc. (unless specifically and separately programmed to do so), we should not expect them to deal with Literary Sinitic / Classical Chinese (LS / CC).

These are all different languages, and electronic translation software, like human brains, cannot be programmed and trained in such a way that they can simultaneously translate material coming from different languages.

The only exception is when bits and pieces of these other languages have been embedded in Modern Standard Mandarin (MSM) and regularized there in such a way that they have for all intents and purposes been borrowed as part of MSM vocabulary, e.g., mǎidān 买单 / máidān 埋单 / màidān 卖单 ("pay the bill") from Cantonese maai4daan1 埋单 ("call for the bill / check").  Note that, even though MSM mǎidān 买单 / máidān 埋单 / màidān 卖单 ("pay the bill") is written in three different ways with three separate pronunciations, translation software can deal with all of them because they occur in MSM with sufficiently high frequency to be recognized as an integral, naturalized part of MSM vocabulary.

The same holds for vocabulary coming from LS / CC, e.g., qǐyǒucǐlǐ 岂有此理 ("ridiculous; outrageous; absurd") and sàiwēngshīmǎ 塞翁失马 ("blessing in disguise").  The translations offered for such expressions are not always felicitous and may vary widely, depending upon whether they are trying to convey the overall gist or the literal meaning, but at least they recognize these expressions as constituting lexical, syntactic units within MSM.  For this reason, I approve of Google Translate's pinyinization of such expressions as single units.

The same is true of countless other MSM lexical items from a wide variety of sources beyond Cantonese and LS / CC.

Similar criteria obtain for borrowings from diverse derivations in English, German, Japanese, and other languages.

Electronic translation software programs for Sinitic, so far, are for MSM.   They recognize and are generally capable of dealing with MSM vocabulary, grammar, and syntax quite effectively, and indeed often impressively so.  I do not consider that they have failed when somebody throws an auspicious chicken — whoops! a monkey wrench / spanner — into the ointment / works.

1. ### unekdoud said,

January 29, 2017 @ 11:25 am

These are all different languages, and electronic translation software, like human brains, cannot be programmed and trained in such a way that they can simultaneously translate material coming from different languages.

I might need a little clarification here: how is this different from people code-switching between different languages, who are understood pretty well in conversation?

2. ### ouen said,

January 29, 2017 @ 11:54 am

actually baidu translate does give the option of classical chinese translation.
i tried using the 文言文 function to translate '鸡年大吉' but the result was still unsuccessful, it was translated as just 'chicken year'

3. ### Victor Mair said,

January 29, 2017 @ 1:11 pm

@unekdoud

Well, the people who are doing the code-switching have to know both languages pretty well, and they abide by certain principles regarding when and what they switch.

@ouen

Someday I'm gonna throw a whole bunch of LS / CC at that Baidu Translate wényánwén 文言文 function and we'll see how it does. As a matter of fact, I might just do that while I'm eating lunch in a few minutes. Meanwhile, their wényánwén 文言文 function did better with jī nián dàjí 鸡年大吉 than their MSM function. At the very least, it's encouraging that they treat MSM and LS / CC as two separate kinds of language.

4. ### Victor Mair said,

January 29, 2017 @ 2:07 pm

As promised in the previous comment, here's the beginning of the first chapter in the Confucian Analects:

Zǐ yuē:Xué ér shí xí zhī, bù yì yuè hu? Yǒu péng zì yuǎnfāng lái, bù yì lè hū? Rén bùzhī ér bù yùn, bù yì jūnzǐ hu?

子曰：「學而時習之，不亦說乎？有朋自遠方來，不亦樂乎？人不知而不慍，不亦君子乎？

James Legge's translation:

The Master said, "Is it not pleasant to learn with a constant perseverance and application? Is it not delightful to have friends coming from distant quarters? Is he not a man of complete virtue, who feels no discomposure though men may take no note of him?"

Baidu's wényánwén 文言文 function

Confucius said: "to learn and practice, not a pleasure? To have friends from afar, not happy? People know not angry, not also a virtuous gentleman?"

Hey, not bad at all.

Now let's run the same very famous verse from the Analects through the Baidu MSM function:

Confucius said: "isn't it a pleasure to study, learn and? There are friends from afar awfully? People not resentful, not a gentleman?

Confucius said: "When learning and learning, do not say it? There are friends from afar, enjoying themselves? People do not know and not resentful, not a gentleman down?"

Not too bad either.

Microsoft Translator

Confucius said: "school time 習 it not say? Friends from far future, not a park? People do not know and do not 慍 it not a gentleman?

No comment.

Here are two translations into MSM picked at random (not entirely at random, but to show how different the many vernacular translations can be):

Kǒngzǐ shuō:Xuéxí xiūyǎng zìjǐ hé fú guó lì mín de xuéwèn, yòu nénggòu shìshí de shíxíng, qǐ bùshì hěn lìng rén xīnxǐ ma? Yǒu zhìtóngdàohé de péngyǒu cóng yuǎnfāng lái, qǐ bùshì hěn kuàilè ma? Dāng zìjǐ de dàodé xuéwèn yǒu chéngjiù shí, jíshǐ pángrén bù zhīdào, xīnlǐ yě méiyǒu sīháo yuànhèn, zhè bùzhèng shì yīgè jūnzǐ de fēngfàn ma?

孔子說：「學習修養自己和福國利民的學問，又能夠適時地實行，豈不是很令人欣喜嗎？有志同道合的朋友從遠方來，豈不是很快樂嗎？當自己的道德學問有成就時，即使旁人不知道，心里也沒有絲毫怨恨，這不正是一個君子的風范嗎？

Kǒngzǐ shuō:“Jīngcháng xuéxí, bù yě xǐyuè ma? Yuǎnfāng láile péngyǒu, bù yě kuàilè ma? Dé bù dào lǐjiě ér bù yuànhèn, bù yěshì jūnzǐ ma?”

孔子說：“經常學習，不也喜悅嗎？遠方來了朋友，不也快樂嗎？得不到理解而不怨恨，不也是君子嗎？”

Here are the MSM translations rendered into English:

Baidu

Confucius said: "learning culture on behalf of the country and their knowledge, and timely implementation, is not very exciting? It is very happy to have like-minded friends from afar? When the moral knowledge of their achievements, even if others do not know, the heart is not the slightest resentment, it is not a gentleman demeanor?

Confucius said: "often learning, not happy? Friends from afar, not happy? Can not understand without resentment, not a gentleman?"

Confucius said: "learn and cultivate themselves and the benefits of learning, but also timely implementation, it is very gratifying it? Like-minded friends from afar, it is not very happy? When their own moral knowledge and achievements , Even if others do not know, and my heart is not the slightest resentment, this is not a gentleman's style?

Confucius said, "Do you always learn, do not you please?" Far from coming to a friend, not being happy, not being understood without resentment, and not a gentleman?

Microsoft

Confucius said: "Learning 養 school question himself and blessed country and people, and enough right with proper time line, any 豈 is not very happy? Like-minded friends to promote far future, 豈 is not fast any music? School question time when its own moral, even if others do not know, was also without hex NG resentment, this is not a real gentleman any Wind fan?

Confucius said: "the Bible is often Learning, not any Lin Yue? Far future friends, does not have any joy? Without understanding that it is not resentment, not any gentleman? ”

At least for me, this has been an interesting lunch experiment.

5. ### Silas S. Brown said,

January 30, 2017 @ 10:20 am

What worries me about machine translation is when people use it to translate their own language into a foreign language and make no attempt to check that the foreign-language result even makes sense. If doing this, at the very least it might be a good idea to make use of any option in the software to let you choose between alternative translations of specific words (which can be back-translated into your own language to help you make the choice), but many users don't realise such options exist. I've been asked to proof-read automatically translated Chinese and I invariably ask to see the English original because I'd rather translate it myself than try to pick apart the automatic attempt. I've heard of people in authority trying to use these things to communicate with immigrants while having no idea that there might be something wrong with the translation. These tools should have big warning lights all over them!

6. ### Andrew said,

January 30, 2017 @ 4:55 pm

Recently I was reading about some research out of Google's NMT team where they actually *do* train their models on multiple languages simultaneously (IIRC one example involved 4 or 5 European languages and another involved 3 Asian languages). Then they just provide it with a token at the beginning of the input to indicate the assumed language, and the model does the rest, handling "[spa]la pluma esta en la mesa" and "[fra]la plume est sur la table" with equal facility. Not only does it handle cases of mid-text (or mid-sentence) language switching better than existing systems, since the language data is all there in the same model — they also report that it does better on single language inputs as well, indicating that it's actually learning some sort of an interlingua.

None of which really detracts from your point, so don't think I'm trying to be argumentative — it's just interesting stuff!

7. ### Eidolon said,

January 30, 2017 @ 5:42 pm

It's interesting, because I don't think 大吉 actually all that obtuse to Standard Mandarin speakers through a derivation of the basic vocabulary. That is, both 大 "big" and 吉 "auspicious" are fairly common and well-known morphemes in Standard Mandarin, and the 大X construction is so normal in Standard Mandarin that it cannot possibly be difficult to figure out that 大吉 = "very auspicious." Indeed, I'd say this explains why so many people sent 大吉 as a greeting, in the first place – as they could expect it to be understood. I don't think any knowledge of Literary Sinitic is necessary, or even previous awareness of 大吉 as an expression.

So I wouldn't say this an example in which we *should* expect error, at least not from the human side. Machines, however, are not humans, so perhaps the training algorithm & data set used combined in such a way so as to produce this error. We take for granted much of the common sense that make up human intelligence, without realizing just how hard it is to reproduce mechanically.

8. ### Eidolon said,

January 30, 2017 @ 6:09 pm

To follow up on a previous comment from the other thread, Baidu explains its translation as such:

"大吉[dà jí]

very lucky; highly auspicious;(used ironically in 关门大吉) close down."

This was noticed within the first three comments to the original thread as the likely source of the error.

Google doesn't provide any such explanation, but one might argue that it's the same error, just in a different form. What's fascinating about the nature of this parallel is that it indicates Google and Baidu are using similar algorithms and data sets, such that both software services managed to produce the same error on the same expression. Copy right law suits might be in order…

Microsoft translate, on the other hand, produces a distinctly different result. We might be more confident, then, in their approach being original.

9. ### KC said,

January 30, 2017 @ 8:12 pm

As pointed out Andrew above, there have been some recent results on building a machine translation system that can handle intra-sentence code switching without any explicit mechanism by training one neural machine translation system with multiple languages. See, e.g.,

10. ### Victor Mair said,

January 30, 2017 @ 10:56 pm

Jí 吉 ("lucky; auspicious") is a bound morpheme, so people don't just walk around saying jí 吉 this or jí 吉 that in vernacular speech. Writing is a different matter all together, where you can get away with grammatical murder, hence there are all sorts of bànwénbànbái 半文半白 ("semi-literary semi-vernacular") styles.

Here are a couple of ways to say "jī nián dàjí 鸡年大吉 (lit., "chicken year big auspicious / propitious", i.e., "[may the / your] year of the chicken be greatly auspicious!") that are in accord with Mandarin usage — I call them "chún báihuà/ kǒuyǔ 纯白話 / 口語" ("pure vernacular"):

=====

zhù nín jīnián shífēn jíxiáng 祝您鸡年十分吉祥 ("wishing you a very lucky chicken year")

xīwàng nǐ zài jīnián, měi jiàn shìqíng dōu hěn hǎo, hěn shùnlì! 希望你在鸡年，每件事情都很好、很顺利！("wishing that everything will be fine and smooth for you in this chicken year")
=====

Even when we get into literary styles that make concessions in the interest of oral intelligibility, we see semi-vernacularity creeping in:

=====

jīnián jíxiáng 鸡年吉祥 ("[may the / your] year of the chicken be auspicious!"), where jí 吉 ("lucky; auspicious") is bound with xiáng 祥 ("auspicious; lucky; propitious; felicitous") to form the disyllabic word jíxiáng 吉祥 ("auspicious; lucky; propitious")

jīnián dàjí dàlì 鸡年大吉大利 ("[may the / your] year of the chicken be auspicious!"), where the addition of dàlì 大利 ("great benefit") echoes dàjí 大吉, in essence splitting up the disyllabic word jílì 吉利 ("auspicious; lucky; propitious") — as one informant who noted the literary proclivity for such "fixed expressions stated, "to avoid any possible ambiguity in lack-of-context oral situation"