Language Log

Are you in the book today?

March 19, 2020 @ 7:55 pm · Filed by Victor Mair under Artificial intelligence, Information technology, Parsing, Punctuation, Translation

« previous post | next post »

[This is a guest post by Nathan Hopson, who sent along the two screen shots with which it begins.]

Another splendid example of why punctuation matters and why machine translation is dumb…

With an h/t to my Earlham kōhai (後輩 ["junior schoolmate"]) Becki Kanou, who's also a big LL fan, I'm attaching two screencaps that will have both prophets of AI-fueled utopia and Oxford-comma haters alike weeping.

The horizontally aligned one is my own, since I decided to test this myself before succumbing to embarrassing clickbait. Sadly, it's real.

Google Translate, as good as it is for the basics in many world language pairs, still sucks at CJK. I maintain that in addition to all the other obvious issues, this is in no small part because without spaces, word parsing is hard. Really hard. And worse, really intuitive and high context.

The Japanese is:

今日本にいますか。

What's frankly baffling is that while the Romanization is correctly parsed and transliterated, the English fails to live up to that promise:

"Are you in the book today?"

My guess is that the lack of a comma after 今 is the culprit. With the comma, there's no ambiguity. It's 100% "Are you in Japan?" Without it, artificial stupidity probably saw 今日 as a pair before it moved linearly to 日本 and prioritized that pair. Linear processing for nonlinear processes is not pretty.

It's all very, Eats, Shoots and Leaves.

Selected reading

"Homophonophobia" (2/7/15)
"Homographobia" (9/27/10)
"An Eighteenth-Century Japanese Language Reformer" (4/23/15)
"Which is worse?" (1/21/16)
"Character amnesia and kanji attachment" (2/24/16)
"Japanese survey on forgetting how to write kanji" (9/24/12)
"The foreign carrot regime problem" (6/18/16)
"Google is scary good" (7/31/17)

March 19, 2020 @ 7:55 pm · Filed by Victor Mair under Artificial intelligence, Information technology, Parsing, Punctuation, Translation

Permalink

18 Comments

DBMG said,

March 19, 2020 @ 8:17 pm

It looks like Google Translate's romaji conversion and translation are done by parallel systems that don't communicate with each other. Might chaining them produce a better result? I entered "Ima Nihon ni imasu ka?" into GT and got "Are you here now?"…
Stephen Jones said,

March 19, 2020 @ 8:27 pm

What is really odd is that the same sentence but without か at the end gives you "I'm in Japan right now". So why does Google think that the question version makes the other reading more likely?
Keith said,

March 19, 2020 @ 8:35 pm

I suspected, when I saw this, that it was a parsing problem: that the algorithm picked out "本" (hon) for book (one of the very few Japanese words that I know) and built the rest of the phrase as if this was the topic.

To test it, I took the original phrase "今日本にいますか。" and changed "日本" to "中国". This was correctly romanized as "Ima Chūgoku ni imasu ka." and translated as "Are you in China now?"
Twill said,

March 19, 2020 @ 9:37 pm

@Keith Not quite; the problem is the real morphological ambiguity in 今日本, which could be validly parsed as 今、日本 (now, Japan) or 今日、本 (today, book). The parsing comes down to a question of semantics.

I tried it with an equivalent sentence where the 今日、本 reading is the more probable one, (今日本を借りました), and sure enough, the translation was "I borrowed a book today" while the transcription read "Ima Nihon o karimashita" ([I] borrowed Japan today).
Bob said,

March 19, 2020 @ 10:23 pm

Microsoft’s translator (https://bing.com/translator) gets it right…
Marianna M. said,

March 20, 2020 @ 12:20 am

If there are spaces, it will use them (as in the translating from transliteration example or if you simply insert a space in the original text). Otherwise it's likely using SentencePiece ("not an official Google product," but… https://github.com/google/sentencepiece/blob/master/README.md ) for tokenization. Note that if you remove the question particle, it tokenizes better, so it's not simply a greedy right-to-left heuristic.
Victor Mair said,

March 20, 2020 @ 12:20 am

I think we all need to take to heart this paragraph from Nathan Hopson's o.p.:

=====

Google Translate, as good as it is for the basics in many world language pairs, still sucks at CJK. I maintain that in addition to all the other obvious issues, this is in no small part because without spaces, word parsing is hard. Really hard. And worse, really intuitive and high context.

=====

Is this not an argument for word spacing?

This is a topic that has come up countless times on Language Log, especially for CJK languages, but also for all languages whose customary orthographies do not separate words with spaces.
Christian Weisgerber said,

March 20, 2020 @ 2:28 am

DeepL, which for the languages it supports is widely considered to be superior to Google Translate, recently added Japanese and Chinese. Here we get:

今日本にいますか。 → Are you in Japan now?
unekdoud said,

March 20, 2020 @ 2:36 am

Google Translate can get very non-greedy in parsing, as a quick demo in elephant semifics will show.
Alyssa said,

March 20, 2020 @ 6:42 am

I wouldn't consider this a strong argument for word spacing unless you consider computers the primary users of language. Is there any evidence that humans have trouble understanding these sentences?

Also, written Japanese has other methods than word spacing to make word boundaries clear – specifically, particles and script changes. 今日/本 vs 今/日本 is notable because they don't apply here – all these words are typically written in kanji, and have no particle between them. I don't think it would be easy to come up with many other groups of four words which have this same kind of ambiguity.
Mark Metcalf said,

March 20, 2020 @ 8:49 am

fanyi.baidu.com gets 今日本にいますか wrong, but with a space or comma between 今 and 日 gets it right.
David Morris said,

March 20, 2020 @ 10:30 am

Total no-knowledge-of-Japanese question: How does 'Are you in Japan (now)' need a comma? I can only guess that the first character means 'now'.

Several people have mentioned CJK languages and word spacing. Korean is now usually written with word spaces.
Jim Breen said,

March 20, 2020 @ 11:21 am

GT doesn't actually parse the text any more. It's all driven by a neural-net translation system trained on masses of parallel texts. It's usually very good, and the changeover to the NN system led to a step up in quality, as Victor commented some time ago. This short passage obviously catches it out. It's learned too well that 今日 usually maps to "today".

If I put the passage into the MeCab/UniDic analyzer I use for Japanese NLP work it correctly segments it as "今+日本+に+い+ます+か".

[I remember a possibly apocryphal story of EnglishFrench MT systems being trained on the bilingual Hansard material from the Canadian parliament, which led to "hear" being translated as "bravo".]
Not a naive speaker said,

March 20, 2020 @ 11:45 am

Somewhere in the The Chicago Manual of Style in the chapter about Punctuation (because of ♛ I don't have access now to it) they write punctuation should help the reader.

I read some of posts on Langue Log about CJK writing systems. I think I chose my parents well because I grew up in an alphabetic environment. My guess is these systems were "created" by a leisure class for a leisure class. Why do these systems make it so hard for the reader? The burden should be put on the the writer, not the reader.
Don't blame Google Translate
Bathrobe said,

March 20, 2020 @ 1:24 pm

I had a look at DeepL and used it to translate a short passage from Chinese to Japanese. Errors in the translation made it clear that English was being used as the pivot language. Maybe it's alright for translating English to and from other languages, but not necessarily for direct translation between two other languages.
Bathrobe said,

March 20, 2020 @ 1:33 pm

"Don't blame Google Translate"

Point one: Word spacing and alphabetic systems are two different things. Latin was written without spacing for centuries, even though it used an alphabet.

Point two: Similar problems can be caused if you fail to use hyphens in attributive phrases in English (e.g., 'time-serving' vs 'time serving'). This is common in modern English prose and can in some cases lead to ambiguity (sorry, I don't have any concrete examples to hand but I've seen it happen).
Bathrobe said,

March 20, 2020 @ 1:38 pm

How does 'Are you in Japan (now)' need a comma?

今日本 can be read as '今+日本' now Japan or '今日+本' today book. The first is literally 'now Sun-origin'; the second is 'this-day book'. If you insert a comma at either of the plus signs, the alternative reading is ruled out.
Slumbery said,

March 20, 2020 @ 9:05 pm

You really do not need to use CJK languages to make Google translate miserably fail when context is needed to tell apart homonyms. I can easily do that with very simple Hungarian sentences with correct punctuation.

RSS feed for comments on this post

Are you in the book today?

18 Comments

DBMG said,

Stephen Jones said,

Keith said,

Twill said,

Bob said,

Marianna M. said,

Victor Mair said,

Christian Weisgerber said,

unekdoud said,

Alyssa said,

Mark Metcalf said,

David Morris said,

Jim Breen said,

Not a naive speaker said,

Bathrobe said,

Bathrobe said,

Bathrobe said,

Slumbery said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta