The state of the machine translation art

« previous post | next post »

I don't know any Hebrew. So when I recently saw a comment in Hebrew on a Google Plus page of discussion about Gaza tunnel-building that I was looking at, I clicked (with some forebodings) on the "Translate" link to see what it meant. What I got was this:

Some grazing has hurt they Stands citizens Susan Hammer year

This does not even offer enough of an inkling to permit me to guess at what the writer of the original Hebrew might have been saying. It might as well have said "Grill tree ecumenical the fox Shove sample Quentin Garage plastic."

Linguists generally tend to argue that machine translation based entirely on statistical properties of large parallel corpora, without any guidance from lexical or syntactic information, is not going to work. And you can see why we say that. Hitting the "Translate" button may perhaps get you mediocre literal translations with limited errors for perhaps 80% of simple sentences in languages like French: those closely related to English for which huge amounts of parallel text are available. But a lot of the time, especially for minor languages less well represented on the web, it will get you just about nothing.

However, to be fair to the statistical machine translation industry, we must allow for any defects in the quality of the input. And after the above paragraphs were posted, Daniel Sterman, an experienced editor with a thorough knowledge of Hebrew, gave me this very useful analysis, which makes a considerable difference:

The original Hebrew is riddled with spelling and grammatical errors, which is why machine translation didn't work. You mentioned in your post "with limited errors" – this sentence's errors go well beyond that, and far into the realm of "my translation software was never designed to handle this level of idiocy".

She was trying to say "Look how evil Hamas is, they put citizens between the hammer and the anvil."

She actually said (if I were to translate it as written), "How many shepherd Hamas has they stances the citizens son of the hammer their jaw".

The machine translation didn't actually do too badly; some of the words the machine translation chose are spelled the same as the words I chose, usually due to some similarity of meaning (shepherd::grazing, stances::stands, some::how many). But I confess to having no idea where the words "Susan" and "year" came from.

Another correspondent, Uri Granta, provides a somewhat different account of the correct translation from the botched Hebrew, and a possible defense of statistical machine translation:

What the post meant to say was: "How much evil has Hamas they put citizens between the hammer and Satan." What it actually said was: "How much shepherd has to hurt they stand the citizens son of the hammer jawfish." A statistical approach might actually aid deciphering what was meant here, as it could use context to detect the likely typos: for example guessing that Hamas misspelled is more likely than an uncommon word for to hurt, or that evil misspelled is more likely in the context of the page and sentence than shepherd.

I guess the bottom line is that the old computer scientists' adage "GIGO" from long ago still applies: you put Garbage In, you're gonna get Garbage Out.



Comments are closed.