Green verdure tone desert liquidation
« previous post | next post »
Researchers at Google have responded to current events in Iran by offering an alpha version of Persian-to-English machine translation. I'm a big fan of statistical MT, and for that matter of Google's MT team, and current events in Iran are gripping, so I thought I'd try it out.
[Update: Take a look at the automatic translation of Mir Hossein Mousavi's statement of 6/20/2009 (Human translation here). Some parts are better than others. A couple of fairly characteristic samples:
MT: In this period, and especially in our time of life Imam illuminati capitals huge life and property, and prestige in Mubarak left foot building will consolidate the achievements and was a valuable result.
HT: In those times, especially when our enlightened Imam [Khomeini] was alive, large amount of lives and matters were invested to legitimize this foundation and many valuable achievements were attained.
MT: This is clearly a need in the military and police presence in the street scene and will see some of them and sent them to hear the heart of each interested country's revolution and the pain brings, will not face.
HT: It is clear that in this case, there won’t be a need for security forces on the streets, and we won’t have to face pictures and hear news that break the heart of anyone who loves the country and the revolution.
]
I don't know Persian, so I have to rely on tests where I (think I) know what the outcome should be, or where I can judge the plausibility of the result. My first test was the refrain of Safar ("Traveling") by Nasrollah Moein, where I've added the automatic translation of each line after the latin-alphabet transcription and the original translation:
تو رو دیدم تو بارون دل دریا تو بودی
To ro didam to baron, dele darya to bodi
I saw you in the rain, you were in the middle of the sea
I saw you you you sea Barron heart liquidation
تو موج سبز سبزه تن صحرا تو بودی
To moje sabz sabze, tane sahra to bodi
In the green wave of grass, you were the green cover for desert
You wave your green verdure tone desert liquidation
مگه میشه ندیدت تو مهتاب شبونه؟
Mage mishe nadidet to mahtab shabone
Is it possible not to see you in the moonlight?
Yea میشه Ndydt Mahtab night leave you?
مگه میشه نخوندت تو شعر عاشقونه؟
Mage mishe nakhondet to shere asheghone
Is it possible not to read you among the lovers poems?
Yea you میشه Nkhvndt Ashqvnh poetry?
This is not very promising, though since I don't know Persian, it's possible that the input has problems in this case.
And the system does a *lot* better on recent tweets from Tehran — even though I don't know the right answers, translations like these seem very plausible. "نامه سرگشاده مهدی کروبی به شورای نگهبان" = "Open letter to the Guardian Council Mehdi Karroubi"; "مهم و فوری: مهدی کروبی خواهان ابطال انتخابات و برگزاری مجدد انتخابات ریاست جمهوری شد" = "Important and urgent: Mehdi Karroubi voidance want elections and presidential elections were held again"; "کرباسچی در بی بی سی هست ببینید" = "The BBC is Karbaschi see". Andrew Sullivan's site, which has become the CNN of the past week's events, has started using Google translations in his "Live-tweeting the revolution" updates.
From the BBC Persian site, "پخش زنده برنامه های تلويزيونی" comes out as "Live broadcast TV programs", which makes sense. "اگر رسیور شما به طور خودکار کانال بی بی سی را پیدا نکرد می توانید خودتان مشخصات زیر را وارد کنید" = "If you automatically Satellite channel BBC did not you please enter your profile" is a bit less coherent (assuming that the garbled order is due to the translation and not to Firefox's cut-and-paste).
To make their system better, they'll need lots of parallel Persian/English text (not mainly poetry, of course, though that would be nice as well). I expect that they'll find it, and the system will improve, maybe rapidly.
[Update: instructions here on how to automate the twitter translation process, using Firefox and Greasemonkey.]
John Cowan said,
June 19, 2009 @ 12:58 pm
If any Language Log reader finds any seriously broken particular translations, I am very happy to pass them on to the translation team. Do remember that this particular translation pair is alpha-quality, and wouldn't normally have been released until it had been worked on a lot more. Disclaimer: I work for Google, but not on MT, and I don't know any material non-public information about it.
That said, poetry usually does badly: parallel corpora tend to be about things like law and business.
[(myl) This post at Page F30 says that "Note: Google is translating Basij and Basiji as "mobilize" so when you see that word it actually means Basij / Basiji."
The cited post is titled "How to help now that Google Translate is available in Persian / Farsi", and you might suggest to the folks working on Persian that they post in places like that any appropriate requests for assistance (e.g. lists particular problematic terms, or maybe just a volunteer effort to translate the large volume of Persian tweets now streaming in, as further training data. ]
Rick S said,
June 19, 2009 @ 1:27 pm
@John Cowan: The "Contribute a better translation" function is probably a more effective tool for improving translations, but thanks for the offer.
Nevertheless, based on my experience with Spanish-English translations, I am strongly hesitant to recommend Google Translate–even post-Alpha languages–to anybody who knows nothing of the other language. For example, one of the big problems with Sp-En is the choice of a personal pronoun to insert. I've many times seen incorrect but plausible translations that would mislead a reader who doesn't know about omitted subjects in Spanish. I doubt that statistical MT alone can ever overcome that flaw.
But I like GT, really I do (and I regularly help train it). It often does a very good job with idioms and vocabulary I'm not familiar with. As long as you know something of the other language's syntax and cultural background, it can be quite helpful.
Andrew said,
June 19, 2009 @ 1:32 pm
Aside from the fact that poems are hard enough for people to understand and translate, much less machines, those lines you selected make heavy use of colloquial speech in written form, which I am sure the program isn't set up for.
For example: بارون. This is a written rendering of the way the word باران (rain) is said in everyday speech. I am sure the computer would have no trouble with باران
Another example: مگه میشه. The written form of this is مگر میشود, but here the author has used a spelling to match the colloquial pronunciation.
Hence, the program is much more successful with text that uses the "correct" spelling, such as the BBC Persian example.
Posts about Andrew Sullivan as of June 19, 2009 » The Daily Parr said,
June 19, 2009 @ 1:38 pm
[…] event at The Paley Center For Media in Los Angeles on June 17, 2009. Still from the Paley event Green verdure tone desert liquidation – languagelog.ldc.upenn.edu 06/19/2009 Researchers at Google have responded to current events in […]
Cameron said,
June 19, 2009 @ 2:23 pm
As Andrew mentioned above the spellings used in the original represent modern Tehrani pronunciation, not standard spellings.
And where'd the Latinizations come from? If that's google, shouldn't they be using the Unipers standards?
[(myl) I took everything (except the Google translations) from the page I linked to. ]
Eskandar said,
June 19, 2009 @ 6:10 pm
The system may do better on text from Twitter because, as others have pointed out, many of them are spelled in the proper written style and not the colloquial of the song you used to test. However, there are still a number of problems with its translation–though it's still in Alpha and I'm sure it will improve. When I put in سلام علیکم it translated it as "Noble nation of Pakistan," which I assume may be related to using texts like this as part of the corpus. I have been using the "Suggest a better translation" feature and if there are other things that I can do to help the project, I'd like to hear about them.
Troy S. said,
June 20, 2009 @ 4:26 am
As Andrew said, part of the problem is that the text of the poem is written to colloquial pronunciation, and that shouldn't be a problem for news articles.
Another problem this is the fact the Persian has a large number of homographs, and there needs to be some way for the machine to figure out whether تو is supposed to mean "you" or "in" or else we have translations like "I saw you you you."
Another problem is how it determines what the final ی
should be translated as: it has many uses. the second person singular ending for verbs, an abstract noun making suffix, the indefinite article, and many others. The presence of "liquidition" is somewhat puzzling to figure out, but it's probably a rendering of نابودی which is the negation of بودی
the last word in the second line. Here is means "you were" but the machine apparently picked it up as "existence." Don't know where the negation came from, but non-existence matches up with the sense of liquidation.