Google Translate Adds Languages

« previous post | next post »

Google Translate has added ten languages to its repertoire: Bulgarian, Croatian, Czech,Danish, Finnish, Hindi, Norwegian,Polish, Romanian and Swedish. With the languages previously available (Arabic, Chinese (traditional and simplified writing), Dutch, English, French,German, Greek, Italian, Japanese, Korean, Portuguese, Russian, and Spanish), Google now handles 23 languages. These comprise less than one-half of one percent of the world's languages, but their speakers include more than half of the world's population.

The interesting thing about the languages added is that they do not for the most part represent the next most widely spoken languages. Hindi is the exception, with over 300 million speakers. The others have a modest number of speakers. Moreover, in several cases the demand for translation into the language is probably small: most Scandinavians, for example, are comfortable in English. If Google were trying to add languages in descending order of number of speakers, they should have added such languages as Indonesian, Bengali, Punjabi, Telugu, Marathi and Vietnamese.

In all likelihood, the order in which Google adds languages is probably determined partly by economics (relatively small numbers of speakers of languages used in wealthy countries will provide more advertising revenue than large numbers of speakers of languages used in poor countries) and partly by the availability of sufficiently large amounts of bilingual text, which they use to train their statistical translator.



26 Comments

  1. Charles Belov said,

    May 18, 2008 @ 4:37 am

    It's interesting to take your comments and translate them from English into the various languages and translate them back again. The translator actually does relatively well, although there are definite oddities:

    Chinese:

    In all likelihood, in order to join Google in which the language may be decided by the economics of the (relatively small number of speakers of the language used in rich countries, will provide more advertising revenue than a lot of lectures Gentlemen, the languages used in poor countries) and in part by the provision of adequate large number of bilingual text, they use to train their statistical translation.

    Spanish:

    In all likelihood, the order in which Google adds language is probably partly determined by the economy (relatively small number of speakers of languages used in rich countries provide more advertising revenue that a large number of speakers of languages used in Poor Countries) and partly by the availability of sufficiently large quantities of bilingual text, which uses statistics to train their translator.

    [note the capitalization change]

    Czech:

    In all likelihood, the order in which Google adds language is probably intended partly economics (a relatively small number of speakers on the languages used in rich countries will be more advertising revenue than a large number of speakers on the languages used in poor countries) and partly in the availability of large enough of the bilingual lyrics, which use to train their statistical translator.

    Arabic:

    In all likelihood, which adds photo languages and perhaps partly determined by the economy (relatively small numbers of speakers of languages used in the rich countries will provide more revenue from advertising large numbers of speakers of languages used in poor countries) and partly from the availability of sufficient quantities large From bilingual text, which is used to train statistical translator.

    But translations do seem to be getting better since the time when Babelfish first jumped into our ears from the servers at Altavista.

  2. panne said,

    May 18, 2008 @ 4:43 am

    Huh. The Norwegian to English version has the same issue with names of places that have been discussed here previously. I just did some experimentation, and this is what I got:

    Jeg bor i oslo = I live in boston
    Jeg bor i bergen = I live in houston
    Jeg bor i trondheim = I live in chicago

    This only happens when I didn't capitalize the city names though; when I do that, it comes out right. Also, it is only an issue (as far as I can see) with the three biggest cities. Stavanger, kristiansand and tromsø works no matter what.

    Curious.

  3. Bertilo Wennergren said,

    May 18, 2008 @ 5:39 am

    Here's a similar issue with Swedish:

    Jag bor i orsa. = I live in causes.

    This probably has something to do with the Swedish word "orsaker" that does mean "causes" (the plural of "orsak" = "cause"). But "orsa" (or rather "Orsa") is just a place name.

    However:

    Jag bor i Orsa. = I live in Orsa.

  4. Jeremy Hawker said,

    May 18, 2008 @ 5:44 am

    What about this one. If I write "I'm moving from bergen to oslo", I get:

    Jeg flytter fra bergen til oslo = I move from running to sf

  5. panne said,

    May 18, 2008 @ 6:55 am

    I have now experimented a bit more (yes, I am indeed procrastinating :-S), and found that names of small places yield pretty interesting results. Among the weirdest are:

    Jeg bor i Volda = I live in Slough.

    And this, which scared me a little as it is the name of my homeplace:

    Jeg bor i Kvinesdal = I live in Hell.

    Whoa, Google!

  6. Mark Liberman said,

    May 18, 2008 @ 7:04 am

    Some similar fun can be had in Czech.

    The names of the two largest cities seem generally to be properly translated, if capitalized:

    Kdy odlétá příští letadlo do Prahy? → When departs next plane to Prague?
    Kdy odlétá příští letadlo do Brna? → When an aircraft departs next to Brno?

    But without the initial capitalization, "Prague" becomes "threshold" and "Brno" is just transliterated in an inflected form:

    Kdy odlétá příští letadlo do prahy? → When an aircraft departs next to the threshold?
    Kdy odlétá příští letadlo do brna? → When an aircraft departs next to brna?

    There is some evidence of odd city-to-city transductions in some cases, e.g.

    Městská část Brno-Jehnice → City of Buenos Jehnice

    My favorite, though, is that "Veselí nad Moravou", which is actually "a town in the South Moravian Region", is translated as "Wagga Wagga".

  7. Bogdan Marinov said,

    May 18, 2008 @ 7:57 am

    I had a bit of fun a few days ago testing it on my blog (which is in Bulgarian, so I'm not posting a link here).

    Some sentences came out surprisingly well – but most of them didn't. It seems that Google Translate has a problem with omitting personal pronouns (which is quite common in Bulgarian). For example, "Ние обичаме да четем." is translated correctly as "We love to read.". But "Обичаме да четем." (which has the same meaning) gives "Love to read." This often makes complex sentences incomprehensible.

    And there are some really silly substitutions: "has searched for" becomes "atrocities", "the hell" (like in "What the hell…?") becomes "Shit" (always with a capital letter, even in the middle of a sentence). Sometimes homonyms are translated correctly – and sometimes are not, even when they are in another instance of the same phrase.

    It's a nice feature, though. At least, it will certainly add fresh material to my collection of amusing machine translation failures. :)

  8. Maria said,

    May 18, 2008 @ 8:38 am

    So instead of South Moravia, it points to New South Wales.

  9. Theodore said,

    May 18, 2008 @ 10:07 am

    Maybe the choice of new languages is as simple as "popular request."

  10. Ed Cormany said,

    May 18, 2008 @ 11:24 am

    another factor to keep in mind for language selection: number of speakers of a language does not correlate to number of internet users who speak that language, or number of web pages created in that language. for example, if i recall correctly, China recently passed the US in sheer number of internet users, but that still means that less than 10% of China is online. Scandinavia has the best internet infrastructure of anywhere in the world, so i'm sure their web presence is large in proportion to their language populations.

  11. Stephen said,

    May 18, 2008 @ 1:35 pm

    The new languages correspond somewhat with the most active Wikipedias.

  12. John Cowan said,

    May 18, 2008 @ 1:48 pm

    The appropriate people at Google have been made aware of this problem.

  13. Pekka Karjalainen said,

    May 18, 2008 @ 1:58 pm

    I'm rather fond of this pair you get translating from English to Finnish.

    Ojibwe in the Eisteddfod -> Mouk-Aria, Istunto

    Mouk-Aria is a language of Papua New Guinea, and istunto is Finnish for a session or a meeting (literally "a sitting"). I discovered these oddities separately, by the way. I don't think they speak a lot of Ojibwe in the Eisteddfod.

  14. Monica said,

    May 18, 2008 @ 2:38 pm

    I've been assuming (without data) that online translation would be driven more by *source* languages than *destination* languages — that, for example, there is more desire to translate Bulgarian for English-speaking readers than to translate stuff into Bulgarian.

    That said, though, I wouldn't be surprised if the biggest driver is simply available source material, as you said.

  15. bulbul said,

    May 18, 2008 @ 3:02 pm

    But without the initial capitalization, "Prague" becomes "threshold"
    The MT system apparently confuses the feminine city name "Praha" with the masculine "práh". "Práh" actually means "doorstep, threshold" and it may even be the origin of the name "Praha" (a dam in Vltava was referred to as "práh").

  16. Bill Poser said,

    May 18, 2008 @ 4:18 pm

    When you find errors, if you are so motivated, you can tell Google what a better translation would be and they will, I assume, add the new bit of parallel text to their training database.

    It is true that the choice of new languages may be influenced by desire for translation from that language rather than into it, but if so the distribution of languages is less than obvious to me. For example, without meaning any disrespect to our Bulgarian friends, is there really a large interest among Google users in translation from Bulgarian? Nothing has been going on there that I can think of that has attracted a lot of attention in the outside world.

  17. bulbul said,

    May 18, 2008 @ 5:42 pm

    Maybe the choice of languages depends on the availability resources, in this case mostly huge bilingual corpora. Maybe the folks at Google had one or more available for Bulgarian and thought 'what the heck, why not?'.

  18. bulbul said,

    May 18, 2008 @ 5:49 pm

    … but you said that already. Sorry.

  19. Ryan Denzer-King said,

    May 18, 2008 @ 10:53 pm

    So when are they adding Dakelh? I'd like to see some good permutations involving North American languages. I bet there are A LOT of Czech people who are dying to translate things into Blackfoot.

  20. hjælmer said,

    May 19, 2008 @ 12:43 pm

    bulbul said,
    May 18, 2008 @ 5:42 pm

    Maybe the folks at Google had one or more available for Bulgarian and thought 'what the heck, why not?'.
    ***********
    Based on what Bogdan wrote above, what Google would have thought in Bulgarian would have been "What the poop, why not"?

  21. hjælmer said,

    May 19, 2008 @ 12:50 pm

    Sorry, I of course meant "What Poop, why not"?

  22. dr pepper said,

    May 19, 2008 @ 4:02 pm

    Otoh, the availability of these less represented languages may encourage more participation by their speakers.

  23. Janice Huth Byer said,

    May 19, 2008 @ 5:46 pm

    Panne – Having had the pleasure of touring the three largest of Norway's fine cities checked by you, plus having lived in each of their mistaken American match-ups, I say Google owes Norwegians a great apology. I haven't visited your homeplace, Kvinesdal, but its match-up with Hell recalls to me the American quip: If I owned Hell and Texas, I'd live in Hell and rent out Texas. In other words, Bergen appears to have suffered the worst insult. :)

  24. Bill Poser said,

    May 19, 2008 @ 11:03 pm

    "So when are they adding Dakelh"

    Well, I can't speak for Google, but one problem would be the lack of parallel text. Google uses a statistical translator which requires lots of parallel text to train. The only really large piece of parallel text available is the New Testament, which is kind of skewed in its topical coverage. The next largest piece of text is very nice: the story of a difficult winter journey made when she was a young woman told sixty years later by a woman who spoke no English – but it is only about 8,000 words.

    For languages like Dakelh what might be a more appropriate goal would be translating the Google interface. They currently have the interface in 109 languages (I'm discounting things like "Elmer Fudd") and encourage people to sign up and help them with others.

  25. Richard said,

    May 20, 2008 @ 11:33 am

    I would guess that these languages were chosen as much for proximity reasons as economic. If you've got a good Russian translator, Bulgarian, Polish, Czech and Croatian are going to be pretty similar. Same with Scandinavian languages
    (minus Finnish) from Dutch and German.

  26. Amy Hemmeter said,

    May 21, 2008 @ 2:23 pm

    If you're interested in this translation error stuff, blahblahfish.com is pretty amusing. It uses Babelfish, I think, one of the worst translators out there. Included is a list of "favorites."

RSS feed for comments on this post