Country list translation oddity

« previous post | next post »

This is weird, and even slightly creepy — paste a list of countries like

Costa Rica, Argentina, Belgium, Bulgaria, Canada, Chile, Colombia, Dominican Republic, Ecuador, El Salvador, Ethiopia, France, Germany, England, Guatemala, Honduras, Italy, Israel, Mexico, New Zealand, Nicaragua, Peru, Puerto Rico, Scotland, Switzerland, Spain, Sweden, Uruguay, Venezuela, USA

into Google Translate English-to-Spanish, and a parallel-universe list emerges:

Translation into French, German, Italian, and Chinese seem to be fine, at least for that particular list.

Presumably some strange phrase-based list-mapping glitch is responsible. For an earlier example of country-name translation glitches (with a different apparent explanation), see "Made in USA == Made in Austria|France|Italy…?", 3/23/2008; "Austria == Ireland?", 3/24/2008; "Why Austria is Ireland", 3/24/2008.

Graham Katz, who sent in the example following a question from his sister, offered this explanation:

After determining that this indeed is what happens (the tipping point from item for item translation to "intrusive Estados Unidos" was something like 16 elements), it struck me what it might be.   Googling  "Estados Unidos de America, Federacion de Russia" clinched it. Long Spanish lists of countries (such as the list of members of the UN) almost always have this sequence in them. Clearly at about 16 elements Google starts to resort to some sort of phrase based machine translation (or maybe it's the "song lyrics" sub-model? I found the easiest way to explain it to my sister was by talking about translating song lyrics or prayers).

Anyway, I thought a broader audience might be interested.  (Below is the list of the countries, for your convenience should you choose to play with it.  The crucial cut point is between Italy and Israel – a shorter list gets literal translation, a longer one starts to get the intrusion).

Indeed — this list of 17 is fine:

but this list of 18 has the ghostly USA and Russian intrusion:

I leave it to others to determine what the function from inputs to outputs really is, whether the particular sequence of countries matters, whether there are lists that generate ghostly intrusions for other language pairs, and so on.



22 Comments

  1. AntC said,

    April 10, 2017 @ 5:42 pm

    The input list is in alphabetical order (by English name), apart from the initial Costa Rica.

    The output list seems to be trying to sort into alphabetical order by translated name. So USA, last in the English list, moves to Estados Unidos …

    So is Google trying to be really clever? Has it spotted that's an alphabetical list, tranlsated the names then reshuffled them? But gone wonky in the attempt.

    Presumably in Google's corpus there big gobbets of lists like that, from various governmental/corporate edicts.

    [(myl) Good point. But then Russia sneaks in…]

  2. Ellen Kozisek said,

    April 10, 2017 @ 6:18 pm

    The translation of the 17 item list is missing Dominican Republic, and repeats El Salvador.

    @AntC: The USA isn't moved in full long list (posted first). The translation for USA is right there at the end where it should be: EE.UU. The translation for United States of America, which is not in the list (USA is, as noted) is an extra.

  3. D.O. said,

    April 10, 2017 @ 10:14 pm

    I blame soccer. That's about the only place where Israel, Puerto Rico, and Scotland can be part of the same list.

  4. John Burke said,

    April 10, 2017 @ 10:24 pm

    Surely there's a distant echo of the time when the list went "USA, USSR"?

  5. rosie said,

    April 11, 2017 @ 1:37 am

    Also note that the Spanish list omits some countries that are in the English list, and repeats Honduras and Guatemala.

    I also note an anomaly in the English list other than the Costa Rica noted by AntC: the position of England shows that it was substituted for Great Britain.

  6. Leon said,

    April 11, 2017 @ 1:42 am

    Google Translate recently switched to using deep learning methods rather than phrase-based translation for the most common language pairs, including English-to-Spanish.

    I'm afraid the behavior of these recurrent neural nets is rather less amenable to human analysis than phrase-based systems. Basically you feed a sentence in and you get a sentence out. (Notice how you used to be able to mouse over phrases in Google Translate and see the corresponding phrase highlighted in the other language, whereas now you can only do that at the sentence level.) How to understand what's going on in any more detail than that is an area of active research.

  7. Andrew Usher said,

    April 11, 2017 @ 5:50 am

    Google translate has always given goofy results; this is no improvement. Since their methods are kept secret we can do nothing to help.

    The undesired repetitions show that it can't merely be regurgitating a list from someplace; in traditional software language this could only be called a 'bug'. But I'm sure Google believes they've moved beyond the traditional paradigm, like the damnable Amazon.

    k_over_hbarc at yahoo.com

  8. zoetrope said,

    April 11, 2017 @ 6:55 am

    re Russia sneaking in: it's actually "Federación de Rusia" that sneaks in, thus preserving the alphabetical order, at least up until Nueva Zelandia.

    Also wondering what is going on with all the repetitions of Guatemala and Honduras.

  9. Ralph Hickok said,

    April 11, 2017 @ 6:59 am

    I don't see Canada in the translation. Could it be that either "Estados Unidos de America" or "EE.UU" is meant to be Canada? It seems unlikely but not impossible.

  10. Eric Fahlgren said,

    April 11, 2017 @ 10:55 am

    Wow, you can really cause gtranslate to flip out by replacing the first and last commas in the given list with carriage returns, causing the output to become:

    Costa Rica
    Eslovaquia Eslovaquia Eslovenia España Eslovaquia Eslovenia España Eslovaquia Eslovenia España Eslovaquia Eslovenia España Eslovaquia Eslovenia España Eslovaquia Eslovenia España Eslovaquia Eslovenia Eslovenia Eslovenia Eslovenia Eslovenia Eslovenia Eslovenia Eslovaquia Suiza, España, Suecia, Uruguay, Venezuela
    Estados Unidos

  11. John Burke said,

    April 11, 2017 @ 1:26 pm

    @Eric Fahlgren: I think that list more or less demands the response "¡¡¡¡¡GO-O-O-O-OL!!!!"

  12. Robert Ayers said,

    April 11, 2017 @ 1:59 pm

    Follow-up to Fahlgren: I get his worse result (zillions of duplicated "E" countries) by just deleting the initial "Costa Rica".

  13. Chandra said,

    April 11, 2017 @ 2:59 pm

    Translation into French works fine for the original list, but when I remove "Costa Rica" (the only one out of order alphabetically), the French translation inserts a mysterious "Égypte" and deletes the Dominican Republic and Germany.

    [(myl) Some other deletions as well:

    ]

  14. Jamie said,

    April 11, 2017 @ 4:35 pm

    @Leon
    How to understand what's going on in any more detail than that is an area of active research.

    Maybe neural nets could help analyse what is going on …

    [(myl) Analyze, certainly. Illuminate? Not likely, at present.]

  15. John Chew said,

    April 11, 2017 @ 11:37 pm

    I remember way back, when Google used to translate "東京" (Tokyo, in Japanese) as "London" in English, I'm guessing by being just a little too clever in observing which strings corresponded to each other in similar documents.

  16. Ray said,

    April 11, 2017 @ 11:47 pm

    is google translate case sensitive? (I get a slightly different spanish list when "honduras" is "Honduras")

    also, google seems to be sensitive to whether or not the list ends in a period.

    also, for some strange reason, when the english list is not properly alphabetized, I get a "Puerto Rico ," in the spanish results, but when the english list is properly alphabetized, I get a "Nicaragua, ," in the spanish results

  17. Marja Erwin said,

    April 12, 2017 @ 4:23 pm

    Yes, it can be hard to figure out which cities in their translations correspond to which cities or sometimes institutions in the original text.

    Екатеринослав used to become "LOTS," possibly because correct translations were split between Yekaterinoslav, Ekaterinoslav, and Katerynoslav.

  18. Ray said,

    April 12, 2017 @ 9:43 pm

    I wonder if google translator is somehow using google maps in these renderings of country names. the odd comma spacings in the results (and shifts in clusters) suggest that perhaps some kind of ghostly "gps-ing" is going on — spatial translations based on geographic (latitude, longitude) coordinates, rather than simply straight-up one-on-one verbal translations. hmm…

  19. D.O. said,

    April 12, 2017 @ 9:52 pm

    Екатеринослав used to become "LOTS," possibly because correct translations were split between Yekaterinoslav, Ekaterinoslav, and Katerynoslav.

    Not to mention Dnepropetrovsk and now simply Dnepr in both Russian and Ukrainian.

  20. Jarkko Hietaniemi said,

    April 15, 2017 @ 4:55 am

    This is a result of using the neural net technology. What this basically stems from is that the nets are trained using masses of text, and they try to find correlations, both intralanguage and interlanguage. For example "I live in France" and "I live in Spain" are semantically very close, the only difference being the country. This is not unique to country names: Google Translate used to convert "Urho Kekkonen" (a Finnish president for decades) into "George Washington", but it seems that they have fixed that glitch.

  21. mg said,

    April 16, 2017 @ 1:53 am

    Still better than Facebook, which a year ago insisted on translating חַג שָׂמֵחַ (Hebrew for "happy holiday", the usual greeting for Jewish holidays) into "Merry Christmas"! Needless to say, those of us who were celebrating Chanukah were less than pleased by this – especially since all other sources, such as Bing and Google, got it right and there was no way to complain to FB about it. It was especially bad several months later when we were trying to give that greeting for Passover and it was still being translated as Merry Christmas!

    My guess is that it was partly due to all the "war on Christmas" nonsense, because their attempt at finally fixing it was to no longer offer to translate it rather than translating it correctly.

  22. Jarkko Hietaniemi said,

    April 16, 2017 @ 3:01 am

    This might be interesting reading:

    https://www.technologyreview.com/s/604087/the-dark-secret-at-the-heart-of-ai/

RSS feed for comments on this post