A Japanese-French Google Translate mixup

« previous post | next post »

From an anonymous correspondent:

An amusing translation glitch: Google translates the Japanese word "migaku 磨く" (to polish, to brush) to the French word "polonais" (Polish, as in "of Poland").

The full translation party: "migaku 磨く" → "polonais" → "kenma 研磨" (polishing, grinding) → "polissage" (proper French for polishing).

I'm not sure how the machine decided to equate "of Poland" and "to brush". Does it hop through English, where the homonymy of "polish" and "Polish" would explain the confusion?

The links for these stages are here, here, and here.



Selected readings



15 Comments

  1. Laura Morland said,

    July 13, 2020 @ 7:20 am

    "Does it hop through English, where the homonymy of "polish" and "Polish" would explain the confusion?"

    That's exactly what I've heard. The system isn't programmed to translate between any two "random" languages, and so many languages are processed through English as an intermediary.

    Exceptions would presumably be FrenchSpanish, SpanishPortuguese… surely one of your learned readers knows for certain.

    While we're on the topic, I personally find Google Translate less useful than I did 10 years ago, and for the following reason:

    I live in France, and I write emails (occasionally letters) in French several times a day. I long ago developed the habit of composing any important missive in French and then dropping it into Google Translate to render into English. Any errors or infelicities would be clearly evident, and then I'd go about fixing them.

    Now, unfortunately, Google Translate "guesses" many of my mistakes and silently corrects them. I understand that this feature helps more people than it hurts, but now I'm on the hunt for a clunkier translation machine.

  2. Laura Morland said,

    July 13, 2020 @ 7:24 am

    P.S. In the above comment, I'd typed 'lesser-than' + 'greater-than' symbols bracketing an equals sign (between French & Spanish; Spanish & Portuguese), but your system erased them! More nefarious "translation" at work….

  3. Terpomo said,

    July 13, 2020 @ 7:35 am

    Indeed it does mostly work via English, but there are a few language pairs that it translates directly, like Japanese and Korean, and I think Catalan and Ukrainian are even always by way of Spanish and Russian respectively just because there's so much more bilingual data for those pairs.

  4. cameron said,

    July 13, 2020 @ 7:45 am

    @Laura Morland: the use of lesser-than ( < ) and greater-than ( > ) signs is always going to be tricky in a context like this, because the system supports some markup in the comments, and those symbols have special uses in the markup language.

  5. M. Paul Shore said,

    July 13, 2020 @ 10:58 am

    One point of terminology: The anonymous correspondent’s phrase “the homonymy of ‘polish’ and ‘Polish’” really ought to read “the homography or near-homography of ‘polish’ and ‘Polish’”. Those two words, while homographs to the extent we exclude the detail of capitalization, are not homonyms, since they’re not homophones: rather, they’re heterophones, and are therefore classified as heteronyms (i.e., differing in both pronunciation and meaning even though spelled the same, or sort of the same).

    It’s possible that, if the anonymous correspondent is not a native speaker of English, but rather is perhaps a native Francophone or Japanophone, he or she doesn’t realize that the two words are pronounced differently.

  6. Thomas Rees said,

    July 13, 2020 @ 12:04 pm

    There are lots of arrows available: ⇔ ; ↔︎ ; ⇄ ; ⇋ ; and so forth

  7. Michael Watts said,

    July 13, 2020 @ 8:39 pm

    the use of lesser-than ( < ) and greater-than ( > ) signs is always going to be tricky in a context like this, because the system supports some markup in the comments, and those symbols have special uses in the markup language.

    Well, to be more accurate, the system is set up to accept HTML markup which it then renders as HTML. This is a huge mistake that early programmers reliably avoided and modern programmers constantly make.

    For example, the escape character (something to indicate that you don't want to be interpreted literally) for C strings as understood by the C compiler is the backslash, "\". The string "\a" is one character long, not two, and \a is code for ringing the computer's internal bell, where "a" would just be the letter "a".

    The escape character for strings as understood by the C function printf is not "\" but "%". Thus, the string "%d" is two characters long to the C compiler, and those characters are a percentage sign and a d. But it is one entity long to printf, and that entity is a command to print a numeric value specified elsewhere.

    At some point, we lost track of this and made \ the escape character for everything. In Java, the escape character for strings is the traditional "\". The escape character for regular expressions is also "\". Regular expressions have to be defined from strings, so the regular expression \s* ("any amount of whitespace") is actually defined, in the code, by the string "\\s*". The backslash must be doubled because first, the compiler processes the sequence "\\" down to "\", and then the regular expression engine interprets "\s" as "any whitespace". If this went through a third layer, it would look like "\\\\s*". Exponential growth in the escape character would be completely avoided by just using different escape characters for different levels of interpretation.

    (And if you're wondering how to produce < and >, you need to type "&lt;" or "&gt;".)

  8. Michael Watts said,

    July 13, 2020 @ 8:43 pm

    Ah, I see that my comment about an exponential explosion in backslashes was mangled by the comment software interpreting doubled backslashes as single backslashes. To correct for display:

    Regular expressions have to be defined from strings, so the regular expression \s* ("any amount of whitespace") is actually defined, in the code, by the string "\\\\s*".

    If this went through a third layer [as it's doing right here, right now!], it would look like "\\\\\\\\s*". Exponential growth in the escape character would be completely avoided by just using different escape characters for different levels of interpretation.

    (To be very clear, the example in that second paragraph, with four backslashes displayed, had to be typed with eight of them.)

  9. AG said,

    July 14, 2020 @ 12:30 am

    I think I've found another one:

    "sow" [ambiguous] =>
    "truie" [female pig] =>
    "種をまく" [plant seeds, has nothing to do with the French]

  10. Bathrobe said,

    July 14, 2020 @ 12:40 am

    Google Translate had always used English as a pivot in translating between languages, which is widely known to result in problematic translations. The justification is that there is not a large enough corpus of existing translations between, say, Thai and Swahili, to train Google Translate on.

    I had assumed that the advent of AI-based translation would put an end to those bad old days, but apparently not. It appears that not even AI can be trained to translate via a pivot (that is, learn to make sense of translations of translations).

  11. Bathrobe said,

    July 14, 2020 @ 12:51 am

    As for the diminished usefulness of Google Translate, I'm afraid I agree with you. I now find that Google Translate actually drops whole clauses of the original when translating. It simply leaves them out as though they didn't exist, even though they may be essential to making sense of a passage. Totally unreliable.

  12. Michael Watts said,

    July 14, 2020 @ 2:29 am

    It appears that not even AI can be trained to translate via a pivot (that is, learn to make sense of translations of translations).

    In my view, this is related to the idea that translating should be done as an operation on text rather than an operation on meaning. You'll have trouble translating via a pivot when the pivot is a natural language, because natural languages are ambiguous. If the pivot is unambiguous — say, a representation of a state of the world, such as humans use to comprehend utterances — then the problem disappears.

  13. M. Paul Shore said,

    July 14, 2020 @ 8:05 am

    Just to clarify poster AG’s somewhat unclear posting of early this morning, it concerns a French-to-Japanese Google Translate error he or she has discovered that presumably results from the use of English as an intermediate, and that’s analyzable as follows:

    Fr. “[la] truie” (female pig) =>
    Eng. “sow” (ambiguous in its written form) =>
    Jap. "種をまく" (plant seeds; has nothing to do with the French)

  14. AG said,

    July 14, 2020 @ 8:20 am

    M. –

    Yes, I was pressed for time earlier and didn't happen to know the French term for a sow, so I started with "sow" and Google Translate-d it from ENG=>FR=>JP and wrote all three down quickly in order to immortalize my earth-shattering discovery.

    I think your explanation describes what probably happened in a way that more closely fits the wording of the original post, thanks.

  15. Philip Taylor said,

    July 14, 2020 @ 9:41 am

    Michael, I am totally unclear why you (a) write "the system is set up to accept HTML markup which it then renders as HTML. This is a huge mistake that early programmers reliably avoided and modern programmers constantly make", and then (b) go on to talk about something completely unrelated, the use of backslash as an escape character.

    Why do you believe that "a system [which] is set up to accept HTML markup which it then renders as HTML" is a huge mistake, and (b) what connection do you see between this and your digression concerning backslashes as escape characters ?

RSS feed for comments on this post