English as Afrikaans?

« previous post | next post »

Language-identification from digital text has been a solved problem for many years, so I was surprised yesterday to see Gmail offering to translate from Afrikaans an email written in perfectly idiomatic English, which started this way:

The rest of the email had a few acronyms and Israeli place-names, and it did mention "Hebrew", but those features don't help solve the mystery of why Gmail assigned it to the category of Afrikaans. My guess is that it's one of the (essentially inexplicable) vagaries of modern deep-learning technology, but who knows?

 



9 Comments »

  1. Crprod said,

    July 22, 2021 @ 12:46 pm

    This confusion of Afrikaans and English has happened to me.

  2. David Marjanović said,

    July 22, 2021 @ 12:48 pm

    taggers looks kinda Afrikaans…? Just guessing, I don't know.

  3. DMcCunney said,

    July 22, 2021 @ 12:57 pm

    If I had to guess, I'd say that behavior was prompted by where Gmail thought the email was sent *from*, and it would get that idea from the originating IP address in the email header. That would happen before any parsing routines tried to look at the message body. And that likely would not happen until the user asked for a translation.

    No deep learning needed. There are an assortment of websites where you can plug in an IP address and the site will tell you where the holder of the address is located. (I have a Firefox extension to do that.)

    (What I'm bemused by at the moment is the amount of Swedish porn email showing up correctly labeled Spam. Huh? How did I get on *that* spam list? I was previously getting a lot of Turkish messages labeled spam, which likely weren't if you were a Turk. They were invitations to various events concerned with business and economics. Again, why was I getting them? ).

    The nice thing about Gmail is that I largely don't *care* about spam email. If one actual spam message that hasn't been correctly filtered appears in my Inbox every two weeks it's a lot. Click Report Spam, and I don't see mail from that source again.I delete the spam regularly, but check for occasional false positives, and save particulary entertaining 419 spams.

    [(myl) This particular email was sent via a Gmail account from an IP address in Philadelphia.]

  4. Marinus Ferreira said,

    July 22, 2021 @ 3:16 pm

    As a native Afrikaans speaker, I can say that this resembles Afrikaans even less than most English messages this length would. 'I' is not a word in Afrikaans except for naming the letter of the alphabet and should immediately tell an AI this is almost certainly English. 'th', 'ea' and 'wh' do not represent phonemes in Afrikaans, and would only occur where two morphemes are joined together. 'The same goes for 'ai' (which only occurs as part of 'aai') and 'sch' can only occur in proper names from another language. It does not contain most of the most common words in Afrikaans including most of those which also occur in English like 'is' (but does contain 'in'), nor other common words like 'ek', 'jy', 'was', 'sal', 'het', nor any past participle all of which are 'ge-'.

    I would have guessed like DMcCunney that it would be based on where the message is sent from. But, failing that, I have another hypothesis. I and other Afrikaans speakers routinely use the Google keyboard when typing Afrikaans, but it doesn't work for Afrikaans only, when you use it it recommends words both of English and Afrikaans (and doesn't pretend to do anything else, it even says 'AF[rikaans] – EN[glish]' on it). If the AI uses that input as part of the training data, and I'd be surprised if it didn't, the training data for Afrikaans recognition may be so polluted with English (when people like me use the keyboard to input English even in AF – EN mode) that this may result. It would be a surprising mistake for the system to make, but an intelligible one.

  5. Alison said,

    July 22, 2021 @ 7:24 pm

    The boring explanation is that Afrikaans is probably just the first language in the list, and some bug caused the first one to get selected by default.

    [(myl) This is the most plausible theory so far..]

  6. Brett said,

    July 22, 2021 @ 8:35 pm

    Google misidentified English in my browser as Icelandic yesterday, but that was more understandable, since the text included passages from Beowulf.

  7. PeterL said,

    July 22, 2021 @ 10:42 pm

    If they're using n-gram statistics for language identification, a short piece of text (such as the referenced email) could easily get a wrong result. And improving this with Bayesian priors can be tricky, given how common English is. Even if the mail has RFC 3282 language codes, they're of marginal use because a lot of software simply sets the language to default values and doesn't provide a good way for the user to set the language code. Source-IP probably isn't useful, as few people run mail relays on their personal computers and many use web-based mail.

    [(myl) As I tried to indicate, the image is just the email's intro. It comprises 225 characters — but I'll bet a month's salary that no decent character n-gram language-ID method would classify even that fragment as Afrikaans. The whole email was 1905 characters, and I'll bet year's salary against Afrikaans as the outcome in that case. In fact, even a naive bayes 1-gram classifier would prefer English to Afrikaans — see here for a discussion of the method from a first-year ugrad seminar in 2014. …and here's a plot:

    ]

    Of course there are better methods of language guessing than n-grams; but when you look at the volume of mail handled by Google or Microsoft or other large email providers, performance becomes a serious issue.

    And this assumes that the programmers are fully versed in the subtleties of dealing with multiple languages. This is definitely not the case: many programmers barely understand the distinction between language and country. For example, if I set my Android phone to "UK English" because I prefer map directions in that voice, I start getting British news when I use the Google News app. And when I add Japanese as an alternative language (to improve the quality of search queries), Street View start showing me street names in katakana …

    Part of this is the confusion between "localization" and "locale". (Locale = language + region, e.g. "en-US"); if you want to know more, you can start with ISO 3166 and ISO 639 and quickly descend into madness.

    Here's a simplistic guide to localization: https://upload.wikimedia.org/wikipedia/commons/7/7b/Falsehoods_programmers_believe_about_languages.pdf
    (There are similar guides to dealing with names, addresses, time, etc. – some of them are listed here: https://spaceninja.com/2015/12/08/falsehoods-programmers-believe/)

  8. Daniel Hershcovich said,

    July 23, 2021 @ 7:48 pm

    Like many other mysteries in NLP today, the answer to why a neural network learned something is that… that's what we taught it. The training data for Google's classifier is very likely to include at least some misidentified text, especially for languages which their engineers don't know or don't care enough about to thoroughly inspect. This is not always easy to solve, especially in the case of code switching where you'd have to cut up the text somehow or ignore many training examples. More data is always better, isn't it? Google certainly has more than enough data. It's just not always accurate.

  9. Michael Watts said,

    July 24, 2021 @ 5:03 pm

    This is not always easy to solve, especially in the case of code switching where you'd have to cut up the text somehow or ignore many training examples.

    In principle, it's not especially difficult to assign a language to each individual word of a text.

RSS feed for comments on this post · TrackBack URI

Leave a Comment