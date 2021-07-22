English as Afrikaans?
Language-identification from digital text has been a solved problem for many years, so I was surprised yesterday to see Gmail offering to translate from Afrikaans an email written in perfectly idiomatic English:
The body of the email had a few acronyms and Israeli place-names, and it did mention "Hebrew", but those features don't help solve the mystery of why Gmail assigned it to the category of Afrikaans. My guess is that it's one of the (essentially inexplicable) vagaries of modern deep-learning technology, but who knows?
Crprod said,
July 22, 2021 @ 12:46 pm
This confusion of Afrikaans and English has happened to me.
David Marjanović said,
July 22, 2021 @ 12:48 pm
taggers looks kinda Afrikaans…? Just guessing, I don't know.
DMcCunney said,
July 22, 2021 @ 12:57 pm
If I had to guess, I'd say that behavior was prompted by where Gmail thought the email was sent *from*, and it would get that idea from the originating IP address in the email header. That would happen before any parsing routines tried to look at the message body. And that likely would not happen until the user asked for a translation.
No deep learning needed. There are an assortment of websites where you can plug in an IP address and the site will tell you where the holder of the address is located. (I have a Firefox extension to do that.)
(What I'm bemused by at the moment is the amount of Swedish porn email showing up correctly labeled Spam. Huh? How did I get on *that* spam list? I was previously getting a lot of Turkish messages labeled spam, which likely weren't if you were a Turk. They were invitations to various events concerned with business and economics. Again, why was I getting them? ).
The nice thing about Gmail is that I largely don't *care* about spam email. If one actual spam message that hasn't been correctly filtered appears in my Inbox every two weeks it's a lot. Click Report Spam, and I don't see mail from that source again.I delete the spam regularly, but check for occasional false positives, and save particulary entertaining 419 spams.