Language Log

"Notes to the financial statements"

November 19, 2017 @ 6:14 am · Filed by Mark Liberman under Elephant semifics

From Jenny Chu:

You might be amused by this latest edition of Google Translate's ability to transform meaningless character sequences into spoken-word poetry, discovered by my young son.

It is all of the Vietnamese characters, in order of their appearance on the character map, with no spaces. Moreover, if you add all of the other non-diacritic characters on the keyboard, you get "The following is a brief description of each of the available options."

And arranged in the order of this table, the set of possible Vietnamese characters yields things like this:

The input speech synthesis version could be raw material for something interesting, especially layered with the output synthesis in its normal or repeated-slowly-for-better-understanding (or after-a -few-drinks) versions:

Varying the input length yields a considerable variety of outputs, e.g.

Or this:

There were 361 distinct characters in my first input, so the number of different orders of different subsets of that input is very large — there are 2^361 = 4.697085e+108 elements in the power set (which is already more than the estimated number of baryons in the visible universe), and there are a variable but often large number of possible orders for each such set, e.g. the number of permutations of 361 items is 361! = 1.437923258e+768.

Question: What is the subset of English-language strings that Google Translate can generate from all the permutations of all the subsets of the original input? Can we prove that any given English-language string is NOT the output for some permutation of one of those subsets?

These are not questions whose answers matter much in themselves, but the difficulty of answering them illustrates some of the conceptual/theoretical difficulties with modern machine-learning methods.

November 19, 2017 @ 6:14 am · Filed by Mark Liberman under Elephant semifics

Permalink

8 Comments

John Roth said,

November 19, 2017 @ 8:47 am

My first impression is that you found an Easter Egg, that is, a weird output to unusual input that's intended to be amusing. Alternatively it's something to do with testing. Those look like canned outputs.

[(myl) Both ideas are entirely wrong — see the large number of similar examples with very diverse inputs here, and the explanations here.]
Jenny Chu said,

November 19, 2017 @ 8:59 am

I assume we are all familiar with https://www.youtube.com/watch?v=3-rfBsWmo0M ?

[(myl) That's what started the whole thing off for me — "What a tangled web they weave", 3/15/2017]
reader_not_academe said,

November 19, 2017 @ 12:46 pm

my hunch is that this is vaguely related to the "i don't know problem," observed in neural chatbots trained just like MT models, except source segments are prompts and targets are responses. these systems tend to fall back to a generally likely response, which is often just "i don't know." i'd theorize that when an NMT model gets input that is way off the mark from anything in its training corpus, it produces "generally" likely content based on the corpus's target distributions.

it would not be surprising if a significant part of the vietnamese-english corpus behind GMT were made up of contracts, legal texts, standards and the like. or popular crime fiction and movie subtitles. that would explain some of the examples seen here.

a friend and i ran into this a few months ago in an attempt to train a "crazy" NMT model by garbling the training corpus first: https://jealousmarkup.xyz/texts/neural-mt-frankenstein-chatbot/
stedak said,

November 20, 2017 @ 6:48 pm

I tried Vietnamese>English on random vowel strings and found that if the input is plain Latin vowels with no diacritics, then the output is nothing interesting: you just get the same vowel string back, or maybe some stranded HTML tags. I had to put diacritics on at least about 1/3 of the vowels before it started producing English words in the output. This suggests that strings without diacritics are too far from anything in the Vietnamese corpus to be matched at all, even by the obviously loose matching it's using.

@reader_not_academe: Google Translate sometimes turns random inputs into generic outputs, but the best ones are the hallucinations, where the result is not trivial and seems to come out of nowhere. I can't resist sharing this one:

eêeeêạaaạyyyyỷỷyyỷyyỷyyiìiiìiiìiiìiiiìíìỉĩịỉỉỉỉỉiỉiôôôôôôôôôôôổuuụuụuụôăầẳữữữữữữữữữữỗỗỗỗỗỗộộộộộộộằằằằàààýỳỷ

translated Vietnamese to English:

For example, if you use the word "yo", you'll find that the word "sañños" comes from the Japanese word for the word "sausage".

Speaking of sausage, food comes up a lot in the random translations. I would bet that restaurant menus are also a significant part of the training corpus.
Chas Belov said,

November 21, 2017 @ 1:26 am

This is catnip for me.

@stedak: Thank you for that delightful character string. If I delete one character at the end at at time, I get some wonderfully wacky results, starting with:

For example, if you use the word "chocolate", you can use the word "chocolate" to describe the sweetness of the chocolate, with the words "sweet", "sweet", "sweet", and "sweet".

and often alternating with gibberish strings until around the 35-character mark when it starts just spitting back the string. Also, it starts detecting Kurdish in place of Vietnamese. You can also delete characters from the beginning, leading to things like

yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurt yogurtster

at the 148-character mark and

You can use the word "smiley" in the text.

at 144. Oddly, going in this direction (front-chopping) it has stopped detecting a language and nevertheless is translating. Often I am getting close strikeout HTML tags:

y y .

At 140 there's an encoding hiccup; this is clearly not Unicode:

yoìiï¿½ï¿½iÃ ¢ â,¬â "¢ sa short-term Ã ¢ â,¬Å" Ã ¢ â,¬Å "ÃƒÂ ¢ Ã ¢ â € šÂ¬Ã ¢ â € šÂ¬ÃƒÂ ¢ Ã ¢ â € šÂ¬Ã ¢ â € šÂ¬Ã £ â,¬â,¬Ã £ â,¬â,¬Ã £ â,¬Å "

with moments of clarity at, as at 135,

You and your business partner are responsible for ensuring that you are getting the most out of your business.

and at 132

You will also enjoy the benefits of having a low-calorie, low-calorie, low-calorie, low-calorie diet.

. Oddly, at 131, it claims to still be giving me English, but gives me a lot of Turkish characters:

ioiii iiıııımışınızınızınızınızınızınızınızınız

and, lest I exceed the character count for posts, will end at 90 with

A carol with a mouthwash in the mouth of a robot.
Chas Belov said,

November 21, 2017 @ 1:38 am

Oops, the strikeouts didn't show. Basically a repeated

less-than symbol
slash
s
greater-than symbol
space
sean said,

November 21, 2017 @ 11:37 am

Text encoding side note: in Chris Belov's "encoding hiccup" example above, the text contains the character sequence "ï¿½" which is what you get when you take the UTF-8 encoding of the "Unicode Replacement Character" U+FFFD and you interpret those bytes as being encoded in latin1. The presence of the Unicode Replacement Character in the first place indicates that somewhere along the line, the text was converted wrong and some of it was lost in a non-reversible way. So when you see ï¿½, it almost always means the text was mishandled at least twice before it got to you, and it's impossible for you to fully recover what was originally there.

I've done enough work with encodings that ï¿½ is forever burned into my brain, along with the corresponding bytes "EF BF BD"
stedak said,

November 21, 2017 @ 12:06 pm

@Chas Belov And of course you get similar results if you delete random letters out of the middle as well. I think this demonstrates that there isn't any substring in there that means "chocolate", or "yogurt". If there were, it would still be there when you delete a single letter far away in the sentence. But in fact, changing a single letter — or putting a period at the end — often changes almost every word in the whole sentence.

This is totally alien to human language understanding, according to Mark Liberman's explanation (in a different context, but still relevant):

"language works because it's compositional — the (literal) meaning of larger messages is a predictable function of the meaning of their parts and the way that the parts are combined."

This is what's violated by hallucinatory translations. It makes me wonder if hallucinations could be identified algorithmically, by checking if the parts of the whole match anything in the input.

RSS feed for comments on this post

"Notes to the financial statements"

8 Comments

John Roth said,

Jenny Chu said,

reader_not_academe said,

stedak said,

Chas Belov said,

Chas Belov said,

sean said,

stedak said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta