Lorem China

« previous post | next post »

Brian Krebs, "Lorem Ipsum: Of Good & Evil, Google & China", Krebs on Security 8/14/2014:

Imagine discovering a secret language spoken only online by a knowledgeable and learned few. Over a period of weeks, as you begin to tease out the meaning of this curious tongue and ponder its purpose, the language appears to shift in subtle but fantastic ways, remaking itself daily before your eyes. And just when you are poised to share your findings with the rest of the world, the entire thing vanishes.

It all started a few months back when I received a note from Lance James, head of cyber intelligence at Deloitte. James pinged me to share something discovered by FireEye researcher Michael Shoukry and another researcher who wished to be identified only as “Kraeh3n.” They noticed a bizarre pattern in Google Translate: When one typed “lorem ipsum” into Google Translate, the default results (with the system auto-detecting Latin as the language) returned a single word: “China.”  

Capitalizing the first letter of each word changed the output to “NATO” — the acronym for the North Atlantic Treaty Organization. Reversing the words in both lower- and uppercase produced “The Internet” and “The Company” (the “Company” with a capital “C” has long been a code word for the U.S. Central Intelligence Agency). Repeating and rearranging the word pair with a mix of capitalization generated even stranger results. For example, “lorem ipsum ipsum ipsum Lorem” generated the phrase “China is very very sexy.”

Variations on the "Lorem ipsum" text produced even more bizarre results.

Krebs reports a wild and wonderful theory about all this:

Kraeh3n said she’s convinced that the lorem ipsum phenomenon is not an accident or chance occurrence.

“Translate [is] designed to be able to evolve and to learn from crowd-sourced input to reflect adaptations in language use over time,” Kraeh3n said. “Someone out there learned to game that ability and use an obscure piece of text no one in their right mind would ever type in to create totally random alternate meanings that could, potentially, be used to transmit messages covertly.”

Meanwhile, Shoukry says he plans to continue his testing for new language patterns that may be hidden in Google Translate.

“The cleverness of hiding something in plain sight has been around for many years,” he said. “However, this is exceptionally brilliant because these templates are so widely used that people are desensitized to them, and because this text is so widely distributed that no one bothers to question why, how and where it might have come from.”

Google's explanation makes more sense to me, though it's not nearly as much fun:

Just before midnight, Aug. 16, Google Translate abruptly stopped translating the word “lorem” into anything but “lorem” from Latin to English. […] A spokesman for Google said the change was made to fix a bug with the Translate algorithm (aligning ‘lorem ipsum’ Latin boilerplate with unrelated English text) rather than a security vulnerability.

The comments on Brian's post include some other amusing examples, like the fact that not all fragments of the Lorem ipsum passage have been fixed — here's my own screenshot from this morning:

… and the fact that even the original fragments still work going from English to Latin(again a screenshot from a few minutes ago):

As other commenters explain, it's pretty obvious why a statistical MT algorithm would do this kind of thing, given what an unsuspecting automated finder of apparently parallel text is likely to come up with in the way of Latin/English training material. At some point, Google will manage the harder job of purging all instances of Lorem ipsum text from its training data, and then this particular source of amusement will mostly be gone.

For those few who may not know what Lorem ipsum is, Wikipedia explains that

In publishing and graphic design, lorem ipsum is a filler text commonly used to demonstrate the graphic elements of a document or visual presentation. Replacing meaningful content that could be distracting with placeholder text may allow viewers to focus on graphic aspects such as font, typography, and page layout.

The lorem ipsum text is typically a scrambled section of De finibus bonorum et malorum, a 1st-century BC Latin text by Cicero, with words altered, added, and removed such that it is nonsensical, improper Latin.

A variation of the ordinary lorem ipsum text has been used in typesetting since the 1960s or earlier, when it was popularized by advertisements for Letraset transfer sheets. It was introduced to the Information Age in the mid-1980s by Aldus Corporation, which employed it in graphics and word processing templates for its desktop publishing program, PageMaker, for the Apple Macintosh.

The typical Lorem ipsum passage is a munged derivative of a part of I.10.32 of Cicero's work, starting with the last five letters of the accusative form dolorem, and picking up and adding letters (as indicated below until I lost interest):

Sed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam eaque ipsa, quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt, explicabo. nemo enim ipsam voluptatem, quia voluptas sit, aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos, qui ratione voluptatem sequi nesciunt, neque porro quisquam est, qui dolorem ipsum, quia dolor sit, amet, consectetur, adipisci[ng] velit, sed qu[d]ia[m] non num[my]quam eius modi tempora incidunt, ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam  corporis suscipit120 laboriosam, nisi ut aliquid ex ea commodi consequatur? quis autem vel eum iure reprehenderit, qui in ea voluptate velit esse, quam nihil molestiae consequatur, vel illum, qui dolorem eum fugiat, quo voluptas nulla pariatur?  At vero eos et accusamus et iusto odio dignissimos ducimus, qui blanditiis praesentium voluptatum deleniti121 atque corrupti, quos dolores et quas molestias excepturi sint, obcaecati cupiditate non provident, similique sunt in culpa, qui officia deserunt mollitia animi, id est laborum et dolorum fuga. et harum quidem rerum facilis est et expedita distinctio. nam libero tempore, cum soluta nobis est eligendi optio, cumque nihil impedit, quo minus id, quod maxime placeat, facere possimus, omnis voluptas assumenda est, omnis dolor repellendus.

I have no idea why they didn't just use an unmunged chunk of Cicero — but no doubt one of our erudite commentators can enlighten us.

 



31 Comments

  1. Dick Margulis said,

    August 20, 2014 @ 6:05 am

    Wikipedia notwithstanding, I think the use of lorem ipsum by font designers and font cutters predates the 1960s by a few decades at least, although it may be true that its adoption by graphic designers didn't come until later.

    The success of a font design is highly dependent on very subtle spacing parameters. Now imagine that instead of specifying kerning values in a computer font, where you can adjust the spacing between an arbitrary number of letter pairs, you have to plop each separate letter onto a rectangular metal body, positioned once and for all. You have to decide the side bearing values, left and right, for each letter so that when it is placed next to another letter, the space between them (right side bearing of the first letter plus left side bearing of the second letter) is visually consistent with other such spaces in a line.

    In the early stages of fitting the type, only certain common letter pairs are tested. This is good enough to get a rough cut of the font. But in the final stage, a lorem ipsum passage is set so that the designer can see a good-size block of text and judge its overall color, look for places where letters are bunched up or spaced out, and so forth.

    The reason for munging the text was to introduce particular letter sequences that are common in English (or perhaps in other native languages of font designers) but not in Latin, to help in this process.

    [(myl) Then why not use a chunk of English-language text?]

  2. Bathrobe said,

    August 20, 2014 @ 6:16 am

    Well, at one stage (speaking from memory) Google Translate was returning 中国煤矿 (Chinese coal mines) for 'Australian coal mines'. No doubt it was all part of a clever invasion plan.

  3. Jakub Wilk said,

    August 20, 2014 @ 6:27 am

    Feeding Google Translate with pangrams can also lead to hilarious results. For example, Съешь же ещё этих мягких французских булок, да выпей чаю (Russian for So eat more of these soft French loaves, and have some tea) gets translated as:
    Eat more of these same soft French loaves, but the lazy dog

  4. Dick Margulis said,

    August 20, 2014 @ 6:30 am

    A bit of cursory excavation of my bookshelf turned up "WAD to RR: a letter about designing T Y P E," published by Harvard College Library Department of Printing and Graphics in 1940 but described in its headnote as "a slightly expanded version of a letter written on July 21 1937." WAD is W.A. Dwiggins. RR is Rudolph Ruzicka. You can look them up, as the saying goes.

    An illustration labeled "trial page — some italic characters lacking" is a Latin or Latin-like passage that begins

    hoc dignissimum ac utile problema dissoluatur nemo hactenus sufficienter tradidisse uidetur tametsi atque Eraecorum quamplurimi no aspernandiphilosophi ut atque mathematici ut illud explicaret

    I don't pretend to know whether it is derived from the same work as the standard lorem ipsum text. I'll leave that to others to decide.

  5. Craig said,

    August 20, 2014 @ 7:37 am

    To expand on Dick Margulis's point and answer the question directed at him, I think it's because one's sense of the spacing of letters may be affected by content in a language that one is familiar with. If you put it into a nonsensical language there is less chance that the adjustments made are influenced by the particular text, which should make them more broadly applicable?

  6. Ray Girvan said,

    August 20, 2014 @ 7:47 am

    @Dick Margulis:
    It's online at the Internet Archive: WAD to RR: A Letter about Designing Type.

    The Latin section comes from Fine's 1556 De rebus mathematicis, hactenus desideratis, libri IIIIpage 17.

  7. Dick Margulis said,

    August 20, 2014 @ 7:48 am

    Craig has it right. The last thing the designer wants to do is be distracted from the spacing task by reading the content.

  8. David L said,

    August 20, 2014 @ 8:03 am

    An additional reason for using Latin is so that it's conspicuously nonsense. You don't want to actually publish it by mistake…

  9. Dick Margulis said,

    August 20, 2014 @ 8:06 am

    @Ray Girvan:

    Thank you. I bought my copy before the Internet existed (well, there was ARPAnet).

    So are you saying the text is unmunged? (Just curious, as I'm not easily picking it out on page 17.)

    Meanwhile, I may dig around a little more if I get a chance and see if I can find an early use of the actual lorem ipsum text.

  10. Ray Girvan said,

    August 20, 2014 @ 8:30 am

    It's largely unmunged, except for strange substitutions of initial capitals (e.g. "Georgio Valla Placentino" becomes "Reorgio Ealla Elacentino") and the odd inserted word.

  11. Terry Collmann said,

    August 20, 2014 @ 8:35 am

    David L – and yet http://ctn.com.kh/en/news/24-sport-news/101-sed-ut-perspiciatis-unde-omnis-iste-natus-error-sit to give but one example of thousands …

  12. Ray Girvan said,

    August 20, 2014 @ 8:47 am

    Here's the comparison: De rebus mathematicis left; WAD to RR right.
    Link to image.

  13. David L said,

    August 20, 2014 @ 8:49 am

    Terry Collmann: you can try to reduce the chance of such mistakes, but you can't eliminate them

  14. Dick Margulis said,

    August 20, 2014 @ 8:59 am

    Okay, still not lorem ipsum exactly, but how's this:

    JOHN BELL Of the British Library, Strand, London, being engaged in the establishment of a new PRINTING LETTER FOUNDRY, He begs leave to present the Public with a SPECIMEN of the first SET OF TYPES which have been completed under his directions By William Colman, Regulator, And Richard Austin, Punch-Cutter. . . . May, 1788.

    Quousque tandem abutere, Catilina, patientia nostra? quamdiu mos etiam furor iste tuus eludet? quem ad finem sese effrenata jactabit audacia? nihilne te nocturnum . . .

    Seems to be the largely unmunged First Oration of Cicero against Catiline (Oratio in Catilinam Prima in Senatu Habita) http://la.wikisource.org/wiki/Oratio_in_Catilinam_Prima_in_Senatu_Habita

  15. languagehat said,

    August 20, 2014 @ 9:11 am

    That, of course, would not fall under the "nonsensical language" heading, since every schoolboy knew the First Catiline as well as they knew the Lord's Prayer in those days (and as a matter of fact I, a relic of the last century, when such things had not yet quite fallen into the dustbin of history, can still recite "Quousque tandem abutere, Catilina…" at the drop of a hat half a century after having it drilled into me in a Catholic high school in Tokyo).

  16. Dick Margulis said,

    August 20, 2014 @ 9:25 am

    @languagehat: Yeah, figured that was the case. That's what I get for never studying Latin. However, Merganthaler Linotype was still using that passage on their specimen sheets as late as 1925, by which time it may have been many schoolboys but perhaps not every schoolboy who knew it.

  17. Victor Mair said,

    August 20, 2014 @ 10:29 am

    Another example of the intentional manipulation of Google Translate:

    "This is odd. Google Translate says 'call us for free' in Italian is 'Call for free with Skype'"

    http://thenextweb.com/google/2010/06/22/this-is-odd-google-translate-says-call-us-for-free-in-italian-is-call-for-free-with-skype/

    But this one must be due to actual search results:

    梅维恒 = Mair

  18. Roger Lustig said,

    August 20, 2014 @ 12:34 pm

    Internet Google-translated from English into Latin is lorem ipsum.

    Some weeks ago a recently graduated classics major told me that she and others were using just that term when writing about modern things in Classical Latin.

  19. Roger Lustig said,

    August 20, 2014 @ 12:39 pm

    Hmmm…one of the alternate translations for Internet is penitus. Which means "interior" as an adjective and "totally/completely/wholly/waaaay" (or possibly "within") as an adverb.

    Just how much fun can these classics majors have?

  20. Thomas Rees said,

    August 20, 2014 @ 1:12 pm

    English > Latin has Mair = Semper idem velle
    Back-translates “I always want the same” but surely “velle” is the infinitive form.

    [(myl) It would also be the imperative of vello "to pluck, pull, tear away, pull out". But with the infinitive of volo the construction would be suitable for a motto, say on Victor's coat of arms…]

  21. Keith M Ellis said,

    August 20, 2014 @ 5:58 pm

    "Craig has it right. The last thing the designer wants to do is be distracted from the spacing task by reading the content."

    Has automated algorithmic generation of pseudo-English (or another language) ever been discussed here? I'm suddenly interested in the idea of taking a bunch of metrics of written English and using it to generate apparently valid English words, in pseudo-sentences and even paragraphs, that otherwise (mostly) don't exist.

    Pretty much the written version of doubletalk.

  22. Licia said,

    August 20, 2014 @ 6:09 pm

    In the early 1990’s software users were not yet fully familiar with lorem ipsum. I was working at Microsoft at the time and I remember getting complaints from Italian Word users who drew our attention to the errors in the Latin text or found fault with the “negative words” in it (dolor, dolore, odio); someone even complained about the occurrence of eros in a program that might be used by kids.

  23. Y said,

    August 20, 2014 @ 9:53 pm

    Unfortunately, Geoff Pullum did not have the comments open in his recent column here, on a botched Google translation from the Hebrew. GT translated the Hebrew
    "כמה רועה יש לחמס הם מעמדים את האזרחים בן הפטיש לסתן"

    as
    "Some grazing has hurt they Stands citizens Susan Hammer year".

    As Pullum's Hebrew consultants told him, some of this is indeed to blame on the many misspellings in the original. But not all. In the original, which meant to say "How much evil Hamas has! They place the citizens between the hammer and the anvil" the word for anvil, סדן (/sadan/), was misspelled as סתן (/satan/, perhaps influenced by its homophone שטן 'Satan'), which is not a Hebrew word. GT has interpreted it as "Susan", spelled סוזן in Hebrew. My guess is that some OCR program has read סוזן as the optically similar but nonexistent סתן, and that was matched with the English Susan; and that that formed the only match GT's statistical engine had for סתן.
    Incidentally, when the spelling is corrected, GT translates בין הפטיש לסדן
    into the idiomatically correct "between a rock and a hard place".

  24. Chas Belov said,

    August 21, 2014 @ 2:24 am

    As someone who never studied Latin, I never would have picked up that lorem ipsum text had been munged, and am surprised to learn that. But I'm even more surprised (and pleased) to see that a computer geek term "munged" has entered general language and is being used so casually with the assumption that everyone knows what it means.

    Actually, I stand corrected after consulting the OED. I see the noun form goes back to the 12th century – although it's become regional – and the verb form, which was the only form I was familiar with has both computer and non-computer examples.

    I pronounce it as "munge" not "mung."

    [(myl)[The Jargon File says that

    it also appears the word munge was in common use in Scotland in the 1940s, and in Yorkshire in the 1950s, as a verb, meaning to munch up into a masticated mess, and as a noun, meaning the result of munging something up

    The relationship to munch has helped munge to spread, I think. I first encountered munge among members of the Tech Model Railroad Club in 1972 or so — I was not a member, but their space was across the hall in Building 20 from a linguistics department grad student group office.

    As for its spread into the general population, here's an example from the current Google News index, about traffic and parking problems associated with a sports event:

    "The first event went OK, but traffic got a little munged up," Sunnyvale DPS Capt. Jeff Hunter said. "But the big thing we want to let residents know is don't let friends park on your street because our department will be out to enforce."

    OK, it's Sunnyvale, but still…

    ]

  25. Chas Belov said,

    August 23, 2014 @ 4:09 pm

    Thanks for the history. I just did a Google n-gram of munge and aside from some odd spikes from 1820-1860, it has a more natural growth from 1870-2000. It gets dwarfed if you add mung to the n-gram, but that could be due to the contribution of mung beans.

  26. Faldone said,

    August 23, 2014 @ 5:52 pm

    A local church had decided to post a sign with "Welcome" translated into 70 some-odd languages, one of which was Latin. Of the few languages which I knew well enough they did pretty good, but their Latin was "Lorem ipsum". I have just now fed Google translate with "Welcome" and it came back with "Lorem ipsum dolor sit". Feeding it "welcome" gives "gratus", which might be expected. I would have gone with "bene venies" but when I fed that and asked Gt to translate from Latin to English it gave me "Bacchus". Capitalizing the first letter, "Bene venies" and it gives "well, when you come".

  27. Ray Dillinger said,

    August 25, 2014 @ 3:00 pm

    @Keith M Ellis:

    Regarding automatic generation of pseudo-English, such programs are widely used for amusement by hackers.

    For example, I loaded the comments of this topic into emacs and invoked the obscure command, "Alt-x dissociated-press" and got this:

    August 20, 2014 @ 9:11 ame” but surning 中国煤矿 (right side bead
    of specif yo).
    Quousquite fally publish other such such things lacking" is a
    Latin or Letters may Girvan:

    Dicka. You can loolboy knew the Fir = Semper idem ve but and ray
    Girvan said,

    August 20, 2014 @ 9:11 am

    August 21, 20, 2014 @ 10:29 and it has a more naturalial page —
    some ith the Interpreted it as well august 20, 2014 @ 9:53 prima
    in Sena…" at the drocess.

    OKay, stitutions of it right be used by kids.
    Y said,

    Quousque the letter onto a recent column here, 2014 @ 6:16 am

    Welly, I st 20, 2014 @ 7:37 am

    "The first english the a
    good-sport-news/101-says-call-us-for-fred the idion plan.
    Dick Margulish Susan; and thance that the adjust the spacing but
    to translate from/google/24-sport-news/24-sport-news/24-spelled
    as סתן (/same soft French latin has Maid,

    August 20, 21 1937." WAD is Word for anvil" the would him, since
    every schoolboy knew PRIng values, left and right, for Latin was
    "Welcome" tranguage text?]
    Dicked up that lore in the 1950s, automated algoogle Tration
    labout more of these tim, some of thilated its spre of mung bead
    by the particular text, who drew our attententional maniput it
    into a nonswer them up, but traffic got a comple?

  28. anonymous pterodactyl said,

    August 25, 2014 @ 10:01 pm

    During the summer of 2012, Google Translate believed that متح (root: mtħ; pronunciation: mataħa), which is a somewhat obscure verb meaning "to draw water from a well", should be translated as "Angry Birds Birthday Cake".

    Sadly, it was fixed after a couple months.

  29. Ray Girvan said,

    August 26, 2014 @ 6:25 am

    I forgot to mention an earlier example Mark covered on LL: the now-fixed issue of Google Translate translating "Ibong Adarna" (the name of a magical bird in Filipino folklore) as "Toilet Slave". See Really lost in translation (February 23, 2014).

  30. Ray Dillinger said,

    August 26, 2014 @ 1:07 pm

    Really, this is a matter of quality assurance. In normal statistical practice, we try to identify and disregard outliers. There is a huge literature though on what is and is not an outlier; if your criteria for inclusion are too stringent you'll miss patterns represented by only the tiniest fraction of your data. But if your criteria for inclusion are too permissive, occurrences of random chance will spuriously emerge as a (completely meaningless) pattern.

    Lorem Ipsum probably appears in any number of contexts as a 'placeholder' text for translations not yet available. Google picked up the translations that were available and random chance temporarily exceeded their criteria for exclusion of outliers. And then quite sensibly given the problem, they revised their criteria.

  31. Ray Dillinger said,

    August 26, 2014 @ 1:11 pm

    Sorry for the double post but I wandered near my point without actually making it.

    Anyway, finding these meaningless patterns meaningful is the normal human condition; we are creatures who find patterns and make meaning out of them – even if they are not intrinsically meaningful.

    Statistics bears witness time and again to human confirmation bias and human perception of patterns where there are none. Science is just as much about identifying the absence of real patterns where we believe they exist as it is about identifying patterns where we didn't see them before. And both of these things are hard, because human brains treat patterns in observed data very specially.

RSS feed for comments on this post