Stray Chinese characters in English language documents
« previous post | next post »
Lawrence Evalyn wrote to me saying that he received the official communication below about a new student card that is being issued by his university. He was perplexed by all the Chinese characters that got inserted in the text. They seem to appear consistently in certain places and for certain letters. [N.B.: The communication has been anonymized for posting on Language Log.]
What is the [new student ID card]?
Offering the services of multiple cards in one, the [card] will be your student identification card. [card] is:
慍our [local bus pass] (if eligible)
慍our access to [the gym]
嵩ash and [this card] are the ONLY accepted forms of payment at the newly renovated [cafe] opening in September 2014
慍our on-campus debit card for tap-and-go payment at all University Food Services locations (including the newly renovated [cafe] in the University Centre)
慍our University Food Services Meal plan
In the near future, you will be able to use your [card] for:
慈ther on-campus vendors, including the Bookstore and the [university-affiliated] businesses
惹nacks and drinks from on-campus vending machines
微aundry
媲uilding access
[The card] is more streamlined, flexible and secure than the current [card] system, and your student number will not change.
Beat the September rush!
Once you are registered for courses in the 2014 Fall Term, visit the [card] office (prev. Photo ID Centre) in the University Centre lobby to get your new [card].
You will need:
慈ne piece of current government-issued photo identification that clearly shows your full, legal name and your date of birth
慍our [university] student number
嫂void the long lines and come in soon!
Lawrence wondered what sort of glitch would cause this kind of garbling.
Tom Bishop and Richard Cook, both of whom are Chinese specialists associated with the Unicode consortium, commented thus:
Tom:
It looks to me like some kind of "bullet" character codes got reinterpreted as Hanzi, due to mistaken encoding identification (luanma). For example, something like "•Your" became "慍our". Probably at some point along the line of communication, a non-Unicode encoding (e.g., Latin1 or Big5) was used. Unfortunately, a lot of software still doesn't support Unicode properly.
Richard:
The text was written as one encoding and then read as another encoding. Without studying this example carefully, it seems that bullets at the line start are being misinterpreted. Anyway, if the text isn't too damaged, if you open it in Wenlin you can probably figure out what the original encoding is.
I don't have much to add to what Tom and Richard have said, except to note that 5-10 years ago this sort of thing used to happen a lot more than it does now. I used to be very annoyed when I would receive English language documents with random Chinese characters scattered throughout the text, not just at the beginnings of lines as here, and often I had to spend a considerable amount of time removing them individually. That almost never happens to me anymore.
Here's one documented case and its solution:
"Random Chinese Characters in Exchange 2010 SP1 Emails".
For those who wish to delve more deeply into the technical details of how this happened in the above quoted letter, here is an expert explanation from Silas S. Brown:
慍 encoded in Big5 is a byte B7 + a letter Y. B7 is mid-dot (·) in Windows-1252 / ISO-8859-1. (A 'real' bullet • is not available in that codepage.) Somebody typed a mid-dot immediately followed by a Y, encoded it into Windows-1252 or 8859-1 (Western European), and then some other program interpreted these 2 bytes as a Big5 code and converted it to Unicode 慍. Similarly for 嵩 (Big5 code = byte B7 + letter C). Hence 慍our is a luanma'd ·Your, and 嵩ash is a luanma'd ·Cash. Not sure why the original writer didn't put a space after the mid-dot, but still.
Lawrence concludes:
I've also found myself amused by the many, many ways the university has been attempting to convince all of us that this card is somehow more convenient for the student body– at least for graduate students, all of these features were already offered by our student ID cards, and the cafes used to accept credit and debit– but there's no real linguistic innovation in corporate speak insisting that something is "more streamlined!" when they really mean "more streamlined for us."
Anyway, it does seem that each character is consistently used for a specific letter, but I have no idea how each letter came to be replaced by the characters in question! If there's some kind of logic behind it, I'd love to hear it.
If this ever happens to you, it's all right to get upset, but don't make the slightest effort to decipher these annoying graphs. Detached from a Chinese context and arbitrarily inserted in an English text, they convey no intelligible meaning or logic whatsoever.
[Thanks to John Rohsenow]
Keith said,
August 23, 2014 @ 2:46 am
I had almost exactly the same kind of problem when I was working in the technical documentation section of an engineering company.
Texts would come back from the translation agency with weird characters here and there, very often the first letter of a line.
I tracked it down to the same problem as described here: text composed in one encoding being read with a different encoding.
There was nobody else in-house with any experience of multi-language, multi-platform text processing, so whenever there was any suspicion that a bundle (en English text accompanied by its five to twenty translations) was corrupted it fell to me to give it a quick QC check.
I found a really useful text editor for Windows XP that can do all the encodings I needed (plus many others: I worked only in Latin, Greek and Cyrillic scripts in this job).
Additionally, it can colour and indent the syntax of markup languages (e.g. HTML, XML) and programming languages (Java, BASH, Perl, etc.).
Bob Ladd said,
August 23, 2014 @ 2:52 am
Anonymized, maybe, but (unless VHM has been exceptionally devious and modified stuff without flagging it with square brackets) clearly Canadian: there's nowhere else in the English-speaking world that has "Fall" beginning in September and also uses the spelling Centre.
Ken said,
August 23, 2014 @ 8:04 am
Microsoft Word once used (and maybe still does) a non-ISO character set. It was basically a one-byte encoding with ASCII in the bottom half but their own assignments in the top, for characters that are often used in word processing. You could always tell when someone had cut-and-pasted from Word into a web document, because those characters came out wrong.
Some browsers showed them with a "?" symbol, which created some unintentional humor with the smart quotes, trademarks, copyright symbols, and so forth:
djw said,
August 23, 2014 @ 8:49 am
As Lawrence observed, these letters must be kind of a world unto themselves, anyway. I got one this week from a bank that said something about how they were "better serving the needs of the customers" by doing something like raising the cost of using the card and limiting the number of times it can be used for certain transactions. If that's "serving the customers," it might as well have been in Chinese for me (and I don't read any Chinese at all).
KWillets said,
August 23, 2014 @ 11:41 am
This happens so often that some databases have an IS_UTF8 predicate for deciding if data has been misloaded somewhere along the way. There's usually some step where someone misidentified the encoding and caused it to be re-encoded like the Chinese characters in the example.
Nowadays people use UTF-8 or UCS-2 almost exclusively so they don't make that mistake as often.
Florence Artur said,
August 23, 2014 @ 1:27 pm
This still happens often enough with French text, because with the wrong encoding, characters with accents can appear as Chinese characters. Usually it can be fixed by manually selecting the correct encoding, but sometimes even that doesn't work. Annoying, I agree. I guess it's not as frequent as it used to be though.
Dan T. said,
August 23, 2014 @ 4:38 pm
Even non-Chinese documents (including on the Web) can have character set confusion from time to time; you see it sometimes when text is copied-and-pasted into a different document, or a site or blog is moved from one platform to another, and text ends up going between Windows-1252 and UTF-8 (or some other combination of encodings) without being properly converted, and suddenly all the bullets or curly quotes or em dashes are something weird instead.
Ray Dillinger said,
August 25, 2014 @ 2:42 pm
Seen it, found it consistently on certain channels, even wrote adhoc software to correct it.
Elizabeth Pyatt said,
August 26, 2014 @ 12:51 pm
This kind of "exotic character" insertion is very common when people copy text from Word (often Win-1252 encoding) into either email or a content management system.
There's lots of confusion for "smart quotes" (the ones that curve to the right & left as in), left curving apostrophes (’) and long dashes (e.g. –, —). They technically don't exist in either ASCII or the larger Latin-1 set and systems can get confused.
They do exist in Unicode (UTF-8), but not everything set to work in Unicode.
Jim said,
August 26, 2014 @ 3:28 pm
Wow. Is no system safe from PLA hacking?