4-digit numbers versus 5-digit numbers
« previous post | next post »
Phil H wrote these comments to "Uncommon words of anguish" (7/18/21):
The anguish is very real. My wife had a character in her name that most computers will not reproduce ([石羡]), despite it being relatively common in names in our part of the world, and has been refused bank accounts, credit cards, and a mortgage because of it. In the end she changed her name rather than continue to deal with the hassle. The character is in the standard, but it was too late for us.
…there have always been ways to get the character onto a computer, but any given piece of bank software might not recognise it, and any given bank functionary might be unfamiliar with them. We then had trouble when some organisations used the pinyin XIAN in place of the character, but that then made their documentation inconsistent with her national ID card (which had the right character on it) and so yet further bodies would not accept them… It was the standard "mild computer snafu + large inflexible bureaucracy = major headache" equation.
An anonymous correspondent, a computer scientist, sent in the following remarks:
Phil H is talking about a character which is in a "supplementary plane" in Unicode (and similarly in GB-18030). Unfortunately, an awful lot of software was only ever tested on Basic Multilingual Plane characters.
The way I explained it to a Chinese friend was: imagine you live in a small town where everybody's telephone number is 4 digits. Then they build a load of new houses on the edge of town, but when they come to connect them up, they find there aren't enough numbers left to give all the new people 4-digit telephone numbers, so they get 5 digits.
But none of the existing 4-digit residents want to change their numbers into 5-digit numbers, so they carry on having 4-digit numbers. And they know each other very well, and only ever call each other, and it's very rare indeed that any one of them wants to call up one of those new 5-digit people living on the edge of town, so some of them still go through their lives thinking that telephone numbers only ever have 4 digits.
Which unfortunately leads to some slightly-careless equipment designers testing their equipment only on 4-digit numbers, not checking if 5-digit numbers also exist. Which means, when you want to use a 5-digit number, you just might find that some piece of equipment which had claimed to be fully compatible with the phone system suddenly doesn't work.
(Oh, and there does exist a way to translate a 5-digit number into TWO of the 4-digit numbers, which is supposed to make it compatible with any old equipment that insists on 4-digit numbers. But some of that equipment doesn't support putting two telephone numbers in the space of one.)
If they'd insisted on giving EVERYBODY 5-digit numbers, then all the makers of equipment would have been forced to face the reality of 5 digits. But then everybody would have had to change, and that's too awkward.
At least a lot of Western designers are facing up to 5-digit numbers now. Why? Because a whole bunch of EMOJI characters are in that 5-digit area! So I think the way to get blog tools etc to support the extra characters is simply to say we want to use all the latest emoji: English developers can relate to that more than they do to rare characters in a language they don't understand. (Just hope they don't make some other assumption like "if it's a 5-digit number then the first digit will always be a 1 because that covers the emoji".)
Meanwhile China is, apparently, building new towns and cities which have the 5-digit characters in their names. That'll make them more common, once people start moving in to those places and they want to write about where they live without having to use some cheap-looking homophone.
Does the anonymous correspondent's analogy work?
Selected readings
- "Sinographic inputting: 'it's nothing' — not" (2/22/21)
- "Chinese character inputting" (19/17/15)
- "Stroke order inputting" (10/30/11)
- "Cantonese input methods" (1/20/15)
- "Google Translate Chinese inputting" (1/27/13)
- "Creeping Romanization in Chinese" (8/30/12)
- "Swype and Voice Recognition for mobile device inputting" (1/22/14) — esp. ¶¶ 3-5
- "Language notes from Macao and Hong Kong" (6/22/14) — search for "Starbucks"
- "Easy versus exact" (10/14/17)
- "The infinitude of Chinese characters" (9/9/20) — with a long list of dozens of relevant posts
- "Sinographs by the numbers" (2/22/19)
- "How many more Chinese characters are needed?" (10/25/16)
- "Chinese character inputting" (10/17/15)
- "Is there a practical limit to how much can fit in Unicode?" (10/27/17)
- "Character crises" (6/15/18)
- "Ask Language Log: Looking up hanzi for ignoramuses" (11/29/17)
- "Sinological suffering" (3/31/17)
- "Writing characters and writing letters" (11/17/18)
- "An immodest proposal: 'Boycott the Chinese Language'" (11/18/18)
- "The wrong way to write Chinese characters" (11/28/18)
- "The unpredictability of Chinese character formation and pronunciation" (2/6/12)
- "The unpredictability of Chinese character formation and pronunciation, pt. 2" (2/11/19)
- "Information content of text in English and Chinese" (10/9/17)
- "Sinitic languages without the Sinographic script" (3/5/19)
David Marjanović said,
July 27, 2021 @ 4:00 pm
Many web forms have told me my name is invalid. But fortunately I've never had the inverse problem, unlike Phil H's wife.
Alexander Quine said,
July 27, 2021 @ 5:09 pm
This is a pretty good analogy! And as a JavaScript developer I can assure you that the assumption of 5-digit numbers always starting with a 1 has already been made… https://en.wikipedia.org/wiki/Unicode_equivalence
Howard said,
July 27, 2021 @ 5:12 pm
This problem is analogous to the Y2K problem, lack of imagination or foresight in designing technical systems when optimizing the use of technical resources is paramount. I don't think human language works like that https://advances.sciencemag.org/content/advances/5/9/eaaw2594.full.pdf
DBMG said,
July 27, 2021 @ 5:29 pm
The analogy would be sound if the character in her name was a modern coinage and the problem was one of systems struggling to keep up with novelties. But since Phil H says the character is locally common, it must be traditional and far predate things like mortgages.
Michael Watts said,
July 27, 2021 @ 6:07 pm
DBMG, the analogy is sound anyway. The character is not newly coined. But there is no comprehensive list of every character ever written. That character didn't make it in to early standards; it wasn't common enough. So in the Unicode context, it is newly coined (well, early 2000s), even though in the historical context it isn't.
Barbara Phillips Long said,
July 28, 2021 @ 12:34 am
When I was an undergraduate in the early 1970s, I got married. I notified the registrar of my name change. The semester went well, but then my grades disappeared. During the spring semester, I would stop in at the registrar’s office to see if my problems had been sorted out.
Finally the registrar came out personally to ask me to use both my maiden and married names as my surname because the software as programmed had no provision for name changes and for some undisclosed reason, tracking all my records using my student ID number did not work if my maiden surname was not part of my name.
As I recall, using both names worked, but I continue to have problems with computer systems that can’t accept the space in my surname and need a hyphen in there, and other systems that can’t accept a hyphenated name. When the Affordable Care Act went into effect, I was buying my health insurance as an individual, so I went on healthcare.gov and set up an account. The catch-22 was that one system needed a hyphen and the other system would not accept a hyphen in the surname field. As a result, getting the tax forms I needed and renewing my insurance involved making phone calls to various offices to request forms be mailed to me. I could not download them, because neither computer system could recognize my name in the other system, so some forms were not automatically generated.
My profound sympathies to Phil H’s wife. My problems sound much less difficult than hers.
Phil H said,
July 28, 2021 @ 12:49 am
The analogy makes sense to me, though I don't know enough of the technical details to know how perfectly it fits the real underlying technologies.
For printed documents like ID cards and household registrations you can 造字 – make characters. We're not sure how they did it, but they managed! But for software systems it often proved impossible. The character still doesn't pop up as an option on the Microsoft IME that I use to type in Chinese (or the Apple one on my phone); I can copy and paste it from other sources, but there's no guarantee that any given piece of software will recognise it.
Phil H said,
July 28, 2021 @ 12:58 am
@Barbara – Yes, that's very much the same species of problem. And as with the systems you were dealing with, there's always a patch that will fix this problem for now… only after a while all the different patches start to conflict and cause more problems of their own.
Michael M said,
July 28, 2021 @ 1:35 am
As a programmer, I'd say the analogy is very good for how Unicode works, and have definitely worked with software that doesn't support the supplementary planes. However, it should be said that most modern software uses a system that supports 5-digit numbers without most of the 4-digit people needing to change (called UTF-8), so I think a big part of the problem is that institutions like banks often use very old software. That said, some modern software still doesn't handle this correctly.
WGJ said,
July 28, 2021 @ 1:59 am
From Wikipedia:
https://en.m.wikipedia.org/wiki/GB_18030
In a move of historic significance for software supporting Unicode, the PRC decided to mandate support of certain code points outside the BMP. This means that software can no longer get away with treating characters as 16-bit fixed width entities (UCS-2). Therefore, they must either process the data in a variable width format (such as UTF-8 or UTF-16), which are the most common choices, or move to a larger fixed width format (such as UCS-4 or UTF-32). Microsoft made the change from UCS-2 to UTF-16 with Windows 2000.
All computer systems in China are now (and have been for many years) legally required to be able to process those rarer, but officially recognized characters. So if your bank tells you it can't do it, you should report the bank to the China Banking and Insurance Regulatory Commission, the State Administration for Market Regulation, and of course the Standardization Administration of China (in this order, given how much teeth each of them has in terms of enforcement) for non-compliance. (Theoretically, you could sue the bank, too, but everyone who's familiar with how things work in China knows a complaint with the oversight organ is much more effective.)
Hyman Rosen said,
July 28, 2021 @ 5:26 am
I personally fixed a failure in handling supplementary plane characters in one tiny piece of software when I was at Bloomberg. It was noticed, as the article says, in the context of handling emoji in messages. Bloomberg has made this library open source, so I can point at the actual code: https://github.com/bloomberg/bde/blob/5122c1c138845cca024ba5371dd8d916498b961b/groups/bal/baljsn/baljsn_parserutil.cpp#L227
Computer history is rife with unwise decisions based on saving space. The decision not to make Unicode code points all 32 bits right away is one of them.
Philip Taylor said,
July 28, 2021 @ 6:41 am
If "16-bits wrong, 32-bits right" is true today, then surely "32-bits wrong, 64-bits right" will be true tomorrow. In which case, Unicode (or its successor) should not assume that any finite number of bits will suffice, but should allow smaller units to be daisy-chained to whatever length necessary.
KWillets said,
July 28, 2021 @ 11:26 am
This sounds like the old SQL Server decision to use UCS-2 in order to keep characters at a constant length.
The Unicode standard is computationally painful and prone to shortcuts; for instance sorting is seldom fully supported.
Another problem is that variable-length encodings can be used to conceal malicious characters; the whole software stack has to be updated correctly to prevent vulnerabilities.
Twill said,
July 28, 2021 @ 11:51 pm
@Philip Taylor I find it rather incredible that Unicode could possibly bloat to billions of characters when we don't even have 150 thousand at the moment, and that certainly not for want of trying. Maybe if Unicode moves from including whichever scribbles committees at giant tech companies spew up to selling spots for literally anyone to doodle in, and even then we probably would be orders of magnitude off (not to mention that much further from the intended purpose of the standard).
As KWillets said, it's not as black and white as mere laziness from programmers. Implementing Unicode is ridiculously complex, even for established companies, and even ready-made libraries can be patchy and not able to be used in many circumstances. Not to excuse giant banks, who have teams of programmers for a reason, but it's not just some crochety old programmers trying to save a few bits on a machine with gigabytes of memory.
Philip Taylor said,
July 29, 2021 @ 3:56 am
Twill — "I find it rather incredible that Unicode could possibly bloat to billions of characters when we don't even have 150 thousand at the moment". All the while that Unicode restricted itself to characters (or their equivalent) from the world's languages, I would have been inclined to agree with you, although I would still have argued for extensibility to be inbuilt from the outset. But as soon as the Unicode Consortium, in either its infinite wisdom or in a moment of utter madness, decide to include emoji and emoticons, all bets went out of the window.
Peter CS said,
August 2, 2021 @ 4:24 pm
An old Y2K joke: While we're at it we might as well allow for 5-digit years or we're going to have to do all this again in 8000 years.