Unicode CJK Unified Ideographs Extension J and the nature of the sinographic writing system
« previous post |
Submitted by Charles Belov:
I've been browsing through the proposed Unicode 17 changes, currently undergoing a comment period, with interest. While I don't have the knowledge to intelligently comment on the proposals, it's good to see that they are actively improving language access.
I'm puzzled that some new characters have been added to the existing Unicode CJK Unified Ideographs Extension C (6 characters) and Unicode CJK Unified Ideographs Extension E (12 characters) rather than added to a new extension. But the most interesting is the apparently brand-new Unicode CJK Unified Ideographs Extension J, with over 4,000 added characters.
I found the following characters of special interest:
– 323B0 looks like the character 五 with the bottom stroke missing.
– 323B3 looks like an arrangement of three 三s – does it possibly mean the same as 九?
– 32501, while not up to the character for biang for complexity, is nevertheless quite a stroke pile: the 厂 radical enclosing a 3 by 3 array of the character 有
– 3261E is the character 乙 in a circle, which doesn't look quite right to me as a legit Chinese character
– 326FB seems sexist to me: three 男 over one 女
– 33143, similarly to 32501, has ⻌ enclosing a 3 by 3 array of the character 日
Alas, macOS does not yet support the biang character, so I can't include it in this email. Hopefully someday.
Character additions
VHM:
Note that, as it has been since the beginning of Unicode, CJK gobbles up the vast majority of all code points (see Mair and Liu 1991).
What is this fact telling us about the Chinese writing system, particularly in comparison with other writing systems? How does one account for this disparity? What is the meaning of this gross disparity?
The average number of strokes in a Chinese character is roughly 12.
The average number of strokes in a letter of the English alphabet is 1.9.
The average number of syllables in an English word is 1.66 (and 5 letters).
The average number of syllables in a Chinese word is roughly 2 (and 24 strokes).
The average number of words in an English sentence is 15-20.
The average number of words in a Chinese sentence is 25 (ballpark figure; see here)
Chinese has more than 100,000 characters.
English has 26 letters.
Total number of English words; over 600,000 (Oxford English Dictionary)
Total number of Chinese words: a little over 370,000 (Hànyǔ dà cídiǎn 漢語大詞典 [Unabridged dictionary of Sinitic])
und so weiter
Selected readings
- "Is there a practical limit to how much can fit in Unicode?" (10//27/17) — with a lively debate in the comments; the post and its comment raise many interesting issues about what Unicode is for after all
- "How many more Chinese characters are needed?" (10/25/16)
- "The infinitude of Chinese characters" (10/9/20) — with an extremely lengthy bibliography
- "The economics of Chinese character usage" (9/2/11)
- "Chinese character inputting" (10/17/15)
- "Language is not script and script is not language" (1/23/22)
- Mair, Victor H., and Yongquan Liu, ed., Characters and Computers (Amsterdam, Oxford, Washington, Tokyo: IOS Press, 1991), including James T. Caldwell's "Unicode: A Standard International Character Code for Multilingual Information Processing", which was the first presentation of Unicode to the broader public after it was up and running. What struck me most powerfully about Caldwell's chapter was a graph showing the huge proportion of code points in Unicode that were taken up by Chinese characters. All the other writing systems and symbols in the world that were covered by Unicode occupied only a small amount of the total. It made me feel as though Unicode had been devised primarily to accommodate the enormous number of Chinese characters in existence.
- "Cucurbits and junk characters" (3/30/24)
Tom Gewecke said,
June 16, 2025 @ 8:32 am
I think biang u+30ede should display if you install the Babelstone Han font on a Mac, at least in some apps.
J.M.G.N. said,
June 16, 2025 @ 9:32 am
– I cannot see how such an ever-increasing number of arabic honorific ligatures merits acceptance.
-"HAIRY CREATURE": seriously‽ What's next, "FLAT EARTH"?
– "Compound tone diacritics" for what language(s) ?
What symbols do you still miss most from Unicode?
Magnus said,
June 16, 2025 @ 11:52 am
Characters added in Unicode 17 also include the ones mentioned in this blog post: https://medium.com/@peterburkimsher/hakka-news-adding-11-unicode-characters-320c78807988
I don't know any Taiwanese or Hakka, so I can't really comment, but I found the account of the inclusion process interesting.