Unicode CJK Unified Ideographs Extension J and the nature of the sinographic writing system

« previous post | next post »

Submitted by Charles Belov:

I've been browsing through the proposed Unicode 17 changes, currently undergoing a comment period, with interest. While I don't have the knowledge to intelligently comment on the proposals, it's good to see that they are actively improving language access.

I'm puzzled that some new characters have been added to the existing Unicode CJK Unified Ideographs Extension C (6 characters) and Unicode CJK Unified Ideographs Extension E (12 characters) rather than added to a new extension. But the most interesting is the apparently brand-new Unicode CJK Unified Ideographs Extension J, with over 4,000 added characters.

I found the following characters of special interest:

– 323B0 looks like the character 五 with the bottom stroke missing.
– 323B3 looks like an arrangement of three 三s – does it possibly mean the same as 九?
– 32501, while not up to the character for biang for complexity, is nevertheless quite a stroke pile: the 厂 radical enclosing a 3 by 3 array of the character 有
– 3261E is the character 乙 in a circle, which doesn't look quite right to me as a legit Chinese character
– 326FB seems sexist to me: three 男 over one 女
– 33143, similarly to 32501, has ⻌ enclosing a 3 by 3 array of the character 日

Alas, macOS does not yet support the biang character, so I can't include it in this email. Hopefully someday.

Character additions

Block Name New
Characters
Arabic Extended-B 1
Bengali 1
Oriya 2
Telugu 1
Kannada 1
Combining Diacritical Marks Extended 27
Currency Symbols 1
Miscellaneous Symbols and Arrows 1
Latin Extended-D 5
Arabic Presentation Forms-A 25
Sidetic 26
Arabic Extended-C 14
Sharada Supplement 8
Tolong Siki 54
Chisoi 40
Beria Erfe 50
Ideographic Symbols and Punctuation 5
Tangut 8
Tangut Supplement 22
Tangut Components Supplement 115
Symbols for Legacy Computing Supplement 9
Miscellaneous Symbols Supplement 34
Tai Yo 55
Transport and Map Symbols 1
Alchemical Symbols 4
Supplemental Arrows-C 9
Chess Symbols 4
Symbols and Pictographs Extended-A 7
Symbols for Legacy Computing 1
CJK Unified Ideographs Extension C* 6
CJK Unified Ideographs Extension E* 12
CJK Unified Ideographs Extension J 4298
Total 4847

VHM:

Note that, as it has been since the beginning of Unicode, CJK gobbles up the vast majority of all code points (see Mair and Liu 1991).

What is this fact telling us about the Chinese writing system, particularly in comparison with other writing systems?  How does one account for this disparity?  What is the meaning of this gross disparity?

The average number of strokes in a Chinese character is roughly 12.

The average number of strokes in a letter of the English alphabet is 1.9.

The average number of syllables in an English word is 1.66 (and 5 letters).

The average number of syllables in a Chinese word is roughly 2 (and 24 strokes).

The average number of words in an English sentence is 15-20.

The average number of words in a Chinese sentence is 25 (ballpark figure; see here)

Chinese has more than 100,000 characters.

English has 26 letters.

Total number of English words;  over 600,000 (Oxford English Dictionary)

Total number of Chinese words: a little over 370,000 (Hànyǔ dà cídiǎn 漢語大詞典 [Unabridged dictionary of Sinitic])

und so weiter

 

Selected readings



32 Comments »

  1. Tom Gewecke said,

    June 16, 2025 @ 8:32 am

    I think biang u+30ede should display if you install the Babelstone Han font on a Mac, at least in some apps.

  2. J.M.G.N. said,

    June 16, 2025 @ 9:32 am

    – I cannot see how such an ever-increasing number of arabic honorific ligatures merits acceptance.
    -"HAIRY CREATURE": seriously‽ What's next, "FLAT EARTH"?
    – "Compound tone diacritics" for what language(s) ?

    What symbols do you still miss most from Unicode?

  3. Magnus said,

    June 16, 2025 @ 11:52 am

    Characters added in Unicode 17 also include the ones mentioned in this blog post: https://medium.com/@peterburkimsher/hakka-news-adding-11-unicode-characters-320c78807988
    I don't know any Taiwanese or Hakka, so I can't really comment, but I found the account of the inclusion process interesting.

  4. Chris Button said,

    June 16, 2025 @ 2:59 pm

    And I'm still waiting for the phonetic (right side) component of 漢 to be included …

  5. J.M.G.N said,

    June 16, 2025 @ 3:42 pm

    @Chris Button

    How's that dict of yours coming along?
    Looking forward to it for years now.

  6. Chris Button said,

    June 16, 2025 @ 5:37 pm

    @ J.M.G.N

    I appreciate you asking! I've been looking forward to it too!

    Work and family continue to take up most of my time, but it plods along as I eek time out now and then.

    The good news is that I've ironed out the wrinkles. It basically feels like a massive jigsaw puzzle but with all the edge pieces now firmly in place.

  7. Jonathan Smith said,

    June 16, 2025 @ 6:55 pm

    Wrt Magnus’s post, those nine "Taiwanese" Ex. J forms are taken kinda randomly from bibles/hymnals, wherein

    1 [⿸疒粒] writes the 1st syl. of lia̍p-á 'sore, boil (n.)', cf. ("alternative"? "normative"? other?) 粒仔
    2 [⿱⿳⼇口⼍足] writes the word 'tall, long-limbed', cf. 躼/軂
    3 [⿰牛周] writes the word tiâu 'pen, sty [for cows, sheep etc.]', cf. 寮/椆/牢/稠/著
    4 [⿰口毋] writes '[negative particle]', cf. 不/毋/唔
    5 [⿱艹帕] writes the 2nd syl. of chhì-phè (TIL) 'thorn bush', cf. 柿/梂
    6 [⿱艹吐] writes puh 'sprout, bud (v.)', cf. 發/窋
    7 [⿰勿愛] writes mài 'don’t!', cf. 勿/莫
    8 [⿰氵都] writes 'soak, drown', cf. 注/駐
    9 [⿰牜公] writes kang 'male of animals', cf. 公

    The words under 1, 2, 3, 9 have incorrect tones in the linked proposal; corrected above. All are common with the possible exception of the one under 5 which IDK.

    I guess it is good that these exact 2 or 3 printed texts can now be digitized faithfully? But FWIW there is no theoretical endpoint here. NB: the different ways to write the same Tw. word reflected above are a function of the nature of "Hanzi" + lack of standardization (rumor has it that MOE throws things like this to a vote for the government-sanctioned online dictionary); characters used for e.g. modern standard Mandarin are not objectively less arbitrary from a historical POV.

  8. David Marjanović said,

    June 17, 2025 @ 11:32 am

    What symbols do you still miss most from Unicode?

    Superscript ʢ for epiglottalized sounds… assuming it hasn't been added in the last few years, which I don't know.

    I guess it is good that these exact 2 or 3 printed texts can now be digitized faithfully? But FWIW there is no theoretical endpoint here.

    There isn't supposed to be a theoretical endpoint. The goal is to make it possible to faithfully digitize everything that's ever been written under halfway serious circumstances.

  9. Jonathan Smith said,

    June 17, 2025 @ 12:29 pm

    @David Marjanović I miswrote/thought: no *practical* endpoint, "faithfully digitize everything" being the theoretical endpoint. Re: your example, good one, with Chinese writing so mind-bogglingly full of parallel examples that for good or ill (or just neutral), if inbuilt Western cultural biases are overcome, "Kanji" will occupy a larger and larger percentage of code points — asymptotically approaching all.

    Personally I look forward to a digitized version of Xu Bing's deeply serious magnum opus Tian Shu… but hundreds of equivalent projects await the intrepid should they care to look…

  10. Chris Button said,

    June 17, 2025 @ 7:23 pm

    @ David Marjanović

    Superscript ʢ for epiglottalized sounds…

    Out of curiosity, what are you wanting to transcribe that unequivocally warrants it?

  11. Victor Mair said,

    June 17, 2025 @ 7:46 pm

    Even Xu Bing did not know what his characters meant or what they sounded like. I met him several times at his studio in New York and at exhibitions in Hong Kong and Washington DC, so had ample opportunity to ask him about that. Nonetheless, at his first showing in Beijing, those who viewed the artistic work tried extremely hard to "read" the text, and some were disappointed that they could not make any sense of it. Some were upset and thought that Xu Bing was pulling their leg or making fun of the Chinese script.

    And there are indeed many other made-up sinographic scripts, some published in SPP.

  12. Thomas said,

    June 18, 2025 @ 12:23 am

    It is mind-boggling how the CJK part of Unicode is constantly flooded with new iterations of chatacters that presumably no one will be using anyways. Not even language log provides meanings and readings for these new additions, so I am really wondering why they keep doing this.

    Now it's been a while since I thought about this topic, but one thing I have been wondering is the following: What are the most used recently added CJK characters? Are any of the characters added in the last 15 years even widely used? Suppose we restrict the question to Putonghua or Guoyu, i.e. standard Mandarin Chinese simplified or traditional. Can someone maybe have a guess at what these top characters might be?

  13. Chas Belov said,

    June 19, 2025 @ 2:24 am

    Thank you for posting my comments. I see I have some follow-up (re-)reading to do with the related posts.

    Can anyone tell me what any of these new characters I listed mean? Or do they only appear in Japanese people's names? I'm not even sure what documentation is needed for a new character. I presume these 4,000-plus characters show up somewhere and have been documented for the Unicode committee.

    @J.M.G.N.: As for characters I miss, I do wish Unicode had "m" with a grave accent to go along with ń, ǹ, and ḿ, needed for Yale Romanization of Cantonese. I guess I could write them and ask them to add it. I've tried using combining characters and haven't been happy with the alignment they produce. Let's give it another go: m̀. Hmmm, well that wasn't so bad, at least on my computer.

    @Tom Gewecke: Thank you for the referral to Babelstone Han. I'm somewhat intimidated by their statement

    This font is under continuous development, as tens of thousands of additional CJK ideographs are scheduled for inclusion in Unicode over then next few years.

    And I have to wonder how much space I want to take up on my computer for characters that I don't actually have use for beyond enjoying not seeing tofu. Specifically with regard to the beloved biang, it does me no good to see it on my computer if I cannot share it in posts with others. But should I be moved to install it, thank you for the ref.

    So, this led me to dive into the Unicode website to see if I could glean any info about these characters and how they made it into Unicode.

    There's a CJK FAQ, which includes the statement:

    The Unicode Standard is designed to encode scripts and their characters, not their specific shapes, or glyphs. Even where there are substantial variations in the standard way of writing a character from region to region, if the fundamental identity of the character is not in question, then a single character is encoded in the standard.

    Which raises the question, why does biang get separate code points for traditional vs. simplified?

    There are some characters for which one can guess based on the source information in the Unihan Database whether they are traditional Chinese, simplified Chinese, Japanese, Korean, or even Vietnamese, but there are far too many exceptions to make this reliable.

    So presumably I could track down these characters in the Unihan Database to find out where those characters came from.

    A lack of reading data simply means that nobody supplied a reading, not that a reading doesn’t exist.

    So I might be able to learn how the character is pronounced, but that's not guaranteed.

    And, alas, the public Unihan database only covers through Unicode 16.0, so I will have to patiently wait until it is updated to include 17.0 to seek further knowledge.

    The Ideographic Research Group does provide working set documents for its work on 17.0, but I don't know how useful it will be. It doesn't include Unicode code points. By a stroke of fortune, the character 五 with the bottom stroke missing happens to be the very first character shown. But it only shares that the proposal for this character came from the Unicode Committee, the committee's sequence number for the character, and little more. So this document may not help my quest.

  14. Chas Belov said,

    June 19, 2025 @ 2:32 am

    I must apologize for the lack of links in the previous post. Past experience has shown that including a bunch of links in a post leads to delays. However, the following link is so delicious that I must share it.

    Regarding that 323B0 which looks like the character 五 with the bottom stroke missing which was the first character in the 17.0 work document, there is a link in that document to the character.

    The character comes from Tianwenge Qinpu (天闻阁琴谱) vol1, fol2, which is opaque to me but I'm sure many of you will know what that document is. It, along with several other characters, are highlighted on the working group page for that character. along with other technical info about that character.

  15. Chas Belov said,

    June 19, 2025 @ 2:35 am

    And here is the main chart page for the 17.0 working document.

  16. Chas Belov said,

    June 19, 2025 @ 3:31 am

    Okay, I may be getting the hang of this. You can search by radical or take the URL for the no-bottom-stroke 五 in my post above and change the 00003 to the relevant work character.

    – 323B3, work character number 00006, looks like an arrangement of three 三s – does it possibly mean the same as 九?

    Nope. Source 1: 路迪民:"亳州老君碑古字谱考释" in 《武当》(2007-06) pp. 20–21

    The character is noted as 周易卦名. If so ⿱三⿰三三 would be a variant of ䷋ (否卦). If it is 坤卦, the shape should be ䷁.

    – 32501, work character number 00411, while not up to the character for biang for complexity, is nevertheless quite a stroke pile: the 厂 radical enclosing a 3 by 3 array of the character 有

    詞林三知抄 (1604)

    It's apparently Japanese and comes in 6 有 and 9 有 varieties. There's quite a bit of discussion about it. One comment reads:

    It is a proper name character for 天橋立, a Japanese scenic site as an important poetic motif.

    – 3261E is the character 乙 in a circle, which doesn't look quite right to me as a legit Chinese character

    Couldn't find this one.

    – 326FB, work character number 01014, seems sexist to me: three 男 over one 女

    Source 1: 路迪民:"亳州老君碑古字谱考释" in 《武当》(2007-06) pp. 20–21

    – 33143, work character number 04008, has ⻌ enclosing a 3 by 3 array of the character 日

    雲歩色葉集 (1571)

    And that's enough for an evening's research. ¡Fascinating!

  17. JMGN said,

    June 19, 2025 @ 5:28 am

    @Chas Belov
    https://en.wiktionary.org/wiki/m%CC%80

  18. Magnus said,

    June 19, 2025 @ 5:51 am

    On the working group page, you can look up characters by the source references in the Unicode chart, below each character. For example, 乙 in a circle (3261E) has source reference SAT-01395, so you can enter that code in "Find by Source Ref" and arrive at this page:
    https://hc.jsecs.org/irg/ws2021/app/index.php?find=SAT-01395

  19. Chris Button said,

    June 19, 2025 @ 8:15 am

    I do wish Unicode had "m" with a grave accent to go along with ń, ǹ, and ḿ,

    I would like schwa with acute and grave accents.

  20. David Marjanović said,

    June 19, 2025 @ 9:12 am

    AFAIK, precombined letters + diacritics only exist for backwards compatibility with older encodings, and no new ones are supposed to be added because the whole idea is to encode the base characters and the diacritics separately. (…With lots of more or less arbitrary decisions of what counts as a separate base character vs. a character with a diacritic or two.) Issues like alignment are supposed to be a matter for font designers, not for Unicode.

    Out of curiosity, what are you wanting to transcribe that unequivocally warrants it?

    A bunch of underresearched matters. One example are the "strident vowels" of various "Khoisan" languages; they're said to be epiglottalized. Another is the intriguing idea that [æ] doesn't belong on the vowel chart – that it's epiglottalized [ɛ].

  21. David Marjanović said,

    June 19, 2025 @ 9:13 am

    backwards compatibility

    …not so much; I meant the opposite – the ability to convert digital text from older encodings to Unicode without losses.

  22. Chris Button said,

    June 19, 2025 @ 1:10 pm

    AFAIK, precombined letters + diacritics only exist for backwards compatibility with older encodings, and no new ones are supposed to be added because the whole idea is to encode the base characters and the diacritics separately.

    Ah, good to know.

    One example are the "strident vowels" of various "Khoisan" languages; they're said to be epiglottalized.

    Well, since ʢ is given a distinct symbol from ʕ, I suppose it also deserves its own superscript version regardless of whether ʢ vs ʕ merit their own distinct symbols purely in terms of epiglottal vs pharyngeal. Not an area I know much about.

    Another is the intriguing idea that [æ] doesn't belong on the vowel chart – that it's epiglottalized [ɛ]

    [e] vs [ɛ] as a tense vs lax distinction (like [i] vs [ɪ]) makes good sense to me. Outside of the extremities of the vowel space, it all gets a little arbitrary as to what gets a symbol and what it represents,

  23. Jonathan Smith said,

    June 19, 2025 @ 1:51 pm

    Hmm, my joke goes that only two kinds of people could be silly enough to consider there to be one single "Chinese" language: Chinese and non-Chinese (outstanding question being who started it)… a parallel description characterizes those who consider "Kanji" to have inherent "readings" and "meanings" as opposed to being symbols to write shit down with — SO perusal of some above links suggests that Unicode Deciders are a combination of native Kanji users and non-natives, who knew.

  24. Chas Belov said,

    June 19, 2025 @ 4:03 pm

    @JMGN: Thank you for the Wiktionary link. It is clear from the URL ("m" followed by two hex encodings) that it is using the combining character and is not a pre-composed m with grave accent.

    @Chris Button:

    I would like schwa with acute and grave accents.

    ¿Was that humor? If so, my compliments. If not, ¿where would that be useful?

    @David Marjanović:

    AFAIK, precombined letters + diacritics only exist for backwards compatibility with older encodings, and no new ones are supposed to be added because the whole idea is to encode the base characters and the diacritics separately.

    Alas. Well, that saves me the trouble of suggesting it.

    @Magnus:

    On the working group page, you can look up characters by the source references in the Unicode chart, below each character.

    ¡Thank you! I had never noticed that.

    It does seem to be a bug that it's not showing up when I search the working group database on the 乙 radical. I'll try to report it.

    Anyway, thank you very much for the link to the beta character. The discussion is fascinating. Especially the posts that circle-enclosed-乙 needs to be distinguished from square-enclosed-乙 because the two characters have different meanings in Guangzhou and in Hakka.

  25. Chas Belov said,

    June 19, 2025 @ 4:10 pm

    My remaining original question, if anyone knows, is why they added some of the characters to earlier CJK extensions rather than all of them to the new extension.

  26. Chris Button said,

    June 19, 2025 @ 4:15 pm

    @ Chas Belov

    Haha. No it wasn't intended as humor, but I do see why you might have taken it that way!

    Edwin Pulleyblank used acute and grave accents to symbolize prominence on the vowel (grave accent) or coda (acute accent) of a syllable.

  27. Chas Belov said,

    June 19, 2025 @ 4:31 pm

    @Magnus: And I see I have been overlooking the source reference for some time. I went back to the CJK Extension A code chart and arbitrarily picked 㐥, U+3425, no translation found in Wiktionary, which has a source reference JMJ-067978. But when I try to plug JMJ-067978 into the search on the working group database, it gets a Not Found response. I wonder whether there are other databases for sources.

  28. KIRINPUTRA said,

    June 19, 2025 @ 10:42 pm

    If the idea of Unicode is to codify symbols that are or have been in use, then graphs like these …

    [⿸疒粒]
    [⿱⿳⼇口⼍足] (This might be a customary alt. graph.)
    [⿰牛周]
    [⿰口毋]
    [⿱艹帕]
    [⿱艹吐]
    [⿰勿愛]
    [⿰氵都]
    [⿰牜公]

    … would be a perversion of that, arising where one man's craving meets another man or woman's misunderstanding.

    AFAICS, it was never intended that Unicode codify proposed symbols / graphs, which is essentially what graphs like these are. (These proposals may have made it into a Bible, but that edition of that Bible is barely in use anywhere, and next to nobody can read those graphs unless guided by others reading, or romanisation, or other translations.)

    "Craving" refers to the yearning of some for union between Modern Chinese (more or less Mandarin) and certain (partly) subjugated tropical (or sub-) tongues — Taioanese, Hakka, Hokkien, etc. "Culturally" Chinese-nationalist Confucians have sought to create, since c. 1920, Mandarin-compatible neo-scripts with, esp. after c. 1970, sets of exotic extension graphs for the more "incompatible" words. (Of the above, ⿰氵都 most transparently reflects the depth of this obeisance, since even the sound component is Mandarin-based….)

    "Misunderstanding" refers to the conclusions outside observers arrive at on the basis of common sense, mixed with an exaggerated hands-off respect for the Native. Yes, new & deviant usage is also usage, but these neo-scripts are not that; they're fundamentally not used. These idiosyncratic make-believe scripts are driven by Neo-Chinese ideology, while they're "suppressed, for convenience" in favor of Mandarin in almost every situation where you'd think they'd be used — but the Native Confucian won't tell you (or each other) that. Fact-finding is deeply confounded.

    OTOH, Unicode has somehow done a good job of supporting the customary Hakka, Hokkien & Taioanese sinoscripts, as well as the Vietnamese, and others. In part this just reflects the fetish-free, not-exotic nature of many of the customary scripts.

    This graph, for example, is customary (and very consistent, apparently) for Hakka MÀNG ("not yet"):

  29. KIRINPUTRA said,

    June 19, 2025 @ 10:47 pm

    That didn't display, but you can see it here. It's like 日 lying on its side.

    https://zi.tools/zi/%F0%AB%A9%8F

  30. Chas Belov said,

    June 20, 2025 @ 2:50 am

    I finally located the discussion, such as it was, for biang, in the 2015 working set – apparently the search does not work across working sets.

    The traditional form for biang has one post, as does the simplified form (change the 00791 in the URL to 01312).

    791 reports the simplified form is submitted separately as 1312, but not a peep as to why they are separate.

  31. Jonathan Smith said,

    June 22, 2025 @ 12:53 pm

    Re: what is actually "in use", Wikipedia-Unicode quotes Becker at Xerox (1988) to the sensible effect that priority would be "characters published in the modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988)" with "others […] defined to be obsolete or rare; these are better candidates for private use registration […]" And re: "CJK ideographs" specifically, the historical account at unicode.org also refers to "modern characters in common use." Whereas at this point, 90%+ of "CJK", which is in turn 80% and rising of Unicode period, is completely useless by which I mean literally never used once by anyone period. FWIW.

    Incidentally re: the submission linked above by Chas Belov, the submitted "evidence" doesn't even vaguely match the accepted submission, fact of which UTC says it is "fully aware." Clearly the form actually shown on that dictionary page (if anyone can make it out) should be added immediately…

  32. Vampyricon said,

    July 1, 2025 @ 3:21 pm

    > Note that, as it has been since the beginning of Unicode, CJK gobbles up the vast majority of all code points (see Mair and Liu 1991).
    >
    >What is this fact telling us about the Chinese writing system, particularly in comparison with other writing systems? How does one account for this disparity? What is the meaning of this gross disparity?

    This tells us that Chinese characters were encoded prior to the decisions made by the Unicode to implement combining characters, like with Egyptian hieroglyphs, and due to the requirement of backwards compatibility, Chinese cannot be redone, and thus the list of characters grows ever longer.

RSS feed for comments on this post · TrackBack URI

Leave a Comment