Language Log

Sinographs by the numbers

January 22, 2019 @ 3:23 am · Filed by Victor Mair under Writing systems

Does the Unicode process restrict the language people use? For example, I haven't seen any requests to add any new English characters – I can write whatever I like without having to amend the base character set.

Good question and observation!

Our Roman alphabet has 26 X 2 = 52 letters, and you are right. With them we can write any word that we can say.

It's quite a different matter with Chinese characters. There are tens of thousands of them, only about 3,000 of which are in fairly common use (covering 99.18% of all occurrences), and about 6,000 of which are in infrequent use (covering about 99.98% of all occurrences).

Most great works of Chinese literature, both modern and premodern, are written within a range of around 3,000 characters or less.

I've never met a single person, no matter how learned, who could actively produce more than 8,000 characters by handwriting, without the aid of any electronic device (I think that there must be an upper limit to the number of different characters the human brain can keep distinct in their brain for active production, which requires intricate neuro-muscular coordination. .

10,000 characters would cover 99.9999999% of all occurrences.

[VHM: The coverage figures in the previous paragraphs are taken from "Modern Chinese Character Frequency List" by Jun Da.]

Unicode has over 70,000 code points that are reserved just for Chinese characters, so you can see that the vast majority of them, more than 60,000, almost never occur, yet each one has to take up a code point simply because they "exist", though they are virtually never used in modern life.

Those 70,000+ Chinese characters occupy approximately 7/11ths of all currently assigned Unicode points. But they are not enough, because any Chinese person can invent a character for his or her own name or for some other reason (whimsy, poesy, topolectal expression, etc.), and sometimes people actually do this (though governments may not recognize such characters for obvious reasons, e.g., their computer systems cannot handle them, other people do not know how to pronounce them, etc.)

Here's an article about a doctor in Taiwan, Lín Guóhóng 林圀宏, who has a character in his name that is so extremely rare that practically nobody who meets him knows how to pronounce it or what it means: 圀. As a matter of fact, this character is an old variant of guó 國 / 国 ("country"). Together with a group of other special characters, it was invented by a female emperor (yes, that's what I said) named Wu Zetian (624-705) to celebrate the founding of her own dynasty, the Zhou (684-705). After her dynasty collapsed, the special characters that Wu Zetian had created swiftly fell out of use.

So how did such an old, strange character turn up in the name of a doctor in Taiwan in the 21st century? It turns out that Japanese monks who had travelled to China during the Tang Dynasty (618-907) picked up this character and took it back to Japan with them, where it was sustained for the next thirteen centuries and more. When Japan made Taiwan a part of its empire (1895-1945), it brought 圀 with them to the island. Taiwanese who went to school during the Japanese occupation became familiar with the character. Some families who admired the Japanese system of education adopted the character for their given names and passed it down through the generations.

Preserving an extraordinarily rare character that is of extremely low frequency is vastly more difficult and demanding — in terms of mental and electronic resources — than preserving a rare, low frequency word with an alphabetic script. Is it worth the cost and the effort?

Readings

"How many more Chinese characters are needed?" (10/25/16)
"Chinese character inputting" (10/17/15)
"Is there a practical limit to how much can fit in Unicode?" (10/27/17)
"Character crises" (6/15/18)
"Ask Language Log: Looking up hanzi for ignoramuses" (11/29/17)
"Sinological suffering" (3/31/17)
"Writing characters and writing letters" (11/17/18)
"An immodest proposal: 'Boycott the Chinese Language'" (11/18/18)
"The wrong way to write Chinese characters" (11/28/18)

[H.t. Chau Wu]

January 22, 2019 @ 3:23 am · Filed by Victor Mair under Writing systems

Permalink

65 Comments

Andreas Johansson said,

January 22, 2019 @ 3:50 am

This is perhaps obvious to a sinologist, but why is the character considered a separate character if it's in origin a variant of another, and apparently used identically?
GH said,

January 22, 2019 @ 3:53 am

Is it worth the cost and the effort?

That seems like a strange question to me. If people do it, then apparently it is, to them. For the most part, they themselves bear the cost and inconvenience.

Is it "worth" preserving smaller local languages, or should all Chinese just speak Mandarin? After all, it would save a lot of cost and effort.
~flow said,

January 22, 2019 @ 5:48 am

Clearly, if we could only rid ourselves of those pesky uppercase letters, we'd be that much more efficient and, hence, more affluent.

@Andreas Johansson—it is indeed not quite clear, from the outset, what to call 'a character', but technically, in a character set, you got to have one code point per distinguishable shape. That, at least, is the general idea, which is in practice often violated for a number of reasons. 圀 is a facultative variant of 國, so if you don't give it a code point, there's no way of indicating that it's this form you want to write. 国 is another variant of 國, and has another code point. You could say that 圀, 國, 国 are three ways to write the 'same character', and this would capture an important fact about these three shapes (namely, that they can be used interchangeably in a way that other shapes can not).
Philip Taylor said,

January 22, 2019 @ 5:52 am

Victor, you say that you have "never met a single person, no matter how learned, who could actively produce more than 8,000 characters by handwriting, without the aid of any electronic device". Is there reliable historical evidence as to how many characters a candidate for the Imperial examinations might have been expected to know, and did this change over the period (around two millennia, I believe) that the examinations were in regular use ?
Victor Mair said,

January 22, 2019 @ 6:25 am

For the most part, they themselves bear the cost and inconvenience.

No, users of the whole system bear the cost and inconvenience.

Is it "worth" preserving smaller local languages, or should all Chinese just speak Mandarin?

You can find no greater advocate of the topolects than Victor H. Mair, so please do not distort what I have written.
Victor Mair said,

January 22, 2019 @ 6:39 am

One of the best ways to protect, preserve, and empower the topolects is to permit, indeed encourage, them to be written alphabetically.
Andreas Johansson said,

January 22, 2019 @ 6:52 am

@~flow:

I understand the technical part. What I don't understand is why it was decided it's a separate character that needed to be represented in the first place. We don't assign Unicode code points to every variation of a Latin letter (tho we do so to quite a few, because they have specialized uses in some context other than normal writing).
Ken said,

January 22, 2019 @ 8:19 am

@Andreas Johansson: My understanding is that a goal of Unicode is to represent all glyphs used in writing, including in scholarly articles (and blog posts) on obscure Chinese characters used for only a brief period in the seventh century.

Whether that's a feasible goal is another matter.
Theophylact said,

January 22, 2019 @ 9:16 am

53. You forget the apostrophe.
Theophylact said,

January 22, 2019 @ 9:18 am

(I suppose you need the hyphen and the diaeresis as well.)
Victor Mair said,

January 22, 2019 @ 9:38 am

Those are punctuation marks and diacriticals, not letters.
~flow said,

January 22, 2019 @ 9:42 am

@Andreas Johansson "We don't assign Unicode code points to every variation of a Latin letter"—well, not every variation, but indeed quite a few. Historically, the Unicode consortium used to be much more restrictive in the question what should and what should not get a code point of their own. For example, in the first editions, each Arabic letter only got a single code point, the argument being that font technology should care about the distinct shapes needed in conjoined writing; that was soon found to be inadequate, and nowadays we have what they call 'representation forms'.

What's more, the superficially simple Latin script with its 26 letters (and digits, and punctuation, and diacritics, and so on) is represented in well over one thousand code points, which should be borne in mind when comparing scripts.

The consortium used to be very optimistic about what CJK character forms could be 'unified'; as a holdover from that time, there are certain code points like 龜 and 门 whose shapes are highly dependent on the font and / or language settings, more so than the shapes of most other code points. It is conceivable that at some point in the future we will find it more convenient to encode CJK characters as structural formulas, so 圀 could become ⿴囗⿱八方, but nobody has come up with the necessary software to do that, yet. Something very similar already works for Korean Hangeul.
WSM said,

January 22, 2019 @ 9:50 am

diacriticals at a minimum are fair to count, since without them it's not possible to "write any word that we can say", assuming "we" to include speakers of languages other than English. They also add significantly to electronic storage burdens which, compared to 3 bytes/UTF-8 encoded Chinese character across the board, I'm not at all persuaded are that much smaller for alphabetic languages.
Doctor Science said,

January 22, 2019 @ 10:04 am

Just chiming in to boost @Philip Taylor's question, which immediately occurred to me, as well.
Victor Mair said,

January 22, 2019 @ 10:18 am

I was thinking of English.

Compare 70,000+ to 1,000 or so out of a total of around 110,000 currently assigned Unicode points. The burden of the former is far greater than that of the latter.
Alex said,

January 22, 2019 @ 10:28 am

@flow

"Clearly, if we could only rid ourselves of those pesky uppercase letters, we'd be that much more efficient and, hence, more affluent."

If one were to be sarcastic at least one could have at least taken the time to find a better example. Capital letters have an efficiency for proper nouns. An apple vs buying an Apple. Yes you can say the new rule would be to underline the first letter instead of capitalizing it but more importantly what if the China decided to keep adding brand new characters. For example, I remember in an old post there was this artist that created a whole book on whimsical characters in which if you picked one and showed the man on the street the person would just think its a character they forgot. I actually don't know when the last new character was created. I guess its something I will have to google when I have time. I suppose its when they started simplifying things the first and second times. As for effort one can go back to the article on how hard it is to create fonts. Now that new words are being created by using character phonetics for example the Chinese word for salad 沙拉 (sand and pull) dont you think its time to make life easier for millions of students and move to pinyin? If they arent creating actual new characters, should we ask why?

Id imagine the answer would be because the tremendous amount of effort in updating IT systems.
WSM said,

January 22, 2019 @ 10:54 am

@Alex I do think it would be good for Chinese writing to introduce a "kana" system specifically for the purpose of rendering foreign words such as "salad", since the current system based on Chinese characters with vaguely similar Mandarin pronunciations is terribly imprecise. Such an introduction does not require replacing the entire current writing system by pinyin or bopomofo, either of which would be excellent candidates for such a kana syllabary.

Adding new characters doesn't require any "updating IT systems", since the currently implemented Unicode standard provides for upwards of 1.1 million codepoints: far, far more than is currently being used. Per a recent LL post, even Japanese speakers seem keen on introducing new characters, despite the ready availability of several phonetic syllabaries; why Chinese speakers decided to go with "sha-la" instead of some new (polysyllabic?) character is anyone's guess. In any case, the putative burden associated with introducing new characters into the Unicode standard seems more bureaucratic than technical, and I suspect that at this point the bureaucratic burden of adding new Chinese characters to the Unicode space, is dwarfed by the burden of adding new emoji, which seem to be added at a much faster rate.
~flow said,

January 22, 2019 @ 11:20 am

@Alex I chose my example because there is an actual discussion in Germany to at list partly abolish capital letters. This has been going for centuries (see https://de.wikipedia.org/wiki/Kleinschreibung) and was last widely (and heatedly) discussed in the 1970s, the argument being that the German orthography (which mandates upper case for all nouns) is too difficult to learn and, hence, an undemocratic barrier to education and professional success for children from poor families. (Interestingly, only three decades before that, all of Germany used to use *two* bicameral alphabets in parallel). So this shows that there are advocates for simplification even in a writing system that is already orders of magnitude simpler than Chinese orthography.

Does anyone know about a more or less objective metric for 'difficulty of a writing system' and 'difficulty of an orthography'? Or have a suggestion for what could be the world's simplest and / or most reduced writing system?
Philip Taylor said,

January 22, 2019 @ 11:26 am

Sigh. It is (IMHO) only "too difficult" to learn that all nouns should have a leading cap. if one is not taught how to parse a sentence in the first place. "In my day" (60 years ago) every child in the land was taught how to parse; now, it would seem, parsing is a dying (if not already dead) art (as is spelling). We are (IMHO) slowly going back to the dark ages.
Frédéric Grosshans said,

January 22, 2019 @ 11:29 am

@~flow I think you have the history wrong about Arabic in Unicode. Some pre-Unicode encoding had the representation forms encoding, which go again the character-vs-glyph model of Unicode, so Unicode encoded them only for compatibility with earlier standard, and they are clearly labeled as compatibility characters characters in the standard, not to be used in new texts. The encoding model you describe as impractical for Arabic (one character with context dependent shapes) is the current one, as you can read in the 1st paragraph of section 9.2 of the current standard (https://www.unicode.org/versions/latest/)

And the unification of CJK characters (“Han Ideographs” in Unicode language) has been very early been understood to be a complex matter, and the standard itself has several pages on this in its chapter 18. Beyond the principle, this tasks is a difficult one, as witnessed by the constant work of the IRG (Ideographic Rapporteur Group) of Unicode ( https://www.ogcio.gov.hk/en/our_work/business/tech_promotion/ccli/iso_10646/irg.htm ) over tha last few decades.

Korean Hangeul, despite its aspects, is a very different story: it’s an alphabet, and it’s encoding has nevertheless been complicated (with different versions in 1.0, 1.1 and 2.0 (see e.g. http://unicode.org/pipermail/unicode/2015-June/002038.html for some details))
WSM said,

January 22, 2019 @ 11:37 am

I think the world's most reduced writing system is binary, or any representation you can choose for two states for "yes" and "no". "Simple" seems too subjective to be a useful criterion.
Alex said,

January 22, 2019 @ 12:49 pm

WSM said,

"Adding new characters doesn't require any "updating IT systems""

I am not an expert on this area of IT but my gut tells me otherwise. In the enterprise world there are a myriad of different operating systems which run on versions going back 15 years. Which generally speaking requires planning when rolling out a patch. I am fairly certain that you couldn't type a new character without some kind of update. Id also believe that if there were any comparisons going on it would need to be matched against some database or ldap entry. Even my Microsoft word how would I select the pinyin input for that character if there was no update? Now i'm imagining banking and name on account or with credit card input for payment and matching of address or name.
What if that guy who has that obscure character actually had a brand new character that was created yesterday? Do you think he would have any IT difficulties opening an account when he fills out the form for a new account by writing it by hand on the paper form?

As I said, I am not expert on this area of IT but from my experience with enterprise IT and roll outs, nothing is easy.
WSM said,

January 22, 2019 @ 12:53 pm

You would need to update fonts and IMEs, yes – but such updates occur all the time, and largely invisibly to everyone including the admins performing them.
Kangrga said,

January 22, 2019 @ 12:53 pm

@Philip Taylor: While it might in principle be simple to learn to recognise nouns, German orthography has a huge number of arbitrary conventions about capitalisation. For example, there are two suffixes that can derive an adjective from a personal name "-'sch" and "-isch". They are clearly just variants of each other. However, with the first, the resulting adjective is to be capitalised, with the second it isn't: "Luther'sch", but "lutherisch". The same is the case with "-er" and (again) "-isch", which derive adjective from toponyms: "Hamburger" but "hamburgisch". "Was" is (among other things) a colloquial clipping of "etwas", meaning 'something'. However, the Orthographical Conference has decided that the two have different grammatical functions — one being classified as a pronoun and the other a quantifier — so that 'something red' is "etwas rotes" but "was Rotes". German High School students have to take tests that focus specifically and such complicated cases.
WSM said,

January 22, 2019 @ 12:56 pm

It would require updating fonts and IMEs, yes – but such updates occur all the time anyway, and largely invisibly to the admins performing them.
Chris Button said,

January 22, 2019 @ 1:42 pm

… only about 3,000 of which are in fairly common use (covering 99.18% of all occurrences), and about 6,000 of which are in infrequent use (covering about 99.98% of all occurrences).

It's amazing how little benefit is gained from the upper echelons.

For my "Derivational Dictionary of Chinese and Japanese Characters", I'm including the 6,500 characters that form Level 1 (3,500) and Level 2 (3,000) of the Tōngyòng Guīfàn Hànzìbiǎo in China (I'm excluding the 1,605 in Level 3), along with the 2,769 characters that form the Jōyō (2136) and Jinmeiyō (633 exclusive characters out of a total of 863 including variants of other Joyō/Jinmeiyō characters) lists in Japan. Since the vast majority of the Japanese characters overlap with the Chinese ones, the total number of entries (using traditional forms for the head characters with Chinese/Japanese simplified forms in the body) happily remains only slightly above 6,500.
Antonio L. Banderas said,

January 22, 2019 @ 2:27 pm

@Chris Button

Could you please elaborate a bit on your "Derivational Dictionary"? ETA? Thanks
Lars said,

January 22, 2019 @ 3:52 pm

@~flow and @Philip Taylor: Capitalised Nouns were abolished in Danish in 1948, part of the reason given was that it was too difficult to learn. Myself, I suspect that post-war sentiments against all things German had something to do with it as well. :)

Kangrga mentions one complication, and in general there are words that behave enough "like nouns" to be capitalised too, and making that determination is pretty involved. So learning what a noun is is not quite enough.
Philip Taylor said,

January 22, 2019 @ 4:25 pm

Kangrga, Lars — Thank you for the additional information. I now understand that identification of words requiring a leading cap. is not as simple in German as it is in English.
ohwilleke said,

January 22, 2019 @ 6:21 pm

As I understand it, dictionaries of Chinese characters are organized at the first order by the number of pen strokes necessary to write that character.

All Chinese characters are composites of a fairly small number of strokes (often single digits or low double digits, and no more than about 62 strokes). And, there are only certain kinds of markings that can qualify as grammatically recognized strokes. You can't, for example, drop a middle finger emoji or a poop emoji into a grammatical Chinese character. Similarly, nobody would ever confuse Arabic script for Chinese characters, for example. You could teach a computer without undue difficulty to distinguish between grammatical Chinese characters that someone could legitimately invent and logograms that are incapable of being grammatical Chinese characters. For example, only certain stroke shapes, variations in length, orientations and positions of strokes relative to each other are permitted.

The universe of potential strokes that could be included in a character, it would seem to me, must be much smaller than the universe of all characters that can be devised by combining those strokes using some system to identify its location relative to other strokes in the character and its orientation. If you wanted to be really clunky about it, the correct way to draw every Chinese character could be expressed entirely in words consisting of other Chinese characters that would rarely run longer than a long paragraph.

Why is it not possible to have Unicode for Chinese characters that combine certain strokes or perhaps in some cases, certain sequences of strokes that recur frequently, rather the separately coding each character?

Note that I'm not saying that the user interface would necessarily work this way. Somebody with a smart phone who wasn't programmer wouldn't need to know the technical code for turning a Chinese character into a set of strokes and stroke locations and orientations, but, the Unicode set could be profoundly smaller and a typical Chinese character would be a string of stroke codes. Essentially, stroke codes would be the visual analog to vowels and consonants and dipthongs and accent marks.
Chris Button said,

January 22, 2019 @ 9:41 pm

@ Antonio L. Banderas

If you're familiar with Akiyasu Todo's "Kanji Gogen Jiten", it's essentially a modern expanded version of that based on more sophisticated understandings of Old Chinese inscriptions and phonology over half a century later (I lean heavily on the works of Ken-ichi Takashima and Edwin Pulleyblank respectively in those regards). Furthermore, to avoid the inevitable speculativeness associated with Todo's work, any proposed etymological connections running through the word families represented by the characters are supported by typological comparisons with Proto-Indo-European (not suggesting any primordial connection between OC and PIE but rather based on semantic universals). While its academic foundation should appeal to specialists, it actually has a more fundamental aim of serving the needs of students of Chinese and Japanese writing by providing logic to the composition of the characters. Essentially, it's something I would have loved to have had when I was a student!

As for when it will be completed, working full-time I would be easily finished in under 2 years from now. Unfortunately, I cannot afford to do that without financial support so it is taking significantly longer. Sorry I can't be more specific.
Marnanel said,

January 22, 2019 @ 11:22 pm

Alex: the chap who wrote a book full of invented characters was Xu Bing. The book was called "Book from the Sky". It reminds me of the Codex Seraphinianus.
Bob Ladd said,

January 23, 2019 @ 1:57 am

@ohwilleke: You might want to look at Richard Sproat'sComputational Theory of Writing Systems (available on ResearchGate) to get an idea of how complicated what you propose might be.
loonquawl said,

January 23, 2019 @ 2:33 am

Is there an upper bound for the information needed to store any possible Chinese character? I.e. i guess a 'stroke' would not simply be a line (characterized as a line from point A to B), but it would be conceivable to have two Chinese characters that only differ in the way one of the strokes tapers, so you'd also have to store a taper to every stroke (thickness at point A, thickness at point B). Is there a recognized lower limit to the resolution with which those to measures (position and width) would have to be stored? Is there anything else that could set two Chinese characters apart? —- Otherwise it might be possible to store the very infrequent characters as their graphic representation (it would of course be a much bigger blob of information than the surrounding Unicode characters, but as they are infrequent, the resultant files would not be much bloated on average.
Chas Belov said,

January 23, 2019 @ 3:47 am

I saw the exhibit of Xu Bing's work at the Asian Art Museum in San Francisco and, even with my dreadfully inadequate Chinese comprehension, I was in awe. Has there been any initiative to add Xu Bing's character set to Unicode?

I seem to recall that the 1993 album Broadcast Drive Fans Murder 廣播道軟硬殺人事件 by Softhard created a character "lup," meaning "love and [condom] protection" for their song "Dimgaai Yui Dai Ga Lup?" I have the CD booklet filed away somewhere, alas not handy, and remember seeing the character but would not be able to reproduce it. It might have the "establish" phonetic and the heart radical, but I can't say for sure. Also, not sure if they were able to typeset it or if it only occurred in the awesome mini-poster that came with the album and had the album's entire lyrics in graffit style by an HK graffiti artist. In any case, my Chinese would not have been good enough at the time (and still isn't) and I believe I only learned about the new character from an article about the album in the South China Morning Post, which in 1993 I would have read at the Chinatown branch of San Francisco Public Library.
Chas Belov said,

January 23, 2019 @ 4:03 am

Urk, typo, graffiti style.

In the database that iTunes accesses to fill in song titles, someone has rendered the song title as 點解要大家笠, which I think would be the wrong radical for the last character based on the SCMP article, but I'm seeing it used on multiple search results so that's perhaps what they're actually using. And the English title of the song was Bring Your Own Bag.

That said, I stand corrected. The graffiti art version of the back cover, which turns up in a Google images search for 廣播道軟硬殺人事件 does show the 笠 character "spray painted" in the results.
stephen l said,

January 23, 2019 @ 4:48 am

I also found this explicit remark (one of several) against a composition-based encoding for CJK characters to be interesting

https://www.unicode.org/faq/han_cjk.html

`Finally, East Asian governments, while aware of the compositional nature of the script, do not wish to actively encourage the coining of new forms because of the practical problems they create. In particular, new coinages are rarely an aid to communication, since they have no obvious inherent meaning or pronunciation. They are little more than dingbats littering otherwise intelligible text.`
stephen L said,

January 23, 2019 @ 4:49 am

I also found this explicit remark (one of several) against a composition-based encoding for CJK characters to be interesting

https://www.unicode.org/faq/han_cjk.html

"Finally, East Asian governments, while aware of the compositional nature of the script, do not wish to actively encourage the coining of new forms because of the practical problems they create. In particular, new coinages are rarely an aid to communication, since they have no obvious inherent meaning or pronunciation. They are little more than dingbats littering otherwise intelligible text."
Victor Mair said,

January 23, 2019 @ 6:27 am

It wouldn't make any sense to assign Unicode points to Xu Bing's invented characters, because nobody knows how to pronounce them and nobody knows what they mean. They are completely imaginary, though they look like real characters because their components all exist in real characters. There are a seemingly infinite number of such characters, certainly enough to break the upper limit of possible Unicode points, as the system exists now (1,114,112 possible characters).
Antonio L. Banderas said,

January 23, 2019 @ 7:51 am

@loonquawl

The following article might be clarifying:
Stroke systems in Chinese characters: A systemic functional perspective on simplified regular script (by Xuanwei Peng)
Theophylact said,

January 23, 2019 @ 9:56 am

Victor Mair: Go argue with Geoff Pullum over whether an apostrophe is a punctuation mark.

The apostrophe is not a punctuation mark. It doesn’t punctuate. Punctuation marks are placed between units (sentences, clauses, phrases, words, morphemes) to signal structure, boundaries, or pauses. The apostrophe appears within words. It’s a 27th letter of the alphabet.
Daniel N. said,

January 23, 2019 @ 10:11 am

I have a very simple, maybe stupid question. What is the official reference for Chinese characters? I mean, there should be a definite list of them somewhere so that when you find some strange character in a book, you have somewhere to look into. Is it simply a big dictionary?
Chris Button said,

January 23, 2019 @ 9:31 pm

The apostrophe is not a punctuation mark. It doesn’t punctuate. Punctuation marks are placed between units (sentences, clauses, phrases, words, morphemes) to signal structure, boundaries, or pauses. The apostrophe appears within words. It’s a 27th letter of the alphabet.

I agree up until the last sentence. In my opinion, a symbol denoting elision of a vowel sound does not qualify it to be considered as a "letter". By the same logic, could we not then argue the same case for the virama in many Brahmi derived orthographies whose function is to suppress the inherent vowel sound of a consonant? (Granted the virama is operating with the written word as its base, while the apostrophe is working with the spoken word as its base). Conversely, how about the tilde that originated as a superscript "n" that now denotes a nasal vowel in modern Portuguese, or the auk myit that originated as a subscript glottal consonant that now denotes the "creaky tone" in modern Burmese? By a similar logic, shouldn't these "diacritics" then be considered "letters"?
Philip Spaelti said,

January 24, 2019 @ 12:20 am

@ohwilleke: The characters may look like they are composed of a small set of parts, but in fact, many are not, and there is no simple inventory of parts. This is especially true of simple characters with only a handful of strokes. These are often just random arrangements that defy decomposition.

@Daniel N.: There is of course no definitive reference to ALL of the characters, and there has never been such a thing. This is the reason why there is a problem to begin with. Reference works for the characters are a fairly recent thing. (There was no way to order them, and as a result, no way to look them up.) In ancient times a scribe just had to know them all. When writing, if he couldn't remember a character, he would have been forced to think of a text that might contain the character and then look for it there. More likely he just wrote whatever he thought was right. This generates mistakes. But due to the veneration of script, the next person who reads this, thinks the mistake is correct, and this just proliferates variants, leading to the mess we have today.
Bathrobe said,

January 24, 2019 @ 2:40 am

Historically, the Unicode consortium used to be much more restrictive in the question what should and what should not get a code point of their own. For example, in the first editions, each Arabic letter only got a single code point, the argument being that font technology should care about the distinct shapes needed in conjoined writing; that was soon found to be inadequate
…
The encoding model you describe as impractical for Arabic (one character with context dependent shapes) is the current one, as you can read in the 1st paragraph of section 9.2 of the current standard

That is also the model adopted for the Traditional Mongolian Script. Unfortunately it's not always possible for font technology to predict the correct shape of letters in this script. Several special markers (separate code points which are not visible in the final text) have been introduced to ensure the correct form is chosen in an otherwise ambiguous environment. For foreign words, which feature special letter forms, it's not always possible to remember which special code will produce the desired result.

What is more, producing the correct form of inflectional endings (which are written as idiosyncratic separate forms) also requires an act of interpretation — the user types them in using ordinary Unicode letters and certain conventions (such as the use of non-breaking space) are used to tell the 'font technology' to render them correctly. Apple in particular manages to break the representation of Traditional Mongolian script every time it updates its operating system and is very slow to fix it.

It's a pity that Unicode didn't follow Menksoft in Inner Mongolia, which used separate proprietary code points for different letter combinations.
Philip Taylor said,

January 24, 2019 @ 6:35 am

I often wonder whether, if the Unicode Consortium were to start all over again, would it still come up with the same underlying model or would it come up with something significantly different ? I cannot help but feel that Unicode carries so much legacy baggage that it really would be worth interested parties starting from scratch (but obviously with an awareness of the strengths and weaknesses of Unicode-as-we-know-it-today) and coming up with a proposal for (say) "Omnicode", fit-for-purpose for the 21st century.
Antonio L. Banderas said,

January 24, 2019 @ 7:58 am

@Philip Taylor

Any academic literature regarding such "Omnicode"?
Philip Taylor said,

January 24, 2019 @ 8:30 am

No, it is just a name which might be used if one day a replacement for Unicode were proposed.
Antonio L. Banderas said,

January 24, 2019 @ 9:35 am

@Philip Taylor

I meant about such topic: its strengths and weaknesses as well as proposals for a new system
Trogluddite said,

January 24, 2019 @ 9:42 am

@loonquawl
There is no technical reason that the number of bytes stored per symbol could not be increased, besides the problem of keeping everyone's systems up to date with whatever the current standard is. However, Unicode is used for much more than just describing what symbols a reader should see. Unless we resort to far more complex computer algorithms than are currently typical, any operation which involves comparing one string of symbols with another requires that there is an agreed-upon pattern of bytes for each symbol which can unambiguously identify it (a bare minimum requirement is that the size of each 'blob' can be determined, so that it is possible to skip over an unrecognised symbol.)

One of the aims of Unicode is to abstract a symbol's identity from its appearance, such that it's easy to display the same content in many different typefaces and to reliably make comparisons. It might be argued that it was misguided to expect this abstraction to be (easily) applicable to all writing systems. However, even if we disregard this factor, the need to decide upon a consistent byte pattern for each symbol is orthogonal to deciding which features of a symbol those bytes describe.
Philip Taylor said,

January 24, 2019 @ 10:36 am

Trogluddite ("One of the aims of Unicode is to abstract a symbol's identity from its appearance") — This is one aspect that concerns me (and forms part of my thinking about a possible "Omnicode"). If you accept that a symbol's identity can be abstracted from its appearance, how do you know that two symbols are, in fact, the same ? Take, for example, English "d" and Vietnamese "d" — their appearance is the same (regardless of font) but their sound is very different. Why, then, do we say they are the same character, if not because their appearance is identical ? What makes two characters "the same" ?
Chris Button said,

January 24, 2019 @ 12:04 pm

… in general there are words that behave enough "like nouns" to be capitalised too, and making that determination is pretty involved. So learning what a noun is is not quite enough.

I'd imagine that must be quite demanding for the writer. Even just the slight differences in capitalization rules between languages like English and French can be confusing at times. Having said that, are there any studies on whether it is relatively easier to read a text in German (with nominal capitalization) as a native speaker than in Dutch (usually without) as a native speaker? I've always found it interesting how English spelling (and indeed Chinese for that matter) makes it harder on the writer but easier on the reader (once proficient of course) than say something like Spanish where the inverse is the case.
Bathrobe said,

January 25, 2019 @ 1:55 am

When did English abandon the practice of writing nouns with capital letters?
Bathrobe said,

January 25, 2019 @ 2:00 am

An interesting thread about this at Stack Exchange:

Capitalisation of nouns in English in the 17th and 18th centuries
loonquawl said,

January 25, 2019 @ 2:42 am

@Antonio L. Banderas – the only online version of the suggested paper i can find is behind a 20€ paywall – do you have a less pricey alternative?
Philip Taylor said,

January 25, 2019 @ 4:30 am

Loonquawl — Not complete, but the Google books version is better than just an abstract : https://books.google.co.uk/books?id=DPN_DQAAQBAJ&pg=PT75&lpg=PT75&dq=Stroke+systems+in+Chinese+characters:+A+systemic+functional+perspective+on+simplified+regular+script+%28by+Xuanwei+Peng%29&source=bl&ots=9MvjOGlQ4q&sig=ACfU3U2API1Qztq5YaMS0HRFjxMYwpIueQ&hl=en&sa=X&ved=2ahUKEwjijPSzz4jgAhVMRBUIHaHiA-oQ6AEwA3oECAUQAQ#v=onepage&q=Stroke%20systems%20in%20Chinese%20characters%3A%20A%20systemic%20functional%20perspective%20on%20simplified%20regular%20script%20%28by%20Xuanwei%20Peng%29&f=false
Antonio L. Banderas said,

January 25, 2019 @ 4:38 am

@Loonquawl

Try requesting its author a previous draft, here's their email:
201610003@oamail.gdufs.edu.cn
Chris Button said,

January 25, 2019 @ 12:19 pm

When did English abandon the practice of writing nouns with capital letters?

I was actually just reading the translator's introduction to a self-published English version of Shirakawa Shizuka's** 常用字解 (not a small undertaking by any means). The translator is German and there is a brief note in it where he laments the lack of capitalization of all nouns in English.

** Incidentally, Shirakawa Shizuka was an interesting figure in Kanji studies. He seems to have become a national icon of sorts in his later years, but his approach was highly controversial (heavily based on pictographic religious interpretations!) and naturally criticized as a result. The translator's introduction talks a little about Shirakawa's clash with Tōdō Akiyasu, (mentioned above regarding his brilliant, but inherently subjective, 漢字語源辞典), as recorded in their respective articles in the journal 文学 38.7 (1970) and 38.9 (1970). This clash between Shirakawa and Tōdō is very reminiscent of the clash between Herlee G. Creel and Peter A. Boodberg debate in the United States a few decades earlier. Nonetheless, in spite of Shirakawa's conclusions about what the characters originally represented often being very hard to take seriously, his deep understanding of the composition of the characters from their earliest forms through to their modern ones is incredibly useful and should not be taken lightly.
Victor Mair said,

January 25, 2019 @ 4:03 pm

I own the books of Shirakawa and Tōdō and have used both to great advantage, though am cautious in accepting all the details of their explanations.
Victor Mair said,

January 25, 2019 @ 4:04 pm

Category: Scripts not encoded in Unicode

https://en.wikipedia.org/wiki/Category:Scripts_not_encoded_in_Unicode

This category includes the seal script, which would requite the addition of around 10,000 or more items.

https://en.wikipedia.org/wiki/Seal_script

And then there are bronze forms and oracle bone forms and other forms of the Sinographs….

Many tens of thousands of unstandardized forms….
Antonio L. Banderas said,

January 25, 2019 @ 6:47 pm

Should a script such as the "Rongorongo" be added to Unicode?
http://scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=zntg7a8uub
January First-of-May said,

January 25, 2019 @ 9:42 pm

I actually don't know when the last new character was created. I guess its something I will have to google when I have time.

Pretty recently (within the last few years) would be my guess – if nothing else then because each newly named chemical element gets a new character in Chinese, so whatever the Chinese character for "oganesson" is has to have been added pretty darn recently.

(May 2017, apparently. And, weirdly enough, by now it had already entered Unicode – U+9FEB, 鿫. There might well have been non-element-related Chinese characters that were created even later than that, however.)
shubert said,

January 29, 2019 @ 12:14 pm

@WSM
"I think the world's most reduced writing system is binary,"
Yes, generally speaking. But, there is 3rd, neutral one.
GH said,

January 31, 2019 @ 5:18 am

@Victor Mair:

No, users of the whole system bear the cost and inconvenience.

I would contend that the cost to the "system" is not very significant (it doesn't very much matter if Unicode assigns 10,000 or 100,000 code points to Chinese), and anyway just one example of how the richness of culture and human nature tends to defy simple classification systems.

The Unicode standard has a rule that says it "does not encode idiosyncratic, personal, novel, rarely exchanged, or private-use characters" (sometimes referred to as the "no-Prince rule"), but obviously a sharp distinction cannot be made in principle, and many emojis, for example, would seem to violate a number of these conditions.

Is it "worth" preserving smaller local languages, or should all Chinese just speak Mandarin?

You can find no greater advocate of the topolects than Victor H. Mair, so please do not distort what I have written.

That was rather my point.
Victor Mair said,

February 2, 2019 @ 7:31 pm

"I would contend that the cost to the 'system' is not very significant…".

I was clearly talking about the cost and inconvenience to users of the system, i.e., human beings, not to the system itself.

RSS feed for comments on this post

Sinographs by the numbers

65 Comments

Andreas Johansson said,

GH said,

~flow said,

Philip Taylor said,

Victor Mair said,

Victor Mair said,

Andreas Johansson said,

Ken said,

Theophylact said,

Theophylact said,

Victor Mair said,

~flow said,

WSM said,

Doctor Science said,

Victor Mair said,

Alex said,

WSM said,

~flow said,

Philip Taylor said,

Frédéric Grosshans said,

WSM said,

Alex said,

WSM said,

Kangrga said,

WSM said,

Chris Button said,

Antonio L. Banderas said,

Lars said,

Philip Taylor said,

ohwilleke said,

Chris Button said,

Marnanel said,

Bob Ladd said,

loonquawl said,

Chas Belov said,

Chas Belov said,

stephen l said,

stephen L said,

Victor Mair said,

Antonio L. Banderas said,

Theophylact said,

Daniel N. said,

Chris Button said,

Philip Spaelti said,

Bathrobe said,

Philip Taylor said,

Antonio L. Banderas said,

Philip Taylor said,

Antonio L. Banderas said,

Trogluddite said,

Philip Taylor said,

Chris Button said,

Bathrobe said,

Bathrobe said,

loonquawl said,

Philip Taylor said,

Antonio L. Banderas said,

Chris Button said,

Victor Mair said,

Victor Mair said,

Antonio L. Banderas said,

January First-of-May said,

shubert said,

GH said,

Victor Mair said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta