The unpredictability of Chinese character formation and pronunciation

« previous post | next post »

Judging from many comments on this post, "Annals of airport Chinglish, part 3", there is both tremendous interest in and massive confusion about how Chinese characters are constructed.

Jeremy Goldkorn sent me this clever complaint about the characters from Weibo (China's imitation of Twitter) which is circulating widely on the web; it seems to be relevant to our present discussion:

终于会读了,泪奔 三个土念垚(yáo)三个牛念犇(bēn)三个手念掱(pá)三个田念畾(lěi)三个马念骉(biāo)三个羊念羴(shān)三个犬念猋 (biāo)三 个鹿念麤(cū)三个鱼念鱻(xiān)三个贝念赑(bì)三个毛念毳(cuì)三个车念轟(hōng)不会读的转!


I'll provide a rough translation:

With tears streaming down, I've finally learned how to read:

three tǔ 土 ("earth") are pronounced yáo 垚 ("high, lofty")

three niú 牛 ("bovine") are pronounced bēn 犇 ("rush")

three shǒu 手 ("hand") are pronounced pá 掱 ("pickpocket"); the same morpheme may also much more easily be written as 扒, although the latter character also has many other meanings under two different pronunciations, bā: "hold on to; cling to; rake; dig up; push lightly; strip / take off; peel", and pá "rake up; gather together; stew, braise"

three 田 ("field") are pronounced lěi 畾 ("fields divided by dikes")

three mǎ 马 ("horse") are pronounced biāo 骉 ("the aspect / appearance of a galloping herd of horses"); a Chinese woman who had this character as her surname was forbidden to use it because it was not found in standard fonts

three yáng 羊 ("sheep-goat; ovicaprid") are pronounced shān 羴 ("the rank smell of ovicaprids / mutton; a flock / herd of ovicaprids")

three quǎn 犬 ("canine, dog") are pronounced biāo 猋 ("appearance / aspect of dogs running; swift; whirlwind" — the latter meaning is usually written with the "wind" radical either on the left or the right side, and the wind radical may, of course, be either simplified 风 or traditional 風, and the 3 dogs may be replaced by 3 fires (huǒ 火) yet retain the same meaning of "whirlwind", and so forth and so on

three lù 鹿 ("deer") are pronounced cū 麤  ("coarse; crude"), and the same morpheme is written a number of different ways, e.g., 粗, 麄, 眯, etc.

three yú 鱼 ("fish") are pronounced xiān 鱻 ("fresh; new; delicious; rare, few"), another way of writing xiān 鲜 ("fresh", etc.)

three bèi 贝 ("cowry; shell[fish]; valuable; conch") are pronounced bì 赑 ("straining hard; a legendary animal like a tortoise [the bases of many heavy steles in premodern times were often carved in the presumed shape of a bì 赑]")

three máo 毛 ("hair; fur; feather; wool") are pronounced cuì 毳 ("fine animal hair or feathers")

three chē 车 ("cart; car; chariot; vehicle") are pronounced hōng 轟 ("boom; bang; rumble; noise of an explosion")

If there are characters you don't know how to read, forward them [on Weibo].

Most of these characters are of relatively low frequency and, except for a few of them, neither their meanings nor their pronunciations are known by persons of average literacy.

Many more such characters consisting or two, three, or four repetitions of the same character exist, and their sounds and meanings are in most cases equally or more opaque.

Since it is the year of the dragon, let us examine a couple of characters that consist solely of repetitions of the character for dragon, lóng (simplified 龙 traditional / complicated 龍).  To show the full complement of strokes, I will use only the traditional forms:

龍龍 [that is meant to be one character consisting of 32 strokes] ("the appearance of a flying dragon; a pair of dragons")

zhé

龍龍

龍龍 [that is meant to be one character consisting of 64 strokes]  ("garrulous; verbose; talkative")

All of the characters referred to above are real (neither I nor anyone else now alive made them up).

The ultimate sendup of Chinese character formation is Xu Bing's famous Tiānshū 天书 (A Book from the Sky), which consists entirely of characters that look like real characters, but are in fact all fake.  When A Book from the Sky was first exhibited in Beijing in 1988, it caused enormous consternation, because those who came to view it felt that the characters were familiar, but no matter how hard they strained, they could not make sound or sense of a single character in the entire lot.  Sounds and meanings could arbitrarily or imaginatively be assigned to each and every one of Xu Bing's 4,000 characters from the sky.  All of the strokes and all of the components are "legal" in the sense that they occur in officially authorized characters, but they have been combined in "illegal" ways.  That is to say, they don't add up to any characters that occur in historical texts or dictionaries.  Once they realized that they had been "had", conservative viewers were outraged because they thought that Xu Bing was making fun / light of them and their revered writing system.  It wasn't long before the exhibition closed and Xu fled to the United States in the aftermath of the Tiananmen Square Massacre.

I have met Xu Bing several times, e.g., once in his studio in New York and once at a lecture in Hong Kong, and I've gone to three or four of his exhibitions in the United States and have read his autobiograpical and theoretical / critical writings (I included his substantial "The Living Word" [translated by Ann L. Huss and Victor H. Mair] in the Hawai'i Reader in Traditional Chinese Culture).  Yet I have not been able to determine precisely what his intentions were in creating A Book from the Sky (though I certainly have my theories about what prompted him to spend so many years of exacting labor to produce such a monumental work of completely impenetrable "literary" art).  To tell the truth, I do not think that Xu Bing himself knows exactly why he felt driven to produce this mind-boggling / jarring multivolume book that makes no sense whatsoever.

Lest learners and lovers of the Chinese script feel as though they have been cast adrift after reading this blog, I want to reassure them that approximately 85% of all Chinese characters do give some hints about how they are to be pronounced and / or what they mean, but these are vague and imprecise hints only.  For instance, it is easy for me to think of two dozen characters that include fāng 方 ("place; region; square; regular; upright; honest; side, party; easy; rule; means; comparison; method, way; prescription; only when; then; just, still") as a phonophore having the following pronunciations:  fāng, fáng, fǎng, fàng, páng.  In most of these cases, the basic meaning of fāng 方 has no perceivable bearing on the meaning of the character, but is being used strictly for its sound, which — although spread across all four tones and a fifth related pronunciation — is actually more regular than many other phonophores.  I can also easily think of two dozen other characters in which fāng 方 is the radical (Kangxi no. 70).  In these cases, fāng 方 occasionally has vague semantic significance (though it is usually so hidden as to be essentially useless for figuring out the actual meaning of a character in which it appears), and often it is only considered the radical for the purpose of looking up the character by the shape of fāng 方, without regard to its meaning.  I can, moreover, identify nearly another two dozen characters in which fāng 方, as incorporated in the derived phonophore páng 旁 ("side"), serves as the secondary phonophore, where 旁 has the following pronunciations pāng, páng, pǎng, bǎng, bàng.  In a couple of these characters where páng 旁 is the phonophore, one may with effort detect the secondary semantic notion of "side", but the overall meaning is more often than not vaguely related to the various radicals under which these characters fall.

In the final analysis, one must still rely on brute memorization to master the sounds and the meanings of the characters, though in some cases the radical may provide a slightly useful jog to the memory in recalling roughly what the character means.  Similarly, probably in over half the cases the phonophore may provide a somewhat useful, yet often dim, hint about the pronunciation of the character.  To conclude, we may say that, if one studies very, very hard, one can can master upwards of three thousand out of the 80,000+ total characters.  If one does not apply oneself extremely diligently, tears will stream from one's eyes when faced with trying to remember the sounds and the meanings of characters like those in the lament at the beginning of this post (and they are relatively easy when it comes to monsters like 龠龜 [that is intended to be one character consisting of 35 strokes] and 齒簿 [that is intended to be one character consisting of 34 strokes].

[Thanks to Zhao Lu for help in figuring out the meaning of the final character in the Weibo doggerel at the beginning of this post and to Maiheng Dietrich for making some sense of the context]



28 Comments

  1. Nick Lamb said,

    February 6, 2012 @ 7:00 am

    Isn't it surprising that none of the 4000 characters Xu Bing created were real? It seems as though if I decided (as an art project) to make up 4000 English words I would very likely screw up and accidentally invent one which already existed.

    And indeed I have just tried a more modest experiment and the first word I came up with was "stanging" which isn't in the concise English dictionary I have to hand but is attested on the web and in larger dictionaries apparently.

    I don't have a feel for the space Xu Bing was exploring. Are there millions of plausible but currently non-existent Chinese characters? Billions? More?

  2. Adrian said,

    February 6, 2012 @ 7:01 am

    Is Chinese very different from English in this respect? What Xu Bing did in Chinese, you or I could do in English, i.e. make a list of non-words:
    belk
    vope
    libbon
    lisher
    all of which would evoke some quasi-understanding in the reader. (It is quite hard to do though, because the brain keeps alighting on real words, and even when it doesn't, it might still be an obscure word you hadn't heard of.) If you're Lewis Carroll you can go one stage further and assemble them into poems.

  3. jo said,

    February 6, 2012 @ 7:25 am

    @Adrian

    I think there is something of a difference. When we try and invent fake English words, what we are doing is looking ways of combining sounds that are 'legal' in terms English phonetics/phonology but not used for actual words. So, for example, we wouldn't admit a word like "prtskvna" (apparently a Georgian word; source) because it breaks English's rules. This means that in practice the number of 'legal' English syllables (which can then form fake or real words) is finite and definable (by someone, not me…). Similarly, the number of possible Chinese syllables must be finite and definable. The number of 'legal' Chinese characters on the other hand, I think must be quite a bit higher (certainly higher than the number of possible Chinese or syllables), and maybe it's not so easy to define its limits, although there are rules about how different character elements combine. Even so, I agree with Nick Lamb — if all 4000 characters Xu Bing created are plausible but completely unattested that seems like quite a feat.

  4. Philip Spaelti said,

    February 6, 2012 @ 8:34 am

    http://www.xubing.com/images/uploads/artforthepeople1.jpg

  5. Philip Spaelti said,

    February 6, 2012 @ 8:38 am

    The link posted above should give an idea of the kind of thing that Xu Bing does. If you look at the banner it looks like four Chinese characters. If you know Chinese however you can't really make sense of them. But if you know English, and you stare at them long enough, you should be able to make that they spell "Art for the people".

  6. William Ockham said,

    February 6, 2012 @ 10:05 am

    What an awesome work of art. Now someone needs to put all those Chinese "characters" into a tattoo book.

  7. Jim said,

    February 6, 2012 @ 11:50 am

    "What an awesome work of art. Now someone needs to put all those Chinese "characters" into a tattoo book."

    Okay, that's fiendish.

  8. Mr Punch said,

    February 6, 2012 @ 11:58 am

    "[A] Chinese woman who had this character [biāo 骉] as her surname was forbidden to use it because it was not found in standard fonts." Is this the same character as in the name of the late Marshal Lin Biāo? Was it found in standard fonts before 1971?

  9. Victor Mair said,

    February 6, 2012 @ 12:57 pm

    @Mr Punch Totally different character, and Lin was the Marshal's surname: 林彪.

  10. Matt said,

    February 6, 2012 @ 1:53 pm

    Sounds like Xu Bing's book twas bryllyg, causing the readers to gyre and gymble in ye wabe.

  11. Ran Ari-Gur said,

    February 6, 2012 @ 3:07 pm

    @Jim: I dunno, if I had to have a Hanzi tattoo, I think I'd much rather have one that's genuine nonsense, from a famous work of art, than one that has a real meaning that I don't have a good sense of.

  12. John Swindle said,

    February 6, 2012 @ 3:16 pm

    "[A] Chinese woman who had this character [biāo 骉] as her surname was forbidden to use it because it was not found in standard fonts." Maybe she could change her surname to 不 (biāo). Or not, as the case may be.

  13. John said,

    February 6, 2012 @ 5:12 pm

    You missed out http://en.wiktionary.org/wiki/龘, which actually appears in most fonts.

    There's also http://zh.wiktionary.org/wiki/%F0%A0%94%BB and http://en.wiktionary.org/wiki/䨻 (four thunder (lei3)s).

    I wonder how you would use these characters in a sentence.

  14. Ian F. said,

    February 6, 2012 @ 5:25 pm

    @VictorMair, @MrPunch, @JohnSwindle – there may be an error in Professor Mair's post. The recent notable case of the Beijing woman forced to change her name was, I believe, that of Mǎ Chēng 马

  15. Ian F. said,

    February 6, 2012 @ 5:27 pm

    @VictorMair, @MrPunch, @JohnSwindle – there may be an error in Professor Mair's post. The recent notable case of the Beijing woman forced to change her name was, I believe, that of Mǎ Chēng (character link here: http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=299E2&useutf8=true – inputting that in my first post seems to have broken the comment software) and is altogether different than in the triangular form as in biāo 骉. She was required to change her name to the character's more common form Chēng 骋.

  16. Alan Shaw said,

    February 6, 2012 @ 6:18 pm

    The English words arranged to resemble characters are part of a different Xu Bing project "Square Word." You can get a look at actual parts of Tianshu at http://www.hanshan.com/specials/xubingts.html and http://www.quaritch.com/NewsItem.asp?id=117 – characters composed of real character components according to real compositional methods, but characters that reveal neither pronunciation nor meaning. (Actually a few of Xu Bing's Tianshu characters have been found to have existed at some point in history.)

    At a Xu Bing symposium at Princeton in 2005, Professor Perry Link
    presented an amusing analogy: a poem in which all the syllables
    are perfectly legal according to Mandarin formation rules, but none
    actually exist in Mandarin. The aural effect is of the meaning being
    just out of reach, or perhaps in "dialect". Here it is in GR and in pinyin:

    dua puen cheei tianq mhong fi doei
    fay shiin tzwa garng shoeng tenn kin
    pwai byong juoh chuey biabia seeng
    kenq moa chyai tiue tseei renq lhin

    Duā pūn chěi tiàng mōng fī duǐ
    Fài xǐn zuá gáng shuěng tèn kīn
    Puái bióng zhuò chuì biābiā sěng
    Kèng muǎ qiái tüē cěi rèng līn

    "Biābiā sěng, of course, I must have heard that before…"

  17. Victor Mair said,

    February 6, 2012 @ 7:26 pm

    @Ian F.
    This is very interesting. I actually wrote a whole blog about this problem: see "A Limitation on Names in the PRC". (http://languagelog.ldc.upenn.edu/nll/?p=1355) In it, I also mentioned Xu Bing's "Book from the Sky" and brought up the matter of biāo 驫. This latter character, in comparison with the 馬馬馬 [U+299E2] of Ma Cheng's name, demonstrates that placement of the same three components side by side results in a completely different character, with different pronunciation and meaning, than placing them in a triangular arrangement.

    There is also a good account of Ma Cheng's name problem on Danwei: http://www.danwei.org/video/crazy_horse_name.php

    It would seem that the source of the story I cited above in this post is a garbled version of the Ma Cheng story, perhaps the result of someone looking for chěng 馬馬馬 and, not finding it, settling for biāo 驫, and then further garbling the story by stating that biāo 驫 was her surname.

    A final curious fact about the tale of Ma Cheng's name is that, in the form that she was forced by the government to adopt, namely 馬騁, it is identical with the name of the head of the company responsible for the signal system that apparently led to the disastrous high speed railway crash at Wenzhou in July of last year. The railway executive, Ma Cheng 馬騁, though only in his 50s, suddenly died of a heart attack during investigations into the cause of the crash that took place in August

    http://www.youtube.com/watch?v=FugiW30n4MU
    http://www.chinadaily.com.cn/china/2011-12/29/content_14346798_2.htm

  18. Stefan said,

    February 6, 2012 @ 10:00 pm

    When I studied Chinese in the Army during Vietnam, we used to say one tree was "mu", two trees was "lin", and three trees was "junglu."

  19. Stefan said,

    February 6, 2012 @ 10:05 pm

    (Or maybe "junglu" was four trees, I forget.)

  20. Chaon said,

    February 7, 2012 @ 1:19 am

    Three trees is "sen". Four trees is "quagmire".

  21. Victor Mair said,

    February 7, 2012 @ 7:52 am

    For "junglu", I suspect that Stefan means "jungle", but since he repeats the "junglu" spelling in two successive comments, perhaps he has something else in mind.

    As for the characters mentioned by Stefan and Chaon, they are:

    mù 木 ("wood; timber; tree")

    lín 林 ("woods; grove; forest")

    sēn 森 ("full of trees; dark; dense; thick; gloomy; multitudinous")

    Let us imagine that the following two combinations constitute single characters:

    木木木


    I do not know whether such characters actually exist, and they probably do not (up to this point), but there is no reason why somebody could not assign sounds and meanings to them and choose such characters for his / her name or for a line of poetry or for use in an advertisement or to write a particular morpheme in some topolect, etc. If they do not yet exist, these are the types of characters that Xu Bing would have written into his A Book from the Sky, although he usually chose much more interesting and challenging combinations of components for his non-characters than merely to repeat the same character several times. You can imagine how much research Xu Bing had to undertake in order to determine that all 4,000 of his imaginary characters did not actually exist at some point in the past before he decided to put them in his A Book from the Sky. Apparently he made up a few characters that actually already existed but were of such extremely low frequency that he failed to locate them in his efforts to ensure that his imaginary characters truly were non-characters.

  22. Weisse said,

    February 7, 2012 @ 7:53 am

    @Prof. Mair, Philip Spaelti , Tianshu is certainly an interesting experiment, but I personally found it's less than an original art. The picture at the first link posted by Philip reminds me of the amulets used in Taoist exorcism, 符咒 fu2 zhou4 amulet/talisman, also the "鬼画符gui3 hua4 fu2 lit. ghost-draw-amulet", also used to describe all illegible writings. (See the examples here:
    http://www.google.fr/search?hl=fr&gs_is=1&cp=2&gs_id=d&xhr=t&q=%E7%AC%A6%E5%92%92&safe=active&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&biw=1229&bih=790&wrapid=tljp132861776425504&um=1&ie=UTF-8&tbm=isch&source=og&sa=N&tab=wi&ei=KhkxT8iuK8Ov0AG__9iKCA )

    An intentional 鬼画符 at school is traditionally considered to be a mockery of teaching, so you can imagine what the Chinese feel back in the 80's. However, today, 鬼画符 is almost nothing compared to 火星文.

  23. Weisse said,

    February 7, 2012 @ 7:57 am

    Some characters can be arranged vertically or horizontally without changing the meaning or the pronunciation, as for 峰 feng1, mountain http://www.chineseetymology.org/CharacterEtymology.aspx?characterInput=%E5%B3%B0

  24. Robert King said,

    February 8, 2012 @ 7:00 pm

    @adrian – as Nick Lamb said above you, this is hard to do in English. From your list, the following comes from the OED

    belk, v. Obs. and dial. form of belch v.

  25. Sakura Maichiru said,

    February 8, 2012 @ 8:08 pm

    Am I the only one who thought of the Voynich Manuscript when I read about Tianshu (A Book from the Sky)? XD I wonder if, a long time from now, the book will be rediscovered and boggle future archeologists and linguists….

    Boggling future archeologists is fun.

  26. Lugubert said,

    August 30, 2012 @ 1:03 pm

    Some of my first thoughts that were prompted by the "invented character" thing was Zhao Yuanren's version of Jabberwocky.

  27. jk said,

    December 6, 2012 @ 8:30 pm

    those are perfectly cromulent characters

  28. flow said,

    June 22, 2014 @ 7:52 pm

    (2nd try with astral characters removed)

    @Nick Lamb i've compiled a list of the 9980 most used characters in Mandarin, Putonghua, Japanese and Korean, all analyzed into sequences of between one and six basic elements; i've found approximately 700 different elements (although for systematic purposes, i always treat repetitive elements as elements in their own right, so e.g. 竹, 賏 and 贔 are considered to consist of a single element; there may be fifty or so of these). now a simple calculation—700^1 + 700^2 + … 700^6—shows there are 117817310443490700 (more than 117e15 or 117 petacharacters) characters altogether. dividing this by 10.000 gives 117e11 or 1.1781e9 billion characters.

    to make it short, for each of the 10.000 most common characters, there are a thousand billion characters that could be formed with the same elements, keeping roughly the same amount of complexity.

    the figures would have to be modified by the consideration that (1) there are rather much more than 10.000 signs out there that could rightly be called 'chinese characters'; Unicode alone currently recognizes 75.000 characters, and the number is known to be significantly bigger; (2) with those new characters, new components turn up (random example: Unicode encodes and hundreds more); (3) there is no non-arbitrary way to slice and dice characters; good schemes should be well justified, but no scheme is without alternatives (e.g. i treat 鬲 as a single unit, although it is clearly composed out of 一, 口, 冂, 丷, and 丅); (4) in the modern script, there are quite a few elements that have a more or less rigidly proscribed position with a character—氵 must appear to the left, 亠 to the top of another component (however, these rules are sometimes violated by rare characters, e.g. 㳲, which is not a possible combination in modern characters). but i don't think these factors would significantly change the above calculation.

    so yes, when you arbitrarily combine a few chinese character building-blocks, you have likely created a figure that no man that has ever lived or will ever live has ever seen or will ever see.

RSS feed for comments on this post