Sinographic inputting: "it's nothing" — not

« previous post | next post »

Last week in our Dunhuangology seminar, a student wanted to type "wǔ 武" ("martial; military") into the chat box, but instead out popped "nián 年" ("year").  I immediately said to her, "I'll bet you were using a shape-based inputting system", which left her a bit surprised.

Ever since information technologists began to wrestle with the problem of inputting, ordering, and retrieving Chinese characters in computers during the 70s, I have been intensely interested in the theoretical and practical obstacles they faced.  To better understand the overall situation with regard to characters in computers, I organized an international conference at Penn in 1990 on the computerization of Chinese characters that resulted in Victor H. Mair and Yongquan Liu, eds., Characters and Computers (Amsterdam, Oxford, Washington, Tokyo:  IOS, 1991).

Indeed, one may say that during the 80s, 90s, 00s, and up to today, I have been preoccupied with Sinographic inputting, saturated to the point that I am familiar with the names and natures of countless methods that have been proposed.  So, when I asked the student what method she was using, and she said the Taiwanese word "Boshiamy", which means "it's nothing", at first I thought that maybe she was just joshing me.  But then I had to confess that, though I was vaguely familiar with the name, I didn't have a clear conception of how it actually worked.  I suspected though, with a name like that (implying "it's no big deal", "it's not hard / complicated"), that it must be based on strokes or components — all of which are really quite complex and difficult — but which their devisers try to convince themselves and their prospective users are simple and straightforward.  Not!

Chinese character inputting systems, which by now run into the thousands, can be generally broken down into the following types:

1. shape-based; configural

    a. components; elements

    b. strokes

2. mixed; hybrid — combining sound and shape

    a. predominantly shape based

    b. predominantly sound based

3. phonetic

    a. syllabic

    b. lexical

Diachronically, these systems have generally developed in the order listed above, with phonetic inputting systems gradually displacing configural systems, to the point that now more than 90% of Sinographic inputting is done via sound rather than shape.  Nonetheless, die-hard hanzi / kanji / hanja purists — not wanting their beloved characters to be contaminated by alphabets and other phonetic writing systems — continue to cling to shape-based systems and even invent new ("easier, simpler") ones from time to time.

To take a closer look at the Taiwanese expression "bô-siánn-mi̍h", it is customarily written in Sinographs as 无啥物 and the Mandarin Sinographic transcription of the Taiwanese sounds is fǔxiāmǐ 呒虾米, which looks like it means "stunned; stupefied; to not have; to be without" + "peeled, dried sea shrimp; small shrimp; dried, shelled shrimps", but it has nothing whatsoever to do with stunned, stupefied dried, peeled shrimp or not having dried, peeled shrimp.  That's just the Mandarin transcription of the Taiwanese sounds of the word — so it is claimed.

Here is a website introducing the pronunciation and meaning of the Taiwanese expression (at bottom left, with recording).

Although the Boshiamy method may sound obscure and arcane, it must be used by a considerable number of people in / from Taiwan, including the student in my Dunhuangology seminar.  There's even a Wikipedia article for the Boshiamy method, for which see here.  Click on the link and see the color coded chart at the top right to get a sense of how the method works.  Daunting.  "Boshiamy uses about 300 radicals represented by 26 letters to build characters. Radicals are mapped to letters by their shapes, sounds or meanings" — also "variants" and "others".  "It's nothing something."

By the way, the reason for my rapid response to the typing error reported in the first paragraph of this post is simple:  "wǔ 武" ("martial; military") and "nián 年" ("year") have minimal phonological resemblance but perceptible graphic resemblance.

 

Selected readings

[Thanks to Debby Chih-Yen Huang]

Update [3/12/21] from Kirinputra:  The customary way to write the phrase "bô siáⁿ-mih" (~BOSHIAMY) in Hanji is actually 無省乜 (or 无 or 旡); "無啥物" is a "creationist" creation roughly dating back to the 1970s or the 80s.



18 Comments

  1. Antonio L. Banderas said,

    February 22, 2021 @ 7:50 am

    I did some research years ago.

    Six-Digit Stroke-based Chinese Input Method https://mega.nz/file/YZhn0C5C#4whOar6rM2tRGCWJ-cyKmVnAY5JFRzJaCSqT7c-9_bw

    Chinese input for the blind https://mega.nz/file/NEgDUQiD#QGItiz3OnMWFLrbr1kJmdQNg2x3PQpDe6ScJVwEXKzs

    Stroke systems in Chinese characters https://mega.nz/file/cAoBkY6S#zMaY5vBm3HRhbCRl01eVaMXpK-IasTjE0XxeOU_k0d8

    YES STROKE ORDER https://mega.nz/file/oMo2XLgJ#eTqscaaRLNX8G_pZkQR3IcjH8xUpIY8EB_fCTLZKDv0

  2. AntC said,

    February 22, 2021 @ 5:08 pm

    Every smartphone in the Sinosphere seems to have stroke-based input. Are they all using the same system? (on Apple vs Samsung vs etc.) And people seem to use the systems by entering the first few strokes then hunting through a tree of options to find the character.

    Victor's many posts on character amnesia, and particularly stroke-order amnesia seem to apply here: I've seen people repeatedly try for a character with what seems to be different orders of the strokes.

    The two characters discussed don't look similar/easily confusable to my Latin-script eyes. Is it that the first few strokes in order are the same?

  3. ~flow said,

    February 23, 2021 @ 5:44 am

    re "die-hard hanzi / kanji / hanja purists — not wanting their beloved characters to be contaminated by alphabets and other phonetic writing systems — continue to cling to shape-based systems and even invent new ("easier, simpler") ones from time to time"

    I think that even without discussing the merits and demerits of sound- v shape-based systems their usefulness to a given purpose and demographic, and without doling out judgemental second-guesses we can agree that shape-based search and input systems for sinographs are a valuable asset for anyone dealing with this script in electronic form. It is all too often that one sees a character and can clearly discern its shape, its components, its construction, yet is unable to tell what t means or what it sounds like and so on, be it because one has momentarily forgotten this specific grapheme (character amnesia, quite frequent) or because one has never learned or even encountered said character (a daily experience for learners, and a lifelong predicament for all practitioners). One does not have to be a die-hard purist to be in need of a shape-based input system, and I say this as someone who almost exclusively uses Pinyin and Romaji-based IMEs. What I totally agree with is that many popular shape-based IMEs such as Cangjie have a steep learning curve and require users to memorize very long lists of rules, exceptions, and arbitrary codes; in this respect they are probably somewhat better than the mechanical CJK typewriters and telegraph codes of yore, but not by very much: they do pay off when you specialize in the technique, but are almost unusable for the occasional user. Whether any significant number of skilled Cangjie or Boshiamy users can type faster than a majority of Pinyin IME users is doubtful (though totally possible).

    Given the present state of affairs, the comment that some people "*even* invent new ('easier, simpler') ones" (which appears to beg the reader to agree with the sentiment that since shape-based systems are naturally inferior, therefore, the invention of new ones is frivolous or maybe hilarious) is IMHO misleading. Rather, precisely because the available shape-based systems have so many unsatisfactory properties, therefore, innovation is needed.

    Having said this much, I'd be really interested in a deeper discussion of the inner workings of systems like Cangjie, Boshiamy and others. I am looking forward to reading the material kindly offered by Antonio L. Banderas and to reading more on the topic on LL!

  4. Victor Mair said,

    February 23, 2021 @ 6:08 am

    "I say this as someone who almost exclusively uses Pinyin and Romaji-based IMEs. What I totally agree with is that many popular shape-based IMEs such as Cangjie have a steep learning curve and require users to memorize very long lists of rules, exceptions, and arbitrary codes"

    Q.E.D.

    And there's a fundamental reason why that is so.

    You can be sure that Language Log will have many future discussions of shape-based vs. phonetic inputting systems for Sinographs.

  5. ~flow said,

    February 23, 2021 @ 10:23 am

    > And there's a fundamental reason why that is so.

    I guess that to be true, but let's not forget there's also handwriting input which is shape-based but in a totally different class than keyboard-based methods (not necessarily better or worse; though I have seen amazing stuff here (e.g. http://shapecatcher.com/ which covers a lot of Unicode minus CJK but is great for that elusive maths symbol you need). I see the use case for shape-based methods more in cataloging (also on paper; no input here but of course you can use the codes to obtain ordered sections with ordered sequences of characters), systematic searches ("which characters contain this shape, the patterns" and so on), as a learning aid ("which characters look similar to this one") etc., besides of course being useful for those individuals who are ready to put in some serious training to learn thousands of (typically) four-stroke codes that they can hammer into the computer at a fast rate. Some people seem to be happy with that although it's not for me.

    Also worth an honorable mention is what I call the 札字五筆法 (not to be confused with the 五筆 IME) which puts each stroke into one of five classes (一丿丨丶乚), assigns them digits 1 to 5 and lets users search by the numbers (札 is 12345, hence the method's name; 光 is 243135, 人 is 34 etc). This is by no way an efficient *input* method for long texts but a very accessible way to locate otherwise difficult-to-find characters that requires very little prior technical knowledge of the user other than being reasonably good at handwriting the way you should have learned in school. I wish more dictionaries would provide auxiliary indexes using this or a similar method; all to often what you get is only huge lists of characters by stroke counts with no obvious internal order; even sub-grouping by the first stroke cuts you down from a list of a hundred characters to one of twenty.

    So I see the future more in a multitude of methods both on paper and in the computer, methods that can be selected as one sees fit, ideally with some kind of cross-referencing. FWIW most of the time when I want to do a quick lookup I reach for the latest edition of the (Pinyin-based) 新華字典; with all its shortcomings it's still the fastest most of the time. I'm frequently on zdic.net and guoxuedashi.com, too.

    > future discussions of shape-based vs. phonetic inputting systems for Sinographs

    That's great to hear!

  6. Julian said,

    February 23, 2021 @ 3:24 pm

    'To assemble your IKEA wardrobe using the supplied Allen key just follow these 35 simple steps….'

  7. Victor Mair said,

    February 23, 2021 @ 4:51 pm

    From an eminent European Sinologist:

    Foreigners forgetting how to write difficult and rare Chinese characters by hand should not blame themselves. Ask any young Chinese or even university teacher, I guess, several of them shall not be able to write 韬光养晦 and perhaps will use Pinyin instead. With the modern media the 提笔忘字 (Sinographemic Amnesia) phenomenon has increased dramatically also among native speakers / writers. Ask any Japanese to write a difficult kanji-word, he will use Hiragana. That’s why since years I try to convince that Chinese language has only a chance to go abroad if “digraphia” 双文制 will be officially recognized and introduced. But today’s nationalists are far from accepting this!

  8. Noel Hunt said,

    February 24, 2021 @ 12:34 am

    `…that it must be based on strokes or components — all of which are really quite complex and difficult…' Isn't it surprising how the Chinese and Japanese have managed to cope with this unbelievably, staggeringly complex system, for at least a millenium in the case of the Japanese and much longer for the Chinese?

  9. Victor Mair said,

    February 24, 2021 @ 7:11 am

    "…have managed to cope with this unbelievably, staggeringly complex system…."

    Before the digital, electronic age of modern science and Global English, with consequent, rampant Sinographemic Amnesia….

  10. Chris Button said,

    February 24, 2021 @ 7:32 am

    Since there are 30-odd strokes in Chinese characters, I always wondered why keyboards weren't laid out with those on the keys. It's a good number for an "alphabet". I think that would be easier than those versions where multiple strokes are conflated as one of 5 kinds of strokes (as common on mobile phones). Predictive input can bring up options to speed up typing of characters with a high stroke count.

  11. ~flow said,

    February 24, 2021 @ 1:56 pm

    @Chris Button—"there are 30-odd strokes in Chinese characters, I always wondered why keyboards weren't laid out with those on the keys"

    Part of the answer seems to be that when you talk about 20 to 40 types of strokes, you're talking about things that only the academic calligrapher can discern with ease, the mass of all script users are blissfully unaware of the finer distinction. This is no different in Latin script; an analogous task would be to ask laymen about the different configuration of serifs (ugh? what are serifs?) you see in a lower case u compared to an n or a t or an l, or maybe ask a non-typographic person to name all the parts of printed type (cf. https://paigerayoblog.wordpress.com/2015/05/13/types-anatomy/). To the ordinary user, you have maybe horizontal, slant-left, slant-right, downward, slant-upward-right, dot, horizontal-vertical, vertical-horizontal and then some, that's it, far fewer than twenty, maybe a dozen.

    Even a dozen stroke types are somewhat hard to memorize in absence of a unified system that is taught in schools and used in daily life. Personally I sometimes have to mumble letter sequences like "…P-Q-R-S-T…" in order to locate that R or S in a dictionary although the 26 letters' ABC has been learnt ages ago and is steadily reinforced.

    And that's maybe the real miracle, the big question in Chinese character history: Sure, there have been texts (I think from Tang onwards, but I'm likely wrong) that enumerate all the strokes there are in Chinese writing. Problem is, all these lists differ from each other, and they do not distinguish between those stroke types that are 'universal' or 'fundamental' to sinographs and those that only apply to a certain style of writing. In that sense, the lists given by calligraphers have to be treated as open, i.e. there can always be more or less categories depending on how fine a distinction you want to allow between any two types (with 30 types, some of those distinctions are hair-splittingly fine even for the connaisseur) or what your style of writing is.

    The same problem, only worse, exists with character components: outside of Kangxi's 214 radicals, comparatively few components have been given unique names, or arranged in a standardized sequence. No definitive list of components has ever been fixed in the history of Chinese writing, and it is not an easy task to come up with a reasonable enumeration that is not incomplete or has some difficult corner cases: are 立, 日 components? Sure, although the first does contain 亠. 一 and 口? Sure, but these two combined give you 日. Is ⺶a bona fide component? Well, most of the time it's just a variant of 羊. In the same vein, older dictionaries tended to treat 氵 as a mere variant of 水, 扌 as a variant of 手 and so on, whereas post-1960 radical systems from the PRC like to tease them apart.

    While we have archaeological and textual evidence that the Greek and Roman alphabets had (at least locally) definitive (i.e. closed list), fixed sequences already over 2000 years ago (see https://en.wikipedia.org/wiki/File:NAMA_Alphabet_grec.jpg and https://en.wikipedia.org/wiki/G#History), no such thing exists for Chinese writing.

    Maybe, if someone centuries ago had come up with a memorable enumeration of all the strokes there are in any Chinese character, if that enumeration had been made the foundation of elementary school writing education—maybe then we would be seeing Chinese keyboards with pictures of strokes (plus, presumably, a small number of high frequency composed forms like 口, 幺, 乂) printed of them. But that never happened, which I find to be the big, as yet unanswered question.

  12. ~flow said,

    February 24, 2021 @ 2:18 pm

    I should have added some proof to the claim that some valid calligraphic subtleties are hairsplittingly fine for the lay person, so here it comes: 「九成宮醴泉銘」の基本筆画 http://www.tonan.jp/moji/02hikkaku/index.html

    Admittedly, this system has 64(!) classes so is twice the proposed size of 30. But observe there are left-slanting strokes in the horizontals class (1st column from the left) as well as left-slanting strokes that form their own grouping (4th col.), while other systems throw all of those together.

    (PS this image appears on the cover of 漢字の骨組 by Ōkuma Hajime. Highly recommended, as is his upcoming work, 字体変遷字典, which you can watch taking shape over at http://tonan.seesaa.net/article/480097667.html)

  13. Victor Mair said,

    February 24, 2021 @ 2:38 pm

    Excellent discussion begun by Chris Button and followed up by ~flow.

    Now we're talking turkey, and it gets to the heart of the conceptual nature of the Sinographemic writing system and its relationship to the Sinitic languages.

  14. Antonio L. Banderas said,

    February 24, 2021 @ 3:22 pm

    Stroke order rules: https://i.imgur.com/g17Rv91.png

    The Writing Order of Modern Chinese Character Components https://mega.nz/file/UUgwGQwJ#MhsN5M-x0wUZPWzytyPUXYUjsBd8enAL6oANfssSfqs

    Encyclopedia of Chinese Language and Linguistics: Stroke Order https://mega.nz/file/Edx2XASY#z2172aWe7vO6Y6iI1E8qnbSAALKhQzQIi9qGhnVLycA

    Stroke order learning https://mega.nz/file/ZdxgkSCB#TbEllJO63OzfDPVJ6Hw92YY_wXHc1j59vVXY_F0D2cM

  15. AntC said,

    February 24, 2021 @ 4:43 pm

    @Noel Isn't it surprising how the Chinese and Japanese have managed to cope with this unbelievably, staggeringly complex system, for at least a millenium in the case of the Japanese and much longer for the Chinese?

    (To nit-pick: you mis-spelled millennium, so you're not 'coping' with the less-complex system of English orthography.)

    wikipedia on 'Education in China' "In 1949 … illiteracy … was more than 80% of the population". So throughout the period of developing Chinese script, it was the preserve of a scholar class, hardly anyone 'managed to cope'.

    Japan historically seems to have had rather better levels of literacy, but perhaps that's because you can 'get by' without so much of a feat of memory.

  16. Laichar said,

    February 24, 2021 @ 7:45 pm

    As a native speaker of Cantonese and a Cangjie user for almost three years now, The claim that using Cangjie would ease character amnesia is not really true as Cangjie selects certain parts of a character to be inputted, often contrary to stroke order, although encountering arbritrary exceptions isn't as common. It is possible that one can only remember the significant parts of a character for input in Cangjie but not the other parts and how to connect them.

  17. ~flow said,

    February 25, 2021 @ 9:43 am

    > Sinographemic writing system and its relationship to the Sinitic languages

    The obvious follow-up question: which languages were or are successfully written with CJK characters? Mandarin of course, and then a continuum of Northern Guanhua … Literary Sinitic. Japanese has evolved into a very complex system which is alive and kicking whereas similar orthographies have since fallen out of use in Vietnam and Korea. Taiwanese / Hokkien comes to mind; what's the state of affairs in Old Zhuang / Sawndip? Cantonese is being written except for the stuff that cannot be readily written, I guess?

  18. Antonio L. Banderas said,

    March 3, 2021 @ 2:46 am

    Unihan Database Property “kStrange” https://unicode.org/notes/tn43/tn43-1.pdf

RSS feed for comments on this post