A Sino-Japanese dictionary juxtaposed with the Four-Corner Method

« previous post | next post »

[This is a guest post by Conal Boyce]

Here I’ll deal briefly with Halpern’s Kodansha Kanji Dictionary, then devote most of my attention to the Four‑Corner Method — not that I’m an advocate of the latter but its formal design and quirky byways (such as its Fifth‑Digit Kludge) require a good deal of time simply to be described, never mind assessed. An antiquarian pursuit? Given that translation apps now have a phone‑camera option for handling hànzì, and given that a Chinese Chip in the cranium seems imminent, a study of two dictionary look‑up methods might strike one as quaint. But there are lessons to be learned by studying such material as if from a System Analyst’s viewpoint. I hope this piece might have some appeal from that angle at least, if not from a nuts‑and‑bolts Chinese studies standpoint.

Preliminaries. First I must explain how it is that I’ve chosen to treat a kanji dictionary as if it were a hànzì dictionary. This works for the Kodansha specifically because Halpern provides both the Japanese and MSM pronunciation for each of his 5458 entry‑characters, plus variants of the characters themselves. Two examples: His entry for 体 tai has variant 體 on the Japanese side, and on the Chinese side we see simplified character 体 (again), now with its Chinese reading, tĭ, and secondary reading, tī. His entry for 乗 jō shows the variant 乘 on the Chinese side, accompanied by its primary reading, chéng, and secondary reading, shèng. (I mention 乘 as a kind of test case because it is discussed at some length in a Language Log post entitled “Sinological suffering” [3/31/17]. There it is noted that 乘’s secondary reading, shèng, which turns it into a noun, is often missing even from some large‑ish dictionaries.)

On cracking the cover of the Kodansha Kanji Dictionary, one is pleasantly surprised by a look‑up method that reveals nearly all of itself to the reader in literal seconds. Halpern calls it his System of Kanji Indexing by Patterns or SKIP. Consider the following two sentences, into which I’ve distilled the bare essence of SKIP:

If the character you seek has a left‑right structure like this 操, do a stroke‑count of its two ‘halves’ separately, as 3 and 13, then find it under the 3‑13 heading in the left‑right section of this dictionary. If the character you seek has a top‑bottom structure like this 翌, do a stroke‑count of its two ‘halves’ separately, as 6 and 5, then find it under the 6‑5 heading in the top‑bottom section.

For an investment of, say, 16 seconds to read the italicized passage, one has gained access to 82% of the dictionary’s contents. (Beyond left‑right and top‑bottom, the other two categories are enclosure, e.g., 症, accounting for 11%, and solid, e.g., 爾, accounting for the remaining 7% of the entries. These percentages I base on Halpern’s New Japanese‑English Character Dictionary [1993], predecessor to his Kodansha Kanji Dictionary [2013].)

Timing. For me, the look‑up duration is about 40 seconds for a character in the left‑right or top‑bottom section, and 24 seconds for a character in the enclosure section or solid section. Summing 16 seconds, to read the instructions, plus 40 seconds to locate a sample character, plus an additional 60 seconds to get the hang of the enclosure and solid sections, we have a learning curve that clocks at 1 minute 56 seconds, call it 2 minutes. That’s for access to the whole dictionary. (Someone younger might cut my times in half.) Can a system for dealing with hànzì really be this simple, logical and pleasant? Yes. With just one negative aspect that I’m aware of: Under some headings, we encounter ‘too many’ characters. For example, the ‘Left‑Right 7‑6’ section seems rather long, as it runs to 15 pages. However, all sections are internally ordered by radicals, and this helps one search even a longish section at a reasonable pace. On balance, the Kodansha is a dream come true, as it invites one to stop worrying about sea of characters and simply swim in it.

Now for the Four‑Corner Method. The Sìjiăo Hàomă Cházìfă 四角號碼查字法 (also called四角號碼檢字法 with jiăn for chá) was developed in 1924‑1925 by Wáng Yúnwŭ 王雲五. It is often introduced in this fashion: “Digits 0123456789 are assigned to these ten shapes: 亠一丨丶十‡口⎾八小. The four digits of a given code are assigned to the character’s NW, NE, SW and SE corners in turn. Example: To find 檀 (tán), look under 4091in the dictionary.” Is it that easy? For one who has actually been using the system at a reasonable level of proficiency for some time, yes, the system might feel roughly as simple as portrayed by the above vignette. But how one would arrive at such a level of proficiency and comfort is quite another matter.

Upon studying the system in earnest, one finds that the actual inventory of shapes for 0 through 9 looks more like this:
0 亠; 1 一 乚; 2 丨丿; 3 丶*永; 4 十 乂; 5 ‡; 6 口; 7 ⎾ ⏌ 乛; 8 八 人 丷; 9 小 个 忄.

Here I’ve expanded the list from 10 shapes to 20, just to give the flavor; but know that the list, when filled out with all relevant sub‑shapes, grows to 29 or 30 items. (*At the asterisk I’ve borrowed the canonical character 永 (yŏng), for the sake of its final stroke, called nà 捺.)

More about 0: The majority of 0’s seen in a Four‑Corner dictionary are dummy values indicating “no digit is assigned to this corner,” exemplified by code 6000 for 口. This might cause us to wonder: In a Four‑Corner dictionary, can any of the ten digits occur at any of the four corners?

0-9        0-9

0-9        0-9


Yes. I’ve checked this, using the three such dictionaries that I own.

Then the next question has to be: Can any of the 30 sub‑shapes occur at any of the four corners? Again, yes, though with a caveat this time: There are a few sub‑shapes, of the basic 10, that never occur in certain corners. Having noted that nuance we discount it here so that we can work in round numbers and provide a formal description of the whole system as follows: “The code‑identification process for a given character involves 30*4 = 120 shape‑selection choices.” But we see that this is far removed from how a human experiences the system. Subjectively, the user feels s/he is dealing with a mere handful of shape‑selection choices per character, certainly not 120. How can the discrepancy be explained? The human brain is wiley and practical. Presented with 捺 (nà), for example, it sees “choices J or K at the NW corner, choices L or M at the NE corner, choices N or O at the SW corner and choices P or Q at the SE corner” — something like that, whereby the 120 theoretical choices vanish before a mere handful of real‑world choices, as one determines soon enough that the code is probably 5409. (Note the implications here for primitive vs. advanced AI.)

By the way, what exactly is ‘a corner’ in this system that is all about corners? There are some subtleties to learn. In the character 戊 (wù), perhaps you see a NW corner suggesting 7 as the first digit? Wrong. One of the rules directs us to adopt this way of thinking: “A chā 插 shape (5 ‡) protrudes from the top of 戊 and must therefore be handled first. Moreover, one should ignore the actual NE corner of such a character and instead take note of the part that ‘comes next’ into one’s vision.” This gives us 5 3 _ _. Only now do we acknowledge the stroke that defines the left side of 戊, and complete the code as follows: 5320. But why not 5324? Because of the following additional rule: Once a line has been ‘handled’ by a digit, whether directly or indirectly, one must not return to that line to assign it any other digit, save the null marker, 0. In this case, since the SE corner was ‘handled already’ at the outset, as part of the (5 ‡) combo, one assigns 0 to the SE corner. Next, with this ‘handled already’ principle in mind, let’s look at 風 (fēng). To start, we assume that its NW and NE corners are coded as 7’s. Now, since the line that swoops down from NW to SW has already been ‘handled’ as part of the NW corner, shouldn’t the SW digit be 0? And shouldn’t the SE digit likewise be 0? Both wrong. Shapes that form corners are to be treated as if confined to the corners themselves, like this: 冖. So, understanding that the 几‑shape must be mentally hacked into four isolated pieces, we see that the code for 風 must be 7721.

Earlier I noted that some sections of a Halpern dictionary have “too many” characters. A Four‑Corners dictionary has its own version of this clumping problem. But to see how it comes about, we must start at the very top of the system, where the number of possible codes is 10,000 — calculated as 104 since 10 digits are assigned to 4 positions. As it happens, a Four‑Corner dictionary occupies only about one‑quarter of that ‘phase space’ leaving roughly 7,400 codes untapped. For instance, The Foursquare Dictionary (edited by Herring, 1969) contains approximately 9,500 character‑entries, for which 2,659 codes were employed (per my estimate). Similarly, for the 8,000 character‑entries of the Xīnbiān Sìjiăohàomă Xuéshēng Cídiăn 新編四角號碼學生詞典 (edited by Wú Fènmiăn 吳奮勉, 1976), I estimate the number of codes employed to be 2,666. In both cases, about 7,400 of the 10,000 possible codes are absent from the picture. Why? Because the Four‑Corner system, ‘conspiring’ with its source material (the Chinese written language), has no choice but to box itself into the (approximate) 2,600 range indicated above. Given the vastness of the Chinese written language, this sounds odd at first, but recall that we’re dealing only with character‑shape types, so this juxtaposition of the numbers 2,600 and 10,000 is really no different than saying, “We need only a few letters to represent blood types, not the entire alphabet.”

Back to the clumping problem. If we look at characters‑to­‑code ratios in a Four‑Corner dictionary, we find that they vary widely, all the way from 1:1 to 38:1 (e.g., for code 7722). And since this clumping problem is endemic to the system itself (see previous paragraph), one seeks a way to ameliorate it internally. The traditional response has been to append fifth digits, in hopes that they will make the codes unique. (The subscripted digit is based on a shape on the interior of the character. For example, the five‑digit code for 捺 (nà) is 54091 where the subscript 1 refers to the horizontal line that tops the 9‑shape.) But does this Fifth‑Digit Kludge actually work? No.

Suppose we want to illustrate the scheme by presenting some trios of characters whose codes are made unique by their subscripted fifth digits. In paging through the dictionary, what we will likely encounter at first are cases where the fifth digit happens to be the same for all three, like this: 50806 責 zé; 50806 貴 guì; 50806 賮 jìn. Useless. And only rarely will we encounter the sought‑after type of trio where the fifth digit actually creates unique codes, like this: 28361 鰍 qiū; 28365 鱔 shàn; 28366 膾 kuài. And when it comes to ‘clumps’ larger than a trio of characters, where uniqueness is needed most, its chance of being supplied by the fifth digits drops to zero. Conversely, there are many cases where a fifth digit is appended, pro forma, even though the 4‑digit code happens already to be unique, e.g. ‘65077’ for 嘒 (huì). All the fifth‑digit scheme does is transform a supposedly four‑corner dictionary into a home for mindless five‑digit clutter. A Four‑Corner dictionary is obsessed with uniqueness yet achieves it only sporadically, 1% of the time; conversely, the Kodansha has no overt interest in unique codes but provides them automatically as character numbers 1 through 5458. Mercifully, with an on‑line Four‑Corner dictionary, it seems the fifth digit might be less likely to be implemented. At least at the cutely named 众果搜 Zhòngguŏsōu, one finds that a fifth digit is rejected by the UI.

Given the ungainly nature of the Four‑Corner system, one expects a learning curve that will be measured in hours or days, in contrast to the SKIP learning curve that requires only 56 seconds for 82% access to 5458 characters. Qualitatively, SKIP invites one to stop ‘worrying’ about the sea of characters and, instead, go swimming in it. Speaking of ‘worrying’, here is a look back from 2022 to 1961. Thirty‑odd years before SKIP existed, I suffered a typical neophyte’s crisis regarding the seemingly endless sea of Chinese characters. The 214 radicals and 888 Soothill phonetics left me cold; what I craved instead was an inventory of purely visual building blocks. By analyzing, from scratch, the contents of Fenn’s Five Thousand Dictionary, I arrived at 325 such graphical elements or ‘Chatoms’ as I called them, short for ‘Chinese atoms.’ But the sense of solace that I was hoping for didn’t arrive along with this enumeration of 325 shapes. Instead, concerned that the project had been too rushed and impulsive, I filed it away with other juvenilia. Still, the dream of navigating the hànzì sea using a purely visual approach stayed with me over the years. In 2000, when I came across the 1993 predecessor to the Kodansha dictionary in a book store, I felt a strong resonance with my abandoned scheme — not due to anything specific in SKIP, only its visual paradigm, and the sense of relief that it engenders. (Of course one also needs a cídiăn*, and for that we have the four‑volume Gwoyeu Tsyrdean 國語辞典, beautifully conceived and implemented. But that’s for another day.)

*VHM:  cídiăn is a "word dictionary", in contrast to a zìdiăn ("character dictionary").


Selected readings

  1. "Mount a chariot" (11/22)
  2. "Latinxua / Latinization — it worked in the 30s and 40s" (12/21/21)
  3. Victor H. Mair, " The Need for an Alphabetically Arranged General Usage Dictionary of Mandarin Chinese: A Review Article of Some Recent Dictionaries and Current Lexicographical Projects" (pdf), Sino-Platonic Papers, 1 (February, 1986), 1-31.


  1. Antonio L. Banderas said,

    November 7, 2022 @ 7:20 am

    It's a pity CJKI canceled all their other lexicographical resources (except for Esperanto). Yet, you can still take a look at some samples from almost a decade ago:

    CJKI汉英学习字典 http://www.kanji.org/dictionaries/hanying/index.htm

    القاموس العربي الإنجليزي للمتعلمين http://www.kanji.org/dictionaries/cald/index.htm


  2. Chris Button said,

    November 7, 2022 @ 8:30 am

    I love Jack Halpern's skip method for the Kodansha dictionary. It's the most useful arrangement I've ever seen for a Kanji dictionary.

    On a separate note, I've been wondering how to arrange my dictionary since it contains Japanese and Chinese characters and readings along with various reconstructed forms. Since I'm of the belief that the 22 heavenly stems and earthly branches represented an alphabet of sorts, I've decided to arrange it based on the oldest reconstructed forms for each headword (phonetic component) in accordance with which of the 22 onset categories each of them begins with. (I do plan of course to have multiple other indexes in the back so people can actually find stuff when just dipping in).

  3. Victor Mair said,

    November 7, 2022 @ 10:55 am

    @Chris Button

    I like what you're saying about the primary arrangement for your dictionary, also highly approve of having multiple indices for looking up entries by various and sundry means.

  4. Jim Breen said,

    November 7, 2022 @ 4:32 pm

    The CJKI resources are available at https://www.cjk.org/

  5. Chris Button said,

    November 7, 2022 @ 4:43 pm

    @ Victor Mair

    Thanks for the encouragement. It can be the first ever alphabetical dictionary arranged according to the actual 22 "letters" (i.e. ganzhi) of the script.

  6. John J Chew said,

    November 21, 2022 @ 9:34 pm

    "Clumping" is called "hash collision" in computer science.

RSS feed for comments on this post