Who Has the Biggest Dictionary?

« previous post | next post »

The East Asians have an ongoing contest propelled by dictionary size envy. Everybody wants to see who can produce a dictionary with the most entries. The Koreans at Dankook University have just pulled off the amazing feat of compiling a dictionary that has outstripped anything yet generated by the Japanese or the Chinese themselves. After 30 years of labor and investing more than 31,000,000,000 KRW (equal to more than 25 million USD), the South Koreans have just published the Chinese-Korean Unabridged Dictionary in 16 volumes. This humongous lexicon contains nearly half a million entries composed of 55,000 different characters. You can read more about the Dankook dictionary and its bested competitors here and here.

I should note that this is a *word* (CI2 詞 / 辭) dictionary as opposed to a *character* (ZI 字) dictionary. The fundamental, crucial difference between these two types of dictionaries is that entries in the former type are composed of both monosyllabic and polysyllabic terms, whereas entries in the latter type are composed only of single characters. Character dictionaries can contain far more than 55,000 entries. For example, the online Japanese repository of Chinese characters has more than 80,000 different forms, and the Zhonghua zihai 中華字海, which was published in China in 1994, has over 85,500 different characters.

It is essential to point out that there will never be an end to the compilation of ever larger single character dictionaries, since the Chinese writing system is essentially open-ended. People invent new characters for their own names; every time a new element is discovered, a new character is created for it (e.g., LAO2 鐒 for lawrencium); special graphs must be coined for topolect morphemes; etc. This vast proliferation of characters poses numerous challenges and problems, including the following:

1. how to order and locate them
2. how to identify each of them with a specific code designation (not even Unicode — which has assigned the vast majority of its code points to Chinese — can keep up)
3. the fact that many of these "different" characters are actually just variants of other characters, including forms that were popular at diverse moments in history, but then became obsolete

Above about 30,000 characters, we usually don't know the sound or the meaning (or both) of most characters. Furthermore, most of the characters in these mega-dictionaries can only be attested as having occurred once in history, and that often in lexicons of obscure characters! I liken the situation to junk DNA in the human body — there are an awful lot of junk characters out there clogging up the writing system. While it may be an understandable obsession for East Asian lexicographers to compete to produce the most comprehensive and, above all, BIGGEST dictionary of characters, I do not think that custodians of electronic codes should feel obliged to follow suit, especially when the frequency of characters over 20,000 (or, for that matter, over 10,000) is so infinitesimally small. Consider the following chart, which shows the coverage of increasing numbers of characters:

number of characters rate of coverage
1,000 90%
2,400 99%
3,800 99.9%
5,200 99.99%
6,600 99.999%

If the cumulative coverage continues to grow at the same rate (a factor of 10 per 1,400 characters after the first thousand), then the characters ranked around 20,000 would be expected to occur once in 2.7×1015 characters of text (once per 2.7 million billion characters), and the characters ranked around 85,000 would be expected to occur once in 1061 characters of text. (For comparison, Archimedes calculated that it would take 1063 grains of sand to fill the universe as he thought it to be — a sphere about two light years in diameter — just 100 times more.)

The fact that the rarest characters in the biggest Chinese-character dictionaries have actually occurred in text — if only in someone's dictionary — implies that the actual frequencies do not continue to drop so rapidly. Still, the expected frequency of codes for characters ranked below 20,000 is surely very small.

With a nod to Josh Vittor for calling this to my attention and to Minkyung Ji for additional information.



Comments are closed.