This is another one of those posts that I started writing long ago (in this case back in January of 2012), but then set aside for one reason or another. However, such drafts and research notes usually reemerge on my radar screen sooner or later, especially if they are of compelling interest and potential significance. Now that it is summer time and I have a little bit of leisure to do what I like, I'm happy to return to this topic and finish it up.
In this instance, I've long been intrigued by the fact that the number of basic morphemes in Sinitic is roughly comparable to the number of roots in Proto-Indo-European (PIE). I wondered whether this was purely a coincidence or a reflection of some fundamental feature of language and the human brain. So I started to look at other language families to see whether they too had a similar amount of root morphemes.
As I gathered and examined data, they seemed to confirm my initial impression that the essential etyma of many languages amount to approximately 1,000-2,000, with most falling at around 1,200-1,500. Wanting to secure more precise and reliable evidence, I asked colleagues who are specialists in various fields to share their expertise. Some of the replies that I received are given below.
Since I am an ardent fan of the Appendix of Indo-European roots at the back of the American Heritage Dictionary of the English Language (AHD), I began by looking there. Published separately in 2000, the AHD list contains around 1,350 reconstructed IE roots.
I will return to IE later, but for the moment, I want to take a look at Semitic, inasmuch as AHD also has an appendix of roots for words in English that are derived from that family. The author of that appendix is John Huehnergard, who kindly prepared the following remarks, which he has generously agreed to let me quote (with the understanding that they were just the result of a couple of hours of poking around, not an in-depth study):
You're right that the list of roots in AHD is not at all exhaustive. I don't remember, and have not been able to find, a statement in the various tomes on Semitic grammar, and the many articles on Semitic root structure, about the total number of roots. There is a work in progress (since 1970), Dictionnaire des racines sémitiques ou attestées dans les langues sémitiques, which by my estimate is now about 1/3 finished; the published fascicles contain some 950 pages, and at a very, very rough estimate there are an average of perhaps 3 or 4 roots per page. That would yield some 8,000-10,000 roots. But that is too high for a set of Proto-Semitic roots, for several reasons: (a) variant roots are listed (e.g., roots in which one of the root consonants has become voiced in some dialect of Arabic and then been incorporated into the literary language alongside the root with the more common reflex of that root consonant); (b) the attestées part of the title is significant, in that roots extracted from borrowed words are listed (a favorite of mine — though not yet in the available parts of the Dictionnaire – is classical Ethiopic m-n-kw-s 'to become a monk' < Gk. monakhos). But although 8-10,000 roots is clearly too high, I'm not sure how to estimate how much too high.
There is an oft-cited paper by Joseph Greenberg, originally published in Word 6 1950, "The Patterning of Root Morphemes in Semitic." Greenberg counted the number of Arabic roots in two standard dictionaries of the classical literary language, and came up with 3,775. That number will also be higher than a list of "Proto-Arabic" roots, for the reason given above under (a), and also because Arabic borrowed a good number of words from Aramaic, so Arabic has sūq 'souk' < Aramaic (ultimately from Akkadian), and ḍīq 'narrowness', originally the same root.
Biblical Hebrew, according to one count I found, has 1,565 verbal roots. That would not include roots that do not occur in verbs, such as k-l-b in *kalb 'dog', and the 50–100 pronominal forms. And Biblical Hebrew is an incompletely attested language.
I did a rough "guestimate" of the number of Akkadian roots listed in one dictionary (an interesting Rückläufiges Wörterbuch; since so many Akkadian texts are broken, it helps to have a dictionary listing words from the end) — about 60 roots per page over 31 pages, so about 1,800 roots.
All of this gets us roughly in the ballpark of your 1,500 roots for PIE and Sinitic. Very interesting, indeed.
I also took a quick look at the number of individual signs in the earliest cuneiform; it is about 800, and then gets a bit more complex (but only another 100-200), and then simplified to anywhere from 150-400, depending on the period and region of Mesopotamia.
For Sumerian, Philip Jones states:
We have about 7 1/2 thousand entries in the PSD [Pennsylvania Sumerian Dictionary], but we do not indicate which are basic etyma. My impression is that it would be reasonably less than 2k, but I'm not sure how to give you a more calculated figure from our databases as they stand.
For Nostratic and PIE, Michael Witzel provides the following references:
Illich-Svitych, V.M. Opyt sravneniia nostraticheskikh iazykov (Moscow: 1971-1976) (2 vols.), has only 353 entries, but this was a very early, pioneering effort.
See now : Bomhard, Allan R. Reconstructing Proto-Nostratic: Comparative Phonology, Morphology, and Vocabulary (Leiden: 2008) [VHM: cf. the remarks by the author below]
For comparison : Ehret, Christopher. Reconstructing Proto-Afro-Asiatic (Proto-Afrasian): Vowels, Tone, Consonants, and Vocabulary (Berkeley and Los Angeles: 1995) has 1,024 roots.
Pokorny, Julius. Indogermanisches etymologisches Wörterbuch (Bern: 1959), has by my calculation: c. 3 entries per page x 1,183 pages = 3,549 words, but this also includes many nouns, not just roots.
The more recent LIV (Lexikon der indogermanischen Verben), by Helmut Rix, et al. (Wiesbaden 1998) has by my calculation: 2 entries per page x 640 = 1,280 roots, or according to their index only 1,150 verb roots (which, of course, excludes some nouns such as 'heart' that have no obvious verbal base.
So, your guess may be just a tiny a bit high….
On Nostratic, Allan Bomhard, comments:
Aharon Dolgopolsky lists some 3,300 Nostratic roots in his unpublished "Nostratic Dictionary", but that number is untenable. [VHM: Bomhard has written an extremely critical review of Dolgopolsky's work.]
Your estimate sounds close to me. In my 2008 book, I proposed 857 Nostratic roots, and, in my 2011 book, an additional 93.
John Colarusso provides guesstimates for Caucasian languages::
For NEC I would say 300.
For NWC I would say 200.
There has been less work on the last two [than for IE] and their time depth is great.
For SC, I would say 500 or so, but I'll have to check Fähnrich's dictionary of Proto-Kartvelian.
Why [are these numbers so low]?
A few solid cognates are enough to establish a link. More simply fleshes out the proto-culture.
[VHM: The figures you give for the Caucasic groups seem low.]
JC: Yes, but those are cognates. For basic vocabulary in NWC there are about 800 or so. This is because this family, like some of the Shan languages of SE Asia, makes up fundamental words out of smaller bits, such as /na-pe/ 'upper part of face', from /ne/ 'eye,' /pe/' nose.
I think your numbers of 1,200 – 1,500 sound good.
Don Ringe on PIE:
Well, it depends on how strict your etymological standards are, but that [VHM: 1,200-1,500] must be in the right ballpark. The lower number probably depends on more rigorous methods of reconstruction (rejecting questionable etymologies, etc.).
But there's another problem you need to think about. We're accustomed to labelling "PIE" just about any reconstruction from a reasonable number of cognates from non-contiguous branches of the family. But if Anatolian is really half the family–that is, all the other languages share a single intermediate ancestor younger than PIE–then strictly speaking a reconstruction can't be *real* PIE unless there's at least one Anatolian cognate and at least one non-Anatolian cognate. And if Tocharian is really half of the remaining half of the family (in the same way), the same reasoning applies again for the protolanguage of the non-Anatolian half. If we apply those standards, probably only a few hundred words can be proved to go back to PIE. If we apply looser standards, 1,500 or so is probably roughly correct. It's really a matter of definition.
From J. P. Mallory:
The following paper contains a carefully determined accounting of PIE roots, one that doesn't muck in regional variants separately:
2010 "Semantic field and cognate distribution in Indo-European." In: T. M. Nikolaeva (ed) Issledovanija po Lingvistike i Semiotike, 180-190. Moscow. (This is the festschrift for Ivanov).
Table 1 presents what Doug Adams and I scored as real PIE (the other tables add the regional cognate groups).
[VHM: There are 2,283 PIE roots according to Table 1. This is slightly on the high side, but well within the ball park for basic etyma that I was predicting on the basis of comparison with other language families and with cognitive science.]
Douglas Adams on PIE:
Actually, I think your higher estimate may be closer to the truth. Pokorny (1959) has 2,044 entries. Some of those are pretty dubious, of course, but dealing as he was largely with pre-war sources, Anatolian, Tocharian, and Albanian were not well represented, so in the last 50-60 years evidence for some additional roots has shown up. I certainly would not go below 1,500 in my estimate and I think the total is closer to 2,000 than to 1,500.
For Sinitic, we may arrive at a rough approximation of the number of proto-morphemes by counting the discrete, decipherable oracle bone graphs, which amounts to around 1,500, and by tallying up the number of root phonophores in the Mandarin pronunciation of the script as it has existed for the last century or so, which is somewhat less than a thousand. For example, L. Wieger's venerable Chinese Characters: Their Origin, Etymology, History, Classification and Signification: a Thorough Study from Chinese Documents includes 858 of these phonetic components and "The Soothill Syllabary" has 895 (see John DeFrancis, The Chinese Language: Fact and Fantasy, pp. 97ff). Since the phonology of Middle Sinitic and Old Sinitic is more complex than that of modern Mandarin, the number of phonophores in earlier times would have been greater than it is now, when many syllables that have collapsed together to sound the same in Mandarin would formerly have been distinguished.
Naturally, the above figures and analysis constitute only a very preliminary and highly tentative approach to the hypothesis I am putting forward. Nonetheless, I think that the fact that the quantity of basic building blocks of various languages is roughly comparable is not merely coincidental, but may have something to do with the cognitive makeup of the brain. That is to say, at the bottom limit, for a language to become an organic, functioning entity, it needs to have a sufficient amount of constituent, core etyma from which a working vocabulary may be derived. At the other end of the scale, there seems to be an upper limit to the number of primary conceptual categories that the mind is capable of processing.
It seems that, in general, there are roughly 1,200-1,500 root concepts from which all others are generated. This appears to hold for many language families. Inventories of core etyma with a magnitude that are much over 2,000 or much under 1,000 are probably the result of differing definitions of what constitutes a basic root and how the computations are carried out.
Contributions of additional data from Language Log readers are warmly welcome.
[Thanks to Russell Gray and Sergei Nikolaev]