Ur-etyma: how many are there?

This is another one of those posts that I started writing long ago (in this case back in January of 2012), but then set aside for one reason or another.  However, such drafts and research notes usually reemerge on my radar screen sooner or later, especially if they are of compelling interest and potential significance.  Now that it is summer time and I have a little bit of leisure to do what I like, I'm happy to return to this topic and finish it up.

In this instance, I've long been intrigued by the fact that the number of basic morphemes in Sinitic is roughly comparable to the number of roots in Proto-Indo-European (PIE).  I wondered whether this was purely a coincidence or a reflection of some fundamental feature of language and the human brain.  So I started to look at other language families to see whether they too had a similar amount of root morphemes.

As I gathered and examined data, they seemed to confirm my initial impression that the essential etyma of many languages amount to approximately 1,000-2,000, with most falling at around 1,200-1,500.  Wanting to secure more precise and reliable evidence, I asked colleagues who are specialists in various fields to share their expertise.  Some of the replies that I received are given below.

Since I am an ardent fan of the Appendix of Indo-European roots at the back of the American Heritage Dictionary of the English Language (AHD), I began by looking there.  Published separately in 2000, the AHD list contains around 1,350 reconstructed IE roots.

I will return to IE later, but for the moment, I want to take a look at Semitic, inasmuch as AHD also has an appendix of roots for words in English that are derived from that family.  The author of that appendix is John Huehnergard, who kindly prepared the following remarks, which he has generously agreed to let me quote (with the understanding that they were just the result of a couple of hours of poking around, not an in-depth study):

You're right that the list of roots in AHD is not at all exhaustive. I don't remember, and have not been able to find, a statement in the various tomes on Semitic grammar, and the many articles on Semitic root structure, about the total number of roots. There is a work in progress (since 1970), Dictionnaire des racines sémitiques ou attestées dans les langues sémitiques, which by my estimate is now about 1/3 finished; the published fascicles contain some 950 pages, and at a very, very rough estimate there are an average of perhaps 3 or 4 roots per page. That would yield some 8,000-10,000 roots. But that is too high for a set of Proto-Semitic roots, for several reasons: (a) variant roots are listed (e.g., roots in which one of the root consonants has become voiced in some dialect of Arabic and then been incorporated into the literary language alongside the root with the more common reflex of that root consonant); (b) the attestées part of the title is significant, in that roots extracted from borrowed words are listed (a favorite of mine — though not yet in the available parts of the Dictionnaire – is classical Ethiopic m-n-kw-s 'to become a monk' < Gk. monakhos). But although 8-10,000 roots is clearly too high, I'm not sure how to estimate how much too high.

There is an oft-cited paper by Joseph Greenberg, originally published in Word 6 1950, "The Patterning of Root Morphemes in Semitic." Greenberg counted the number of Arabic roots in two standard dictionaries of the classical literary language, and came up with 3,775. That number will also be higher than a list of "Proto-Arabic" roots, for the reason given above under (a), and also because Arabic borrowed a good number of words from Aramaic, so Arabic has sūq 'souk' < Aramaic (ultimately from Akkadian), and ḍīq 'narrowness', originally the same root.

Biblical Hebrew, according to one count I found, has 1,565 verbal roots. That would not include roots that do not occur in verbs, such as k-l-b in *kalb 'dog', and the 50–100 pronominal forms. And Biblical Hebrew is an incompletely attested language.

I did a rough "guestimate" of the number of Akkadian roots listed in one dictionary (an interesting Rückläufiges Wörterbuch; since so many Akkadian texts are broken, it helps to have a dictionary listing words from the end) — about 60 roots per page over 31 pages, so about 1,800 roots.

All of this gets us roughly in the ballpark of your 1,500 roots for PIE and Sinitic. Very interesting, indeed.

I also took a quick look at the number of individual signs in the earliest cuneiform; it is about 800, and then gets a bit more complex (but only another 100-200), and then simplified to anywhere from 150-400, depending on the period and region of Mesopotamia.

For Sumerian, Philip Jones states:

We have about 7 1/2 thousand entries in the PSD [Pennsylvania Sumerian Dictionary], but we do not indicate which are basic etyma. My impression is that it would be reasonably less than 2k, but I'm not sure how to give you a more calculated figure from our databases as they stand.

For Nostratic and PIE, Michael Witzel provides the following references:

Illich-Svitych, V.M.  Opyt sravneniia nostraticheskikh iazykov (Moscow: 1971-1976) (2 vols.), has only 353 entries, but this was a very early, pioneering effort.

See now :  Bomhard, Allan R.  Reconstructing Proto-Nostratic:  Comparative Phonology, Morphology, and Vocabulary (Leiden:  2008) [VHM:  cf. the remarks by the author below]

For comparison : Ehret, Christopher. Reconstructing Proto-Afro-Asiatic (Proto-Afrasian):  Vowels, Tone, Consonants, and Vocabulary (Berkeley and Los Angeles:  1995) has 1,024 roots.

Pokorny, Julius.  Indogermanisches etymologisches Wörterbuch (Bern: 1959),  has by my calculation:  c. 3 entries per page x 1,183 pages =  3,549 words, but this also includes many nouns, not just roots.

The more recent LIV (Lexikon der indogermanischen Verben), by Helmut Rix, et al. (Wiesbaden 1998) has by my calculation: 2 entries per page x 640 = 1,280 roots, or according to their index only 1,150 verb roots (which, of course, excludes some nouns such as  'heart' that have no obvious verbal base.

So, your guess may be just a tiny a bit high….

On Nostratic, Allan Bomhard, comments:

Aharon Dolgopolsky lists some 3,300 Nostratic roots in his unpublished "Nostratic Dictionary", but that number is untenable.  [VHM:  Bomhard has written an extremely critical review of Dolgopolsky's work.]

Your estimate sounds close to me.  In my 2008 book, I proposed 857 Nostratic roots, and, in my 2011 book, an additional 93.

John Colarusso provides guesstimates for Caucasian languages::

For NEC I would say 300.

For NWC  I would say 200.

There has been less work on the last two [than for IE] and their time depth is great.

For SC, I would say 500 or so, but I'll have to check Fähnrich's dictionary of Proto-Kartvelian.

Why [are these numbers so low]?

A few solid cognates are enough to establish a link.  More simply fleshes out the proto-culture.

[VHM:  The figures you give for the Caucasic groups seem low.]

JC:  Yes, but those are cognates.  For basic vocabulary in NWC there are about 800 or so.  This is because this family, like some of the Shan languages of SE Asia, makes up fundamental words out of smaller bits, such as /na-pe/ 'upper part of face', from /ne/ 'eye,' /pe/' nose.

I think your numbers of 1,200 – 1,500 sound good.

Don Ringe on PIE:

Well, it depends on how strict your etymological standards are, but that [VHM:  1,200-1,500] must be in the right ballpark.  The lower number probably depends on more rigorous methods of reconstruction (rejecting questionable etymologies, etc.).

But there's another problem you need to think about.  We're accustomed to labelling "PIE" just about any reconstruction from a reasonable number of cognates from non-contiguous branches of the family.  But if Anatolian is really half the family–that is, all the other languages share a single intermediate ancestor younger than PIE–then strictly speaking a reconstruction can't be *real* PIE unless there's at least one Anatolian cognate and at least one non-Anatolian cognate.  And if Tocharian is really half of the remaining half of the family (in the same way), the same reasoning applies again for the protolanguage of the non-Anatolian half.  If we apply those standards, probably only a few hundred words can be proved to go back to PIE.  If we apply looser standards, 1,500 or so is probably roughly correct.  It's really a matter of definition.

From J. P. Mallory:

The following paper contains a carefully determined accounting of PIE roots, one that doesn't muck in regional variants separately:

2010 "Semantic field and cognate distribution in Indo-European."  In: T. M. Nikolaeva (ed) Issledovanija po Lingvistike i Semiotike, 180-190. Moscow. (This is the festschrift for Ivanov).

Table 1 presents what Doug Adams and I scored as real PIE (the other tables add the regional cognate groups).

[VHM:  There are 2,283 PIE roots according to Table 1.  This is slightly on the high side, but well within the ball park for basic etyma that I was predicting on the basis of comparison with other language families and with cognitive science.]

Douglas Adams on PIE:

Actually, I think your higher estimate may be closer to the truth. Pokorny (1959) has 2,044 entries.  Some of those are pretty dubious, of course, but dealing as he was largely with pre-war sources, Anatolian, Tocharian, and Albanian were not well represented, so in the last 50-60 years evidence for some additional roots has shown up.  I certainly would not go below 1,500 in my estimate and I think the total is closer to 2,000 than to 1,500.

For Sinitic, we may arrive at a rough approximation of the number of proto-morphemes by counting the discrete, decipherable oracle bone graphs, which amounts to around 1,500, and by tallying up the number of root phonophores in the Mandarin pronunciation of the script as it has existed for the last century or so, which is somewhat less than a thousand.  For example, L. Wieger's venerable Chinese Characters: Their Origin, Etymology, History, Classification and Signification: a Thorough Study from Chinese Documents includes 858 of these phonetic components and "The Soothill Syllabary" has 895 (see John DeFrancis, The Chinese Language:  Fact and Fantasy, pp. 97ff).  Since the phonology of Middle Sinitic and Old Sinitic is more complex than that of modern Mandarin, the number of phonophores in earlier times would have been greater than it is now, when many syllables that have collapsed together to sound the same in Mandarin would formerly have been distinguished.

Naturally, the above figures and analysis constitute only a very preliminary and highly tentative approach to the hypothesis I am putting forward.  Nonetheless, I think that the fact that the quantity of basic building blocks of various languages is roughly comparable is not merely coincidental, but may have something to do with the cognitive makeup of the brain.  That is to say, at the bottom limit, for a language to become an organic, functioning entity, it needs to have a sufficient amount of constituent, core etyma from which a working vocabulary may be derived.  At the other end of the scale, there seems to be an upper limit to the number of primary conceptual categories that the mind is capable of processing.

It seems that, in general, there are roughly 1,200-1,500 root concepts from which all others are generated.  This appears to hold for many language families. Inventories of core etyma with a magnitude that are much over 2,000 or much under 1,000 are probably the result of differing definitions of what constitutes a basic root and how the computations are carried out.

Contributions of additional data from Language Log readers are warmly welcome.

[Thanks to Russell Gray and Sergei Nikolaev]



  1. Victor Mair said,

    July 6, 2014 @ 2:42 pm

    From Johanna Nichols:

    The figure of 300 roots for Nakh-Daghestanian ("NEC" in your post) is way too low. That's the right order of magnitude for the number of basic verb roots in most of the family, where simplex verbs are a closed class. Nouns are an open class in all the languages, as are the heavy (lexical) elements in light verb constructions and, in most branches, adjectives. A reliable reconstruction and etymological dictionary of the family haven't been compiled yet, so we don't know just how many elementary roots reconstruct, but a figure in the low thousands would probably be right.

    A few years ago I did a survey of a number of recent fieldwork-based dictionaries (as I recall a smattering of North American, Siberian, and Australian languages) and found that in most of them there were around 2000 elementary roots. Something like that seems to be on the right order for elementary roots in most languages. Reconstruction, as those you quote point out, is another matter. The size of the reconstructed word stock is a consequence of how many daughter languages survive, how well they are attested, how old the family is, how many person-years of scholarly effort have gone into the reconstruction, etc. These are matters of historical contingency rather than anything like language type.

  2. Jess Tauber said,

    July 6, 2014 @ 2:52 pm

    The number of morphemes may also reflect the morphosyntactic type, as has been noted before by others (for example Michael Fortescue). The more a language depends upon active synthesis the fewer basic forms it will need, whereas a language that is highly analytic will have many more basic forms, themselves the result of earlier states of the language plus following simplifications.

  3. Victor Mair said,

    July 6, 2014 @ 2:56 pm

    From Philip Jones:

    Very interesting.

    Just a few very quick remarks that hopefully don't miss the point completely:

    – If there is a psychological limit to the number of core etyma, should this theory also apply to modern English – if so, how many do we have?

    – Of those 7.5k Sumerian lemma, a number of them are Semitic loanwords

    – the sign inventory of cuneiform is slewed towards the accounting requirements of the inventors of the script and probably doesn't have much to tell us about core etyma.

  4. Stephan Stiller said,

    July 6, 2014 @ 2:56 pm

    A rather large pronunciation dictionary of German which I recently held in my hands says that the active vocabulary of an "average" native speaker of German has about 12000-16000 "Basiswörter" (whatever that's supposed to mean – presumably something involving root and lexeme counts), and about 3500 of these are (according to the book) loanwords. Languages with shorter modern literary traditions will have less at present.

    Alexander Arguelles supposedly stated that languages have 5000 "words" for an uneducated native speaker, 10000 for one with higher education, and not more than 20000 for fiction. This matches the fact that vocabulary lists for beginning learners tend to have 3000+ lexemic entries; the better ones have perhaps 5000+, and the largest I've seen have around 15000+. My own list of English lexemic entities reflects the fact that English has plenty of Latinate and Greek loanwords and in general plenty of foreign words. Note that many languages of the world are not used in "higher education". For 1350 root words (the middle of VHM's guess of 1200-1500) and 5000 "words" (which means we will have more lexemic entities), we get an average multiplicity of 3.7 for each root. For an estimated 9000 "native" root words of a language which is only partly used in "higher education" (or imports its words there from English) and a short literary tradition (since I'm interested in the vernaculars here, since we're talking about cognitive limitations/constraints/tendencies), we get a multiplicity of 6.7. Counting 20000, we get a multiplicity of 14.8. Let's remember those numbers.

    Note that Semitic with its supposedly very productive morphology based on tri- and quadriliteral roots has been given such high counts by people. I think this reflects the fact that derivational morphology is so often lexicalized. Cognitively, this whole 'triliteral roots' thing doesn't gain the learner all that much.

  5. Daniel Rocha said,

    July 7, 2014 @ 12:53 am

    It seems that the different estimates should take into consideration into the size of the corpus of the language, or reconstructed language. Then, after this, fit in a frequency distribution.

    This distribution is usually taken into consideration in Hanzi learning. There is nothing that would not exclude the possibility of a language having an infinite number of roots. But there is a point where the use of roots become too cumbersome or redundant.

    So, for example, if each hanzi accounts for a root, you have that 1500 hanzi corresponds to ~95% of the vocabulary and 2500 around >99%. Some more information, here:

    So, probably, if you find a number of roots, probably it mean you found a reasonable account that covers a large part of the vocabulary. But, it doesn't mean its all, so, it is important to have an idea of the frequency estimates.

  6. Daniel Rocha said,

    July 7, 2014 @ 1:06 am

    Some freq. lists for other languages:

  7. Lane said,

    July 7, 2014 @ 4:29 am

    Stephen Stiller, if you mean that "derivations from a Semitic root can come to firmly mean something so different from the root that the root doesn't always help that much," you're right. (I hope I understood you right.)

    Picking up my Hans Wehr Arabic dictionary and opening it to a random page and looking for a long entry, I find the root sh-h-d, well known for "shahada" (the 'testimony' that one is a Muslim, "there is no god but god" etc.) and "shaheed", "martyr."

    The core meaning is given as "witness". But its derivatives include

    witness, see, view, inspect, call, to utter the Moslem profession of faith, to call as a witness, to die as a martyr, honey, honeycomb, carbuncle, martyr, deposition, place of assembly, place where a hero dies, sight, written certifcation, death of a martyr, present, an oblong & upright tombstone, spectator…

    So this is a good example – the core meaning "witness" gets you a lot (including "sight" and "written certification") but won't help you with a whole nother stream of meaning that developed out of the "martyr" meaning. And where "honey" and "carbuncle" come from is anybody's guess. Loanwords?

    Interesting that "martyr" went from "witness" to "one who dies for a cause" in both Greek and Arabic. Direct influence, or some logical connection that influenced both developments?

  8. Brian Joseph said,

    July 7, 2014 @ 12:28 pm

    I think Johanna's points are important ones, but especially when she says "for elementary roots in most languages" — the American Heritage Dictionary root list is not just a compendium of roots for PIE but a set of the relevant roots for *English*, excluding non-IE loanwords. So the real question, perhaps, is what is normal in terms of numbers for the basic roots (a construct that perhaps still needs some careful defining) for natural languages, whether a contemporary spoken language or a corpus-based language like Sumerian or Hittite or a reconstructed proto-language for a language family. The existence of borrowings, if any are taken as "basic" in some sense, would complicate matters and presumably in languages like English where the loans have not always replaced native words would lead to higher numbers of basic roots.

  9. David Marjanović said,

    July 7, 2014 @ 7:20 pm

    Etymological dictionaries of language families have their quirks. For instance, in cases of doubt, Pokorny consistently erred on the side of inclusion, so his dictionary contains rock-solid reconstructions alongside ones that are based on poorly attested words and/or require appeals to irregular developments; the Moscow School has been following this tradition, and that includes Dolgopolsky's etymological dictionary of Nostratic. The LIV does something different, as far as I know, that may also distort the numbers: it contains projections into PIE of words that are only attested in one branch as long as there's no evidence that they're loans from outside of IE.

    At the other end of the scale, there seems to be an upper limit to the number of primary conceptual categories that the mind is capable of processing.

    I rather think that the limiting factor is how often a concept comes up. If you have a basic word for something you only talk about once every 20 or 30 years, chances are good you'll simply forget it, and chances are good it won't be passed on to the next generation. When the topic does come up, it's easier to create a new word on the spot from derivation, compounding or metaphor than to remember an extremely rare one that already exists.

  10. Irene Fuerst said,

    July 8, 2014 @ 11:20 pm

    I am not a linguist, I am just a person with a lot of time on my hands.

    As I understand it, the question you are asking is not the number of root words, but the number of root concepts, which is different. This is probably a question that's been addressed in psychology or cognitive science. You might want to consider the evolution of new languages such as pidgins, creoles, or spontaneous sign languages.

  11. Victor Mair said,

    July 15, 2014 @ 6:20 pm

    Valuable discussion here, including, among other interesting topics, disyllabic morphemes in Sinitic:


