Language Log

Computational phylogeny of Indo-European

June 27, 2025 @ 6:24 am · Filed by Victor Mair under Classification, Computational linguistics, Evolution of language

Alexei S. Kassian and George Starostin, "Do 'language trees with sampled ancestors' really support a 'hybrid model' for the origin of Indo-European? Thoughts on the most recent attempt at yet another IE phylogeny". Humanities and Social Sciences Communications, 12, no. 682 (May 16, 2025).

Abstract

In this paper, we present a brief critical analysis of the data, methodology, and results of the most recent publication on the computational phylogeny of the Indo-European family (Heggarty et al. 2023), comparing them to previous efforts in this area carried out by (roughly) the same team of scholars (informally designated as the “New Zealand school”), as well as concurrent research by scholars belonging to the “Moscow school” of historical linguistics. We show that the general quality of the lexical data used as the basis for classification has significantly improved from earlier studies, reflecting a more careful curation process on the part of qualified historical linguists involved in the project; however, there remain serious issues when it comes to marking cognation between different characters, such as failure (in many cases) to distinguish between true cognacy and areal diffusion and the inability to take into account the influence of the so-called derivational drift (independent morphological formations from the same root in languages belonging to different branches). Considering that both the topological features of the resulting consensus tree and the established datings contradict historical evidence in several major aspects, these shortcomings may partially be responsible for the results. Our principal conclusion is that the correlation between the number of included languages and the size of the list may simply be insufficient for a guaranteed robust topology; either the list should be drastically expanded (not a realistic option for various practical reasons) or the number of compared taxa be reduced, possibly by means of using intermediate reconstructions for ancestral stages instead of multiple languages (the principle advocated by the Moscow school).

Discussion and conclusions

In the previous sections, we have to tried to identify several factors that might have been responsible for the dubious topological and chronological results of Heggarty et al. 2023 experiment, not likely to be accepted by the majority of “mainstream” Indo-European linguists. Unfortunately, it is hard to give a definite answer without extensive tests, since, in many respects, the machine-processed Bayesian analysis remains a “black box”. We did, however, conclude at least that this time around, errors in input data are not a key shortcoming of the study (as was highly likely for such previous IE classifications as published by Gray and Atkinson, 2003; Bouckaert et al. 2012), although failure to identify a certain number of non-transparent areal borrowings and/or to distinguish between innovations shared through common ancestry and those arising independently of one another across different lineages (linguistic homoplasy) may have contributed to the skewed topography.

One additional hypothesis is that the number of characters (170 Swadesh concepts) is simply too low for the given number of taxa (161 lects). From the combinatorial and statistical point of view, it is a trivial consideration that more taxa require more characters for robust classification (see Rama and Wichmann, 2018 for attempts at estimation of optimal dataset size for reliable classification of language taxa). Previous IE classifications by Gray, Atkinson et al. involved fewer taxa and more characters (see Table 1 for the comparison).

Table 1 suggests that the approach maintained and expanded upon in Heggarty et al. 2023 project can actually be a dead-end in classifying large and diversified language families. In general, the more languages are involved in the procedure, the more characters (Swadesh concepts) are required to make the classification sufficiently robust. Such a task, in turn, requires a huge number of man-hours for wordlist compilation and is inevitably accompanied by various errors, partly due to poor lexicographic sources for some languages, and partly due to the human factor. Likewise, expanding the list of concepts would lead us to less and less stable concepts with vague semantic definitions.

Instead of such an “expansionist” approach, a “reductionist” perspective, such as the one adopted by Kassian, Zhivlov et al. (2021), may be preferable, which places more emphasis on preliminary elimination of the noise factor rather than its increase by manually producing intermediate ancestral state reconstructions (produced by means of a transparent and relatively objective procedure). Unfortunately, use of linguistic reconstructions as characters for modern phylogenetic classifications still seems to be frowned upon by many, if not most, scholars involved in such research — in our opinion, an unwarranted bias that hinders progress in this area.

Overall one could say that Heggarty et al. (2023) at the same time represents an important step forward (in its clearly improved attitude to selection and curation of input data) and, unfortunately, a surprising step back in that the resulting IE tree, in many respects, is even less plausible and less likely to find acceptance in mainstream historical linguistics than the trees previously published by Gray & Atkinson (2003) and by Bouckaert et al. (2012). Consequently, the paper enhances the already serious risk of discrediting the very idea of the usefulness of formal mathematical methods for the genealogical classification of languages; it is highly likely, for instance, that a “classically trained” historical linguist, knowledgeable in both the diachronic aspects of Indo-European languages and such adjacent disciplines as general history and archaeology, but not particularly well versed in computational methods of classification, will walk away from the paper in question with the overall impression that even the best possible linguistic data may yield radically different results depending on all sorts of “tampering” with the complex parameters of the selected methods — and that the authors have intentionally chosen that particular set of parameters which better suits their already existing pre-conceptions of the history and chronology of the spread of Indo-European languages. While we are not necessarily implying that this criticism is true, it at least seems obvious that in a situation of conflict between “classic” and “computational” models of historical linguistics, assuming that the results of the latter automatically override those of the former would be a pseudo-scientific approach; instead, such conflicts should be analyzed and resolved with much more diligence and much deeper analysis than the one presented in Heggarty et al. 2023 study.

Despite all the energetic discussions of our previous attempts, it appears that the question of IE phylogeny has not yet been put to bed.

Selected readings

Heggarty P, Anderson C, Scarborough M et al. (2023) Language trees with sampled ancestors support a hybrid model for the origin of Indo-European languages. Science 381(6656):eabg0818. https://doi.org/10.1126/science.abg0818
"New Indo-European genetic evidence" (2/6/25)
"Where did the PIEs come from; when was that?" (7/28/23)
"The Linguistic Diversity of Aboriginal Europe" (1/6/09) — classic post by Don Ringe
"Horse and wheel in the early history of Indo-European" (1/10/09)
"More on IE wheels and horses" (1/10/09)
"Inheritance versus lexical borrowing: a case with decisive sound-change evidence" (1/13/09)
"The linguistic history of horses, gods, and wheeled vehicles" (1/13/09)
"Some Wanderwörter in Indo-European languages" (1/16/09)
"Don Ringe ties up some loose ends" (2/20/09)
"The place and time of Proto-Indo-European: Another round" (8/24/12)
"Proto-Indo-European laks- > Modern English 'lox'" (12/26/20)

[Thanks to Ted McClure]

June 27, 2025 @ 6:24 am · Filed by Victor Mair under Classification, Computational linguistics, Evolution of language

Permalink

5 Comments

Victor Manfredi said,

June 28, 2025 @ 3:42 am

The moral of the story: Swadesh in, rubbish out — i.e. "showy but meaningless number games" (Horace Lunt, "Discussion of I. Dyen", 9th Int. Congress of Linguistics, 27-31 Aug. 1962, Mouton, The Hague, p. 37). Alternatively, if natural languages are more systematic objects than brief lists of translated English glosses — if diachrony is more than traditional "histoire des mots" — why not reconstruct phonology and syntax, controlling normally for universals (typology) and coincidence? Quantified results on these lines have been reported (e.g. Longobardi & al., "Towards a syntactic phylogeny of modern Indo-European languages", J. of Historical Linguistics 3 [2013], 122-52). As for borrowing, Meillet famously observed that Indo-European inflection patterns are the most resistant to horizontal transmission (La methode comparative en linguistique historique, Oslo 1925, p. 22). Otherwise, keep on torturing Swadesh stats in black boxes until they cry uncle…
Victor Mair said,

June 28, 2025 @ 7:16 am

@Victor Manfredi

Merci beaucoup / daalụ / grazie!

Victor Manfredi knows whereof he speaks.

=====

…studies the comparative grammar of the Benue‑Kwa languages of the Niger‑Congo family, and several indigenous cultures of the southern 9ja (or Nàìjá) area. Since 1980 he has taught about these subjects, mostly as a glorified temp (sounds nicer in Russian: kontráktnik), at a dozen universities in Nigeria, Europe and North America. At BU he has offered Ìgbo and Yorùbá (1993‑2000, 2006‑07), Syntax2 (1999‑2000) and Morphology (2006‑07) and served on six doctoral committees in syntax (defended in 1996, 1997, 2004, 2007, 2012, 2013 plus one each in musicology (2014 = 2021) and history (2020). He is currently a research affiliate in BU African Studies Center (contact information).

https://people.bu.edu/manfredi/

Also:

"Find that mystery linguist woman" (11/18/06)

here, here, and here
Chris Button said,

June 28, 2025 @ 8:46 am

For a PTB take on this PIE question, I recommend Jim Matisoff's 1978 monograph Variational Semantics in Tibeto-Burman

My own personal experience is that you can often elicit a word that is nothing like you expected. If you then reconstruct a hypothetical word based on your (rudimentary) knowledge of the comparative historical phonology involved, you often uncover the word or something similar. However, it may have shifted semantically, become restricted to a specific context, or be an antiquated term that has been largely ousted.
Victor Manfredi said,

June 29, 2025 @ 10:56 am

pullum's account (18 november 2006, linked by VHM above) is overly dramatic. in feb. '77 i was seized along with my elderly host nkama (one of the first 6 police recruits in british eastern nigeria) for refusing to bribe a newby cop after three solicitations, my papeles were legit but hard to read and the interrogators at CID owere dumped ichie nkama on the roadside at 2am, 200km from home. at the hearing 3 days later, my volunteer lawyer mike ikenna ahamba didn't show so i was pro se and unclear on the charge. the case before mine had pinched a bottle of aspirin from a local clinic but when he failed to speak english to the magistrate (njemanze if i recall), he was tossed back into the paddy wagon so hard i heard his skull hit the floor. from this edifying example i deduced that nobody was going to be impressed with my igbo morphophonemics, so i responded to the false charge of sans-papiers in what fe.la called "big big oyibo" — i used the word "facticity" — at which point the learned judge peered at this dirty hippy over his spectacles, bade me approach the bench and released me to the custody of my mentor no.lue. emenanjo., who didn't bat an eye when i was delivered in a squad car to the college of education smelling like codfish — regrettably interrupting his class on ATR harmony. nkama forgave me when next i saw him. fieldwork is ever thus.
David Marjanović said,

July 11, 2025 @ 11:02 am

Despite all the energetic discussions of our previous attempts, it appears that the question of IE phylogeny has not yet been put to bed.

To be fair, Heggarty et al. (2023) didn't claim it had been put to bed; they only claimed theirs was the latest greatest attempt so far, and pointed out the areas of uncertainty they found.

Anyway, I agree with the criticism that the dataset is too small. I wrote what is in effect a long, detailed review spread out over several comments and several days starting here when the paper came out. One of my conclusions is the paper should have been titled “Flawed Program Produces Wrong Tree, So We Used Flawed Dataset Instead of Flawed Program”.

RSS feed for comments on this post

Computational phylogeny of Indo-European

5 Comments

Victor Manfredi said,

Victor Mair said,

Chris Button said,

Victor Manfredi said,

David Marjanović said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta