Computational phylogeny of Indo-European
« previous post |
Alexei S. Kassian and George Starostin, "Do 'language trees with sampled ancestors' really support a 'hybrid model' for the origin of Indo-European? Thoughts on the most recent attempt at yet another IE phylogeny". Humanities and Social Sciences Communications, 12, no. 682 (May 16, 2025).
Abstract
In this paper, we present a brief critical analysis of the data, methodology, and results of the most recent publication on the computational phylogeny of the Indo-European family (Heggarty et al. 2023), comparing them to previous efforts in this area carried out by (roughly) the same team of scholars (informally designated as the “New Zealand school”), as well as concurrent research by scholars belonging to the “Moscow school” of historical linguistics. We show that the general quality of the lexical data used as the basis for classification has significantly improved from earlier studies, reflecting a more careful curation process on the part of qualified historical linguists involved in the project; however, there remain serious issues when it comes to marking cognation between different characters, such as failure (in many cases) to distinguish between true cognacy and areal diffusion and the inability to take into account the influence of the so-called derivational drift (independent morphological formations from the same root in languages belonging to different branches). Considering that both the topological features of the resulting consensus tree and the established datings contradict historical evidence in several major aspects, these shortcomings may partially be responsible for the results. Our principal conclusion is that the correlation between the number of included languages and the size of the list may simply be insufficient for a guaranteed robust topology; either the list should be drastically expanded (not a realistic option for various practical reasons) or the number of compared taxa be reduced, possibly by means of using intermediate reconstructions for ancestral stages instead of multiple languages (the principle advocated by the Moscow school).
Discussion and conclusions
In the previous sections, we have to tried to identify several factors that might have been responsible for the dubious topological and chronological results of Heggarty et al. 2023 experiment, not likely to be accepted by the majority of “mainstream” Indo-European linguists. Unfortunately, it is hard to give a definite answer without extensive tests, since, in many respects, the machine-processed Bayesian analysis remains a “black box”. We did, however, conclude at least that this time around, errors in input data are not a key shortcoming of the study (as was highly likely for such previous IE classifications as published by Gray and Atkinson, 2003; Bouckaert et al. 2012), although failure to identify a certain number of non-transparent areal borrowings and/or to distinguish between innovations shared through common ancestry and those arising independently of one another across different lineages (linguistic homoplasy) may have contributed to the skewed topography.
One additional hypothesis is that the number of characters (170 Swadesh concepts) is simply too low for the given number of taxa (161 lects). From the combinatorial and statistical point of view, it is a trivial consideration that more taxa require more characters for robust classification (see Rama and Wichmann, 2018 for attempts at estimation of optimal dataset size for reliable classification of language taxa). Previous IE classifications by Gray, Atkinson et al. involved fewer taxa and more characters (see Table 1 for the comparison).
Table 1 suggests that the approach maintained and expanded upon in Heggarty et al. 2023 project can actually be a dead-end in classifying large and diversified language families. In general, the more languages are involved in the procedure, the more characters (Swadesh concepts) are required to make the classification sufficiently robust. Such a task, in turn, requires a huge number of man-hours for wordlist compilation and is inevitably accompanied by various errors, partly due to poor lexicographic sources for some languages, and partly due to the human factor. Likewise, expanding the list of concepts would lead us to less and less stable concepts with vague semantic definitions.
Instead of such an “expansionist” approach, a “reductionist” perspective, such as the one adopted by Kassian, Zhivlov et al. (2021), may be preferable, which places more emphasis on preliminary elimination of the noise factor rather than its increase by manually producing intermediate ancestral state reconstructions (produced by means of a transparent and relatively objective procedure). Unfortunately, use of linguistic reconstructions as characters for modern phylogenetic classifications still seems to be frowned upon by many, if not most, scholars involved in such research — in our opinion, an unwarranted bias that hinders progress in this area.
Overall one could say that Heggarty et al. (2023) at the same time represents an important step forward (in its clearly improved attitude to selection and curation of input data) and, unfortunately, a surprising step back in that the resulting IE tree, in many respects, is even less plausible and less likely to find acceptance in mainstream historical linguistics than the trees previously published by Gray & Atkinson (2003) and by Bouckaert et al. (2012). Consequently, the paper enhances the already serious risk of discrediting the very idea of the usefulness of formal mathematical methods for the genealogical classification of languages; it is highly likely, for instance, that a “classically trained” historical linguist, knowledgeable in both the diachronic aspects of Indo-European languages and such adjacent disciplines as general history and archaeology, but not particularly well versed in computational methods of classification, will walk away from the paper in question with the overall impression that even the best possible linguistic data may yield radically different results depending on all sorts of “tampering” with the complex parameters of the selected methods — and that the authors have intentionally chosen that particular set of parameters which better suits their already existing pre-conceptions of the history and chronology of the spread of Indo-European languages. While we are not necessarily implying that this criticism is true, it at least seems obvious that in a situation of conflict between “classic” and “computational” models of historical linguistics, assuming that the results of the latter automatically override those of the former would be a pseudo-scientific approach; instead, such conflicts should be analyzed and resolved with much more diligence and much deeper analysis than the one presented in Heggarty et al. 2023 study.
Despite all the energetic discussions of our previous attempts, it appears that the question of IE phylogeny has not yet been put to bed.
Selected readings
- Heggarty P, Anderson C, Scarborough M et al. (2023) Language trees with sampled ancestors support a hybrid model for the origin of Indo-European languages. Science 381(6656):eabg0818. https://doi.org/10.1126/science.abg0818
- "New Indo-European genetic evidence" (2/6/25)
- "Where did the PIEs come from; when was that?" (7/28/23)
- "The Linguistic Diversity of Aboriginal Europe" (1/6/09) — classic post by Don Ringe
- "Horse and wheel in the early history of Indo-European" (1/10/09)
- "More on IE wheels and horses" (1/10/09)
- "Inheritance versus lexical borrowing: a case with decisive sound-change evidence" (1/13/09)
- "The linguistic history of horses, gods, and wheeled vehicles" (1/13/09)
- "Some Wanderwörter in Indo-European languages" (1/16/09)
- "Don Ringe ties up some loose ends" (2/20/09)
- "The place and time of Proto-Indo-European: Another round" (8/24/12)
- "Proto-Indo-European laks- > Modern English 'lox'" (12/26/20)
[Thanks to Ted McClure]