Language Log

New results on Austronesian linguistic phylogeny

January 23, 2009 @ 9:55 am · Filed by Mark Liberman under Computational linguistics, Linguistic history

Published today: R. D. Gray, A. J. Drummond, and S. J. Greenhill, "Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement", Science 323(5913):479:483, 23 January 2009. The abstract:

Debates about human prehistory often center on the role that population expansions play in shaping biological and cultural diversity. Hypotheses on the origin of the Austronesian settlers of the Pacific are divided between a recent "pulse-pause" expansion from Taiwan and an older "slow-boat" diffusion from Wallacea. We used lexical data and Bayesian phylogenetic methods to construct a phylogeny of 400 languages. In agreement with the pulse-pause scenario, the language trees place the Austronesian origin in Taiwan approximately 5230 years ago and reveal a series of settlement pauses and expansion pulses linked to technological and social innovations. These results are robust to assumptions about the rooting and calibration of the trees and demonstrate the combined power of linguistic scholarship, database technologies, and computational phylogenetic methods for resolving questions about human prehistory.

An unusually clear explanation of the project, along with a great deal of background information, is available on the web here.

This work follows up on a preliminary study published in 2000 (R.D. Gray and F.M. Jordan, "Language trees support the express-train sequence of Austronesian expansion", Nature 405:1052-1055), as well as earlier work on Indo-European (Russell Gray and Quentin Anderson, "Language Tree Divergence Times Support the Anatolian Theory of Indo-European Origin", Nature 426: 435-439, 2003).

A complete list of publications on related matters from Gray's lab is here. An especially interesting recent paper discusses the web-accessible database that underlies today's Science paper: S.J. Greenhill, R. Blust, and R.D. Gray, "The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics", Evolutionary Bioinformatics 4:271-283, 2008. The abstract:

Phylogenetic methods have revolutionised evolutionary biology and have recently been applied to studies of linguistic and cultural evolution. However, the basic comparative data on the languages of the world required for these analyses is often widely dispersed in hard to obtain sources. Here we outline how our Austronesian Basic Vocabulary Database (ABVD) helps remedy this situation by collating wordlists from over 500 languages into one web-accessible database. We describe the technology underlying the ABVD and discuss the benefits that an evolutionary bioinformatic approach can provide. These include facilitating computational comparative linguistic research, answering questions about human prehistory, enabling syntheses with genetic data, and safe-guarding fragile linguistic information.

LL discussions of the 2003 work include "Dating Indo-European" 12/10/2003; "Glottochronology revisted, very carefully" 4/25/2004; "More on Gray and Atkinson" 4/28/2004; "Gray and Atkinson – Use of binary characters".

More on all this later — meanwhile, there are plenty of links here to enjoy! I'm leaving comments open, but I'd like to urge people to read about the work before sounding off.

January 23, 2009 @ 9:55 am · Filed by Mark Liberman under Computational linguistics, Linguistic history

Permalink

13 Comments

James C. said,

January 23, 2009 @ 3:13 pm

I was absolutely floored by Russ Gray’s presentation here at the University of Hawai‘i. I get the feeling that in a few years the use of methods from evolutionary bioinformatics will become a course in many graduate programs in linguistics. What really impressed me was the idea that models of horizontal gene transfer in bacteria and archaea might have applications in language too, and that the Stammbaumtheorie might eventually fall to more complex phylogenetic networks.

The funny thing is that the biologists started out by borrowing the comparative method and family tree models from us linguists, and now we’re coming back to borrow the updated version from them. Things have come full circle in the last century.
marie-lucie said,

January 23, 2009 @ 4:34 pm

In the present case the methods used work and complement each other because there is both a homogeneous population and a single language family (or superfamily), and the spread is relatively recent (very much so in the case of Eastern Polynesia). The results would not necessarily be as coherent in an area such as Eurasia which has a much older population and has seen a number of large-scale, successive migrations and language replacements: see for instance the misfit (?) between genetics and language in Finland and Hungary, alluded to in another thread.
Seadog Driftwood said,

January 23, 2009 @ 4:36 pm

Obviously, like all news, these "new results" should be taken with a grain of salt. There may well be truth in them, but the idea of Indo-European originating from Anatolia has some things working against it. If the Early PIE (i.e. just before the Anatolian languages (Hittite, Palaic, Luwian, et. al.) split off ) was spoken circa 9000 years ago (i.e. c. 7000 B.C.), that means that there would have had to be a remarkably minuscule rate of linguistic change over several thousand years. Look at the past 2000 years in Anatolia, and the rate of language change – and replacement, for that matter – has been anything but small. Moreover, let's just consider Anatolia between 2000 and 1000 B.C. We've got the Hattic culture being usurped by the Hittites and the slow death of the Hattic language as everyday speech, and the rise of Hittite and then Luwian – and that's just in Central and East Anatolia. There's also the Kaska tribes and the people of Hayasa (if they constitute a different culture) in the north, and there are droves of languages, such as Lemnian, Minoan, Pelasgian, possibly residual Etruscan (if they came from the Aegean coast of Anatolia), Greek, and other lost languages in Western Anatolia. Then just before 1100 B.C., the "Sea Peoples" invade and redraw the map again.
Admittedly, new technologies came into play that facilitated changes, but even relatively isolated languages change over time (Icelandic, despite have a low rate of linguistic change, HAS changed a little over the past millennium). Anatolia is not what one would call isolated.
I'm not dismissing the Anatolian theory – it may turn out to be right, but I can't help but feel unsettled by the problems that arise. Then again, I have a habit of playing devil's advocate.
Basically, stay skeptical.
Yoram said,

January 23, 2009 @ 4:49 pm

Looking at the Polynesian part of the big family tree, I was surprised to see that quite a few nodes are wildly out of place (Hawaiian with Tahitic languages, Samoan with Tongan), which might not affect the overall conclusion, but needs explaining.

At the other end of the tree is Old Chinese, presumably owing to Sagart's Sino-Austronesian, which as far as I know is considered a promising direction but by no means universally accepted. That also requires a comment, but I haven't seen the paper yet.

The same issue has an article on the phylogeny of Helicobacter Pylori in Pacific populations, which apparently reaches similar conclusions.
Yoram said,

January 23, 2009 @ 5:26 pm

And, to paraphrase Zippy, a few items up:

"Comparative linguistics?! Griffy! That is so 20th century! Not to mention 19th century & possibly 18th century!"

"Computers have their functions, Zippy… but I will continue establishing sound correspondences and constructing family trees using shared innovations!"

And finally,

"Bayesian! parsimony! phylogenetic! Bayesian! parsimony! phylogenetic! Bayesian! parsimony! phylogenetic!"

"Bayesian, parsimony and phylogenetic? Great examples of lexical and morphological borrowing from French, Latin and Greek! Run down to the 19th century and get me copies of Lewis and Short and of Liddell and Scott!"
David Marjanović said,

January 23, 2009 @ 6:46 pm

Concerning the post on the use of binary characters by Gray & Atkinson (2003), let me just confirm that they really should have used multistate characters for cognate sets with the same meaning. This is no less obvious in biology than in linguistics. It is possible to code a taxon/language as polymorphic, and the phylogenetics programs can deal with this.

By turning everything into binary characters, the data matrix by Gray & Atkinson drastically exaggerates the number of characters and therefore the number of differences between languages that have different cognate sets for the same meaning — most of the time that probably means it exaggerates the differences between the branches of Indo-European. And this should lead to an inflated age estimate.

This is completely independent of where PIE was spoken.

(And, BTW, Icelandic has done zany things to its pronunciation. Á is [au], ll is [tɬ], and so on…)

At the other end of the tree is Old Chinese, presumably owing to Sagart's Sino-Austronesian, which as far as I know is considered a promising direction but by no means universally accepted.

Sagart himself has dropped it. He now considers Sino-Tibetan and Austronesian to be sister-groups, as opposed to just Chinese alone and Austronesian (his previous view).

Laurent Sagart (2005): Sino-Tibetan-Austronesian: an updated and improved argument, pp. 161–176 in Laurent Sagart, Roger Blench & Alicia Sánchez-Mazas: The Peopling of East Asia: putting together archaeology, linguistics and genetics, Routledge Curzon.

or, more accessibly, footnote 1 in

Laurent Sagart (2006): [book review of James A. Matisoff (2003): Handbook of Proto-Tibeto-Burman: system and philosophy of Sino-Tibeto-Burman reconstruction, University of California Press], Diachronica 23(1), 206–223.

"Computers have their functions, Zippy… but I will continue establishing sound correspondences and constructing family trees using shared innovations!"

Way to embarrass yourself. You still need to establish sound correspondences to make a data matrix. For calculating the most parsimonious tree from the matrix, the programs only use shared innovations; that's half of the whole point of cladistics. The other half is that the principle of parsimony ( = basic science theory) is strictly applied; traditionally, historical linguists don't count the steps to find out if the hypothesis they get really is the most parsimonious one.
Yoram said,

January 23, 2009 @ 7:13 pm

Sagart himself has dropped it. He now considers Sino-Tibetan and Austronesian to be sister-groups, as opposed to just Chinese alone and Austronesian (his previous view).

Sure, but in the Austronesian database Old Chinese is a proxy for whatever larger group Chinese is in, and the data and presumably the cognacy judgments were added by Sagart.

Way to embarrass yourself.

Huh? I was just making a funny about old-school historical linguistics not needing these newfangled computers and statistical methods.
David Marjanović said,

January 24, 2009 @ 8:13 am

The funny thing is that the biologists started out by borrowing the comparative method and family tree models from us linguists, and now we’re coming back to borrow the updated version from them. Things have come full circle in the last century.

Even funnier is that "grottoclonology" was invented in the 1950s, using much too simple assumptions, and then pretty few improvements came for the rest of the century, but in the 1980s to 1990s, molecular dating was independently invented, used considerably more complex assumptions than all versions of glottochronology, became reasonably reliable around the end of the century, and then turned out to be fairly easily applicable to linguistic data… the most important difference seems to be that reasonably powerful computers were unavailable in the 50s.

(BTW, as far as I've understood, reconstructing a tree and dating its nodes are a single step in glottochronology. Molecular dating doesn't make trees, you have to feed it a tree, normally made by a cladistic method.)

Sure, but in the Austronesian database Old Chinese is a proxy for whatever larger group Chinese is in, and the data and presumably the cognacy judgments were added by Sagart.

Yes, but that still makes sense, because 1) OC is the oldest attested Sino-Tibetan language by far, and 2) Sinitic and Tibeto-Burman probably are sister-groups, with each possessing its own innovations (though more research has definitely yet to be done on that).

Sure, putting actual reconstructed Proto-Sino-Tibetan in that position would have been even better, but the reconstruction of PST doesn't seem to have progressed very far yet (several pretty different partial reconstructions exist).

I was just making a funny about old-school historical linguistics not needing these newfangled computers and statistical methods.

Sorry, I hadn't read the comic yet (I came here directly from Language Hat). My point is that computers simply allow the same old methods to handle way more data at once, in an at least as strict fashion. You should try it :-)
marie-lucie said,

January 24, 2009 @ 12:01 pm

It is true that with computers one can handle a lot more data, but first one has to be sure that the data are a) plentiful (such as lexical data on many living or at least well-documented languages) and b) accurately recorded and classified, and the more languages are included, the less likely it is that the person compiling them has enough personal acquaintance with them to use them effectively: consider for instance the fiasco around Greenberg's "Amerind" (OK, he was not using computers, but there was a staggering percentage of errors of many kinds in his compilation). There is also the problem of little-known extinct languages (eg many North American ones) where the amount of data is not really sufficient for computerized methods. And finally, those methods are much more difficult to use with grammatical information and irregularities, which are very important for language classification. The use of computers can facilitate some tasks, it does not replace the expertise of a human being.
David Marjanović said,

January 24, 2009 @ 9:13 pm

the amount of data is not really sufficient for computerized methods.

Oh no, that's not how it works. There are simulation studies (like… several papers in the June 2003 issue of the Journal of Vertebrate Paleontology), as well as empirical evidence, showing that missing data are much, much less of a problem than people used to think.

those methods are much more difficult to use with grammatical information and irregularities

Why?

The use of computers can facilitate some tasks, it does not replace the expertise of a human being.

Indeed not. Without that expertise to build the data matrix, you get "garbage in, garbage out".

————

Incidentally, precisely how much of a fiasco do you think Amerind was? I'm thinking of the 1st- and 2nd-person markers with /n/- and /m/- which (together) are very widespread in the Americas (even though they don't occur in all "Amerind" families) and outright rare elsewhere.
marie-lucie said,

January 25, 2009 @ 1:38 am

Actually, this set of pronouns, frequent but by no means prevalent in the Americas (they also occur individually with other pronouns, or have the opposite or other meanings, or do not exist in many languages, etc), are quite common in Austronesian.
Trond Engen said,

January 27, 2009 @ 9:52 am

The first pulse seems to coincide in time with the introduction of the dog to Australia and the spread of Pama-Nyungan (mentioned by chris here).
Florian Blaschke said,

January 1, 2010 @ 2:31 pm

The big problem with the attempt at establishing subgrouping with this kind of newfangled methods is that it relies on lexicon, which is a deadly sin in historical linguistics. Subgrouping is established on the grounds of shared innovations, preferrably morphological (genetic mutations, or synapomorphies, if you need that kind of biological analogue), nothing else. By relying on lexical similarities, the classification is misled and marred by secondary convergencies like Sprachbund phenomena or shared retention far too easily. That's where fallacies like Hawaiian as a Tahitian language and bizarrely unlikely datings for Proto-Indo-European (criticised in an earlier post on this blog – http://itre.cis.upenn.edu/~myl/languagelog/archives/000094.html – come from). Lexicostatistical analyses can never be used to be trump traditional classifications or datings, if anything, they're a nice confirmation.

Greetings from a proud (if conservative and rather immune to fads) traditional historical linguist.

RSS feed for comments on this post

New results on Austronesian linguistic phylogeny

13 Comments

James C. said,

marie-lucie said,

Seadog Driftwood said,

Yoram said,

Yoram said,

David Marjanović said,

Yoram said,

David Marjanović said,

marie-lucie said,

David Marjanović said,

marie-lucie said,

Trond Engen said,

Florian Blaschke said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta