Indo-European borrowing

« previous post | next post »

The abstract of Shijulal Nelson-Sathi et al., "Networks uncover hidden lexical borrowing in Indo-European language evolution", Proc. Roy. Soc. B, published online 11/24/2010:

Language evolution is traditionally described in terms of family trees with ancestral languages splitting into descendent languages. However, it has long been recognized that language evolution also entails horizontal components, most commonly through lexical borrowing. For example, the English language was heavily influenced by Old Norse and Old French; eight per cent of its basic vocabulary is borrowed. Borrowing is a distinctly non-tree-like process—akin to horizontal gene transfer in genome evolution—that cannot be recovered by phylogenetic trees. Here, we infer the frequency of hidden borrowing among 2346 cognates (etymologically related words) of basic vocabulary distributed across 84 Indo-European languages. The dataset includes 124 (5%) known borrowings. Applying the uniformitarian principle to inventory dynamics in past and present basic vocabularies, we find that 1373 (61%) of the cognates have been affected by borrowing during their history. Our approach correctly identified 117 (94%) known borrowings. Reconstructed phylogenetic networks that capture both vertical and horizontal components of evolutionary history reveal that, on average, eight per cent of the words of basic vocabulary in each Indo-European language were involved in borrowing during evolution. Basic vocabulary is often assumed to be relatively resistant to borrowing. Our results indicate that the impact of borrowing is far more widespread than previously thought.

I haven't had time to understand the article's methods and their relationship to its conclusions.  But my initial reaction is that it's not at all surprising to conclude that "on average, eight per cent of the words of basic vocabulary in each Indo-European language were involved in borrowing during evolution" over the thousands of years since proto-IE unity.

My second reaction is that traditional methods of historical reconstruction — as well as some recent formalizations of these methods — don't rely on cognate percentages, nor do they limit themselves to basic vocabulary, precisely because of the possibility of unknown amounts of unknown layers of borrowing, convergent change, and so forth. Instead, such methods look for evidence of (things like) systematic sound changes that cut across the reconstructed vocabulary as a whole, whether "basic" or specialized and obscure.

On the other hand, I'm not an Indo-Europeanist. More on this at some later time…

[Update — see the comment from Tom D. for some better information.]



45 Comments

  1. GeorgeW said,

    December 10, 2010 @ 7:34 am

    I glanced the article over and realized that it is much, much too dense for any cursory review (at least for me).

    In any event, I am not clear on what new insights this provides. Is it the unappreciated degree of borrowing (vs. inherited) among European languages, the degree of borrowing of basic vocabulary among them or something else?

    [(myl) The argument between "tree" and "wave" theories of language evolution goes back to the middle of the 19th century, and it will not come as any surprise to historical linguists that borrowing commonly occurs, even in so-called basic vocabulary. (Of course, the "wave" theories don't assume that borrowing is only or typically an isolated random transfer across branches of the tree generated by "descent with modification"; rather they see the most important factor being innovations that spread systematically through the social networks of the time, independent of historical relationships.)

    There's a long (and mostly not very distinguished) history of attempts to use tables of apparent cognate relationships and other isolated traits to infer linguistic histories, without consideration of systematic sound changes and thus missing the most critical evidence for distinguishing inheritance from borrowing. (See here for some discussion and links.) I guess this article may seen as a warning against such efforts, added to many others.

    Some (recent) algorithms for phylogenetic reconstruction (most extensively used in biology, not linguistics) do not include any way to model transfer across lineages (i.e. borrowing); others do include various ways to allow (at least in principle) for borrowing. Part of the interest of this article is presumably in its particular ways of modeling borrowing and estimating its frequency in a particular dataset — but I haven't read it carefully enough to figure out how novel these methods are.

    The methods used in this article to estimate amounts of borrowing can be compared to those used in Katerina Rexová et al., "Cladistic analysis of languages: Indo-European classification based on lexicostatistical data", Cladistics 19(2), 2003, to reach a very different conclusion. Their abstract:

    The phylogeny of the Indo-European (IE) language family is reconstructed by application of the cladistic methodology to the lexicostatistical dataset collected by Dyen (about 200 meanings, 84 speech varieties, the Hittite language used as a functional outgroup). Three different methods of character coding provide trees that show: (a) the presence of four groups, viz., Balto-Slavonic clade, Romano-Germano-Celtic clade, Armenian-Greek group, and Indo-Iranian group (the two last groups possibly paraphyletic); (b) the unstable position of the Albanian language; (c) the unstable pattern of the basalmost IE differentiation; but (d) the probable existence of the Balto-Slavonic–Indo-Iranian (“satem”) and the Romano-Germano-Celtic (+Albanian?) superclades. The results are compared with the phenetic approach to lexicostatistical data, the results of which are significantly less informative concerning the basal pattern. The results suggest a predominantly branching pattern of the basic vocabulary phylogeny and little borrowing of individual words. Different scenarios of IE differentiation based on archaeological and genetic information are discussed. (emphasis added)

    I have a generally high opinion of Russell Gray, who's one of the authors of the Royal Society paper; so my prior prejudice, in advance of careful reading, is to take its conclusions seriously.]

  2. Andrew (not the same one) said,

    December 10, 2010 @ 10:08 am

    What exactly is basic vocabulary? Would I be right in thinking that many of the words in this abstract (which strikes me as containing a notably high number of borrowed words – language, evolution, traditionally, etc.) are not basic?

    [(myl) In this context, "basic vocabulary" generally means one of the Swadesh lists.]

  3. EJP said,

    December 10, 2010 @ 10:43 am

    I am also not sure what to make of this article. Historical linguists are definitely aware of the role of borrowings in reconstructing proto languages. In fact the traditional comparative method is a good tool for separating borrowings from non-borrowings.

    I would agree that there is a need to improve analysis of the different facets of language change, but I feel like many attempts are trying to avoid the comparative method instead of building on what is already known.

    Maybe I am misreading the article, but at first glance it does look like a paper in which geneticists (or genetic analysis) is trying to save mathematically unsophisticated historical linguists from ourselves.

  4. Sally Thomason said,

    December 10, 2010 @ 10:57 am

    The Shijulal Nelson-Sathi et al. quotation contains at least one howler, though: it concludes, "the impact of borrowing is far more widespread than previously thought". Maybe the authors could locate some seriously out-of-touch scholars who think that 8% borrowed basic vocabulary is startling, but no modern knowledgeable historical linguist (and I bet few if any non-modern ones) would ever make such a claim. There are other potential problems with the authors' premises too. First, lexical borrowing is only the most common effect of contact when you're looking at language maintenance situations, where people typically borrow from a second language into their first language. Linguistic interference in situations of language shift differs sharply: there, vocabulary lags behind phonological and syntactic diffusion. (English is sometimes claimed as a counterexample to this generalization, because lexical borrowing from French into English was massive in spite of the fact that it was a shift situation, from French to English; but this was a case of superstrate shift, which is pretty rare. I don't think it's a real counterexample, for reasons I won't go into here.) Second, the authors' application of a uniformitarian principle, though probably necessary for their statistical approach, is not going to yield reliable results in all cases, which means that it can't safely be generalized. To take just one problem, some communities have what amounts to a cultural ban on lexical borrowing; new lexical items in those cultures are coined from native material, sometimes as calques (which of course won't show up as loanwords in the authors' statistics even though they're due to language contact). I don't know of any Indo-European cultures where this is true, but I bet the authors haven't checked out this possibility, or other cultural factors (like lack of extensive contacts!) that could skew their results.

  5. GeorgeW said,

    December 10, 2010 @ 11:13 am

    @Sally: " some communities have what amounts to a cultural ban on lexical borrowing . . ."

    Don't many have some degree of cultural constraint and the amount of borrowing would be influenced by the relative level of resistance?

  6. John Lawler said,

    December 10, 2010 @ 11:23 am

    @Sally: How about Gm. Hauptwort and Zeitwort, in preference to Latinate noun and verb? Or, less linguistically, Rundfunk and Fernsehapparat? I wouldn't call it a "cultural ban", exactly, but it's certainly a strong tendency that has affected basic vocabulary.

  7. Leo said,

    December 10, 2010 @ 11:39 am

    Icelandic, which is Indo-European, is a prototypical case of a language with a ban on lexical borrowing.

    [(myl) According to A. M. Hilmarsson-Dunn, "Protectionist Language Policies in the Face of the Forces of English. The Case of Iceland", 2006:

    The Icelandic Government has reacted to the threat of English by implementing a protectionist language policy, the two cornerstones of which are preservation and enhancement of the Icelandic language. This policy covers such areas as education, the media and information technology. Despite such efforts, however, English is on the rise.

    The author also explains that

    Language planning in Iceland probably began as the result of hostility to the Danish language. Although the Icelandic Language Council was established by act of parliament as recently as 1964, the pure language movement (hreinraektarstefnan) had been active in Iceland for 300 years in trying to rid the language of Danish words. This movement gained momentum in the fight for independence from Denmark.

    ]

    John Lawler – my impression is that in today's German, Hauptwort, Zeitwort etc are largely replaced by Latinate words: Substantiv (or Nomen)*, Verb (my dictionary also gives Verbum), Adjektiv.

    *"noun"

  8. marie-lucie said,

    December 10, 2010 @ 12:41 pm

    I agree with Sally that finding 8% borrowings, even in common words, is hardly unexpected in historical work. I will add that "200 meanings" do not necessarily reveal cognacy, because of frequent meaning shift: for instance, the meaning "dog" will not reveal English and German cognacy, because the English cognate of German Hund is hound, a specialized word which does not show up among the 200 commonest English words. In order to find the cognates, one has to rely on established correspondences based on the findings of the comparative method, not on a list of modern words with their current meanings (nobody would immediately link English head and Latin caput). And borrowing is not necessarily the answer if the language from which a word might be borrowed is unknown, or if a strange word in one language no longer has cognates in related languages.

  9. Will said,

    December 10, 2010 @ 1:12 pm

    I don't understand how the words are on the Swadesh list were chosen. It has the number "five" but not a single emotion? How is "five" a more basic word than "happy"?

  10. Kaviani said,

    December 10, 2010 @ 1:16 pm

    The conclusion seems like common sense, but I suppose even common sense should be researched and analyzed.

    Humans do not all communicate in the same way all the time, even within a given dialect. Lower registers of any language are more likely to mutate in one form or another. If these researchers are only relying on the basic vocabulary per the frozen registers of these IE languages, they are ignoring an ungodly amount of data.

  11. GeorgeW said,

    December 10, 2010 @ 1:54 pm

    @Will: Every language is more likely to have a word for 'five' acquired young and used with high frequency than 'alligator.' Everyone has a mother and father and every language would be expected to have words for them. But, not every speech community has a parliament.
    So, it is more likely that 'parliament' would be borrowed than 'mother.'

  12. Aaron Toivo said,

    December 10, 2010 @ 2:16 pm

    @Will: I believe the list is intended to cover vocabulary items believed to be highly resistant to borrowing, as its chief purpose. The most basic vocab in a language will tend to be more resistant than less-basic vocab, but there are other factors too – e.g. I could see making a case for emotional vocabulary being more plastic, however basic. Consider the semantic evolution behind English words like "gay", for example, and there are many other such.

  13. J. W. Brewer said,

    December 10, 2010 @ 2:32 pm

    I love that someone has done a Swadesh list for Esperanto:
    http://en.wiktionary.org/wiki/Appendix:Esperanto_Swadesh_list. Try using that as input into your mechanism for distinguishing trees from waves and see what happens!

  14. Morgan said,

    December 10, 2010 @ 3:24 pm

    @ marie-lucie: "In order to find the cognates, one has to rely on established correspondences based on the findings of the comparative method, not on a list of modern words with their current meanings (nobody would immediately link English head and Latin caput)."

    This is a common statement from linguists, but I don't see why it should be the case. In fact, as a sweeping statement it is certainly false – cognates *can* be probabilistically identified based on measures of current semantic and phonetic similarity, e.g. see…

    http://webdocs.cs.ualberta.ca/~kondrak/papers/naacl01.pdf

    But even in a more limited sense – that only cognate pairs identified using the comparative method can be used for identification of relationships between languages – the statement doesn't seem to hold up.

    While the degree of certainty that any given pair of words identified as cognate really are cognate may be much higher for pairs developed using the comparative method than for those identified simply on the basis of distance metrics, there is no reason (in principal) that this elimination of false positives would cause it to perform better for purposes of identifying relatedness. A distance-based approach would more adeptly handle true cognates that were subject to irregular phonetic or semantic change over time, and that advantage might outweigh the elimination of false positives.

    And as a criticism of computational methods for performing comparisons across languages in general (which I won't ascribe to you, but which is also a common theme), the critique is even more misplaced. If the comparative method can be algorithmically described, it can be automated.

  15. Morgan said,

    December 10, 2010 @ 4:04 pm

    @Sally Thompson:

    I believe the uniformitarian principle was invoked only in support of the assumption that the number of synonyms for a concept in proto-languages is consistent with what is observed in modern languages (on a quick read, if I'm wrong, please let me know). I'm sure that you (and virtually every other commentor here) would be more knowledgable than I regarding the confidence that can be placed in that assumption.

    I believe that their estimate of the proportion of basic vocabulary borrowed is likely to be sensitive to violation of the assumption of similar synonomy, with less synonomy in the proto-language increasing estimates of borrowing versus the alternative explanation for apparent cognateness involving parallel descent of synonyms with differential loss.

  16. ~flow said,

    December 10, 2010 @ 4:06 pm

    i find the findings difficult to gauge. is 8% much? i don't know.

    chinese, as one example, is quite resistant to borrowing; in this language, most imports happen in the form of translation. where borrowings do occur, they're much harder to detect than in many other languages, owing to the chinese penchant for terseness and the relative scarcity of available syllables. originally i wanted to show some examples for how much more similar to the original english renderings of chinese words are than vice versa, but then found this etymological gem over at http://thefreedictionary.com/typhoon:

    'The history of typhoon presents a perfect example of the long journey that many words made in coming to English. It traveled from Greece to Arabia to India, and also arose independently in China, before assuming its current form in our language. The Greek word tuphn, used both as the name of the father of the winds and a common noun meaning "whirlwind, typhoon," was borrowed into Arabic during the Middle Ages, when Arabic learning both preserved and expanded the classical heritage and passed it on to Europe and other parts of the world. fn, the Arabic version of the Greek word, passed into languages spoken in India, where Arabic-speaking Muslim invaders had settled in the 11th century. Thus the descendant of the Arabic word, passing into English (first recorded in 1588) through an Indian language and appearing in English in forms such as touffon and tufan, originally referred specifically to a severe storm in India. The modern form of typhoon was influenced by a borrowing from the Cantonese variety of Chinese, namely the word taaîfung, and respelled to make it look more like Greek. Taaîfung, meaning literally "great wind," was coincidentally similar to the Arabic borrowing and is first recorded in English guise as tuffoon in 1699. The various forms coalesced and finally became typhoon, a spelling that first appeared in 1819 in Shelley's Prometheus Unbound.'

    so this makes 'typhoon' a profoundly difficult item to discuss; however, it goes to show that at least some words are known to profoundly contradict purely phylogenetic patterns of scientific explication.

    next, i do share the view that small basic word lists are of doubtful value, simply because they're so short. the example of 'hound' vs german 'hund' has been hinted at above; likewise, english 'dog' one may think could be related to the german 'dogge' (a kind of dog). countless more examples exist; 'bottle' is 'flasche' in english, but english has 'flask', too; nevertheless, every schoochild (in northern germany at least) can easily link 'bottle' to 'buddel', today much regarded a slang word and somehow related to sailors. in all these examples, looking at short lists will only help to obscure the real level of common inheritance and exchange between the groups concerned. likewise, often a word gets borrowed into another language because of a specific physical item newly introduced; such usages therefore should have a tendency to be rather refined to smaller usages.

    all told, i am very skeptical of purely phylogenetic models of language evolution. i believe it is somehow due to the way people in the west, and especially in europe, in the 19th century got overly preoccupied with their separate conflicting nations and nationalities. let us not forget that an important impetus for modern linguistics came out of the newly founded imperial germany around 1870, when the junggrammatikers set out to prove that sound changes occur according to strict and deterministic rules. people at that time obsessed over finding out how many peoples there were in the world, and how many tribes were there in germany. they also thought they would find clear dialectal zones, much like clearly delineated sub-nations, but it was found that thousands of years of interaction had left only bizarely gerrymandering and overlapping patterns on the maps.

    people to this day tend to view languages (and dialects) like they were separate biological species, like horses and lions. little do they know that the concept of species itself is an extremely difficult one. look up ring species on wikipedia to see what i mean.

  17. GeorgeW said,

    December 10, 2010 @ 4:45 pm

    One of my favorite types of borrowings (for which there should be a term – recycles?) is Arabic 'amiral' 'admiral.' This was borrowed from one of the European languages (I would guess French). But, it was originally borrowed from Arabic 'amir al-bahr 'commander of the sea.'

  18. J Lee said,

    December 10, 2010 @ 4:56 pm

    Speaking of Swadesh lists, why is it so difficult to find ones with IPA? Have they become more widely used as second-language learning materials than analytical devices?

  19. Aaron Toivo said,

    December 10, 2010 @ 5:43 pm

    @J Lee: that may have something to do with the fact that they were used mostly by glottochronologists, whose heyday was well before it was easy or even possible to use IPA on a computer. And now, of course, that field has passed its best-by date. So I don't know how much serious use Swadesh lists get anymore.

  20. Army1987 said,

    December 10, 2010 @ 5:48 pm

    I don't see why they'd be used as learning materials. "Louse" or "liver" aren't exactly among the 200 most useful words when you've just started learning a language.

  21. marie-lucie said,

    December 10, 2010 @ 5:51 pm

    morgan: cognates *can* be probabilistically identified based on measures of current semantic and phonetic similarity, e.g. see…

    http://webdocs.cs.ualberta.ca/~kondrak/papers/naacl01.pdf

    I had a brief look at the paper, and I am not familiar enough with the mathematics to understand the technicalities, but I have a couple of comments:

    – the languages used (all Algonquian) are definitely known to be closely related, and the phonetic correspondences have already been observed and systematized, so the computational methods can be refined in order to take these features into account;
    – the problems the author identifies (slight differences in spelling, morphology, semantics, etc) are problems for the computers, not for actual linguists using pre-computational methods ("measuring" semantic and phonetic similarity, etc). For a computational method to work, it has to be based on preliminary work by linguists who can code the various relationships between forms.

    The "probabilistic" evaluation of potential cognacy using the methods described here seems to be very similar to that achievable by properly trained living and breathing linguists, except that (with the right coding) it is probably faster over a large corpus. With whichever method, a person with a human brain has to go through the material, first to determine the parameters, and last to test the results.

    So the computational methods described are extensions from traditional methods, not completely different methods. This seems to be as true in the present context as in many other scientific contexts.

  22. marie-lucie said,

    December 10, 2010 @ 6:07 pm

    Army1987: "Louse" or "liver" aren't exactly among the 200 most useful words when you've just started learning a language.

    In the modern Western world, perhaps not. In medieval Europe, "louse" was certainly a very useful word. As for "liver", a hunting culture is very familiar with this part of an animal.

    The 200 words are not necessarily the most useful for the learner of a given language, only the more common ones for quasi-universal concepts over both geography and history, concepts which every language can be expected to have a word for and therefore not to need to borrow from another language. Swadesh's original lists (first 100, then 200 words) have been revised for different parts of the world to reflect both geography and culture – words for 'ice' are not too common in the tropics, for instance.

  23. Tom D said,

    December 10, 2010 @ 6:48 pm

    I'm a student at University of Hawaii, and I just so happened to have written a paper on this and some of the other stuff Russell Gray et al. and others have been doing for one of Robert Blust's classes.

    This explanation is culled from the relevant section of that paper, so I've left the citations and such in. Hopefully the examples and explanation will be of some help to y'all.

    If we know that there is basically always borrowing in language evolution, we might want to create and test models that differ in how cognates evolve down the tree. Nelson-Sathi et al. tested several models of cognate evolution both with and without borrowing on Indo-European (2010). We can assume that, since we are dealing with basic vocabulary, we would expect that the size of the basic vocabulary to remain somewhat constant, as per the uniformitarian principle (Nelson-Sathi et al. 2010, 5). Because we have the “real” answer–the number of cognates in each language–we can compare how different models infer cognate distributions (Nelson-Sathi et al. 2010, 5).

    First, Nelson-Sathi et al. proposed a model of that proposed cognates could only be lost (2010, 3). Thus, if a word exists in any daughter language, it must also have existed in the proto-language. To use our Japonic example again, if Japanese has the word taiyoo, if Okinawan has the unrelated word tiida, and if Old Japanese has another unrelated word pyi, all for 'Sun', then, under this loss-only model, we would have proto-Japonic having three words for 'Sun'. While, as I mentioned above, there is no a priori reason to rule this one instance out, if the uniformitarian principle holds, this will add up enough over the whole set of basic vocabulary to produce too many words in the basic vocabulary of the proto-language (Nelson-Sathi et al. 2010, 5).

    Next, we could instead assume that cognates only originate in the most recent common ancestor of all forms where they are present (Nelson-Sathi et al. 2010, 3). To continue with our Japanese example, we could observe that only languages in the Ryukyuan subgroup have a reflex of a form like kuuga for 'egg'. Proto-Ryukyuan, then, would be the most common recent ancestor in the tree that has a form like kuuga, but it wouldn't occur higher in the tree, at the level of Proto-Japonic. This would still, however, not fit our true tree. We know that modern Okinawan, for example, has the word tamagu, which is cognate with Japanese tamago, also meaning 'egg'. Then, at the level of proto-Japonic, because of this evidence for this cognate set in two primary branches, our model would give us tamago as the proto-Japonic word for 'egg'. This is, however, right but for the wrong reason. We know that Okinawan has borrowed this from modern Japanese.

    So we could then attempt to include borrowing. But we can do better than just including borrowing, we could specify a level of borrowing. For the simplest attempt, Nelson-Sathi et al. allowed words to have two origins: one genetic and one borrowing-based, where the words would behave like the single-origin and loss-only models above afterwords (2010, 3). They also allowed for more borrowing-based origins, up to a total of one genetic and fifteen borrowing-based origins for words (Nelson-Sathi et al. 2010, 3). To continue once again with our Japonic example, the model with one genetic origin and one borrowing would perfectly fit the data from modern Okinawan and Japanese for the cognate set including tamagu and tamago, respectively. The most recent genetic origin would be in the Japanese branch of Japonic, while the word would also be borrowed into Okinawan, thus assuming the origin at the correct place and also not inferring too many or too few cognates in proto-Japonic. With only one possible borrowing, this model would, however, not be able to handle the situation if another Ryukyuan language, say Amami, also borrowed the same cognate for 'egg'. It would force us to infer this cognate set as being genetically higher up the tree than this evidence points to. Of course, this would be technically correct; the Ryukyuan forms are lexically innovative. It would just be correct for the wrong reason, not something we want. So we would then want to use a model that allows for more borrowing.

    These borrowing models were compared with a probability distribution, inferred from the data, of the average basic vocabulary size of an Indo-European language to find out which model best fit the “reality” of the situation (Nelson-Sathi et al. 2010, 5-6). They found, with reservations, that one genetic origin and three borrowings best fit their Indo-European data (Nelson-Sathi et al. 2010, 6). It is probably not too much to speculate that a distribution of rates of borrowing, where some cognates are much more readily borrowed, and others no, would be a logical next step and a statistically better fit to the data, but they have not yet taken this next step.

  24. Matt McIrvin said,

    December 11, 2010 @ 1:14 am

    The story of "typhoon" reminds me of "genie"; my daughter noticed the similarity to "genius" and I wondered if there was a common etymology, given that "genie" seemed related to the Arabic "djinni"–and yet I also knew "genie" and "genius" were the same word in French.

    It turns out "genie" is from the French for an inspiring or tutelary spirit, used by French translators of the 1001 Nights and other stories for its coincidental similarity to the Arabic word, which influenced the meaning in English. So both origins are real.

  25. Andrew Garrett said,

    December 11, 2010 @ 1:28 am

    I haven't fully digested the paper, but I'm not sure I agree with the authors' general remarks about "patchy COG distributions" (p. 5). Two languages may share an apparent lexical innovation by chance (which N-S et al. rightly discount) or by cultural borrowing of the "television" or "beef" sort, but other scenarios are possible that they do not take seriously. For example, if an etymon has an original meaning "X" that is especially likely to evolve into "Y" (e.g. words for "cheek" are especially likely to come to mean "jaw"), then, even if members of that cognate set recur in the semantic slot Y, this does not necessarily point to borrowing — but neither is the recurrence due to chance or to what N-S et al. dismiss, namely proto-languages with "different, but redundant, words for the same basic concepts" (p. 5). This strikes me as a very common situation.

  26. Tom D said,

    December 11, 2010 @ 5:32 am

    @Andrew Garrett

    Take a look at the supplementary materials when you're done. They do mention that they're probably picking up a good bit of parallel evolution unintentionally.

    However, what you're talking about, as far as I understand the methods, would not be picked up. With the exception of borrowing, each semantic slot is assumed to evolve independently from every other. It's something they definitely take seriously; they talk a bit about it in Atkinson et al. 2005.

    Essentially, under the one model of lexical evolution they were proposing in Atkinson et al. 2005, it didn't seem to be that big of an issue when they ran tests with artificial data designed so that cognates did evolve dependently to one another.

    To talk about some stuff I didn't mention before, I do think there are basically two flaws with this paper. Usually, when Gray et al. does this sort of thing, they work with at least one historical linguist specializing in the area. This time, they didn't, and I think it's showing a bit.

    Second, despite being able to give out as much supplementary material as they want, or being able to host it elsewhere, they don't include the full results that would interest linguists, just some statistical stuff and what I assume is a complete list for English (in the supplementary materials).

  27. Army1987 said,

    December 11, 2010 @ 8:10 am

    @marie-lucie: I get the original intended usage of that list, and I think that it does a decent job for that. I was disputing their "second-language learning materials". Words such as "hello", "goodbye", "thanks", "sorry" or "fine" are far more useful, I think.

  28. Army1987 said,

    December 11, 2010 @ 8:11 am

    [insert "utility as" after "their" in the post above]

  29. Coby Lubliner said,

    December 11, 2010 @ 10:08 am

    It seems to me that one of the problems with identifying borrowings is that the meaning of the borrowed lexeme may be so different from that in the lending language that the borrowing may not be recognized. In relatively recent borrowings, the pronunciation may not be fully assimilated and the word may keep its original orthography, so that we know it's borrowed. We know that French words ending in –ing are borrowed from English, and English lexemes beginning in en (pronounced /ɑn/, whether or not attached to the following word, as in envelope or en masse) are borrowed from French. And so we know that French footing ('jogging') and BrE en suite ('having a private bathroom') are "borrowings." But in older, especially preliterate, stages of a language such markers are unlikely to exist.

    Then there are some curious cases of secondary borrowing. Italian borrowed the French rondeau and respelled it (as was common before 1800 or so) as rondò. German took the word from Italian and dropped the accent, making rondo seem like an Italian word.

  30. VMartin said,

    December 11, 2010 @ 10:18 am

    A Slovakian Slavist Ondruš using the method of semantic transposition concluded that that the word "king" and "der König" is of Old-Slovakian/Slavic origin from "koning/kuning". These words were accepted into Old-Germanic language where it have no relation to the word "first" (like the word "der Fürst"). Such semantic relations exist in Latin, Greek and Old-Slavic (archont – principal – koning) but not in German.
    The word "koning" transformed into knjez, knieža and oddly enough centuries later Slavonic accepted tha word "kráľ" from Charles the Great.

  31. marie-lucie said,

    December 11, 2010 @ 10:19 am

    Army1987: I see that I read your comment too fast and misunderstood your intent. The person talking about the usefulness of the Swadesh lists as learning materials was probably not familiar with the lists themselves and their purpose.

    Indeed, "hello" etc are usually found at the top of the list of words to be memorized by learners, but since the purpose of such words is not to impart information but to facilitate social interaction between speakers, they are extremely susceptible to borrowing (eg the use of "adios" or "ciao" in English by persons who do not understand or speak Spanish or Italian).

  32. Dragos said,

    December 12, 2010 @ 8:59 am

    @Morgan: In fact, as a sweeping statement it is certainly false – cognates *can* be probabilistically identified based on measures of current semantic and phonetic similarity

    I second Marie-Lucie: I don't think cognates can be determined automatically (yet). Of course, as pointed above, recent loanwords still preserve some similarities, but for the Indo-European family we're talking about changes happening 5-6,000 years ago (or more, according to Gray & Atkinson)

    Linguists have some examples of almost perfect phonetic and semantic matches which are not related at all (not inherited, not borrowed): Modern Greek mati means 'eye', and so does the Malay mata. To judge how these words are related are not, similarity is not enough, one has to know the history of the two languages ( mati comes from ommation which is a diminutive of omma).

    Another well-known example is English bad vs Persian (Farsi) bad, two unrelated words having the same form and the same meaning.

    Here's a nice essay about random lexical matches in languages:
    http://www.zompist.com/chance.htm

  33. Army1987 said,

    December 12, 2010 @ 6:40 pm

    What part of ‘probabilistically’ is unclear to you?

  34. J Lee said,

    December 12, 2010 @ 7:39 pm

    It is highly unlikely I could scour the web for lists that have IPA without even inadvertently reading about what they were designed for.

    I referred to using them merely as a means of comparing languages of the same family, whether to find sound correspondences or get an idea of dialectal differences or anything else you might want to do. Wikipedia's Indo-Iranian parallel lists, for example, are in individual orthographies and thus useless on their own for such purposes.

  35. Dragos said,

    December 12, 2010 @ 8:26 pm

    When a researcher presents some tests performed on four languages which allegedly indicate a method of discovering cognates on average, I must say I am unimpressed, to put it mildly.

    In that paper the cognates are "words in different languages that are similar in form and meaning, without making a distinction between borrowed and genetically related words". Most linguists will disagree with this definition. How likely is for two words which are definitely related to be cognates instead of borrowings? The author of the paper doesn't even try to answer such a question.

    But how likely is for two similar words to be related? How likely is for two un-similar words to be related (IIRC English eye and Latin oculus are cognates)?
    I don't know the exact answers to any of these questions, but I guess they depend on those languages and how they evolved, probably also on the selected word sets. For example I believe similar words are likelier to be related in Spanish and Portuguese, than in Basque and Sumerian. But this is not what I've read in that paper.

  36. J Lee said,

    December 13, 2010 @ 7:34 am

    Surely the Persian and English words for 'bad' are cognates.

  37. Army1987 said,

    December 13, 2010 @ 8:10 pm

    How likely is for two words which are definitely related to be cognates instead of borrowings?

    That's not such a black-and-white question as it sounds. Two related words in Irish and English could be a very recent borrowing from English to Irish (or vice versa), a 16th-century borrowing from Early Modern English to Early Modern Irish (or vice versa), an 11th-century borrowing from Middle English to Middle Irish, a borrowing into Middle English from Norman French which had in turn borrowed it centuries before from its Celtic substrate, a word borrowed by Old English from its Brythonic substrate, a borrowing from proto-Germanic to proto-Celtic and viceversa, …, …, or have been independently inherited from proto-Indoeuropean. Lumping all of these except the last in one category and keeping the last separated isn't necessarily the right thing (though it depends on what you're doing).

  38. Atmir Ilias said,

    December 15, 2010 @ 12:48 am

    I’m glad to read this kind of articles, which made me feel so good, and has given me the hope that the linguistic intelligence is still alive. We are aware that most of language is so difficult to understand, and most of us probably assume we have only known the appearance of it.
    The linguists know very little the word and the source of its creation. Our primary weakness is that still we interpret the words as a simple sign, in one dimension, and then we create single theories, single methods, single classifications, and single conclusions. We suppose that some words are determined by sounds of nature (sound-symbol), some others by figures (picture-symbol), some by ideas (idea-symbol) and some by their combination, but none of us accepts those theories, and none of us is also able to understand that all those ways might have worked together during the time, and might have created a language from the first word to the last. The classification of the languages requires the knowledge of the sources that made possible the creation of the words, and, of course, they are more than one. Putting together tomatoes with apples only because they are red, it does not mean that they form a single group with just one name.

  39. Dragos said,

    December 16, 2010 @ 12:06 pm

    Surely the Persian and English words for 'bad' are cognates.

    Are they?

    How likely is for two words which are definitely related to be cognates instead of borrowings?

    That's not such a black-and-white question as it sounds.

    I agree. But the "lexicostatistical problem" presented in the article of Shijulal Nelson-Sathi et al. is mostly focused on the cognate vs borrowing dichotomy.

    Phonetic and semantic similarity (as in Grzegorz Kondrak's studies – he also has a PhD thesis on this topic) cannot determine if a word is a cognate, nor how likely is to be one. Been there, done that, it's called mass comparison and it was widely criticized.

    After a more careful reading, I don't think I agree with the method of N-S et al. If my understanding is correct, they use computer generated phylogenetic trees, and then search for cognate sets (COGs) distributed in a way which is incongruent with the tree-branching patterns. Their conclusion is that (many of) these problematic cognates are in fact borrowings which are currently undetected.

    Well, certainly there are borrowings which cannot be differentiated from true cognates. But I doubt the validity of the assumptions from this paper, and thus their conclusions are articulated on shaky grounds.

    The authors seem to take language trees for granted. But maybe some COGs inconsistencies are because some branches are incorrectly placed, or because of the limitations of their tree model. If we use a tree representation, is there any serious demonstration that Romance languages splitted one by one, as pictured in these phylogenies? What if we model them splitted all at once from Latin? Wouldn't those "patchy COGs" give a different picture?

    The authors are aware of parallel evolution but they say they "can assume that it is rather rare". Why?

    It seems the UP is applied in a rather abusive way by suggesting there's some sort of low and constant ratio of words / concept (though I'm not really sure what they mean by "far more"). Is it? Languages do have both partial and total synonyms, and it often happens that each word sources a diffferent COG in the descending languages (see the Romance language family).
    I think this is also related to a dangerous equation between wordsets and actual languages, ignoring that for many of those concepts/meanings, some languages do have more words which are not taken in account (for example for 'belly' Romanian has 'burtă', 'pântec(e)', 'vintre', 'stomac' and maybe some others I am missing right now). If the authors choose to use wordsets with 1-2 words / concept, it doesn't mean the ancient, unattested languages had only 1-2 words / concept.

  40. Morgan said,

    December 16, 2010 @ 2:34 pm

    @marie-lucie:

    Sorry I'm so late to respond.

    "…the languages used (all Algonquian) are definitely known to be closely related, and the phonetic correspondences have already been observed and systematized, so the computational methods can be refined in order to take these features into account".

    Perhaps they could be, but I don't believe they were "tuned" to Algonquian. In fact, to go back to your "head" "caput" example, these might well be identified as highly likely to be cognate using a version of Kondrak's methodology – if (and I want to stress that this "if" bears some thought) multiple languages are examined at once. Include "hoofd", "kopf", "huvud" – the ALINE algorithm identifies these as being very similar to one another and also identifies some forms as being highly similar to "caput".

    You can play around with it here:

    http://webdocs.cs.ualberta.ca/~kondrak/cgi-bin/demo/aline/aline.html

    So maybe you'd get to likely cognacy even in a somewhat prototypical example of true cognates that have been subject to significant phonetic drift (at least on one side). In semantic/phonological space, these forms cluster tightly together, which is the basic probabilistic evidence of potential cognacy. What's more, this grouping can be made much tighter by a parsimonious translation through the space (initail "c/k" to "h"). That shift could certainly be identified in an automated fashion, given that someone thought to program the computer to attempt to identify it. Of course, this would merely replicate what has already been found by actual living linguists.

    "…the problems the author identifies (slight differences in spelling, morphology, semantics, etc) are problems for the computers, not for actual linguists using pre-computational methods ("measuring" semantic and phonetic similarity, etc). For a computational method to work, it has to be based on preliminary work by linguists who can code the various relationships between forms."

    I'm sure that these were problems for actual linguists when they first grappled with them. The need for "preliminary work by linguists" is fundamental, however. Whether that means stripping away "distracting" (to the computer) aspects of morphology, or doing the fundamental work required to define robust measures of phonetic and semantic distance, the computers are at the mercy of existing work. Even if the computer were clever enough to "strip away" morphology on its own, it would do so based on linguists' encoded understanding of how to do so (setting aside the possibility of machine learning approaches).

    "…The "probabilistic" evaluation of potential cognacy using the methods described here seems to be very similar to that achievable by properly trained living and breathing linguists, except that (with the right coding) it is probably faster over a large corpus. With whichever method, a person with a human brain has to go through the material, first to determine the parameters, and last to test the results."

    Agreed – with caveats, but not important ones for the present. The value of a computational method is that it allows (a.) speed of computation over a large corpus (b.) the ability to simultaneously handle a large number of languages, (c.) complete objectivity in the sense that there are no "judgment calls" regarding cognacy, (d.) flexibility in defining and updating phonetic and semantic distance metrics as better ones are developed, and (e.) (potentially at least) the ability to probabilistically account for large numbers of "low probability" cognates. This makes it likely to be valuable for identifying longer range relationships among languages than would otherwise be accessible, assuming of course they exist.

    I realize this sounds like Greenberg's method of mass comparison – unfortunately I can't speak to how similar to his method what I'm describing really is, because I've never found a detailed description of his methodology. My understanding is, however, that it was manual, and that the primary objections were to the subjective nature of cognacy determinations.

    I want to stress again the difference between an approach that minimizes false positives with regard to cognacy and one that makes maximal use of the information available to it. A simple distance metric is hardly the last word in terms of incorporating all available information – it wouldn't be even if relationships among languages were strictly tree-like – but it does allow for better handling of all the pairs that don't meet the stringent cognacy test.

    "So the computational methods described are extensions from traditional methods, not completely different methods. This seems to be as true in the present context as in many other scientific contexts."

    Again, agreed, but this time the caveat is more important. New tools often allow the accomplishment of things that would otherwise be impossible. It's pretty clear that some linguists have a strong dislike for the very idea of computational methods. Maybe this is due to a history of overpromising (or overclaiming), or just suspicion of shortcuts. But it makes sense to me to find ways to work with these methods rather than throwing them out altogether.

  41. Morgan said,

    December 16, 2010 @ 2:44 pm

    @J Lee:

    "Surely the Persian and English words for 'bad' are cognates."

    I believe they are agreed not to be cognate. A simple phonetic/semantic distance metric would think they were very likely to be cognate. A more sophisticated one (e.g. one that identified systematic sound shifts) might not.

    But in either case, any metric-based method would have to account for false positives like these before evaluating the likelihood of relatedness between languages – perhaps based on an expected distribution of distances given unrelatedness, which might be generated based on distances between randomly chosen pairs of words.

  42. Dragos said,

    December 16, 2010 @ 4:00 pm

    Perhaps they could be, but I don't believe they were "tuned" to Algonquian.

    They are. Each language has a different phonology and each phonology also varies in time. Comparing two languages, should the p match a p? A b? A f? Any of them? There's no universal answer. Other phonetic correspondences are not that obvious. See below.

    You can play around with it here
    http://webdocs.cs.ualberta.ca/~kondrak/cgi-bin/demo/aline/aline.html

    It gives a similarity score of 85 to English bad vs Farsi bad (not a cognate) and a score of 30 to Gothic hund vs modern Pan-Slavic sto (a cognate). Not only it fails to identify the cognate, it also misaligns the two words by "matching" the initial 's' with the middle 'n', not with the initial 'h' as it should.

  43. Atmir Ilias said,

    December 16, 2010 @ 7:58 pm

    1. Sanskrit; 2.Albanian; 3.English
    /Nata/(1.sans) – nata(2.albanian)-night(3.engl),/çlath/(1)–çlith, zglith(2)-release(3),/varga/(1) – varg(2)-gamma(3),/bahra/(1)– barra(alb)-load(3), /giri/(1)– guri(2)-rock(3), /vartitum/ – vërtita(2)-revolves(3), /peja /(1)– pija(2)-drink(3), /trapa/(1) – trup(2)-body(3), /krimi/(1) – krimbi, krimi(2)-worm(3), /krija/(1) – kryeja(2)-krieja(Tzam 2)-top,head,leading(3), /lipsu/(1) – lipës(2)-beggar(3), /lap/(1) – llap(2)-prattle(3), /ratha/(1) – reth(2)-circle(3), /prer/(1) – prerë, prej, pres(2)-cut(adj,verb), /paka/(1) – pjek(2)-bake(3), /vrana/(1) – e vrame, e vrare(2)-killed, /val/(1) –valë(2)-wawe(3), /trut/(1) – tret(2)-thaw(3), /tiras/(1) – thërras(2)-summon(3), /tila/(1) – thela(2)-lobule(3), /vasu/(1) – vashë(2)-girl(3), /val/(1) – valle(2)-dance(3), /vas/(1) – vesh(2)-ear(3), /kleça/(1) – kleçka (2)-?(3), /suni/(1) – çuni(2)-boy(3), /nusa/(1) – nusja(2), nusa(Tzam 2)-bride(3), /ramja/(1) – i ramë( i rënë)(2)-dropped,fallen(3), /vasa/(1) – vise(2)-(3)?, /fal/(1) – fal(2)-forgive(3),/gata/(1) – gota(2)-cup(3), /tata/(1) – tata(2)-father(3), /gatita/(1) – gatita (2)- prepare (3), /bhuta/(1) – bota(2)-world(3), /anu/(1)–anë(2)-side(3).

  44. Dragos said,

    December 16, 2010 @ 8:21 pm

    @Morgan:
    I realize this sounds like Greenberg's method of mass comparison – unfortunately I can't speak to how similar to his method what I'm describing really is, because I've never found a detailed description of his methodology. My understanding is, however, that it was manual, and that the primary objections were to the subjective nature of cognacy determinations.

    It is like Greenberg's method of mass comparison and problem is that with such methods it is impossible to separate cognates from loanwords, chance resemblances, nursery formations, etc. The contribution above suggesting Albanian is strongly related to Sanskrit proves beautifully what is wrong with this method.

  45. Atmir Ilias said,

    December 17, 2010 @ 7:26 pm

    The primary weakness is the structure of the word. None pays attention on. Are incorporated into words other mini-words that their combinations produce the main meaning as a resultant of mini-meanings? If a word have inside a small “sentence” with some primitive elements of “subject, verb, predicate..” what is gonna happen?
    For example, the verb “die” in a Urdu language is “maraikju”. If we compare with the same meaning-verb “vdes, des,dek” of the Albanian language, it is clear the big difference, but if compare “mar” and “ikju”, we can find out the Albanian verb /mar/, which means “take” and “iku”, past tense, third person singular of the verb /ik/, which means “is gone”, we can arrive an another conclusion.
    If we are going to develop this method, we will see that the Japanese language has those verbs and is interesting that they have the identical meaning to Albanian. The past tens of the verb “mar” in Albanian is “mora”, which has the same signification with Japanese verb “morau” , and it’s the same with the “iku”, it’s signification is “go” on both languages.
    We should answer the questions and not to deviate them. Instead to deal all different words of all languages wasting time and money, maybe would be better to deal with the things they determine.

RSS feed for comments on this post