Language Log

Where the language diversity is

December 28, 2014 @ 6:49 am · Filed by Geoffrey K. Pullum under Evolution of language, Language and travel, Language contact, Languages

« previous post | next post »

In the articles-noted-but-not-yet-studied pile: an article on language diversity in a journal that (as reader Ted McClure points out to me) linguists might easily have missed (though at least some linguistics blogs covered it): in Proceedings of the Royal Society B: Biological Sciences (281, 20133029), earlier this year, Jacob Bock Axelsen and Susanna Manrubia published a paper entitled "River density and landscape roughness are universal determinants of linguistic diversity." The abstract says:

Global linguistic diversity (LD) displays highly heterogeneous distribution patterns. Though the origin of the latter is not yet fully understood, remarkable parallelisms with biodiversity distribution suggest that environmental variables should play an essential role in their emergence. In an effort to construct a broad framework to explain world LD and to systematize the available data, we have investigated the significance of 14 variables: landscape roughness, altitude, river density, distance to lakes, seasonal maximum, average and minimum temperature, precipitation and vegetation, and population density. Landscape roughness and river density are the only two variables that universally affect LD. Overall, the considered set accounts for up to 80% of African LD, a figure that decreases for the joint Asia, Australia and the Pacific (69%), Europe (56%) and the Americas (53%). Differences among those regions can be traced down to a few variables that permit an interpretation of their current states of LD. Our processed datasets can be applied to the analysis of correlations in other similar heterogeneous patterns with a broad spatial distribution, the clearest example being biological diversity. The statistical method we have used can be understood as a tool for cross-comparison among geographical regions, including the prediction of spatial diversity in alternative scenarios or in changing environments.

Intuitively not too surprising, I guess: diverse languages evolve where it is difficult to get around to where other folks live because routes to other areas are made difficult by mountains, cliffs, ridges, canyons, or rivers. You chat with people you see a lot, and you see more of people who live somewhere you can easily get to. It makes more sense than looking for a correlation with rainfall.

The entire paper seems to be available on an open access basis on this site.

December 28, 2014 @ 6:49 am · Filed by Geoffrey K. Pullum under Evolution of language, Language and travel, Language contact, Languages

Permalink

22 Comments

Keith said,

December 28, 2014 @ 8:45 am

I suppose the best-known example of this is PNG, where (so I've read) people lived in small settlements in the uplands and where moving from one settlement to another was almost impossible if it required descending into the dense, lowland jungle: movement was very, very slow and where there was nothing for people to eat… a group could starve before being able to climb back up to an altitude where food could be found. The only place where groups could mingle would be along the coast.

In Europe and elsewhere, we could expect that terrain and isolation in river valleys would create linguistic communities isolated from each other until societies organised and technologies were developed to a point where not only movement between language communities allows interchange, but also invasions lead to one group imposing its language or adopting the language of the invaded.
ThomasH said,

December 28, 2014 @ 11:26 am

Does "river density" mean many river valleys (with difficult communications between them) or many rivers that facilitate travel and communication? And whichever, were the rivers coded independently of the knowledge of the degree of communication between them?
Geoffrey K. Pullum said,

December 28, 2014 @ 11:47 am

Check out the paper itself at http://rspb.royalsocietypublishing.org/content/281/1784/20133029.full.pdf+html.
ThomasH said,

December 28, 2014 @ 12:04 pm

The link did not lead me to the article, but my question was principally aimed at the ambiguity in the abstract about the sign of the variable in the model.
Michael Rank said,

December 28, 2014 @ 1:02 pm

I found the article here http://rspb.royalsocietypublishing.org/content/281/1784/20133029

Related article here (focuses on Papua New Guinea) Spatial congruence in language and species richness but not threat in the world's top linguistic hotspot
http://rspb.royalsocietypublishing.org/content/281/1796/20141644
Stephen said,

December 28, 2014 @ 1:20 pm

" Overall, the considered set accounts for up to 80% of African LD" (my emphasis).

Surely anyone with a knowledge of statistics will be saying 'correlation does not equal causation'!
KWillets said,

December 28, 2014 @ 1:26 pm

I've wondered about this question in Northern California, which seems to have a peak in diversity in a very rugged area.

On the river question, they spend a few paragraphs considering different aspects, and conclude that communication may have been a key factor:

Hence, regions of high river density have acted as social hubs, promoting the interaction among different linguistic groups. It is conceivable that rivers may have boosted LD through a process analogous to genetic recombination [30,31]. . This scenario is indirectly supported by observations of how frequent contacts between speakers of different languages may result in new hybrid languages within a few generations. It has been put forward that rapidly emerging contact languages may then have played a significant role in language evolution [32].

There's a copy of the paper on the author's site.
Roger Shuy said,

December 28, 2014 @ 5:07 pm

Linguistic geographers have known (and talked about) this for decades. It isn’t much of a secret that early Americans took the easiest routes west that they could find. What else would explain Ohio’s Summit County (Akron) from being settled later than lower elevations of Cleveland 45 miles north and Columbus 75 miles south? Or take Illinois for example. The earliest white settlers were Midland dialect speakers, mostly hunters and trappers who came up north up the Mississippi River then further north up the Illinois River bringing their Kentucky Midland dialect with them. The Blackhawk War scared them off and they paddled back down the Illinois River to safety. By the time that battle was over and it was safe to go north again, Robert Fulton had invented the steamboat and New England farmers with various Northern English dialects took water routes to Chicago. As the Midland dialect speakers, mostly hunters and trappers, returned up the Illinois River they ran into the more recently settled Northern dialect speakers, mostly farmers, who were cutting down the precious woodland areas for hunting and trapping and turning it into farm the land. This created conflict over the terrain that caused the Midlanders and Northerners to cluster separately. Except for Chicago, the Northern/Midland dialects in Illinois remain fairly separate to this day, all based on the routes of settlement patterns in that state. And those settlement patterns are reflected in the dialects diversity that was created by the landscape, attitude about survival, and waterways.
Rod Johnson said,

December 28, 2014 @ 5:28 pm

Surely anyone with a knowledge of statistics will be saying 'correlation does not equal causation'!

It seems pretty mundane to say that some set of variables "accounts for" some portion of the variance in some other variable in a principal components analysis. What's the objection?
Stephen said,

December 28, 2014 @ 6:05 pm

@Rod Johnson
Well "accounts for" implies a causative relationship whereas what is in the abstract only shows a correlation.

Without a causative mechanism it is just plain wrong to say that A causes B. It may be that B causes A or both are caused by C or even just chance.

A while ago, maybe 30 years, a quite close correlation could be drawn between owning a TV, or more than one radio, and the likelihood of developing cancer. However there was no causative relationship between them. Rather both were symptoms of a more affluent lifestyle.

It is easy in this case to speculate upon a causative mechanism (hard to cross terrain keeps groups apart and their languages separate) but to be at all plausible that mechanism must be investigated and its impact measured. It may well be that only some of the correlation is caused by this mechanism and that the rest is caused by something else or is just chance.

Even if the causative mechanism is clear in a given case, implying causation when it is not proved is a terribly bad habit for a scientist.

At a quick skim the Wikipedia article at
https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation
seems to cover this topic reasonably well.
J.W. Brewer said,

December 28, 2014 @ 6:54 pm

Maybe this is covered in the paper in a jargony way that I missed on first skim, but . . . if you take the species-diversity analogy as a starting point, there are two different sorts of scenario. In scenario A, for a given stretch of rain forest (let's say one of their 222 km x 222 km "cells") you have ten different species of frogs all living intermingled right next to each other (perhaps exploiting slightly different niches so they aren't direct competitors for the same food sources etc). In scenario B, the same-size territory also has ten different species of frogs but that's because the territory is divided up into ten separate zones, each, due to roughness of terrain, its own distinctive micro-ecology, and each inhabited by only one distinctive species of frog. It seems to me that what is being described here languagewise is more analogous to B, whereas A is what people have in mind when they talk about "biodiversity" being an uncontroversially Good Thing. Scenario B by contrast is not necessarily diverse in a good way, because each teensy little isolated frog species with its restricted range may be highly vulnerable to slight changes of circumstances. The linguistic equivalent of scenario A is what you used to get in polyglot cities ruled by (typically illiberal and not infrequently brutal) multi-ethnic empires, like e.g. Ottoman-era Salonika, where speakers of Turkish, Greek, Albanian, Macedonian, Aromanian (or whatever you want to call "the language used by Vlachs when speaking to each other"), and Ladino were living cheek by jowl, with most people able to function in several tongues beyond their own ancestrally-determined L1. I don't think that's the sort of linguistic diversity this paper is measuring.
ThomasH said,

December 28, 2014 @ 9:30 pm

It is true that the study was not properly set up with a causal hypothesis to be tested (such as that lack of interaction among populations caused by living in different river valleys promotes language divergence) that an observed correlation could be taken as evidence, As it is, it appears that the authors threw a lot of possible causative variables at the phenomenon to see which ones would stick. What you can get out of such a procedure is a possible hypothesis for future work, but not evidence of causation.
Ran Ari-Gur said,

December 28, 2014 @ 10:37 pm

@Stephen: As Rod Johnson says, "accounts for" is a very common phrasing in this technical context. It does not denote causation.
Stephen said,

December 29, 2014 @ 4:29 am

@ThomasH
"the authors threw a lot of possible causative variables at the phenomenon to see which ones would stick. What you can get out of such a procedure is a possible hypothesis for future work, but not evidence of causation"

I would have thought that, scientifically, you actually get nothing out of that approach, e.g.
http://xkcd.com/882/

@Ran Ari-Gur
" "accounts for" is a very common phrasing in this technical context"
If true then that is very poor practise indeed.

In general English usage 'accounts for' has the meaning of causation. If they don't mean that and they don't mean 'is correlated with' (hopefully that is what they would have said in that case) …

"It does not denote causation"
… what does it mean?
GH said,

December 29, 2014 @ 7:08 am

I disagree that "accounts for" means causation in everyday use. Rather, it tends to mean "explains," and while an explanation can be a claim of direct causation, it does not need to be. (Or it can mean "fully enumerates," with causation even less relevant.)

In technical jargon, "accounts for" is used to talk about the strength of correlation or covariance (i.e. the r-value, or equivalent measure), and as others have pointed out it does not imply causation.

Of course, the authors are in any case not positing a direct causation: that rivers and rough terrain make people speak differently from each other. They more sensibly hypothesize that the different distributions of language diversity has come about through different patterns of population contact, which in turn is partly an effect of communication difficulty, which is often a factor of specific terrain features, for which river density and landscape roughness appear to be useful heuristics. And while the findings don't prove the hypothesis, they do at least support it.
Lars said,

December 29, 2014 @ 7:56 am

@Stephen,

the researchers in XKCD had a bit of luck there — twenty uniformly distributed variables in [0;1] have a 35.8% chance of all being >= 0.05. (I'm sure that Randall Munroe knows this).

But a surefire recipe for success is this: Ask 58 sophomores to come up with a hypothesis each. If one or more test out at p < 0.05, good. Otherwise you can claim that "Sophomores only come up with bad hypotheses, p < 0.05".
Eric Ringger said,

December 29, 2014 @ 9:18 am

Now the stage is set for a variety of generative probabilistic models of terrain geography interacting with people. What are the natural hidden variables in such models? Migration events? Linguistic speciation events? Some such models might model whole populations, but it is conceivable to model individuals and even their vocabularies. New hidden variables come to mind. Conversation events? Trade events? Conflict events? Validating such models sounds daunting but could lead to insights into causation of linguistic diversity.
Nathan Myers said,

December 29, 2014 @ 10:26 am

As I understood it, in Papua New Guinea the highland population don't live up on disjoint mountain slopes, but in a broad valley isolated from the coast by a high mountain ridge. What separates them linguistically from one another is not geography, but cultural intolerance: I.e., travel was suicidal.

Perhaps someone can correct me if I got it wrong.

I wonder how it is that the highlands are not just one big lake.
ThomasH said,

December 29, 2014 @ 3:19 pm

I think the "finding" of a correlation between river density and language diversity and a presumption of some sort of causal link could lead to formation of a hypothesis about the causative mechanism along the lines that Ringger suggests. So I'd say the procedure has some scientific value. Of course the suggested hypotheses could all be wrong and the correlation having occurred by chance, but that does not invalidate the initial finding.
Stephen said,

December 30, 2014 @ 6:12 am

@GH
I disagree that "accounts for" means causation in everyday use. Rather, it tends to mean "explains,"
However I think that you are using 'explains' as sense 1.2 here
http://www.oxforddictionaries.com/definition/english/explain?q=explains which uses the word 'cause' in the definition.

"(Or it can mean "fully enumerates," with causation even less relevant.)"
I don't see how that is at all relevant here. Something cannot 80% fully enumerate.

"In technical jargon, "accounts for" is used to talk about the strength of correlation or covariance"

You don't specify what field you are referring to that this jargon comes from, so it hard to comment on that.
However, if they mean 'has an 80% correlation with' then surely it is better to say that than to say that 'accounts for up to 80% of'. The former is only a few characters longer and is much clearer.
GH said,

December 31, 2014 @ 5:58 am

"(Or it can mean "fully enumerates," with causation even less relevant.)"
I don't see how that is at all relevant here. Something cannot 80% fully enumerate.

Just commenting on how the everyday use of the phrase doesn't have the strong connotation of causality you claim it does. But of course, with the qualifier it would simply mean "enumerates 80% of," (for example, after a transportation disaster you might say that the survivors and the bodies recovered together "account for 80%" of the passengers) which starts to come close to the meaning here.

You don't specify what field you are referring to that this jargon comes from, so it hard to comment on that.

Statistics. http://en.wikipedia.org/wiki/Explained_variation

However, if they mean 'has an 80% correlation with' then surely it is better to say that than to say that 'accounts for up to 80% of'. The former is only a few characters longer and is much clearer.

No. I find "has an 80% correlation with" clumsy and imprecise, depending on their mathematical model (for one thing, I'd always want a correlation expressed as a number between -1 and 1, not as a percentage). "Accounts for" is the standard way to express this observation in words, without technical symbols like R^2.

But beyond the disagreement about the phrasing, I think "correlation does not equal causation" is a fatuous platitude in this context. The effect of the variables might be spurious, but if there is a link, we can rule out a number of other possible causal relationships by common sense (it is not, for example, plausible that increased linguistic diversity causes terrain to become rougher). The mechanisms the authors suggest are not the only possibility – one alternative that comes to mind would be to speculate that rougher terrain is less desirable, and that a wide variety of small populations have been displaced there by expansionist cultures – but in any case we're led to consider ways in which terrain features (or something closely linked to them) have affected patterns of human behavior to lead ultimately to patterns of linguistic diversity, as @Eric Ringger outlines.
Stephen said,

December 31, 2014 @ 7:28 am

@GH
You did not comment on my statement "I think that you are using 'explains' as sense 1.2 here … which uses the word 'cause' in the definition."
Does that mean that you agree that in everyday speech, 'accounts for = explains' does imply causation?

Just commenting on how the everyday use of the phrase doesn't have the strong connotation of causality you claim it does.

I'm sorry I thought it was obvious that I meant in this context. A slightly fuller quote from the abstract is:

only two variables that universally affect LD. Overall, the considered set accounts for up to 80% of African LD

where the word 'affect' makes it seem, to me at least, that causality can reasonably be inferred.

Also, I just said it implies causation, I never said how strong that linkage was.

Thanks for the link. If that is that standard way that this is expressed, then I have to accept that. However the way that we speak (& write) affects the way that we think. So I think that talking in a way that implies causation (or may imply it) when that causation has not been shown is a bad habit as it will, to some extent, lead scientists to be less sceptical about their theories.

It is with that attitude in mind that I do not think that "correlation does not equal causation" is at all fatuous here. Common sense might give us an idea as to what areas to prioritise for further research but I do not think it is much of a touchstone for definitively ruling out ideas.

it is not, for example, plausible that increased linguistic diversity causes terrain to become rougher

Well, I can immediately think of two mechanisms for exactly that:
– Antagonism between two, linguistically different, groups leads to the development of a no-mans-land where trees grow better, leading to less soil erosion and something of a barrier.
– Alternatively, the river that both groups depend upon for water, is diverted by one group and over time cuts quite a deep gorge between them.

I am not saying that either of these is common or even likely, merely saying that dismissing possibilities out of hand is poor practise. Also, you yourself said "The effect of the variables might be spurious" and it was with just that (a chance correlation) in mind that I issued my initial 'warning'.

RSS feed for comments on this post

Where the language diversity is

22 Comments

Keith said,

ThomasH said,

Geoffrey K. Pullum said,

ThomasH said,

Michael Rank said,

Stephen said,

KWillets said,

Roger Shuy said,

Rod Johnson said,

Stephen said,

J.W. Brewer said,

ThomasH said,

Ran Ari-Gur said,

Stephen said,

GH said,

Lars said,

Eric Ringger said,

Nathan Myers said,

ThomasH said,

Stephen said,

GH said,

Stephen said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta