Caleb Everett, "Evidence for Direct Geographic Influences on Linguistic Sounds: The Case of Ejectives", PLoS ONE, 2013:
We examined the geographic coordinates and elevations of 567 language locations represented in a worldwide phonetic database. Languages with phonemic ejective consonants were found to occur closer to inhabitable regions of high elevation, when contrasted to languages without this class of sounds. In addition, the mean and median elevations of the locations of languages with ejectives were found to be comparatively high.
Sean Roberts does some of the statistical checks that Everett should have done and didn't ("Altitude and Ejectives: Hypotheses up in the air", replicated typo 6/13/2013), and the connection between altitude and ejectives hold up fairly well. Thus only two linguistic variables ( Order of Object and Verb and the Relationship between the Order of Object and Verb and the Order of Adjective and Noun) have a stronger connection with altitude than the ejectives feature (red vertical line) does:
In comparison, here's Sean's check on Keith Chen's association of the future-tense variable with savings rates ("Whorfian economics reconsidered: Residuals and Causal Graphs", 2/28/2013):
Still, the (presumably) spurious correlations of the two word-order variables with altitude remind us of the possibility for false findings here. Some past posts on similar issues from the same source: "Most important paper on cultural evolution that includes acacia trees published", 1/17/2013; "Spurious correlation bonanza to mark Replicated Typo 2.0 reaching 100,000 hits", 11/30/2011; and a published paper, Sean Roberts and James Winters, "Social Structure and Language Structure: The New Nomothetic Approach", Psychology of Language and Communication 2012.
Whether or not the altitude/ejective correlation reveals a causal connection, we can expect the near future to bring us a large number of spurious correlational analyses, along with a few meaningful ones. There are three reasons for this:
(1) The existence of digital datasets makes it increasingly easy to perform quantitative checks on hypotheses about possible relationships between linguistic and non-linguistic variables;
(2) The astronomically large number of such possible relationships guarantees that many of them should exhibit a strong pair-wise connection by chance, even if all of the distributions were statistically independent;
(3) The distributions are not statistically independent, due to factors such as cultural and geographical diffusion.
Note that the "file drawer effect" strongly undermines the often-made argument "But I/we made the hypothesis before we checked, we didn't just dredge for correlations and then try to explain them". The data-dredging (and the associated multiple comparisons) can (and do) occur across many unconnected investigations, with only the "significant" ones getting published.
As a result, responsible journal editors ought to insist on at least one simple thing: comparison of the asserted relationship with the full distribution of logically-possible relationships in the dataset(s) being analyzed. This check was absent from Everett's paper, but was supplied by Sean Roberts with a few minutes of work done while waiting for a plane in Singapore airport.
But in every pair of datasets, for each variable in one of the datasets, we'll see a distribution like those shown in the plots above, showing an especially strong statistical connection to a few of the variables in the other dataset. And sometimes these connections won't make any sense — verb-object order and altitude, or velar nasals and savings rates, or lexical tone and acacia trees — while others will suggest a plausible causal story in one direction or the other — ejectives and altitude, future tense and savings rate, or lexical tone and haplogroups of ASPM and MCPH. So plots of this kind doesn't guarantee that the connections in the high-significance tails are directly meaningful ones. Still, it's a start.
A pioneering (and plausible) effort to correlate linguistic and geographical variables was John Fought et al., "Sonority and Climate in a World Sample of Language: Findings and Prospects", Cross-Cultural Research 2004:
In a world sample (N = 60), the indigenous languages of tropical and subtropical climates in contrast to the languages spoken in temperate and cold zones manifested high levels of sonority. High sonority in phonetic segments, as found for example in vowels (versus consonants), increases the carrying power of speech sounds and, hence, audibility at a distance. We assume that in the course of daily activities, the speakers in warm/hot climates (a) are often outdoors due to equable ambient temperatures, (b) thereby frequently transmit messages distally, and (c) transmit such messages relatively intelligibly due to the acoustic and functional advantages of high sonority.
But Fought et al. had to do a lot more work to test their hypothesis than they could have managed while waiting for a plane at Singapore Airport. They started by choosing a sample of 60 societies from the HRAF Probability Sample Files, geographically stratified and "chosen to represent the 60 macrocultural areas of the world". Then they had to devise and implement a method for coding these societies for "climate"; and they also had obtain a 200-word vocabulary for each society's language, and to devise and implement a method for coding "sonority" in the pronunciations of those words. Both of these involved extensive expert re-interpretation of previous literature.
In contrast, a decade later, many relevant linguistic and non-linguistic datasets are now pre-compiled and available for easy download, and the software needed for fitting various sorts of statistical models can easily be run on your laptop. So if you have a bright idea — maybe alcohol consumption correlates with phonotactic complexity? really, it could — the chances are that you test a model within a few hours. If it doesn't work out, there are plenty more to try — maybe coffee consumption helps to preserve morphological inflection?
Some previous Language Log discussions of similar things: "Corpus-Wide Association Studies", 3/11/2012; "Cultural Diffusion and the Whorfian Hypothesis", 2/12/2012; "Typological progress", 5/11/2008; "Dediu and Ladd again", 5/30/2007.
Also, everyone should be aware of the path-breaking work on this topic by Mark Dingemanse, "The Hidbap language of PNG", 5/7/2008:
Mt. Iso in PNG, 12 miles southwest of Sumo, east of the Catalina River. Diuwe is spoken between sea level and the first isoline at 100m, Hidbap between the first and the second isolines.
This week, the language of the week at Anggarrgoon is DIY, also known as Diuwe. Claire Bowern, noting that the only comment in the Ethnologue entry of the language is the terse and rather mysterious ‘Below 100 meters’, claims that the phonology of DIY shows an effect of altitude on air stream mechanisms. I thought I would shed some light on this curious situation by profiling Hidbap, a language related to Diuwe.
Hidbap is Diuwe’s closest neighbour both geographically and phylogenetically. It is a language spoken above 100m but below 200m in the same area as Diuwe, that is, 12 miles southwest of Sumo, east of the Catalina River. Like Diuwe, it has exactly 100 speakers. The languages are quite closely related, though there is no mutual intelligibility due to the presence of a large bundle of isoglosses at the 100m isoline. Speakers of either language avoid crossing into each other’s territories at all cost (see below).