Chen […] thinks that if your language has clear grammatical future tense marking […], then you and your fellow native speakers have a dramatically increased likelihood of exhibiting high rates of obesity, smoking, drinking, debt, and poor pension provision. And conversely, if your language uses present-tense forms to express future time reference […], you and your fellow speakers are strikingly more likely to have good financial planning for retirement and sensible health habits. It is as if grammatical marking of the difference between the present and the future insulates you from seeing that the two are coterminous so you should plan ahead. Using present-tense forms for future time reference, on the other hand, encourages you to see that the future is just more of the present, and thus encourages you to put money in a 401(k).
Geoff notes that "Chen's evidence on the lifestyle indicators comes from massive amounts of hard data, and his mathematical analysis is serious". But in addition to expressing some qualms about the linguistic data, Geoff worries that the large number of linguistic traits and the large number of lifestyle and other cultural traits might give rise to spurious connections:
I also worry that it is too easy to find correlations of this kind, and we don't have any idea just how easy until a concerted effort has been made to show that the spurious ones are not supportable. For example, if we took "has (vs. does not have) pharyngeal consonants", or "uses (vs. does not use) close front rounded vowels", would we find correlations there too?
I have similar concerns; but I believe that I can explain and justify my worries without looking at any real data at all. There are two qualitative facts about the world that make it especially easy to fool ourselves about quantitative connections of this kind.
The first relevant fact is that cultural traits — whether language-related or lifestyle-related — tend to diffuse geographically. As a result, there are fewer degrees of freedom in our data than we might think.
Here's a simple demonstration. Suppose we assign uniform random values on the open interval 0,1 to points on a 40×40 grid:
Now we do the same thing, independently, 50 times over. When we're done, each of the 1600 cells has 50 numerical traits or features. We can check the correlations among the geographical distributions of these 50 traits or features, or look in various ways at the effectiveness of attempts to predict the distribution of one of them in terms of the distribution of another; and we'll find that our random number generator has done its job.
But suppose we implement a simple model of cultural diffusion, assuming that we have a series of time-steps on which for each feature, each cell independently influences its immediate neighbors, and is influenced by them in turn. The details don't really matter much — what I did is explained at the end of the post. The result, by design, is that each feature diffuses into a pattern of random regional blotches:
And even though each feature evolved completely independently of all the others, this blotchiness means that (positive or negative) geographical correlations between features become more common. (The plot below shows the distribution of all (50*50 – 50)/2 = 1225 pairwise correlations…)
As the blotches spread, the distribution of correlations also spreads out, and more significant-seeming correlations get commoner. The emergent blotch-correlations of feature-distributions look like this:
Those pictures show the geographical distributions of the 16th feature and the 40th feature after 100 time steps on that particular run, and you can see that they are weakly negatively correlated (at around r = -0.25). If we apply logistic regression, as Chen did, we find that the beta parameter for predicting feature 16 in terms of feature 40 is -1.08, with an estimated standard error of 0.21, p<.0000001 (or corrected for multiple comparisons, p=.0011).
The plot below compares the predicted values (on the left) with the observed values (on the right):
But there's a second simple fact about the world that also tends to introduce artefacts into an analysis of this kind. Cultural traits not only spread geographically — they tend to spread in correlated ways.
In what we've done so far, each of the traits diffuses independently from all the others. There are no rivers or mountain ranges steering cultural diffusion along preferred pathways. There's no long-distance migration or colonization. And we took no account of the fact that even in simple cases of local borrowing, cultural traits tend to travel in packages: if culture X influences your language, there's a good chance that it influences other aspects of your culture as well. Introducing such co-diffusion of traits into the model would tend to increase the similarity of feature-blotch distributions, and therefore would increase the number of correlations for which we might postulate a causal explanation.
But all this is entirely independent of any genuine causal connections among the traits. For example, if you borrow your writing system from the Chinese, it's likely that you'll borrow some aspects of food culture and architecture as well. This is not because a logographic writing system tends to cause people to use chopsticks or build pagodas.
My point is not that Prof. Chen's analysis is wrong. There may very well be consequential connections of the kind that he explores. But the existence of statistically significant logistic regression coefficients among geographic distributions of cultural traits is not enough to convince me.
In the simulation above, "neighbors" are defined by modular arithmetic on array indices, so that the top row is next to the bottom row, and the rightmost column is next to the leftmost one. At each time step, each cell interacts with each of its neighbors (and itself) by flipping an unfair coin for each trait, with probabilities determined by its current value for that trait; the neighbor then updates the trait in question as a linear combination of its old value and the 0 or 1 derived from the coin flip.
Corresponding Matlab code is here.