Corpus-Wide Association Studies

« previous post | next post »

I've spent the past couple of days at GURT 2012, and one of the interesting talks that I've heard was Julian Brooke and Sali Tagliamonte, "Hunting the linguistic variable: using computational techniques for data exploration and analysis". Their abstract (all that's available of the work so far) explains that:

The selection of an appropriate linguistic variable is typically the first step of a variationist analysis whose ultimate goal is to identify and explain social patterns. In this work, we invert the usual approach, starting with the sociolinguistic metadata associated with a large scale socially stratified corpus, and then testing the utility of computational tools for finding good variables to study. In particular, we use the 'information gain' metric included in data mining software to automatically filter a huge set of potential variables, and then apply our own corpus reader software to facilitate further human inspection. Finally, we subject a small set of particularly interesting features to a more traditional variationist analysis.

This type of data-mining for interesting patterns is likely to become a trend in sociolinguistics, as it is in other areas of the social and behavioral sciences, and so it's worth giving some thought to potential problems as well as opportunities.

In this case, the data being mined was from the Toronto English Project, which is available only to the people who collected it.  But an increasing number of increasingly-large collections of sociologically-relevant speech and text are being published or otherwise made available to researchers at large.  And there are many standard ways to search large datasets for features that are especially informative with respect to some classification or regression task.

In the text-classification domain, see e.g. George Forman, "An Extensive Empirical Study of Feature Selection Metrics for Text Classification", Journal of Machine Learning Research, 2002. Or in an example more similar to what Brooke & Tagliamonte did,  Jonathan Schler et al. ("Effects of Age and Gender on Blogging", AAAI 2005) used information gain to find and rank age-related and gender-related features — typical variables of interest to sociolinguists.

In general, I feel that the application of such techniques is a wonderful opportunity for linguists of all kinds. But at the same time, there are some difficult problems that we need to learn to think about, many of which are covered under the negatively-evaluated dismissive term "data dredging".

First, unconstrained data-mining is likely to turn up many relationships that are accidental. The basic reason is that in the datasets we're talking about, the number of possible values of linguistic and sociological features or feature-combinations is very large, much larger than the set of training examples. Relevant sociological features include gender, age, geography, ethnicity, SES, etc. Relevant linguistic features include thousands of common words, many values of many pronunciation alternatives, many phrase structures and phrase-structural contexts, many interactional functions and contexts — and there are exponentially larger sets of combinations of such features. As a result, there are in effect millions or billions of "trials" in which relationships among features are evaluated. It's guaranteed that some numerically-impressive feature-combinations will occur entirely by accident, at least if our idea of numerically-impressive is just "unlikely to occur by chance".

There are various ways to minimize the number of such artefactual discoveries. For example, we can evaluate the performance of selected features and model parameters on held-out test data, using a single train/test division of the data, or perhaps a more elaborate cross-validation scheme.  This will avoid over-fitting due to sampling error.

But data-mining discoveries can be scientifically ephemeral for other reasons. For example, there may be essentially independent variables that happen to be reliably correlated in the particular circumstances of the data collection; and if the testing and training data come from the same source, then cross-validation will not catch non-sampling errors of this kind. In my experience, such things become more rather than less likely to happen as the size and complexity of the data collection increases.

For practical applications where the only goal is correct classification or prediction, and where context of use is similar to the context of training-data collection, this doesn't matter. But if the application environment is different in critical ways, or if the application depends on finding genuinely causal connections, or if we're looking for scientific understanding, then we've still got a problem.

The problems of sampling and non-sampling error have always been present in traditional sociolinguistic research — and it can fairly be argued, here as elsewhere, that published papers may present a few statistically-successful hypotheses against an unpublished background of many others that were considered and silently discarded. But when our computer programs are explicitly considering hundreds of thousands (or millions or billions) of hypotheses, the change of scale is likely to make the problems more serious as well as more obvious.

These issues are familiar ones in statistics, and in the applications of statistics in areas like survey methodology, epidemiology, and machine learning. One especially-relevant domain of research is genome-wide association studies (GWAS), where microarray methods are used to check large numbers of subjects (as many as 200,000) for correlations between large numbers of single-nucleotide polymorphisms (as many as 400,000) and disease states or other phenotypic traits.

What Brooke and Tagliamonte report is essentially a "corpus-wide association study", and many of the issues that have been debated in the case of GWAS are going to arise in CWAS as well. The analogies are far from direct — for instance, the sociolinguistic analogue of "population stratification" might be "context stratification" — but the comparison is still worth making.


  1. Tanja said,

    March 11, 2012 @ 8:04 am

    As we move more and more towards exploratory data analysis, I think correcting for multiple comparisons (e.g., via false discovery rate control) needs to become a standard part of corpus-linguistic methodology. Granted, it won't solve all of the problems mentioned above, but it will be a step in the right direction.

    [(myl) This is true in principle, but in practice, correcting for multiple comparisons can be hard to do. The Bonferroni correction is likely to be far too stringent, since the very large number of "trials" are far from independent; but it's not easy to figure out a reliable way to estimate and take account of the even larger number of covariances involved.

    And in the situation under discussion, we have a data-mining step used for feature selection, and then a statistical inference step associated with evaluation of hypotheses associated with the features. The hypotheses in question may be (or at least appear to be) different from the relationships used in feature selection, making it even harder to figure out what sort of significance-correction to use.]

  2. Eric P Smith said,

    March 11, 2012 @ 9:16 am

    The way to eliminate artefactual discoveries is to use statistics properly. The experimenter must first choose his hypothesis, and then test it statistically on a suitable corpus. If he uses a corpus to suggest a hypothesis, then he must test it on an independent corpus. If he chooses to test multiple hypotheses, then he must appropriately strengthen his test of significance for every one of them.

    [(myl) This is a bit naive with respect to traditional hypothesis-testing, since the choice of hypothesis typically emerges from (formal or informal) interaction with data that is either identical to or highly correlated with the "suitable corpus" on which testing is to be done. It's also a bit naive about how independent a "second" corpus is likely to be. And it leaves open the problem of how to "appropriately strengthen" significance tests for multiple hypotheses, given non-independent comparisons.

    Your prescription seems to outlaw exploratory data analysis, or at least to make it extremely expensive; but in fact the consequence, for people who follow the traditional hypothesis-testing ideal that you present, seems to be simply that EDA is driven underground and done without safeguards. It seems to me to be better to admit that EDA (and a fortiori data-mining) is a useful thing to do, while recognizing that there are many ways it can go wrong.

    And in that context, I'm personally skeptical that there's any always-valid formal or procedural safeguard against false inference that isn't so conservative as to bring investigation completely to a halt. It seems better to rely on the traditional quasi-adversarial process of scientific debate (including debate with one's self) to reach an appropriate balance.]

  3. bks said,

    March 11, 2012 @ 10:49 am

    Eric, that's the way I learned it at University, but in the biological realm the algorithm is: 1) Collect data, 2) Notice pattern, 3) Tweak data and statistics until P < .05, 4) Publish.

    Sad but true. There just are not statistical methods for dealing with data sets that have large quantities of missing data and for data which has natural variation as a feature, not a bug.


  4. D.O. said,

    March 11, 2012 @ 12:49 pm

    The best (though hard) safeguard would be to search for a meaningful interpretation. That is, if feature A correlates with feature B, but not feature C, one should come up with a testable hypothesis of why it might be so. Simply coming out with result like "women of Midwestern origin in the 40-50 year cohort are more prone to stress the second syllable in Italian loan-words" is not good enough.

  5. Eric P Smith said,

    March 11, 2012 @ 2:42 pm

    Fascinating to be so roundly contradicted. Thank you myl and bks. I'll not argue. You'll gather I'm more familiar with theory than practice.

  6. Adrian Morgan said,

    March 11, 2012 @ 6:01 pm

    "Present and Future Applications of Data Mining in Linguistics" was the title of an assignment I wrote in an undergraduate data mining course around 2003. The topic was free choice (approval required) and the assignment written in the form of an academic paper.

    I can make the file available on request, but it was written a decade ago by an undergraduate (and not even a linguistics undergraduate), so not much use to anyone seeking up-to-date quality information about data mining and linguistics. But could still be of interest to someone who simply enjoys talking about this stuff. One self-criticism of my own work is that I was in places insanely optimistic about what might be possible with future technology.

    I got 8.5 out of 10 for it.

    [(myl) Congratulations! But in some sense, people have been doing text data mining since Kucera and Francis at Brown and Salton at Cornell in the 1960s; and people started calling it "data mining" in the 1990s — see this 1999 review and prospectus by Marti Hearst. So you definitely caught the wave in 2003, but the wave was there to be caught.]

  7. thomas said,

    March 11, 2012 @ 7:32 pm

    correlations between large numbers of single-nucleotide polymorphisms (as many as 400,000) and disease state

    I wish we only had 400,000. You can't buy a general-purpose genome-wide SNPchip that small any more. We've been measuring a million SNPs and expanding by imputation to 2.5 million for years, and Illumina will now sell you 5-million SNP chips.

    [(myl) Wow. My experience with such things goes back to the work of a grad student I helped to advise a decade ago, and clearly my facts are an order of magnitude out of date. I should have know that Moore's Law would continue to apply here as elsewhere.]

  8. Corpus-Wide Association Studies « Another Word For It said,

    March 11, 2012 @ 8:10 pm

    […] Corpus-Wide Association Studies by Mark Liberman. […]

  9. Adrian Morgan said,

    March 12, 2012 @ 6:13 am


    Looking at my 2003 paper, I see that I actually cited Hearst! (I'd totally forgotten, but just happened to look and notice.) The context, from page two:

    Most linguistic applications of data mining thus far established are in computational linguistics – specifically language technology research – where the goal is not to discover knowledge that is useful to a human researcher on theoretical grounds, but to facilitate a computer with the knowledge it needs to perform efficient and accurate language tasks (examples in Hearst 1999, Stevenson 2003). This division affects the optimal degree of compactness versus redundance, the relative importance of rules and exceptions (see Daelemans 1999), the relevance of whether something is obvious or established knowledge, and other considerations that affect the choice of mining strategy, the way it is carried out, and the presentation of the results.

    [(myl) I think that 8.5 out of 10 was low — you should appeal.]

  10. Jacob Eisenstein said,

    March 14, 2012 @ 7:59 pm

    Nice post, I like the analogy to GWAS, where a similar situation holds — lots of "predictors" and lots of "outputs," and only a few meaningful associations.

    In GWAS, there's been some cool recent work on estimating models with structured sparsity. The idea is to find just a few relevant predictors or outputs (or just a few predictor-output associations; this is the regular old "unstructured" sparsity), while setting all other associations to zero. (For example, see Kim and Xing, PLoS Genetics 2009).

    We applied this idea to sociolinguistic associations in an ACL paper last year. This enabled us to find a small set of words which correlate with a range of different (interrelated) demographic attributes. We then ran a traditional significance test on only these words, with a conservative correction for multiple comparisons.

    [(myl) Neat — I'm sorry to have missed this when it came out, and happy to cite it here: Jacob Eisenstein et al., "Discovering Sociolinguistic Associations with Structured Sparsity", ACL 2011. The data and code are also available on Jacob's web site.]

  11. Post-Easter link catch-up « The Outer Hoard said,

    April 22, 2012 @ 10:03 am

    […] the comments of a Language Log post a while back, I mentioned an old university assignment of mine. For the record, it's […]

  12. A 3rd Point of Entry | Linguistics 212 said,

    November 29, 2012 @ 1:31 pm

    […] read the article – which quotes the abstract of a paper by someone who is doing this – here. On the one hand, it would be nice if we could give a computer all of our 'Iraq' data, […]

RSS feed for comments on this post