I've spent the past couple of days at GURT 2012, and one of the interesting talks that I've heard was Julian Brooke and Sali Tagliamonte, "Hunting the linguistic variable: using computational techniques for data exploration and analysis". Their abstract (all that's available of the work so far) explains that:
The selection of an appropriate linguistic variable is typically the first step of a variationist analysis whose ultimate goal is to identify and explain social patterns. In this work, we invert the usual approach, starting with the sociolinguistic metadata associated with a large scale socially stratified corpus, and then testing the utility of computational tools for finding good variables to study. In particular, we use the 'information gain' metric included in data mining software to automatically filter a huge set of potential variables, and then apply our own corpus reader software to facilitate further human inspection. Finally, we subject a small set of particularly interesting features to a more traditional variationist analysis.
This type of data-mining for interesting patterns is likely to become a trend in sociolinguistics, as it is in other areas of the social and behavioral sciences, and so it's worth giving some thought to potential problems as well as opportunities.
In this case, the data being mined was from the Toronto English Project, which is available only to the people who collected it. But an increasing number of increasingly-large collections of sociologically-relevant speech and text are being published or otherwise made available to researchers at large. And there are many standard ways to search large datasets for features that are especially informative with respect to some classification or regression task.
In the text-classification domain, see e.g. George Forman, "An Extensive Empirical Study of Feature Selection Metrics for Text Classification", Journal of Machine Learning Research, 2002. Or in an example more similar to what Brooke & Tagliamonte did, Jonathan Schler et al. ("Effects of Age and Gender on Blogging", AAAI 2005) used information gain to find and rank age-related and gender-related features — typical variables of interest to sociolinguists.
In general, I feel that the application of such techniques is a wonderful opportunity for linguists of all kinds. But at the same time, there are some difficult problems that we need to learn to think about, many of which are covered under the negatively-evaluated dismissive term "data dredging".
First, unconstrained data-mining is likely to turn up many relationships that are accidental. The basic reason is that in the datasets we're talking about, the number of possible values of linguistic and sociological features or feature-combinations is very large, much larger than the set of training examples. Relevant sociological features include gender, age, geography, ethnicity, SES, etc. Relevant linguistic features include thousands of common words, many values of many pronunciation alternatives, many phrase structures and phrase-structural contexts, many interactional functions and contexts — and there are exponentially larger sets of combinations of such features. As a result, there are in effect millions or billions of "trials" in which relationships among features are evaluated. It's guaranteed that some numerically-impressive feature-combinations will occur entirely by accident, at least if our idea of numerically-impressive is just "unlikely to occur by chance".
There are various ways to minimize the number of such artefactual discoveries. For example, we can evaluate the performance of selected features and model parameters on held-out test data, using a single train/test division of the data, or perhaps a more elaborate cross-validation scheme. This will avoid over-fitting due to sampling error.
But data-mining discoveries can be scientifically ephemeral for other reasons. For example, there may be essentially independent variables that happen to be reliably correlated in the particular circumstances of the data collection; and if the testing and training data come from the same source, then cross-validation will not catch non-sampling errors of this kind. In my experience, such things become more rather than less likely to happen as the size and complexity of the data collection increases.
For practical applications where the only goal is correct classification or prediction, and where context of use is similar to the context of training-data collection, this doesn't matter. But if the application environment is different in critical ways, or if the application depends on finding genuinely causal connections, or if we're looking for scientific understanding, then we've still got a problem.
The problems of sampling and non-sampling error have always been present in traditional sociolinguistic research — and it can fairly be argued, here as elsewhere, that published papers may present a few statistically-successful hypotheses against an unpublished background of many others that were considered and silently discarded. But when our computer programs are explicitly considering hundreds of thousands (or millions or billions) of hypotheses, the change of scale is likely to make the problems more serious as well as more obvious.
These issues are familiar ones in statistics, and in the applications of statistics in areas like survey methodology, epidemiology, and machine learning. One especially-relevant domain of research is genome-wide association studies (GWAS), where microarray methods are used to check large numbers of subjects (as many as 200,000) for correlations between large numbers of single-nucleotide polymorphisms (as many as 400,000) and disease states or other phenotypic traits.
What Brooke and Tagliamonte report is essentially a "corpus-wide association study", and many of the issues that have been debated in the case of GWAS are going to arise in CWAS as well. The analogies are far from direct — for instance, the sociolinguistic analogue of "population stratification" might be "context stratification" — but the comparison is still worth making.