(Not) trusting data

« previous post | next post »

Pete Warden, "Why you should never trust a data scientist", 7/18/2013:

The wonderful thing about being a data scientist is that I get all of the credibility of genuine science, with none of the irritating peer review or reproducibility worries. […]

I’ve never ceased to be disturbed at how the inclusion of numbers and the mention of large data sets numbs criticism. The articles live in a strange purgatory between journalism, which most readers have a healthy skepticism towards, and science, where we sub-contract verification to other scientists and so trust the public output far more. If a sociologist tells you that people in Utah only have friends in Utah, you can follow a web of references and peer review to understand if she’s believable. If I, or somebody at a large tech company, tells you the same, there’s no way to check. The source data is proprietary, and in a lot of cases may not even exist any more in the same exact form as databases turn over, and users delete or update their information. Even other data scientists outside the team won’t be able to verify the results. The data scientists I know are honest people, but there’s no external checks in the system to keep them that way.

Amen. Except that Warden's trust in that "web of references and peer review" is naive and unfounded.  Most peer-reviewed scientific papers are based on unpublished (and typically unavailable) data, and under-documented (and often crucially errorful) methods. Journals are reluctant to publish negative results (for the plausible reason that there are lots of ways to screw up an experiment), and equally reluctant to publish failures to replicate positive ones. For these and other reasons, most peer-reviewed scientific papers are wrong, and the more prominent the journal, the less likely published results are to be replicable.

A serious effort is underway to ameliorate if not fix these problems, but there's a long way to go.

Warden's concluding advice is likely to make things worse:

What should you do? If you’re a social scientist, don’t let us run away with all the publicity, jump in and figure out how to work with all these new sources.

Successful scientific PR is not necessarily antithetical to valid science, but there's good evidence of a negative correlation.  A couple of random examples, out of dozens from LL coverage over the years: "Debasing the coinage of rational inquiry",  4/22/2009; "'Vampirical' hypotheses", 4/28/2011.

It's not helpful to urge scientists to grab hunks of "big data" and pursue publicity even more avidly. Regular LL readers know that I'm a strong proponent of empirical methods in studies of speech, language, and communication — but whatever the dataset sizes and analysis methods involved, the key methodological issue is reproducibility. This normally requires publication of all (raw) data and (implemented) methods.

Ironically, traditional "armchair" syntax and semantics is entirely reproducible: The explicandum is a pattern of judgments about specified examples. You can disagree about the judgments, or about the argument from the pattern of judgments to a conclusion, but all the cards are on the table. The same thing is true of traditional work in phonology and morphology, which makes assertions about patterns of documented lexical fact.

But traditional experimental research in phonetics, psycholinguistics, sociolinguistics, corpus linguistics, neurolinguistics etc. is generally not reproducible: the raw data is usually not available; detailed annotations or classifications of the data may be withheld along with documentation of the methods used to create them; the fine details of the statistical analysis may be unavailable (e.g. decisions about data inclusion and exclusion, specific methods used, possible algorithmic or coding errors).

Does this matter? Often, the lack of transparency in scientific publication hides over-interpretation, mistakes, and even outright fraud — see e.g. the priming controversy, the fall of Marc Hauser, the Duke biomarkers scandal, and so on.

This is unfortunately not an unusual situation — there are many examples within linguistics where false or misleading ideas have become widely accepted on the basis of flawed experimental evidence, and where access to the experimental data would probably have limited the damage.

Here's one example among many: A series of important papers from 1976 onwards argued for a categorical distinction between e.g.

le mappe di città [v:]ecchie  "the maps of old cities"
le mappe di città [v]ecchie   "the old maps of cities"

This conclusion was originally based on native-speaker intuitions, though none of the original authors spoke a relevant dialect of Italian. Intuition was later supported by a small phonetic experiment, which was crucially effective in countering native speakers who doubted the judgments. This "fact" was crucial evidence in favor of a widely-accepted hypothesis, namely that well-defined prosodic consitutents exist, arranged in a "prosodic hierarchy", and that crisp formal rules define the relationship between syntactic structures and prosodic structures, which in turn govern the application of certain external sandhi rules, of which raddoppiamento (fono)sintattico became a paradigm example.

This argument was very influential throughout the 1980s and 1990s.

But in fact, the basic observation was completely wrong. Italian raddoppiamento sintattico works rather like English flapping and voicing — it can apply anywhere in connected speech. The cited phonetic measurements were apparently due to facultative disambiguation: insertion of overt silent pauses to disambiguate (somewhat unnatural) sentences presented as minimal pairs. For a detailed summary of  the situation, see e.g. Matthew Absalom et al., "A Typology of Spreading, Insertion and Deletion or What You Weren't Told About Raddoppiamento Sintattico in Italian", ALS 2002.

The view of RS as an "anywhere" rule was strongly supported by corpus-based work, e.g. in Agostiniani "'Su alcuni aspetti del rafforzamento sintattico in Toscana e sulla loro importanza per la qualificazione del fenomeno in generale", Quaderni del Dipartimento di Linguistica, Università degli studi di Firenze (1992). And in 1997, one of the original authors admitted that "… notably in Tuscan and romanesco, raddoppiamento fonosintattico [RS] seems to apply throughout sentences without regard to their syntactic (and [derived] phonological) constituency" (Irene Vogel, "Prosodic phonology", in Maiden & Parry (eds), The Dialects of Italy).

How did this happen? I'm not asking about the sociological process whereby a 35-year-old false generalization, abandoned by its originators 15 years ago, continues to be treated by some as part of the foundations of the field. Rather, I want to discuss the natural processes that led to the wrong generalization in the first place.

1. "Facultative disambiguation" — The natural contrastive effect of considering a minimal pair in juxtaposition usually leads to an exaggeration of the natural distinctions, and sometimes to the deployment of unusual (phonetic or pragmatic) resources in order to create a clear separation.

2. "Selection bias" — It's natural to choose cases where a phenomenon of interest seems to be especially clear, and this often leads to the selection of examples from the ends of a continuum or from widely-separated regions of a more complex space; or perhaps examples where some additional associated characteristics re-inforce the apparent differences.

3. "Confirmation bias" — As an apparent pattern begins to emerge, we (individually or as a field) tend to focus on evidence that confirms the pattern, and to put problematic or equivocal evidence into the background.

All of these things can easily happen with laboratory experiments as well as with intuitions: We choose experimental materials that seem to work especially well ("selection bias"); experimental subjects are likely to notice (near-) minimal pairs, and to exaggerate the contrasts that they imply ("facultative disambiguation"); and experiments often don't work for irrelevant reasons, and so it's tempting (and often correct) to put "failed" experiments aside in favor of "successful" ones.

Of course, all of these things — especially selection bias and confirmation bias — can also happen in corpus-based research. But both in laboratory experiments and corpus studies,  the best way to avoid or fix such mistakes is to make sure that all of the data and methods are available for others to check and extend.

Beyond possible problems with flawed, mistaken, or outright fraudulent studies, there are significant positive benefits to "reproducibility": it reduces barriers to entry, and speeds up extension as well as replication. The greatest benefits accrue to the original researchers themselves, who don't have to waste time trying to remember or recreate what they did to get some results from a few years (or even a few months) earlier.

So we can hope that some day, experimental research on speech and language will be as reproducible as armchair linguistics always has been.



  1. bks said,

    August 4, 2013 @ 9:39 am

    This normally requires publication of all (raw) data and (implemented) methods.

    But what is the raw data? A thousand recordings of karyotyped native speakers along with a GPS reading and meteorological circumstances under which each was taken? Does it include the name and coordinates of the graduate students who did the transcriptions?

    [(myl) The details vary with the kind of research, obviously. In the case of research in acoustic phonetics, for example, the "raw data" would normally include the audio recordings, any relevant scripts, instructions, or transcriptions; any segmentation, labelling, or other measurements that play a role in the analysis; and the basic demographic information about the speakers. As a general matter of intellectual hygiene in such experiments, it's a good idea to check inter-annotator agreement; and information about who (or what) did any transcriptions or other analyses would not be a bad piece of information to retain.]

    I once requested the raw data underlying a linkage mapping (genomics) experiment. I received a spreadsheet replete with ten thousand numbers and no column headings.

    [(myl) Was this the result of stupidity, or was it a big "screw you"? Or is it possible that the authors had simply mislaid all the rest of the information?]


  2. Rodger C said,

    August 4, 2013 @ 11:50 am

    The articles live in a strange purgatory

    Sounds more like a limbo to me.

  3. Johan Rooryck said,

    August 4, 2013 @ 2:15 pm

    Dear Mark,
    I am not sure I understand your "Facultative disambiguation". The comparison of minimal pairs is a time-honoured methodology in linguistics, going back to structuralism (at least). Can you give some examples of the "phonetic or pragmatic) resources in order to create a clear separation"? What do you mean by "the natural distinctions" that are exaggerated in such pairs?

    [(myl) People can usually find some way to communicate a distinction, if motivated to do so. For example, if they want to distinguish two words whose sounds are normally close, like pet and pat, they may just exaggerate the normal difference, pushing the performances apart phonetically. If they want to distinguish two words that they normally pronounce in the same way, but which are spelled differently, like latter vs. ladder or affect vs. effect, they may use a "spelling pronunciation". If they want to distinguish two bracketings, like [[new oil] prices] vs. [new [oil prices]], they may insert an exaggerated silent pause at the top-level juncture. What all of these things have in common is that there are two different underlying forms whose productions are overlapping or entirely merged in normal speech, but may be more clearly distinguished by exaggerated or artificial means when speakers are motivated to differentiate minimal pairs.

    It's perfect in order to use such exaggerated or artificial performances as evidence for the existence of an underlying distinction. But as a way to characterize the patterns of ordinary speech, this kind of experiment is seriously flawed.]

  4. Michael Newman said,

    August 4, 2013 @ 3:21 pm


    The implication of your criticism is that there should be standard expectations regarding the form of data made available and metadata to be attached. In fact, development of standards in this regard can deal with the diminishing biases of IRBs in favor of data destruction and provide guidance to new researchers.

    I want to mention two related obstacles to open use of corpora in sociolinguistics.
    One is the bias in favor of collecting ad hoc corpora for studies when existing corpora might work just as well. For example, I've heard criticisms of a dissertation based on the fact that student hadn't collected her own speech samples. I don't think this bias exists in child language research but it is in sociolinguistics.

    The other is problem of researchers having a corpus but severely limiting access. I'm not sure if granting agencies have a policy on this, but they should.

    [(myl) I've certainly also observed both of these factors. The second one is something that needs to be taken into account at the planning stage, since ethical considerations (as well as IRB policies) dictate that participants should understand and agree to the plan for the data they help to create. That plan can include access contracts that protect confidentiality and privacy in various ways, as appropriate given the circumstances of the data collection.]

  5. Rubrick said,

    August 4, 2013 @ 11:08 pm

    I, for one, would be willing to endure the hardship of living without your Language Log posts if you put aside all your other responsibilities and spent a year writing a really good book-length manifesto on this topic. But I suspect your employer, collaborators, and students might feel otherwise. (I still can't believe you actually hold down a job in addition to all your superb LL writing.)

  6. Catanea said,

    August 5, 2013 @ 4:20 am

    I love it that "Rubrick" came out in red.
    And I second the motion.

  7. tk said,

    August 5, 2013 @ 5:36 am

    An interesting IRB-related dilemma has arisen in soc-cult anthro, in re whether or not to identify ‘consultants’, either in publication or after. I have been faced, in a public gathering at which I discussed some of my findings, with a person who demanded to know who I had talked to, saying, in effect, “What are their creds; I don’t think they knew what they were talking about.”

    In this case, I was able to divert some of the criticism by saying, “I can’t name them as they are deceased, and your cultural norms frown on speaking the names of the dead.” But more to the point of the value of identifying consultants came when a friend rose and said, “He spoke with my Daddy. If you have any problems with that, speak to me about it.”

    BTW, I have been able to trace the criticism, “Your consultants were just making things up,” to at least 1912, when Alice Fletcher and Francis La Flesche’s “The Omaha Tribe” criticized James Owen Dorsey’s “Omaha Sociology” (1885) on those grounds. Ironically, one of Dorsey’s consultants was La Flesche’s own father.

  8. Mark P said,

    August 6, 2013 @ 9:45 am

    I come originally from the atmospheric sciences, where I don't think this problem really exists. Most research in the atmospheric sciences that I am (used to be) familiar with involved data sets obtained from public sources (for example, NOAA or NASA). Other, less public sources would always be made available.

    In contrast, in the field in which I have spent most of my life (defense), data sets are actively hidden by security procedures. The data are shared within certain allowed groups, and the results are reported, but no outside agency has or can normally get access. A colleague who had attended a meeting at a high classification level said that one participant complained that classification was too often used to hide bad science. In other situations, apparently the whereabouts of the data are simply forgotten. A critique of the current missile defense program found that top-level executives of the agency apparently did not know where data from previous flight tests was. Or at least they weren't saying.

  9. J.W. Brewer said,

    August 6, 2013 @ 4:36 pm

    I wonder to what extent there's a tension at the planning stage between: a) the approach that will maximize usefulness of the dataset to other researchers later; and b) the approach that will lead to the least hassle in getting IRB approval. Since b) involves a near-term obstacle that must be gotten past before you even have the dataset that raises the question of whether you've optimized a), that could lead to some perverse incentives for researchers dealing with a prone-to-creating-hassles IRB that has difficulty distinguishing between recording people's speech and injecting them with experimental drugs with unknown side effects. Is there a better way to deal with this than by encouraging sources of funding to push so hard for a) up front that avoiding IRB hassles is not going to be the researchers' path of least resistance?

  10. Timothy Mills said,

    August 7, 2013 @ 9:53 pm

    I have never heard this term "facultative disambiguation" before. It is a very relevant concept for a paper's theory section I am currently revising. A quick Google Scholar search gives me no hits for the term as a unit, and a broader Google search points me to several LL posts, but nothing else.

    Can you point me to a peer-reviewed (or at least conference) discussion of the term that I could cite?

RSS feed for comments on this post