Language Log

Reliability

February 28, 2015 @ 9:43 am · Filed by Mark Liberman under Language of science

On Thursday and Friday, I participated in a workshop on"Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results" at the National Academy of Sciences in Washington DC.

Some of the presentations were even more horrifying than I expected — at one point, an audience member was moved to ask half-seriously whether ANY reproducible result has ever been published in biomedical research — but others described positive trends and plans.

There was a good deal of discussion about terminology. What do words like replicable and reproducible really mean? What is the range of relevant concepts across different disciplines and subdisciplines? How can we line up the concerns that apply to (say) observational research on large samples of human populations, with the issues that come up in experiments on lab animals, or the concepts relevant in empirical geophysics, or in computer modeling of protein folding or global warming?

The reason that we should care, of course, is that the cumulative development of scientific and technical knowledge depends on individual results being more or less reliable. If most published findings don't generalize — and one horrifying subfield was identified where the replication rate is apparently about 1% — then we might as well be doing philosophy or literary criticism, which are much cheaper and also more entertaining.

Some of the reasons for the problems are well known. There's the "file drawer effect", where you try many experiments and only publish the ones that produce the results you want. There's p-hacking, data-dredging, model-shopping, etc., where you torture the data until it yields a "statistically significant" result of an agreeable kind. There are mistakes in data analysis, often simple ones like using the wrong set of column labels. (And there are less innocent problems in data analysis, like those described in this article about cancer research, where some practices amount essentially to fraud, such as performing cross-validation while removing examples that don't fit the prediction.) There are uncontrolled covariates — at the workshop, we heard anecdotes about effects that depend on humidity, on the gender of experimenters, and on whether animal cages are lined with cedar or pine shavings. There's a famous case in psycholinguistics where the difference between egocentric and geocentric coordinate choice depends on whether the experimental environment has salient asymmetries in visual landmarks (Peggy Li and Lila Gleitman, "Turning the tables: language and spatial reasoning", Cognition 2002).

I understand that most of the presentation slides will eventually be available on the conference website. Meanwhile my presentation, on "Reproducible Computational Experiments", can be found at the link. I described, as I have in several other places, the history of the now-ubiquitous common task method.

The goal is to promote reliable improvement in techniques for algorithmic analysis of (digital recordings of) the natural world. The method starts by publishing two things: a recipe for quantitative evaluation of candidate techniques, and a large collection of training data. Suitable sets of similar examples are withheld for subsequent testing, which is often administered by a neutral third party like the National Institute of Standards and Technologies.

This is now the standard way to organize research on dozens of problems, from speech recognition to image analysis — but it was developed about 30 years ago in response to a particular historical situation, in which the then-prevalent ways of evaluating such research were seen to be so unreliable as to be essentially worthless. And some of the recent positive developments in other areas of science and engineering, such as movement towards data sharing, suggest that some of the same lessons are being learned over again — though it is striking that the value of well-defined quantitative evaluation metrics still seems to have been missed. The lack of such metrics in data-sharing projects like the Alzheimer's Disease Neuroimaging Initiative is a good case in point.

As the workshop title ("Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results") suggests, there was a good deal of discussion about statistical methods. We were reminded, for example, that a p value just below 0.05 — even if obtained without implicit multiple comparisons or other procedural sins — actually means that there is only a 50% chance that the experiment will replicate at p<0.05, if performed in exactly the same way with a fresh sample from the same population. And we were also reminded that as datasets get large, sampling error goes to zero, so that conventional hypothesis-testing yield absurd p-values of 10^-10 or 10^-100, although very large non-sampling errors of many kinds probably remain.

But in my opinion, the central issues in this area are not statistical but cultural. And what I learned at the workshop confirmed this belief.

February 28, 2015 @ 9:43 am · Filed by Mark Liberman under Language of science

Permalink

9 Comments

Rob P. said,

February 28, 2015 @ 12:43 pm

Along the same lines. Difficulties in reproducibility/reliability of linguistic analysis – http://www.washingtonpost.com/lifestyle/magazine/should-texts-e-mail-tweets-and-facebook-posts-the-be-new-fingerprints-in-court/2015/02/19/a5ec2bf6-6f32-11e4-8808-afaa1e3a33ef_story.html

[(myl) Indeed. I originally planned this post as a comparison between the issues in the NRC workshop and in the WaPo article. But the NRC workshop part got too long, so I decided to split it into two, and continue the discussion of the forensic linguistics article in a separate post.

But as a preview, here are some of the links to past LLOG posts from the partial draft:

"Authors vs. speakers: A tale of two subfields", 7/27/2011
"Linguistic Deception Detection: Part 1", 12/6/2011
"Separated by a common problem", 12/12/2013
"'Voiceprints' again", 10/14/2014

]
Rubrick said,

February 28, 2015 @ 5:55 pm

Solving the cultural challenges seems damnably difficult. For example, how can we incentivize journals to publish papers with negative or inconclusive results? Such results are extremely important, but also unfortunately pretty dull. Who's going to read "Tweeting Habits Don't Correlate With Depression"?

Open-access journals and the publishing of raw datasets — two things I know you've pushed hard for — offer a glimmer of hope, but it's still always going to be more work to document negative results than to just leave them on the cutting room floor, as it were.

In essence, we need a way to reward failure as well as success, which comes close to being a contradiction.
John Coleman said,

February 28, 2015 @ 6:24 pm

Time to publish all our own results on our own sites? Who needs publishers any more?
the other Mark P said,

February 28, 2015 @ 11:34 pm

Who needs publishers any more?

Anyone who is paid or otherwise rewarded by a metric such as "impact".

You could be the world's greatest researcher and deepest thinker, yet no official publications and you are an academic nobody to a prospective Head of Department.

I suggest what is needed is meta-publications. That is "journals" that cite the most relevant research in an area, after giving it a thorough dust-over (currently called peer-review, when that is actually done and not merely a formality). They could refuse to publicize work that did not meet proper standards of data-sharing etc. Their costs would be trivial, as they would not actually publish anything.

(Also web-pages expire. Journals do not.)
Peter Taylor said,

March 1, 2015 @ 3:52 am

@Rubrick, there are some journals which specialise in negative results, although their problem seems to be getting people to submit papers. E.g. the Journal of Interesting Negative Results in Natural Language Processing and Machine Learning has only one article, from 2008.
bks said,

March 1, 2015 @ 10:52 am

There is little benefit in publishing all negative results because there are so many ways to screw up an experiment. I'm sure I could take the data from a single genome-wide association study (GWAS) and produce one thousand negative results in a fortnight. What's needed are serious attempts to reproduce surprising results. The de rigueur methods section often lacks sufficient detail. –bks
Ben Hemmens said,

March 1, 2015 @ 3:39 pm

There isn't a need to publish all negative results, but when someone attempts to reproduce a result and it doesn't work anything like the original, there should be a readiness on the part of the journal that published the original report to publish this information. The reality is that the original journal rejects it because there is no new positive discovery and the others think its none of their business. End result is, or at least it was 15 years ago, that if you demolished a paper in a journal with impact factor x, you would be lucky to publish it in one with impact factor x/10, if at all.

There also exists the problem that within certain fields, certain kinds of measurement just become accepted and the possible problems such as interferences, checking whether your signal is really in the dynamic range etc., which of course can be a time-consuming pain to do and may spoil your fun considerably, just get left by the wayside in a kind of groupthink. People get all rigorous about some things and mutually ignore other things that they're being sloppy about. Human nature I suppose. If there's a corrective, I think it will have to involve explaining oneself to people from much more widely different fields than is customary at present.

Linguists checking out what biochemists are doing I think is an excellent idea.
Jeffrey Willson said,

March 2, 2015 @ 7:36 am

I think part of what makes the situation much better in the physical sciences is that the theoretical framework is highly detailed, so that if there is a experimental result published, competing labs can try to reproduce and refine the result: more precise measurements of a certain key mass or temperature are considered publishable, as are extensive charts mapping out how it varies with the chemical composition of the sample. Then, if subsequent work fails to confirm the big discovery, it will be highly embarrassing to the original experimenter.
Chad Nilep said,

March 4, 2015 @ 11:50 pm

I am a fan of (what I understand to be) the pre-print model of repositories such as ArXiv. Interesting results are "pre-published" without peer review, then read, criticized, and potentially endorsed by other scholars in the field. I suppose this works best for big, sexy results that draw other scholars to attempt reproduction. The cultural problem, of course, even given this model is how to reward or create incentives for others to make those attempts, especially given that the outcome of the effort is either a negative result or validation of someone else's earlier work.

RSS feed for comments on this post

Reliability

9 Comments

Rob P. said,

Rubrick said,

John Coleman said,

the other Mark P said,

Peter Taylor said,

bks said,

Ben Hemmens said,

Jeffrey Willson said,

Chad Nilep said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta