Corpus-based Linguistic Research: From Phonetics to Pragmatics

Reproducible Research

David Donoho: "An article [...] in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship." According to this increasingly widespread view, responsible research documentation must include all of the basic data, along with any relevant annotation, and also the code that generates any numbers, tables, and figures using in the argument.

Traditional linguistics is reproducible. The explicandum is a pattern of judgments about specified examples. You can disagree about the judgments, or about the argument from the pattern of judgments to a conclusion, but all the cards are on the table.

Of course, people do often disagree about the judgments. In phonology, syntax and semantics, the intersubjective reliability of intuitive judgments tends to decrease as their relevance to basic theoretical questions increases. And so there is growing interest in various kinds of experimental methods.

But traditional research in phonetics, psycholinguistics, sociolinguistics, corpus linguistics, neurolinguistics etc. is generally not reproducible: the raw data is often not available; annotations or classifications of the data may be withheld along with documentation of the methods used to create them; the fine details of the statistical analysis may be unavailable (e.g. decisions about data inclusion and exclusion, specific methods used, possible algorithmic or coding errors).

Does this matter? Often, the lack of transparency in scientific publication hides over-interpretation, mistakes, and even outright fraud -- see e.g. the priming controversy, the fall of Marc Hauser, the Duke biomarkers scandal, and so on.

But beyond possible problems with flawed, mistaken, or outright fraudulent studies, there are significant positive benefits to "reproducibility": it reduces barriers to entry, and speeds up extension as well as replication. The greatest benefits accrue to the original researchers themselves, who don't have to waste time trying to remember or recreate what they did to get some results from a few years (or even a few months) earlier.

An Example from Sociolinguistics

[Discussion of t/d deletion -- to be provided]

Greg Guy, "Explanation in variable phonology: an exponential model of morphological constraints", Language Variation and Change, 1991.

Holger Mittere & Mirjam Ernestus, "Listeners recover /t/s that speakers reduce: Evidence from /t/-lenition in Dutch", Journal of Phonetics 2006.

Rosalind A.M. Temple, "(t,d): the Variable Status of a Variable Rule", Oxford Working Papers in Linguistics, Philology & Phonetics, 2009.

An Example from Computational Linguistics

This is a case where omission of some apparently-irrelevant details led the field seriously astray for 8-10 years...

Steve Abney et al., "Procedure for quantitatively comparing the syntactic coverage of English grammars", HLT 1991:

The problem of quantitatively comparing the performance of different broad-coverage grammars of English has to date resisted solution. Prima facie, known English grammars appear to disagree strongly with each other as to the elements of even the simplest sentences. [...]

Specific differences among grammars which contribute to this apparent disparateness of analysis include the treatmeat of punctuation as independent tokens or, on the other hand, as parasites on the words to which they attach in writing; the recursive attachment of auxiliary elements to the right of Verb Phrase nodes, versus their incorporation there en bloc; the grouping of pre-infinitival "to" either with the main verb alone or with the entire Verb Phrase that it introduces; and the employment or non-employment of "null nodes" as a device in the grammar; as well as other differences. Despite the seeming intractability of this problem, it appears to us that a solution to it is now at hand. We propose an evaluation procedure with these characteristics: it judges a parse based only on the constituent boundaries it stipulates (and not the names it assigns to these constituents); it compares the parse to a "hand-parse" of the same sentence from the University of Pennsylvania Treebank; and it yields two principal measures for each parse submitted.

The procedure has three steps. For each parse to be evaluated: (1) erase from the fully-parsed sentence all instances of: auxiliaries, "not", pre-infinitival "to", null categories, possessive endings (% and '), and all word-external punctuation (e.g. " . , ; -); (2) recursively erase all parenthesis pairs enclosing either a single constituent or word, or nothing at all; (3) compute goodness scores (Crossing Parentheses, and Recall) for the input parse, by comparing it to a similarly-reduced version of the Penn Treebank parse of the same sentence.

Jason Eisner, "Bilexical Grammars and a Cubic-Time Probabilistic Parser", 5th International Workshop on Parsing Technologies, 1997:

Computational linguistics has a long tradition of lexicalized grammars, in which each grammatical rule is specialized for some individual word. The earliest lexicalized rules were word-specific subcategorization frames. It is now common to find fully lexicalized versions of many grammatical formalisms, such as context-free and tree-adjoining grammars [Schabes et al. 1988]. Other formalisms, such as dependency grammar [Mel'cuk 1988] and head-driven phrase-structure grammar [Pollard & Sag 1994], are explicitly lexical from the start. Lexicalized grammars have two well-known advantages. Where syntactic acceptability is sensitive to the quirks of individual words, lexicalized rules are necessary for linguistic description. Lexicalized rules are also computationally cheap for parsing written text: a parser may ignore those rules that do not mention any input words. More recently, a third advantage of lexicalized grammars has emerged. Even when syntactic acceptability is not sensitive to the particular words chosen, syntactic distribution may be [Resnik 1993]. Certain words may be able but highly unlikely to modify certain other words. Such facts can be captured by a probabilistic lexicalized grammar, where they may be used to resolve ambiguity in favor of the most probable analysis, and also to speed parsing by avoiding ("pruning") unlikely search paths. Accuracy and effciency can therefore both benefit. Recent work along these lines includes [Charniak 1995, Collins 1996, Eisner 1996b, Collins 1997], who reported state-of-the-art parsing accuracy. Related models are proposed without evaluation in [Lafferty et al. 1992, Alshawi 1996]. This recent flurry of probabilistic lexicalized parsers has focused on what one might call bilexical grammars, in which each grammatical rule is specialized for not one but two individual words. The central insight is that specific words subcategorize to some degree for other specific words: tax is a good object for the verb raise. Accordingly, these models estimate, for example, the probability that (a phrase headed by) word y modifies word x, for any two words x, y in the vocabulary V.

Dan Bikel, "Intricacies of Collins Parsing Model", Computational Linguistics 2004:

This article documents a large set of heretofore unpublished details Collins used in his parser, such that, along with Collins' (1999) thesis, this article contains all information necessary to duplicate Collins' benchmark results. Indeed, these as-yet-unpublished details account for an 11% relative increase in error from an implementation including all details to a clean-room implementation of Collins' model. We also show a cleaner and equally well-performing method for the handling of punctuation and conjunction and reveal certain other probabilistic oddities about Collins' parser. We not only analyze the effect of the unpublished details, but also reanalyze the effect of certain well-known details, revealing that bilexical dependencies are barely used by the model and that head choice is not nearly as important to overall parsing performance as once thought. Finally, we perform experiments that show that the true discriminative power of lexicalization appears to lie in the fact that unlexicalized syntactic structures are generated conditioning on the headword and its part of speech.