The best part of Monday's post on the Facebook authorship-authentication controversy ("High-stakes forensic linguistics", 7/25/2011) was the contribution in the comments by Ron Butters, Larry Solan, and Carole Chaski. It's interesting to compare the situation they describe — and the frustration that they express about it — with the history of technologies for answering questions about the source of bits of speech rather than bits of text.
The U.S. National Institute of Standards and Technologies (NIST) has been running a formal "Speaker Recognition Evaluation" (SRE) series since the mid-1990s. There were older planned evaluations of a similar sort going back a decade before that — the 1987 King Speaker Verification corpus was used in such evaluations starting in the last half of the 1980s, and the 1991 Switchboard corpus, though it has been used for many things, was originally collected and used as for speaker identification research and evaluation.
So we've had about 25 years of systematic development and open competitive evaluation of methods for determining or validating speaker identity.
Now, if you know the history of forensic speaker identification, you'll object that allegedly scientific methods in that area go back much further. You might cite, for example, Lawrence G. Kersta, "Voiceprint-Identification Infallibility", J. Acoust. Soc. Am. 34(12) 1962:
Previously reported work demonstrated that a high degree of speaker‐identification accuracy could be achieved by visually matching the Voiceprint (spectrogram) of an unknown speaker's utterance with a similar Voiceprint in a group of reference prints. Both reference and unknown utterances consisted of single cue words excerpted from contextual material. An extension of the experimental showed identification success in excess of 99% when the reference and unknown patterns were enlarged to 5‐ and 10‐word groups. Other experiments showed that professional ventriloquists and mimics cannot create voices or imitate others without revealing their own identities. Attempts to disguise the voice by use of various mechanical manipulations and devices were also unsuccessful. Experiments with a population expanded over the original 25 speakers showed that identification success does not decrease substantially with larger groups.
Kersta published a similar paper in the august journal Nature ("Voiceprint Identification", Nature 12/29/1962). But you might also recall the controversy that Kersta's claims caused. Thus Ralph Vanderslice and Peter Ladefoged, "The 'Voiceprint' Mystique", J. Acoust. Soc. Am. 42(5) 1967:
Proponents of the use of so‐called “voiceprints” for identifying criminals have succeeded in hoodwinking the press, the public, and the law with claims of infallibility that have never been supported by valid scientific tests. The reported experiments comprised matching from sample—subjects compared test “voiceprints” (segments from wideband spectrograms) with examples known to include the same speaker—whereas law enforcement cases entail absolute judgment of whether a known and unknown voice are the same or different. There is no evidence that anyone can do this. Our experiments, with only three subjects, show that: (1) the same speaker, saying the same words on different occasions without any attempt to disguise his voice, may produce spectrograms which differ greatly; (2) different speakers, whose voices do not even sound alike, may produce substantially identical spectrograms. As with phrenology a century earlier, “voiceprint” proponents dismiss all counter evidence as irrelevant because the experiments have not been done by initiates. But our data show that the use of “voiceprints” for criminal identification may lead to serious miscarriages of justice.
This 1960s-era controversy is described from a legal perspective in Rick Barnett, "Voiceprints: The End of the Yellow Brick Road", 8 U.S.F. L. Rev. 702 (1973-74), who concludes that
The weight which a jury might choose to attribute to voiceprint identifications may be the determining factor in a decision as to guilt or innocence. Although the courts have tended to treat general acceptance and reliability as one rather than two issues, the latter has provided a much stronger ground for denial of admissibility. Yet, adverse judicial rulings will not end the controversy as to the reliability of the voiceprint technique of speaker identification. Only further research will accomplish that end.
If you look over the documentation of the NIST Speaker and Language Recognition evaluations, you'll see a very different sort of field from the one that is evoked by the exchange between Kersta and Vanderslice — or by the comments from Butters, Solan, and Chaski. Thus NIST's "1997 Speaker Recognition Evaluation Plan", published well in advance of the 1997 evaluation, describes at length what background ("training") data is to used in preparation for the test, how the tests will be scored (using withheld "test" data), in what format results should be submitted to NIST, and so on.
This is an instance of the "common task method" developed by Charles Wayne and others at DARPA in the late 1980s, and used up to the present day to make progress in areas from speech recognition and machine translation to text understanding and summarization. For people in the speech and language technology field, it's now become second nature to expect this kind of research environment. Younger researchers, who may not recognize that things were not always this way, would do well to read John Pierce, "Whither Speech Recognition?", J. Acoust. Soc. Am. 46(4B) 1969:
We all believe that a science of speech is possible, despite the scarcity in the field of people who behave like scientists and of results that look like science.
Most recognizers behave, not like scientists, but like mad inventors or untrustworthy engineers. The typical recognizer gets it into his head that he can solve "the problem." The basis for this is either individual inspiration (the "mad inventor" source of knowledge) or acceptance of untested rules, schemes, or information (the untrstworthy engineer approach). […]
The typical recognizer […] builds or programs an elaborate system that either does very little or flops in an obscure way. A lot of money and time are spent. No simple, clear, sure knowledge is gained. The work has been an experience, not an experiment.
In this context, it may be helpful to take a look at the summary of three years of progress in speaker recognition research described by Mark Przybocki, Alvin Martin, and Audrey Le, "NIST Speaker Recognition Evaluations Utilizing the Mixer Corpora: 2004, 2005, 2006", IEEE Trans. on Audio, Speech & Language Processing, 2007:
The Speech Group at the National Institute of Standards and Technology (NIST) has been coordinating yearly evaluations of text-independent speaker recognition technology since 1996. During the eleven years of NIST Speaker Recognition evaluations, the basic task of speaker detection, determining whether or not a specified target speaker is speaking in a given test speech segment, has been the primary evaluation focus. This task has been posed primarily utilizing various telephone speech corpora as the source of evaluation data.
By providing explicit evaluation plans, common test sets, standard measurements of error, and a forum for participants to openly discuss algorithm successes and failures, the NIST series of Speaker Recognition Evaluations (SREs) has provided a means for chronicling progress in text-independent speaker recognition technologies.
[…] Here, we concentrate on the evaluations of the past three years (2004–2006). These recent evaluations have been distinguished notably by the use of the Mixer Corpora of conversational telephone speech as the primary data source, and by offering a wide range of […] test conditions for the durations of the training and test data used for each trial. One test condition in the past two years has involved the use of the Mixer data simultaneously recorded over several microphone channels and a telephone line. In addition, the corpus and the recent evaluations have included some conversations involving bilingual speakers speaking a language other than English.
A summary of the 2006 state-of-the-art can be seen in their Fig 1, which present "Decision Error Trade-off" (DET) curves for the primary systems of the 36 organizations who entered the 2006 evaluation's "core" test. Thus the best-scoring system achieved about a 2% false-alarm probability at a 5% miss rate on this test:
That's not to say that this level of performance would be expected under all test conditions. In fact, the 2005-2006 SRE evaluations began the testing of "cross-channel" conditions, in which a variety of different microphones and recording conditions were compared; and as Figure 12 from the same paper shows, performance in cross-channel conditions was much worse:
The relatively unsmooth curves and large confidence boxes of the figure reflect the relatively small numbers of trials in the 2006 evaluation for the cross-channel condition. The key point to note is the far better performance of the telephone test segment data (solid black) than that of all the cross-channel (all training was over the telephone) microphone curves. (The microphone used in the broken black curve was apparently defective.)
Although the word "science" doesn't occur in Przybocki et al. 2007 — and indeed the perspective is very much that of an engineer rather than a scientist — I think that John Pierce would be satisfied that Przybocki et al. are reporting "experiments" rather than mere "experiences".
Is there anything comparable in the field of (text-based) authorship identification? Alas, no.
Things started out well with Frederick Mosteller and David Wallace, "Inference in an Authorship Problem", Journal of the American Statistical Association 1963 (discussed in my post "Biblical Scholarship at the ACL", 7/1/2011). Mosteller and Wallace used a generally-available source of texts — the Federalist papers — and documented their methods well enough that what they did can now be assigned as a problem set in an introductory course (and sometimes is). And there have been many research publications since then.
But what is so far still lacking, IMHO, is a systematic application of the "common task method". As far as I know, none of the aspects of "authorship attribution" have ever been addressed by any of the on-going or ad-hoc organizations that have sprung up to mediate the application of this method to text-analysis problems — not TREC, nor TAC, nor CoNLL, nor anybody else. As a result, the work is more scattered and diffuse than it should be, and harder to evaluate in reference to real-world questions. There is less research than there would be, if published training and testing collections lowered the barriers to entry, and if published results on well-defined benchmark tests gave researchers a "record" to try to beat. There is less of a research community than there should be, in the absence of the forum for exchanging ideas that a shared task creates. And authorship evaluation lacks the clear community standards — for technology and for performance evaluation — that emerge from the sort of scientific "civil society" represented by the NIST SRE evaluation series.
As far as I know the International Association of Forensic Linguists (IAFL) has never tried to fill this gap, although its home page observes that its "further aims" include "collecting a computer corpus of statements, confessions, suicide notes, police language, etc., which could be used in comparative analysis of disputed texts".
I'd love to learn that I'm wrong, and that I've managed to fail to notice a history of common-task evaluation in authorship attribution. Or, more probably, that there's a plan for future evaluations in this area that I don't know about. But if this field is as empty as it seems to me, isn't it time for someone to step forward to fill it? A tenth of a percent of the sums at stake in the Facebook case would cover the costs nicely.