One of the curious things about the uses of linguistics in the legal context is that the smallest units of language get the most public attention. Linguists analyze language in all its shapes and forms, from minute sounds to broad discourse structures, but the media's interest is on the smaller language units like letters, punctuation, and words, not the larger language units like syntax, discourse structure, and conversational strategies.

A case in point is the area of authorship identification, which typically focuses on small language units such as morphology, lexicon, or stylistic choices found in evidence documents. It's tempting to think that such language features can actually identify authors with as much validity and precision as the way DNA analysis helps law enforcement identify suspects. Personally, I have some reservations about what I see linguists doing as they try to help the police and the courts determine issues of innocence or guilt.

A recent article in BBC News describes the way authorship identification, based on a few small features, is being used  these days, primarily in England but also in the US. I'm not exactly pleased with what I see being done. And it bothers me even more that authorship identification is thought to be synonymous with forensic linguistics. The territory of forensic linguistics is much larger than that.

Okay, I have to admit that in the past I too have tried to help law enforcement agencies narrow down their suspect lists by examining the evidence of letters, threat messages, emails, and other documents. Sometimes this produces a few language clues and insights for the police to use in further interrogations of their suspects, but I've never found enough of these to cause me to have any degree of certainty about their authorship. There were too many unresolved problems for me to reach such conclusions. For one thing, there is usually only a very small amount of language to work with, making generalizations difficult if not impossible.

Second, even if there are a few language features that seem to stand out (like the use of "my" and "myself" mentioned in the BBC article), without a large corpus of data I can't tell whether the writers used these variably. And if they did use them variably, what percentage of variation of use in a given feature is necessary in order for anyone to be able to claim that the writer has reached it? It's also not clear that anything like idiolect really exists in written language and, even if it does exist, we don't know the kinds and frequency of features needed to identify the text as the idiolect of a given writer.

Third, I don't know whether the suspicious-looking diagnostic language features I find are actually diagnostic of specific individuals as opposed to different individuals. Until we have an established and relatively large corpus of a given writer's texts, we can't identify with certainty that individual's predictable usage. In fact, we don't even know how large such a corpus needs to be. The articles cited here mention that a British linguist has collected over 8,000 text messages and has analyzed them using "robust statistical methods," but his corpus appears to consist of the writings of many different individuals, not a single suspect. It might help determine the structure of a typical threat message, suicide note, or some other text genre, but it seems to have little to say about what a given individual might write.

With these qualms and qualifications hanging over my head I, for one, would hate to try to give testimony in a courtroom about authorship identification. An effective cross-examination, based on these qualms and qualifications, could easily destroy my conclusions. But this is not to say that such analysis is totally useless as an investigative tool for law enforcement to use as they interview suspects or narrow down their suspect lists. It might even encourage a confession. But it still rests on the shaky grounds noted above and would not be appropriate to present in the courtroom as expert witness testimony.

