Nailing a suspect

« previous post | next post »

Mark Liberman's post about the phone call that has caused people to try to determine who was responsible for the Mumbai attacks highlights a problem in the current practice of forensic linguists who do authorship analysis these days. His post was about speaker identification (or nationality/ethnicity of speaker), so I'm stretching things a bit here, but whether the evidence is spoken or written, the process of narrowing down a list of suspects, much less finding the right one, has many of the same problems.

As in all science, effective diagnosis begins with adequate comparative data. In authorship identification cases, the first problem is that the documents in question (threat messages, suicide notes, hate mail, etc.) are usually very brief and sometimes their writers make an effort to disguise any personal identity. Further, even if we can determine what "enough data" might mean, we need to have adequate samples of documents for comparison. And not just any old comparison samples will do. For example, it can be fruitless to compare the register of a threat message with the register of a business letter. Most of us have a range of styles available to us, so it's necessary to obtain comparison samples in the same or at least similar genres and registers.

If that isn't enough to discourage us from determining the writers' identity (or even their nationality, ethnicity, region, social status, age, or education level), we also have to be concerned about individual language variation. People tend to not stay put geographically or socially, making them subject to language influences from wherever they live or work and from the language used by the people or groups that they live among or admire. Sometimes they get these influences right. Sometimes they get them only partly right. Sometimes they get all or parts of them dead wrong. Sometimes they incorporate only parts of the new influences and hang on to their older patterns, creating a language olio than comes close to defying individual or group identification.

In addition, we also need to worry about language variability itself. Sociolinguistic research has accomplished a lot, but we're only beginning to scratch the surface of all the kinds, rates, and comparative ratios in English use, much less that of the many other languages of the world. If a person uses a form that seems to stand out, we have to be relatively certain whether this feature stands out in the same way when it's used by others. That is, we need to know whether this feature is a diagnostically significant identity marker at all.

Even if we have good a good research base telling us that the feature in question is potentially diagnostic, we need to have a large enough corpus of the target's language from which we can determine that person's variable use of that particular feature. Two or three paragraphs are usually not enough. For example, consider the practice of some American English speakers to use the double negative. Research tells us that very few use it all the time. Some use it 60% of the time; others 20%. Some use it in speaking but less so or not at all in writing. If our data sample has only one or two instances of the double negative, what can we say about its diagnosticity in a hate mail message?

The comments cited in and after Mark's post point to various individual words, expressions, intonation, and pronunciations as evidence of the caller's true national or ethnic identity. They find a feature that they believe to be diagnostic, and then claim that it is clear proof of the speaker's language or ethnic identity. Unfortunately, it's just not that easy. I'm no more able than they are to evaluate the sociolinguistic accuracy or power of their claims, but I humbly suggest that satisfying the argument about the nationality or ethnicity of the speaker may require a great deal more evidence than that which is available in this phone message. These features, if accurate, may offer some clues, but they don't appear to be conclusive. Clues offer places to begin the analysis. They are not the evidence itself.

This situation reminds me of the way some linguists carry out authorship identification in the US today. It has some usefulness as an investigative tool, but it can be very troublesome to present such evidence in court. An alert and informed cross-examiner could ask:

1. Are you comparing the unknown language evidence in this case with comparable known language evidence? (i.e., comparable samples and sample size)
2. How can you prove that the features you use are diagnostic of the differences or similarities that you claim? (amount and quality of data compared with evidence based on research findings)
3. What are the known rates of variation in the features you use? (what does research tell us about variation in this feature of the language)
4. If you know this rate of variation, how do you know that the alleged user follows or doesn't follow it? (sample size and quality)
5. When you point to individual words or expressions as diagnostic, how do you know that other people also do or do not use them? (the old question about idiolect)

No doubt there is more that I haven't covered here. It's probably not a bad thing that in circumstances like the Mumbai tragedy, people are eager to find clues about who was responsible. The problem is in finding the most effective way to get beyond the clue stage.

Comments are closed.