Language Log

Diamond, the New Yorker, and corpus linguistics

April 24, 2009 @ 4:51 pm · Filed by Roger Shuy under Language and the law

Forbes reports that the April 21, 2008 New Yorker article, “Vengeance is Ours,” by Jared Diamond, has recently generated a $10 million dollar lawsuit brought by Daniel Wemp, a New Guinean who Diamond claimed was pursuing vengeance for his uncle’s death. His efforts are said to have led to six years of warfare that have claimed the lives of 47 people in New Guinea. Rhonda Roland Shearer’s very long blog at StinkyJournalism.org provides more details. There’s a connection to Language Log because Shearer asked linguist Douglas Biber to assess whether the long, numerous, and allegedly direct quotations in Diamond’s article were actually spoken language or whether they were written language modified to look like direct quotes. Biber is an expert on measuring the differences between written and spoken language, so it was prudent for Shearer to seek his help with corpus linguistics to help resolve the issue.

Biber’s first impression was that the quotations were academic writing rather than speech, but he wanted to analyze them further, which he did, basing his results on a huge data base of spoken and written English, The Longman Grammar of Spoken and Written English. Following are excerpts from Biber’s report to Shearer.

Over the last 25 years, a research approach has been developed for the empirical analysis of such grammatical characteristics. Referred to as ‘corpus linguistics’, the approach is based on the analysis of very large collections of natural texts from thousands of individual speakers and writers. Computer programs aid the analyses, which result in descriptions of the grammatical features that are especially frequent, features that are typical, and features that rarely occur. In addition, by comparing corpora with different kinds of texts, it is possible to contrast the grammatical characteristics that are usually found in conversation to those usually found in academic writing (or any other spoken or written varieties).

The most comprehensive grammatical description of English undertaken from this perspective is the 1,200-page Longman Grammar of Spoken and Written English (LGSWE; Biber et al. 1999). The research for that project is based on analysis of a very large corpus that represents four major varieties: conversation, fiction writing, newspaper writing, and academic writing. For example, the sub-corpus for conversation includes c. 6.4 million words, produced by thousands of speakers. The sub-corpus for academic writing includes 5.3 million words from 408 different texts. Computational / quantitative analyses of these corpora allow us to make strong generalizations about the grammatical characteristics that are frequent or rare in conversation, contrasted with the features that are frequent/rare in academic writing.

Biber explained that he carried out a 3-way comparison:
1. a quantitative analysis of the grammatical characteristics of Diamond’s quotes [i.e., the quotations attributed to Wemp as his spoken words in the 4/21/08 New Yorker article]
2. a quantitative grammatical analysis of Wemp’s actual speech [verbatim transcripts of speech produced by Wemp collected by Shearer] and
3. the research findings from the LGSWE [large-scale corpus analysis of conversation and academic writing].

Biber then continues:

Taken together, the linguistic analyses indicate that it is extremely unlikely that The New Yorker quotations are accurate verbatim representations of language that originated in speech. To put it simply, normal people do not talk using the grammatical structures represented in these quotations. However, these quotations do include several grammatical structures found commonly in academic writing, suggesting that the quotations were produced in writing rather than being transcribed from speech.

He gives some examples:
Adjective and/but adjective (e.g., tall and handsome) was 100 times more frequent in Diamond quotes than in speech.
Preposition + Relative pronoun (e.g., each battle in which we succeeded in killing an Ombal) was 100 times more frequent in Diamond quotes.

Biber concludes:

These comparisons show the magnitude of the discrepancies between the grammatical style of normal conversation contrasted with the grammatical style of the Diamond quotes. To find one of these grammatical features in a normal conversation is noteworthy. To find repeated use of this large constellation of features in actual spoken discourse, some of them occurring c. 100 times more often than in normal conversation, is extremely unlikely. In contrast, these are all features that are typical of academic writing, suggesting that they have their origin in writing rather than actual speech.

Other corpus studies (e.g., the book University Language; Biber 2006) have shown that these same features are rare and exceptional in even academic speech, including university lectures. In contrast, what we find in the Diamond quotes is the pervasive use of a suite of grammatical constructions, which are all rare in conversation but common in formal writing. This constellation of grammatical characteristics is also strikingly different from the grammatical style of the verbatim transcripts of speech produced by DW [Daniel Wemp]. In sum, the analysis strongly indicates that the Diamond quotes are much more like discourse that was produced in writing, reflecting the typical grammatical features of formal academic prose, rather than verbatim representations of language that was produced in speech.

It will be interesting to see what The New Yorker’s lawyers will be able to do with Biber’s analysis.

(Hat tip to Barry Hilton)

April 24, 2009 @ 4:51 pm · Filed by Roger Shuy under Language and the law

Permalink

Comments are closed.

Diamond, the New Yorker, and corpus linguistics

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta