Hoc est enim corpus linguistics

« previous post | next post »

I'm at the AACL 2009 meeting in Edmonton — that's the meeting of the American Association for Corpus Linguistics, which is neither American nor an Association, as John Newman explained to me.  I'll report later on some of what I see and hear.

So far, the most notable thing has been the outside temperature of 20 F or so, experienced on a morning walk around campus — the conference itself hasn't started yet — but the program looks interesting.

It seems to me that no reputable naming consultant would have approved the choice of the word corpus — Latin for "body" — in corpus linguistics, which involves the study of "bodies" or "collections" of text.  There's an unfortunate resonance with corpse, which makes the whole enterprise sound faintly icky. (It isn't — the method is used on living languages as well as dead ones. Not that the dead ones are icky either…)

The OED gives citations back to the 18th century for corpus in the sense "A body or complete collection of writings or the like; the whole body of literature on any subject":

1727-51 CHAMBERS Cycl. s.v., Corpus is also used in matters of learning, for several works of the same nature, collected, and bound together..We have also a corpus of the Greek poets..The corpus of the civil law is composed of the digest, code, and institutes.

The more specialized sense "The body of written or spoken material upon which a linguistic analysis is based" is cited only back to the 1950s:

1956 W. S. ALLEN in Trans. Philol. Soc. 128 The analysis here presented is based on the speech of a single informant..and in particular upon a corpus of material, of which a large proportion was narrative, derived from approximately 100 hours of listening. 1963 Language XXXIX. 1 In the analysis of the data, the structural features of the corpora will first be described. 1964 E. PALMER tr. Martinet's Elem. General Linguistics ii. 40 The theoretical objection one may make against the ‘corpus’ method is that two investigators operating on the same language but starting from different ‘corpuses’, may arrive at different descriptions of the same language.

There's more to be said about the ideas involved — methodological issues can become quasi-religious for some people, as Geoff Pullum observed here a few years ago, and he was describing the mere residue of earlier battles that were much more bitter.  But as Moore's Law and the digitization of society have made it easier and easier to apply corpus-linguistics methods, the methodological arguments about whether, when and how to apply them have become much less violent.


  1. Benjamin Zimmer said,

    October 9, 2009 @ 9:55 am

    As I mentioned in the Bloggingheads discussion with my brother Carl, the word corpus does indeed have a P.R. problem — I've been specifically asked to avoid using the word in more than one media appearance about (corpus-driven) lexicography.

  2. Alex said,

    October 9, 2009 @ 9:57 am

    I wonder what it would be like if, alongside the American Physical Society and the National Geographic Society and the like, we had the American Association for Corporeal Linguistics. Makes it sound more substantial, doesn't it?

    [(myl) Would we have to choose sides between that and the American Association for Spiritual Linguistics?]

  3. Mark P said,

    October 9, 2009 @ 10:49 am

    Shouldn't it be the Incorporeal American Association for Linguistics? I think that association already doesn't exist.

  4. Spectre-7 said,

    October 9, 2009 @ 11:04 am

    (myl) Would we have to choose sides between that and the American Association for Spiritual Linguistics?

    I think I might prefer the American Association for Ætherial Linguistics. It sets a slightly different tone, although a question remains about how to initialize it. Is it proper to put an ash in initials (AAÆL) ?

  5. Dan T. said,

    October 9, 2009 @ 11:28 am

    There seems to be disagreement in the citations as to whether its plural is "corpora" or "corpuses".

  6. Acilius said,

    October 9, 2009 @ 1:02 pm

    As a Latinist, I vote for "corpuses."

    In the singular, the Latin "corpus" can mean "structure" or "collection," but as far as I can see from my reading and my ten-minute skimming of the Thesaurus Linguae Latinae, "corpora" always means "particles" or "grains." So to say that a usage study was based on "corpora" would suggest that it dissolved collections into single entries. Of course, that etymology might not be prominent in the minds of most corpus linguists. But if you don't want to evoke the etymology, why use the Latinate plural form?

  7. John Cowan said,

    October 9, 2009 @ 1:52 pm

    The Corpus Juris of Justinian that Chambers's Cyclopedia mentions has been known to English lawyers, at least by name, almost from the beginnings of English law. So the word-sense itself has been in the language, at least in that one context, for much longer than the OED says. (It's difficult to know what to do about the words of foreign-language tags: is c(a)etera a word of English? Probably not.)

    Dan T.: I suspect that E. Palmer knew something was wrong, or he wouldn't have put "corpuses" in horror quotes.

  8. Sili said,

    October 9, 2009 @ 3:45 pm

    Let me guess: The lecture theatre is emblazoned with "Terribilis est locus iste".

  9. Gary said,

    October 11, 2009 @ 9:38 am

    This brings up memories from the dawn of the computer age in the early 60s when I was an undergraduate majoring in linguistics.

    I remember one of the linguists at Brown (Twaddell? Francis? Kucera?) reminiscing fondly about going to New York to buy pornography for inclusion in their corpus of American English. The corpus was stored on computer tapes and a copy was available for us undergraduates to cut our linguistic and programming teeth on.

    He liked to fantasize about his suitcase accidentally reaking open on the train, resulting in the discovery of the haul and an arrest for being a pornography salesman.

  10. Tim said,

    October 11, 2009 @ 2:56 pm

    Acilius : So, what word would they have used to talk about more than one corpus in the "structure" or "collection" sense?

  11. Acilius said,

    October 12, 2009 @ 3:57 pm

    @Tim: I haven't found any examples of the Romans talking "about more than one corpus in the "structure" or "collection" sense." That metaphor seems to have been a stretch for the Romans, so that they avoided the plural.

  12. Atario said,

    October 16, 2009 @ 3:02 am

    > "corpora" or "corpuses"

    Why isn't it corpi? (Said the amateur with only enough Latin knowledge to get in trouble and make bad jokes.)

  13. John Cowan said,

    November 8, 2009 @ 6:15 am

    Atario: Because corpus is a neuter noun of the third declension, despite the masculine-looking ending that suggests the second declension, and all Indo-European neuter nouns originally ended in -a in the plural of the nominative and accusative. (Even in English, which has lost almost all its case endings, it's notable that it is both nominative and accusative, alone among the personal pronouns.) The s/r variation is a regular process in Latin that eliminated intervocalic s in favor of r.

RSS feed for comments on this post