This is an illustrative Breakfast Experiment™ for my course at the LSA Institute (on "Corpus-Based Linguistic Research"). It starts from an earlier LL post, "When men were men, and verbs were passive", 8/4/2006, where I observed that Winston Churchill, often cited as a model of forceful eloquence, used the passive voice for 30-50% of his verbs in various passages from his 1899 memoir The River War — several times the rate noted in statistical usage studies from the 1960s and later.
So I thought I'd do a quick historical survey of passive-voice rates, as a example of what can be done with Mark Davies' COHA corpus.
The texts in COHA have been automatically tagged with the CLAWS tagset, and the COHA search interface lets us use the locution [vb*] to refer to all the forms of the verb be (included contractions and so on), so we should be able to search for a pattern meaning "form-of-be past-participle":
This won't find cases where an adverb or other adjunct intervenes between the be-form and the past participle (e.g. "… was previously noted by …"); and of course it relies on the CLAWS tagger being mostly right.
However, it doesn't work anyhow, because Mark Davies is (reasonably) worried about doing big joins, and so the search interface informs us that
Well, there are just 8 possible be-forms in the CLAWS tagset, so we maybe can search separately for
were: [vbdr] [vvn]
was: [vbdz] [vvn]
being: [vbg] [vvn]
be: [vbi] [vvn]
am: [vbm] [vvn]
been: [vbn] [vvn]
are: [vbr] [vvn]
is: [vbz] [vvn]
And indeed these searches work.
But there's still a hurdle to overcome. The COHA interface presents its results to us in the form of an HTML table (and furthermore, one that's embedded in an HTML frameset). Here's the start of such a table as displayed in a web browser:
And here's just a little bit of what the underlying HTML (for the relevant frame) looks like:
We could save the HTML as a local file, and write a program to pull out the numbers from the table. (It would be MUCH easier if Mark Davies would add a button to pop up a plain-text table, suitable for reading into R, as below…). [Update -- As Neal Goldfarb points out in the comments, MS Excel and perhaps other spreadsheet programs will accept cut-and-paste of html tables into the rows and columns of a spreadsheet, which may be how some people want to do subsequent analysis anyhow. And for those who prefer data analysis in R, it's possible to write out the spreadsheet as a .csv file, which R can then read via read.csv(). But in my experience, that's at least as much trouble as hacking a plain text file for read.table(). Also, in this case I just want to add up the TOTAL line from each of the 8 returned tables anyhow.]
But in this case, all that we care about is the bottom line — the total across all instantiations of the pattern:
So we can snarf the TOTAL row with the mouse or trackpad, and paste the numbers into a plain text file, with an initial value to remind us of what the numbers are. After a bit of fussing with tabs and spaces and so on, we'll have a file like this one, with 8 rows and 21 columns. (I used emacs to map all tab/space sequences to single spaces — but in fact, R's function read.table() is rather generous about treating all tab/space combinations as single field separators…)
Now all we need is a little bit of R scripting, like this, and we can plot the trend:
So we see that Americans' (textual) passivity has been declining steadily for 200 years, and even more steeply since WWII.
The effect is not a small one — the estimated rate has fallen from 4,369 per million in the 1820s, to 1,951 per million in the 2000s.
There are a few controls we should run to be sure that the effect is a real one, and that we're describing it correctly:
- Maybe the mix of sources in COHA has been changing, in ways that contribute to the effect or even explain it? (I don't think so.)
- Maybe the UCREL CLAWS tagger mis-classifies past participles at a high rate in a time-varying way? (I don't think so.)
- Maybe passive verbs are being used at the same relative rate as ever, but verbs as a class have gotten less frequent (relative to other classes of words) by a factor of two — e.g. American English is getting nounier. (I don't think so.)
And there are plenty of other patterns to look at:
- Has there also been a historical change in the relative proportions of different forms of be in passives?
- What about the distribution of lexical verbs in passives?
- Are there historical changes in the (absolute or relative) frequency of other aspect/tense/mood/voice combinations?
And then there are various possible explanations for the effect, some of which can be explored empirically:
- Vernacularization: Maybe American English prose has been getting closer to the norms of the spoken language, with increased use of get-passives, and decreased use of passives overall.
- Maybe American writers are increasingly likely to put adverbs between be-forms and past participles in passive-voice constructions.
- Maybe the "Avoid Passive" usage advice is taking its toll.
The vernacularization hypothesis makes some sense, but it doesn't seem adequate to me — I suspect that we can see a similar trend in the spoken language. And none of the rest of the obvious explanations seem likely to stand up to testing. For example, the anti-passive animus among usage mavens seems to have originated in the first couple of decades of the 20th century, at a time when the decline in passive usage had apparently been underway for at least a hundred years.
No doubt readers will be able to think of other explanations. But I wonder whether maybe the passive is just going the way of the passival.