« previous post | next post »

Email yesterday from Bill Benzon:

Here's a blog post about a little bit of linguistic detail in a VERY interesting book: Matthew Jockers, Macroanalysis: Digital Methods & Literary History.

Do you have any thoughts on that detail?

The post in question is "Reading Macroanalysis 4: On the matter of 'the'", New Savanna 8/13/2014, and the "detail" in question is a cited difference in the frequency of the word the  between a collection of of 19th century British novels and a comparable collection of 19th-century American novels:

Chapter 7, “Nationality” is pretty straightforward. I don’t have much to say about it except for a puzzle that Jockers presents at the beginning. He points out that, because British and American writers have different practices concerning the word the, that word is about 5 percent of the word tokens in his corpus of 19th Century British novels, while it is about 6 percent of the tokens in the American novels.

My first thought is to point Bill towards the work of Jamie Pennebaker and other social psychologists, who have consistently found surprisingly large effects of many factors on rates of function-word use.

Thus Michael Cohn, Matthias Mehl, and James Pennebaker, "Linguistic Markers of Psychological Change Surrounding September 11, 2001", Psychological Science 2004:

When people are writing with high psychological distance (compared with low psychological distance), they use longer words and more articles, and avoid present tense and first-person singular.

Or Matthias Mehl & James Pennebaker, "The Sounds of Social Life: A Psychometric Analysis of Students’ Daily Social Environments and Natural Conversations", Journal of Personality and Social Psychology 2003:

The natural conversations and social environments of 52 undergraduates were tracked across two 2-day periods separated by 4 weeks using a computerized tape recorder (the Electronically Activated Recorder [EAR]). The EAR was programmed to record 30-s snippets of ambient sounds approximately every 12 min during participants’ waking hours. Students’ social environments and use of language in their natural conversations were mapped in terms of base rates and temporal stability.


Consistent with previous research (Pennebaker & King, 1999), men used significantly more big words (words more than six letters long: 9.4% vs. 8.3%), more articles (4.4% vs. 3.5%), fewer first-person singular pronouns (6.2% vs. 7.5%), and fewer discrepancy words (2.0% vs. 2.5%) than women.

In a similar vein, there's e.g. Duyen Nguyen & Susan Fussell, "Lexical Cues of Interaction Involvement in Dyadic Instant Messaging Conversations",Discourse Processes 2014:

In Study 1, an experiment with 60 participants, we manipulated level of involvement in a conversation with a distraction task. We examined how participants' uses of verbal cues such as pronouns were associated with their involvement in text-only IM conversations. We found that use of personal pronouns, assent words, cognitive words, and definite articles were significant indicators of a participant's involvement.

Specifically, the proportion of definite articles (from their Table 4):

  High Involvement Condition Low Involvement Condition
Mean 6.02 4.70
Standard deviation 4.87 3.55
95% C.I. [5.14, 6.90] [4.05, 5.34]


We see similarly lawful patterns (though with different base rates) if we look at the influence of sex and age on the average rate of "the" usage in the Fisher conversational telephone speech transcripts:

So we might ask whether Jockers' collections of 19th-century British and American novels are balanced for the authors' sex and age, or for the percentage of dialogue, or for the amount of real-time narration vs. discussion of less immediate things (like the extended passages of natural history in Moby Dick). But then again, maybe there's just some geographical variation, along with everything else.

Note that the reasons for variation in THE frequency are surely various: use of definite descriptions as opposed to pronouns or names ("the parson" vs. "he" vs. "Mr. Samuels"); presence or absence of modifiers ("the door" vs. "the tavern door"); general phrasal choice ("the place she came from" vs. "her home town").

Thus part of the reason for the somewhat more frequent use of THE by male speakers in the Fisher transcripts is probably the somewhat more frequent use by female speakers of e.g. possessive pronouns:

Males Females
my 0.461% 0.650%
your 0.211% 0.215%
her 0.062% 0.113%
his 0.058% 0.070%
our 0.079% 0.105%
their 0.132% 0.145%
TOTAL  1.00%  1.30%




  1. Bill Benzon said,

    August 14, 2014 @ 7:05 am

    Thanks, Mark. Jocker’s does mention one different between Britisn and American usage, but I didn’t cite it in my post (p. 105). Where Americans would say, “I have to go to the hospital,” the British speaker would say “I have to go to hospital.”

    [(myl) Since the word "hospital" has an overall frequency of approximately 1 per 10,000 words in modern fiction (100 per million), and probably less in 19th-century fiction, this difference could account for approximately 1/100 of the cited effect, at best.]

    What he found puzzling is that, over the course of the century, when American use of the would go up or down, so did American use. But that didn’t happen with any of the other handful of words he tested. He didn’t actually list them, but noted that they were “the ten or so most frequent words” in the corpus I presume.

  2. Bill Benzon said,

    August 14, 2014 @ 7:08 am

    Whoops! "Jockers" not "Jocker's". For some reason the substitution of "'s" for final "s" seems to be on autopilot in my keyboarding habits.

  3. D.O. said,

    August 14, 2014 @ 12:11 pm

    If the/possessive pronouns trade-off is a real thing it should be visible on the individual participant level. From comments on the previous post I remember that a conversation side in the Fisher corpus is about 1000 words. That gives ~25 the and ~10 possessive pronouns. Not very promising because of the random noise, but if the effect is strong…

  4. michael farris said,

    August 14, 2014 @ 2:34 pm

    There's also British "in future" vs US "in the future".

  5. Eric P Smith said,

    August 14, 2014 @ 3:22 pm

    @michael farris: I'm British, and I reckon that in British English "in future" and "in the future" have different meanings. "In future" = from now on; "In the future" = at some future time.

  6. Richard said,

    August 14, 2014 @ 10:02 pm

    @Eric P Smith

    And in U.S. English, "In the future" serves both meanings. We wouldn't say "In future."

    "And in the future, behave yourself, young man!"

  7. Michael Watts said,

    August 16, 2014 @ 3:51 am

    Speaking as someone born and raised entirely in the US, I say "in future". So does the rest of my family.

  8. Brett said,

    August 16, 2014 @ 8:30 am

    My wife has taken to saying only, "in future," but she started doing it shortly after we started watching a lot of British television.

  9. Graham Asher said,

    August 19, 2014 @ 4:27 pm

    American English has more homonyms than BrE because of loss of the writer/rider and (in some dialects) Mary/merry/marry distinctions. One of the most salient uses of 'the' is add the redundancy needed to aid parsing. I suggest that extra uses of 'the' in AmE are partly influenced by a balancing need for extra information caused by the loss of other distinctions. Obviously there are other strategies, like 'horseback riding' for 'riding'.

  10. ajay said,

    August 22, 2014 @ 5:03 am

    Where Americans would say, “I have to go to the hospital,” the British speaker would say “I have to go to hospital.”

    Those are both used in British English, but they have subtly different meanings.
    Where are you? If I am in the hospital I might be visiting a sick friend, I might be there because I'm a nurse, I might be repairing their plumbing. But if I am in hospital then I'm ill. Same with "in school/in the school" – if you're in school then you're a pupil in class, or possibly a teacher. If you're in the school, all you're saying is that you're physically within a certain building.

  11. ajay said,

    August 22, 2014 @ 5:04 am

    And, for that matter, I could be in school without being in the school. If you ask a parent "is your daughter in school" then you are asking "is she attending school" – the parent will say "yes" even if the daughter's, at that moment, upstairs in bed.

RSS feed for comments on this post