In an earlier post on this topic ("Why definiteness is decreasing, part 1"), I suggested that the decrease in definite-article frequency in published English text, over the course of the past century, might be connected with a decrease in formality. Roughly, this means that writing has been becoming more like speech (though speech has also been changing, and writing and speech remain very different).
In this post, I want to discuss two other socio-stylistic dimensions — age and sex. If the language is changing, then we expect to see "age grading", where younger people tend to exhibit the innovative pattern, while older people's usage is more old-fashioned. And because women are generally the leaders in language change, we expect to see women at every age being more linguistically innovative and men being more conservative. In other words, "young men talk like old women". And as the plot on the right illustrates, differences by age and sex in the frequency of the seem to confirm this hypothesis. (Click on the graph for a larger version.)
These numbers come from the Fisher corpus of conversational telephone speech, comprising nearly 12,000 10-minute conversations involving a similar number of callers. Here are the numbers in tabular form — frequency of the, as a percentage of all words produced by callers in the specified age range:
|AGE <28||Age 28-40||Age >40|
And trust me, the numbers are large enough that these differences are statistically significant.
So one interpretation of yesterday's theory about decreasing formality in written language is apparently wrong. If this age grading in the spoken language reflects a change in progress, then we can reject the hypothesis that the changes in text are nothing but a gradual approximation to a fixed pattern in speech. Rather, the frequency of the is apparently decreasing in both text and speech — it's just that text is a lagging indicator.
Of course, this argument is an indirect one, since we don't have comparable speech samples from very widely separated time periods, and instead we're relying on age-grading to give us an "apparent time" picture. But as I noted yesterday, we do have two samples of conversational transcripts collected about a dozen years apart, which do show a change in the expected direction: the Switchboard corpus, collected in 1990-01, has overall the frequency of 2.98%, while Fisher, collected in 2003, has 2.47%.
In any case, we've already seen age- and sex-linked variation in the frequency, in a corpus of informal text. As I explained in "Sex, age, and pronouns on Facebook" (9/19/2014):
Andy Schwartz and others at the World Well-Being Project have worked with "Facebook posts from over 75,000 volunteers who also took the standard Interpersonal Personality Item Pool (IPIP) personality test to measure the 'Big Five' personality traits", looking for linguistic features that correlate with those aspects of personality measured by that test.
And in "More fun with Facebook: THE" (10/12/2014), I observed that
The script that I used to make that course assignment about Facebook pronouns ("Sex, age, and pronouns on Facebook", 9/19/2014; "More fun with Facebook pronouns", 9/27/2014) can trivially be focused on any other words — so here's "the":
(For comparison purposes, the y-axis frequencies of 20,000 to 30,000 per million in that graph are equivalent to 2.0-3.0 percent.)
It remains possible that the age- and sex-linked changes in the usage are a life cycle effect rather than evidence of overall linguistic change in progress — maybe older people just gradually get more formal (or at least more the-ful) both in speech and in writing. But I'm guessing that this is mostly a linguistic change in progress.
We still don't know what these changes really are, in detail. What is taking the place of those missing definite articles? In response to a comment on yesterday's post, I listed some possibilities:
[A] larger number of non-pronominal noun phrases; a higher percentage of definite articles on a similar number of noun phrases; a smaller number of one or more other parts of speech (e.g. adjectives, adverbs, discourse particles, etc.); and so on.
And whatever the distributional shifts may be, are they semantically and rhetorically neutral, or do they reflect some larger stylistic shifts? For an example of what such a shift might be like, consider what Pennebaker et al. have called the "categorical-dynamic index" (CDI), featured in a paper published just about a week ago: James Pennebaker, Cindy Chung, Joey Frazee, Gary Lavergne, and David Beaver, "When Small Words Foretell Academic Success: The Case of College Admissions Essays", PLoS one 12/31/2014. The abstract:
The smallest and most commonly used words in English are pronouns, articles, and other function words. Almost invisible to the reader or writer, function words can reveal ways people think and approach topics. A computerized text analysis of over 50,000 college admissions essays from more than 25,000 entering students found a coherent dimension of language use based on eight standard function word categories. The dimension, which reflected the degree students used categorical versus dynamic language, was analyzed to track college grades over students' four years of college. Higher grades were associated with greater article and preposition use, indicating categorical language (i.e., references to complexly organized objects and concepts). Lower grades were associated with greater use of auxiliary verbs, pronouns, adverbs, conjunctions, and negations, indicating more dynamic language (i.e., personal narratives). The links between the categorical-dynamic index (CDI) and academic performance hint at the cognitive styles rewarded by higher education institutions.