From John Coleman:
Inspired by your recent Language Log pieces, I tried an analysis of "er" vs "erm" in the Spoken BNC. These are the two main transcriptions for filled pauses labelled as "UNC" in the Claws-5 tagset and also "UNC" in the richer set of pos labels used in BNC. I.e. they are distinguished from items labelled as ITJ / INTERJ, in which the few tokens of "uh" and "um" are classified. These "uh"s are almost all in "uh huh" meaning "yes", and many of the "um"s and "mm"s are also in contexts where the "yes" sense is clear. So I disregarded the ITJs and restricted the analysis to UNC "er" and "erm", which are far more numerous in any case. As these are mostly nonrhotic dialects one can interpret "erm" as just schwa + nasality, with no implication of rhoticity; ditto for "er".
The British National Corpus is a balanced corpus of 100 million words, collected in the early 1990s. The spoken portion comprises ten million words; Jiahong Yuan and I collaborated with John Coleman and others, five years ago, on a project to help rescue the recordings and connect them appropriately with the transcripts.
Here are John's counts:
SEX AGE ER ERM f 0 315 499 f 1 669 813 f 2 1299 2121 f 3 1574 1398 f 4 2255 1674 f 5 3226 2071 m 0 602 518 m 1 1125 893 m 2 2530 2111 m 3 2631 1642 m 4 4614 3513 m 5 5648 1605
As John explains,
The age groups are 0 (1-14 years old), 1 (15-24), 2 (25-34), 3 (35-44), 4 (45-59), and 5 (60-95). We have 45,246 tokens from turns that are labelled with speaker age and sex; I left out the remaining 28,378.
The overall ERM/(ERM+ER) proportion for female speakers is 47.9%, while for male speakers it's 37.5%. Thus the direction of the sex effect is consistent with what we've seen in a five American datasets and the Glaswegian dataset from the HCRC Map Task corpus.
A plot of the interaction between sex and age is here:
This also shows the same apparent-time change in the direction of greater ERM (==UM) usage found in Fisher, Switchboard, and the Philadelphia Neighborhood Corpus (PNC).
In the PNC, where we have data collected over a period of 40 years, this seems to be partly a life-cycle effect and partly a genuine change in progress.
It's surprising that there's such a widespread and robust marker of gender (and age) identity that (as far as I know) no one noticed before I stumbled on it in 2005 while looking for something else. I won't be surprised to learn that there are some earlier observations, but in any case, ordinary people don't seem to register these differences consciously at all.
Update — John supplied the total word counts for each age and sex combination, making it possible to calculate ER percentages by age and sex, which shows the same age grading as we saw in the Fisher and Philadelphia Neighborhood datasets:
The ERM percentages by age and sex show a less clear pattern:
The overall filled pause (ERM+ER) percentages:
In response to my observation that "It's surprising that there's such a widespread and robust marker of gender (and age) identity that (as far as I know) no one noticed before I stumbled on it in 2005", John responded:
Ah, not really surprising? (a) probably nobody looked for it, and (b) as you have pointed out many times, it is only recently that the right combination of corpora, easy access to corpora, and widespread ownership of laptops etc permitting breakfast and other periprandial experiments to be carried out have come together. Earlier studies of filled pauses etc in e.g. the Conversation Analysis literature tended to be small-scale micrographic studies in which changes over time and/or differences in usage frequency would be invisible.
I agree, but what surprises me is that this rather large gender difference, in a rather common aspect of speech, wasn't a commonplace anecdotal reaction to people's experience of everyday life.
John is just as puzzled as I am about how this difference came to be as geographically and socially widespread as it clearly is:
It's also pretty mysterious as to why or how the change should be so widespread? Sure, there is always a certain amount of linguistic to-ing and fro-ing between US and UK English, on a fairly low level (catchwords etc), but otherwise the many dialects involved have retained their separate characteristics quite extensively, it seems to me. The UH/UM change precedes Facebook, universal internet, huge increase in US TV shows in Britain etc … I am baffled as to *cause*.
It's a plausible hypothesis that phonetic symbolism is involved somehow, but the only evidence for this idea is Sherlock Holmes' assertion that "when you have eliminated the impossible, whatever remains, however improbable, must be the truth".
The accumulating set of LLOG posts on UM vs. UH:
Past LLOG posts on UM vs. UH:
"Young men talk like old women", 11/6/2005
"Fillers: Autism, gender, age", 7/30/2014
"More on UM and UH", 8/3/2014
"UM UH 3", 8/4/2014
"Male and female word usage", 8/7/2014
"UM / UH Geography", 8/13/2014
"Educational UM / UH", 8/13/2014
"UM / UH: Life-cycle effects vs. language change", 8/15/2014
"Filled pauses in Glasgow", 8/17/2014