[Warning: More than usually wonkish and quantitative.]
In two recent and one older post, I've referred to apparent gender and age differences in the usage of the English filled pauses normally transcribed as "um" and "uh" ("More on UM and UH", 8/3/2014; "Fillers: Autism, gender, and age", 7/30/2014; "Young men talk like old women", 11/6/2005). In the hope of answering some of the many open questions, I decided to make a closer comparison between the Switchboard dataset (collected in 1990-91) and the Fisher dataset (collected in 2003).
I used the Mississippi State 1998 re-transcription of Switchboard, and the Fisher English Training Set Part 1 and Part 2 transcripts, as published by the LDC. For this comparison, I used only those Fisher conversation sides where the post-hoc call auditing agreed with the nominal speaker's registered demographic information.
The general pattern is consistent: females on average use more UMs than males, while males on average use more UHs than females. But the actual rates are quite different. Overall (UM+UH) rate in Switchboard was 2.80%, versus 1.64% in Fisher.
The breakdown by UM vs. UH and male vs. female shows a similar and consistent difference in base rates, along with a consistent pattern of sex differences:
Why the large difference in overall filler-word rates?
Logically, the cause might be a change in linguistic norms over the course of 12 years; a difference in the conversational setting; a difference in the sample of speakers; or difference in transcription practices.
The "age grading" of UM/UH ratios might suggest that there's some change in progress in relative UM vs. UH usage, but I think it's out of the question that overall rates of filler-word usage decreased by more than 40% in 12 years.
There are some differences in the demographics of the speaker samples for Switchboard (for which speakers were recruited at Texas Instruments in Dallas) and Fisher (for which speakers were recruited at Penn in Philadelphia), though in both cases an effort was made to get a broad demographic and socioeconomic sample. Since location and years-of-education information is available for speakers in both datasets, it will be possible to explore this issue further to some extent.
The conversational setting was pretty much identical for the two collections: a telephone conversation with a stranger on an assigned topic (which both speakers have chosen as a topic they're willing to discuss). Many of the topics were even the same. One perhaps-relevant difference is that 99% of Fisher participants made 3 or fewer calls, whereas 46% of Switchboard participants made more than 10 calls, and only 22% made 3 or fewer calls. It's possible that participating in your fifth or tenth or 15th call might put you in a frame of mind where filler-words come easier.
And finally, the transcriptions were done at different times by different groups of people using different technological support. There's a general tendency for transcribers to edit out various sorts of disfluencies, including filled pauses. This is first because human verbal memory tends to ignore disfluencies even when a listener is trying to register them accurately, and second because transcribers often learn to omit disfluencies in order to improve readability, even when they hear and remember them. In both the Switchboard and Fisher transcripts, transcribers were instructed to record filled pauses accurately, but (especially in the case of Fisher) there was an emphasis on transcription speed as well. So it's plausible that a difference in transcription practices might have contributed to the cited difference in overall base rates of filler-word usage.
If we go on to divide things up by speaker age as well, the same sort of relationship applies — effects of sex and age in a consistent direction, along with a large overall difference between the two datasets:
In graphical form ("FF" is Fisher females, "FM" is Fisher males, "SF" is Switchboard females, "SM" is Switchboard males):
The overall UM/(UM+UH) proportions, and the UM/UH ratios by age and sex, again show similar trends in the two datasets, but quite different actual values:
It's possible that a more systematic statistical model, allowing for all available factors, would clear this up. But I suspect that more elaborate modeling, though a good thing to do, wouldn't succeed in explaining these differences between the datasets, unless dataset identity is one of the factors.
Since the original recordings have also been published, we could compare a sample of re-transcribed material to evaluate the effect of transcription practices — for those interested in filler-word usage, this would be worth doing.