Below is a guest post by Martijn Wieling, following up on a series of LLOG postings over the years on the effects of sex, age, geography and other factors on the relative frequency of the filler words um and uh: "Young men talk like old women", 11/6/2005; "Fillers: Autism, gender, and age", 7/30/2014; "More on UM and UH", 8/3/2014; "UM UH 3", 8/4/2014; "Educational UM / UH", 8/13/2014; "UM / UH geography", 8/13/2014; "UM / UH: Life-cycle effects vs. language change", 8/15/2014; "Filled pauses in Glasgow", 8/17/2014.
I was surprised to see this effect in the first place; and more surprised to see it robustly replicated in a variety of American English datasets; and even more surprised to see the same pattern in Glasgow. The fact that the same pattern is also found in Dutch raises some interesting questions, about which more later.
After reading the various posts about the uh/um distinction and its relation to gender and age for English speakers, a colleague of the University of Groningen, Gosse Bouma, and I decided to look at this distribution in a series of spontaneous conversations extracted from a corpus of spoken Dutch (Corpus Gesproken Nederlands). While Dutch speakers also use ‘uh’ and ‘um’ as hesitation markers, they generally prefer the vocalic hesitation marker ‘uh’ over the vocalic-nasal hesitation marker ‘um’ (de Leeuw, 2007: “Hesitation markers in English, German, and Dutch”, Journal of Germanic Linguistics). No studies, however, have looked at the relationship of this distribution on the basis of gender and age.
A logistic regression model predicting the relative hesitation marker frequency of ‘um’ clearly revealed that while ‘uh’ is indeed the preferred marker, the frequency of ‘um’ significantly (p < 0.0001) increases for women compared to men and younger as opposed to older speakers. The table and figure below illustrate this relationship by showing the relative frequency of ‘um’ in four age groups (each containing approximately 25% of the speakers). (Note that the relative frequency of ‘uh’ can be obtained by subtracting these values from 1.)
In graphical form:
The corpus data we use contains speakers from two countries, the Netherlands (NL) and Belgium (FL; i.e. Flanders, where Dutch is the native language). The tables and figures below show that this factor plays an important role. Speakers from Flanders show a much larger relative frequency of ‘um’ compared to the speakers from the Netherlands. In addition, the effects of both age and gender are significantly (p < 0.05) stronger for the speakers from Flanders than those from the Netherlands. In both logistic regression models, however, the effects of both age (with younger speakers showing a greater relative frequency of ‘um’) and gender (with women showing a greater relative frequency of ‘um’) are highly significant (p < 0.0001).
|Relative frequency of ‘um’: NL||Male||Female|
|Born between 1914 and 1949||0.047||0.084|
|Born between 1950 and 1963||0.062||0.075|
|Born between 1964 and 1975||0.065||0.098|
|Born between 1976 and 1987||0.078||0.103|
|Relative frequency of ‘um’: FL||Male||Female|
|Born between 1914 and 1949||0.085||0.114|
|Born between 1950 and 1963||0.081||0.154|
|Born between 1964 and 1975||0.141||0.242|
|Born between 1976 and 1987||0.208||0.306|
Details of the data and analysis code can be found here.
Above is a guest post by Martijn Wieling.