UH/UM in Dutch

Martijn Wieling and Gosse Bouma (University of Groningen, the Netherlands)

Data and scripts for the Language Log guest post of Martijn Wieling

To run the complete analysis yourself, please refer to the bottom of this page.

The following lines load the required library, download the required files, and load the data.

# load required packages
library(lme4)
library(ggplot2)

# version information
R.version.string

## [1] "R version 3.1.1 (2014-07-10)"

packageVersion('lme4')

## [1] '1.1.7'

packageVersion('ggplot2')

## [1] '1.0.0'

# download required files and scripts
download.file('http://www.let.rug.nl/wieling/ll/multiplot.R', 'multiplot.R')
download.file('http://www.let.rug.nl/wieling/ll/CGN-UH-UM.txt', 'CGN-UH-UM.txt')

# load custom plotting function
source('multiplot.R') # custom plotting function

# read data
spk = read.table('CGN-UH-UM.txt',sep='\t',header=T)

Results when ignoring country information

The following code generates the table of the relative frequency of 'um'.

spk$RelFreqUM = spk$FreqUM / (spk$FreqUM + spk$FreqUH)
dat = aggregate(spk$RelFreqUM,by=list(spk$AgeGroup,spk$Gender),FUN=mean)
colnames(dat) = c('BirthYear','Gender','RelFreqUM')
dat

##   BirthYear Gender RelFreqUM
## 1 1914-1949 Female   0.09548
## 2 1950-1963 Female   0.11165
## 3 1964-1975 Female   0.16157
## 4 1976-1987 Female   0.18166
## 5 1914-1949   Male   0.05899
## 6 1950-1963   Male   0.07052
## 7 1964-1975   Male   0.09842
## 8 1976-1987   Male   0.13236

To generate the graph, we first have to calculate the 95% confidence intervals around the average values (per age group) we want to visualize. The 95% confidence interval lies within 1.96 times the standard error around the average of each group. The standard error is calculated by dividing the standard deviation by the square root of the number of observations in the group. The calculation thus can be done as follows:

# calculation of standard deviation
tmp = aggregate(spk$RelFreqUM,by=list(spk$AgeGroup,spk$Gender),FUN=sd)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.sd')
dat = merge(dat,tmp,by=c('BirthYear','Gender'))

# calculation of the number of observations per group
spk$ones = 1
tmp = aggregate(spk$ones,by=list(spk$AgeGroup,spk$Gender),FUN=sum)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.N')
dat = merge(dat,tmp,by=c('BirthYear','Gender'))

# storing the 95% confidence bands (1.96 standard deviations above and below the mean)
dat$RelFreqUM.lower = dat$RelFreqUM - (1.96 * (dat$RelFreqUM.sd / sqrt(dat$RelFreqUM.N)))
dat$RelFreqUM.upper = dat$RelFreqUM + (1.96 * (dat$RelFreqUM.sd / sqrt(dat$RelFreqUM.N)))

The following command visualizes the graph including the confidence bands:

ggplot(data = dat, aes(x = BirthYear, y = RelFreqUM, colour = Gender)) + 
geom_line(aes(group = Gender)) + theme_bw() + xlab('Year of birth') + ylab(" ") + 
geom_errorbar(aes(ymin=RelFreqUM.lower, ymax=RelFreqUM.upper, width=.1)) +
ggtitle("Relative frequency of 'um' (1658 speakers)")

plot of chunk graph

Clearly the plot shows that women and younger speakers show a greater relative frequency of the use of 'um'. This is confirmed by the following mixed-effects logistic regression model.

spk$BirthYear.z = scale(spk$BirthYear) # z-transformation is necessary to prevent a warning
model1 = glmer(cbind(FreqUM,FreqUH) ~ BirthYear.z + Gender + (1|Speaker),family='binomial',data=spk) 
summary(model1)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(FreqUM, FreqUH) ~ BirthYear.z + Gender + (1 | Speaker)
##    Data: spk
## 
##      AIC      BIC   logLik deviance df.resid 
##     9330     9352    -4661     9322     1654 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.5370 -0.4122 -0.0116  0.2018  2.8355 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  Speaker (Intercept) 1.19     1.09    
## Number of obs: 1658, groups:  Speaker, 1658
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.2785     0.0440   -51.7   <2e-16 ***
## BirthYear.z   0.3435     0.0318    10.8   <2e-16 ***
## GenderMale   -0.5744     0.0635    -9.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) BrthY.
## BirthYear.z -0.107       
## GenderMale  -0.676  0.081

The first line of the fixed effects description (the part which we are focusing on here) shows the intercept. The negative value simply indicates that the relative proporition of 'um' is much lower than that of 'uh' (i.e. a value of 0 would indicate that they are equally likely). The positive estimate for year of birth indicates that younger speakers (having a higher year of birth) are more likely to use 'um' as opposed to 'uh'. Similarly, the negative estimate for the male gender indicates that men are less likely to use 'um' as opposed to 'uh' than women.

The influence of country

The data contains speakers from the Netherlands and Flanders. The following table of relative frequencies shows that country plays an important role, with speakers from Flanders using 'um' relatively more frequent.

dat2 = aggregate(spk$RelFreqUM,by=list(spk$AgeGroup,spk$Gender,spk$Country),FUN=mean)
colnames(dat2) = c('BirthYear','Gender','Country','RelFreqUM')
dat2

##    BirthYear Gender     Country RelFreqUM
## 1  1914-1949 Female    Flanders   0.11353
## 2  1950-1963 Female    Flanders   0.15395
## 3  1964-1975 Female    Flanders   0.24234
## 4  1976-1987 Female    Flanders   0.30646
## 5  1914-1949   Male    Flanders   0.08474
## 6  1950-1963   Male    Flanders   0.08090
## 7  1964-1975   Male    Flanders   0.14068
## 8  1976-1987   Male    Flanders   0.20831
## 9  1914-1949 Female Netherlands   0.08350
## 10 1950-1963 Female Netherlands   0.07481
## 11 1964-1975 Female Netherlands   0.09795
## 12 1976-1987 Female Netherlands   0.10346
## 13 1914-1949   Male Netherlands   0.04704
## 14 1950-1963   Male Netherlands   0.06176
## 15 1964-1975   Male Netherlands   0.06547
## 16 1976-1987   Male Netherlands   0.07822

Similar as before, we can also visualize this (including 95% confidence bands). Before visulizing, we again need to calculate the 95% confidence bands. We first do this for the speakers from the Netherlands.

nl = dat2[dat2$Country == 'Netherlands',]

# standard deviation
tmp = aggregate(spk$RelFreqUM,by=list(spk$AgeGroup,spk$Gender),FUN=sd)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.sd')
nl = merge(nl,tmp,by=c('BirthYear','Gender'))

# number of observations
tmp = aggregate(spk$ones,by=list(spk$AgeGroup,spk$Gender),FUN=sum)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.N')
nl = merge(nl,tmp,by=c('BirthYear','Gender'))

# 95% confidence bands
nl$RelFreqUM.lower = nl$RelFreqUM - (1.96 * (nl$RelFreqUM.sd / sqrt(nl$RelFreqUM.N)))
nl$RelFreqUM.upper = nl$RelFreqUM + (1.96 * (nl$RelFreqUM.sd / sqrt(nl$RelFreqUM.N)))

And subsequently for the speakers from Flanders.

fl = dat2[dat2$Country == 'Flanders',]

# standard deviation
tmp = aggregate(spk$RelFreqUM,by=list(spk$AgeGroup,spk$Gender),FUN=sd)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.sd')
fl = merge(fl,tmp,by=c('BirthYear','Gender'))

# number of observations
tmp = aggregate(spk$ones,by=list(spk$AgeGroup,spk$Gender),FUN=sum)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.N')
fl = merge(fl,tmp,by=c('BirthYear','Gender'))

# 95% confidence bands
fl$RelFreqUM.lower = fl$RelFreqUM - (1.96 * (fl$RelFreqUM.sd / sqrt(fl$RelFreqUM.N)))
fl$RelFreqUM.upper = fl$RelFreqUM + (1.96 * (fl$RelFreqUM.sd / sqrt(fl$RelFreqUM.N)))

Finally, we plot the resulting graphs including 95% confidence bands.

pnl <- ggplot(data = nl, aes(x = BirthYear, y = RelFreqUM, colour = Gender)) + 
       geom_line(aes(group = Gender)) + theme_bw() + xlab('Year of birth') + ylab(" ") + 
       ggtitle("Relative frequency of 'um' (976 NL speakers)") + ylim(0,0.35) +
       geom_errorbar(aes(ymin=RelFreqUM.lower, ymax=RelFreqUM.upper, width=.1)) 

pfl <- ggplot(data = fl, aes(x = BirthYear, y = RelFreqUM, colour = Gender)) + 
       geom_line(aes(group = Gender)) + theme_bw() + xlab('Year of birth') + ylab(" ") + 
       ggtitle("Relative frequency of 'um' (682 FL speakers)")  + ylim(0,0.35) +
       geom_errorbar(aes(ymin=RelFreqUM.lower, ymax=RelFreqUM.upper, width=.1)) 

multiplot(pnl,pfl,cols=2) # show both plots next to each other

plot of chunk graph2

While in both cases women and younger speakers show a greater relative frequency of the use of 'um', this effect is much stronger for the speakers from Flanders. To see that the effects are significant for both groups, we can make two separate mixed-effects logistic regression models. The first model is for the speakers from Flanders:

spkfl = spk[spk$Country == 'Flanders',]
model2a = glmer(cbind(FreqUM,FreqUH) ~ BirthYear.z + Gender + (1|Speaker),family='binomial',data=spkfl) 
summary(model2a)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(FreqUM, FreqUH) ~ BirthYear.z + Gender + (1 | Speaker)
##    Data: spkfl
## 
##      AIC      BIC   logLik deviance df.resid 
##     3774     3792    -1883     3766      678 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.7256 -0.4458 -0.0124  0.2183  2.6830 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  Speaker (Intercept) 1.04     1.02    
## Number of obs: 682, groups:  Speaker, 682
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.6905     0.0656  -25.77  < 2e-16 ***
## BirthYear.z   0.5522     0.0507   10.89  < 2e-16 ***
## GenderMale   -0.7315     0.0937   -7.81  5.9e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) BrthY.
## BirthYear.z -0.130       
## GenderMale  -0.671  0.044

The results are similar as for the full model (above) and thus indicate that younger speakers and women are more likely to use 'um' compared to 'uh'. This pattern is less strong for the speakers from the Netherlands speakers, but the pattern is the same:

spknl = spk[spk$Country == 'Netherlands',]
model2b = glmer(cbind(FreqUM,FreqUH) ~ BirthYear.z + Gender + (1|Speaker),family='binomial',data=spknl) 
summary(model2b)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(FreqUM, FreqUH) ~ BirthYear.z + Gender + (1 | Speaker)
##    Data: spknl
## 
##      AIC      BIC   logLik deviance df.resid 
##     5320     5339    -2656     5312      972 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.3167 -0.4384 -0.0144  0.2260  2.8174 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  Speaker (Intercept) 0.84     0.917   
## Number of obs: 976, groups:  Speaker, 976
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.6284     0.0495   -53.1  < 2e-16 ***
## BirthYear.z   0.1956     0.0344     5.7  1.3e-08 ***
## GenderMale   -0.4618     0.0717    -6.4  1.2e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) BrthY.
## BirthYear.z -0.097       
## GenderMale  -0.671  0.104

Instead of creating two separate models (which is sub-optimal), we can also create a single model where we include country as a moderating factor.

model3 = glmer(cbind(FreqUM,FreqUH) ~ BirthYear.z:Country + Gender:Country +
               Country + (1|Speaker),family='binomial',data=spk)
summary(model3)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(FreqUM, FreqUH) ~ BirthYear.z:Country + Gender:Country +  
##     Country + (1 | Speaker)
##    Data: spk
## 
##      AIC      BIC   logLik deviance df.resid 
##     9096     9134    -4541     9082     1651 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8140 -0.4361 -0.0128  0.2232  2.7841 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  Speaker (Intercept) 0.916    0.957   
## Number of obs: 1658, groups:  Speaker, 1658
## 
## Fixed effects:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -1.6754     0.0618  -27.10  < 2e-16 ***
## CountryNetherlands              -0.9625     0.0797  -12.07  < 2e-16 ***
## BirthYear.z:CountryFlanders      0.5475     0.0481   11.37  < 2e-16 ***
## BirthYear.z:CountryNetherlands   0.1972     0.0356    5.54  3.0e-08 ***
## CountryFlanders:GenderMale      -0.7242     0.0889   -8.14  3.9e-16 ***
## CountryNetherlands:GenderMale   -0.4668     0.0742   -6.29  3.2e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) CntryN BY.:CF BY.:CN CnF:GM
## CntryNthrln -0.768                            
## BrthYr.z:CF -0.126  0.095                     
## BrthYr.z:CN -0.003 -0.058  0.001              
## CntryFln:GM -0.680  0.530  0.046 -0.001       
## CntryNth:GM  0.004 -0.437 -0.002  0.105  0.001

As the estimate for country being equal to the Netherlands is negative, this indicates that speakers from the Netherlands are less likely to use 'um' as opposed to 'uh'. The following four lines of the fixed effects show that the influence of gender and year of birth are in the same direction for both groups, but less strong (i.e. having a lower absolute value) for the speakers from the Netherlands. To see if these differences are significant, we can change the specification of the model slightly:

model4 = glmer(cbind(FreqUM,FreqUH) ~ BirthYear.z*Country + Gender*Country + 
               (1|Speaker),family='binomial',data=spk) 
summary(model4)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: 
## cbind(FreqUM, FreqUH) ~ BirthYear.z * Country + Gender * Country +  
##     (1 | Speaker)
##    Data: spk
## 
##      AIC      BIC   logLik deviance df.resid 
##     9096     9134    -4541     9082     1651 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8140 -0.4361 -0.0128  0.2232  2.7841 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  Speaker (Intercept) 0.916    0.957   
## Number of obs: 1658, groups:  Speaker, 1658
## 
## Fixed effects:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -1.6754     0.0618  -27.10  < 2e-16 ***
## BirthYear.z                      0.5475     0.0481   11.37  < 2e-16 ***
## CountryNetherlands              -0.9625     0.0797  -12.07  < 2e-16 ***
## GenderMale                      -0.7242     0.0889   -8.14  3.9e-16 ***
## BirthYear.z:CountryNetherlands  -0.3503     0.0598   -5.85  4.8e-09 ***
## CountryNetherlands:GenderMale    0.2574     0.1158    2.22    0.026 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) BrthY. CntryN GndrMl BY.:CN
## BirthYear.z -0.126                            
## CntryNthrln -0.768  0.095                     
## GenderMale  -0.680  0.046  0.530              
## BrthYr.z:CN  0.100 -0.804 -0.111 -0.037       
## CntryNth:GM  0.525 -0.036 -0.688 -0.767  0.069

In this case the final two lines of the fixed effects now show that the interaction with country is significant for both predictors. This means that the speakers from the Netherlands show significantly different estimates compared to the speakers from Flanders. Specifically, the estimate for the year of birth (which is about 0.55 for the speakers from Flanders, as Flanders is the reference level for country) has to be reduced by 0.35 (to about 0.2) to get to the estimate of the speakers from the Netherlands. Similarly, the estimate for the contrast between males and females for the speakers from Flanders (i.e. -0.72) has to be increased by 0.26 (to about -0.46) for the speakers from the Netherlands. The effect of these adjustments clearly reflect the estimates for the two groups as shown in model3.

Replication of the analysis

To replicate the analysis presented above, you can just copy the following lines to the most recent version of R. You need the packages 'lme4', 'ggplot2', 'knitr', and 'markdown'. If these are not installed (the library commands will throw an error), you can uncomment (i.e. remove the hashtag) the first four lines to install them.

#install.packages('lme4',repos='http://cran.us.r-project.org')
#install.packages('ggplot2',repos='http://cran.us.r-project.org')
#install.packages('knitr',repos='http://cran.us.r-project.org')
#install.packages('markdown',repos='http://cran.us.r-project.org')
download.file('http://www.let.rug.nl/wieling/ll/analysis.Rmw', 'analysis.Rmw')
library(knitr) 
library(markdown)
knit('analysis.Rmw', 'analysis.md') # generates markdown file
markdownToHTML('analysis.md', 'UH-UM-Dutch.html') # creates html file
browseURL(paste('file://', file.path(getwd(),'UH-UM-Dutch.html'), sep='')) # shows result