Martijn Wieling and Gosse Bouma (University of Groningen, the Netherlands)
Data and scripts for the Language Log guest post of Martijn Wieling
To run the complete analysis yourself, please refer to the bottom of this page.
The following lines load the required library, download the required files, and load the data.
# load required packages
library(lme4)
library(ggplot2)
# version information
R.version.string
## [1] "R version 3.1.1 (2014-07-10)"
packageVersion('lme4')
## [1] '1.1.7'
packageVersion('ggplot2')
## [1] '1.0.0'
# download required files and scripts
download.file('http://www.let.rug.nl/wieling/ll/multiplot.R', 'multiplot.R')
download.file('http://www.let.rug.nl/wieling/ll/CGN-UH-UM.txt', 'CGN-UH-UM.txt')
# load custom plotting function
source('multiplot.R') # custom plotting function
# read data
spk = read.table('CGN-UH-UM.txt',sep='\t',header=T)
Results when ignoring country information
The following code generates the table of the relative frequency of 'um'.
spk$RelFreqUM = spk$FreqUM / (spk$FreqUM + spk$FreqUH)
dat = aggregate(spk$RelFreqUM,by=list(spk$AgeGroup,spk$Gender),FUN=mean)
colnames(dat) = c('BirthYear','Gender','RelFreqUM')
dat
## BirthYear Gender RelFreqUM
## 1 1914-1949 Female 0.09548
## 2 1950-1963 Female 0.11165
## 3 1964-1975 Female 0.16157
## 4 1976-1987 Female 0.18166
## 5 1914-1949 Male 0.05899
## 6 1950-1963 Male 0.07052
## 7 1964-1975 Male 0.09842
## 8 1976-1987 Male 0.13236
To generate the graph, we first have to calculate the 95% confidence intervals around the average values (per age group) we want to visualize. The 95% confidence interval lies within 1.96 times the standard error around the average of each group. The standard error is calculated by dividing the standard deviation by the square root of the number of observations in the group. The calculation thus can be done as follows:
# calculation of standard deviation
tmp = aggregate(spk$RelFreqUM,by=list(spk$AgeGroup,spk$Gender),FUN=sd)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.sd')
dat = merge(dat,tmp,by=c('BirthYear','Gender'))
# calculation of the number of observations per group
spk$ones = 1
tmp = aggregate(spk$ones,by=list(spk$AgeGroup,spk$Gender),FUN=sum)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.N')
dat = merge(dat,tmp,by=c('BirthYear','Gender'))
# storing the 95% confidence bands (1.96 standard deviations above and below the mean)
dat$RelFreqUM.lower = dat$RelFreqUM - (1.96 * (dat$RelFreqUM.sd / sqrt(dat$RelFreqUM.N)))
dat$RelFreqUM.upper = dat$RelFreqUM + (1.96 * (dat$RelFreqUM.sd / sqrt(dat$RelFreqUM.N)))
The following command visualizes the graph including the confidence bands:
ggplot(data = dat, aes(x = BirthYear, y = RelFreqUM, colour = Gender)) +
geom_line(aes(group = Gender)) + theme_bw() + xlab('Year of birth') + ylab(" ") +
geom_errorbar(aes(ymin=RelFreqUM.lower, ymax=RelFreqUM.upper, width=.1)) +
ggtitle("Relative frequency of 'um' (1658 speakers)")
Clearly the plot shows that women and younger speakers show a greater relative frequency of the use of 'um'. This is confirmed by the following mixed-effects logistic regression model.
spk$BirthYear.z = scale(spk$BirthYear) # z-transformation is necessary to prevent a warning
model1 = glmer(cbind(FreqUM,FreqUH) ~ BirthYear.z + Gender + (1|Speaker),family='binomial',data=spk)
summary(model1)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: cbind(FreqUM, FreqUH) ~ BirthYear.z + Gender + (1 | Speaker)
## Data: spk
##
## AIC BIC logLik deviance df.resid
## 9330 9352 -4661 9322 1654
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.5370 -0.4122 -0.0116 0.2018 2.8355
##
## Random effects:
## Groups Name Variance Std.Dev.
## Speaker (Intercept) 1.19 1.09
## Number of obs: 1658, groups: Speaker, 1658
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2785 0.0440 -51.7 <2e-16 ***
## BirthYear.z 0.3435 0.0318 10.8 <2e-16 ***
## GenderMale -0.5744 0.0635 -9.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) BrthY.
## BirthYear.z -0.107
## GenderMale -0.676 0.081
The first line of the fixed effects description (the part which we are focusing on here) shows the intercept. The negative value simply indicates that the relative proporition of 'um' is much lower than that of 'uh' (i.e. a value of 0 would indicate that they are equally likely). The positive estimate for year of birth indicates that younger speakers (having a higher year of birth) are more likely to use 'um' as opposed to 'uh'. Similarly, the negative estimate for the male gender indicates that men are less likely to use 'um' as opposed to 'uh' than women.
The influence of country
The data contains speakers from the Netherlands and Flanders. The following table of relative frequencies shows that country plays an important role, with speakers from Flanders using 'um' relatively more frequent.
dat2 = aggregate(spk$RelFreqUM,by=list(spk$AgeGroup,spk$Gender,spk$Country),FUN=mean)
colnames(dat2) = c('BirthYear','Gender','Country','RelFreqUM')
dat2
## BirthYear Gender Country RelFreqUM
## 1 1914-1949 Female Flanders 0.11353
## 2 1950-1963 Female Flanders 0.15395
## 3 1964-1975 Female Flanders 0.24234
## 4 1976-1987 Female Flanders 0.30646
## 5 1914-1949 Male Flanders 0.08474
## 6 1950-1963 Male Flanders 0.08090
## 7 1964-1975 Male Flanders 0.14068
## 8 1976-1987 Male Flanders 0.20831
## 9 1914-1949 Female Netherlands 0.08350
## 10 1950-1963 Female Netherlands 0.07481
## 11 1964-1975 Female Netherlands 0.09795
## 12 1976-1987 Female Netherlands 0.10346
## 13 1914-1949 Male Netherlands 0.04704
## 14 1950-1963 Male Netherlands 0.06176
## 15 1964-1975 Male Netherlands 0.06547
## 16 1976-1987 Male Netherlands 0.07822
Similar as before, we can also visualize this (including 95% confidence bands). Before visulizing, we again need to calculate the 95% confidence bands. We first do this for the speakers from the Netherlands.
nl = dat2[dat2$Country == 'Netherlands',]
# standard deviation
tmp = aggregate(spk$RelFreqUM,by=list(spk$AgeGroup,spk$Gender),FUN=sd)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.sd')
nl = merge(nl,tmp,by=c('BirthYear','Gender'))
# number of observations
tmp = aggregate(spk$ones,by=list(spk$AgeGroup,spk$Gender),FUN=sum)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.N')
nl = merge(nl,tmp,by=c('BirthYear','Gender'))
# 95% confidence bands
nl$RelFreqUM.lower = nl$RelFreqUM - (1.96 * (nl$RelFreqUM.sd / sqrt(nl$RelFreqUM.N)))
nl$RelFreqUM.upper = nl$RelFreqUM + (1.96 * (nl$RelFreqUM.sd / sqrt(nl$RelFreqUM.N)))
And subsequently for the speakers from Flanders.
fl = dat2[dat2$Country == 'Flanders',]
# standard deviation
tmp = aggregate(spk$RelFreqUM,by=list(spk$AgeGroup,spk$Gender),FUN=sd)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.sd')
fl = merge(fl,tmp,by=c('BirthYear','Gender'))
# number of observations
tmp = aggregate(spk$ones,by=list(spk$AgeGroup,spk$Gender),FUN=sum)
colnames(tmp) = c('BirthYear','Gender','RelFreqUM.N')
fl = merge(fl,tmp,by=c('BirthYear','Gender'))
# 95% confidence bands
fl$RelFreqUM.lower = fl$RelFreqUM - (1.96 * (fl$RelFreqUM.sd / sqrt(fl$RelFreqUM.N)))
fl$RelFreqUM.upper = fl$RelFreqUM + (1.96 * (fl$RelFreqUM.sd / sqrt(fl$RelFreqUM.N)))
Finally, we plot the resulting graphs including 95% confidence bands.
pnl <- ggplot(data = nl, aes(x = BirthYear, y = RelFreqUM, colour = Gender)) +
geom_line(aes(group = Gender)) + theme_bw() + xlab('Year of birth') + ylab(" ") +
ggtitle("Relative frequency of 'um' (976 NL speakers)") + ylim(0,0.35) +
geom_errorbar(aes(ymin=RelFreqUM.lower, ymax=RelFreqUM.upper, width=.1))
pfl <- ggplot(data = fl, aes(x = BirthYear, y = RelFreqUM, colour = Gender)) +
geom_line(aes(group = Gender)) + theme_bw() + xlab('Year of birth') + ylab(" ") +
ggtitle("Relative frequency of 'um' (682 FL speakers)") + ylim(0,0.35) +
geom_errorbar(aes(ymin=RelFreqUM.lower, ymax=RelFreqUM.upper, width=.1))
multiplot(pnl,pfl,cols=2) # show both plots next to each other
While in both cases women and younger speakers show a greater relative frequency of the use of 'um', this effect is much stronger for the speakers from Flanders. To see that the effects are significant for both groups, we can make two separate mixed-effects logistic regression models. The first model is for the speakers from Flanders:
spkfl = spk[spk$Country == 'Flanders',]
model2a = glmer(cbind(FreqUM,FreqUH) ~ BirthYear.z + Gender + (1|Speaker),family='binomial',data=spkfl)
summary(model2a)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: cbind(FreqUM, FreqUH) ~ BirthYear.z + Gender + (1 | Speaker)
## Data: spkfl
##
## AIC BIC logLik deviance df.resid
## 3774 3792 -1883 3766 678
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.7256 -0.4458 -0.0124 0.2183 2.6830
##
## Random effects:
## Groups Name Variance Std.Dev.
## Speaker (Intercept) 1.04 1.02
## Number of obs: 682, groups: Speaker, 682
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6905 0.0656 -25.77 < 2e-16 ***
## BirthYear.z 0.5522 0.0507 10.89 < 2e-16 ***
## GenderMale -0.7315 0.0937 -7.81 5.9e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) BrthY.
## BirthYear.z -0.130
## GenderMale -0.671 0.044
The results are similar as for the full model (above) and thus indicate that younger speakers and women are more likely to use 'um' compared to 'uh'. This pattern is less strong for the speakers from the Netherlands speakers, but the pattern is the same:
spknl = spk[spk$Country == 'Netherlands',]
model2b = glmer(cbind(FreqUM,FreqUH) ~ BirthYear.z + Gender + (1|Speaker),family='binomial',data=spknl)
summary(model2b)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: cbind(FreqUM, FreqUH) ~ BirthYear.z + Gender + (1 | Speaker)
## Data: spknl
##
## AIC BIC logLik deviance df.resid
## 5320 5339 -2656 5312 972
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.3167 -0.4384 -0.0144 0.2260 2.8174
##
## Random effects:
## Groups Name Variance Std.Dev.
## Speaker (Intercept) 0.84 0.917
## Number of obs: 976, groups: Speaker, 976
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.6284 0.0495 -53.1 < 2e-16 ***
## BirthYear.z 0.1956 0.0344 5.7 1.3e-08 ***
## GenderMale -0.4618 0.0717 -6.4 1.2e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) BrthY.
## BirthYear.z -0.097
## GenderMale -0.671 0.104
Instead of creating two separate models (which is sub-optimal), we can also create a single model where we include country as a moderating factor.
model3 = glmer(cbind(FreqUM,FreqUH) ~ BirthYear.z:Country + Gender:Country +
Country + (1|Speaker),family='binomial',data=spk)
summary(model3)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: cbind(FreqUM, FreqUH) ~ BirthYear.z:Country + Gender:Country +
## Country + (1 | Speaker)
## Data: spk
##
## AIC BIC logLik deviance df.resid
## 9096 9134 -4541 9082 1651
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8140 -0.4361 -0.0128 0.2232 2.7841
##
## Random effects:
## Groups Name Variance Std.Dev.
## Speaker (Intercept) 0.916 0.957
## Number of obs: 1658, groups: Speaker, 1658
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6754 0.0618 -27.10 < 2e-16 ***
## CountryNetherlands -0.9625 0.0797 -12.07 < 2e-16 ***
## BirthYear.z:CountryFlanders 0.5475 0.0481 11.37 < 2e-16 ***
## BirthYear.z:CountryNetherlands 0.1972 0.0356 5.54 3.0e-08 ***
## CountryFlanders:GenderMale -0.7242 0.0889 -8.14 3.9e-16 ***
## CountryNetherlands:GenderMale -0.4668 0.0742 -6.29 3.2e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) CntryN BY.:CF BY.:CN CnF:GM
## CntryNthrln -0.768
## BrthYr.z:CF -0.126 0.095
## BrthYr.z:CN -0.003 -0.058 0.001
## CntryFln:GM -0.680 0.530 0.046 -0.001
## CntryNth:GM 0.004 -0.437 -0.002 0.105 0.001
As the estimate for country being equal to the Netherlands is negative, this indicates that speakers from the Netherlands are less likely to use 'um' as opposed to 'uh'. The following four lines of the fixed effects show that the influence of gender and year of birth are in the same direction for both groups, but less strong (i.e. having a lower absolute value) for the speakers from the Netherlands. To see if these differences are significant, we can change the specification of the model slightly:
model4 = glmer(cbind(FreqUM,FreqUH) ~ BirthYear.z*Country + Gender*Country +
(1|Speaker),family='binomial',data=spk)
summary(model4)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula:
## cbind(FreqUM, FreqUH) ~ BirthYear.z * Country + Gender * Country +
## (1 | Speaker)
## Data: spk
##
## AIC BIC logLik deviance df.resid
## 9096 9134 -4541 9082 1651
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8140 -0.4361 -0.0128 0.2232 2.7841
##
## Random effects:
## Groups Name Variance Std.Dev.
## Speaker (Intercept) 0.916 0.957
## Number of obs: 1658, groups: Speaker, 1658
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6754 0.0618 -27.10 < 2e-16 ***
## BirthYear.z 0.5475 0.0481 11.37 < 2e-16 ***
## CountryNetherlands -0.9625 0.0797 -12.07 < 2e-16 ***
## GenderMale -0.7242 0.0889 -8.14 3.9e-16 ***
## BirthYear.z:CountryNetherlands -0.3503 0.0598 -5.85 4.8e-09 ***
## CountryNetherlands:GenderMale 0.2574 0.1158 2.22 0.026 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) BrthY. CntryN GndrMl BY.:CN
## BirthYear.z -0.126
## CntryNthrln -0.768 0.095
## GenderMale -0.680 0.046 0.530
## BrthYr.z:CN 0.100 -0.804 -0.111 -0.037
## CntryNth:GM 0.525 -0.036 -0.688 -0.767 0.069
In this case the final two lines of the fixed effects now show that the interaction with country is significant for both predictors. This means that the speakers from the Netherlands show significantly different estimates compared to the speakers from Flanders. Specifically, the estimate for the year of birth (which is about 0.55 for the speakers from Flanders, as Flanders is the reference level for country) has to be reduced by 0.35 (to about 0.2) to get to the estimate of the speakers from the Netherlands. Similarly, the estimate for the contrast between males and females for the speakers from Flanders (i.e. -0.72) has to be increased by 0.26 (to about -0.46) for the speakers from the Netherlands. The effect of these adjustments clearly reflect the estimates for the two groups as shown in model3.
Replication of the analysis
To replicate the analysis presented above, you can just copy the following lines to the most recent version of R. You need the packages 'lme4', 'ggplot2', 'knitr', and 'markdown'. If these are not installed (the library commands will throw an error), you can uncomment (i.e. remove the hashtag) the first four lines to install them.
#install.packages('lme4',repos='http://cran.us.r-project.org')
#install.packages('ggplot2',repos='http://cran.us.r-project.org')
#install.packages('knitr',repos='http://cran.us.r-project.org')
#install.packages('markdown',repos='http://cran.us.r-project.org')
download.file('http://www.let.rug.nl/wieling/ll/analysis.Rmw', 'analysis.Rmw')
library(knitr)
library(markdown)
knit('analysis.Rmw', 'analysis.md') # generates markdown file
markdownToHTML('analysis.md', 'UH-UM-Dutch.html') # creates html file
browseURL(paste('file://', file.path(getwd(),'UH-UM-Dutch.html'), sep='')) # shows result