Language Log

Style or artefact or both?

January 12, 2016 @ 8:21 am · Filed by Mark Liberman under Computational linguistics, Language and culture, Linguistic history

« previous post | next post »

In "Correlated lexicometrical decay", I commented on some unexpectedly strong correlations over time of the ratios of word and phrase frequencies in the Google Books English 1gram dataset:

I'm sure that these patterns mean something. But it seems a little weird that OF as a proportion of all prepositions should correlate r=0.953 with the proportion of instances of OF immediately followed by THE, and it seems weirder that OF as a proportion of all prepositions should correlate r=0.913 with the proportion of adjective-noun sequences immediately preceded by THE.

So let's hope that what these patterns mean is that the secular decay of THE has somehow seeped into some but not all of the other counts, or that some other hidden cause is governing all of the correlated decays. The alternative hypothesis is that there's a problem with the way the underlying data was collected and processed, which would be annoying.

And in a comment on a comment, I noted that the corresponding data from the Corpus of Historical American English, which is a balanced corpus collected from sources largely or entirely distinct from the Google Books dataset, shows similar unexpected correlations.

So today I'd like to point out that much simpler data — frequencies of a few of the commonest words — shows some equally strong correlations over time in these same datasets.

Here's the correlation matrix derived from the frequencies of 1grams from 1900 to 2000 in the Google 2012 English 1gram dataset, lower case only:

        the     of    and     to      a     in   that     be     he    you
the   1.000  0.983  0.791  0.878  0.767  0.740  0.769  0.947  0.919 -0.386
of    0.983  1.000  0.716  0.808  0.686  0.830  0.695  0.948  0.876 -0.517
and   0.791  0.716  1.000  0.915  0.686  0.230  0.896  0.674  0.876  0.154
to    0.878  0.808  0.915  1.000  0.849  0.382  0.962  0.749  0.960  0.049
a     0.767  0.686  0.686  0.849  1.000  0.369  0.816  0.672  0.843  0.077
in    0.740  0.830  0.230  0.382  0.369  1.000  0.239  0.786  0.516 -0.878
that  0.769  0.695  0.896  0.962  0.816  0.239  1.000  0.610  0.908  0.199
be    0.947  0.948  0.674  0.749  0.672  0.786  0.610  1.000  0.791 -0.501
he    0.919  0.876  0.876  0.960  0.843  0.516  0.908  0.791  1.000 -0.101
you  -0.386 -0.517  0.154  0.049  0.077 -0.878  0.199 -0.501 -0.101  1.000

Here's the case insensitive version of the same thing:

        the     of    and     to      a     in   that     be     he    you
the   1.000  0.988  0.777  0.858  0.527  0.617  0.736  0.954  0.891 -0.478
of    0.988  1.000  0.733  0.812  0.450  0.687  0.686  0.946  0.861 -0.568
and   0.777  0.733  1.000  0.926  0.459  0.033  0.907  0.687  0.883  0.079
to    0.858  0.812  0.926  1.000  0.634  0.180  0.961  0.754  0.959 -0.017
a     0.527  0.450  0.459  0.634  1.000  0.119  0.645  0.478  0.659  0.142
in    0.617  0.687  0.033  0.180  0.119  1.000  0.018  0.657  0.302 -0.948
that  0.736  0.686  0.907  0.961  0.645  0.018  1.000  0.610  0.911  0.161
be    0.954  0.946  0.687  0.754  0.478  0.657  0.610  1.000  0.778 -0.542
he    0.891  0.861  0.883  0.959  0.659  0.302  0.911  0.778  1.000 -0.133
you  -0.478 -0.568  0.079 -0.017  0.142 -0.948  0.161 -0.542 -0.133  1.000

Why, for example, should "in" and "you" correlate r=-.848 in the lowercase counts, or r=-0.948 in case-insensitive counts? When we look at a plot of the proportional frequencies, it looks like "in" rose and "you" fell through 1966, at which point the trends steeply reversed:

So maybe the mix of books published, or anyhow the mix of books in the Google dataset, anticipated the Summer of Love by a year? Or maybe it took a couple of years for the Merry Pranksters' effects to start getting into print?

Analogous data from COHA, taken by decade from 1900 to 2000, also shows many large positive and negative correlations, but not always in the same places:

        the     of    and     to      a     in   that     be     he    you
the   1.000  0.987  0.837  0.933 -0.973  0.900  0.909  0.982  0.601 -0.004
of    0.987  1.000  0.870  0.942 -0.968  0.855  0.933  0.958  0.496  0.036
and   0.837  0.870  1.000  0.822 -0.893  0.529  0.776  0.782  0.181  0.269
to    0.933  0.942  0.822  1.000 -0.903  0.811  0.968  0.935  0.523  0.273
a    -0.973 -0.968 -0.893 -0.903  1.000 -0.821 -0.867 -0.935 -0.468 -0.019
in    0.900  0.855  0.529  0.811 -0.821  1.000  0.793  0.926  0.804 -0.171
that  0.909  0.933  0.776  0.968 -0.867  0.793  1.000  0.891  0.449  0.191
be    0.982  0.958  0.782  0.935 -0.935  0.926  0.891  1.000  0.708  0.014
he    0.601  0.496  0.181  0.523 -0.468  0.804  0.449  0.708  1.000 -0.191
you  -0.004  0.036  0.269  0.273 -0.019 -0.171  0.191  0.014 -0.191  1.000

For example,

           Gnocase  Glocase   COHA
in ~ you    -0.948  -0.878   -0.171
be ~ to      0.754   0.749    0.935
the ~ a      0.527   0.767   -0.973
that ~ to    0.961   0.962    0.968

A plausible hypothesis, in my opinion, is that these correlations give us a glimpse of some stylistic dimensions that play out in the distribution of common words, filtered through shifts over time in writing styles, in cultural norms, and in the distribution of sources in the Google Books and COHA collections.

And as Bill Labov and Jamie Pennebaker and Doug Biber and others have taught us, such "stylistic dimensions" correlate with personality and attitude and genre and register and so forth — but they do so in ways that drift over time…

Update — As D.O. points out in the comments,

The reasons for the high degree of correlations mentioned in the OP are very reasonable, but we should calibrate our "surprise factor" first. It is quite clear that the word counts are highly autocorrelated, which is simply to say that their change is primarily due to the long-term trends. It means we should compare the results to a typical distribution of correlation coefficients for highly autocorrelated signals.

Another way to put this is that "random" but smooth sequences have a relatively small number of degrees of freedom simply due to their smoothness, and as a result will tend to show correlations with relatively high absolute values.

This is related to the point I made in "Cultural diffusion and the Whorfian hypothesis" (2/12/2012), which has to do with the distribution of correlations among spatially smooth but formally independent 2-D patterns. So I don't have any excuse for failing to see the point.

I still suspect that some stylistic or cultural (or sampling-method) trends will emerge from appropriate dimensionality reduction on word-frequency time-series data (especially for words where the emergent dimensions are not main topic-related) — but D.O. is right, the degree of cross-correlation in the basic time-series data shouldn't surprise us.

January 12, 2016 @ 8:21 am · Filed by Mark Liberman under Computational linguistics, Language and culture, Linguistic history

Permalink

9 Comments

Nathan Myers said,

January 12, 2016 @ 12:00 pm

Back in the days of Orkut (an early Google facebook-alike), I started two "groups": "I hate Java" and "I resent LISP", among others. Over the course of many months, membership count in one exactly tracked the other, but with a constant offset. Many more people went on record hating Java than resenting LISP, but it was a constant number. Over time, membership in both groups grew many times without breaking the pattern.

All I have been able to conclude from it is that spurious astonishingly high correlations are so common they are hardly worth wondering about unless you have a testable model in mind.

(People were invited to resent LISP, BTW, for failing to offer them a career path.)

[(myl) But these examples were not selected from a larger set — rather this is all the correlations among some of the commonest words in the collection. (And the rest of the top of the list are more of the same — I skipped down the list a bit just to get a couple of pronouns and a couple of forms of "to be".) When just a few of a very large set of numbers are highly correlated, that might well be an accident. But when nearly all of a very large set of numbers are highly inter-correlated, it's generally an indication of some kind of latent structure. (Which might be an artefact of the data collection or data processing methods, but is not likely to be sampling error.)

In this case, it's clear that there are only a few real degrees of freedom in the patterns of word frequency variation, whether across documents or across time. Some of those dimensions are topic-related, but for the commonest words, they seem likely to be mainly stylistic.]
leoboiko said,

January 12, 2016 @ 2:25 pm

@Myers: Oh, you're the creator of "I Resent LISP"? I was a subscriber (and I resented Lisp for losing to Python, so that I couldn't use it for lack of libraries etc.) (sorry for off-topic comment).
D.O. said,

January 12, 2016 @ 10:41 pm

The reasons for the high degree of correlations mentioned in the OP are very reasonable, but we should calibrate our "surprise factor" first. It is quite clear that the word counts are highly autocorrelated, which is simply to say that their change is primarily due to the long-term trends. It means we should compare the results to a typical distribution of correlation coefficients for highly autocorrelated signals.

A scientific way to do it (I guess) would be to
1) pull out a large number of word count time series,
2) calculate their autocorrelation functions,
3) generate random time series with the same range of autocorrelation coefficients (but without any correlations among themselves!) and
4) observe the distribution of their correlation coefficients.

I am not doing it myself, but instead tried a much easier example. I took Fourrier series with amplitudes 1/n^2+0.1/n (for n-th harmonic) and random phases and got about 40% of correlation coefficients more than 0.8 (by absolute value).

Now, I am sure it's only part of the story. 1/n^2 for the first 10 harmonics maybe a bit harsh, but some kind of discount for the autocorrelation ought to be made.
D.O. said,

January 12, 2016 @ 10:48 pm

Just in case anyone cares, my Matlab code:

Npoints = 100;
x=linspace(-pi/2,pi/2,Npoints);
for ii=1:100000 y1=rsc(Npoints);y2=rsc(Npoints);a=corrcoef(y1,y2); crs(ii)=a(1,2);end
disp(nnz(crs>0.8 | crs<-0.8)/100000)

rsc function:

function signal = rsc(Npoints)
x=linspace(-pi/2,pi/2,Npoints);
Nharmonics = floor(Npoints/2);
amp = 1./(1:Nharmonics).^2+0.1./(1:Nharmonics);
phase = 2*pi*rand(Nharmonics,1);
signal = amp*sin((1:Nharmonics)'*x+phase*ones(1,Npoints));
D.O. said,

January 13, 2016 @ 12:18 am

I actually downloaded the 1900-2000 timeseries for 10 most frequent words from Google ngrams and calculated their autocorrelation coefficients. As it happens the first ~10 of their harmonics (it is simply |fft(n)|^2) fall off as ~1/n^2. Which means that my simple guess model was right on money.

If so it might make sense to look at the deciles:
p(|correlation|<0.1)=5.5%, p(0.1<|correlation|<0.2)=5.7%, p(0.2..0.3)=5.9%, p(0.3..0.4)=6.2%, p(0.4..0.5)=6.8%, p(0.5..0.6)=8.0%, p(0.6..0.7)=9.4%, p(0.7..0.8) = 12.3%, p(0.8..0.9)=18.4%, p(0.9..1.0)=21.8%.

[(myl) Excellent point — I should have thought of that. It's actually related to the point made in "Cultural diffusion and the Whorfian hypothesis" (2/12/2012), which has to do with the distribution of correlations among spatially smooth but formally independent 2-D patterns. So I don't have any excuse for failing to see the point.

I still suspect that some stylistic or cultural (or sampling-method) trends will emerge from dimensionality reduction — but you're right, the degree of cross-correlation shouldn't surprise us.]
Yuval said,

January 13, 2016 @ 2:42 am

Could it be, then, that the sources of books Google took changed at some years? Or some other prosaic reason – some publisher changed to a different font which has other OCR issues; a few publishers united (/parted) and thus their editing standards coalesced (/diverged); an eminent journal is only available starting a particular year; etc.
Graeme said,

January 13, 2016 @ 3:40 am

Somewhat off topic. But has anyone tested the fate of 'have'? In its frequency in spoken or even blogged English? A long standing impression of my old dad was that 'got' as a broad indicator of possession was accelerating its supplanting of 'have'. And more recently there is the misuse of 'of' instead of 'have' after 'could/should/would'.
Matt said,

January 13, 2016 @ 7:35 am

& how about the near-disappearance of "that" as a phrase connector?
e.g. "Pentagon sources later said [that] they had entered Iranian waters after facing technical difficulties." Here it's only marginally useful, but helps with the parsing of these sorts of constructions.

As a specifier it's still doing nicely. "How much is that doggie in the window?"

[(myl) "Near disappearance" is an exaggeration:

The construction with "that" actually increased in relative frequency through the 1960s, and seems to have leveled off at 30-40% overall.]
Jon said,

January 13, 2016 @ 8:46 am

At least among older books, there are many examples of the same book being scanned several times, and also updated editions in the case of non-fiction. This is sometimes caused by Google scanning the whole contents of different libraries, but I have even seen two identical (apart from handling marks) scanned books from the same library. Could this contribute to the correlations seen?

RSS feed for comments on this post

Style or artefact or both?

9 Comments

Nathan Myers said,

leoboiko said,

D.O. said,

D.O. said,

D.O. said,

Yuval said,

Graeme said,

Matt said,

Jon said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta