Style or artefact or both?
« previous post | next post »
In "Correlated lexicometrical decay", I commented on some unexpectedly strong correlations over time of the ratios of word and phrase frequencies in the Google Books English 1gram dataset:
I'm sure that these patterns mean something. But it seems a little weird that OF as a proportion of all prepositions should correlate r=0.953 with the proportion of instances of OF immediately followed by THE, and it seems weirder that OF as a proportion of all prepositions should correlate r=0.913 with the proportion of adjective-noun sequences immediately preceded by THE.
So let's hope that what these patterns mean is that the secular decay of THE has somehow seeped into some but not all of the other counts, or that some other hidden cause is governing all of the correlated decays. The alternative hypothesis is that there's a problem with the way the underlying data was collected and processed, which would be annoying.
And in a comment on a comment, I noted that the corresponding data from the Corpus of Historical American English, which is a balanced corpus collected from sources largely or entirely distinct from the Google Books dataset, shows similar unexpected correlations.
So today I'd like to point out that much simpler data — frequencies of a few of the commonest words — shows some equally strong correlations over time in these same datasets.
Here's the correlation matrix derived from the frequencies of 1grams from 1900 to 2000 in the Google 2012 English 1gram dataset, lower case only:
the of and to a in that be he you the 1.000 0.983 0.791 0.878 0.767 0.740 0.769 0.947 0.919 -0.386 of 0.983 1.000 0.716 0.808 0.686 0.830 0.695 0.948 0.876 -0.517 and 0.791 0.716 1.000 0.915 0.686 0.230 0.896 0.674 0.876 0.154 to 0.878 0.808 0.915 1.000 0.849 0.382 0.962 0.749 0.960 0.049 a 0.767 0.686 0.686 0.849 1.000 0.369 0.816 0.672 0.843 0.077 in 0.740 0.830 0.230 0.382 0.369 1.000 0.239 0.786 0.516 -0.878 that 0.769 0.695 0.896 0.962 0.816 0.239 1.000 0.610 0.908 0.199 be 0.947 0.948 0.674 0.749 0.672 0.786 0.610 1.000 0.791 -0.501 he 0.919 0.876 0.876 0.960 0.843 0.516 0.908 0.791 1.000 -0.101 you -0.386 -0.517 0.154 0.049 0.077 -0.878 0.199 -0.501 -0.101 1.000
Here's the case insensitive version of the same thing:
the of and to a in that be he you the 1.000 0.988 0.777 0.858 0.527 0.617 0.736 0.954 0.891 -0.478 of 0.988 1.000 0.733 0.812 0.450 0.687 0.686 0.946 0.861 -0.568 and 0.777 0.733 1.000 0.926 0.459 0.033 0.907 0.687 0.883 0.079 to 0.858 0.812 0.926 1.000 0.634 0.180 0.961 0.754 0.959 -0.017 a 0.527 0.450 0.459 0.634 1.000 0.119 0.645 0.478 0.659 0.142 in 0.617 0.687 0.033 0.180 0.119 1.000 0.018 0.657 0.302 -0.948 that 0.736 0.686 0.907 0.961 0.645 0.018 1.000 0.610 0.911 0.161 be 0.954 0.946 0.687 0.754 0.478 0.657 0.610 1.000 0.778 -0.542 he 0.891 0.861 0.883 0.959 0.659 0.302 0.911 0.778 1.000 -0.133 you -0.478 -0.568 0.079 -0.017 0.142 -0.948 0.161 -0.542 -0.133 1.000
Why, for example, should "in" and "you" correlate r=-.848 in the lowercase counts, or r=-0.948 in case-insensitive counts? When we look at a plot of the proportional frequencies, it looks like "in" rose and "you" fell through 1966, at which point the trends steeply reversed:
So maybe the mix of books published, or anyhow the mix of books in the Google dataset, anticipated the Summer of Love by a year? Or maybe it took a couple of years for the Merry Pranksters' effects to start getting into print?
Analogous data from COHA, taken by decade from 1900 to 2000, also shows many large positive and negative correlations, but not always in the same places:
the of and to a in that be he you the 1.000 0.987 0.837 0.933 -0.973 0.900 0.909 0.982 0.601 -0.004 of 0.987 1.000 0.870 0.942 -0.968 0.855 0.933 0.958 0.496 0.036 and 0.837 0.870 1.000 0.822 -0.893 0.529 0.776 0.782 0.181 0.269 to 0.933 0.942 0.822 1.000 -0.903 0.811 0.968 0.935 0.523 0.273 a -0.973 -0.968 -0.893 -0.903 1.000 -0.821 -0.867 -0.935 -0.468 -0.019 in 0.900 0.855 0.529 0.811 -0.821 1.000 0.793 0.926 0.804 -0.171 that 0.909 0.933 0.776 0.968 -0.867 0.793 1.000 0.891 0.449 0.191 be 0.982 0.958 0.782 0.935 -0.935 0.926 0.891 1.000 0.708 0.014 he 0.601 0.496 0.181 0.523 -0.468 0.804 0.449 0.708 1.000 -0.191 you -0.004 0.036 0.269 0.273 -0.019 -0.171 0.191 0.014 -0.191 1.000
For example,
Gnocase Glocase COHA in ~ you -0.948 -0.878 -0.171 be ~ to 0.754 0.749 0.935 the ~ a 0.527 0.767 -0.973 that ~ to 0.961 0.962 0.968
A plausible hypothesis, in my opinion, is that these correlations give us a glimpse of some stylistic dimensions that play out in the distribution of common words, filtered through shifts over time in writing styles, in cultural norms, and in the distribution of sources in the Google Books and COHA collections.
And as Bill Labov and Jamie Pennebaker and Doug Biber and others have taught us, such "stylistic dimensions" correlate with personality and attitude and genre and register and so forth — but they do so in ways that drift over time…
Update — As D.O. points out in the comments,
The reasons for the high degree of correlations mentioned in the OP are very reasonable, but we should calibrate our "surprise factor" first. It is quite clear that the word counts are highly autocorrelated, which is simply to say that their change is primarily due to the long-term trends. It means we should compare the results to a typical distribution of correlation coefficients for highly autocorrelated signals.
Another way to put this is that "random" but smooth sequences have a relatively small number of degrees of freedom simply due to their smoothness, and as a result will tend to show correlations with relatively high absolute values.
This is related to the point I made in "Cultural diffusion and the Whorfian hypothesis" (2/12/2012), which has to do with the distribution of correlations among spatially smooth but formally independent 2-D patterns. So I don't have any excuse for failing to see the point.
I still suspect that some stylistic or cultural (or sampling-method) trends will emerge from appropriate dimensionality reduction on word-frequency time-series data (especially for words where the emergent dimensions are not main topic-related) — but D.O. is right, the degree of cross-correlation in the basic time-series data shouldn't surprise us.
Nathan Myers said,
January 12, 2016 @ 12:00 pm
Back in the days of Orkut (an early Google facebook-alike), I started two "groups": "I hate Java" and "I resent LISP", among others. Over the course of many months, membership count in one exactly tracked the other, but with a constant offset. Many more people went on record hating Java than resenting LISP, but it was a constant number. Over time, membership in both groups grew many times without breaking the pattern.
All I have been able to conclude from it is that spurious astonishingly high correlations are so common they are hardly worth wondering about unless you have a testable model in mind.
(People were invited to resent LISP, BTW, for failing to offer them a career path.)
[(myl) But these examples were not selected from a larger set — rather this is all the correlations among some of the commonest words in the collection. (And the rest of the top of the list are more of the same — I skipped down the list a bit just to get a couple of pronouns and a couple of forms of "to be".) When just a few of a very large set of numbers are highly correlated, that might well be an accident. But when nearly all of a very large set of numbers are highly inter-correlated, it's generally an indication of some kind of latent structure. (Which might be an artefact of the data collection or data processing methods, but is not likely to be sampling error.)
In this case, it's clear that there are only a few real degrees of freedom in the patterns of word frequency variation, whether across documents or across time. Some of those dimensions are topic-related, but for the commonest words, they seem likely to be mainly stylistic.]
leoboiko said,
January 12, 2016 @ 2:25 pm
@Myers: Oh, you're the creator of "I Resent LISP"? I was a subscriber (and I resented Lisp for losing to Python, so that I couldn't use it for lack of libraries etc.) (sorry for off-topic comment).
D.O. said,
January 12, 2016 @ 10:41 pm
The reasons for the high degree of correlations mentioned in the OP are very reasonable, but we should calibrate our "surprise factor" first. It is quite clear that the word counts are highly autocorrelated, which is simply to say that their change is primarily due to the long-term trends. It means we should compare the results to a typical distribution of correlation coefficients for highly autocorrelated signals.
A scientific way to do it (I guess) would be to
1) pull out a large number of word count time series,
2) calculate their autocorrelation functions,
3) generate random time series with the same range of autocorrelation coefficients (but without any correlations among themselves!) and
4) observe the distribution of their correlation coefficients.
I am not doing it myself, but instead tried a much easier example. I took Fourrier series with amplitudes 1/n^2+0.1/n (for n-th harmonic) and random phases and got about 40% of correlation coefficients more than 0.8 (by absolute value).
Now, I am sure it's only part of the story. 1/n^2 for the first 10 harmonics maybe a bit harsh, but some kind of discount for the autocorrelation ought to be made.
D.O. said,
January 12, 2016 @ 10:48 pm
Just in case anyone cares, my Matlab code:
Npoints = 100;
x=linspace(-pi/2,pi/2,Npoints);
for ii=1:100000 y1=rsc(Npoints);y2=rsc(Npoints);a=corrcoef(y1,y2); crs(ii)=a(1,2);end
disp(nnz(crs>0.8 | crs<-0.8)/100000)
rsc function:
function signal = rsc(Npoints)
x=linspace(-pi/2,pi/2,Npoints);
Nharmonics = floor(Npoints/2);
amp = 1./(1:Nharmonics).^2+0.1./(1:Nharmonics);
phase = 2*pi*rand(Nharmonics,1);
signal = amp*sin((1:Nharmonics)'*x+phase*ones(1,Npoints));
D.O. said,
January 13, 2016 @ 12:18 am
I actually downloaded the 1900-2000 timeseries for 10 most frequent words from Google ngrams and calculated their autocorrelation coefficients. As it happens the first ~10 of their harmonics (it is simply |fft(n)|^2) fall off as ~1/n^2. Which means that my simple guess model was right on money.
If so it might make sense to look at the deciles:
p(|correlation|<0.1)=5.5%, p(0.1<|correlation|<0.2)=5.7%, p(0.2..0.3)=5.9%, p(0.3..0.4)=6.2%, p(0.4..0.5)=6.8%, p(0.5..0.6)=8.0%, p(0.6..0.7)=9.4%, p(0.7..0.8) = 12.3%, p(0.8..0.9)=18.4%, p(0.9..1.0)=21.8%.
[(myl) Excellent point — I should have thought of that. It's actually related to the point made in "Cultural diffusion and the Whorfian hypothesis" (2/12/2012), which has to do with the distribution of correlations among spatially smooth but formally independent 2-D patterns. So I don't have any excuse for failing to see the point.
I still suspect that some stylistic or cultural (or sampling-method) trends will emerge from dimensionality reduction — but you're right, the degree of cross-correlation shouldn't surprise us.]
Yuval said,
January 13, 2016 @ 2:42 am
Could it be, then, that the sources of books Google took changed at some years? Or some other prosaic reason – some publisher changed to a different font which has other OCR issues; a few publishers united (/parted) and thus their editing standards coalesced (/diverged); an eminent journal is only available starting a particular year; etc.
Graeme said,
January 13, 2016 @ 3:40 am
Somewhat off topic. But has anyone tested the fate of 'have'? In its frequency in spoken or even blogged English? A long standing impression of my old dad was that 'got' as a broad indicator of possession was accelerating its supplanting of 'have'. And more recently there is the misuse of 'of' instead of 'have' after 'could/should/would'.
Matt said,
January 13, 2016 @ 7:35 am
& how about the near-disappearance of "that" as a phrase connector?
e.g. "Pentagon sources later said [that] they had entered Iranian waters after facing technical difficulties." Here it's only marginally useful, but helps with the parsing of these sorts of constructions.
As a specifier it's still doing nicely. "How much is that doggie in the window?"
[(myl) "Near disappearance" is an exaggeration:
The construction with "that" actually increased in relative frequency through the 1960s, and seems to have leveled off at 30-40% overall.]
Jon said,
January 13, 2016 @ 8:46 am
At least among older books, there are many examples of the same book being scanned several times, and also updated editions in the case of non-fiction. This is sometimes caused by Google scanning the whole contents of different libraries, but I have even seen two identical (apart from handling marks) scanned books from the same library. Could this contribute to the correlations seen?