e.g. "Pentagon sources later said [that] they had entered Iranian waters after facing technical difficulties." Here it's only marginally useful, but helps with the parsing of these sorts of constructions.

As a specifier it's still doing nicely. "How much is that doggie in the window?"

[(myl) "Near disappearance" is an exaggeration:

The construction with "that" actually increased in relative frequency through the 1960s, and seems to have leveled off at 30-40% overall.]

]]>If so it might make sense to look at the deciles:

p(|correlation|<0.1)=5.5%, p(0.1<|correlation|<0.2)=5.7%, p(0.2..0.3)=5.9%, p(0.3..0.4)=6.2%, p(0.4..0.5)=6.8%, p(0.5..0.6)=8.0%, p(0.6..0.7)=9.4%, p(0.7..0.8) = 12.3%, p(0.8..0.9)=18.4%, p(0.9..1.0)=21.8%.

[(myl) Excellent point — I should have thought of that. It's actually related to the point made in "Cultural diffusion and the Whorfian hypothesis" (2/12/2012), which has to do with the distribution of correlations among spatially smooth but formally independent 2-D patterns. So I don't have any excuse for failing to see the point.

I still suspect that some stylistic or cultural (or sampling-method) trends will emerge from dimensionality reduction — but you're right, the degree of cross-correlation shouldn't surprise us.]

]]>Npoints = 100;

x=linspace(-pi/2,pi/2,Npoints);

for ii=1:100000 y1=rsc(Npoints);y2=rsc(Npoints);a=corrcoef(y1,y2); crs(ii)=a(1,2);end

disp(nnz(crs>0.8 | crs<-0.8)/100000)

rsc function:

function signal = rsc(Npoints)

x=linspace(-pi/2,pi/2,Npoints);

Nharmonics = floor(Npoints/2);

amp = 1./(1:Nharmonics).^2+0.1./(1:Nharmonics);

phase = 2*pi*rand(Nharmonics,1);

signal = amp*sin((1:Nharmonics)'*x+phase*ones(1,Npoints));

A scientific way to do it (I guess) would be to

1) pull out a large number of word count time series,

2) calculate their autocorrelation functions,

3) generate random time series with the same range of autocorrelation coefficients (but without any correlations among themselves!) and

4) observe the distribution of their correlation coefficients.

I am not doing it myself, but instead tried a much easier example. I took Fourrier series with amplitudes 1/n^2+0.1/n (for n-th harmonic) and random phases and got about 40% of correlation coefficients more than 0.8 (by absolute value).

Now, I am sure it's only part of the story. 1/n^2 for the first 10 harmonics maybe a bit harsh, but some kind of discount for the autocorrelation ought to be made.

]]>All I have been able to conclude from it is that spurious astonishingly high correlations are so common they are hardly worth wondering about unless you have a testable model in mind.

(People were invited to resent LISP, BTW, for failing to offer them a career path.)

[(myl) But these examples were not selected from a larger set — rather this is all the correlations among some of the commonest words in the collection. (And the rest of the top of the list are more of the same — I skipped down the list a bit just to get a couple of pronouns and a couple of forms of "to be".) When just a few of a very large set of numbers are highly correlated, that might well be an accident. But when nearly all of a very large set of numbers are highly inter-correlated, it's generally an indication of some kind of latent structure. (Which might be an artefact of the data collection or data processing methods, but is not likely to be sampling error.)

In this case, it's clear that there are only a few real degrees of freedom in the patterns of word frequency variation, whether across documents or across time. Some of those dimensions are topic-related, but for the commonest words, they seem likely to be mainly stylistic.]

]]>