Very crude way to do things is to look at how much observed frequency of *the* for one side of the conversation is different from the average *the* frequency of that speaker. Than subtract these differences (to eliminate changes in the same direction, for example, from topic or difficulty level) and average them (the sign of difference should be taken opposite to the sign of difference in average frequencies). We would get how many percentage points (on average) a person changes they rate of *the* to accommodate the other side. Switchboard has this number at 0.16% (total *the* rate is 3.2%). That means 5% rate variation due to accommodation.

Unless we seriously believe in dis-accommodation (that is people changing their *the* frequencies to be more different from the other side), the standard deviation provides the noise level for this estimate and it is about 1.3%, 8 times larger than the effect. Obviously, too crude.

That's what I've done, to some extent. For each conversation, I took the *the* rate for each side, subtracted the average *the* rate for the speaker throughout the corpus and averaged the products (weighted by the conversation length). Then normalized on variance throughout the corpus.

The next step would be to build some sort of Bayesian model where each speaker is characterized by her "natural level" of *the* use and "elasticity", that is propensity to change that natural level toward *the* frequency of another speaker. But, sorry, I am not doing it.

[(myl) Since there are quite a few cases where a given speaker is involved in multiple conversations with different other speakers, it would be interesting to analyze the variation across the whole sparse matrix of conversational connections…]

]]>Is the wider range statistically significant? For example, if we fit a line, is a line of slope one outside the 95% confidence interval? Also, it looks like the interviewee range is wider on a log-log plot (or expressed as a percent of average), which would probably be the more relevant comparison if we want to see the magnitude of the effect.

[(myl) I could run various analyses of "statistical significance" — or you could, on the basis of the raw data:

CarrieBrownstein 2924 84 1662 51

DanielTorday 4630 167 1258 63

GloriaSteinem 4069 116 1355 53

IlleanaDouglas 4529 136 1222 39

JillSoloway 4327 138 1516 50

JohnKander 2300 85 1544 85

LenaDunham 5740 149 1715 49

RichardFord 1356 49 1861 51

SarahSilverman 4304 96 1151 28

StephenKing 4750 201 1172 36

TanehisiCoates 6287 194 1229 40

ViencentDevita 3873 151 895 34

WillieNelson 3318 89 1482 49

[where the 2nd column is the number of words in the guest's transcript, the 3rd column is the number of instances of *the* in the guest's transcript, and the 4th and 5th columns are the same things for the host, who is Terry Gross in all cases.]

But the fact that there are only 13 interviews in this test pretty well guarantees that the estimate of the the-percentage range is not very reliable.]

]]>[(myl) The (conventionally cleaned up and edited) transcripts are on line, so you can explore this for yourself ad libitum.]

]]>