Andrew Gelman, "Separated by a common blah blah blah", SMCISS 12/1/2013:
I love reading the kind of English that English people write. It’s the same language as American but just slightly different. I was thinking about this recently after coming across this footnote from “Yeah Yeah Yeah: The Story of Modern Pop,” by Bob Stanley:
Mantovani’s atmospheric arrangement on ‘Care Mia’, I should add, is something else. Genuinely celestial. If anyone with a degree of subtlety was singing, it would be quite a record.
It’s hard for me to pin down exactly what makes this passage specifically English, but there’s something about it . . .
I wouldn't have had the same reaction to that specific passage, but I recognize that cues to style (including geographic style) are often very subtle. In this case, Andrew may be reacting to features like these:
|COCA||BNC||Weight of Evidence|
|. if anyone||1.25||2.14||0.538|
|with a degree of||0.29||1.03||1.267|
The second and third columns are frequencies per million words in COCA and BNC, and the fourth column is the "Weight of Evidence" as per Alan Turing, defined as log(P(evidence|hypothesis1)/P(evidence|hypothesis2), where hypothesis1 is "text is British" and hypothesis2 is "text is American". (The maximum-likelihood estimate of, say, the probability of a random word in American text, based on the COCA corpus, is 762.82/1000000; the analogous estimate for British text is 1119.98/100000; so the ratio of likelihoods is just 1119.98/762.82 = 1.46821, and the "weight of evidence" is log(1.46821) = 0.384044, etc.)
The sum of those log likelihood ratios is 4.051, which corresponds to odds of better than 50 to 1 in favor of a British origin (exp(4.051) = 57.45). This is a completely illegitimate calculation, since I've cherry-picked ngrams of different degrees that struck me as likely to be commoner in British as opposed to American writing (though I didn't need to withdraw any guesses). But still, this unsound calculation does suggest that an evaluation of the whole passage with proper n-gram language models for COCA and BNC might well yield similar results, confirming Andrew's origin-instinct if not his aesthetic reaction.
For more on the background of Turing's idea, and its relationship to neuroscience and psychology, see Joshua Gold and Michael Shadlen, "Banburismus and the Brain", Neuron 2002, or Paul Cisek, "Neurobiology: The currency of guessing", Nature 2007. It's quite striking how effectively this simple idea can often be used to combine a large number of weak pieces of evidence into a strong conclusion. If you're the kind of person who is best persuaded by trying an idea out in practice, see here for a simple recipe in Matlab (or better, these days, Octave) for combining weak evidence about single-letter frequencies in English and Italian.