Subtle differences

« previous post | next post »

Andrew Gelman, "Separated by a common blah blah blah", SMCISS 12/1/2013:

I love reading the kind of English that English people write. It’s the same language as American but just slightly different. I was thinking about this recently after coming across this footnote from “Yeah Yeah Yeah: The Story of Modern Pop,” by Bob Stanley:

Mantovani’s atmospheric arrangement on ‘Care Mia’, I should add, is something else. Genuinely celestial. If anyone with a degree of subtlety was singing, it would be quite a record.

It’s hard for me to pin down exactly what makes this passage specifically English, but there’s something about it . . .

I wouldn't have had the same reaction to that specific passage, but I recognize that cues to style (including geographic style) are often very subtle. In this case, Andrew may be reacting to features like these:

COCA BNC Weight of Evidence
should 762.82 1,119.98 0.384
genuinely 7.64 14.19 0.619
. if anyone 1.25 2.14 0.538
with a degree of 0.29 1.03 1.267
subtlety 1.72 2.62 0.421
quite a 27.07 61.56 0.822

The second and third columns are frequencies per million words in COCA and BNC,  and the fourth column is the "Weight of Evidence" as per Alan Turing, defined as log(P(evidence|hypothesis1)/P(evidence|hypothesis2), where hypothesis1 is "text is British" and hypothesis2 is "text is American". (The maximum-likelihood estimate of, say, the probability of a random word in American text, based on the COCA corpus, is 762.82/1000000; the analogous estimate for British text is 1119.98/100000; so the ratio of likelihoods is just 1119.98/762.82 = 1.46821, and the "weight of evidence" is log(1.46821) = 0.384044, etc.)

The sum of those log likelihood ratios is 4.051,  which corresponds to odds of better than 50 to 1 in favor of a British origin (exp(4.051) = 57.45). This is a completely illegitimate calculation, since I've cherry-picked ngrams of different degrees that struck me as likely to be commoner in British as opposed to American writing (though I didn't need to withdraw any guesses). But still, this unsound calculation does suggest that an evaluation of the whole passage with proper n-gram language models for COCA and BNC might well yield similar results, confirming Andrew's origin-instinct if not his aesthetic reaction.

For more on the background of Turing's idea, and its relationship to neuroscience and psychology, see Joshua Gold and Michael Shadlen, "Banburismus and the Brain", Neuron 2002, or Paul Cisek, "Neurobiology: The currency of guessing", Nature 2007. It's quite striking how effectively this simple idea can often be used to combine a large number of weak pieces of evidence into a strong conclusion. If you're the kind of person who is best persuaded by trying an idea out in practice, see here for a simple recipe in Matlab (or better, these days, Octave) for combining weak evidence about single-letter frequencies in English and Italian.



29 Comments

  1. D.O. said,

    December 1, 2013 @ 5:34 pm

    On the other hand, "something else", if g-Ngram to be believed is twice as frequent in AmEng as in BrEng; "would be" 3 times and "arrangement" little more than 1.5. I also checked "atmospheric", but they are approximately equal. If you go by the hunches you should at least go for the hunches in both directions. BTW AFAIK Andrew Gelman generally doesn't like Bayesian ratios.

    [(myl) BNC vs. COCA for "something else" is a squeaker for the Yanks, just 20.11 to 24.97 per million words (WoE = -0.256); and "arrangement" is actually a bigger win for the Brits, 32.86 to 17.05 (WoE = 0.656). I'd believe BNC vs. COCA over the Google ngrams.

    There are certainly problems with adding up log likelihood ratios, starting with independence assumptions. But the method works surprisingly well in many cases, all the same, including cases where it's not clear how to do something better founded. And there are people who claim that it provides a good model for (at least some cases of) human and other animal decision-making under uncertainty.]

  2. D.O. said,

    December 1, 2013 @ 5:44 pm

    Oops, sorry. "atmospheric" is 1.5 in AmEng over BrEng, it's "celestial" that is almost the same.

    [(myl) Using BNC and COCA, "atmospheric" is again a squeaker for the Yanks, 7.37/MW vs. 6.67 (WoE = -0.010). And again, I'd bet on that comparison against the Google Books ngrams.]

  3. D.O. said,

    December 1, 2013 @ 5:49 pm

    And, well, who would have guessed? A simple word "add" is 3 times as frequent in AmEng as in BrEng.

    [(myl) Another good reason not to believe the geographically-specialized Google ngram counts. Anyhow, a better method would be to look at the overall fit of language models trained on COCA and the BNC. Unfortunately, this is not possible, since Mark Davies keeps COCA (as a whole) private. And ditto for the Google Books collection…]

  4. Lazar said,

    December 1, 2013 @ 6:12 pm

    One thing is the use of "was" instead of the subjunctive "were". Both variants occur on both sides of the Atlantic, but "was" is more common and more accepted for literary use in Britain.

  5. Jerry Friedman said,

    December 1, 2013 @ 6:50 pm

    "if anyone was" per 100,000,000

    COCA: 40.67
    BNC: 47

    weight of evidence: 0.1447

    I was expecting more of a difference than that, since I've been told the same thing about the subjunctive. A difference that small could just reflect the fact that COCA has less informal conversation than the BNC. Also, some people at alt.usage.english say the subjunctive "were" is making a comeback in British English, possibly because of American influence.

    For some better statistics:

    if I|he|she|it|someone|somebody|anybody|this|that|anyone was
    (per 100,000,000)

    COCA: 5017
    BNC: 6080

    weight of evidence: 0.1922

  6. Jerry Friedman said,

    December 1, 2013 @ 7:02 pm

    COCA won't let me search for two-word adjective-noun fragments—all the "words" are too common.

    Isn't the song title "Cara Mia"? Yes, "Care" seems to be Andrew Gelman's typo.

  7. dw said,

    December 1, 2013 @ 7:27 pm

    This example is from a book. In edited prose, I think the "were" irrealis form is more steadfastly adhered to in AmE than in BrE.

  8. dw said,

    December 1, 2013 @ 7:43 pm

    From Google News (which I think is more reliable about the location of its sources than Google Books or regular Google search):

    source USA "if I was": 15100 hits

    source USA "if I were": 9600 hits

    source UK "if I was": 4220 hits

    source UK "if I were": 1420 hits

    US were/was ratio: 64%
    UK were/was ratio: 34%

    I chose "if I was|were" because that collocation, unlike e.g. "if he was|were", nearly always indicates a potential irrealis mood.

  9. Ted said,

    December 1, 2013 @ 9:41 pm

    Were I a betting man, I should not be surprised if "were I" as an alternative to "if I were" were more common in BrE than AmE.

  10. Ted said,

    December 1, 2013 @ 9:49 pm

    Were I a wolf, however, I should be careful speaking AmE lest I be overheard in Mayfair.

  11. Yuval said,

    December 2, 2013 @ 1:12 am

    Two math typos in the post-table paragraph: the parentheses after "log" don't end; the 100000 referring to BNC should have another zero.

  12. Vanya said,

    December 2, 2013 @ 3:30 am

    Ted, if you were a wolf presumably you wouldn't speak English at all. I suppose I'm missing the reference.

  13. David Morris said,

    December 2, 2013 @ 4:22 am

    I believe the reference is to this: http://www.youtube.com/watch?v=OllepPdoDb0

  14. Colin Fine said,

    December 2, 2013 @ 9:44 am

    I'm surprised that the irrealis subjunctive appears to be less common in the UK than the US. It's on the retreat in both places, but I expected it was just as alive and kicking in formal English as in formal American.
    Where I would expect to find a difference is in the jussive , after verbs like "request" and particularly "mandate" where British usage tends to favour either the indicative or the infinitive over the subjunctive.

  15. Vance Koven said,

    December 2, 2013 @ 10:06 am

    Nobody has mentioned that notable curiosity, "arrangement on," which I have never seen before, but if it's correctly transcribed, is surely not American. "Arrangement of" is nearly universal among musicians and writers on music this side of the pond.

  16. Morten Jonsson said,

    December 2, 2013 @ 10:34 am

    @Vance Koven

    I think "on" can be understood as referring to the recording of the song, not the song itself. The author is thinking of the arrangement as an element of the recording, like a vocal or an instrumental part, so he says "on," as in "the guitar solo on 'Red House'" or "the harmony on "'Ebony Eyes.'" That's idiomatic in the US as well as the UK. It's a kind of shorthand, if you like, for "Mantovani’s atmospheric arrangement of 'Cara Mia' on the recording of ‘Cara Mia.’"

  17. Rodger C said,

    December 2, 2013 @ 12:25 pm

    I think I've commented on this before, but I was positively shocked at age 17 when Tolkien kept saying "as if he was."

  18. J. W. Brewer said,

    December 2, 2013 @ 3:54 pm

    This is a particularly complicated situation, because the fellow is writing about popular music. Most British music journalists of the last X decades are buffs of various sorts of cultishly-obscure American music and are also familiar with the work of certain stylistically-influential US rock critics. Most American music journalists of the same period are buffs of various sorts of cultishly-obscure British music and are also familiar with the work of certain stylistically-influential UK rock critics. So one might expect the UK writers to consciously or unconsciously affect some Americanisms and the US writers to consciously or unconsciously affect some Britishisms.

  19. J. W. Brewer said,

    December 2, 2013 @ 4:31 pm

    I don't know anything about Stanley, but some of the chapter titles in his grand historical narrative are amusing, especially "1985: What the Fuck Is Going On?" (surely a Britishism?). The chapter did not quite live up to its title, but I expect he and I may have discordant factional/aesthetic views on some of the subject matter.

  20. Asa said,

    December 3, 2013 @ 12:36 am

    What about "1985: What the Fuck Is Going On?" is a Britishism? Possibly I'm missing a joke here…

  21. Larry Sheldon said,

    December 3, 2013 @ 3:58 am

    I think the clue is "with a degree of subtlety". I can't tell you the technicality of why I think that.

  22. Charly Baltimore said,

    December 3, 2013 @ 9:59 am

    @Ted

    Ahh oooooooo!!!
    See you at Trader Vic's ;)

  23. Red Scharlach said,

    December 3, 2013 @ 12:27 pm

    @ J.W. Brewer

    The chapter title "1985: What The Fuck Is Going On?" is actually a reference to the album title "1987 (What The Fuck Is Going On?)" by the British band The Justified Ancients of Mu Mu, who later became more famous (in the UK, at least) as the KLF.

  24. blahedo said,

    December 3, 2013 @ 2:48 pm

    Summing the WOE log-ratios is exactly equivalent to building a naïve Bayesian classifier, no? So it would carry all the strengths (easy, often useful in practice) and weaknesses (independence assumptions!) of that technique.

  25. un malpaso said,

    December 3, 2013 @ 4:50 pm

    I'm from the USA… and I can't pick up anything distinctive about the above passage at all. If he had wanted to find a better example of distinctively UK writing (to my eyes/ears), I could pick several from any issue of The Economist, but I imagine that's not exactly a scientific example.

  26. Victoria Simmons said,

    December 4, 2013 @ 4:58 am

    I had the same "This is British" reaction on reading "Fifty Shades of Grey," which has American characters, but was written by a woman who was raised in England. I knew for certain she was British before I read anything about her. I eventually came across a few howlers, such as the narrator referring to a temper tantrum as "tossing toys out of his pram," but much more than that, the book was just suffused with Britishness–and not just Britishness, but the particularly arch sort of Britishness that tends to show up in fan fiction. I like the demonstration here of how subtle these differences can be, far beyond those lists of all the Britishisms removed from the American editions of the Harry Potter books.

  27. Lane said,

    December 4, 2013 @ 8:31 am

    (American writer for The Economist here…) Are COCA and the BNC equivalent for this kind of search? I, like a lot of Americans, sense that British English is a little more "buttoned up" than American, more "shall" and "were I …" and "quite" and the like. Those are also probably all more likely to be found in formal written sources than less-formal ones. So if either of the two corpora were more relatively weighted to formal writing, it could screw up this kind of Breakfast Experiment.

  28. Tom said,

    December 4, 2013 @ 9:30 am

    Has anyone compared human judgment vs automatic judgment in identifying the source of written text?

    (& I would've said that the combination of the insult & compliment in the last sentence end made it distinctively british)

  29. J. W. Brewer said,

    December 4, 2013 @ 11:19 am

    I appreciate Red Scharlach's explanation of the allusion, which was lost on me no doubt due to my own factional aesthetic preferences regarding the music of the period. (I would have recognized an allusion to that Spacemen 3 song with the lyric about how in "1987 all I wanna to do is get stoned"). Given that JAMM's/KLF were not only British but more commercially successful there than in the U.S., I suppose my "surely a Britishism" joke was unintentionally accurate.

RSS feed for comments on this post