Richard Roeper, "Election prediction: Electoral votes will add up to Barack Obama victory", Chicago Sun-Times 11/4/2012:
Please understand, we’re not talking about my preference. This is all about the cold hard business of predicting. If you handed me a suitcase of money and sent me to a casino where they allowed wagering on elections and I had to put all of it on one candidate in this race, I wouldn’t hesitate to put that money on Obama.
Of course it’s not Romney’s fault that allies such as Hannity, Limbaugh, Trump and Giuliani seem increasingly shrill and desperate in their criticism of Obama in the last week or so. Of course Romney couldn’t do anything about a force of nature that allowed the president to be presidential while the Mittster was relegated to the sidelines, comparing the massive undertaking on the East Coast with the time when he and his chums had to clean up a high school football field after a big game. (“The field was covered with rubbish and paper goods from people who’d had a big celebration there at the game.” Rubbish?) [emphasis added]
Please understand, we're not talking about election prediction. This is all about the cold hard business of linguistic variation. If you handed me a suitcase of money and sent me to a casino where they allowed wagering on word usage and I had to put it all on one of two candidates for the national origin of someone who described the detritus left in the wake of a party as "rubbish", I wouldn't hesitate to put that money on Britain.*
I believe that this is what Roeper meant when he wrote "Rubbish?" after the (approximate) quotation from Romney — he was surprised to hear that word, in that context, coming out of the mouth of an American.
The facts seem to support his surprise. In the Corpus of Contemporary American English, trash is about 10 times as common as rubbish, whereas in the British National Corpus the relationship is reversed. The exact numbers (expressed as frequency per million words) are these:
If we assume that the speaker was either British or American, then what we learn about their national origin by encountering the word rubbish can be represented by the Bayes factor
K = Pr(rubbish|British)/Pr(rubbish|American)
This is the probabiity of reaching into the British urn of words and coming up with rubbish, divided by the probability of reaching into the American urn of words and coming up with rubbish. If we accept the corpus-derived estimates that there are 22.81 instances of rubbish per million words in the British case, and 1.76 per million words in the American case, then this becomes
K = (22.81/100000)/(1.76/100000) = 22.81/1.76 = 12.96023
If that's all the evidence we have, and we treat the prior probability of the two nationalities as equal, then the probability that the speaker is British can be estimated as
1/(1+1/12.96023) = 0.928368 = ~ 93%
We can recover the odds from this as
0.928368/(1-0.928368) = 12.96023
or about 13:1.
Of course, Mitt Romney is really an American, so this one piece of evidence is misleading, and I'd lose the suitcase full of money if I were really foolish enough to bet it. And I would deserve to lose it, even if I really didn't know anything about the speaker beyond the word sequences in this one sentence, because there's even stronger evidence in the other direction within the same sentence. At least, there would be, if we had a reliable transcription of what he said:
I remember once we had a-
a football game at my high school and the
football field afterwards was covered with all sorts of
uh rubbish and- and uh
paper goods from people who'd had a big uh celebration there at the game and
there was a group of us that was assigned to clean it up
"My high school"? That phrase occurs 512 times out of 450 million words in COCA, for a rate of 1.14 per million words; it occurs once out of 100 million words in the BNC, for a rate of 0.01 per million words. If we accept these empirical estimates as valid, this gives us
K = (1/100)/(512/450) = 0.008789062
or about 114:1 against.
As Alan Turing noted in the context of the Enigma project during WWII, we can take the sum of the logs of such ratios as a measure of the "weight of evidence" for one hypothesis versus another. If we use log to the base 10, as the Bletchley Park codebreakers did, we get
log10(12.96023)+log10(0.008789062) = -0.9434448
for the case of someone who uses the word "rubbish" but also says "my high school". This gives us a rather different picture of the probability that the speaker was British:
1/(1+1/(10^-0.9434448)) = 0.10226
corresponding to odds of about 0.10226/(1-0.10226) or about 1:9.
[Update -- As a small additional exercise in combining weights of evidence, let's add the evidence from the phrases "football game" and "football field", accepting the COCA and BNC frequencies as the relevant conditional probability estimates:
Summing the corresponding log likeihood ratios along with those for "rubbish" and "my high school" gives us
log10(22.81/1.76) + log10(0.01/1.14) + log10(0.29/2.70) + log10(0.36/1.81) = -2.614634
This yields an estimate of the probability that the speaker was British of
1/(1+1/(10^-2.614634)) = 0.002422772
or odds of about 1:412 against.]
A serious attempt to decide on a statistical basis whether a speaker was British vs. American would be more complicated in various ways — it would try to use all the evidence available in transcript and/or recording, it would consider any prior odds available from background information about the choice, it would do a more careful job of estimating probabilities from small counts — but it might well combine different pieces of evidence using a version of this method, which is provably optimal if the sources of evidence are independent and the probability estimates are reliable.
*In fact I would be foolish to place so much weight on one piece of evidence; please chalk this whole paragraph up to rhetorical over-reaching. Seriously, I was thinking about writing something on the great Chin-Stroking-vs.-Number-Crunching debate, but I got distracted by rubbish.