The most Trumpish (and Bushish) words
« previous post | next post »
I they you Trump very great he China said me money going Mexico
Those are the top 13 words at Donald Trump's end of a vocabulary comparison with Jeb Bush. The top 13 words at Jeb Bush's end of the list are:
The state strategy government should create president American in growth of ISIS forces
Click the following links for the whole list, sorted by Trumpishness and sorted by Bushishness.
The method:
I created lists of word counts from each candidate's announcement speech, debate remarks, and (their side of) some press conferences or interviews (two for Trump, three for Bush). The total was 14,746 words for Trump, 14,429 words for Bush.
For all the words in the combined lists, I calculated the "weighted log-odds-ratio, informative Dirichlet prior", using the algorithm described on p. 387-8 of Monroe, Colaresi & Quinn "Fightin' Words: : Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict", Political Analysis 2009. For some earlier uses of this method, see "Obama's favored and disfavored SOTU words", 1/29/2014; and "Male and female word usage", 8/7/2014.
This method tells us something about the candidates' characteristic word choices. But there are lots of interesting things about how politicians (and their speechwriters) talk and write that it doesn't tell us.
I'm about to board a plane for a journey that will end at Interspeech 2015, if all goes well; so it'll be a couple of days before I can tell you more.
Victor Mair said,
September 5, 2015 @ 5:04 pm
These words bespeak a world of difference between the mentalities of the two men.
Rubrick said,
September 5, 2015 @ 5:35 pm
Wow. I was going to chide you for rearranging the top 13 in each list for humorous effect until I looked at the original lists. (I would have gone with the top 15 just so Jeb's would have ended with "for sure".)
Coby Lubliner said,
September 5, 2015 @ 5:47 pm
With some punctuation, the word sequences as they stand would make great nonsense sentences.
Yerushalmi said,
September 6, 2015 @ 1:56 am
I am fascinated by the fact that Bush uses the word "the" so much more often than Trump. What might explain such a discrepancy?
[(myl) It's not an enormous difference in proportional terms: Bush's rate is about 4.97%, while Trump's is 3.50%. This difference wins the Bushiness prize because "the" is so common that this ~40% difference is gets weighted strongly. In comparison, Bush uses "should" about 4.3 times as often as Trump does — 430% more — but the rates are only 0.32% compared with 0.07%, so the "weighted log-odds ratio" doesn't give "should" as high a Bushiness score. (A list of words with overall frequency greater than 1500 per million, sorted by Trump-to-Bush frequency ratios is here.)
For some context on why there might be a difference in rates of "the" usage, see
"Decreasing definiteness" 1/8/2015
"Why definiteness is decreasing, part 1", 1/9/2015
"Why definiteness is decreasing, part 2", 1/10/2015
"Why definiteness is decreasing, part 3", 1/18/2015
In this case, I think it means that Trump's texts are less formal and more conversational — that might be partly an artefact of the choice of texts, but in general I think he extemporizes more often than Jeb! does.]
(Also: When will we start seeing articles about how Trump's overuse of the "I", "me", and "Trump" means he's a narcissist?)
[(myl) George F. Will is on the case.]
bks said,
September 6, 2015 @ 10:31 am
There's something wrong with the linked Bush list: e.g. there is no entry for i at all, though there is an entry for ii and between made and alive is an entry for 1.3.
[(myl) If by "Bush list" you mean
http://languagelog.ldc.upenn.edu/myl/BushTrump.out
then
i 666 (45164.8) 356 (24672.5) 1022 (35030) 5.584
is the last line (line 3243), just as it's the first line in
http://languagelog.ldc.upenn.edu/myl/TrumpBush.out
The two lists have exactly the same content, just with the lines sorted in opposite orders. Each line has 8 columns:
1. word
2. count in Trump's sample
3. frequency per million in Trump's sample
4. count in Bush's sample
5. frequency per million in Bush's sample
6. overall count (in this case, sum of (2) and (4)
7. overall frequency per million
8. weighted log-odds ratio
]
bratschegirl said,
September 6, 2015 @ 2:50 pm
I was so hoping that Trump's candidacy would lead to multiple appearances of "YOOOOOGE!" Alas, it has not been so.
Peter Taylor said,
September 6, 2015 @ 4:07 pm
For example, by restricting it to English-language interviews it completely brushes under the carpet the different attitudes of the two candidates to the Spanish language.
D.O. said,
September 6, 2015 @ 10:41 pm
Because only two people are compared, it might make sense to use some external source for the base rates. Just saying.
[(myl) There's plenty of data out there, and the algorithms are easy to code. So go to it!]
D.O. said,
September 6, 2015 @ 11:13 pm
For some reason, the log-odds ratio seems to systematically increase with the frequency of words. This trend can be reined in with ad hoc procedure of dividing log-odds ratio by the square of the log of frequency (Don't ask me what this means, I am just trying to have some fun). After this rescaling here's the baker's dozen from Mr. Trump:
trump china they very money great nice mexico wanted said trade building understand
and Mr. Bush:
strategy growth state forces american create government florida power standards economic federal income
[(myl) The relationship of the weighted log-odds ratio to word frequency is not mysterious, but baked into the algorithm for a well-defined reason — see the paper.]
D.O. said,
September 7, 2015 @ 9:03 am
If log-odds ratio is the focus of comparison then naturally very frequent words (like function words) will get a leg up. Even if their percentage differences are not large, the sheer number of them will make the difference very much "statistically significant". Which is OK and really is what we want if the application in mind is something like trying to decide whether a text we are reading is T's or J's by sampling a handful of words. But it seems to be missing the point if we want to know more about the content of their speeches or even "framing" of the issues. Relatively small percentage differences in functional words might tell us something about the speakers register, style, and demographics or even some speech idiosyncrasies, but probably very little about the content.
That's why, I think, the best measure would be based on percentage differences (or straight log-odds or something like that), but with infrequent words suppressed because infrequent words will dominate any such measure due to vagaries of statistics. The "suppression factor" should include something akin to Bonferrony correction to take into account the increased odds that some relatively infrequent words will rise up to large differences simply because there are lots of them. I don't have any statistically sound measure in mind, but it would be interesting to devise one…
D.O. said,
September 7, 2015 @ 10:35 am
It seems that we are missing (at least I was missing) a key piece about Mr. Trump's and Mr. Bush's speaking style. Mr. Bush is using significantly more words than Mr. Trump. If the typical number of words is calculated by perplexity, then Mr. Trump used 314 words against Mr. Bush's 441. This should be included in estimates of relative prominence of different words, only I am not sure how…
richardelguru said,
September 8, 2015 @ 6:31 am
@ Coby Lubliner
You might be right about the 'great nonsense sentences', but the Trumpal one seems to me to be the most cogent statement of his potential policies to date.
Swiss said,
September 16, 2015 @ 3:43 pm
Interesting! Where can I find Hillary Clinton's top 13 words?