Debate words
« previous post | next post »
As I mentioned a few days ago ("More political text analytics", 4/15/2016), I've now got more-or-less cleaned-up text from the 21 debates held so far in the current U.S. presidential campaign.
[Update — with some help from Chris Culy, I've done additional clean-up on the debate texts, and therefore have revised the numbers in this post slightly, as of 4/23/2016. None of the numbers have changed a lot, and none of the qualitative implications have changed at all.]
If we focus on the contributions to those 21 debates of the five remaining U.S. presidential candidates, we get 199,188 words in total, divided up like this:
Clinton | 56,989 |
Sanders | 50,649 |
Trump | 41,039 |
Cruz | 32,654 |
Kasich | 28,772 |
This morning I'll add a few small examples of the kind of information that can be derived from a dataset of this type.
There are some obvious topical clues in the relative frequency of words and phrases.
In the table below, the first number in each column is the literal count, and the number in parenthesis following is the count expressed as the frequency per million words produced by the candidate. Thus Bernie Sanders used the phrase "Wall Street" 127 times, and normalized by his 50,649 total words that's
1000000*127/50649 ~ 2507 per million
Wall_Street China Mexico health_care Ohio Clinton 49 (860) 13 (228) 1 (18) 48 (842) 2 (35) Cruz 8 (245) 12 (367) 2 (61) 11 (337) 0 (0) Kasich 2 (70) 11 (382) 2 (70) 6 (209) 80 (2780) Sanders 127 (2507) 25 (494) 6 (118) 62 (1224) 1 (20) Trump 4 (97) 65 (1584) 36 (877) 6 (146) 4 (97)
And here are some slightly less obvious stylistic features:
try note secondly issue tremendous very_very Clinton 91 (1597) 1 (18) 6 (105) 23 (404) 1 (18) 0 (0) Cruz 11 (337) 24 (735) 2 (61) 16 (490) 0 (0) 11 (337) Kasich 9 (313) 1 (35) 21 (730) 15 (521) 1 (35) 3 (104) Sanders 12 (237) 0 (0) 0 (0) 78 (1540) 0 (0) 7 (138) Trump 4 (97) 0 (0) 0 (0) 0 (0) 46 (1121) 42 (1023)
And, of course, the ever-popular pronoun percentages:
%1PS %1PP %2P %3M %3F %3PP %3PN CLINTON 4.24 2.85 1.07 0.49 0.07 1.07 1.38 CRUZ 2.73 2.06 2.01 1.03 0.22 0.77 1.32 KASICH 3.50 4.08 2.03 0.24 0.05 1.75 1.48 SANDERS 3.37 2.60 1.31 0.28 0.17 0.97 0.99 TRUMP 5.16 2.95 2.17 1.01 0.11 1.86 2.06
And the type-token plot based on the fixed-up transcripts:
Where the data comes from —
12 Republican Debates:
08/06/15 | Cleveland OH |
09/16/15 | Simi Calley CA |
10/28/15 | Boulder CO |
11/10/15 | Milwaukee WI |
12/15/15 | Las Vegas NV |
01/14/16 | Charleston SC |
01/28/16 | Des Moines IA |
02/06/16 | Goffstown NH |
02/14/16 | Greenville SC |
02/25/16 | Houston TX |
03/03/16 | Detroit MI |
03/10/16 | Coral Gables FL |
101315 | Las Vegas NV |
111415 | Des Moines |
121915 | Goffstown NH |
011716 | Charleston SC |
020416 | Durham NH |
021116 | Milwaukee WI |
030616 | Flint MI |
030916 | Miami FL |
041416 | Brooklyn NY |
A. Riddell said,
April 19, 2016 @ 9:50 am
Thank you for assembling these.
Getting 404s for two debates: RepublicanDebate011416.txt and RepublicanDebate012816.txt
[(myl) I think all the links are correct now. I'm sure there are some errors in the text prep here and there, but (I hope) nothing major.]
Lane said,
April 19, 2016 @ 10:01 am
So what does it say that Trump uses pretty much every pronoun more than his rivals? I'm guessing it's his general preference for basic vocabulary.
[(myl) That's one way to look at it. But it's also his quasi-conversational style, e.g.
The fact is, there is a big overhang. There’s a big question mark on your head. And you can’t do that to the party. You really can’t. You can’t do that to the party. You have to have certainty. Even if it was a one percent chance, and it’s far greater than one percent because (inaudible)
or
Our healthcare is a horror show. Obamacare, we’re going to repeal it and replace it. We have no borders. Our vets are being treated horribly. Illegal immigration is beyond belief. Our country is being run by incompetent people. And yes, I am angry.
There are some non-basic words in there, but … ]
Charles Antaki said,
April 19, 2016 @ 4:55 pm
Just on the pronouns – they're all referentially tricky to some degree of course (bar "I", usually). But my guess is that Trump will be using "we" much more to refer to himself ("we are going to win" etc.) than, say, Sanders who might, I'm guessing again, use it more in the "we, the people" sense. But difficult to tag for a count, unless manually.
ricki said,
April 21, 2016 @ 8:34 am
Is there any chance the 'frank' finding is tangled up in Dodd-Frank? I'm assuming not, but was curious!
[(myl) Actually you're right — of Clinton's 29 instances of frank, 23 are Dodd-Frank and 4 are Barney Frank… So it was really a topical signifier, and I've replaced it with try.]
A. Riddell said,
April 22, 2016 @ 3:33 pm
Hitting a 404 for: http://languagelog.ldc.upenn.edu/myl/RepublicanDebate021316.txt
Thanks again for collecting these.