Debate words

« previous post | next post »

As I mentioned a few days ago ("More political text analytics", 4/15/2016), I've now got more-or-less cleaned-up text from the 21 debates held so far in the current U.S. presidential campaign.

[Update — with some help from Chris Culy, I've done additional clean-up on the debate texts, and therefore have revised the numbers in this post slightly, as of 4/23/2016. None of the numbers have changed a lot, and none of the qualitative implications have changed at all.]

If we focus on the contributions to those 21 debates of  the five remaining U.S. presidential candidates, we get 199,188 words in total, divided up like this:

Clinton 56,989
Sanders 50,649
Trump 41,039
Cruz 32,654
Kasich 28,772

This morning I'll add a few small examples of the kind of information that can be derived from a dataset of this type.

There are some obvious topical clues in the relative frequency of words and phrases.

In the table below, the first number in each column is the literal count, and the number in parenthesis following is the count expressed as the frequency per million words produced by the candidate. Thus Bernie Sanders used the phrase "Wall Street" 127 times, and normalized by his 50,649 total words that's

1000000*127/50649 ~ 2507 per million

           Wall_Street    China     Mexico   health_care    Ohio
Clinton     49  (860)   13  (228)   1  (18)  48  (842)     2   (35)
Cruz         8  (245)   12  (367)   2  (61)  11  (337)     0    (0)
Kasich       2   (70)   11  (382)   2  (70)   6  (209)    80 (2780)
Sanders    127 (2507)   25  (494)   6 (118)  62 (1224)     1   (20)
Trump        4   (97)   65 (1584)  36 (877)   6  (146)     4   (97)

And here are some slightly less obvious stylistic features:

            try       note    secondly    issue    tremendous  very_very
Clinton    91 (1597)  1  (18)   6 (105)  23  (404)   1   (18)   0    (0)
Cruz       11  (337) 24 (735)   2  (61)  16  (490)   0    (0)  11  (337)
Kasich      9  (313)  1  (35)  21 (730)  15  (521)   1   (35)   3  (104) 
Sanders    12  (237)  0   (0)   0   (0)  78 (1540)   0    (0)   7  (138)
Trump       4   (97)  0   (0)   0   (0)   0    (0)  46 (1121)  42 (1023)

And, of course, the ever-popular pronoun percentages:

         %1PS %1PP %2P  %3M  %3F  %3PP %3PN
CLINTON  4.24 2.85 1.07 0.49 0.07 1.07 1.38
CRUZ     2.73 2.06 2.01 1.03 0.22 0.77 1.32
KASICH   3.50 4.08 2.03 0.24 0.05 1.75 1.48
SANDERS  3.37 2.60 1.31 0.28 0.17 0.97 0.99
TRUMP    5.16 2.95 2.17 1.01 0.11 1.86 2.06

And the type-token plot based on the fixed-up transcripts:

Where the data comes from —
12 Republican Debates:

08/06/15 Cleveland OH
09/16/15 Simi Calley CA
10/28/15 Boulder CO
11/10/15 Milwaukee WI
12/15/15 Las Vegas NV
01/14/16 Charleston SC
01/28/16 Des Moines IA
02/06/16 Goffstown NH
02/14/16 Greenville SC
02/25/16 Houston TX
03/03/16 Detroit MI
03/10/16 Coral Gables FL

9 Democratic Debates:

101315 Las Vegas NV
111415 Des Moines
121915 Goffstown NH
011716 Charleston SC
020416 Durham NH
021116 Milwaukee WI
030616 Flint MI
030916 Miami FL
041416 Brooklyn NY

 



5 Comments

  1. A. Riddell said,

    April 19, 2016 @ 9:50 am

    Thank you for assembling these.

    Getting 404s for two debates: RepublicanDebate011416.txt and RepublicanDebate012816.txt

    [(myl) I think all the links are correct now. I'm sure there are some errors in the text prep here and there, but (I hope) nothing major.]

  2. Lane said,

    April 19, 2016 @ 10:01 am

    So what does it say that Trump uses pretty much every pronoun more than his rivals? I'm guessing it's his general preference for basic vocabulary.

    [(myl) That's one way to look at it. But it's also his quasi-conversational style, e.g.

    The fact is, there is a big overhang. There’s a big question mark on your head. And you can’t do that to the party. You really can’t. You can’t do that to the party. You have to have certainty. Even if it was a one percent chance, and it’s far greater than one percent because (inaudible)

    or

    Our healthcare is a horror show. Obamacare, we’re going to repeal it and replace it. We have no borders. Our vets are being treated horribly. Illegal immigration is beyond belief. Our country is being run by incompetent people. And yes, I am angry.

    There are some non-basic words in there, but … ]

  3. Charles Antaki said,

    April 19, 2016 @ 4:55 pm

    Just on the pronouns – they're all referentially tricky to some degree of course (bar "I", usually). But my guess is that Trump will be using "we" much more to refer to himself ("we are going to win" etc.) than, say, Sanders who might, I'm guessing again, use it more in the "we, the people" sense. But difficult to tag for a count, unless manually.

  4. ricki said,

    April 21, 2016 @ 8:34 am

    Is there any chance the 'frank' finding is tangled up in Dodd-Frank? I'm assuming not, but was curious!

    [(myl) Actually you're right — of Clinton's 29 instances of frank, 23 are Dodd-Frank and 4 are Barney Frank… So it was really a topical signifier, and I've replaced it with try.]

  5. A. Riddell said,

    April 22, 2016 @ 3:33 pm

    Hitting a 404 for: http://languagelog.ldc.upenn.edu/myl/RepublicanDebate021316.txt

    Thanks again for collecting these.

RSS feed for comments on this post