## Debate words

As I mentioned a few days ago ("More political text analytics", 4/15/2016), I've now got more-or-less cleaned-up text from the 21 debates held so far in the current U.S. presidential campaign.

[Update — with some help from Chris Culy, I've done additional clean-up on the debate texts, and therefore have revised the numbers in this post slightly, as of 4/23/2016. None of the numbers have changed a lot, and none of the qualitative implications have changed at all.]

If we focus on the contributions to those 21 debates of  the five remaining U.S. presidential candidates, we get 199,188 words in total, divided up like this:

 Clinton 56,989 Sanders 50,649 Trump 41,039 Cruz 32,654 Kasich 28,772

This morning I'll add a few small examples of the kind of information that can be derived from a dataset of this type.

There are some obvious topical clues in the relative frequency of words and phrases.

In the table below, the first number in each column is the literal count, and the number in parenthesis following is the count expressed as the frequency per million words produced by the candidate. Thus Bernie Sanders used the phrase "Wall Street" 127 times, and normalized by his 50,649 total words that's

1000000*127/50649 ~ 2507 per million

           Wall_Street    China     Mexico   health_care    Ohio
Clinton     49  (860)   13  (228)   1  (18)  48  (842)     2   (35)
Cruz         8  (245)   12  (367)   2  (61)  11  (337)     0    (0)
Kasich       2   (70)   11  (382)   2  (70)   6  (209)    80 (2780)
Sanders    127 (2507)   25  (494)   6 (118)  62 (1224)     1   (20)
Trump        4   (97)   65 (1584)  36 (877)   6  (146)     4   (97)


And here are some slightly less obvious stylistic features:

            try       note    secondly    issue    tremendous  very_very
Clinton    91 (1597)  1  (18)   6 (105)  23  (404)   1   (18)   0    (0)
Cruz       11  (337) 24 (735)   2  (61)  16  (490)   0    (0)  11  (337)
Kasich      9  (313)  1  (35)  21 (730)  15  (521)   1   (35)   3  (104)
Sanders    12  (237)  0   (0)   0   (0)  78 (1540)   0    (0)   7  (138)
Trump       4   (97)  0   (0)   0   (0)   0    (0)  46 (1121)  42 (1023)


And, of course, the ever-popular pronoun percentages:

         %1PS %1PP %2P  %3M  %3F  %3PP %3PN
CLINTON  4.24 2.85 1.07 0.49 0.07 1.07 1.38
CRUZ     2.73 2.06 2.01 1.03 0.22 0.77 1.32
KASICH   3.50 4.08 2.03 0.24 0.05 1.75 1.48
SANDERS  3.37 2.60 1.31 0.28 0.17 0.97 0.99
TRUMP    5.16 2.95 2.17 1.01 0.11 1.86 2.06


And the type-token plot based on the fixed-up transcripts:

Where the data comes from —
12 Republican Debates:

 08/06/15 Cleveland OH 09/16/15 Simi Calley CA 10/28/15 Boulder CO 11/10/15 Milwaukee WI 12/15/15 Las Vegas NV 01/14/16 Charleston SC 01/28/16 Des Moines IA 02/06/16 Goffstown NH 02/14/16 Greenville SC 02/25/16 Houston TX 03/03/16 Detroit MI 03/10/16 Coral Gables FL
 101315 Las Vegas NV 111415 Des Moines 121915 Goffstown NH 011716 Charleston SC 020416 Durham NH 021116 Milwaukee WI 030616 Flint MI 030916 Miami FL 041416 Brooklyn NY

1. ### A. Riddell said,

April 19, 2016 @ 9:50 am

Thank you for assembling these.

Getting 404s for two debates: RepublicanDebate011416.txt and RepublicanDebate012816.txt

[(myl) I think all the links are correct now. I'm sure there are some errors in the text prep here and there, but (I hope) nothing major.]

2. ### Lane said,

April 19, 2016 @ 10:01 am

So what does it say that Trump uses pretty much every pronoun more than his rivals? I'm guessing it's his general preference for basic vocabulary.

[(myl) That's one way to look at it. But it's also his quasi-conversational style, e.g.

The fact is, there is a big overhang. There’s a big question mark on your head. And you can’t do that to the party. You really can’t. You can’t do that to the party. You have to have certainty. Even if it was a one percent chance, and it’s far greater than one percent because (inaudible)

or

Our healthcare is a horror show. Obamacare, we’re going to repeal it and replace it. We have no borders. Our vets are being treated horribly. Illegal immigration is beyond belief. Our country is being run by incompetent people. And yes, I am angry.

There are some non-basic words in there, but … ]

3. ### Charles Antaki said,

April 19, 2016 @ 4:55 pm

Just on the pronouns – they're all referentially tricky to some degree of course (bar "I", usually). But my guess is that Trump will be using "we" much more to refer to himself ("we are going to win" etc.) than, say, Sanders who might, I'm guessing again, use it more in the "we, the people" sense. But difficult to tag for a count, unless manually.

4. ### ricki said,

April 21, 2016 @ 8:34 am

Is there any chance the 'frank' finding is tangled up in Dodd-Frank? I'm assuming not, but was curious!

[(myl) Actually you're right — of Clinton's 29 instances of frank, 23 are Dodd-Frank and 4 are Barney Frank… So it was really a topical signifier, and I've replaced it with try.]

5. ### A. Riddell said,

April 22, 2016 @ 3:33 pm

Hitting a 404 for: http://languagelog.ldc.upenn.edu/myl/RepublicanDebate021316.txt

Thanks again for collecting these.