As I mentioned a few days ago ("More political text analytics", 4/15/2016), I've now got more-or-less cleaned-up text from the 21 debates held so far in the current U.S. presidential campaign.

[Update — with some help from Chris Culy, I've done additional clean-up on the debate texts, and therefore have revised the numbers in this post slightly, as of 4/23/2016. None of the numbers have changed a lot, and none of the qualitative implications have changed at all.]

If we focus on the contributions to those 21 debates of  the five remaining U.S. presidential candidates, we get 199,188 words in total, divided up like this:

 Clinton 56,989 Sanders 50,649 Trump 41,039 Cruz 32,654 Kasich 28,772

This morning I'll add a few small examples of the kind of information that can be derived from a dataset of this type.

There are some obvious topical clues in the relative frequency of words and phrases.

In the table below, the first number in each column is the literal count, and the number in parenthesis following is the count expressed as the frequency per million words produced by the candidate. Thus Bernie Sanders used the phrase "Wall Street" 127 times, and normalized by his 50,649 total words that's

1000000*127/50649 ~ 2507 per million

           Wall_Street    China     Mexico   health_care    Ohio
Clinton     49  (860)   13  (228)   1  (18)  48  (842)     2   (35)
Cruz         8  (245)   12  (367)   2  (61)  11  (337)     0    (0)
Kasich       2   (70)   11  (382)   2  (70)   6  (209)    80 (2780)
Sanders    127 (2507)   25  (494)   6 (118)  62 (1224)     1   (20)
Trump        4   (97)   65 (1584)  36 (877)   6  (146)     4   (97)


And here are some slightly less obvious stylistic features:

            try       note    secondly    issue    tremendous  very_very
Clinton    91 (1597)  1  (18)   6 (105)  23  (404)   1   (18)   0    (0)
Cruz       11  (337) 24 (735)   2  (61)  16  (490)   0    (0)  11  (337)
Kasich      9  (313)  1  (35)  21 (730)  15  (521)   1   (35)   3  (104)
Sanders    12  (237)  0   (0)   0   (0)  78 (1540)   0    (0)   7  (138)
Trump       4   (97)  0   (0)   0   (0)   0    (0)  46 (1121)  42 (1023)


And, of course, the ever-popular pronoun percentages:

         %1PS %1PP %2P  %3M  %3F  %3PP %3PN
CLINTON  4.24 2.85 1.07 0.49 0.07 1.07 1.38
CRUZ     2.73 2.06 2.01 1.03 0.22 0.77 1.32
KASICH   3.50 4.08 2.03 0.24 0.05 1.75 1.48
SANDERS  3.37 2.60 1.31 0.28 0.17 0.97 0.99
TRUMP    5.16 2.95 2.17 1.01 0.11 1.86 2.06


And the type-token plot based on the fixed-up transcripts:

Where the data comes from —
12 Republican Debates:

 08/06/15 Cleveland OH 09/16/15 Simi Calley CA 10/28/15 Boulder CO 11/10/15 Milwaukee WI 12/15/15 Las Vegas NV 01/14/16 Charleston SC 01/28/16 Des Moines IA 02/06/16 Goffstown NH 02/14/16 Greenville SC 02/25/16 Houston TX 03/03/16 Detroit MI 03/10/16 Coral Gables FL
 101315 Las Vegas NV 111415 Des Moines 121915 Goffstown NH 011716 Charleston SC 020416 Durham NH 021116 Milwaukee WI 030616 Flint MI 030916 Miami FL 041416 Brooklyn NY

