Sentiment analysis disappointment

« previous post | next post »

A Quinnipiac Poll released on May 10 asked respondents "What is the first word that comes to mind when you think of Donald Trump?"  46 words were used by 5 or more respondents. The full list, with the number of responses for each word, is here — the top 15 words were:

idiot         39
incompetent   31
liar          30
leader        25
unqualified   25
president     22
strong        21
businessman   18
ignorant      16
egotistical   15
asshole       13
stupid        13
arrogant      12
trying        12
bully         11

For other reasons, I've recently been gathering word-linked information about features like frequency, concreteness, positive vs. negative valence, etc. So I thought it would be interesting to look at the (obviously bimodal) distribution of positivity found in this list, and perhaps the distributions of some more subtle properties as well.

What I found was disappointing.

Some of the relevant lists omit a surprisingly high percentage of these words — for example, the excellent Hedonometer list compiled by Peter Dodds and others lacks any entry for 19 of the 46 words (41%):

incompetent, unqualified, egotistical, arrogant, bully, narcissist, dishonest, American, bigot, buffoon, con-man, despicable, dictator, blowhard, decisive, embarrassment, inexperienced, negotiator, patriotism

If we ignore the missing words and plot the distribution of "positivity" from this list (on a 9-point scale from maximally negative to maximally positive) we get this:

Makes sense, I guess. But I'm still worried about the large number of missing words.

The information in more complete lists is often oddly inconsistent. Thus SentiWordNet entry for arrogant gives it a "PosScore" of 0.5 and a "NegScore" of 0.375. The entry for egotistical has PosScore=0.25 and NegScore=0, and so does the entry for embarrassment. The entry for leader has 0 and 0.

Those are not the only lists I checked, nor the only disappointments I found. The experience leaves me with the feeling that the domains of "sentiment analysis", "opinion mining", etc., could use some more work.

But that's all I have time to report this morning.

Update — EMOLEX is missing 18 of the 46 words:

unqualified, strong, businessman, trying, business, narcissist, great, racist, American, smart, buffoon, con-man, different, rich, blowhard, decisive, mental, patriotism

But if I take the 10 categories for which EMOLEX words are marked (as 0 or 1)

anger anticipation disgust fear joy negative positive sadness surprise trust

and weight each by the counts of the words for which they're marked, we get:

anger anticipation disgust fear joy negative positive sadness surprise trust 
  132     36        175     64  36    258      88       64       37     74

Or graphically:

Again, I'd feel better about the result if 39% of the words weren't missing.

1 Comment

  1. Joshua K. said,

    June 12, 2017 @ 8:05 pm

    For comparison, here's a link to a similar poll that Quinnipiac conducted in August 2015, asking not only for the "first word that comes to your mind" when you think of Donald Trump, but also for Hillary Clinton and Jeb Bush.

RSS feed for comments on this post