Sentiment analysis disappointment
« previous post | next post »
A Quinnipiac Poll released on May 10 asked respondents "What is the first word that comes to mind when you think of Donald Trump?" 46 words were used by 5 or more respondents. The full list, with the number of responses for each word, is here — the top 15 words were:
idiot 39 incompetent 31 liar 30 leader 25 unqualified 25 president 22 strong 21 businessman 18 ignorant 16 egotistical 15 asshole 13 stupid 13 arrogant 12 trying 12 bully 11
For other reasons, I've recently been gathering word-linked information about features like frequency, concreteness, positive vs. negative valence, etc. So I thought it would be interesting to look at the (obviously bimodal) distribution of positivity found in this list, and perhaps the distributions of some more subtle properties as well.
What I found was disappointing.
Some of the relevant lists omit a surprisingly high percentage of these words — for example, the excellent Hedonometer list compiled by Peter Dodds and others lacks any entry for 19 of the 46 words (41%):
incompetent, unqualified, egotistical, arrogant, bully, narcissist, dishonest, American, bigot, buffoon, con-man, despicable, dictator, blowhard, decisive, embarrassment, inexperienced, negotiator, patriotism
If we ignore the missing words and plot the distribution of "positivity" from this list (on a 9-point scale from maximally negative to maximally positive) we get this:
Makes sense, I guess. But I'm still worried about the large number of missing words.
The information in more complete lists is often oddly inconsistent. Thus SentiWordNet entry for arrogant gives it a "PosScore" of 0.5 and a "NegScore" of 0.375. The entry for egotistical has PosScore=0.25 and NegScore=0, and so does the entry for embarrassment. The entry for leader has 0 and 0.
Those are not the only lists I checked, nor the only disappointments I found. The experience leaves me with the feeling that the domains of "sentiment analysis", "opinion mining", etc., could use some more work.
But that's all I have time to report this morning.
Update — EMOLEX is missing 18 of the 46 words:
unqualified, strong, businessman, trying, business, narcissist, great, racist, American, smart, buffoon, con-man, different, rich, blowhard, decisive, mental, patriotism
But if I take the 10 categories for which EMOLEX words are marked (as 0 or 1)
anger anticipation disgust fear joy negative positive sadness surprise trust
and weight each by the counts of the words for which they're marked, we get:
anger anticipation disgust fear joy negative positive sadness surprise trust 132 36 175 64 36 258 88 64 37 74
Or graphically:
Again, I'd feel better about the result if 39% of the words weren't missing.
Joshua K. said,
June 12, 2017 @ 8:05 pm
For comparison, here's a link to a similar poll that Quinnipiac conducted in August 2015, asking not only for the "first word that comes to your mind" when you think of Donald Trump, but also for Hillary Clinton and Jeb Bush.
https://poll.qu.edu/national/release-detail?ReleaseID=2274