I just saw your piece on these curious QWERTY findings re positivity. I see you used the ANEW data set, and this has prompted me to point to our much larger data set (just over 10,000 words) that we have for positivity/happiness (I've been meaning to write for a while). I've attached the scores and the relevant paper is here:
Dodds et al., "Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter", PLoS ONE 2011.
There's a lot in this paper which is focused on measuring happiness in real time using Twitter. But a few quick key points:
1. Our scores correlate very well with those of the original ANEW study.
2. We compiled our list of words using the 5000 most frequently used words in four corpora: Twitter, NY Times (20 years), music lyrics (50+ years), and Google Books (100s of years).
3. Because of 2., our coverage of texts is much more complete.
4. We found a natural way to select stop words.
5. The resulting instrument for measuring happiness is much improved over the previous one which we based on the ANEW study. It feels very similar to making a better physical instrument (e.g., a microscope) and being able to then see more detail and structure.
A related shorter paper you might find of interest is here:
Kloumann et al., "Positivity of the English Language", PLoS ONE 2012.
Here we show that there is a positive bias to at least the `atoms' of language, and that this bias is independent of frequency of usage (the Pollyanna Hypothesis).
I removed a few of the words in their list that contain non-alphabetic characters (e.g. "7th", "how's"), leaving 9912 word types (compared to 1032 in ANEW). A plot of the relationship between "right-side advantage" and "happiness" for those words is here:
The slope of the fitted regression line is -0.0008636 ± 0.0044066. This linear regression based on the (QWERTY) "right-side advantage" accounts for about four one-millionths of the variance in "happiness" scores. For what little it's worth, the probability of doing this well by chance was estimated at 0.845.
Perhaps the regression should be weighted by word frequency, but I'm done for now.
Update — more from Dodds' lab is here.