For the background of this discussion, see "The QWERTY effect", 3/8/2012; "QWERTY: Failure to replicate", 3/13/2012; and "Casasanto and Jasmin on the QWERTY effect", 3/17/2012. In their reply to me, C&J make three basic points:
- "We’re not concerned with Liberman’s subjective evaluation of the QWERTY effect’s size or of our study’s importance."
- "The QWERTY effect is reliable. Replication is the best prevention against false positives. In this paper, we demonstrated the QWERTY effect *six times*: in 5 corpora (one of which we divided into 2 parts, a priori), in 3 languages, and in a large corpus of nonce words."
- "There’s a reason why scientific results go through peer review, and why analyses are not simply self-published on blogs. If there were a review process for blog posts, or if Liberman had gone through legitimate scientific channels (e.g., contacting the authors for clarification, submitting a critique to the journal), we might have avoided this misleading attack on this paper and its authors; instead we might have had a fruitful scientific discussion."
I'll take these up one at a time.
1. The QWERTY effect's size. As far as I'm concerned, and as far as the general public is concerned, the size (and therefore the practical importance) of the QWERTY effect (if it exists) is the key question. This is not an entirely subjective matter — we can ask, as I did, what proportion of the variance in human judgments of the emotional valence of words is explained by the "right side advantage". The answer is "very little", or more precisely, around a tenth of a percent at best (at least in the modeling that I've done).
I focused on the effect-size question because the press release said the following (and the popular press took the hint):
Should parents stick to the positive side of their keyboards when picking baby names – Molly instead of Sara? Jimmy instead of Fred? According to the authors, “People responsible for naming new products, brands, and companies might do well to consider the potential advantages of consulting their keyboards and choosing the 'right' name."
So C&J may not be interested in my subjective evaluation of the effect size, but they promoted their own subjective evaluation by suggesting that the effect is important enough to matter to people choosing names. I felt (and feel) that this represents a serious exaggeration of the strength of the effect; and it seemed (and seems) appropriate to me to say so publicly.
2. The statistical reliability of the QWERTY effect. My first response to the article and the press release was to be skeptical of the size and practical importance of the effect. So I independently obtained the English (ANEW) data, calculated the "right side advance" for each of the words, and fit a regression line in order to see how much of the variance was accounted for. As I observed in the original post, the answer was "very little". But the other thing that emerged from the regression was that the slope of the regression line was not statistically distinct from 0 (… in the simple linear regression that I performed — another kind of analysis might yield a different estimate of the uncertainty of the slope estimate…)
I probably should have ignored this, since my main interest was in the strength of the relationship between RSA and emotional valence of words, not in the question of whether there's any real relationship at all. Rather than go into the statistical details, I emphasized the weakness of the effect by showing how comparatively easy it was to obtain a similar result by chance re-assignment of valences to RSA values. That argument was informal at best and misleading at worst — so let's try it again in a more careful and responsible way.
Since C&J quite properly point to replication as the key to effect reliability, I hunted down the Spanish (SPANEW) data from Redondo et al. 2007. Here's the scatter plot with the regression line.
(The Spanish data itself, taken from the file provided with Redondo et al. 2007, is here — the fields are word, RSA, mean valence, std valence. In order to account for the layout of Spanish keyboards, I've used the equivalent U.S.-keyboard letters, such as ';' for 'ñ', except that I've used underscore in place of single quote, because R read.table() doesn't like single quotes).
True enough, the slope of the regression line (0.028) is positive. But again, the effect is on the (wrong side of the) borderline of significance. Here's what R says about the fit:
Coefficients: ___________Estimate Std. Error t value Pr(>|t|) (Intercept) 4.76936 0.07197 66.265 <2e-16 *** x 0.02777 0.02558 1.086 0.278
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.142 on 1032 degrees of freedom Multiple R-squared: 0.001141, Adjusted R-squared: 0.0001728 F-statistic: 1.178 on 1 and 1032 DF, p-value: 0.2779
As another approach to significance testing, we could try looking at the distribution of slopes for random re-assignments of SPANEW valence estimates to RSA values. Rather than doing it three time, I did it 10,000 times. Here's the distribution of slopes in the 10,000 random re-assignments:
The slope is as great or greater than 0.028 in 14.05% of these (equivalent to a one-tailed test; in a two-tailed test the number would be roughly twice as great) — so this test also suggests that the effect might not be statistically significant (in a simple linear regression on the SPANEW data set).
I've now tried this on six data sets — (1) the overall ANEW data, (2) the ANEW data for male subjects, (3) the ANEW data for female subjects, (4) the DANEW data from the C&J paper, (5) the overall SPANEW data, (6) the English Hedonometric data from Peter Dodds' lab.
The first five cases were all basically like the Spanish data discussed above — a weakly positive regression slope, which (depending on what statistical test you use) is generally not statistically distinguishable from zero at the .05 level. However, I agree that it's suggestive that all five cases (and the others that C&J discuss) seem to show a positive (though very small) effect. And (according to their analysis) merging the English, Danish and Spanish data together does pass the statistical-significance bar.
In the sixth case — the Hedonometric values — the slope was weakly negative.
This is all certainly worth looking into more carefully, though the main point from my perspective is that any relationship is a very weak one. In exploring the nature and possible causes of these patterns, there are a lot of possibilities to explore. One observation is that the positive slope of the relationship between RSA and valence seems to be driven to some extent by the large leverage of the small number of words with extremely low or extremely high RSA values — thus in the SPANEW data, if we look only at the 1023 (of 1034) words with RSA between -7 and 5, the slope goes down from 0.028 to 0.014, and the p value goes up to 0.596. We might also consider the idea that the arrow of causation might go in the other direction, e.g. because the inventor of the QWERTY layout was motivated to assign negative-valence letters (i.e.letters that are relatively common in negative-value morphemes) to the left hand. But the interest of the exploration, in my opinion, is lessened by the weakness of the relationship.
3. The appropriateness of discussing scientific publications in a blog. Some serious scientific discussion has always taken place in informal and un-refereed media — hallway conversations at conventions, lab meetings, letters and email, colloquium presentations and the associated question periods, presentations at unrefereed or lightly-refereed conferences, working papers, and the like. In recent years, important discussion in several fields now takes place in exchanges of un-refereed preprints in the arXiv and other repositories. And yes, even in blogs — and in principle, I'd defend the value of carrying on such informal discussions in the light of blogospheric day.
But that's not what's happening here.
What happened here is that a scientific paper became — with a stiff push from its authors — a topic of interest to the general public, with reports in media outlets ranging from serious intellectual outfits like WBUR to tabloids like The Daily Mail. There's nothing wrong with that at all — I'm 110% in favor of promoting scientific research results in the public square. But when a piece of science (or engineering, or humanistic scholarship) becomes a matter of public interest and public discussion, it's odd to argue that it's a violation of professional etiquette for other scientists and engineers and scholars to join that discussion, and that instead they must submit their comments for evaluation and possible eventual publication in a peer-reviewed journal.
If C&J's QWERTY paper had been published in Psychonomic Bulletin and Review without any public fanfare, I wouldn't have written a word about it. But when someone sends me a link to something like the image below in the popular press, I'm curious enough to look into it and to report what I find.
It's possible that a productive exchange can result, as (for instance) it recently did with Keith Chen. But my initial motivation is to improve the quality of the public discussion. I continue to believe that I've done so.