It seems like an extremely simple way to check the validity of this theory would be to repeat the analysis, but with the letters grouped into two random subsets, rather than right-left subsets. In fact, I'd think the original authors should have done this as a control. If this new grouping yields a graph with any meaningful-looking trends whatsoever (or if multiple repeats of the analysis with different random subsets yield such trends a significant percentage of the time), it would pretty soundly deflate the idea that the original trends are the result of "right-hand favoritism".
The graphs don't all look the same, but they all look interesting, and several of them practically beckon the storyteller. There's something interesting about this general kind of data and "advantage function" analysis worth discovering, I think.
A bit later, Breffni countered that
[I]n fairness to these researchers, they didn't go from pictures to story. They put this QWERTY effect theory on record in previous publications, and adduced evidence in support of it (whatever you think of that evidence); the theory leads to a clear-cut hypothesis regarding trends in baby names; and the hypothesis is indeed supported.
And Steve responded by quoting the Casasanto et al. paper:
"In Experiment 5, we tested whether the first names that Americans give their children have changed over time, as QWERTY has become ubiquitous in people’s homes, and whether new names coined after the popularization of QWERTY are spelled using more right-side letters (i.e., have a greater RSA) than names coined earlier."
and arguing that
This is not a clear-cut hypothesis, at least not clear-cut in the sense that you could state it and there would be little doubt what mathematical proposition about data it hypothesized.
Obviously it's a well established fact that "the first names that Americans give their children have changed over time", both before and after "QWERTY has become ubiquitous in people's homes" — a hypothesis framed in those terms is guaranteed to be supported by the data. Breffni took a more charitable interpretation, namely that the change should be in the direction of greater "right side advantage" over the past 20-odd years; and I agree that this seems to be what the authors had in mind, and indeed what they found.
Steve objects to the idea looking for a slope significantly greater than 0 is a valid way to test such a hypothesis, at least in the sense that the "p value" has any meaning; and extrapolating from his five random-partition experiments, he suggests a "metahypothesis" like
Let M be any measurement on words that is computed as a sum over the letters of the word of a function on the letters A to Z that has value between -1 and 1. Then the mean value of M on the 802 baby names in significant use since 1960 has probably been trending in the same direction for at least 15 years.
In order to facilitate empirical exploration of such metahypotheses, I wrote a simple program to generate random partitions of the alphabet into 11-against-15 subsets, and to check the random-subset-advantage (averaged by name) against the SSA's baby-name data from 1960 to 2012.
(If you want to continue the recipe at home, and have access to a unix or unix-like environment, my programs are linked and described at the bottom of this post.)
Here are ten runs:
And here are Steve's five examples again:
It's not exactly the meta-analysis that Steve asked for, but 8 of these 15 random trials have "statistically significant" positive slopes over the period 1990-2012. So I'd respond to Breffni's cogent observation by guessing that Casasanto et al. in fact had about a 50-50 chance of being correct — maybe a bit better since they could have adjusted their estimate of when "the popularization of QWERTY" occurred.
And I wonder whether they would have reported this experiment if the results had been equivocal or negative. They could have found plenty of good reasons for concluding that baby-name changes are not a valid test of their ideas.
Mainly, I agree with Steve that there is an interesting mathematical point lurking in these graphs, not specifically connected to baby names or QWERTY handedness. More on that at some future time.
RECIPE: As discussed in the previous post, neither Steve nor I were able to reproduce Casasanto et al.'s count of 788 names that occurred 100 or more times in every year from 1960 to 2012. On different interpretations of the definitions, we came up with sets of 802 names and 791 names. In what follows, I'm using the list of 791.
The basic data from the SSA is here — when you unzip that archive, you'll find 134 csv files of the form "yobNNNN.txt", where NNNN is a year from 1880 to 2013. You can run this script to create corresponding files "100yobNNNN.txt" that are limited to the names that occurred 100 or more times in the years 1960-2012. (If this strikes you as an unnecessary restriction, you can adjust the following script appropriately.)
Then each time you run this script, it will create a new random 11-to-15 partition, and run it over the years 1960 to 2012 in order to create an analogue of the "right side advantage" time function. You'll need this randperm (perl) function, and the file alphabet.
Finally, (a modified version of) this R script will create the graphs — note that you will have to change the list of random letter-sets to correspond to the results of your experiments.