People are endlessly fascinated by average sex differences in cognitive measures, despite the fact that the between-sex differences are generally so small, relative to within-sex variation, that they have no consequential effects outside of the ideological realm. Here's a striking example — Hannah Fairfield, "Girls Lead in Science Exam, but Not in the United States", NYT 2/4/2013:
For years — and especially since 2005, when Lawrence H. Summers, then president of Harvard, made his notorious comments about women’s aptitude — researchers have been searching for ways to explain why there are so many more men than women in the top ranks of science.
Now comes an intriguing clue, in the form of a test given in 65 developed countries by the Organization for Economic Cooperation and Development. It finds that among a representative sample of 15-year-olds around the world, girls generally outperform boys in science — but not in the United States.
The arresting graphic that accompanies the text:
What does this have to do with language?
Well, there's a common cognitive mistake that works like this:
We measure some property of a set of individuals, divided into subsets A and B. We accept that the average of the samples represents the average of the populations that were sampled — and if the samples are large and representative, this much is probably harmless, though whether the samples are representative is often questionable. We then use generic plurals to describe the relationship among the group averages, so that if the average value for property P in group A is larger than the average value in B, we say that "A's are ahead of B's in P" or something similar. (Here we have, for instance, "Girls outperformed boys in a science test given to 15-year-olds in 65 countries — but in the United States, boys led the girls".)
And then we forget about the distribution of individual values that lies behind the group averages, and act as though the group averages were properties of the individual group members.
In the 2009 science test in the Program for International Student Assessment (PISA), the U.S. average for females was 495 (out of 1000), with a standard error of 3.7, and the average for males was 509, with a standard error of 4.2. (2009 is the last year for which this data has been published, as far as I can tell.) This sex difference, which is displayed on the NYT graphic as a difference of 2.7% in favor of the boys, was statistically significant — but it's a tiny effect.
How small is this sex effect? In the U.S., 5,233 students participated. Assuming that half were male and half female, this would be about N=2616 of each sex. Since the standard error is the standard deviation divided by the square root of N, we can estimate the standard deviations by multiplying the standard errors by sqrt(2616) ≈ 51, giving us standard deviations of about 3.7*51 ≈ 189 for females, and about 4.2*51 ≈ 214 for males. "Cohen's d" (a measure of the effect size) is then the difference in the means divided by the pooled standard deviation, which here is roughly
s = sqrt((2616*189^2 + 2616*214^2)/5233) ≈ 202
So the effect size is about 14/202 ≈ 0.069. And as Wikipedia tells us,
For Cohen's d an effect size of 0.2 to 0.3 might be a "small" effect, around 0.5 a "medium" effect and 0.8 to infinity, a "large" effect.
On this scale, an effect size of 0.07 qualifies as "tiny".
If the standard deviations for the two groups are set at the pooled value of 202, and if a normal distribution accurately predicts the overall results of this test, then the top-scoring half of the overall student population would be about 51% male. The top-scoring 1% would be about 55% male. Here's what those distributions would look like:
In fact, the larger standard deviation for male students makes a bigger difference — it means that a somewhat larger proportion of male students will have extreme (really low or really high) scores. This lifts the (lower and upper) tails of the male distribution enough to substantially increase the proportion of males in the extreme sets. Given the actually reported standard deviations (and again assuming that a normal distribution accurately predicts the behavior of the tails), the top half of U.S. students (in terms of this test) would still be about 51% males, but the top 1% of students would be about 73% male.
This last point is close to the one that Larry Summers actually (and fatally) made — he suggested that it might be greater male variance rather than higher male scores that is responsible for the over-representation of males in mathematics and the natural sciences. The point is a statistically valid one, but (in my opinion) not very persuasive, since (above a certain threshold) test scores at age 15 are not very effective in predicting career success. And Summers' remark was spectacularly inappropriate from a diplomatic point of view, since it's obvious that there are many other relevant factors that (unlike a putatively greater male variance in test-taking outcomes) are within the power of a University president to modify.
Still, the country-by-country PISA data do show males with generally greater standard error (the male value is greater in 26 of 34 comparisons available in the data I could find online; the female value is greater in 5 comparisons, and they're equal in 3):
Again, however, this is an intriguing psychometric puzzle that has little if anything to do with sex differences in career outcomes. At least, any effect must be a psychological one due to students misinterpreting scores in the same sort of way that the NYT article does.
And we should note in passing that the vertical axis in the NYT plot — which denotes country differences — has a vastly greater range than the horizontal axis, which denotes (percentage) sex differences. The lowest-scoring data point (for the Kyrgyz Republic) has average scores of 318 for males and 340 for females; while the highest-scoring data point (Shanghai, China) has average scores of 574 for males and 575 for females, about 75% higher overall. By contrast, the largest difference favoring females was Jordan (where females scored about 9% higher than males), and the largest difference favoring males was Columbia (where there was a difference of about 5%).
In terms of those country scores on the 2009 PISA science test, Canadian students (for example) outperformed U.S. students by 529 to 502, on average — a difference almost twice as great as the difference between U.S. males and females. Of course, breast-beating about country-level generic-plural test scores is also a popular rhetorical move.