## Sex in PISA

People are endlessly fascinated by average sex differences in cognitive measures, despite the fact that the between-sex differences are generally so small, relative to within-sex variation, that they have no consequential effects outside of the ideological realm. Here's a striking example — Hannah Fairfield, "Girls Lead in Science Exam, but Not in the United States", NYT 2/4/2013:

For years — and especially since 2005, when Lawrence H. Summers, then president of Harvard, made his notorious comments about women’s aptitude — researchers have been searching for ways to explain why there are so many more men than women in the top ranks of science.

Now comes an intriguing clue, in the form of a test given in 65 developed countries by the Organization for Economic Cooperation and Development. It finds that among a representative sample of 15-year-olds around the world, girls generally outperform boys in science — but not in the United States.

The arresting graphic that accompanies the text:

What does this have to do with language?

Well, there's a common cognitive mistake that works like this:

We measure some property of a set of individuals, divided into subsets A and B. We accept that the average of the samples represents the average of the populations that were sampled — and if the samples are large and representative, this much is probably harmless, though whether the samples are representative is often questionable. We then use generic plurals to describe the relationship among the group averages, so that if the average value for property P in group A is larger than the average value in B, we say that "A's are ahead of B's in P" or something similar. (Here we have, for instance, "Girls outperformed boys in a science test given to 15-year-olds in 65 countries — but in the United States, boys led the girls".)

And then we forget about the distribution of individual values that lies behind the group averages, and act as though the group averages were properties of the individual group members.

In the 2009 science test in the Program for International Student Assessment (PISA), the U.S. average for females was 495  (out of 1000), with a standard error of 3.7, and the average for males was 509, with a standard error of 4.2. (2009 is the last year for which this data has been published, as far as I can tell.) This sex difference, which is displayed on the NYT graphic as a difference of 2.7% in favor of the boys, was statistically significant — but it's a tiny effect.

How small is this sex effect? In the U.S., 5,233 students participated. Assuming that half were male and half female, this would be about N=2616 of each sex. Since the standard error is the standard deviation divided by the square root of N, we can estimate the standard deviations by multiplying the standard errors by sqrt(2616) ≈ 51, giving us standard deviations of about 3.7*51 ≈ 189 for females, and about 4.2*51 ≈ 214 for males. "Cohen's d" (a measure of the effect size) is then the difference in the means divided by the pooled standard deviation, which here is roughly

s = sqrt((2616*189^2 + 2616*214^2)/5233) ≈ 202

So the effect size is about 14/202 ≈ 0.069. And as Wikipedia tells us,

For Cohen's d an effect size of 0.2 to 0.3 might be a "small" effect, around 0.5 a "medium" effect and 0.8 to infinity, a "large" effect.

On this scale, an effect size of 0.07 qualifies as "tiny".

If the standard deviations for the two groups are set at the pooled value of 202, and if a normal distribution accurately predicts the overall results of this test, then the top-scoring half of the overall student population would be about 51% male. The top-scoring 1% would be about 55% male. Here's what those distributions would look like:

In fact, the larger standard deviation for male students makes a bigger difference — it means that a somewhat larger proportion of male students will have extreme (really low or really high) scores. This lifts the (lower and upper) tails of the male distribution enough to substantially increase the proportion of males in the extreme sets. Given the actually reported standard deviations (and again assuming that a normal distribution accurately predicts the behavior of the tails), the top half of U.S. students (in terms of this test) would still be about 51% males, but the top 1% of students would be about 73% male.

This last point is close to the one that Larry Summers actually (and fatally) made — he suggested that it might be greater male variance rather than higher male scores that is responsible for the over-representation of males in mathematics and the natural sciences. The point is a statistically valid one, but (in my opinion) not very persuasive, since (above a certain threshold) test scores at age 15 are not very effective in predicting career success. And Summers' remark was spectacularly inappropriate from a diplomatic point of view, since it's obvious that there are many other relevant factors that (unlike a putatively greater male variance in test-taking outcomes) are within the power of a University president to modify.

Still, the country-by-country PISA data do show males with generally greater standard error (the male value is greater in 26 of 34 comparisons available in the data I could find online; the female value is greater in 5 comparisons, and they're equal in 3):

Again, however, this is an intriguing psychometric puzzle that has little if anything to do with sex differences in career outcomes. At least, any effect must be a psychological one due to students misinterpreting scores in the same sort of way that the NYT article does.

And we should note in passing that the vertical axis in the NYT plot — which denotes country differences — has a vastly greater range than the horizontal axis, which denotes (percentage) sex differences. The lowest-scoring data point (for the Kyrgyz Republic) has average scores of 318 for males and 340 for females; while the highest-scoring data point (Shanghai, China) has average scores of 574 for males and 575 for females, about 75% higher overall. By contrast, the largest difference favoring females was Jordan (where females scored about 9% higher than males), and the largest difference favoring males was Columbia (where there was a difference of about 5%).

In terms of those country scores on the 2009 PISA science test, Canadian students (for example) outperformed U.S. students by 529 to 502, on average — a difference almost twice as great as the difference between U.S. males and females. Of course, breast-beating about country-level generic-plural test scores is also a popular rhetorical move.

1. ### BlueLoom said,

February 6, 2013 @ 8:13 am

I took physics in high school because "girls don't take physics." There were two girls in my physics class of about 35 kids. A (female) friend in high school signed up for trig & calculus in her senior year. She was called down to the counselor's office and told that "girls don't take trig and calculus." She took them anyhow.

These are events of a bit more than 50 years ago, but there is a ripple effect down the generations. If my generation of women was actively discouraged from the STEM courses, there were few role models in these fields for the next generations of women. I've never seen any hard data on this phenomenon, but I'm guessing that it takes 3-4 generations of school kids for the girls to work their way up to par with the boys. Judging from my own age and the ages of my progeny (kids, g'kids), we're only at the 3rd generation. My son has an engineering PhD; his daughter (first year high school) is a math genius, and has been well-supported by her schools (so far).

2. ### Mr Punch said,

February 6, 2013 @ 8:38 am

The Summers argument (broader distribution) was also offered by Malcolm Gladwell to explain (in part) why all the fastest runners are of (recent) African descent. I'm not sure, by the way, that any supposed lack of correlation of measured mathematical ability with career success amounts to much; Evariste Galois's career, for example ….

Richard Herrnstein's basic "bell curve" argument, I think, was the other one (parallel curves, offset mean) although this is obscured in the co-authored book.

3. ### Daniel Ezra Johnson said,

February 6, 2013 @ 8:40 am

As noted, the 2009 PISA science test showed US boys leading US girls by 14 points. However, the gender gap average for the 34 OECD countries was zero, and most non-OECD countries showed girls in the lead.

For mathematics, US boys led US girls by 21 points. The OECD countries' average was in the same direction, by 12 points. But a few countries showed girls in the lead.

For reading, though ALL 65 COUNTRIES showed girls in the lead, by an average of a massive 39 points in the OECD countries. The United States was comparatively equitable with girls leading boys by only 25 points.

If anything our focus should be on improving boys' reading skills. Worldwide, this is by far the greatest challenge revealed by the PISA data.

[(myl) "Massive" is a misleading word for the reading differences — they're larger than the science differences, but across the 34 countries for which I found detailed data, the median effect size was d = 0.241 (at most — this assumes the minimum allowed sample size), which is bigger than 0.07 but still counts as "small" by the standard interpretation of such numbers.]

4. ### Jonathan said,

February 6, 2013 @ 9:21 am

While test scores at age 15 are poor predictors of career success, they may be excellent predictors of career *choices.* If true, even if men and women who choose some particular career path are equally talented, the 3:1 ratio you cite will by itself create lots of the sorts of numbers we observe, no?

5. ### Narmitaj said,

February 6, 2013 @ 10:41 am

@ BlueLoom: Mildly relevant xkcd.

6. ### Matt McIrvin said,

February 6, 2013 @ 11:18 am

The usual unjustified assumption in these discussions, which tends to pass unnoticed while people talk about the differences between the group distributions, is the idea that only the extremes of the extreme upper tails matter: that if you give aptitude tests to a bunch of 15-year-olds, the highly unusual people who are several standard deviations out are the ones who will lead in the future, while everyone else is just marking time.

If you find some tiny difference in distributions that disproportionately affects upper tails, all you have to do is set your threshold for relevant genius high enough that that difference dominates. (in the graph above, it'd presumably be way off the right edge of the chart in the realm the test can't even measure.)

Yet there's little justification for this, as far as I can tell. Great scientists, say, are not necessarily super-high scorers on aptitude tests, though they would likely do pretty well. Epochal achievements are more a matter of being interested in the right thing at the right time, and being tenacious about it.

7. ### Lane said,

February 6, 2013 @ 11:19 am

Colombia, right? Not the university, or the District?

I think Summers got mauled. His point, unless I badly misremember, was that the factors he mentioned compound one another, and as Jonathan points out, once divergence starts to happen, expectations and prejudice will accelerate them, especially in a society that is already sexist in every other way. Maybe it was impolitic of Summers to mention this, but if he suspected it would be relevant, it would be cowardly not to mention it *among other factors*. Steve Pinker and Elizabeth Spelke went on to have a perfectly civil and enlightening debate about it, even if it's not in their power to do anything about it either.

http://www.edge.org/3rd_culture/debate05/debate05_index.html

8. ### Lane said,

February 6, 2013 @ 11:33 am

A point that Pinker made is that we're not really looking at the top 1%. The point of controversy was the number of tenured scientists at top universities; this isn't the leading 1% of science-minded individuals, but maybe the .001%, if we assume 3,000 top scientists in a population of 300,000,000.

[(myl) But the ranking of "top scientists" is relative to measures of achievement that are not all well predicted by scores on standardized tests. Quite apart from the factors that such tests don't even try to measure (like creativity, grit, political acumen, or plain old good luck), there's the problem discussed at length in an article in the current NYT Magazine, which discusses the alleged effects of two variants of the COMT gene:

One variant builds enzymes that slowly remove dopamine. The other variant builds enzymes that rapidly clear dopamine. We all carry the genes for one or the other, or a combination of the two.

In lab experiments, people have been given a variety of cognitive tasks — computerized puzzles and games, portions of I.Q. tests — and researchers have consistently found that, under normal conditions, those with slow-acting enzymes have a cognitive advantage. They have superior executive function and all it entails: they can reason, solve problems, orchestrate complex thought and better foresee consequences. They can concentrate better. This advantage appears to increase with the number of years of education.

The brains of the people with the other variant, meanwhile, are comparatively lackadaisical. The fast-acting enzymes remove too much dopamine, so the overall level is too low. The prefrontal cortex simply doesn’t work as well. On that score alone, having slow-acting enzymes sounds better. There seems to be a trade-off, however, to these slow enzymes, one triggered by stress. In the absence of stress, there is a cognitive advantage. But when under stress, the advantage goes away and in fact reverses itself.

In a cited study of students taking the Taiwanese 9th-grade "Basic Competency Test", those with the slow-acting enzymes scored 8 percent lower. This offers a (doubtless partial) translation into genomics-ese and neuroscience-ese of a piece of basic common sense: some people whose test scores are off the charts never amount to much, while others with less impressive scores make important intellectual contributions.]

9. ### this week’s reads « Learning: Theory, Policy, Practice said,

February 6, 2013 @ 12:18 pm

[…] At Language Log, Mark Liberman discusses the "intriguing psychometric puzzle" of gender differentials on the 2009 PISA science test and the ways in which the New York […]

10. ### J.W. Brewer said,

February 6, 2013 @ 1:48 pm

Maybe not the top .001% on any test-measurable metric, but as a practical matter before you can separate yourself from the pack of other promising young research scientists via those other harder-to-measure qualities you first have to get into that pack, which these days requires credentialing, which typically means having gotten through a Ph.D. program associated with an entry requirement of very high GRE scores after an undergrad school associated with an entry requirement of at least reasonably high SAT scores. In other words, those preliminary hurdles probably mean that if you're not in the top <1% on some test-measurable metric, we'll never find out about those other qualities. (Of course, there are also those of us who tested very high on math/science aptitude as adolescents but then betrayed our early promise by taking the easy way out and ending up as linguistics majors and/or lawyers.)

[(myl) But in the pluralistic system we have in the U.S., test scores are only one of several ways for students to succeed, and there are lots of second and third chances for those who want them. Even if your SAT subject tests are not super strong in math and science, you can get into a decent college (or transfer into one later) on the strength of other characteristics and accomplishments; and in college, you have a chance to prove yourself in math and science courses, regardless of what your test scores were like; and in many places you can get involved in research projects as an intern or work-study student. If the other aspects of your record are strong enough, you can get into graduate school without off-the-charts GREs. And beyond that point, tests scores don't matter any more.

Obviously there are many financial, social, and psychological barriers to overcome. But if you survey a bunch of successful natural scientists and engineers, I think that you'll find plenty who didn't have even top-1% SAT or GRE scores. And you'd certainly find plenty of people with better test scores who ended up in other careers, either because they tried and failed or because they decided not to try.]

11. ### marie-lucie said,

February 6, 2013 @ 3:01 pm

Just because you do very well in some school subjects does not mean that you should want to spend your whole life pursuing them.

12. ### marie-lucie said,

February 6, 2013 @ 3:21 pm

About girls in some countries outperforming boys in science: In America the majority of children, both boys and girls, go to school K-12. In many countries this is not true, and boys are likely to stay in school longer than girls. I wonder if the girls' better performance often stems from the fact that only the most talented girls will remain in school, while both talented and less-talented boys are encouraged to stay in school, so the average grades for each sex are skewed by this discrepancy (in my own school, where girls were a minority, they tended to be near the top of their classes, since poorly performing girls were more likely to be expelled and sent to a less demanding school than poorly performing boys). Also, in countries where female roles are quite restricted and schools are segregated by sex, girls do not have many outlets for independance, while boys have much more freedom outside of school, so that talented girls spend more time than most boys applying themselves to their studies. A talented, well-educated girl is also more likely to be able to earn her own living after completing her studies and therefore less less likely to be pressured into an early marriage than a less talented girl, especially if her own mother or a close female relative provided a role model.

13. ### Lane said,

February 6, 2013 @ 3:38 pm

Mark, your points are well-taken of course, but JWB says something like what I would have said in reply. Surely our elite universities' tenured scientists have aptitude levels if not at the .001% level, probably well north of the 1% threshhold. At least, I wouldn't say that of every hundred people I pass walking down the street in New York (from the bankers to the homeless), one of them has the ability to make it as a Cal Tech physicist. One in a thousand would be more realistic, but I'd guess it's more like one in ten thousand. Even if it's one in a thousand, we're getting to levels where the (putative?) differences in sex-based variation would get significant quickly.

[(myl) But we're not talking about the CalTech physics department, we're talking about academic positions in natural science and engineering as a whole. And the extent to which other factors become relevant can be seen in the fact that such a large fraction of faculty in (many of) those areas are originally from outside the U.S. One high-quality U.S. physics department known to me recently failed to enroll a single U.S.-born graduate student in at least one recent year. The reasons for this are complicated, but a genetic bias in favor of Eastern Europeans, Arabs, Indians, and Chinese is probably not the key factor…]

Maybe innate sex-differences don't exist; maybe they're already baked in by sexist culture by the time we're taking high-school aptitude exams. Maybe something else entirely is going on. But *if* the differences *do* exist, they're going to be significant when we start selecting the rare birds whose grades and SATs and college grades and GREs start them on that path that JW Brewster describes, ending in a great scientific career. Aptitude at high-school age doesn't have to be all-determining; it just has to be a pretty strong filter.

[(myl) I don't have strong opinions on any of these questions. But I do strongly object to the view that sub-0.1 effect sizes in differences between group means are a good indication of basic group characteristics, and tell us something important about the distribution of extreme cases. If we care about the tail of the distribution, we should look at the tail of the distribution and try to figure out what causes it to look the way it does, not look at small differences in mean values and speculate about why they shade one way or another.]

14. ### Sven said,

February 6, 2013 @ 3:41 pm

What Lane said.

BTW, the argument that standardized tests are imperfect predictors of achievement is a straw man, because Summers never suggested that any specific test (standardized tests, IQ tests, whatever) actually measures the relevant ability. What he said was that we needed more research to understand what makes one an outstanding scientist and whether there is a natural statistical difference between the sexes. This is indeed necessary if we are ever to know how close we are to the ideal of non-discrimination. I still cannot understand how this can even be controversial.

15. ### J.W. Brewer said,

February 6, 2013 @ 4:45 pm

myl's points are well-taken and it is I think a very good feature of the American educational system that it has a lot of pluralism and second chances compared to those of some other countries. I suppose I was thinking of the original context of the Summers remarks, which not why aren't there more women doing interesting/important work in field X but why aren't there more women with tenure in Harvard's Department of X. You'd have to be either really naive or the President of Harvard to think those are the same question. My strong suspicion would be that the paths by which a few people can end up doing interesting/important work in the field despite not having passed through a very-high-test-score sieve early on do not tend to be the paths that result in tenure at Harvard, because the Harvard hiring/tenuring process is going to be at least somewhat distorted by the sort of credentials-snobbery that is a very good proxy for how well you did on standardized tests early on. Since *most* of the people doing very good work are still going to pass the credentials-snobbery sieve with flying colors, the cost to Harvard of being credentials snobs is not too high to bear, and is therefore (perhaps not even consciously) borne.

Also, I suspect in practice that the commendable tendency of elite American universities to allow people to switch fields of study post-matriculation leads net to more switching away from STEM than into it ("I was totally going to be a pre-med, but organic chemistry was mindnumbingly boring and then I fell in love with Heian-era Japanese poety"), but I could be wrong about that.

16. ### Ayse said,

February 6, 2013 @ 5:06 pm

in general, these tests are not designed to measure biological or cognitive differences between the genders. the policymakers can use this information to remedy the barriers to access or change the cultural norms. the observed differences are mostly due to cultural attitudes. when i was growing up in turkey, girls needed to compete with boys to overcome cultural barriers and be independent, so we worked very hard to get in to the traditionally male professions. viewed as a nerd was a compliment not insult. so these test results mostly reflect these realities not the cognitive or biological differences. as with many media approach, this article also sensationalizes a public interest.

17. ### J.W. Brewer said,

February 6, 2013 @ 5:33 pm

In terms of studying the right tail, the interesting question is how to find good data that isn't already tainted by self-selection. I suppose the data is out there for the demographics of the 99th percentile on the math piece of the SAT/ACT and how that has varied over time, which is drawn from a pretty broad and not-grossly-unrepresentative sample of American high school kids. But with everything else you have the problem that e.g the people who take whatever variant of the GRE's you take if you're applying to grad school in mechanical engineering are self-selected and likely already have very different demographics than the general population their age.

There is some longitudinal data on an approximation of the right tail out there from the people at http://www.vanderbilt.edu/peabody/smpy/. (Disclaimer: I am a datapoint in their Cohort Two – compared to most of my cohort I have apparently underperformed as an adult by failing to better mankind via an impressive STEM career, although perhaps blame should fall on whoever had the cockamamie idea to combine accelerated math instruction in 8th grade with giving me a copy of the AHD with the fascinating Calvert Watkins appendix of PIE roots in the back.) But there's some self-selection there as well, since you had to have the dumb luck to have a seventh-grade teacher who pushed you to get tested and then be willing to follow up on the suggestion – there were no doubt lots and lots of other 12-year-olds in the Middle Atlantic states who would have tested just as well had they been tested, and who knows if they've had comparable subsequent life trajectories. On the other hand, I *think* one of their motives for finding their sample via testing at age 12 was the hope that the hypothesized social pressures that might have tended to push girls away from math wouldn't yet have kicked in as severely as they would as the teenage years progressed.

18. ### Lane said,

February 6, 2013 @ 5:37 pm

To your last comment, Mark, I couldn't agree more.

19. ### Ewout ter Haar said,

February 6, 2013 @ 7:31 pm

There must be a factor of 2 error in your estimate of the standard deviation of the PISA scores, since these are fixed to be about a 100 (in the same way that IQ scores have a standard deviation of 15, by convention).

[(myl) Or perhaps there was an error in their standard error calculation? With an N of about 2616 per sex, the standard error for a standard deviation of 100 should be
100/sqrt(2616) = 1.96
But they cite standard errors of 3.7 for females and 4.2 for males…]

20. ### the other Mark P said,

February 7, 2013 @ 12:11 am

Yet there's little justification for this, as far as I can tell. Great scientists, say, are not necessarily super-high scorers on aptitude tests, though they would likely do pretty well. Epochal achievements are more a matter of being interested in the right thing at the right time, and being tenacious about it.

True, but what are the other factors that determine a good scientist, other than native intelligence?

Hard work. There's no (real) evidence that boys work harder than girls, so it's not likely to be that.

A mind that is mechanically and spacially aware. We're not allowed to say it, but boys tend to think better this way than girls. More boys like to tinker with things. Boys tend to think better in 3D. A prime reason they dominate in Physics and Maths, yet don't in Biology.

A mind that is able to focus exclusively on ridiculously specialised areas. Boys are much more prone to the sort of topic specific focus to the detriment of everything else. This makes them much more prone to obsess on abstract Maths or Physics (often at the expense of the big picture, of course).

Put these things together, and you see why some fields favour males. Once that happens social pressure will do the rest. My daughter is very good at Maths, but she's not interested in hanging out with a bunch of dorky males for five years at university.

Trying to change this is doomed to failure. Biology will win, every time.

21. ### Matt McIrvin said,

February 7, 2013 @ 12:57 am

A mind that is mechanically and spacially aware. We're not allowed to say it, but boys tend to think better this way than girls. More boys like to tinker with things. Boys tend to think better in 3D. A prime reason they dominate in Physics and Maths, yet don't in Biology.

That we're not allowed to say it is news to me, since there have been about a million papers on sex differences in the ability to mentally rotate objects in 3D. But how much of mathematics is mental object rotation?

And is the skill of no use in biology? It ought to be useful in visualizing anatomy, at the very least: living things tend to be three-dimensional. Molecular biology, certainly.

Put these things together, and you see why some fields favour males. Once that happens social pressure will do the rest. My daughter is very good at Maths, but she's not interested in hanging out with a bunch of dorky males for five years at university.

Trying to change this is doomed to failure. Biology will win, every time.

I have doubts that the social situation you're describing here is an expression of a universal biological human constant.

Physics Today did a study of gender differences in physics departments across different countries, back in the 1990s. What they found was interesting: while the field was male-dominated more often than not, the degree of male dominance varied hugely across countries. The United States was an outlier, with one of the highest degrees of gender imbalance. I don't recall the stats for other anglophone countries.

It wasn't just that the countries with a higher proportion of female physicists had lower standards: some of them, like Italy, were first-rank. Nor were they generally less sexist societies (again, see Italy).

They looked at a bunch of variables, and found that the countries that had the most gender imbalance tended to be places where people believed that success in the hard sciences depended on un-learnable, innate talent. Where people didn't believe this, and attributed success primarily to hard work, more women would be involved.

February 7, 2013 @ 4:47 am

I like the expression "no consequential effects outside of the ideological realm".

23. ### Pnin said,

February 7, 2013 @ 6:18 am

Liberman's calculation is erreneous. I think this is because the PISA standard errors are inflated due to some features of the sampling scheme. The actual standard deviation (SD) on the science scale is 98 in the US (Figure H3 here), which means that the male-female difference is about 0.14 SD, twice Liberman's 0.069 SD. This means that male overrepresentation at the right tail is larger than Liberman says.

[(myl) An effect size of 0.14 is still small, and in fact on the small end of the small range. And my calculation from the cited standard errors and sample size is correct — if the standard deviations are really around 100 rather than around 200, then PISA's calculations of standard errors must have been wrong, or else what they've called "standard error" is actually something else.]

The point is a statistically valid one, but (in my opinion) not very persuasive, since (above a certain threshold) test scores at age 15 are not very effective in predicting career success.

Not true. Check out the Study of Mathematically Precocious Youth, e.g., http://cdp.sagepub.com/content/19/6/346.abstract. Even if there's some self-selection in this study, as Brewer suggests above, is does not explain why there's a monotonic relationship between test scores and achievement within the top one percent.

[(myl) "Monotonic" here apparently means that students in the top quartile of scores on the SAT-M are somewhat more likely to have some achievement in relevant areas (e.g. about 18% of them have STEM doctorates) than students with lower scores.

I'd say that this supports my point, and certainly doesn't show that a 2.7% difference in mean scores means anything consequential.]

As to why the gender gaps differ between nations, here's one theory.

24. ### Pnin said,

February 7, 2013 @ 8:58 am

An effect size of 0.14 is still small, and in fact on the small end of the small range.

Whether any effect size is to be interpreted as small or large depends on what you're studying. A small mean gap can produce sizable differences at the tails of the distribution, for example. (I don't think the PISA data are suitable for studying differences at the right tail though; there are few difficult items in the PISA tests.)

BTW, the wage gap between sexes is a "small effect" (generally about 0.1-0.3 SD, IIRC), but it is certainly never regarded as insignificant for that reason!

And my calculation from the cited standard errors and sample size is correct — if the standard deviations are really around 100 rather than around 200, then PISA's calculations of standard errors must have been wrong, or else what they've called "standard error" is actually something else

The way you calculated the SDs is based on the assumption that you're dealing with a simple random sample. The PISA is based on a complex sampling design which inflates sampling variance. Additionally, each student answered only a subsample of all questions, and full test scores for individuals were estimated using item response theory, which increases standard errors, too. Quoting from the Highlights from PISA 2009 report:

The estimation of the standard errors that are required to undertake the tests of significance is complicated by the complex sample and assessment designs, both of which generate error variance. Together they mandate a set of statistically complex procedures for estimating the correct standard errors. As a consequence, the estimated standard errors contain a sampling variance component estimated by BRR [=the Fay method of Balanced Repeated Replication]. Where the assessments are concerned, there is an additional imputation variance component arising from the assessment design. Details on the BRR procedures used can be found in the PISA 2009 Technical Report

Liberman wrote: I'd say that this supports my point, and certainly doesn't show that a 2.7% difference in mean scores means anything consequential.

You said that "(above a certain threshold) test scores at age 15 are not very effective in predicting career success", but the data show both that males are more and more overrepresented the farther you go along the right tail of test scores, and that particularly for STEM outcomes being farther on the right tail has a big effect. For example, compared to the bottom quartile, the top quartile of the top one percent is more than 18 times more likely to attain a STEM doctorate and more than seven times more likely to have tenure at a top research university. I don't see how you can say that this is not consequential. Summers said that male overrepresentation at the right tail may be one of the reasons why males are more likely to succeed in STEM fields than females. I think the research is in accord with Summers's view.

25. ### CK said,

February 11, 2013 @ 9:26 pm

For analysis with respect to variance of international surveys in mathematics (TIMSS and PISA) and summaries of recently stated hypotheses for gender gaps and nongaps in scores, see Kane & Mertz, Notices of the American Mathematical Society, 2012, http://www.ams.org/notices/201201/rtx120100010p.pdf.

These hypotheses include what's known in some circles as the "greater male variability hypothesis," the conjecture that for measurements of an intellectual ability, the variance for males will be larger than the variance for females. This is not invariably the case–as already noted in the post.

Kane & Mertz also look at correlations of test scores and measures of equity and opportunity for women. They conclude that "gender equity and other sociocultural factors, not national income, school type, or religion per se, are the primary determinants of mathematics performance at all levels for both boys and girls."