Liberman on Sax on Liberman on Sax on hearing

« previous post | next post »

According to a recently published and very influential book, scientists have recently discovered some amazing things about the differences between boys and girls. For example, girls' hearing is said to be an order of magnitude more sensitive than boys' hearing. And this is a difference with major consequences in public as well as private life:

The difference in how girls and boys hear also has major implications for how you should talk to your children. I can't count the number of times a father has told me, "My daughter says I yell at her. I've never yelled at her. I just speak to her in a normal tone of voice and she says I'm yelling." If a forty-three-year-old man speaks in what he thinks is a "normal tone of voice" to a seventeen-year-old girl, that girl is going to experience his voice as being about ten times louder than what the man is hearing. […]

The gender difference in hearing also suggests different strategies for the classroom. … [E]leven-year-old girls are distracted by noise levels about ten times softer than noise levels that boys find distracting. … If you're teaching girls, don't raise your voice …. [but] the rules are different when you're teaching boys.

That passage is from p. 18 of Why Gender Matters: What Teachers and Parents Need to Know About the Emerging Science of Single Sex Education, by Dr. Leonard Sax, a pediatrician who is the leading light of the National Association for Single Sex Public Education (NASSPE). Dr. Sax is a tireless advocate for the view that boys and girls are so different, in so many scientifically proven ways, that it makes no sense to try to educate them in the same classroom.

There's just one problem: the scientific foundations of this "emerging science of single sex education" are exaggerated, misunderstood, or misrepresented. At least, that's true in the cases where I've checked the original research that Dr. Sax cites, including especially his assertions about sex differences in hearing.

I posted about this two years ago; and a couple of months ago, Dr. Sax posted on the NASSPE web site a letter answering my objections. A few days ago, a reporter asked me for my reaction to his letter ("Sax Q & A", 5/17/2008). So here goes — I warn you that this may be a little tedious, unless you're specifically interested in some of the more obscure techniques for studying human hearing, or in the rhetoric of Dr. Sax's movement.

Let me start by explaining how I got into this. It's not because I have strong feelings about single-sex education. It's clear that there are fine and effective single-sex schools, and maybe there should be more — the locally-governed and pluralistic nature of American education is one of its strengths, in my opinion. Nor do I have strong feelings about the nature and extent of human sex differences. But I do have strong feelings about the use and misuse of science in public policy debates, and that's what set my bad-science spider sense to tingling when I read David Brooks' NYT column for 6/11/2006, "The gender gap at school".

I expressed my qualms in a weblog post ("David Brooks, cognitive neuroscientist", 6/12/2006), and put a bit of effort into tracking down and reading Brooks' sources. I soon learned that Brooks had gotten a lot of his ideas not from the original papers, but from Leonard Sax's re-presentation of them in Why Gender Matters, and so I documented, in excessive detail, four striking exaggerations and misrepresentations in Sax's presentation of one relevant paper that he features prominently ("Are men emotional children?", 6/24/2006). Since I had bought and read Sax's book, I followed up with a several posts on other cases of the book's misunderstanding, exaggeration or misrepresentation of scientific results, including two on hearing ("Leonard Sax on hearing", 8/22/2006; "Girls and boys and classroom noise", 9/9/2006).

OK, fast forward to Dr. Sax's response, available as a .pdf here on the ASSPE web site, where he makes a point about the protocol of argument:

You are selective in the research you cite, using only those papers which support your position. While such selectivity may perhaps be justified when writing a popular book for a lay audience, with severe constraints on length, it is harder to justify such selectivity when writing a scholarly post online.

But I selected and examined papers that Dr. Sax cited, and argued that he misunderstood or misrepresented them. My posts were not a general review of sex differences in hearing, they were a specific critique of Dr. Sax's misuse of science in a political argument.

The main point of my first post ("Leonard Sax on hearing", 8/22/2006) was that Sax turned upside-down the results of Yvonne Sininger, Barbara Cone-Wesson, and Carolina Abdala, "Gender distinctions and lateral asymmetry in the low-level auditory brainstem response of the human neonate", Hearing Research 126:58-66, 1998. This paper actually found that male infants' thresholds were generally lower than those of female infants — in other words, they responded to softer sounds — but he presented this paper as evidence that girls' hearing is innately more sensitive.

The main point of my second post ("Girls and boys and classroom noise", 9/9/2006) was that Sax had cherry-picked numbers from John F. Corso, "Age and Sex Differences in Pure-Tone Thresholds", The Journal of the Acoustical Society of America, 31(4), pp. 498-507, 1959, and then misrepresented their meaning. Sax presents a difference in audiological thresholds of 23.2 dB as typical. In fact, this comes from comparing young women to old men at 3000 Hz; even for that comparison, it's unusually high (the average difference from 250 to 3000 Hz is 9.1 dB); and it's irrelevant to the comparison of boys and girls in school, for which we should compare young women to young men, where Corso found an average threshold difference over the same frequency range of 2.5 dB. And whatever the threshold difference, we need to map it to perceived loudness of sounds at various levels via the well-known loudness function, which predicts that even the exaggerated 23.2-dB difference in thresholds would only map to about 1.4-times greater perceived loudness at classroom levels, while a 2.5-dB difference thresholds would have a negligible effect on perceived loudness.

Dr. Sax grants my point about Sininger et al., and to some extent the larger point about sex differences in hearing:

Your criticism regarding my discussion of one of the papers cited on that web page (Singinger, Cone-Wesson, and Abdala, 1998) is accurate, and the errors you point out have long since been corrected, but you have never updated your posts. Many of your most hostile comments, particularly about differences in hearing, would be difficult for a careful reader today to understand, because you are attacking comments which haven’t been on any web page of mine in a long time. I do plead guilty to the crime of posting material online without adequate fact-checking. I would ask that you focus your attention on my books and on the web postings I have made in the past two years, as I am now much more cautious about what I put on the web. I should have exercised more care when I first started posting comments on the web; you are quite right in this regard.

I'm glad that Dr. Sax has become more careful. I'm particularly happy that the NASSPE web site now includes statements like these:

This does NOT imply that "all girls learn one way and all boys learn another way" – that's not a true statement, and nobody associated with NASSPE believes it! We celebrate and cherish the variations AMONG girls and AMONG boys.

Precisely because girls are so diverse and boys are so diverse, single-sex schools offer unique educational opportunities for girls, and for boys.

It's fair to ask me to add links to his letter (and to this post) to those posts from 2006; and I'll do so. But the Corso paper and the Sininger et al. paper were also cited in Why Gender Matters, to the same end as on his web site; and I believe that my general point, about careless or misleading use of scientific arguments in public-policy debates, remains valid about the material in that book, and also with respect to the additional material Dr. Sax cites in his letter. To that end, I'll discuss two of the additional references in his letter.

He brings in a new topic, namely subjective assessment of loudness measured directly rather than inferred from thresholds:

You show no awareness of new research suggesting that young men and young women may indeed subjectively assess loudness quite differently. I have attached a recent paper making this point (Sagi, D’Allesandro, & Norwich, 2007; copy enclosed).

That's Elad Sagi et al., "Identification Variability as a Measure of Loudness: An Application to Gender Differences", Canadian Journal of Experimental Psychology 61(1): 64-70, 2007. Another version of the same work can be found as D'Allessandro et al., "Gender Differences in Power Functions (n-values) for Loudness", Proceedings of the Twenty First Annual Meeting of the International Society for Psychophysics, 2005.

My executive summary:

This study involves a very small number of subjects (8 females and 7 males), and it doesn't estimate loudness scaling parameters directly, but rather uses an indirect and inferential method, which doesn't allow us to make any quantitative conclusions about the magnitude of a sex difference in perceived loudness levels or in the experience of sounds at different intensity levels. Other researchers have measured such things more directly; and they find that the effects of sex on objectively defined psychoacoustic tasks (like thresholds, JNDs and ratio judgments) are generally small, while the effects of sex on softer, more judgmental tasks (like comfortable listening levels or sensitivity to noise) are strikingly inconsistent and variable across studies, subjects and contexts.

Sagi et al.'s idea — a clever one — is to measure loudness perception indirectly, by asking subjects to identify a particular sound intensity value within a given range. Thus as a subject, you hear a sound at an intensity that might be anywhere in the range (say) from 60 to 70 dB, or from 50 to 70 dB, or from 40 to 70 dB. On each trial, you have to identify the intensity to the nearest dB.

You'll almost always be wrong — but on each trial, we measure how wrong you are, and (for a given intensity range) we summarize the distribution of your errors. Unsurprisingly, your range of errors tends to gets wider as the range of intensities increase. And the standard deviation of the errors is roughly a linear function of the stimulus range.

(Of course, the variability of your errors also depends on how well you're paying attention, how hard you're trying, and so on. Keep this in mind — it'll come up again later.)

The authors present a subtle chain of reasoning connecting a — the slope of the function relating the standard deviation of errors in intensity estimation to the size of the intensity range — to the exponent n in what is called "Stevens' Power Law", which relates perceived loudness to stimulus intensity. The relationship is a proportional one,

k*n = 20*log10(e))*a

where k is an arbitrary constant.

Sagi et al. calculated the slope a for a small sample of subjects (8 females and 7 males, all probably college students). They found that a was significantly higher for the females than for the males. Across all frequencies studied:

Females Males
mean s.d. mean s.d.
0.3053 0.0561 0.2218 0.0443

This gives us a female-male difference of 0.0835 — in proportional terms about 1.38 to 1 — and an effect size of around 1.65 standard deviations, which is quite large.

But before taking this estimate seriously, we need to worry about the small sample size and about the inference from variability slope to loudness exponent. As the authors say, "a larger sample would be required to generalize firmly to the population of all males and females". This is partly because estimates derived from such a small sample are sure to be unstable, even if there is no non-sampling error. But there might very well be non-sampling errors — the authors assume that their male and female subjects didn't differ in any other relevant ways, such as the strategy that they used to perform the task, or how hard they tried; and with such a small number of subjects, it wouldn't take much to invalidate this assumption (say, if three of the seven male subjects happen to have been roommates who over-partied the night before).

And even if the difference is really a perceptual one rather than a difference in bias or attention to task, we need to try to figure out what such a difference might mean about loudness perception in more concrete terms. As Sagi et al. put it,

We obtain values of a experimentally, not values of n. However, we can demonstrate that n-values are equal to some constant, nonnegative multiple of a. Since in our data a-values are greater in females than in males, we may conclude that n-values also are greater in females than in males, at least under these conditions for these participants. We have not calculated the constant of proportionality, but irrespective of its value, the primary conclusion remains: In our study, the derived exponent n emerges as being greater in females than in males.

Fair enough — though remember that a is based on measured variability in judgment, which might well be affected by non-perceptual factors like response bias or degree of attention; and even under the most favorable interpretation, we can't conclude anything about the magnitude of the difference effect on loudness perception. They continue

This result is in accord with the inferences of McGuinness (1974). In studying gender differences in loudness tolerance, she noted that females have a mean maximum comfort level 8 dB lower than that of males.

That's D. McGuinness, "Equating individual differences for auditory input", Psychophysiology 11: 113-120, 1974. But experiments of that kind give a wide variety of different answers, especially when sample sizes are small. Also, the details of a study's design can have very large and surprising results. One example worth discussing is Ventry, I. M., Woods, R. W., Rubin, M., & Hill, W., "Most Comfortable Loudness for Pure Tones, Noise, and Speech", Journal of the Acoustical Society of America, 49(6B), 1971. Their overall conclusion about the male/female effects on "most comfortable loudness" (MCL):

Males and females were included for study because of the paucity of data on the performance of males and females on a suprathreshold loudness tracking task such as MCL. What few data are available for speech MCL are contradictory (Kavanaugh, 1960; Kopra and Blosser, 1968). In addition, our personal experiences suggested that males would tend to track MCLs at higher levels than females. Our major fnding here, however, was that there were no significant differences in MCLs tracked by females and males.

However, they add this: "There was one statistically significant interaction involving sex in the speech-MCL study." And it was a really surprising one. In the test of comfortable loudness levels for speech, they used two versions of the instructions, with and without the words in brackets and bold face:

The purpose of the next test is to find and maintain a loudness level at which speech is most comfortable to listen to [and where you can easily understand everything that is being said]. This switch enables you to make the speech louder or softer. Using this switch, your task is to make the speech louder or softer until you reach a level which you feel is your most comfortable listening level [at which you can easily understand everything that is being said]. When you reach this level continue to make speech louder or softer, maintaining the speech at your most comfortable listening level for as long as you hear the broadcast.

[Choose this level by presuming that you must listen to the broadcast to obtain information from it. It is important for you to find a level where it is both comfortable and where it is easy to understand everything that is being said. At the end of this session, you wil be given a writen test based on the information in this broadcast.]

Are there any questions? Remember, the purpose of this test is to find and maintain a level which is a comfortable level of loudness [at which you can easily understand everything that is being said].

In other words, they did or did not emphasize that the subjects should pay attention to the content of the news broadcast they were listening to. And the result of this change in instructions? Without the instruction to pay attention, the females' MCL was 10.8 dB lower than the males' — but with the instruction, it was 9.7 dB higher:

Male 47.9 (10.1) 50.4 (13.1)
Female 57.6 (12.5) 41.2 (9.4)

(there were 16 subjects in each cell; 64 subjects overall in this part of the study)

Here's all the authors can find to say about this:

These differences are not easy to explain. It may be that such factors as achievement motivation (Brown, 1965), life style, cognitive approach, and work attitudes all played a part in the result. It may also be that the type of subject used, primarily college and university students, had an important role to play , especially when the task involved not only listening, but being tested on the subject matter as well.

In other words, they're baffled; and they think that perhaps the females were more worried about getting the content right, when they were given reason to believe that this was expected of them. Whatever was going on, you can't explain it just in terms of the view that women have a larger exponent in their Stevens-law loudness scaling.

Let's turn to a more recent study that looks at loudness perception both more directly and also in a larger context: Ellermeier et al., "Psychoacoustic correlates of individual noise sensitivity", Journal of the Acoustical Society of America, 109(4): 1464-1473, 2001. The abstract:

In environmental noise surveys, self-reported noise sensitivity, a stable personality trait covering attitudes toward a wide range of environmental sounds, is a major predictor of individual noise-annoyance reactions. Its relationship to basic measures of auditory functioning, however, has not been systematically explored. Therefore, in the present investigation, a sample of 61 unselected listeners was subjected to a battery of psychoacoustic procedures ranging from threshold determinations to loudness scaling tasks. No significant differences in absolute thresholds, intensity discrimination, simple auditory reaction time, or power-function exponents for loudness emerged, when the sample was split along the median into two groups of "low" vs "high" noise sensitivity on the basis of scores obtained from a psychometrically evaluated questionnaire [Zimmer and Ellermeier, Diagnostica 44, 11–20 (1998)]. Small, but systematic differences were found in verbal loudness estimates, and in ratings of the unpleasantness of natural sounds, thus suggesting that self-reported noise sensitivity captures evaluative rather than sensory aspects of auditory processing.

They tested 61 (German) college-student volunteers, 33 female and 28 male. The authors measured "absolute and difference thresholds" directly, and found no sex difference; they measured "magnitude estimation of loudness" directly, and found no sex difference; they measured "loudness category scaling" directly, and found a small sex difference at low sound levels, but none at higher sound levels. (See their section on "Gender effects" for details.) They did find a sex difference in self-reported noise-sensitivity, but they observe that it appears to be a peculiarity of their sample, since no similar effect was found in many other studies, by them and by others:

An unusual finding related to the present sample is that we found a significant majority (roughly two-thirds) of women in the noise-sensitive group (see Sec. III A). This is atypical, both for research published by other investigators, who found no effects of sex on noise sensitivity (Moreira and Bryan, 1972; Weinstein, 1978; Taylor, 1984), and for a vast amount of data collected in our own laboratory. In four different samples, three of which were drawn from a similar student population, and all of which consisted of a far greater number of participants (ranging between 117 and 213), we never found a significant effect of gender (Zimmer and Ellermeier, 1997, 1998a, 1998b). Therefore, we tend to interpret the gender imbalance found in the present investigation as a peculiarity of that particular sample.

Later in his letter, Dr. Sax observes that

On your page attacking the way in which I presented sex differences in auditory processing (I am referring to, you neglect to mention that I cited THREE studies documenting these sex differences (Why Gender Matters, reference 13 for chapter 2, p. 279):

1. Cone-Wesson & Ramirez, 1997;
2. Sininger, Cone-Wesson, and Abdala, 1998
3. Chiarenza, D'Ambrosio, and Cazzullo, 1988

The findings in the first and third references suggest that girls have lower thresholds (as measured by ABR) than boys have, at higher frequencies. The second reference — Sininger, Cone-Wesson, and Abdala — does not show the same finding, but instead shows mostly an overlap in thresholds. I wrote to Barbara Cone-Wesson in 2004, while preparing chapter 2 of Why Gender Matters, asking her to explain why her paper with Ramirez showed a result so different from the result with Sininger, Cone-Wesson, and Abdala, but she declined to respond, saying that she was too busy. You discuss only Sininger, Cone-Wesson, and Abdala.

That's because life is short, and so I picked the most recent reference from his list. He enclosed a fourth, even more recent study on sex differences in infant hearing, Eric Berninger "Characteristics of normal newborn transient-evoked otoacoustic emissions: Ear asymmetries and sex effects", International Journal of Audiology 46(11): 661-669, 2007. Rather than go over again the various results of the other two earlier studies, I'll discuss this one; a similar set of remarks will apply to the others, although of course the details vary.

The executive summary:

This paper doesn't help Dr. Sax's case for large sex differences in hearing. It shows that there is a very small difference, on average, between male and female newborns, in a measurement whose quantitative relevance to auditory acuity and perceived loudness is weak at best.

The difference between male and female means is less than 1/6 of the (within-sex) standard deviation of female values or male values. The correlation of the measured quantity with audiogram measures of hearing threshold is between -0.1 and -0.5, i.e. rather weak.

The details follow, if you care.

Berninger's paper deals with measurements of "transient-evoked otoacoustic emissions" (TEOAEs) in newborns. This is a simple, noninvasive test for hearing defects in infants. A click, a tone burst or other stimulus is played in the infant's ear; if the peripheral auditory system is working normally, the feedback to the cochlea via the outer hair cells caused a sound to be generated in response. This "otoacoustic emission" is measured. If there is no EOAE at all, then the cochlea is probably not functioning normally.

These emissions vary in amplitude from case to case. In the statistics collected in this large study, there was a significant difference between left and right ears: the emissions in the right ears were (on average) about a decibel higher than those in the left ears.

There was also a statistically significant sex difference: the emissions from girls' ears were on average about a decibel greater in amplitude than those from boys' ears.

A few brief comments:

1) These differences in averages are small in absolute terms. The JND (just noticeable difference) for sound intensity, over most of the frequency range, is about one decibel.

2) The distributions are highly overlapped, and the "effect size" (the difference in means expressed in terms of standard deviations) is fairly small. Thus measured at 1414 Hz, the mean TEOAE in the right ear of female infants was 7.1 dB SPL, and in the right ear of male infants it was 6.4 dB SPL.

Since there were N=12,849 male measurements, and N=12,348 female measurements in this case, and the "standard error" (the standard deviation divided by the square root of N) in each case was 0.06 dB, the standard deviation was about 0.06*sqrt(12,500) = 6.7.

Thus the difference in means (of 1 dB) divided by the standard deviation of 6.7 dB gives an effect size of about 1/6.7 = 0.15. This number is known as "Cohen's d", and the standard scale for such things says that "0.2 is indicative of a small effect, 0.5 a medium and 0.8 a large effect size". (To see what overlapping distributions with an effect size of about 0.15 are like, take a look at "Gabby Guys: the effect size", 9/23/2006.)

[If you look at Berninger's Fig. 5, you'll see that over the range most relevant to speech (300 to 3000 Hz) the female-male difference ranges from 0 to about 1.3 dB. It's a bit larger (about 2.3 dB) at 4kHz. The effect sizes will be similar throughout this range.]

Berninger writes: "TEOAEs show large inter-individual variability … according to the large variability of TEOAEs, sex and ear should not be recommended as decision parameters in NHS-programmes." That is, the sex (and ear) differences are so small, relative to the variation across individuals, that screening programs should not bother to treat the sexes and the ears differently in deciding when to flag an infant as possibly hearing-impaired.

3) The relationship between the intensity of evoked otoacoustic emissions and perceived loudness is murky, to say the least. Thus according to Wagner and Plinkert, "The relationship between auditory threshold and evoked otoacoustic emissions", European Archives of Oto-Rhino-Laryngology, Volume 256, Number 4 / April, 1999:

"Correlation between TEOAE amplitudes and HL was in general rather low (r = -0.1 to -0.5)." ["HL" is "hearing loss", i.e. increased audiometric thresholds].

The intensity of TEOAE depends on the gain of the feedback loop between the inner hairs cells (which communicate basilar membrane motion to the central nervous system) and the outer hair cells (which adjust the basilar membrane based on CNS feedback). This can vary for all kinds of reasons that need not have any connection to auditory acuity or perceived loudness.

Dr. Sax again:

Perhaps, as you argue, sex differences in auditory acuity are too small to have any functional significance. In that case, we need some other hypothesis to account for the findings mentioned above, e.g. that 18-month-old girls have substantially larger vocabulary than 18-month-old boys, and that soft music has dramatic beneficial effects for premature baby girls but not for premature baby boys. Or, you could dispute those empirical findings. But merely rejecting the hypothesis of differential auditory acuity, while failing to offer any alternative hypothesis, leaves your argument incomplete.

But I wasn't making an argument about the effects of music on premature babies, or about vocabulary growth in the first few years of life. Rather, I was making an argument about Dr. Sax's rhetorical style, which (I argued) often involves exaggerated or misinterpreted presentation of scientific research in support of a public-policy recommendation.

I haven't read the research on music and premature babies; I don't see that it has a lot of impact on the question of single-sex schooling, but I'm willing to be persuaded. On the subject of vocabulary development, it's absolutely true that 18-month-old girls have (on average) more than double the vocabulary of 18-month-old boys, according to one of the studies that I surveyed in another 2006 post on misuse of scientific evidence ("The main job of the girl brain", 9/2/2006). Here's what I wrote about it then:

[A]according to Svetlana Lutchmaya, Simon Baron-Cohen and Peter Raggat, "Foetal testosterone and vocabulary size in 18- and 24-month infants", Infant Behavior and Development, 24:418-424 (2002), in a sample of 18-month-olds,

For boys, vocabulary size ranged from 0.0 to 222.0, M = 41.8 (SD = 50.1). For girls vocabulary size ranged from 2.0 to 318.0, M = 86.8 (SD = 83.2).

while at 24 months,

For boys, vocabulary size ranged from 0.0 to 414.0, M = 196.8 (SD = 126.8). For girls vocabulary size ranged from 15.0 to 415.0, M = 275.1 (SD = 121.6).

Thus at 18 months, the girls' average vocabulary was roughly double the boys' average. Given the cited variation, if we were to pick a girl and boy at random from their sample, we would expect the girl to have a larger vocabulary about 68 times out of a hundred, while the boy would have the larger vocabulary about 32 times. At 24 months, the girls' average vocabulary was only about 40% greater. The betting odds haven't changed much, though — a random girl would have a larger vocabulary than a random boy about 67 times out of a hundred.

And point #5 (that boys catch up in vocabulary) is also true, as Table 6 from Hyde 1988 shows:

But the standard hypotheses about this difference have to do with differences in CNS development, not in hearing acuity.

In my opinion, there's no doubt that there are group differences between the sexes in hearing. But there are two points in contention. The first question is whether these differences are so large and so consistent — especially in relation to variation among boys and among girls — as to create a significant problem in mixed-sex classrooms. My somewhat-informed opinion is that this is very unlikely to be true. In fact, I think that the word "preposterous" might be appropriate. The second question is whether the presentation of scientific evidence in Dr. Leonard Sax's book Why Gender Matters can be trusted. On the basis of a careful examination of an admittedly small sample of cases, my opinion is that the answer is "no".

Comments are closed.