Replication Rumble
« previous post | next post »
In other non-replication news lately: There's been a pretty kerfuffle this month in social psychology and science blogging corners over a recent failure to replicate a classic 1996 study of automatic priming by John Bargh, Mark Chen, and Lara Burrows. The non-replication drew the attention of science writer Ed Yong who blogged about it over at Discover, and naturally, of John Bargh, who elected to write a detailed and distinctly piqued rebuttal at Psychology Today.
The original paper reported three experiments; the one that's the target of controversy used a task in which subjects unscramble lists of words and isolate one word in the list that doesn't fit into the resulting sentence. The Bargh et al. study showed that when the experimental materials contained words that were associated with stereotypes of the elderly (e.g. Florida, bingo, gray, cautious), subjects walked more slowly down the hall upon leaving the lab compared to subjects who saw only neutral words. The result has been energetically cited, and has played no small role in spawning a swarm of experiments documenting various ways in which behavior can be impacted by situational or subliminal primes. The authors explained their findings by suggesting that when the concept of a social stereotype is activated (e.g. via word primes), this can prompt behaviors that are associated with that stereotype (e.g. slow walking).
But allegedly, despite scads of studies that have built on some of Bargh et al.'s conclusions, the slow-walker study has yet to be fully replicated, which motivated Stéphane Doyen and colleagues at the Université Libre de Bruxelles to undertake the job, reporting their attempts in a recently-published article in PLoS ONE. Their first experiment, which contained sober experimental precautions such as using automated timing systems and ensuring that experimenters were blind as to which conditions subjects were assigned to, failed to produce a priming effect. This led them to wonder in print whether the original priming results could have come from a failure to strictly implement double-blind experimental methods, which serves as the motivation for their second experiment.
The second study reported by Doyen et al. focused on whether an effect could be induced by specifically manipulating the experimenters' expectations of how the subjects would behave. Ten different experimenters were included; these experimenters were made aware of which of their subjects were assigned to the word prime condition, and which were assigned to the neutral word condition. However, half of them were led to expect that when primed, their subjects would walk more slowly as a result of the experimental manipulation, and half of them were led to believe they would walk more quickly. (In reality, all subjects received the same elderly-priming materials as in the first experiment). The paper doesn't go into detail as to how these experimenter expectations were established, other than to report that all this took place during "a one hour briefing and persuasion session prior to the first participant's session." In addition, the experimenters had their expectations reinforced by the behavior of their very first study subject, who was a confederate in cahoots with the researchers and obligingly walked quickly or slowly, as expected.
Not surprisingly, when subjects' walking speed was measured by the experimenters themselves on a stopwatch, their pace aligned with expectations: subjects in the word prime condition were timed at faster speeds than those in the neutral word condition when the experimenters expected that priming would speed them up, and conversely, when experimenters expected priming to slow the subjects down, they timed them at slower speeds in the word prime condition relative to the neutral word condition. This wasn't the whole story, though—the subjects' actual speed was also timed by an automated motion-sensitive system. Objective measures of walking speed showed that when the experimenters expected priming to accelerate their subjects, that's exactly what happened. But when they expected subjects to slow down as a result of the priming, there was no difference between the primed subjects and those in the neutral word condition.
This tells us that the actual walking speed of subjects isn't determined entirely by experimenters' expectations; if that were the case, subjects should have walked more quickly when expected to do so as a result of priming. But it does suggest that the priming effect can be either boosted or dampened by experimenter expectations-presumably because the experimenter is emitting subtle and possibly inadvertent cues that impact the subjects' behavior (it would have been interesting, for example, to measure the experimenters' speech rate).
The authors' take on all this is to conclude that:
although automatic behavioral priming seems well established in the social cognition literature, it seems important to consider its limitations. In line with our result it seems that these methods need to be taken as an object of research per se before using it can be considered as an established phenomenon.
I'm really not sure what the above statement actually means. But it certainly invites a first-blush response of the Ohmygosh-is-all-this-stuff-we-thought-we-knew-about-unconscious behavioral-priming-wrong? variety. But it's worth waiting for that first flush to settle. Because in the end, the result in and of itself causes little trauma to the original Bargh et al. interpretation of their priming data, and none whatsoever to the more general issue of whether automatic behavioral priming exists.
First of all, the fact that experimenter expectations led to an effect on subjects' behavior doesn't mean that this accounts for the original Bargh et al. results. It just means that it has a measurable impact on any priming effects that may or may not occur. To find otherwise would be rather surprising, especially given the rather heavy-handed way in which these expectations seem to have been induced. (Bargh has countered the paper by claiming that in fact, their own study did implement double-blind methods; whether or not this was done rigorously enough, it certainly seems clear that the later Doyen et al. paper went to special lengths to create a salient experimenter bias above and beyond what would plausibly have existed in the earlier work).
So what we're really left with is the issue of how to interpret the non-replication. There are a number of possible reasons for this, some of them really boring, some of them mildly interesting, but most of them unrelated to the important theoretical questions. For example:
1. The non-replication itself is an experimental failure. In experiments involving humans, all "replications" are at best approximate. Other unforeseen aspects of the experimental design and implementation may have obscured a priming effect or led to unusually noisy data. For example, maybe the Belgian experimenter was attractive to the point of distraction. Maybe more of the undergraduate subjects were tested in the morning while still sluggish. Maybe the experimenter was flaky and inconsistent in implementing the study. Obviously, if an effect is repeatedly vulnerable to these kinds of obliterations, that can speak to the fragility of the effect; but the point is that for any single failure to replicate, we can't tell for sure what the source of the non-replication is. Perfectly robust results can be and often are drowned in noise inadvertently introduced somewhere in the experimental procedure. We can simply document that the failure to replicate occurred, while noting (and further testing) any obvious discrepancies from the original implementation.
2. The word primes may not have successfully triggered a stereotype for the elderly in the minds of the subjects, or the conceptual stereotype may not have had a strong association with slow walking movements. It's entirely conceivable that stereotypes would shift due to time or geographic location. A lot has happened demographically since 1991 when Bargh et al. first collected their data. Upon hearing about this study, for example, my own son remarked (referring to his alpine-skiing, Nepal-trekking grandmother): "Those subjects have obviously never met Nanny." In this case, there's no threat to Bargh's original theoretical contribution about the activation of social stereotypes as a driver of behavior; it's just that any given stereotype isn't going to be held by all populations.
3. There was nothing wrong with the stereotypes; the original result really was a statistical fluke, or an experimental artifact, or limited to a very narrow population or set of experimental circumstances. This eventuality is the most damaging to Bargh et al. But does it really threaten the more general conclusion that behavior can be unconsciously, or automatically primed? No; it simply casts doubt on the more specific interpretation of the results as being due to the activation of social stereotypes. In fact, it's hard to interpret Doyen et al.'s second study, which manipulated experimenter expectations, without appealing to unconscious behavioral priming (as fairly pointed out by Ed Yong in his post). Unless the experimenters actually violated experimental ethics outright by instructing the subjects to walk more slowly, it seems likely that the subjects were unconsciously picking up on experimenter cues (but which ones? Speech rate? Certain words?) unconsciously emitted by the experimenters. What's more, there are by now dozens and quite possibly hundreds of demonstrations of automatic priming effects using a variety of different experimental paradigms, some of which do apply the activation of stereotypes. (Some examples here and here.) Given that it's now 2012, not 1996 when the Bargh et al. paper first appeared, any non-replication of that original result is going to have to be interpreted within the context of that entire body of work.
So. Hardly material to launch a full-scale kerfuffle. This is just science plodding its plodding way towards its plodding approximation of truth. Enough with the rubbernecking already—there are no bloody conclusions to be found here, at least not yet.
So why am I bothering to add my voice to the fray? Because I think that it's very important that we actually talk about replication, what it means and doesn't mean, and that we do so in a way that moves beyond thinking about it as a cagematch between scientists.
When I talk to non-scientists, I'm distressed by a general illiteracy in the understanding of non-replication. All too often, failures to replicate are treated as abrupt reversals of truth. As if any new result, especially a startling or counterintuitive one, were anything but an opening gambit, not a declaration of truth. New studies, whether they replicate the result or not, are simply the next moves that change the way the board is now configured. But all too often, a failure to replicate is portrayed as an instance of science "changing its mind" or an indictment of the scientific method, when really, it's at the heart of the scientific method. When it comes down to it, the sound of non-replication isn't the sound of the puck being slapped into the opponent's net. It's the sound of a muttered "hmmm, what's going on here," the sound of science rolling up its sleeves with a sigh and settling in for a long night's work.
Joe said,
March 18, 2012 @ 4:18 am
I agree entirely. One problem is that in many fields it is still not standard practice to do a power analysis to determine the number of participants needed for a study. Failure to replicate might be simply due to the fact that the studies had differences in power. It would be a good idea if effect sizes and confidence intervals and not just p-values were reported.(I'm not saying that this is the case here; it's more a general observation about the practice in some fields. Fortunately, it is changing, but the reporting of the studies hasn't caught up)
Mark Liberman said,
March 18, 2012 @ 6:36 am
It's worth adding Ed Yong's follow-up at Discover: "A failed replication draws a scathing personal attack from a psychology professor", 3/10/2012; and an interesting discussion by Daniel Simons, "A primer for how not to respond when someone fails to replicate your work (with a discussion of why replication failures happen)", 3/8/2012.
In a comment on Simons' post, Art Markman makes this relevant and perhaps-non-obvious comment about the incentives associated with publishing replication failures:
There's other discussion by Alex Tabarrok at Marginal Revolution, Chris Shea at the WSJ's Ideas Market, Sanjay Srivastava at The Hardest Science, Robert Kurzban at Evolutionary Psychology, by Anonymous at Neurobonkers, …
An earlier (2008) failure to replicate the slow-walking findings (by Harold Pashler, Christine Harris, and Noriko Coburn; and apparently otherwise unpublished, for the usual reasons) can be found here, at psychfiledrawer.org ("Archive of Replication Attempts in Experimental Psychology").
See also Chris French, "Precognition studies and the curse of the failed replications", The Guardian 3/15/2012, which backs up Art Markman's comment with a detailed case study.
And, of course, there's the Barghinator, e.g. this.
Julie Sedivy said,
March 18, 2012 @ 3:20 pm
Yes, and more discussion here, by Matthew Lieberman. In this post, Lieberman makes an interesting remark:
I think it's hard to fully appreciate the hodgepodge of possible reasons for failed experiments unless you've been involved in the day-to-day grind of doing experimental work. I've never worked on these kinds of priming studies myself, but have certainly seen my share of failed psycholinguistics experiments, some of them precursors to stable results once de-bugged. What constitutes de-bugging? Sometimes it really can be as dull as the time of day subjects are tested; for example, in doing eyetracking experiments, we eventually developed a policy of never testing subjects before 11 am or late in the semester close to exams. Why? Subjects were often in a sleepy state, their eyelids would droop, and as a result, we'd suffer repeated, momentary losses of eye tracking data, and hence a failure to find statistically-reliable effects.
In self-paced reading tests, researchers have learned to avoid comparing reading times for critical phrases that occur at the ends of sentences. Why? Because reading times at sentence's end are especially long and variable; the resulting noise is apt to drown out any effects that would otherwise be easily observed. The point is that to successfully find clean evidence even for what eventually turn out to be robust, highly-replicable effects, you have to have a good understanding of the specific measures you are using as a probe, and the various factors that can push it around or make it less useful as a measure. This is just part of developing a decent set of lab skills. But it's a part of the scientific process that is not always made explicit (I'm not sure I've ever specifically reported in a paper, for instance, that we had certain "blackout" times on testing subjects.)
That's not to say, of course, that non-replication doesn't deserve serious consideration. It does. It's just that finding the reasons for non-replication is often going to be even more slow and tedious than producing successful replications. In his post, Matthew Lieberman offers an interesting solution:
Jason Merchant said,
March 19, 2012 @ 3:53 pm
Speaking of trying to replicate a fairly well-known and much-discussed result in psycholinguistics, I recently decided that I wanted to use the stimuli that were used in an experiment whose results are described briefly in Boroditsky, Schmidt, and Phillips 2003 (published as a chapter in a book edited by Gentner and Goldin-Meadow, Language in the Mind, MIT Press), and which Lera Boroditsky and others (including myself in class!) have cited in print and in presentations a lot since. The reported results in brief: when using English, German native speakers describe bridges with more "feminine" adjectives [as judged by a separate group of Eng-only speakers] than do Spanish native speakers (die Brücke vs el puente); vice versa for "key" (der Schlüssel vs la llave)). Boroditsky et al 2003 use this a part of the evidence for a Whorfian interpretation of grammatical gender, etc., but that paper doesn't present the study itself (not one word of Spanish or German is even in it); instead, it cites a 2002 ms by the same authors (as "submitted for publication"). As far as I can tell, that ms was never published, and I have been unable to track it down (this includes two emails to Lera herself, both unanswered).
The stimuli themselves may have been taken from Konishi's 1993 paper in JPsycholingResearch, which includes two lists of 54 Ger/Span gender mismatched word pairs, though you wouldn't know this from Boroditsky et al 1993 (where one reads only that "Boroditsky, Schmidt, and Phillips (2002) created a list of 24 object names that had opposite grammatical genders in Spanish and German…"; this is why I'd like to see this ms, where I assume more appropriate acknowledgment is made, if indeed the list is a proper subset of Konishi's word pairs). I myself have been working recently on gender features (in Greek, and Spanish) and on German-Spanish bilingual grammatical effects (with Kay Gonzalez and Sergio Ramos of UIC), and so I'd really love to get the actual list of items used (and better yet, the experimental design, so I can replicate it, and fool with it for Greek). But so far, mum's the word: anyone know if this study indeed has been published, perhaps under a different title?
Jess Tauber said,
March 21, 2012 @ 7:22 pm
I'm getting too old to listen to such nonsense myself, but don't any of you slow down on my account. I repeat- don't slow down. I don't want to put the idea of slowing down into your head. Now go have a nice day.
JP de Ruiter said,
August 31, 2012 @ 7:52 am
I cannot find the rebuttal (of the Doyen et al. study) by Bargh anymore. All the different links to it appear to be broken (perhaps Psychology Today removed it?). Can anyone tell me where I can find this rebuttal? Thanks. JP