There's an ongoing argument about the interpretation of Katherine Baicker et al., "The Oregon Experiment — Effects of Medicaid on Clinical Outcomes", NEJM 5/2/2013, and one aspect of this debate has focused on the technical meaning of the word significant. Thus Kevin Drum, "A Small Rant About the Meaning of Significant vs. 'Significant'", Mother Jones 5/13/2013:
Many of the results of the Oregon study failed to meet the 95 percent standard, and I think it's wrong to describe this as showing that "Medicaid coverage generated no significant improvements in measured physical health outcomes in the first 2 years."
To be clear: it's fine for the authors of the study to describe it that way. They're writing for fellow professionals in an academic journal. But when you're writing for a lay audience, it's seriously misleading. Most lay readers will interpret "significant" in its ordinary English sense, not as a term of art used by statisticians, and therefore conclude that the study positively demonstrated that there were no results large enough to care about.
Many past LL posts have dealt with various aspects of the rhetoric of significance. Here are a few:
"The secret sins of academics", 9/16/2004
"The 'Happiness Gap' and the rhetoric of statistics", 9/26/2007
"Gender-role resentment and the Rorschach-blot news reports", 9/27/2007
"The 'Gender Happiness Gap': Statistical, practical and rhetorical significance", 10/4/2007
"Listening to Prozac, hearing effect sizes", 3/1/2008
"Localization of emotion perception in the brain of fish", 9/18/2009
"Bonferroni rules", 4/6/2011
"Response to Jasmin and Casasanto's response to me", 3/17/2012
"Texting and language skills", 8/2/2012
But Kevin Drum's rant led me to take another look at the lexicographic history of the word significant, and this in turn led me back to the Oregon Experiment — via a famous economist's work on the statistics of telepathy.
The OED's first sense for significant has citations back to 1566, and in this sense, being significant is a big deal: something that's significant is "Highly expressive or suggestive; loaded with meaning". This is the ordinary-language sense that makes the statistical usage so misleading, because a "statistically significant" result is often not really expressive or suggestive at all, much less "loaded with meaning".
The OED gives a second sense, almost as old, that is much weaker, and is probably the source of the later statistical usage: "That has or conveys a particular meaning; that signifies or indicates something". Not necessarily something important, mind you, just something — say, in the modern statistical sense, that a result shouldn't be attributed to sampling error. A couple of the OED's more general illustrative examples:
1608 E. Topsell Hist. Serpents 48 Their voyce was not a significant voyce, but a kinde of scrietching.
1936 A. J. Ayer Lang., Truth & Logic iii. 71 Two symbols are said to be of the same type when it is always possible to substitute one for the other without changing a significant sentence into a piece of nonsense.
And then there's an early mathematical sense (attested from 1614):
Math. Of a digit: giving meaningful information about the precision of the number in which it is contained, rather than simply filling vacant places at the beginning or end. Esp. in significant figure, significant digit. The more precisely a number is known, the more significant figures it has.
The OED gives a few additional senses that are not strikingly different from the first two: "Expressive or indicative of something"; "Sufficiently great or important to be worthy of attention; noteworthy; consequential, influential"; "In weakened sense: noticeable, substantial, considerable, large".
And then we get to (statistically) significant:
5. Statistics. Of an observed numerical result: having a low probability of occurrence if the null hypothesis is true; unlikely to have occurred by chance alone. More fully statistically significant. A result is said to be significant at a specified level of probability (typically five per cent) if it will be obtained or exceeded with not more than that probability when the null hypothesis is true.
In an earlier post, I characterized sense 5. as "[a]mong R.A. Fisher's several works of public-relations genius". However, I gave Sir Ronald too much credit, as the OED's list of citations shows:
1885 Jrnl. Statist. Soc. (Jubilee Vol.) 187 In order to determine whether the observed difference between the mean stature of 2,315 criminals and the mean stature of 8,585 British adult males belonging to the general population is significant [etc.].
1907 Biometrika 5 318 Relative local differences falling beyond + 2 and − 2 may be regarded as probably significant since the number of asylums is small (22).
1925 R. A. Fisher Statist. Methods iii. 47 Deviations exceeding twice the standard deviation are thus formally regarded as significant.
1931 L. H. C. Tippett Methods Statistics iii. 48 It is conventional to regard all deviations greater than those with probabilities of 0·05 as real, or statistically significant.
In 1885, Sir Ronald's birth was still five years in the future. So who came up with this miracle of mathematical marketing? It seems that the credit is due to Francis Ysidro Edgeworth, better known for his contributions to economics. The OED's 1885 citation for (statistically) significant is to his paper "Methods of Statistics", Journal of the Statistical Society of London , Jubilee Volume (Jun. 22 – 24, 1885), pp. 181-217:
The science of Means comprises two main problems: 1. To find how far the difference between any proposed Means is accidental or indicative of a law? 2. To find what is the best kind of Mean; whether for the purpose contemplated by the first problem, the elimination of chance, or other purposes? An example of the first problem is afforded by some recent experiments in so-called "psychical research." One person chooses a suit of cards. Another person makes a guess as to what the choice has been. Many hundred such choices and guesses having been recorded, it has been found that the proportion of successful guesses considerably exceeds the figure which would have been the most probable supposing chance to be the only agency at work, namely 1/4. E.g., in 1,833 trials the number of successful guesses exceeds 458, the quarter of the total number, by 52. The first problem investigates how far the difference between the average above stated and the results usually obtained in similar experience where pure chance reigns is a significant difference; indicative of the working of a law other than chance, or merely accidental.
So the first use in print of "(statistically) significant" was in reference to an argument for telepathy!
Edgeworth gives no detailed analysis of the "Psychical Research" data in this article, though he notes that
we have several experiments analogous to the one above described, all or many of them indicating some agency other than chance.
But he went over the issue in great detail in a paper published in the same year ("The calculus of probabilities applied to psychical research", Proceedings of the Society for Psychical Research, Vol. 3, pp. 190-199, 1885), which begins with an inspiring (though untranslated) quote from Laplace ("Théorie Analytique des probabilités", 1812):
"Nous sommee si éloignés de connaître tous les agents de la nature qu'il serait peu philosophique de nier l'existence de phenomènes, uniquement parcequ'ils sont inexplicables dans l'état actuel de nos connaissances. Seulement nous devons les examiner avec une attention d'autant plus scrupuleuse, qu'il parait plus difficile de les admettre; et c'est ici que l'analyse des probabilités devient indispensable, pour determiner jusqu'à quel point il faut multiplier les observations ou les expériences, pour avoir, en faveur de l'existence des agents qu'elles semblent indiquer, une probability supérieure à toutes les raisons que l'on peut avoir d'ailleurs, de la réjéter."
"We are so far from knowing all the agencies of nature that it would hardly be philosophical to deny the existence of phenomena merely because they are inexplicable in the current state of our knowledge. We must simply examine them with more careful attention, to the extent that it seems more difficult to explain them; and it's here that the analysis of probabilities becomes essential, in order to determine to what point we must multiply observations or experiments, in order to have, in favor of the existence of the agencies that they seem to indicate, a higher probability than all the reasons that one can otherwise have to reject them."
Edgeworth then undertakes to show that telepathy is real:
It is proposed here to appreciate by means of the calculus of probabilities the evidence in favour of some extraordinary agency which is afforded by experiences of the following type: One person chooses a suit of cards, or a letter of the alphabet. Another person makes a guess as to what the choice has been.
After considerable application of somewhat complex mathematical reasoning, e.g. the passage below, Edgeworth concludes that the probability of obtaining the cited results by chance — 510 correct card-suit guesses out of 1833 tries — is 0.00004.
Even if we grant his premises, Edgeworth's estimate seems to have been quantitively over-enthusiastic — R tells us that the probability of obtaining the cited result by chance, given his model, should be a bit greater than 0.003:
binom.test(458+52,1833,p=0.25,alternative="greater") Exact binomial test data: 458 + 52 and 1833 number of successes = 510, number of trials = 1833, p-value = 0.003106 alternative hypothesis: true probability of success is greater than 0.25
But the p-value calculated according to Edgeworth's model — whether it's .00004 or .003 — is not an accurate estimate of the probability of getting the cited number of correct guesses by chance, in an experiment of the cited type. That's because his model might well be wrong, and there are plausible alternatives in which successful guesses are much more likely.
Recall his description of the experiment:
One person chooses a suit of cards, or a letter of the alphabet. Another person makes a guess as to what the choice has been.
He assumes that the chooser picks among the four suits of cards with equal a priori probability; and that the guesser, if guessing by chance, must do the same. But suppose that they both prefer one of the four suits, say hearts? If the chooser always chooses hearts, and the guesser always guesses hearts, then perfect psychical communication will appear to have taken place.
More subtly, we only need to assume a slight shared bias for better-than-chance results to emerge. And the bias need not be shared in advance of the experiment — since the guesser learns the true choice after each guess, he or she has plenty of opportunity to estimate the chooser's bias, and to start to imitate it. In other words, this is really not a telepathy experiment, it's the world's first Probability Learning experiment! (See "Rats beat Yalies", 12/11/2005, for a description of this experimental paradigm.)
If the result is equivalent to a shared uneven distribution over the four suits, then things are very different. With shared probabilities of 0.34, 0.33, 0.17, 0.16, for example, the probability of getting at least 510 correct guesses in 1833 trials is about 55%. And I assert without demonstration that such an outcome could easily emerge from a probability-learning process, without any initial shared bias.
My point here is not to debunk psychical research, but to observe that here as elsewhere, it's important to pay attention to the the details of the model, the data, and the outcome, rather than just looking at the p value.
And this brings us back to the contested paper, Katherine Baicker et al., "The Oregon Experiment — Effects of Medicaid on Clinical Outcomes", NEJM 5/2/2013.
Since this post has already gone on too long, I'll try to make this fast (for the two of you who are still reading…) Here's the background:
In 2008, Oregon initiated a limited expansion of its Medicaid program for low-income adults through a lottery drawing of approximately 30,000 names from a waiting list of almost 90,000 persons. Selected adults won the opportunity to apply for Medicaid and to enroll if they met eligibility requirements. This lottery presented an opportunity to study the effects of Medicaid with the use of random assignment.
Our study population included 20,745 people: 10,405 selected in the lottery (the lottery winners) and 10,340 not selected (the control group).
They were able to interview 12,229 people, 6387 lottery winners and 5842 among the controls, about two years after the lottery.
But it's not quite as simple as that:
Adults randomly selected in the lottery were given the option to apply for Medicaid, but not all persons selected by the lottery enrolled in Medicaid (either because they did not apply or because they were deemed ineligible).
A more detailed picture of the lottery process is given in the paper's Supplementary Appendix:
In total, 35,169 individuals—representing 29,664 households—were selected by lottery. If individuals in a selected household submitted the appropriate paperwork within 45 days after the state mailed them an application and demonstrated that they met the eligibility requirements, they were enrolled in OHP Standard. About 30% of selected individuals successfully enrolled. There were two main sources of slippage: only about 60% of those selected sent back applications, and about half of those who sent back applications were deemed ineligible, primarily due to failure to meet the requirement of income in the last quarter corresponding to annual income below the poverty level, which in 2008 was $10,400 for a single person and $21,200 for a family of four.
In other words, we would expect that only about 30% of the lottery winners in this paper's sample were actually enrolled in the insurance program. Thus the lottery selection was random, but the enrollment step for lottery winners was not: and the various reasons for failing to get insurance are presumably not neutral with respect to health status and outcomes.
When I first began reading this paper, this seemed to me to constitute a huge source of non-sampling error, since the lottery winners who actually enrolled may constitute a very different group from the lottery participants as a whole.
However, it turns out that this doesn't matter. Although the paper's title announces itself as a study of the "Effects of Medicaid on Clinical Outcomes", the authors did not compares the outcomes of those who were actually enrolled against the outcomes of those who were not. Instead, they compared the outcomes of those who won the lottery—and thus were given the opportunity to try to enroll—against the outcomes of those who didn't win the lottery (even though some of these did have health insurance anyhow). So realistically, it's a study of the "Effects of an Extra Opportunity to Try to Enroll in Medicaid on Clinical Outcomes".
Of the 6,387 survey responders who were lottery winners, 1,903 (or 29.8%) actually enrolled in Medicaid at some point. A reasonable number of the control group had OHP or Medicaid insurance as well, and by the end of the period of the study, the differences between the two groups in enrolled in proportion enrolled in public insurance had nearly vanished (we're given no information about how many might have had employer-provided insurance, but presumably the proportion was small):
Of course, the study's authors set up their model to compensate for this situation:
The subgroup of lottery winners who ultimately enrolled in Medicaid was not comparable to the overall group of persons who did not win the lottery. We therefore used a standard instrumental-variable approach (in which lottery selection was the instrument for Medicaid coverage) to estimate the causal effect of enrollment in Medicaid. Intuitively, since the lottery increased the chance of being enrolled in Medicaid by about 25 percentage points, and we assumed that the lottery affected outcomes only by changing Medicaid enrollment, the effect of being enrolled in Medicaid was simply about 4 times (i.e., 1 divided by 0.25) as high as the effect of being able to apply for Medicaid. This yielded a causal estimate of the effect of insurance coverage.
And there was indeed a period of differential enrollment proportions between between the lottery winners and losers, as well a period of different enrollment-opportunity proportions, and so it's plausible to look for effects across the sets of winners and losers as a whole.
But the main physical health outcomes that the study examined (blood pressure, cholesterol, glycated hemoglobin, etc.) are age- and lifestyle-related measures that are not very likely to be seriously influenced by a short period of differential access to insurance. The things that were (both statistically and materially) influenced — self-reported health-related quality of life, out-of-pocket medical spending, rate of depression — are much more plausible candidates to show the impact of having a brief opportunity to get insurance access.
And as in F. Y. Edgeworth's analysis of card-suit guessing, the p values in the regression are not as important as the details of what the observations were, and what forces plausibly shaped them.
[Note: For more on the subsequent history of psychic statistics, see Jessica Utts, "Replication and Meta-Analysis in Parapsychology", Statistical Science 1991.)