When 90% is 32%

« previous post | next post »

I've occasionally complained that when it comes to comparing sampled distributions, modern western intellectuals are mostly just as clueless as the members of the Pirahã  tribe in the Amazon are said to be with respect to counting (see e.g.  "The Pirahã and us", 10/6/2007).  And it doesn't take high-falutin concepts like "variance" or "effect size" to engage this incapacity — simple percentages are often enough.

I discussed one example a few days ago: the coverage of news about a new blood test for Alzheimer's ("(Mis-) Interpreting medical tests", 3/10/2014). In that post, I cited an article in by John Gever ("Is Blood Test for Alzheimer's Disease Oversold?", MedPage Today 3/10/2014). Today, Gever has a follow-up article on the question of how to evaluate medical tests:

Last week's much-publicized study of a blood test purported to identify healthy elderly people who would develop clear cognitive impairments highlighted the uncertain grasp that most journalists, and even some healthcare providers, have on measures of diagnostic test accuracy.  

In that study, you may recall, researchers from Georgetown University found that levels of 10 substances in blood differed in cognitively normal older individuals who developed mild amnestic impairment or Alzheimer's disease within 3 years, compared with a similar group of people whose cognition remained intact. The researchers reported that the 10-marker panel had both specificity and sensitivity of 90% for distinguishing the two groups.  

But they left out an important statistic for judging the usefulness of such a test, as it would be applied in the clinic — the positive predictive value (PPV) or the accuracy of positive results seen in the target population (in this case, cognitively healthy seniors).  

Contrary to what I later learned is popular belief, calculating a PPV is easy, requiring nothing more than fourth grade arithmetic.

And the positive predictive value in this case is 32%, based on the numbers available in the original blood-test report. As Gever observes, this is

terrible, given that there is currently no "gold standard" test that can confirm an individual patient's positive result, and there is also no treatment currently or likely in the near future. At this point, a positive result serves only to put the patient and his or her family on the alert for problems, which, for an older person, they probably already are.

It's true that the calculations required to derive the PPV involve nothing beyond adding , subtracting, multiplying and dividing easily-available modest-sized numbers. But in order to know what fourth-grade arithmetic to do, you also need to understand the concept of a "contingency table"  or "confusion matrix" (though you don't need to know those terms) — and maybe kids aren't ready for that until the sixth grade.

I don't recall ever having encountered this concept in my own K-12 education, and I don't think I saw it in my children's homework assignments either.  Given the lack of similar calculations in the many other stories about that widely-covered blood test research, it seems that few journalists (even those who cover relevant technical areas) have learned to deploy fourth-grade arithmetic in this way.

And as a result, the average science writer is apparently just as clueless about the concept of "accuracy" as the average Pirahã villager is about the concept of "seven".




  1. Leon Derczynski said,

    March 18, 2014 @ 9:24 am

    PPV may be better known to some as "precision", in the sense of "precision and recall".

    [(myl) Indeed — that will be helpful for people versed in information retrieval or machine learning. For the general population, I think it's problematic to rely on terms of ordinary language with new specialized meanings, like "sensitivity", "specificity", "precision", "significance", etc.]

  2. Nathan said,

    March 18, 2014 @ 9:41 am

    Science stories are written as press releases. If the publicity people at the lab or university or whatever wanted the science "writers" to know the actual implications of the research, they would tell them. They know they can count on almost none of them doing the math.

  3. Jonathan Mayhew said,

    March 18, 2014 @ 10:36 am

    Exactly, what scientist is going to brag about a test that is 32% reliable? The math is not hard to do, true, but it is hard to get to the math through a layer of jargon.

  4. dw said,

    March 18, 2014 @ 10:57 am

    I tend to think that percentages should be banned in science journalism. All references should be to actual concrete numbers — either the actual numbers in the study, or to a hypothetical sample of the population

  5. BZ said,

    March 18, 2014 @ 12:47 pm

    But then we'll have things that affect "hundreds of thousands of people all over the world" and not realize that we are talking about statistically insignificant occurrences.

    For example, I often look at a food label and calculate the percentage of sugar it has (sugars/serving size), something I wouldn't have to do had this information been spelled out. It may shock some people that their "healthy" glass of 100% apple juice is 90% sugar.

  6. Peter Taylor said,

    March 18, 2014 @ 1:18 pm

    @BZ, it would certainly shock me: even a 50% sugar solution is pretty viscous.

    My memory is that we covered contingency tables at school when I was about 15, but memory is unreliable.

  7. BZ said,

    March 18, 2014 @ 1:40 pm

    OK, it's actually 10% and I misremembered or my memory exaggerated. It's still basically sugar and water, and Apple Juice Concentrate is often used in "no added sugar" drinks.

  8. Mark F. said,

    March 18, 2014 @ 2:53 pm

    As simple as this particular calculation is, it's unreasonable to expect reporters on a deadline to do it themselves, at least in most cases. It's not unreasonable to expect them to call the author and ask what the PPV is.

    [(myl) I don't agree with this. The prevalence (i.e. the conversion rate) is clearly listed in the paper, and given that, the calculation takes maybe 30 seconds, which is a lot less time than it takes to communicate with the author(s) (which is also a good thing to do).]

  9. D.O. said,

    March 18, 2014 @ 6:07 pm

    I don't have a first-hand experience with Alzheimer disease, but I understand its pretty terrible. And I, for one, would really like to know that my chances of getting symptoms in the next 3 years are 30% rather than 3%. So hooray for clueless journalists and hype-thirsty scientists. Where can I get my test?

    [(myl) But you'd probably object to being told that you have a 90% chance of developing Alzheimer's in the next three years, when the estimated probability before the test was 5%, and the estimated probably after the test was 30% (or less, if the concerns about post-hoc feature selection in this case turn out to be justified).

    As for getting access to the test, you'll have to take that up with the FDA — or get access to a well-equiped biochemistry lab.]

  10. Ben said,

    March 18, 2014 @ 9:15 pm

    You don't need to understand any math, the steps to calculate it are readily available. Hell, here's a PPV calculator online: http://www.medcalc.org/calc/diagnostic_test.php

    [(myl) I think this is slightly mis-stated. The calculator means that you don't need to do the arithmetic for yourself, but you still need to understand that "accuracy" is an ambiguous term, and you still need to decide in a given case which interpretation is the relevant one. I'd call that "understanding math", in the sense that matters in this context.]

    The problem is that it's too easy to plug some numbers in and think you have something meaningful. It's just like Excel's linear regression functions. If you can put together a scatter plot, it'll give you a linear regression, complete with an r-value, and not one shred of guidance as to what any of it means.

  11. Adam Funk said,

    March 19, 2014 @ 5:15 am

    The first thing I thought of was that PPV is the same as precision, but I see my colleague Leon beat me to it. We have a set of slides to explain and compare the various metrics used in clinical tests and IE/IR, which are useful for teaching biomedical or bioinformatics people to develop and evaluate natural language processing applications.

RSS feed for comments on this post