(Mis-) Interpreting medical tests

« previous post | next post »

Jon Hamilton, "Alzheimer's Blood Test Raises Ethical Questions", NPR Morning Edition 3/9/2014:

An experimental blood test can identify people in their 70s who are likely to develop Alzheimer's disease within two or three years. The test is accurate more than 90 percent of the time, scientists reported Sunday in Nature Medicine.

The finding could lead to a quick and easy way for seniors to assess their risk of Alzheimer's, says Dr. Howard Federoff, a professor of neurology at Georgetown University. And that would be a "game changer," he says, if researchers find a treatment that can slow down or stop the disease.  

But because there is still no way to halt Alzheimer's, Federoff says, people considering the test would have to decide whether they are prepared to get results that "could be life-altering." 

But  having a prediction with no prospect for a cure is not, in my opinion, the biggest problem with tests of this kind.

As we can learn from the cited publication (Mark Mapstone et al., "Plasma phospholipids identify antecedent memory impairment in older adults", Nature Medicine 3/9/2014) , the "more than 90 percent of the time" accuracy is defined as "a sensitivity of 90% and specificity of 90%" for identifying participants who had unimpaired memory at the beginning, but would begin exhibiting cognitive impairment during the study.

One small point is that the size of the study was not large enough to be very certain about these numbers:

We enrolled 525 community-dwelling participants, aged 70 and older and otherwise healthy, into this 5-year observational study. Over the course of the study, 74 participants met criteria for amnestic mild cognitive impairment (aMCI) or mild Alzheimer's disease (AD) (Online Methods); 46 were incidental cases at entry, and 28 phenoconverted (Converters) from nonimpaired memory status at entry (Converterpre).

The blood tests are converting participants in the "Converterpre" category from the "Normal Controls" (NC) category, and 28 is not a very large number.

But the bigger problem lies in the meaning of "sensitivity" and "specificity", as explained by John Gever, "Researchers Claim Blood Test Predicts Alzheimer's", MedPage Today 3/9/2014:

If the study cohort's 5% rate of conversion from normal cognition to mild impairment or Alzheimer's disease is representative of a real-world screening population, then the test would have a positive predictive value of just 35%. That is, nearly two-thirds of positive screening results would be false.  In general, a positive predictive value of 90% is considered the minimum for any kind of screening test in normal-risk individuals.

 Let's unpack this. We start with a 2-by-2 "contingency table", relating test predictions and true states or outcomes:

Reality is Positive (P) Reality is Negative (N) 
Test is Positive True Positive (TP) False Positive (FP)
Test is Negative False Negative (FN) True Negative (TN)

In the context, the "sensitivity" is the true positive rate: TP/P, the proportion of real positives that test positive.

The "specificity" is the true negative rate: TN/N = the proportion of real negatives that test negative.

And 90% sensitivity and specificity sounds pretty good.

But what doctors and patients really learn is only whether the test is positive or negative. So suppose the test is positive and the true prevalence of the condition is 5%. Then out of 1,000 patients, there will be  0.05*1000 = 50 who are truly going to get AD; and of these, 0.9*50 = 45 will have a positive test result. But there will be 0.95*1000 = 950 who are not going to get AD; and of these, 0.1*950 = 95 will have a positive test result.

So there will be a total of 45+95 = 140 positive test results, and of these, 45 will be true positives, or 45/140 = 32%.

Thus the real problem with a positive test result, in this case, would not be learning that you're fated to get AD and can't do anything to prevent it. Rather, it would be believing that you're 90% likely to get AD when your actual chances are much lower.

In fact , I think that the numbers might be a bit better than Gever's article suggests. According to "2012 Alzheimer's disease facts and figures" from the Alzheimer's Association:

The estimated annual incidence (rate of developing disease in a one-year period) of Alzheimer’s disease appears to increase dramatically with age, from approximately 53 new cases per 1,000 people age 65 to 74, to 170 new cases per 1,000 people age 75 to 84, to 231 new cases per 1,000 people over age 85.

Even at a rate of 53 per 1,000, the chances of "converting" within three years would be (1 – (1-0.053)^3) = 0.151, so the positive predictive value of the test would be more like 62% than 32%. But 62% is still not 90%, and the general point is an important one.

For more on the terminology involved, see the Wikipedia article on sensitivity and specificity.






  1. AndrewD said,

    March 10, 2014 @ 12:10 pm

    Derrek Lowe at "in the pipeline" has commented on the claim here

    [(myl) One of his commenters raises another concern:

    "144 lipids simultaneously by multiple reaction monitoring. The other metabolites are resolved on the UPLC and quantified using scheduled MRMs. The kit facilitates absolute quantitation of 21 amino acids, hexose, carnitine, 39 acylcarnitines, 15 sphingomyelins, 90 phosphatidylcholines and 19 biogenic amines…" to arrive at the final set which were used for what appears to be retrospective =based predictions.

    That is, we shouldn't rely too strongly on the fact that a combination of ten biomarkers, out of about 200 tested, can retro-dict outcomes for a relatively small and not very diverse set of patients. This raises a question about whether the 90% sensitivity and specificity would hold up in a larger and more diverse population.

    But the main point is that sensitivity and specificity can be misleading numbers.]

  2. D.O. said,

    March 10, 2014 @ 5:18 pm

    From age 65

    For Alzheimer’s, the estimated lifetime risk was nearly one in five (17.2%) for women compared with nearly one in 10 (9.1%) for men.

    which more or less shows that you cannot get much predictive improvement even considering the whole lifespan (older people seem to develop the risk of Alzheimer's faster than the risk of death, but the difference is not very pronounced).

  3. AntC said,

    March 10, 2014 @ 6:03 pm

    Thank you Mark. Upon reading the syndicated piece in my daily paper, this is the first place I came for a proper analysis.

  4. Michael Watts said,

    March 11, 2014 @ 4:31 am

    I've always had the impression that if you test positive for a rare disease, the first thing you do is test again. I mean, we're not going to make rare diseases less rare.

    The strategy produced by this rather simple wisdom has a total error rate of 1.9% given 90% sensitivity and specificity and a population with 5% prevalence of the condition, split half and half between 0.95% who think they're positive when they aren't, and 0.95% who think they aren't positive when they are. (Or, in the terms I worked the problem, if you test 10,000 people of whom 500 are positive, then after the process completes, 95 people will erroneously believe they're positive, 95 people will erroneously believe they're negative, and the other 9,810 will know correctly whether they're positive or negative.)

    Sensitivity and specificity don't tell you the odds that your test result is correct, because that question is inextricably linked to the prevalence of the condition in your population. But we can't do better than reporting them, because they are the part of the question defined by the test. It's not the test's fault if you're using it to look for something rare.

    [(myl) But the results of a "positive" test for something dire may be very problematic indeed — unnecessary surgery in the case of unreliable cancer screening, or major life changes in the case of a test like this one.

    And we CAN do better than reporting sensitivity and specificity, as John Gever's article observes: we can tell people the "positive predictive value" (PPV), which is exactly the proportion of positive test results that are true positives, or (in the case of negative results) we can tell them the "negative predictive value".

    In terms of the 2×2 contingency table, PPV is simply true positives divided by the sum of true positives and false positives. This measure is known as "precision" in the information-retrieval and machine-learning literature.]

  5. Michael Watts said,

    March 11, 2014 @ 10:00 am

    Well, sure. If the question is "do I have cancer", we can be more, but uselessly, informative by saying "most people don't, so there's still a good chance you don't". If the question is "how good is this test", then sensitivity and specificity are all you have.

    But again, the normal thing to do is just test again. If the question is "do I have cancer", people don't really care about getting a numerically precise figure for the accuracy of the test, they care about whether or not they have cancer. The "test again" strategy constructively brings the PPV of our hypothetical test from 32% to 80% (it also cuts the NPV from 99.4% to 99%). The "inform and release" strategy tells people "you have a 32% chance of having cancer — hugely elevated compared to the population, but less than you'd want to base a major surgery on". What will they do then but retest?

    (In the particular case of cancer, you can look at other evidence like the presence and potentially growth rate of tangible lumps; in the case of future Alzheimer's, that seems more difficult.)

  6. Michael Watts said,

    March 11, 2014 @ 10:06 am

    I guess I'm uncomfortable because you refer to "the biggest problem with tests of this kind". Saying that they can't justify their sensitivity and specificity numbers is fair criticism. But the fact the the phenomenon it's looking for is rare can't be a problem with the test. Tests aren't going to reach perfect accuracy, and people won't stop being interested in rare phenomena, because rare phenomena won't stop being important. None of that is a problem with the test. If a serious problem is sufficiently rare that our best test for it isn't informative enough to justify action after only testing once, the solution isn't to give up on testing for it. It's still serious and still detectable.

  7. Jon said,

    March 11, 2014 @ 11:08 am

    @Michael Watts – you seem to be assuming that successive tests on the same person have independent false positive rates. By that assumption, you could increase the accuracy of any test to 99.9% just by repeating often enough. But in fact the false positive in one individual will often be because they genuinely have a high level of some marker, for some reason other than the presence of disease. You may get a bit more accuracy for some tests by repeating, but not always.

  8. Keith M Ellis said,

    March 11, 2014 @ 5:36 pm

    I assumed that Michael Watts was implying a followup with a more expensive, more reliable test, as is the convention with these things.

    But I think the relevant response to his argument is that people make generalized and policy decisions on the basis of misunderstanding what these numbers mean. When headlines and articles reduce this to "more than 90% accurate", the general public and (unfortunately) those responsible for policy make bad decisions about such tests.

  9. Michael Watts said,

    March 11, 2014 @ 11:12 pm

    Yes, the numbers I used assumed independent tests with identical characteristics. As Keith M Ellis points out, it makes more sense to retest with a different test, but then you need new numbers.

    I'll try phrasing myself a little differently. Imagine that a particular disease, such as lung cancer, is very rare within the population but for unknown reasons becomes much more common over time. During the same period, we continue to detect it with exactly the same tests we've always used, and no technological improvements. I don't think it makes sense to say that the quality of our tests has increased dramatically just because there's more lung cancer now than there used to be. The tests have all the same characteristics they've always had; a paper examining their sensitivity and specificity will be just as valuable after this happens as it was originally. But a paper presenting PPV and NPV will be worthless because those have changed. Conversely, if a disease such as measles is initially common but becomes rare, I don't think it makes sense to say that our tests for it have suddenly become crap.

    Look at it yet another way. Suppose I invent and sell a test for leukemia and market it has having a certain PPV, say 45%. Then, unbeknownst to me and beyond my control, the incidence of leukemia drops, so the test's PPV is actually 40%. Should people using my test be allowed to sue me?

RSS feed for comments on this post