Speech-based quantification of Parkinson's Disease

« previous post | next post »

Earlier this year, I discussed an interesting paper from a poster session at ICASSP 2010 ("Clinical applications of speech technology", 3/18/2010), which used an automated evaluation of dysphonia measures in short speech samples to match clinicians' evaluations of Parkinson's Disease severity.

That work, extended and improved, has been published as Athanasios Tsanas et al., "Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson's disease symptom severity", J.  Roy. Soc. Interface, 11/17/2010.

Their abstract:

The standard reference clinical score quantifying average Parkinson's disease (PD) symptom severity is the Unified Parkinson's Disease Rating Scale (UPDRS). At present, UPDRS is determined by the subjective clinical evaluation of the patient's ability to adequately cope with a range of tasks. In this study, we extend recent findings that UPDRS can be objectively assessed to clinically useful accuracy using simple, self-administered speech tests, without requiring the patient's physical presence in the clinic. We apply a wide range of known speech signal processing algorithms to a large database (approx. 6000 recordings from 42 PD patients, recruited to a six-month, multi-centre trial) and propose a number of novel, nonlinear signal processing algorithms which reveal pathological characteristics in PD more accurately than existing approaches. Robust feature selection algorithms select the optimal subset of these algorithms, which is fed into non-parametric regression and classification algorithms, mapping the signal processing algorithm outputs to UPDRS. We demonstrate rapid, accurate replication of the UPDRS assessment with clinically useful accuracy (about 2 UPDRS points difference from the clinicians' estimates, p < 0.001). This study supports the viability of frequent, remote, cost-effective, objective, accurate UPDRS telemonitoring based on self-administered speech tests. This technology could facilitate large-scale clinical trials into novel PD treatments.

A mean absolute deviation of 2 UPDRS points from clinicians' average estimates is markedly better than the 8-point difference reported at ICASSP, and is especially impressive since mean inter-rater differences for (human) clinicians are apparently about 1.7 to 5.4 UPDRS points. The clinicians' UPDRS ratings are based on a combination of many dimensions of interviewing and observation, of which evaluation of vocal symptoms is only one small part. Thus it's interesting — and somewhat surprising — that a clinically-useful approximation of the whole procedure can be accomplished by a simple automated evaluation of a few short vocal samples.

I feel that there are many opportunities for applying speech and language technology in clinical diagnosis and monitoring, and I expect to see a lot of progress in such things in the near future. Progress is held back largely by the lack of the kind of databases for training and testing that these researchers were able to get access to. No such collections exist, as far as I know, for the large number of other conditions for which automated quantification from voice or text samples would probably work.

(This may also represent a trend that many people dislike, namely that in some diagnostic and monitoring tasks, automated evaluations using machine-learning techniques are now on average as good or better than trained human observers.)

[Update — Max Little, one of the authors of the cited study, responds below to some of the comments:

I'd just like to address your points here if I may. A subset of the features we used here have all had successes in different clinical speech applications, including discriminating between healthy controls and PD patients. Yet, we also know from other experiments that when the speech data is very poor quality (so that the audio really can't be interpreted as informative by any stretch of the imagination), then things don't work. To me, this is good evidence that these findings can't just be explained away by "finite sample effects" or over-fitting. I myself don't have any particular reliance on cross-validation alone, I just think you can use it to get some estimate of the spread of your predictions.

Jan mentioned l-dopa but I should first say that all our patients were unmedicated throughout. While I know that factor analysis gets a lot of use in this kind of medical literature, I think Glenn should not have used it on the UPDRS data which is quite obviously non-Gaussian. Easier said than done you might say: dimension reduction such as this for non-Gaussian data is, arguably, in its infancy as a subject. So it's natural to fall back on multivariate normality. But then we are in the opposite danger zone of under-fitting: with only a hammer, everything looks like a nail. What I would point out is that, we show that the motor and total UPDRS are, non-trivially rank-correlated with the speech part of UPDRS.

Overall, I think the explanation for these results is actually quite simple. PD is, amongst other things, a movement disorder. So a very large part of activities of daily living are disrupted by motor symptoms. Speech is just one kind of movement, and this is affected too. Speech is also affected by cognitive decline and mood (e.g. depression). So I don't find it surprising that if you have PD, this will be reflected in your speech, one way or another, just as much as your movement and/or cognitive state is affected.

Interestingly, we found that we could also predict the tremor and bradykinesia parts of UPDRS using these features, but not as well as when you directly measure these two quantities using accelerometers.

Finally I should say that we don't claim anything about diagnosis. I think that's an altogether more difficult step to take, not least because you have to find susceptibles in the population and follow them for a very long time.

So, if we have been fooled, then we have been fooled in a particularly diabolical way. There is something in the speech signal and if you can find the right predictive model, you can get useful results. That's really as much as we claim here.

]



6 Comments

  1. D.O. said,

    November 22, 2010 @ 12:42 pm

    A mean absolute deviation of 2 UPDRS points from clinicians' average estimates … is especially impressive since mean inter-rater differences for (human) clinicians are apparently about 1.7 to 5.4 UPDRS points.

    This is a bit puzzling. It seems that UPDRS is not known with 2 points precision if measured by standard procedure because of inter-rater differences. How can any measure agree better with it? I guess, I need to go read the paper…

    [(myl) There are several different possible measures here. If we have a pool of raters, we can ask what the average absolute difference is between an individual rater and the group mean. Or we can ask what the average difference is between two raters chosen at random. And so on. If raters are noisy but (relatively) unbiased, then it's easy to imagine an automated technique that agrees with the mean rater better than a random rater does.]

  2. Jan van Santen said,

    November 22, 2010 @ 12:49 pm

    Three things are striking about this result. First, PD is heterogeneous in terms of, e.g., symptoms (e.g., axial vs. non-axial) and response to levodopa. Second, speech problems are less lessened by levodopa far less than other problems. Third, the UPDRS is multifactorial: the list of items contains explicit various sub-sections, and factor-analysis has revealed multiple factors, e.g.: Stebbins & Goetz [1998]. "Factor Structure of the Unified Parkinson’s Disease Rating Scale: Motor Examination Section". Movement Disorders, Vol. 13, No. 4, 1998, pp. 633-636. Moreover, even though the biggest factor indeed has heavy speech loadings, it captures less than 50% of the variance.
    It follows that either the automated method is unlikely to be uniformly accurate, or that (inaudible) speech features are uniformly informative.

    [(myl) Even though they did careful cross-validation, I worry that their extensive exploration of a very large feature space might in fact be over-training on some characteristics of their data set that might not hold up in another sample. And there are also issues of microphone and recording-condition variation to worry about. Still, it seems to be a promising result.]

  3. Jan van Santen said,

    November 22, 2010 @ 3:52 pm

    The UPDRS has 294 items spread over 6 parts, only one of which is about motor, and only a few items in that part are about speech; most items in the motor part are about gait, tremor, etc.

    I would love to believe that there is more Dx information in speech than what is generally assumed (e.g., by neurologists), but predicting the total UPDRS score by only looking at speech is just terribly implausible.

    Yes, and then there are the over-training issues.

  4. Mark P said,

    November 23, 2010 @ 9:18 am

    It's intriguing. I really have no idea what's involved with this, but I do know some things about computers and programming. It's possible for a computer program to detect patterns in data that are not obvious to even a trained observer. For example, there are plenty of programs that identify periodicity or patterns in data that look like pure noise. The hardest part of such a task is coming up with a valid algorithm and then translating it into a computer language. And then, of course, figuring out whether the pattern is really there. This may or may not work, but it certainly leads to some interesting possibilities for enlarging a physician's diagnostic toolkit.

  5. Paul Zukowski said,

    November 23, 2010 @ 11:29 am

    A replacement for the polygraph?

  6. Kaviani said,

    November 24, 2010 @ 3:44 pm

    @ Paul Zukowski- voice stress analysis software already exists, but it's dubious for the reasons this software is (microphone/transmission quality, software imperfections, the fact that many PWP are on medications that mitigate vocal disturbances, emotional interference, etc.).

    What seems better, though way more expensive, would be to use this software to train people with perfect pitch (or similar) to analyze conversation face-to-face. Not very convenient for the patient, I know, but probably more accurate.

RSS feed for comments on this post