Earlier this year, I discussed an interesting paper from a poster session at ICASSP 2010 ("Clinical applications of speech technology", 3/18/2010), which used an automated evaluation of dysphonia measures in short speech samples to match clinicians' evaluations of Parkinson's Disease severity.
That work, extended and improved, has been published as Athanasios Tsanas et al., "Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson's disease symptom severity", J. Roy. Soc. Interface, 11/17/2010.
The standard reference clinical score quantifying average Parkinson's disease (PD) symptom severity is the Unified Parkinson's Disease Rating Scale (UPDRS). At present, UPDRS is determined by the subjective clinical evaluation of the patient's ability to adequately cope with a range of tasks. In this study, we extend recent findings that UPDRS can be objectively assessed to clinically useful accuracy using simple, self-administered speech tests, without requiring the patient's physical presence in the clinic. We apply a wide range of known speech signal processing algorithms to a large database (approx. 6000 recordings from 42 PD patients, recruited to a six-month, multi-centre trial) and propose a number of novel, nonlinear signal processing algorithms which reveal pathological characteristics in PD more accurately than existing approaches. Robust feature selection algorithms select the optimal subset of these algorithms, which is fed into non-parametric regression and classification algorithms, mapping the signal processing algorithm outputs to UPDRS. We demonstrate rapid, accurate replication of the UPDRS assessment with clinically useful accuracy (about 2 UPDRS points difference from the clinicians' estimates, p < 0.001). This study supports the viability of frequent, remote, cost-effective, objective, accurate UPDRS telemonitoring based on self-administered speech tests. This technology could facilitate large-scale clinical trials into novel PD treatments.
A mean absolute deviation of 2 UPDRS points from clinicians' average estimates is markedly better than the 8-point difference reported at ICASSP, and is especially impressive since mean inter-rater differences for (human) clinicians are apparently about 1.7 to 5.4 UPDRS points. The clinicians' UPDRS ratings are based on a combination of many dimensions of interviewing and observation, of which evaluation of vocal symptoms is only one small part. Thus it's interesting — and somewhat surprising — that a clinically-useful approximation of the whole procedure can be accomplished by a simple automated evaluation of a few short vocal samples.
I feel that there are many opportunities for applying speech and language technology in clinical diagnosis and monitoring, and I expect to see a lot of progress in such things in the near future. Progress is held back largely by the lack of the kind of databases for training and testing that these researchers were able to get access to. No such collections exist, as far as I know, for the large number of other conditions for which automated quantification from voice or text samples would probably work.
(This may also represent a trend that many people dislike, namely that in some diagnostic and monitoring tasks, automated evaluations using machine-learning techniques are now on average as good or better than trained human observers.)
[Update — Max Little, one of the authors of the cited study, responds below to some of the comments:
I'd just like to address your points here if I may. A subset of the features we used here have all had successes in different clinical speech applications, including discriminating between healthy controls and PD patients. Yet, we also know from other experiments that when the speech data is very poor quality (so that the audio really can't be interpreted as informative by any stretch of the imagination), then things don't work. To me, this is good evidence that these findings can't just be explained away by "finite sample effects" or over-fitting. I myself don't have any particular reliance on cross-validation alone, I just think you can use it to get some estimate of the spread of your predictions.
Jan mentioned l-dopa but I should first say that all our patients were unmedicated throughout. While I know that factor analysis gets a lot of use in this kind of medical literature, I think Glenn should not have used it on the UPDRS data which is quite obviously non-Gaussian. Easier said than done you might say: dimension reduction such as this for non-Gaussian data is, arguably, in its infancy as a subject. So it's natural to fall back on multivariate normality. But then we are in the opposite danger zone of under-fitting: with only a hammer, everything looks like a nail. What I would point out is that, we show that the motor and total UPDRS are, non-trivially rank-correlated with the speech part of UPDRS.
Overall, I think the explanation for these results is actually quite simple. PD is, amongst other things, a movement disorder. So a very large part of activities of daily living are disrupted by motor symptoms. Speech is just one kind of movement, and this is affected too. Speech is also affected by cognitive decline and mood (e.g. depression). So I don't find it surprising that if you have PD, this will be reflected in your speech, one way or another, just as much as your movement and/or cognitive state is affected.
Interestingly, we found that we could also predict the tremor and bradykinesia parts of UPDRS using these features, but not as well as when you directly measure these two quantities using accelerometers.
Finally I should say that we don't claim anything about diagnosis. I think that's an altogether more difficult step to take, not least because you have to find susceptibles in the population and follow them for a very long time.
So, if we have been fooled, then we have been fooled in a particularly diabolical way. There is something in the speech signal and if you can find the right predictive model, you can get useful results. That's really as much as we claim here.