Using automatic speech-to-text in clinical applications

« previous post | next post »

A colleague pointed me to Terje Holmlund et al., "Applying speech technologies to assess verbal memory in patients with serious mental illness", NPJ digital medicine 2020:

Verbal memory deficits are some of the most profound neurocognitive deficits associated with schizophrenia and serious mental illness in general. As yet, their measurement in clinical settings is limited to traditional tests that allow for limited administrations and require substantial resources to deploy and score. Therefore, we developed a digital ambulatory verbal memory test with automated scoring, and repeated self-administration via smart devices. One hundred and four adults participated, comprising 25 patients with serious mental illness and 79 healthy volunteers. The study design was successful with high quality speech recordings produced to 92% of prompts (Patients: 86%, Healthy: 96%). The story recalls were both transcribed and scored by humans, and scores generated using natural language processing on transcriptions were comparable to human ratings (R = 0.83, within the range of human-to-human correlations of R = 0.73–0.89). A fully automated approach that scored transcripts generated by automatic speech recognition produced comparable and accurate scores (R = 0.82), with very high correlation to scores derived from human transcripts (R = 0.99). This study demonstrates the viability of leveraging speech technologies to facilitate the frequent assessment of verbal memory for clinical monitoring purposes in psychiatry.

This is great work, but over-interpretation of such results is likely to be a problem. At this stage in the development of the technologies, experimenting with with speech-to-text in such applications is a very good idea, but relying on it without accurate human-corrected transcripts is a very bad idea.

There are serious potential problems in four areas:

  1. Diarization — current state-of-the-art for "who spoke when" remains problematic.
  2. Disfluencies — all of the standard systems ignore them.
  3. Languages — performance across languages and varieties varies greatly, and in many cases is zero.
  4. Equity — performance is highly variable on different speakers of different dialects/varieties, different ages and backgrounds, different vocal characteristics, different recoding contexts, etc.

In addition, the technology is changing rapidly, and the APIs are changing with it, so that the results in a few years, in general or even from the same recordings, will be radically different from the results today.

The consequence of all this is that reliance on ASR transcripts, aside from degraded performance in general, will produce results that include serious uncontrolled covariates for race, age, gender, and educational background, among other things. And these effects will be radically unstable over time as the technologies change.

A better approach, in my opinion, is to use ASR input to make human transcription faster — into the range of 1x to 3x transcriber time depending on the quality of the ASR — and then to use the human transcripts to improve the ASR by way of better-adapted language modeling. And this needs to happen in the context of very large-scale studies, covering hundreds of thousands of subjects.

This will also let us estimate the size and effect of the cited problems, as time goes on.

How does the Holmlund et al. study match up with my objections?

In the first place, they compared the results with ASR transcripts to those with human transcripts, which is the right thing to do.

And in their application, diarization doesn't matter much, since they aim to record only the subjects. But some readers may not recognize that this can become a serious problem in analyzing interviews or conversations — diarization failure is one of many problems that sank IBM's ambitious plans to apply (the speech aspects of) its Watson technologies in clinical applications.

Disfluencies  don't enter the picture here, since (as far as I can tell) their analysis (of human as well as automatic transcriptions) ignored them — but if so, this was a mistake, since several experiments (including a forthcoming study of schizophrenic patients) show significant diagnostic value in the number, type, and distribution of disfluencies.

As for languages and varieties, they report:

The participant sample comprised 104 adults. Twenty-five patients were recruited from a group home facility in the Southeastern US (Mean age = 49.7 years; SD = 10.4 years, 52.2% female). The patients all met U.S. federal definitions of serious mental illness […] Two-thirds of the patients met the criteria for schizophrenia (N = 16), and the remaining major depressive disorder (N = 8) and bipolar disorder (N = 1). […] The other participants (N = 79) were undergraduate students at Louisiana State University presumed to be healthy (henceforth termed ‘healthy participants’; mean age = 21.7 years; SD = 1.4 years, 62% female).

So how do we know that the test is measuring "verbal memory deficits" and not age, life circumstances, and test-taking environment? Obviously, there's a large literature on story-recall tests suggesting that they measure clinically-relevant "verbal memory deficits" — as well as other things. But if tests of this kind are to be used in serious clinical screening or monitoring applications, the many other factors would need to be taken into account. (In fact, I do believe that such measures of "verbal memory" will differ for people with "severe mental illness" of various kinds — but they will also differ as a function of many clinically-irrelevant social, cultural, economic and contextual factors.)

And one of those factors is highlighted in their report:

Automatic speech recognition performed using the latest Google’s speech-to-text service produced an overall word error rate of 23.3%, with lower error rates in healthy participants (17.1%) compared to patients (43.7%; see Fig. ​Fig.2,2, panel a). This high error rate is likely due to the fact that the Google language model was trained on general language rather than the language specific to our task. Even so, the predictions of a combined feature model based on transcriptions from the generic ASR procedure correlated surprisingly well with human ratings at R = 0.80 (range 0.74–0.88 across five folds). The robustness of such models in the context of high word error rates has been demonstrated in other domains and is attributable to errors being made mostly on non-essential words, with the arguably more important common type words generally being transcribed correctly.

Another conclusion might be that their classification methods worked well, in this case, because of the enormous differences between their patient population (mean age 50; institutionalized for "serious mental illness";  education from 7 to 16 years, mean 12) and their  healthy controls (mean age 22; currently enrolled in college).

When we start testing the population at large using "digital biomarkers" of this general kind, then many important covariates are going to show up, marking age, gender, race, ethnicity, educational background, and so on. If these are not controlled for, the diagnostic evaluations for different subpopulations are going to be systematically different, in ways that will violate elementary considerations of fairness. And for the results of rapidly-evolving automatic speech-to-text algorithms, the effects will be different for different versions of different APIs or programs, and even more different across time.

I should note that accurate human transcripts are not always easy to get, even in English — see Taylor Jones et al., "Testifying while black: An experimental study of court reporter accuracy in transcription of African American English", Language 2019. Even when speakers and transcribers are well matched, lack of standards in things like the notation of disfluencies is often a problem. Current ASR results in such cases are even more problematic, and the errors may affect downstream automated analysis even more than they affect court proceedings.

So let me repeat what I wrote earlier in this post:

Experimenting with with speech-to-text in such applications is a very good idea, but relying on it without accurate human-corrected transcripts is a very bad idea. […]

A better approach, in my opinion, is to use ASR input to make human transcription faster […] and then to use the human transcripts to improve the ASR by way of better-adapted language modeling.

This will also let us estimate the size and effect of the cited problems, as time goes on.

And let me add that we need accurate psychometric norming of such tasks, automated or not, across large and varied populations.



  1. Cervantes said,

    February 27, 2021 @ 11:25 am

    I work with transcripts of medical encounters, and label speech acts and topic categories. Making transcripts is expensive, obviously. With good transcripts, we've been able to label speech act categories — e.g. open and closed question, representative, directive — with acceptable accuracy, kappas above .7, although that's with human parsing.

    Even with a substantial word error rate, and erratic handling of disfluencies, I wonder if a useful signal could still be extracted. The diarization problem is a killer, however. Does it work better if the speakers are, say, different sexes with very different vocal range?

    [(myl) Diarization is getting better — developers now mostly understand that it's a problem, which is a step forward. Results will certainly be better if (a) the number of speakers is known (e.g. = 2), (b) the speakers' voices are different (e.g. in pitch), and (c) the audio quality is good. But the error rate is probably still higher than you'd want, for the kind of recordings you're likely to be using.

    Making transcripts doesn't need to be excessively expensive, even without outsourcing to distressingly low-wage countries. Our experience is that with appropriate transcription software, we can get good-quality results with about 7x real time (i.e. 7 hours of work for 1 hour of audio). Experiments suggest that integrating speech-to-text and intelligent type-ahead should be able to reduce that to 1x (or near it). The clinician (and clinic) costs for your medical encounters must be high enough that an extra $15-20 per hour in transcription costs would not be a large fractional increase.

    And we can expect that full automation will come before long. My argument is just that we need to get there gradually and with careful consideration of the various issues that will arise.]

  2. Cervantes said,

    February 28, 2021 @ 9:28 am

    Well, I don't pay the clinical costs. This is observational research of real world encounters, and the insurance company is paying for them. The transcription cost is a substantial part of the overall budget, though not as much as the coding. The holy grail for me is really the machine being able to parse speech acts. If it can do that, it should be able to label them with useful accuracy.

RSS feed for comments on this post