Shelties On Alki Story Forest
« previous post | next post »
Last week I gave a talk at an Alzheimer's Association workshop on "Digital Biomarkers". Overall I told a hopeful story, about the prospects for a future in which a few minutes of interaction each month, with an app on a smartphone or tablet, will give effective longitudinal tracking of neurocognitive health.
But I emphasized the fact that we're not there yet, and that some serious research and development problems stand in the way. In particular, the current state of the art in speech recognition is not yet good enough for reliable automated evaluation of spoken responses.
Speech-based tasks have been part of standard neuropsychological test batteries for many decades, because speaking engages many psychological and neurological systems, offering many (sometimes subtle) clues about what might be going wrong. One of many such tasks is describing a picture, for which the usual target is the infamous Cookie Theft:
It's past time to replace this image with pictures that are less dated and culture-bound — and in any case, we'll need multiple pictures for longitudinal tracking — but this is the one that clinical researchers have mostly been using. Whatever the source of the speech to be analyzed, many obvious measures — word counts, sentence structure, word frequency and concreteness, etc. — depend on a transcript, which at present is supplied by human labor.
We've tried many speech-to-text solutions, including open-source packages and commercial APIs. And the technology is not quite there yet.
Sometimes the results are not bad:
Human transcription:
Meanwhile while she's doing that,
the- her kids are going for the cookie jar up in the cabinet.
((um)) and her son needs a uh high chair to- to take them down.
He's got one and he's- he's in trouble because the chair's going backwards.
He's going to fall down and hurt himself.
Meanwhile the little girl is there just taking the cookies.
Google Cloud Speech-to-Text:
meanwhile while she's doing that
the kids I'm going for the cookie jar up in the cabinet
and his son needs a high chair to the take him down
he's got one of these he's in trouble with the chairs going backwards
he's going to pull down and hurt himself
meme of the little girl is Dad just taking the cookies
But sometimes the output is just weird:
Human transcription:
They- they shouldn't be taking cookies ((from)) the top shelf.
He's on a- a high stool.
He's going to fall. It's dangerous. The girl is um watching it.
Google Cloud Speech-to-Text:
The shipping music hookers in talk Shelties On Alki Story Forest and just
the girls watching it
How can we fix this?
We can work on getting better sound quality in such recordings. But it's already a challenge to get clinicians to be better sound engineers. And recordings from web apps, on multiple devices in multiple contexts in the hands of demographically and clinically diverse users, are sure to be problematic. It may help to work on ways to analyze recordings and suggest remedies ("Please move away from the air conditioner" or "This will work better if you can turn off the TV or move to another room"). But the technology will still need to cope with less-than-optimal recordings.
We can expect that general speech-to-text technology will continue to improve.
But the most important remedy is language models that are better adapted to specific tasks and speaker populations. A system's "language model" encodes its expectations about what's likely to be said. And current ASR systems are more dependent on their language models that they should be, compensating for the weakness of their acoustic analysis. The good news is that we know very well how to create and incorporate improved language models, if we have large enough amounts of good-quality transcriptions from sources similar to the target application.
We also need to know how various kinds of errors affect the diagnostic inputs we're interested in — "word error rate" is a useful general measure, but a given W.E.R. can correspond to a wide range of effects on diagnostically-relevant features. (Or "biomarkers", to use the fancy word now in vogue.)
Some colleagues and I are starting a large-scale project to get speech data of this general kind: picture descriptions, "fluency" tests (e.g. "how many words starting with F can you think of in 60 seconds?"), and so on. The idea is to support research on analysis of such recordings, automated and otherwise, and to allow psychometric norming of both traditional and innovative measures, for both one-time and longitudinal administration, across a diverse population of subjects. We've got IRB approval to publish the recordings, the transcripts, and basic speaker metadata (age, gender, language background, years of education).
We've been testing the (browser-based) app across a variety of devices and users. When it's ready for prime time, this is one of many channels that we'll use to recruit participants — we're hoping for a few tens of thousands of volunteers.
Update 12/5/2019 — Just for grins, I tried the Mozilla DeepSpeech system on the same two files. The first one:
(1) Human transcription:
Meanwhile while she's doing that,
the- her kids are going for the cookie jar up in the cabinet.
((um)) and her son needs a uh high chair to- to take them down.
He's got one and he's- he's in trouble because the chair's going backwards.
He's going to fall down and hurt himself.
Meanwhile the little girl is there just taking the cookies.
(1) Google Cloud Speech-to-Text:
meanwhile while she's doing that
the kids I'm going for the cookie jar up in the cabinet
and his son needs a high chair to the take him down
he's got one of these he's in trouble with the chairs going backwards
he's going to pull down and hurt himself
meme of the little girl is Dad just taking the cookies
(1) Mozilla DeepSpeech:
meanwhile was his due that
the kids all go for the cooked up in the cabinet
in a sunday chair to the take em down
he's got one of these seas and trouble was the chairs and backwards
he's in for down i heard himself
be one the little girl is there just take the goods
The second one:
(2) Human transcription:
They- they shouldn't be taking cookies ((from)) the top shelf.
He's on a- a high stool.
He's going to fall. It's dangerous. The girl is um watching it.
(2) Google Cloud Speech-to-Text:
The shipping music hookers in talk Shelties On Alki Story Forest and just
the girls watching it
(3) Mozilla DeepSpeech:
i shouldn't say forestock shell
she's on a wall
cristofori ages
a girl is a thatching it
shubert said,
November 26, 2019 @ 9:54 am
One of the most useful posts in this forum in my humble opinion. Well, I can be a volunteer.
Barbara Phillips Long said,
November 26, 2019 @ 5:55 pm
Recently I used the U.S. Postal Service voice recognition system to set up a hold-mail order via phone. It took three tries, at least, to get the system to recognize the street address I was speaking. It got the house number wrong, incorporated the house number into the street name, and came up with other random errors, and I live on a street with a common name that is spelled and pronounced relatively uniformly.
Among other problems, I live in a zip code with two post offices. My mail is delivered more reliably if the post office closest to me is used in the address. The USPS voice recognition system would not let me use the name of that post office, which is also the name of my municipality. I wonder if there are programming problems in addition to voice recognition problems.
I had so many repeated problems that I had to hang up and restart. I was unimpressed by the quality of the software, particularly since responses can be compared to a known database. I can understand why many volunteers will be needed for more complex projects.