Mama Drama

« previous post | next post »

So-called "verbal fluency" is one of the tasks we're using in the first iteration of the SpeechBiomarkers project (and please participate if you haven't done so!). Despite the test's name, it doesn't really measure verbal fluency, but rather asks subjects to name as many words as possible from some category in 60 seconds, like "animals" or "words starting with the letter F".

Here's the first ten seconds of one participant's "animals" response:

As you can hear, the audio quality is quite good, although it was recorded remotely using the participant's browser and their system's standard microphone. These days, standard hardware usually has pretty good built-in facilities for voice recording.

In order to automate the analysis, we need a speech-to-text system that can do a good enough job on data of this kind. As I've noted in earlier posts, we're not there yet for picture descriptions ("Shelties On Alki Story Forest", 11/26/2019; "The right boot of the warner of the baron", 12/6/2019). For "fluency" recordings, the error rate is worse — but maybe we're actually closer to a solution, as I'll explain below.

Here's a sample result, for the full 60-second recording whose beginning you just heard. In what follows, the text in blue is my transcription, and the text in red is the output of the online Amazon Speech Transcription system:

antelope ant aardvark bear barracuda    cat camel dog
antelope and art Bark Bear Barrack Cuda Cat Camel dog

elephant fish uh squid  octopus  shrimp oyster clam scallop
Elefant  Push A  squid! Octopus, shrimp Oyster clam scallop

llama dromedary uh goat sheep cow pig and ostrich     iguana  lizard
Mama  Drama Dairy  Goat She   cow Pig in our stretch. Iguana, lizard,

worm frog  toad  tadpole um   jackal  kangaroo  lion  tiger
worm Frog, Toad, tadpole hand Jackal, Kangaroo, lion, tiger!

Scoring this with NIST's sclite program yields an estimate of 71.1% word error rate:

# Words Corr Sub Del Ins W.E.R.
38 36.8% 60.5% 2.6% 7.9% 71.1%

We see similar scores from other APIs on this and similar inputs.

What can we conclude from this?

Obviously the system didn't figure out that the recording was a list of animal names. That would be asking too much, in the current state of things. But it would be easy enough — at least in a traditional speech-to-text architecture — to build that constraint into the system.

By "a traditional speech-to-text architecture" I mean one that has a separate "language model". The output "Elefant", and the semi-random capitalization, are clues that this is a "sequence-to-sequence" model, which maps audio sequences to letter sequences through a complex "deep learning" network — though there are still ways to bias it in the direction of certain words. In a more traditional system, it should be easy enough to create a language model that does a good enough job of predicting animal-name sequences to massively improve performance on a task of this kind. It may be possible to do that in a sequence-to-sequence system as well, though the method is less clear to me.

 



5 Comments

  1. Philip Taylor said,

    May 27, 2020 @ 12:50 pm

    'though a complex "deep learning" network' or ;through a complex "deep learning" network', Mark ?

    [(myl) Blame it on Apple's infernal butterfly keyboard, which results in 5-10% of all my keypresses being omitted or doubled or tripled or echoed later in the sequence. ]

  2. milu said,

    May 27, 2020 @ 4:04 pm

    I love how the AI seems especially excited about "squid" and "tiger" :D
    I… don't disagree.

  3. Michael Watts said,

    May 29, 2020 @ 12:59 am

    Obviously the system didn't figure out that the recording was a list of animal names. That would be asking too much, in the current state of things. But it would be easy enough — at least in a traditional speech-to-text architecture — to build that constraint into the system.

    Indeed, it would be straightforward — not necessarily easy, but conceptually easy — to restrict the model to recognizing only animal names.

    I'm not sure such a constrained model would actually be suited to this task, though. That would mean that the output consisted entirely of correctly-formed animal names. But you're testing for cognitive decline. If someone messes up an animal name, do you really want their error to be automatically corrected by the speech-to-text software? What if they actually do say "brick" [which isn't an animal at all] and the animals-only speech-to-text recognizes "barracuda"?

  4. Michael Watts said,

    May 29, 2020 @ 1:02 am

    As a followup remark, speech transcription software gets a lot of mileage out of the assumption that whatever the speech sounds like, it was well formed. Thus, the output must be well formed.

    It's easy to use this assumption to make the software look like an idiot. But my worry here is that you may be specifically trying to identify samples of speech that is not well formed. Using a tool that relies on the assumption that all speech is well formed seems dangerous.

  5. Michèle Sharik Pituley said,

    June 1, 2020 @ 12:08 pm

    Question: did she pronounce llama as lama or yama? Do the different pronunciations produce different results?

    [(myl) "Lama". I'm betting that the pronunciations will produce different results — though both quite possibly wrong — but you can check for yourself:
    https://aws.amazon.com/transcribe/
    ]

RSS feed for comments on this post