So-called "verbal fluency" is one of the tasks we're using in the first iteration of the SpeechBiomarkers project (and please participate if you haven't done so!). Despite the test's name, it doesn't really measure verbal fluency, but rather asks subjects to name as many words as possible from some category in 60 seconds, like "animals" or "words starting with the letter F".

Here's the first ten seconds of one participant's "animals" response:

Your browser does not support the audio element.

As you can hear, the audio quality is quite good, although it was recorded remotely using the participant's browser and their system's standard microphone. These days, standard hardware usually has pretty good built-in facilities for voice recording.

In order to automate the analysis, we need a speech-to-text system that can do a good enough job on data of this kind. As I've noted in earlier posts, we're not there yet for picture descriptions ("Shelties On Alki Story Forest", 11/26/2019; "The right boot of the warner of the baron", 12/6/2019). For "fluency" recordings, the error rate is worse — but maybe we're actually closer to a solution, as I'll explain below.



Here's a sample result for the full 60-second recording whose beginning you just heard. In what follows, the text in blue is my transcription, and the text in red is the output of the online Amazon Speech Transcription system:

antelope ant aardvark bear barracuda cat camel dog antelope and art Bark Bear Barrack Cuda Cat Camel dog elephant fish uh squid octopus shrimp oyster clam scallop Elefant Push A squid! Octopus, shrimp Oyster clam scallop llama dromedary uh goat sheep cow pig and ostrich iguana lizard Mama Drama Dairy Goat She cow Pig in our stretch. Iguana, lizard, worm frog toad tadpole um jackal kangaroo lion tiger worm Frog, Toad, tadpole hand Jackal, Kangaroo, lion, tiger!

Scoring this with NIST's sclite program yields an estimate of 71.1% word error rate:

# Words Corr Sub Del Ins W.E.R. 38 36.8% 60.5% 2.6% 7.9% 71.1%

We see similar scores from other APIs on this and similar inputs.

What can we conclude from this?

Obviously the system didn't figure out that the recording was a list of animal names. That would be asking too much, in the current state of things. But it would be easy enough — at least in a traditional speech-to-text architecture — to build that constraint into the system.

By "a traditional speech-to-text architecture" I mean one that has a separate "language model". The output "Elefant" is a clue that this is a "sequence-to-sequence" model, which maps audio sequences to letter sequences though a complex "deep learning" network — though there are still ways to bias it in the direction of certain words. In a more traditional system, it should be easy enough to create a language model that does a good enough job of predicting animal-name sequences to massively improve performance on a task of this kind. It may be possible to do that in a sequence-to-sequence system as well, though the method is less clear to me.

