The right boot of the warner of the baron

« previous post | next post »

Here at the UNESCO LT4All conference, I've noticed that many participants assert or imply that the problems of human language technology have been solved for a few major languages, especially English, so that the problem on the table is how to extend that success to thousands of other languages and varieties.

This is not totally wrong — HLT is a practical reality in many applications, and is being rapidly spread to others. And the problem of digitally underserved speech communities is real and acute.

But it's important to understand that the problems are not all solved, even for English, and that the remaining issues also represent barriers for extensions of the technology to other communities, in that the existing approximate solutions are far too hungry for data and far too short on practical understanding and common sense.

There are many ways to make this point. We could look at the Winograd Schema Challenge among many other text-understanding problems. On the speech side, we could look at the current state of algorithms for diarization and speaker change detection.

But the attitude that caught my attention at this conference was epitomized in a presentation by Kelly Davis, who introduced Mozilla's Common Voice and DeepSpeech projects. These are great projects, well designed to make it possible for (certain kinds of) new languages to be added with minimal new engineering, since they rely on internet-based collection and validation of read speech from recruited volunteers, and sequence-to-sequence training of a system based on the results of that collection. But Kelly's presentation suggested that these projects have solved the speech-to-text problem for English, so that all we need for each additional language is to recruit enough readers and validators to create an open dataset of 10,000 hours of read speech.

I'm strongly in favor of the Common Voice project, though it would be nice to have a way to add conversational or other forms of spontaneous speech, both because spontaneous speech is different, and because some language communities are primarily oral. Today, though, I want to make the point that this admirable approach is not the end of the story.

Here's the start of a Librivox reading of Jane Austen's Pride and Prejudice:

Original text:

It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.

However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.

Mozilla DeepSpeech transcript:

it is a true universally acknowledged that a single man in possession of a good fortune must be in want of a while however little known the feelings are views of such a man may be on his first entering a neighbourhood this tree is so well sick in the minds of the surrounding families that he is considered the rightful property of some one or other of their daughters

5 substitutions in 70 words = Word Error Rate (WER) of 7.1%

That's pretty good! Though sequences like "it is a true universally acknowledged" and "in want of a while" are not outputs that we should accept from an entity that knows English, especially give that the phonetics are pretty clear. Those errors are all too typical of the behavior of sequence-to-sequence algorithms like DeepSpeech, and represent a type of error that would probably not be made by a system with an architecture that knows about the secret entities called "words".

If we add a little reverb and clipping, the passage remains intelligible:

But DeepSpeech now gives us:

it is a true universal is now but a in the man in possession of a good fortune must be in one of the whale however little known the feelings are views in such a man may be on his first entering a neighbourhood this treatise well sick in the minds of the sempalys that he is considered the rightful property of some one or other their daughters

NIST sclite WER 27.1%

[Note: "Word Error Rate" is defined as the sum of word substitutions, deletions and insertions, divided by the number of words in the reference (true) transcript. But it's sometimes not obvious how to count what output as what kind of error, so I've followed the normal practice by relying on the choices made by the cited NIST software.]

And if we add some white noise, it still sounds the same, just with a hiss in the background — but DeepSpeech gets even more confused, to the point where I won't bother trying to put the errors in bold face:

it is universal in now that is in the land of the session of a good portion must in want o a while however little more deadhouse at such men may be on his recent neighbourhood sisters i sowed the mind of the transmitted the right boot of the warner of the baron

NIST sclite WER is 68.6%

I should note that we could retrain the system while adding lots of such noise and distortion to the original training set, and the performance would improve — on similar inputs with similar kinds of noise and distortion. But the world is full of different kinds of input, in particular conversational (and otherwise spontaneous) speech, and the world is also full of lots of different kinds of noise and distortion. For example, we could add a very little bit of babble noise, which hardly changes the human perception at all:

But DeepSpeech gets even more confused, and starts leaving out whole stretches:

it is universal in now that is in the man of possession of a good portion must in want o a i however what a mongoose is such men may be on his breunner hood

NIST sclite WER is 75.7%

As a random example of the kind of spontaneous speech that's Out There, here's a "Cookie Theft" description (see "Shelties On Alki Story Forest" for discussion of the genre), recently recorded, of relatively high audio quality:

Human Transcription:

A: Alright
B: Go ahead.
A: Okay, a girl is getting a cake out of the cupboard
A: and she's almost going to fall on the floor doing it
A: Her-
A: The girl has a cookie jar in her hand.
A: She has-
A: She's grabbing an other girl's hand, but looks like she might fall on the floor.
A: And this girl over here
A: is washing the dishes
A: and drying them.
A: And actually the sink looks like it's overflowing.
A: This girl puts- has an apron on. She looks like she's uh –x

We don't need to add any noise or distortion to cause problems here:

Mozilla DeepSpeech:

go head
and girls getting it kakakew
and she's almost gone on for doing it
or the girl had a cooky jar in her hand
she has
she grabbing another girl's hand but looked like she might fall on the floor
and this girl or here
is washing the dishes
and drying them
actually the sink looks i could see her folly
is girl but as a burnished like she is

NIST sclite WER 47.7%

"Getting it kakakew" indeed.



  1. Judith Klavans said,

    December 6, 2019 @ 9:47 am

    Bravo. #LT4ALL has been both an encouraging and discouraging eye-opener. The naivete of some of our most respected researchers is astounding. The claim that a robot can help teach autistic children in their native indigenous language is preposterous without another 20 or 30 years of development and, above all, evaluation. What evidence is there for saying this? I'm ashamed of our technical community for not taking the time to respect the other fields and approaches in the room. Plus the hyperbole is hard to take. In my talk, I emphasized how absolutely wondrous the latest transfer learning and other technical results have been. However, let's have some balance here – not a cure all and end all.

  2. Andrew JOSCELYNE said,

    December 7, 2019 @ 1:34 am

    Just reading the English subtitles provided for #LT4ALL speakers was evidence enough that we obviously don't yet have truly smart speech tech for the planet's most L-technologized language. But it was a terrific test-bed for transcription engines working on a broad range of non-native M/F accents spouting specialized English content. For me the most interesting, if angst-inducing, takeaway at this event was someone's comment that indigenous languages may wish to "own" their languages, presumably pointing partly to their speech data collected in an effort to make their tongues sustainable. This notion would create yet another rift between "big" languages and the rest when it comes to sharing tech solutions to communication disabilities.:

  3. Kelly Davis said,

    December 7, 2019 @ 1:56 am

    Nice analysis.

    I agree for controlled, clean audio Deep Speech does well. However, for distortions, as you've found, it does not perform as well. This is how most production systems are as of today.

    [(myl) Yes — and even without notable "distortions", the current state of the art sometimes goes off the rails badly. See "Shelties on Alki Story Forest" for how Google Cloud Speech-to-Text does on some Cookie Theft recordings.]

    As to me stating or even suggesting that "that these projects have solved the speech-to-text problem for English". I didn't say or mean to suggest such.

    I made careful note to always say "production quality systems", which means what people deploy today. These "production quality systems", as anyone who has an accent or has used them in noisy situations will attest to, are imperfect.

    [(myl) Unfortunately, a general audience like the folks at LT4All understands "production quality systems" to mean "modern bigtime speech-to-text solutions", not "systems giving good results only for high-quality recordings of carefully-read standard-accent speech".]

  4. Odette Scharenborg said,

    December 7, 2019 @ 2:24 am

    I couldn't agree more. Remarks and sentiments like these are the death of speech tech. Thank you for writing this down so clearly! I'm spreading this blog post on social media as I believe this is a must read for everyone. Steps need to be taken to remove the misconception that 1. Speech technology is solved, and that 2. Existing language and speech technology can simply be "rolled out" to a new language.

  5. Daniel Barkalow said,

    December 9, 2019 @ 5:39 pm

    Remarkable that the transcription of the undistorted Librivox recording manages to get only a 7.3% error rate, yet produce output that's practically incomprehensible. Not that Austen is particularly amenable to a careless reading, but "wife"->"while" and "or"->"are" destroy both the semantics and syntax of this passage enough that it must be universally acknowledged that this tree is so well sick in the mind.

  6. Joseph Mariani said,

    December 10, 2019 @ 8:39 am

    This was the reason of my question « Where are we now ? » to the panelists of this session on the state-of-the-art in Language Technologies at the LT4All conference. But time constraints only allowed to get answers from Ahmed Ali on dialectal variants (we must now address finer grain variants) and Aijun Li on Prosody/Tone languages (midway). Roger Moore told me he was to say “Stone age” for Robot spoken interaction. It would have been certainly interesting to have also feedbacks from the other panelists, comprising Sebastian Stüker on spoken language processing including the very challenging task of simultaneous translation, which adds up the problems of speech recognition and of machine translation.

    Interestingly, Sebastian agreed very courageously to provide the simultaneous translation system he developed in his company (kites) during the whole conference (including the time when human interpreters left for lunch or went back home after 17h30…).

    It happened that I checked and analyzed the data produced by the system during this session on the state of the art. Here are nice examples of the problems encountered by the system at the beginning of the session, with Volker Steinbiss as moderator and Judith Klavans as the first speaker.

    Initial speech:

    Volker Steinbiss (Moderator): Yeah, we will start with Judith Klavans. She is a linguist and computer scientist, and has been active in academia, industry and government in furthering the development and application of computational approaches to the study of language. She’s a senior research scientist at the University of Maryland, USA.

    Judith Klavans: Thank you. (claps) I’m happy to be here for several reasons. First of all, thank you to the translators up there in the booths. (claps) We are acknowledging the work they are doing for us. I will try to speak slowly for you.

    Automatic transcription:

    Yeah, we will start with Judith Cravens, as she is linguists and computer scientists and has been active in academia industry and government in furthering the development and application of computational approaches to the study of language issues, the senior research scientist at the University of Maryland, Usa.

    Thank. I 'm happy to be here for several reasons. First of all. Thank you to the translators up there in the believes we are and knowledge. I will try to speak slowly for you.

    We see the problems raised by people’s name, plural, punctuation and noise (claps which induce recognition errors). Short words are added (“as”) or misrecognized (“She’s a”/“issues the”).

    Machine translation into French is then conducted from this transcription:

    Automatic translation into French:

    Nous commençons avec les maisons de Judith, comme elle est linguiste et informatique et elle a été active dans l'industrie et le gouvernement pour promouvoir le développement et l'application des approches computationnelles à l'étude des problèmes de langue, le scientifique de la recherche à la place de Maryland, aux États-Unis.

    Merci. Je suis heureux d'être ici pour plusieurs raisons. Tout d'abord. Merci aux traducteurs là-haut dans la croyance que nous sommes et que nous savons. Je vais essayer de vous parler lentement.

    Here is my translation of this automatic translation into French for those who don’t read French:

    Human translation in English of the automatic translation into French:

    We will start with Judith’s houses, as she is linguist and computer science and has been active in industry and government for furthering the development and application of computational approaches for the study of language problems, the scientist of research in Maryland place, in the USA.

    Thanks. I’m happy (masculine) to be here for several reasons. First. Thanks to the translators up there in the belief that we are and that we know. I will try to speak slowly to you.

    The error on the name results in a complete misunderstanding when translated into French. Error on punctuation results in missing “academia” and the adding of “issues” also induces translation errors.

    The most poetic part of it is the result for the misrecognition due to the claps: «Thanks to the translators up there in the belief that we are and that we know”, which nicely summarizes the relationship between interpreters and speakers! ;-) Less important is the error “speak slowly to you (the attendees)” instead of “speak slowly for you (the interpreters)”.

    This obviously shows that progress is still needed in this kind of challenging task, while some parts are perfectly transcribed and translated.

    But it is also interesting to have a look at what the professional interpreter did on the same speech:

    Human interpretation into French by professional interpreter:

    Volker Steinbiss (Modérateur): Nous allons donc commencer avec Madame Judith Klavans qui est donc linguiste et informaticienne, euh… qui a travaillé dans l’université, pour l’état, pour… la… travaillé donc sur l’approche computationnelle de l’étude des langues. Elle est euh… chercheuse et enseignante à l’Université de Maryland aux Etats-Unis.

    Judith Klavans: Je suis heureuse d’être ici pour de nombreuses raisons. Tout d’abord, merci aux interprètes qui sont là-bas donc dans les cabines. Nous saluons leur travail. Et les interprètes vous remercient également. J’essayerai donc de m’exprimer lentement.
    Here is my translation for those who don’t read French:

    Human translation in English of human interpretation in French by professional interpreter:

    We will therefore start with Madam Judith Klavans who is therefore linguist and computer scientist, hum… who worked in academia, for the government, for… the… therefore worked on the computational approach for studying languages. She is hum… researcher and teacher at the University of Maryland in the United States.

    I’m happy to be here for several reasons. First, thanks to the interpreters who are over there in the booths. We praise their work. And the interpreters also thank you. I will therefore try to speak slowly.

    We see that the interpreter takes some liberty with the content: the activity of Judith in “industry” is omitted, as well as the fact that she is a “senior” researcher, while it is added that she is a teacher. The interpreter uses some mechanisms to keep the listener waiting for his translation (adding ”donc”/”therefore”, “hums”, stuttering, using non-syntactic sentences) and adds a sentence from himself (“And the interpreters also thank you”), while even improving the initial content (adding “Madam” which allows the reader to get the gender of the person, or translating “translator” into “interpreter”). This shows that human interpretation is not only spoken language translation, but includes several other cognitive tasks, while not conveying 100% of the initial content. Automatic spoken translation stays closer to the original speech, but becomes incomprehensible in case of recognition errors, such as those caused by noise (here by claps).

  7. Judith Klavans said,

    December 10, 2019 @ 5:55 pm

    Joseph, I'm happy to know that I have "houses" according to the translation! I also agree that it was brave and exciting for Sebastian to use this very meeting as a testbed.

  8. Julian said,

    December 11, 2019 @ 4:59 pm

    Apologies if this comment is too naive or obvious to the experts: but what struck me most about this is that I understood all the Austen clips equally easily. How amazing and awesome are our human abilities to screen out the extra stuff that the computer finds so crippling.

  9. Paul Turpin said,

    December 12, 2019 @ 11:59 am

    The recordings are already distortions of "Austenian" English, and are quite far from Received Pronunciation… Could Librivox be trained to recognize vaious accents..?

RSS feed for comments on this post