The state of speech-to-text

« previous post | next post »

…if you haven't noticed, is good. There are many applications, from conversing with Siri and Alexa and Google Assistant, to getting voicemail in textual form, to automatically generated subtitles, and so on. For linguists, one parochial (but important) application is accurate automatic transcription of speech corpora, and the example that motivates this post comes from that world.

We start with Bill Labov's invention of the sociolinguistic interview. In a just-published book, Conversations with Strangers, he describes the history:

This Element documents the evolution of a research program that began in the early 1960s with the author's first investigation of language change on Martha's Vineyard. It traces the development of what has become the basic framework for studying language variation and change. Interviews with strangers are the backbone of this research: the ten American English speakers appearing here were all strangers to the interviewer at the time. They were selected as among the most memorable, from thousands of interviews across six decades. The speakers express their ideas and concerns in the language of everyday life, dealing with their way of earning a living, getting along with neighbors, raising a family – all matters in which their language serves them well. These people speak for themselves. And you will hear their voices. What they have to say is a monument to the richness and variety of the American vernacular, offering a tour of the studies that have built the field of sociolinguistics.

…and over the years since then, Bill retained the tapes — not only of the interviews done for his personal research, but also of the interviews done over many decades by the students who took his course Ling560, "The Study of the Speech Community", documenting Philadelphia neighborhoods. (For a description, see Chapter 3 of Josef Fruehwald's 2012 dissertation.)

A few years ago, Penn's library underwrote an effort to digitize all of the 8,512 tapes stacked on shelves in Bill's lab. And a small group of us has been working to curate the results into publications that can be used in future research. We're starting with the recordings from Bill's work in Martha's Vineyard in 1961-62, documented in the 1963 Word paper "The social motivation of a sound change" — and we've started looking at the quality of automatic transcription of these recordings.

The traditional practice in quantitative sociolinguistics was not to transcribe whole interviews, instead listening and counting perceptual categories, or pulling out short segments for instrumental analysis; and so transcriptions don't exist for most of the material in the Labov Archive. Today there are good reasons for doing full transcriptions. Digital text analysis enormously improves the productivity and accuracy of research that depends on the statistics of word choice, grammatical structure, and anything else that can be calculated from an orthographic transcription. And methods inter-relating text and audio (like "forced alignment") make it possible to automate phonetic, morphophonemic and sociolinguistic analysis of transcribed recordings — see e.g. Michelle Minnick Fox, "Usage-based effects in Latin-American Spanish syllable-final /s/ lenition", 2006; or Yuan and Liberman, "Investigating /l/ Variation in English through Forced Alignment", InterSpeech 2009; or by the same authors, "Automatic Detection of 'g-dropping' in American English Using Forced Alignment", IEEE ASRU 2011.

But accurate transcription is time-consuming. If you choose software that's not well designed for transcription (like Praat), the job can take 20-40 hours of labor per audio hour. With well-designed transcription software (like WebTrans) and proper training, that can be reduced to 3 hours per audio hour or even less — but that's still expensive, at a scale of thousands or tens of thousands of hours of data.

And until recently, automatic speech-to-text software has not been very reliable when applied to real-world recordings — see e.g.  here and here.

However, things have been getting better! Here's a brief Q&A sequence from one of Bill Labov's Martha's Vineyard recordings, an interview that took place almost exactly 62 years ago:

And here is the (automatic) speech-to-text result, as returned by rev.ai. For ease of reading, I've formatted the transcript as a table, removing the turn-level time marks:

Speaker 0: Now these questions that I'm going to ask you concern certain words and ideas that are very common, everyday expressions, uh, and questions. But what, strangely enough, uh, there is very little agreement about. The first question I wanna ask you is, uh, what is common sense? What does that mean to you?
Speaker 1: Well, I suppose a rationalization of, uh, the, um, uh, situation in which a person finds themselves in whatever it may be.
Speaker 0: Do you think most people,
Speaker 1: If I have a wooden leg, let's say, I must realize that that leg is not as good as the other one and, uh, govern my actions accordingly. And that is just an illustration.
Speaker 0: Do you think most people have common sense?
Speaker 1: Yes, I do. But, uh, it's not, uh, the same as it used to be because, uh, the aspect of the situation in which most people find themselves today has changed drastically in the last 50 years or less. And the same rules that applied, uh, to people of my age group when I was a young fella, do not fit the situation today at all. I used to, for example, uh, look at the harness on a horse to see if it was weak anywhere and, uh, had to be watched. Today they think of carburetors and tires, inflation and whatnot. It's a different proposition. It's, uh, more, uh, technical. It's, uh, less a thing that you can operate with your hands, for example, than uh, something you have to use tools and machinery to handle.

The transcription and diarization are almost perfect in this case. The punctuation is slightly odd — though the principles for punctuating accurately-transcribed spontaneous speech are variable and fluid at best. And the field is just starting to make progress in separating message structure from the effects of the message-composing process, and representing the associated forms of prosodic variation.

Overall, this level of transcription quality is clearly good enough to support various forms of downstream phonetic, phonological, lexical, and morpho-syntactic processing. Detailed timing information is also provided — a .json version of this transcript, pretty-printed via jq but otherwise as Rev supplies it, is here. This gives the estimated start and end time of each transcribed word.

The original recording in this case is of remarkably good audio quality. Bill explains the history at the start of Conversations with Strangers:

In the early decades of the twentieth century, various methods of recording speech had been used by dialectologists, and the recordings had become a standard part of the records they kept. And by the early 1950s, American linguists had begun to use the tape recorder to record the Indigenous languages of North America (Voegelin and Harris 1951). Recording conversations, rather than elicited words or phrases, yielded data on speech in its natural context of language use, and my plan was to use this data to analyze the sound system. At that time, the analysis of vowel sounds was done with the Kay Sonograph. I was a student in a course taught at Columbia by Franklin Cooper of Haskins Laboratories in 1962. I asked him whether the Kay Sonograph could be adopted for the study of dialect differences in the field. He said no, the noise level in the field was too high.

Undaunted, I turned to the high-quality Nagra recorder recommended by my film maker friend Murray Lerner, and I found that my Martha’s Vineyard recordings did allow me to study the entire range of speech sounds in the spoken language.

Not all sociolinguistic interviews have this level of sound quality, and transcription performance will of course be somewhat worse with worse SNR, which we will certainly find in some other interview tapes. Still, I think the prognosis is good for using Rev's automatic transcriptions as a starting point, with the only human processing needed at the pre-publication stage being anonymization and perhaps some light correction.

Update — As Yuval points out in the comments, the quality of speech-to-text (and other linguistic technologies) is much better in some languages and varieties than others, with English at or near the top. And of course "English" comes in many varieties, with some Englishes better covered than others. There's a lot of current work on improving the coverage of "underdocumented" languages and varieties, both by collecting data and by trying to develop algorithms that require less training data or can generalize better. More on this later…

 

 



8 Comments

  1. Bill Idsardi said,

    August 11, 2023 @ 6:42 pm

    Thanks Mark, this was wonderful, and I've ordered Bill's new book.

  2. Jerry Packard said,

    August 11, 2023 @ 8:11 pm

    For those who might not be familiar with Labov’s work, I recommend taking a glance at the following URL

    https://www.ling.upenn.edu/phono_atlas/home.html

    which displays much of the content of _The Phonological Atlas of North America_, produced by Labov with Sharon Ash and Charles Boberg, based on telephone surveys representing all the urbanized areas of North America.

  3. Cynthia McLemore said,

    August 11, 2023 @ 8:28 pm

    The Labov Archive at the Linguistic Data Consortium that Mark mentions in this post was a project taking shape under the leadership of LDC Executive Director Chris Cieri, a former student of Bill's. What Mark didn't mention, but encouraged me to, is that his work on the archive has been undertaken in Chris' memory (in addition to all the other excellent reasons to do it). This one's for you too, Chris.

  4. Yuval said,

    August 12, 2023 @ 3:23 am

    …if you haven't noticed, is good, in English.

  5. Chas Belov said,

    August 12, 2023 @ 4:36 pm

    If one is posting to the public for ongoing access as opposed to live use, one still needs to review the text for errors. I regularly see major errors in auto-captions of video and I regularly see them referred to as "craptions" by members of the disability community.

  6. Grant Barrett said,

    August 13, 2023 @ 7:55 am

    I've been fooling around with various implementations of "Whisper," OpenAI's neural-net implementation of audio-to-text transcription. They are the same company behind ChatGPT. The original announcement: https://openai.com/research/whisper/. Github: https://github.com/openai/whisper.

    There are a lot of projects using it on Github https://github.com/search?q=whisper+speech&type=repositories but the one that may be most interesting for linguistic purposes is WhisperX https://github.com/m-bain/whisperX. I have not worked with it yet, though, as I am not looking for academic-level precision in my transcriptions.

    (Although this one that "inverts" Whisper is compelling from a tinkering point of view https://github.com/collabora/WhisperSpeech; imagine taking some of the oldest field recordings and generating new audio of old speechways.)

    The best packaged implementation of Whisper on macOS is MacWhisper (with a modest fee for the "pro" version) https://goodsnooze.gumroad.com/l/macwhisper. For my purposes — transcribing language-related radio show listener voicemails and show segments — it works very well. It handled the Labov clip in 14 seconds and generated the text below (one of 11 different export formats).

    You'll see that this implementation automatically removes "uh" and other disfluencies. I believe this can be tweaked in one's own implementation. Also, MacWhisper does not currently do diarization (recognizing each speaker as distinct and labeling them) but supposedly the feature is coming soon. I have also found it to sometimes skips words for some reason, so as with the Rev.ai solution, you still have to proof the output. Still, it's something like 98% less costly and 70% faster (including cleanup time) than previous traditional solutions.

    OUTPUT:

    Now these questions that I'm going to ask you concern certain words and ideas that are very common in everyday expressions and questions, but what, strangely enough, there is very little agreement about.

    And the first question I want to ask you is, what is common sense?

    What does that mean to you?

    Well, I suppose a rationalization of the situation in which a person finds themselves in, whatever it may be.

    If I have a wooden leg, let's say, I must realize that that leg is not as good as the other one and govern my actions accordingly.

    And that is just an illustration.

    Do you think most people have common sense?

    Yes, I do, but it's not the same as it used to be.

    Because the aspect of the situation in which most people find themselves today has changed drastically in the last fifty years or less.

    And the same rules that applied to people of my age group when I was a young fellow do not fit the situation today at all.

    I used to, for example, look at the harness on a horse to see if it was weak anywhere and had to be watched.

    Today they think of carburetors and tires inflation and whatnot.

    It's a different proposition.

    It's more technical.

    It's less a thing that you can operate with your hands, for example, than something you have to use tools and machinery to handle.

  7. Taylor, Philip said,

    August 14, 2023 @ 2:22 am

    Your sample output interests me, Grant, for two reasons :

    1) It presents both Speaker A's output and Speaker B's output as a series of disjoint sentences. And this made me think — do we speak solely in sentences, or do we also speak in paragraphs ? Or from the oppposite perspective, is a paragraph purely a typographic convention that is only loosely, if at all, connected with what was said when it is used in the transcription of direct speech.

    2) The other reason is purely trivial, but did Speaker B really say "tires inflation" or did he say "tire inflation" ? The former does not seem like a conventional phrase.

  8. Grant Barrett said,

    August 14, 2023 @ 8:27 am

    Philip, I just re-listened to the audio, and the informant does say "tires inflation."

    I also notice that, as I mentioned in my notes, that the Whisper speech-to-text did miss some audio that Mark's method rendered just fine. There's some cross talk here, where Speaker 0 and Speaker 1 overlap. Whisper did not include the part below by Speaker 0.

    Speaker 0: Do you think most people,
    Speaker 1: If I have a wooden leg, let's say, I must realize that…

RSS feed for comments on this post