…if you haven't noticed, is good. There are many applications, from conversing with Siri and Alexa and Google Assistant, to getting voicemail in textual form, to automatically generated subtitles, and so on. For linguists, one parochial (but important) application is accurate automatic transcription of speech corpora, and the example that motivates this post comes from that world.

We start with Bill Labov's invention of the sociolinguistic interview. In a just-published book, Conversations with Strangers, he describes the history:

This Element documents the evolution of a research program that began in the early 1960s with the author's first investigation of language change on Martha's Vineyard. It traces the development of what has become the basic framework for studying language variation and change. Interviews with strangers are the backbone of this research: the ten American English speakers appearing here were all strangers to the interviewer at the time. They were selected as among the most memorable, from thousands of interviews across six decades. The speakers express their ideas and concerns in the language of everyday life, dealing with their way of earning a living, getting along with neighbors, raising a family – all matters in which their language serves them well. These people speak for themselves. And you will hear their voices. What they have to say is a monument to the richness and variety of the American vernacular, offering a tour of the studies that have built the field of sociolinguistics.

…and over the years since then, Bill retained the tapes — not only of the interviews done for his personal research, but also of the interviews done over many decades by the students who took his course Ling560, "The Study of the Speech Community", documenting Philadelphia neighborhoods. (For a description, see Chapter 3 of Josef Fruehwald's 2012 dissertation.)

A few years ago, Penn's library underwrote an effort to digitize all of the 8,512 tapes stacked on shelves in Bill's lab. And a small group of us has been working to curate the results into publications that can be used in future research. We're starting with the recordings from Bill's work in Martha's Vineyard in 1961-62, documented in the 1963 Word paper "The social motivation of a sound change" — and we've started looking at the quality of automatic transcription of these recordings.

The traditional practice in quantitative sociolinguistics was not to transcribe whole interviews, instead listening and counting perceptual categories, or pulling out short segments for instrumental analysis; and so transcriptions don't exist for most of the material in the Labov Archive. Today there are good reasons for doing full transcriptions. Digital text analysis enormously improves the productivity and accuracy of research that depends on the statistics of word choice, grammatical structure, and anything else that can be calculated from an orthographic transcription. And methods inter-relating text and audio (like "forced alignment") make it possible to automate phonetic, morphophonemic and sociolinguistic analysis of transcribed recordings — see e.g. Michelle Minnick Fox, "Usage-based effects in Latin-American Spanish syllable-final /s/ lenition", 2006; or Yuan and Liberman, "Investigating /l/ Variation in English through Forced Alignment", InterSpeech 2009; or by the same authors, "Automatic Detection of 'g-dropping' in American English Using Forced Alignment", IEEE ASRU 2011.

But accurate transcription is time-consuming. If you choose software that's not well designed for transcription (like Praat), the job can take 20-40 hours of labor per audio hour. With well-designed transcription software (like WebTrans) and proper training, that can be reduced to 3 hours per audio hour or even less — but that's still expensive, at a scale of thousands or tens of thousands of hours of data.

And until recently, automatic speech-to-text software has not been very reliable when applied to real-world recordings — see e.g. here and here.

However, things have been getting better! Here's a brief Q&A sequence from one of Bill Labov's Martha's Vineyard recordings, an interview that took place almost exactly 62 years ago:

And here is the (automatic) speech-to-text result, as returned by rev.ai. For ease of reading, I've formatted the transcript as a table, removing the turn-level time marks:

Speaker 0: Now these questions that I'm going to ask you concern certain words and ideas that are very common, everyday expressions, uh, and questions. But what, strangely enough, uh, there is very little agreement about. The first question I wanna ask you is, uh, what is common sense? What does that mean to you? Speaker 1: Well, I suppose a rationalization of, uh, the, um, uh, situation in which a person finds themselves in whatever it may be. Speaker 0: Do you think most people, Speaker 1: If I have a wooden leg, let's say, I must realize that that leg is not as good as the other one and, uh, govern my actions accordingly. And that is just an illustration. Speaker 0: Do you think most people have common sense? Speaker 1: Yes, I do. But, uh, it's not, uh, the same as it used to be because, uh, the aspect of the situation in which most people find themselves today has changed drastically in the last 50 years or less. And the same rules that applied, uh, to people of my age group when I was a young fella, do not fit the situation today at all. I used to, for example, uh, look at the harness on a horse to see if it was weak anywhere and, uh, had to be watched. Today they think of carburetors and tires, inflation and whatnot. It's a different proposition. It's, uh, more, uh, technical. It's, uh, less a thing that you can operate with your hands, for example, than uh, something you have to use tools and machinery to handle.

The transcription and diarization are almost perfect in this case. The punctuation is slightly odd — though the principles for punctuating accurately-transcribed spontaneous speech are variable and fluid at best. And the field is just starting to make progress in separating message structure from the effects of the message-composing process, and representing the associated forms of prosodic variation.

Overall, this level of transcription quality is clearly good enough to support various forms of downstream phonetic, phonological, lexical, and morpho-syntactic processing. Detailed timing information is also provided — a .json version of this transcript, pretty-printed via jq but otherwise as Rev supplies it, is here. This gives the estimated start and end time of each transcribed word.

The original recording in this case is of remarkably good audio quality. Bill explains the history at the start of Conversations with Strangers:

In the early decades of the twentieth century, various methods of recording speech had been used by dialectologists, and the recordings had become a standard part of the records they kept. And by the early 1950s, American linguists had begun to use the tape recorder to record the Indigenous languages of North America (Voegelin and Harris 1951). Recording conversations, rather than elicited words or phrases, yielded data on speech in its natural context of language use, and my plan was to use this data to analyze the sound system. At that time, the analysis of vowel sounds was done with the Kay Sonograph. I was a student in a course taught at Columbia by Franklin Cooper of Haskins Laboratories in 1962. I asked him whether the Kay Sonograph could be adopted for the study of dialect differences in the field. He said no, the noise level in the field was too high.

Undaunted, I turned to the high-quality Nagra recorder recommended by my film maker friend Murray Lerner, and I found that my Martha’s Vineyard recordings did allow me to study the entire range of speech sounds in the spoken language.

Not all sociolinguistic interviews have this level of sound quality, and transcription performance will of course be somewhat worse with worse SNR, which we will certainly find in some other interview tapes. Still, I think the prognosis is good for using Rev's automatic transcriptions as a starting point, with the only human processing needed at the pre-publication stage being anonymization and perhaps some light correction.

