I recently became interested in patterns of speech and silence. People divide their discourse into phrases for many reasons: syntax, meaning, rhetoric; thinking about what to say next; running out of breath. But for current purposes, we're ignoring the content of what's said, and we're also ignoring the process of saying it. We're even ignoring the language being spoken. All we're looking at is the partition of the stream of talk into speech segments and silence segments.
Well, suppose the following things were true:
- Accurate automatic speech/silence partition of audio recordings is possible.
- The distributions of the resulting segment durations, and of sequences of these segment durations, are lawful and are well characterized by simple models with few parameters.
- Factors such as fluency, speaking style, physiological state, etc., affect the rhythms of speech and silence in ways that affect these parameters.
- As a result, automatically-determined parameter estimates can be used to quantify useful estimates of these factors — at least under controlled conditions, and perhaps in combination with other sorts of measures.
In fact, I believe that all four of those things are indeed true. This morning, I'll provide some evidence bearing on points (1) and (2). [If you're not interested in speech production or speech technology, and tolerant of modest doses of exploratory data analysis, you might want to turn to some of our other fine posts...]
"Speech activity detection" is by no means a new idea, but as in other areas of speech technology, the performance of speech activity detectors (SADs) has gradually improved. Neville Ryant has recently built a series of speech activity detectors that work very well indeed — his most recent effort seems to be a significant advance on the state of the art, in terms of noise immunity and robustness across recording conditions. (In due course, he'll describe it in a conference publication and release an implementation as open-source software.)
But for clean speech, more conventional approaches work quite well. For this morning's little experiment, I've used a system that Neville built a few months ago, which has been widely used for internal tasks at the Linguistic Data Consortium. This system uses a "hidden markov model" for broad phonetic classes, based on a conventional set of acoustic features calculated 100 times a second; the results are merged into speech and nonspeech (here = silence) regions, subject to constraints on minimum region lengths. It was trained on a published corpus of English conversational speech for which hand segmentation is available.
I've configured it for minimum (speech and nonspeech) segment durations of 100 msec, and applied it to a variety of collections of recorded speech, with excellent results. Here's a few seconds from the start of one such collection:
This happens to come from the start of the dedication and introduction to the Librivox reading of Amor de Perdição (1862) by Camilo Castelo Branco, which I've been looking at as part of an effort to learn something about the phonetics and phonology of Portuguese. If you want to know what it sounds like, a phrase-by-phrase presentation of the Dedicatória is here.
Today, we only care about the durations of the speech and silence segments in the Introduction and the first 12 chapters (all that's now available) — a total of about 3.25 hours of audio, comprising 5614 speech segments and (because I left out the leading and trailing silences in each recording) 5601 silence segments.
Here's a histogram of the durations of the speech segments:
And a histogram of the durations of the silence segments:
How should we characterize these distributions? For distributions of durations, the obvious place to start is the gamma distribution, characterized by a shape parameter k and a scale parameter θ:
Given a sample of numbers that might have come from a gamma distribution, we can estimate the shape parameter and scale parameters, and plot the resulting approximation to the empirical distribution. Here's how it works (quite well) for the speech segments in this case:
And for the silence segments:
To the extent that there are problems with fit, it's because we're actually looking at a mixture of different cases. Given a minimum segment duration of 0.1 seconds, some of the silent segments are actually within-phrase stop gaps and so forth, while the others are silent pauses between phrases — and similarly, some of the speech segments are cut up by such boundaries. We can see the signature of this in a histogram of silence-segment durations at a finer time scale:
The default settings for the SAD that I used require a minimum duration of 300 msec for nonspeech segments and 500 msec for speech segments — this gives good results when the goal is to divide the input into convenient breath-group-like phrases, e.g. for subsequent transcription. I set the thresholds lower because I wanted to see the shorter-duration end of the distributions as well.
For this data, the approximate dividing-point between the within-phrase and between-phrase silences is about 180 msec — but a better approach would be to divide silent-segment candidates into two categories based on a richer set of properties than mere duration.
Anyhow, I believe that I've supported the plausibility of points (1) and (2) above — leaving for another day the distribution of segment sequences, as well as the issues raised in points (3) and (4).