Audiobooks as birdsong

« previous post | next post »

Wonkier but more accurate title: "Generating the distribution of audiobook speech segment durations".

In "Finch linguistics" 7/13/2011, I observed that the distribution of birdsong motif repetitions indicates that the underlying process is non-markovian in a particularly simple way: the probability of adding another motif to a zebra-finch song is not constant, but rather is an exponentially-decaying function of the number of previous motif repetitions.

And in "Modeling repetitive behavior" 5/15/2015 (and posts linked therein), I suggested that this is likely to be a shared property of several sorts of repetitive behavior, primate as well as avian.

A few days ago, as a result of a conversation with João Sedoc and Tianlin Liu, I decided to apply the same idea to the distribution of speech-segment durations in (a locally re-aligned version of) the LibriSpeech corpus.

LibriSpeech is a large speech corpus representing a small sample of the LibriVox open-access audiobook collection. Our local version has 5831 chapters from 2484 speakers, and a total duration of 1570:39:50.82.

After dividing the speech into pause groups by splitting on between-word silences found by the forced alignment process, I got 2,086,576 speech segments. I then divided each speech segment into 100 millisecond frames, and calculated the empirical probability of continuing from one frame to the next, within a given speech segment, as a function of the time since the start of the segment.

[Note: I chose 0.1 seconds rather than a smaller unit, and gently smoothed the vector of segment counts by frame position, in order to avoid artefacts of uneven or missing counts as segments get longer. I also started counting with the third unit, partly because speech segments shorter than 0.2 seconds are effectively impossible in this context, and partly because I used a smoothing window of ± 2 frames.]

Plotting the same data on log axes yields the straight line expected of exponential decay, or rather (apparently) two such lines, with an apparent transition in slope a couple of hundred milliseconds after my program started counting — perhaps the duration of a single syllable?

I suspect that a more serious model of  the empirical dynamics of speaking (alone or in conversation) would show interesting differences across speaking styles, personalities, interactional roles and attitudes, etc. — and might also be helpful in providing priors for speech activity detection, diarization, and other types of automatic analysis.

Update: Dean Foster points out that this is exactly (on a shorter time scale) the Gompertz Law of Mortality


1 Comment

  1. Avi Rappoport said,

    June 10, 2018 @ 4:47 pm

    I think it's vital to note that Librivox only includes items in the public domain in the US, mainly from before 1923. And most of the works are literary or genre fiction. And the readers seem somewhat self-conscious. So this is a fairly specific set of data, and may not apply to current speech.

    [(myl) All the better — the point is not to find something that is true of all people, contexts, styles, cultures, etc., but rather to find lawful patterns that distinguish people, contexts, styles, cultures, etc.

    I suspect that read speech is mostly like this, though no doubt with varying parameters. Conversation is different, as we'll see.]

RSS feed for comments on this post