« previous post |

Listen to this sound, and describe it in the comments below:

Your browser does not support the audio element.

You can learn what the sound is, and why I care how you hear it, after the fold.



I started with 15 seconds of sound, whose fundamental frequency oscillates up and down with a period of one second, in two versions an octave apart:

200-240 Hz 100-120 Hz Your browser does not support the audio element. Your browser does not support the audio element.

I then divide the 15 seconds into random chunks, and choose either the higher-frequency version or the lower-frequency version for each chunk, with probability of 0.8 for the higher-frequency version. The switch between versions is done smoothly, to avoid acoustic discontinuities.

Here's what a sample looks like in the time and frequency domains — you can see the period-doubling in the waveform (the upper panel), and the corresponding frequency-halving in the spectrogram (the lower panel):

What's the point?

The phenomenon of "vocal creak" involves a similar abrupt period-doubling or halving — for fundamental reasons discussed here — which we mostly perceive as a change in voice quality rather than a factor-of-two change in pitch.

There are various reasons to want estimates of a speaker's amount (and type and location) of creak. And we also often want estimates of a speaker's "pitch range", although straightforward distribution-based measures will be strongly influenced by that speaker's amount of period-doubling and halving.

There are ideas Out There about how to calculate a "creak index", and how to estimate pitch range in a creak-resistant way — see "The great creak-off of 1969", 7/28/2015, for a simple example — but we don't really know how to validate such measures.

One approach is to start with fake data, like the signal at the top of this post, where we know exactly what the control parameters were. (And of course we could make such signals in various ways that sound more speech-like and/or more pleasant.)

If an analysis method can't recover the underlying parameters from such signals, it probably won't work on real speech. If a method works with a variety of synthetic signals, then we can be more interested in its performance on real-world data.

The particular fake data at the start of this post is not very speech-like: the overtone amplitudes are uniformly 1/F rather than modulated by vocal-tract resonances; the period-doubling happens at random times rather than preferentially in lower-pitched regions; the period-doubling transition is handled by mixing in the output domain rather than by modeling a voice-source generation process; the sinusoidal modulation of pitch creates a distribution of values that's very different from what we typically see in speech.

But to my ear, the result meets one basic condition: it doesn't sound like switching between two different sources (though that's what it is), but rather like a single source varying in timbre (or what we would call "voice quality" if it were a human voice). I'm curious to learn whether other listeners agree.

There's also an acoustic perception angle, connected to the Shepard-Risset glissando illusion — but that's a topic for another day.

Permalink