Language Log

Effects of vocal fry on pitch perception

March 5, 2015 @ 11:36 pm · Filed by Mark Liberman under Psychology of language

Earlier today, Jianjing Kuang pointed out to me something interesting and unexpected about the sounds in a LLOG post from last month, "Vocal creak and fry, exemplified", 2/7/2015.

To see what she heard, let's start with a 50 Hz buzz:

Some simple Octave code that generates a buzz of this type is here — it creates a series of impulses spaced 1/50 of a second apart:

Because the impulses are a bit farther apart than the human ear's "flutter fusion threshold", you can hear not only the low-pitched sound but also, to some extent, the sequence of individual impulses. This effect would be even more pronounced if the impulses were just, say, 1/20 of a second apart — you can experience a range of examples from 120 Hz down to 10 Hz in the earlier post.

Now let's do the same thing, again spacing impulses 1/50 of a second apart, but now offsetting each impulse by a randomly selected value between -35% of a period and +35% of a period:

Some simple Octave code that generates a buzz of this type is here, and the result sounds like this:

This "jitter" in the length of adjacent periods is what is traditionally called "vocal fry", because it sounds like moist food frying in boiling oil, where bubbles of water vapor form and pop at random intervals.

What Jianjing noticed is that the buzz with random offsets — the buzz with the "fry" effect — sounds lower in pitch than the plain buzz does. Listen and see if you agree:

Plain buzz:	Buzz with random offsets

My first thought in a case like this is maybe I screwed up and there's a bug in the code.

But I checked. The random number generator seems to be working as instructed — there are about equally many positive and negative offsets, and they're approximately uniformly distributed in the range ±35% of 1/50 of a second, i.e. from -0.35*(1/50) = -0.007 to 0.35*(1/50) = 0.007:

And the resulting distribution of f0 offsets is as we expect:

That is, the shortest possible period should be (1/50)-2*(0.35*(1/50)) seconds = 6 msec. = 166.6667 Hz, corresponding to the case where one period occurs 35% too late, and the next one occurs 35% too early. That's about 12*log2(166.7/50) = 20.8 semitones higher.

And the longer possible period should be (1/50)+2*(0.35*(1/50)) seconds = 34 msec. = 29.4112 Hz, which is about 12*log2(29.4/50) = -9.19 semitones lower.

So we see exactly the distribution of f0 values that we expect. And the result is that the median f0 — as generated in one random run of 1,000 impulses — is 50.114 Hz, which is only about 0.04 semitones different from the 50 Hz we started with. And there are just about as many periods shorter than 1/50 of a second and longer than 1/50 of a second: again, in one run of 1,000 impulses, we have 493 longer periods, 502 shorter periods, and 4 periods exactly equal to (1/50)*22050 = 441 samples.

But REAPER (David Talkin's new pitch tracker) agrees with Jianjing — analyzing the 50-Hz buzz with random offsets, it finds many more longer than shorter periods:

(Note that the largest possible deviation should be 2*0.35*(1/50) = 14 msec., but there are many apparently deviations longer than this…)

And equivalently, many more lower- than higher-pitched inter-pulse intervals:

Note the large number of measurements an octave down:

So what's going on in perception and in REAPER's analysis? In both cases, there's a bias towards continuity — a tendency to look (well, to listen) for sequential repetition of nearly-periodic intervals. And this bias is leading both algorithms to skip some impulses, so that a sequence of two periods (as generated) is treated as a single period in the perceptual (or computational) analysis.

It might be worthwhile to explore this effect in greater detail — how big is the shift in pitch perception, as caused by various amounts of jitter at various basic periods? How much does the wave shape (in time and frequency) matter? Is there a simple computational model that matches the perceptual effects?

But for now, let's just observe that this effect is yet another example of how interesting subjective pitch perception can be. And as Jianjing pointed out to me, it may also provide a perceptual motivation for the fact low tones are often further differentiated by irregular phonation, independent of the articulatory tendency for vocal-fold oscillation to become chaotic at low frequencies.

March 5, 2015 @ 11:36 pm · Filed by Mark Liberman under Psychology of language

Permalink

22 Comments

Mark Mandel said,

March 6, 2015 @ 12:58 am

Strangely, my first reaction to the random-offset buzz was that it seemed to have an overall higher pitch.
Rubrick said,

March 6, 2015 @ 1:13 am

Very interesting. And I like that you unabashedly describe human pitch perception as an "agorithm". Long live the Church-Turing Thesis!

[(myl) The Wisdom of Crowds…]
Matt said,

March 6, 2015 @ 2:00 am

I guess this suggests that at least some uses of vocal fry could fruitfully be interpreted as an attempt (unconscious or otherwise) to be perceived as slightly lower in pitch, and therefore more authoritative/worthy of respect/etc., with a given vocal apparatus.
David L said,

March 6, 2015 @ 8:54 am

I recall from a lecture many years ago that the spiral-shaped thingummy* in the human ear performs, in effect, a Fourier transform on incoming sound — that is, vibrations at a certain frequency trigger a response at a certain place along the spiral.

So I wonder what the FT of your randomized buzz would be. Does it show a preponderance of lower frequency power?

[(myl) No. Spectral slice of the regular buzz:

Ditto of the buzz with random offsets:

The main difference is the expected one, namely that the overtone structure is disordered.]

*the cochlea, duh
D. Sky Onosson said,

March 6, 2015 @ 9:16 am

I actually had the opposite experience, with the randomized version sounding higher-pitched to me, if anything. I was only listening on my phone's speaker (it's an HTC One with very decent speakers for a phone, but still~ ) so I'll have to repeat this later on better equipment.
Jon said,

March 6, 2015 @ 9:57 am

Like others above, it seemed to me that the 'fry' version was possibly higher pitched, definitely not lower.
And the artificial version doesn't sound like frying, but the crackle from a radiation monitor. Unlike with food frying, all of the blips from a monitor are identical, though randomly spaced. But that's just my background.
Mr Punch said,

March 6, 2015 @ 11:27 am

I definitely heard the "fry" version as lower. And I agree with Matt – it seems plausible to me that at least some vocal fry may be attributable to an effort to lower the voice, particularly among women and especially younger women. This would be the opposite of what I think of as "the Julia Child effect," in which a woman with a naturally deep voice (I'm making an assumption about Child, who was quite large) tried (mostly past tense, I think) to sound more feminine or even girlish.
Eric P Smith said,

March 6, 2015 @ 12:20 pm

I find it interesting that the ear hears any pitch at all, even on the regular 50Hz buzz. Each wavelet is neither predominantly above the x-axis nor predominantly below it: its integral over time is 0. Therefore the wavelets, even though they are regularly spaced at 50 per second, do not give rise to a frequency component of 50Hz on any linear analysis such as a Fourier analysis.

[(myl) Yes they do! Here's a close-up of the low-frequency end of the log amplitude spectrum of 18 periods of the buzz — it's got the expected overtones at 50 Hz, 100 Hz, etc.:

]

Each wavelet is centred on 1800Hz. The Fourier analysis of the overall waveform is constant below 1800Hz (in terms of energy per Hz) and it tails off rapidly above that.

That the ear hears a clear pitch shows that its algorithm is not linear, which has been known for a long time and is the source of difference and summation tones.

On the buzz with random offsets, I don't hear any pitch to speak of.
Haamu said,

March 6, 2015 @ 1:57 pm

If the impulses that are 0.013 seconds long are equal in number to the impulses that are 0.027 seconds long (etc.), then we're spending a lot more time during this clip listening to longer intervals than shorter ones. Couldn't that be a simple explanation for any lower perception bias?

Also: If at 50 Hz we're flittering around the flutter fusion threshold, then some of the randomized intervals will be above that threshold and some below it, which suggests another hypothesis. So, is the "lower-bias" effect the same at 100 Hz?

Finally: I started out with a lower bias, but then I played the two clips simultaneously, and convinced myself I could hear both the higher and lower elements of the random clip, with the uniform buzz pitch in the middle. Since then, the random clip has become a sort of Necker cube for me, and I can (or am telling myself I can) hear it higher or lower or in the middle, depending on what I choose to focus on.

[(myl) Interesting!]
Walter Underwood said,

March 6, 2015 @ 5:28 pm

We hear pitch differences logarithmically, so varying a equal number of Hz up and down is a larger logarithmic difference downward than upward. Interpreted that way, the "center" of the tone has dropped.

Try it with equal multipliers up and down, rather than adding and subtracting the frequency.

[(myl) (1) I added and subtracted amounts to the pulse locations in the time domain, rather than working in the frequency domain; (2) the results actually favored shorter periods (= higher pitches) slightly:

1/(1.35*.02) = 37.03704 Hz # lowest pitch resulting from one offset
1/(0.65*.02) = 76.92308 Hz # highest pitch resulting from one offset
12*log2(37.03704/50) = -5.195512 semitones lower
12*log2(76.92308/50) = 7.457861 semitones higher

Still, the median period was very close to 0.02 seconds (frequency of 50 Hz), and the average period was actually on the short side (i.e. higher rather than lower frequency.
]
D.O. said,

March 6, 2015 @ 6:42 pm

There must be some correlation between pitch and perceived loudness, but my 2 minutes of googling were not enough to figure out how large is the effect.
Eric P Smith said,

March 6, 2015 @ 8:42 pm

@myl: You're right. Thank you for taking the time to research this and correct me. Contrary to my first impression, each wavelet is predominantly above the x-axis. When I have time (perhaps Monday) I'll construct a wav file consisting of a 50-per-second repetition of a short wavelet with a zero integral and have a listen to it.

[(myl) I'm not sure why you think a zero integral is relevant. A sine wave (which has an integral of zero, or a sum of zero in the discrete case, for integer numbers of periods), has the Fourier transform of an impulse at its frequency; and a complex wave made up of sums of harmonically-related sinusoids, which also will have an integral of zero for suitably chosen intervals, will have a Fourier transform representing the associated "overtone series", which is just a representation of the amplitude and phase of the sinusoids in its frequency-dimension representation. Thus this Octave code yields this (summing to zero) waveform:

and this amplitude spectrum:

]
Eric P Smith said,

March 6, 2015 @ 9:30 pm

@myl: Thanks. I'm away Saturday and Sunday. I'll reply as soon as I get back.
D.O. said,

March 6, 2015 @ 11:11 pm

Ok. Now that I've spent more than 2 minutes on this, I've learned that there is something called equal-loudness contours. At low loudness (20 phon) 40Hz corresponds to 70dB and 60Hz to 60dB. Roughly we have -0.5dB/Hz. If random-offset signal creates roughly ±5 semitones distribution, it gives us about 30Hz difference. This should give effective 9Hz shift to lower frequencies, which is about -3.5 semitones. Is it far from the center of the reaper spectrum?
D.O. said,

March 7, 2015 @ 12:32 am

Oops. It seems that I've made some stupid arithmetic mistake. Let's try again. There are 7 semitones from 40 to 60 Hz. The drop of 10dB means that the rate is 10/7=1.4dB/semitone. If we convert it to natural base 1.4dB/semitone=1.4*0.05*ln(10)=0.16/semitone. Thus, for the width of ±5 semitones one has a perceived shift of the center of the distribution of about 5*0.16=0.8 semitone down.
D.O. said,

March 7, 2015 @ 1:44 am

My, it's embarrassing. Let's hope third time's a charm. Of course, the last calculation should be 0.16*5^2= 4 semitones down. Which seems to be a lot. Let's try high loudness level. At 100 phon 40 to 60Hz contour drops by about 6dB, which will translate to difference of 2.4 semitones.

Anyways, it's not a Gaussian distribution and ±5 semitones is just a guess and because it is squaring…
D.O. said,

March 7, 2015 @ 2:32 am

Now, I've generated pulses with Prof. Liberman's TestDeviations35.m and then

f0 = SR./Periods;
semitones = 12*log(f0/50)/log(2);
distribution=histc(semitones,-9:20)';
a = -0.1; % drop in perceived loudness (natural log of pressure) per semitone
PerceivedDistribution = distribution.*exp(a*(-9:20));
figure;bar(-9:20, PerceivedDistribution); % sorry, cannot post the figure
disp(mean(PerceivedDistribution.*(-9:20))/mean(PerceivedDistribution));
-2.0436
kktkkr said,

March 7, 2015 @ 4:07 am

I had a variation on @Haamu's approach by playing the buzz with fry over itself (by clicking play on both embedded audio elements). The combined buzz sounds closer to the original 50Hz pitch. But then I went on and played 3 and 4 copies simultaneously by opening the page in another window. However, I do not hear any progression in pitch, but rather it just sounds like noisier and noisier versions of a 50Hz pitch. From consideration of the pulse intervals I would expect stronger overtones depending on the timing of the pulses, exactly as in the regular-spaced case. My guess is that in this case the louder overtones improve perception of the original frequency.
Peter said,

March 8, 2015 @ 6:36 am

Like others, I heard the buzz-with-offsets as a little higher than the regular buzz — specifically, about a minor third higher, or slightly more, so about 3–3.5 semitones. I have a musician’s aural training, not a linguist’s, in case that makes any difference.
David Talkin said,

March 8, 2015 @ 8:29 pm

REAPER has no underlying model of human auditory perception. IMHO, if it is not retrieving the F0 as generated, it is simply making a mistake! Now, I can wine about it not being designed to work on pulse trains, but I won't.
Eric P Smith said,

March 9, 2015 @ 9:20 pm

Sorry for the delay in clarifying what I said before.

The sound buzz50 consists of a short wavelet repeated at intervals of 1/50 second (20 milliseconds). As Mark pointed out, its frequency analysis shows prominent components at 50Hz and multiples of 50Hz.

We may suppose without loss of generality that both the wavelet and the overall sound buzz50 are centred on time 0. Then the component of buzz50 at frequency 50Hz is F(50) = ∫wavelet(t).cos(50t/2π)dt. Now wavelet's effective support (the time interval off which it is effectively zero) is about 3ms, which is small compared with the 20ms buzz wavelength. So cos(50t/2π) hardly deviates from a value of cos(0) = 1 on wavelet's effective support. So, to a first approximation, F(50) = ∫wavelet(t)dt. The 50Hz component is zero if and only if the integral is zero.

I cannot illustrate that in this comment, because Language Log's comment system does not support the <audio> and <img> elements. So I have put an illustration on my own website here, to which you can navigate if you are interested. I do not guarantee that it will remain at the same web address for ever, but I will make sure it is there for at least a year.
Jake Nelson said,

March 10, 2015 @ 12:09 am

I definitely hear the "frying" sound as lower pitched, which is interesting (in most cases of perceptual illusion, my perception flips back and forth, sometimes rapidly. I see both the duck and the bunny, the dress was white and gold until it was blue and black, etc). I wonder how this relates to the fact that I can't perceive "vocal fry" at all- in many of the posts on it here, there's comparisons of "with" and "without", and I can't detect a difference. Like, you could post any two random clips and say one is "with" and the other is "without", and I'd have to take you at your word, because I have no way to tell. I just have to assume everyone commenting is actually perceiving some difference. I assume this is what being color-blind must be like.