That said, I have an uneasy feeling about your heuristic, which relies on ad hoc limits (-7/+10 semitones or such). I also wonder if it would be effective with languages which use creaky voice.

(Then there's the matter of Yma Sumac's singing. The lower end of her seven-octave range is exuberantly creaky, at least in what I have heard.)

[(myl) If not 40 Hz, for some males with low-pitched voices. But for females with higher-pitched voices, the low end might be 120 or 150 or 180. The fact is that you need a custom range for each speaker, and even for each speaker in each context and mode of interaction — at least if you want to characterize parameters of the modal distribution rather than some quasi-artifactual mixture. ]

[(myl) What f0 limits did you use in WaveSurfer? I believe that the default is 60 to 400 Hz — and 60 is pretty much the lower limit of Chomsky's modal range (which my method estimated as 64 to 172 Hz), excluding nearly all of the lower-octave "creak" regions in his case. 60 to 400 Hz is OK for some adult male voices (which is why it's the default), but won't help in general with female voices, whose lower-octave distributions are generally well above 60 Hz, e.g. speaker FDFB0 from TIMIT:

About 8.6% of her f0 values are in the lower-octave distribution, which is enough to significantly skew the bottom end of a quantile-based pitch-range calculation. Or TIMIT speaker FEDW0, whose f0 values are also all above 60 Hz, with 15.8% of them in the lower-octave distribution:

]

How do we decide which frequency is a "base tone" and which one is double period modification in principle? For the base frequency, the harmonics go in the 1, 2, 3, 4, … progression (so that we don't need actually to observe 1, we can reconstruct it). If there is a period doubling, the harmonics will be 1/2, 1, 3/2, 2, 5/2, etc. So how can we tell them apart? The obvious answer is intensity. If energy in the integer harmonics is sufficiently larger than intensity in half-integer ones then we are dealing with period doubling event, if not, the "doubled period" became a dominant one and the true base frequency is 1/2. In that sense, true pitch depends on timber.

So here is my suggestion. Is it possible to track intensities in the harmonics of apparent f0 of Mr. Chomsky's speech and make a graph of the sum of even vs. odd (squared) intensities? It may tell us when we have an accidental undertone vs. when the tone truly shifts an octave down. Maybe it's not better than hard cut-off suggested by Prof. Liberman, but it might be more interesting. And, of course, it might not work at all.

[(myl) In a word, no. Go ahead and try it, and I think you'll find that such approaches can produce interesting results but don't even start to solve the problem. Period-doubling in the time domain is associated with frequency-halving in the frequency domain — in the time domain, we see alternative periods becoming gradually (or abruptly) more similar than those in between, while in the frequency domain, we see subharmonics gradually (or abruptly) appearing. See e.g. Hanspeter Herzel, "__Bifurcations and Chaos in Voice Signals__", 1993.

There are two general approaches to f0 estimation.

In a frequency-domain approach, we might try to identify the frequency of harmonics as you suggest (itself not a trivial matter) and then look for the greatest common divisor, or the relative amplitudes, or whatever. A problem with relative amplitudes is that from a perceptual point of view, missing fundamentals (and even cases where several of the first few harmonics are missing) are basically ignored, with the spacing of higher harmonics dominating the perception. A less fiddly frequency-domain method uses the ceptrum — the squared magnitude of the inverse Fourier transform of the logarithm of the squared magnitude of the Fourier transform of a signal — which was __first used for that purpose in 1964__.

The dominant time-domain approach — which is also the dominant approach to pitch determination these days — uses serial cross-correlation to identify the time lag at which the signal is most similar to itself. See __here__ for some simple discussion.]

[(myl) How would you recognize "creak" so as to exclude it? ]

]]>[(myl) Can you suggest where to find a sample of the kind you recommend? In the collections I've looked at — for example, the SRI-FRTIV dataset — increases in pitch range due to increases in vocal effort generally scale approximately multiplicatively — like transposition to a higher musical range — so that the observed f0 span in semitones (which is a logarithmic scale) doesn't change very much. There are certainly circumstances that cause the modal distribution to broaden, but I haven't seen cases where the bottom of the modal distribution is more than 7 or 7.5 semitones below the mode, or the top of the modal distribution is more than 10 or 12 semitones about the mode.]

]]>Isn't vocal fry manifested mostly as period-doubling, not chaos?

[(myl) No. The whole "fry" metaphor is based on the quasi-random popping of water-vapor bubbles in hot oil. See "Vocal creak and fry, exemplified", 2/7/2015; "What does 'vocal fry' mean?", 8/20/2015.]

Regarding your algorithm, would some languages have a wider tonal ranger in normal speech than your algorithm would accommodate?

[(myl) I doubt it. In the first place, as far as I know, differences in "tonal range" are cross-linguistically individual, situational, and cultural, not a fact about one language versus another. And the purpose of this simple-minded method is to isolate the modal region of what are clearly multi-modal distributions created by period-doubling — if the parameters of the (a fifth below the modal value and an octave above it) turn out to be inadequate for some cases, they could be modified. Or more likely, a different method should be used. But across several challenging datasets, this method seems to do about the right thing.]

]]>