REAPER

« previous post | next post »

A couple of days ago, I mentioned ("Sarah Koenig", 2/5/2015) that David Talkin was releasing a new pitch tracking program called REAPER (available from github at the link). After a few minor improvements in documentation, it's ready for the general public.

The reaper program uses the EpochTracker class to simultaneously estimate the location of voiced-speech "epochs" or glottal closure instants (GCI), voicing state (voiced or unvoiced) and fundamental frequency (F0 or "pitch"). We define the local (instantaneous) F0 as the inverse of the time between successive GCI.

After trying it out, I can recommend it whole-heartedly — it's robust and accurate and fast. It's my new standard pitch tracker.

It's easy to download and build, at least on OSX and Linux systems. (I haven't tried it on Windows under Cygwin, because my Windows laptop is out on loan.) Its output is in the form of Edinburgh Speech Tools files, but the ascii version of those files is easy to assimilate into other programs.

Here are the "Glottal Closure Instants" that it finds for a challenging stretch of a recent This American Life episode (Prologue to "If You Don't Have Anything Nice to Say, SAY IT IN ALL CAPS", 1/23/2015), where Ira Glass gets down to 27 Hz or so:

In order for that passage to be tracked accurately, I had to change the "minimum f0" to 20 Hz from the default 40 Hz — for speakers whose voices are less heroically creaky, the default settings work well.

As a quick demonstration, I tracked Ira Glass's voice through the whole of that prologue passage (until the music kicks in, about 51 seconds):

… and compared it to Kai Ryssdal's voice in two recent Marketplace segments:

"Coming soon: New York's first men's fashion week", 2/5/2015 (45 seconds, 10.1 seconds to analyze on my rather antique laptop):

"Goldman Sachs' reputation sinks even lower", 2/6/2015, (36 seconds, 6.5 seconds to analyze):

The distribution of Reaper's f0 estimates confirms that Ira Glass's voice is lower overall than Kai Ryssdal's (median 95.2 versus 146.8 Hz, more than a fifth lower), and that he is much more often in the "creaky" perceptual range below 70 Hz (17% of f0 estimates vs. 4%):

Here's the same distribution on a semitone scale, which is probably more perceptually appropriate:

 

If we apply the same metric to the samples of Sarah Koenig's radio voice mentioned in the earlier post, we find that the samples from 2000 (TAL #151 and #162) have 5% of their f0 estimates below 70 Hz, while the sample from 2014 (TAL #537) has 16%.

The usual "creakiness" definitions are more complex, and involve a combination of human auditory perception and human evaluation of time-domain or frequency-domain evidence for period-doubling or irregular glottal oscillation. Thus Kirstine Yu and Hui Wai Lam, "The role of creaky voice in Cantonese tonal perception", Journal of the Acoustical Society of America 2014:

A token was defined to be creaky if it had the auditory percept of creaky voice, as determined by the authors and if: (1) there were alternating cycles of amplitude and/or frequency or irregular glottal pulses in the waveform or wide-band spectrogram, (2) missing values or discontinuities in the f0 track determined by Praat's autocorrelation algorithm with default settings, or (3) the appearance of strong subharmonics or lack of harmonic structure in the narrow-band spectrogram.

This (entirely appropriate) definition combines aspects of period-doubling and erratic phonation, with the presence of sounds that are simply low enough in pitch for listeners to start to hear individual glottal cycles.  A problem with such definitions, however, is that they involve a lot of human perceptual testing and human annotation. And this means that a meaningful attempt to evaluate the claims of a "vocal fry epidemic" among young women in America — that is, to investigate the distribution of "vocal fry" (by which people mostly mean "creak") across age, gender, and time — would be a daunting amount of work, because it requires analyzing natural speech samples from hundreds if not thousands of speakers.

We might (and should) try to automate such human annotations — but there may be a much simpler way.

As I noted yesterday in "Vocal creak and fry, exemplified", any sequence of buzz-like oscillations will sound "creaky" when its frequency gets low enough, even if the oscillations are perfectly periodic. The laryngeal and pulmonary gestures that produce these low fundamental frequencies in human speech generally do also tend to produce period-doubling and chaotic oscillation, but just the low fundamental frequency is enough to create the perception of creakiness.

So I hypothesize that given an accurate-enough pitch tracker, a simple metric based on the distribution of estimated f0 values will correlate quite well with human perceptions of the voice-quality characteristics commonly called "vocal fry". And it looks to me — based on these two admittedly limited tests — as if REAPER is accurate enough to support this research.

We need a better metric on f0 distributions than just my crude "percent below 70 Hz" attempt. And we should explore various automated measurements of jitter and/or period-doubling. But I like the idea that a simple quantification of f0 distributions might work well enough that we can finally test (aspects of) the widespread perception that young women are doing something different with their voices that includes increased amounts of vocal "creak" or "fry".



5 Comments

  1. David Talkin said,

    February 8, 2015 @ 9:09 am

    Note that REAPER may be obtained at:

    https://github.com/google/REAPER/

    [(myl) Thanks, David! The github repository was linked to the name, but some people may not recognized the reference. As the directions in the README file explain, it's this simple:

    cd convenient_place_for_repository 
    git clone https://github.com/google/REAPER.git 
    cd REAPER 
    mkdir build   
    # In the REAPER top-level directory 
    cd build 
    cmake .. 
    make
    

    ]

  2. James said,

    February 8, 2015 @ 11:41 am

    I'm wondering how a poem would register on this. Particularly Ezra Pound's known recordings (or Frost, Stevens, Plath even). The reason I bring poetry up is because I had read an essay about how Pound was involved in an experiment to record and graph the "cadence" of a poetry reading. This REAPER app seems like it is brining it full circle, unless of course that's not how the program works.

    [(myl) I'm not sure whether "cadence" in that case refers to pitch or time or both. In any event, for some earlier experiments in this area, see:

    "Poem in the key of what", 10/9/2006
    "More on pitch and time intervals in speech", 10/15/2006
    "Bembé, Attis, Orpheus", 5/9/2009
    "The message", 8/27/2013

    David's new & improved pitch tracker will make it easier to do such investigations better, but his old (and still quite good) pitch tracker was generally up to the task as well. One of the key differences in the new program is that it does a good job of distinguishing low-pitched (and sometimes erratically-voiced) regions from voiceless regions, which is crucial for studying creak and fry, but matters less for work on the modal regions of a speaker's pitch range.]

  3. Josef Fruehwald said,

    February 8, 2015 @ 12:39 pm

    I've had some success using the covarep repository for creaky voice detection (https://github.com/covarep/covarep). The creaky voice detector is based on a trained on an artificial neural net (requires matlab).

    I remember it returning too many high probability regions of creaky voice within fricatives. Developing some kind of classification on the basis of identified glottal pulses would be pretty good.

  4. Chris said,

    February 8, 2015 @ 1:31 pm

    Wikipedia's vocal range article (http://en.wikipedia.org/wiki/Vocal_range) claims the bottom end of the Bass singing range is E2, about 82 Hz, so it surprises me to read (and hear, in that first snippet) that Ira Glass is routinely producing sound lower than that. Is it just that singing demands a longer duration, stability, loudness, or something else not required by speech? Or is he just a deep-voiced guy?

    [(myl) Speaking-voice pitches in the 60-80 Hz region are pretty common even in the modal range of low-pitched male voices, especially at low levels of vocal effort — but producing such pitches in a singing voice would be a different matter. The 20-40 Hz stuff that Ira Glass is producing is in a range where periods have doubled or re-doubled, and a sung version seems even harder to achieve.]

  5. guest said,

    February 8, 2015 @ 11:21 pm

    There already exists an excellent, rather well known pro audio program named Reaper.

RSS feed for comments on this post