I spent the past couple of days at a workshop on lexical tone, organized by Kristine Yu at UMass. A topic that came up several times was the question of whether "segmental" influences on pitch — for instance, the fact that voiceless consonants are typically associated with a higher pitch in the first part of a following vowel — might be diminished or even eliminated in languages with lexical tone. Several participants observed that the evidence for this is not very strong: the classical paper on the subject studied a small number of utterances from one speaker in Thai, for example.
So for this morning's Breakfast Experiment™, I wrote a little script that calculates and displays (one way of looking at) these effects in the TIMIT dataset, which includes 10 English sentences spoken by each of 630 speakers. (Specifically, there are two sentences spoken by all 630 speakers; 450 sentences spoken by 7 speakers each; and 1890 sentences spoken by a single speaker.)
I had to go to a meeting before I had a chance to write up the results, but the meeting ended early enough for me to find 15 minutes before lunch, so:
My script pitch-tracked all the sentences, and located all the places where one of the consonants "b", "d", "g", "k", "m", "n", "p", "t" was followed by one of the vowels "aa", "ae", "ah", "ay", "eh", "er", "ey", "ih", "ix", "iy" (in ARPABET). I pulled out the first 50 msec. of estimated F0 values from the designated vowels — 10 estimates at a 5-msec. frame advance. I expressed the F0 estimates in each 10-element vector as ratios to the mean value of that vector.
A plot of the results:
This clearly shows the expected effects, with /p/ /t/ /k/ showing an average fall of about 10%, whereas /b/ /d/ /g/ show about a 3% fall, and /m/ /n/ show even less.
It's nice to see that such a crude technique produces such clean results. This is presumably due to the size of the dataset (small by today's speech-technology standards, but enormous by the standards of most phonetics research), and perhaps the dataset's balanced character (though I suspect that conversational or broadcast-news datasets will show similar effects, if they're large enough). The counts involved in this case (after automatically removing examples with period-doubling or other pitch-tracking errors):
No doubt there are effects of vowel, of stress, of syllabification, of word and phrase position, of speaker, etc. — which can be explored via regression in a dataset of this type, though another breakfast or two would be required.
Now if we only had TIMIT-like datasets for French, German, Chinese, Thai, Yoruba, Chinantec, Pashto, etc. !
Someday, I hope, we will…
Update — for the record, here are the scripts involved — it would be nice if there were an easier and more standard way of doing it. Again, someday…
This script does the pitch tracking,
#!/bin/sh # for f in `cat filelist` do __echo $f __get_f0 -P /home/myl/bin/params60_650 new/$f.wav new/$f.f0 __fea_print /home/myl/bin/layout new/$f.f0 >new/$f.af0 done
where filelist is a list of pathnames to the 6300 TIMIT file prefixes, e.g.
TEST/DR1/FAKS0/SA1 TEST/DR1/FAKS0/SA2 TEST/DR1/FAKS0/SI1573 TEST/DR1/FAKS0/SI2203 TEST/DR1/FAKS0/SI943
and get_f0 and fea_print are part of the esps package. The params60_650 file is
float min_f0 = 60.0; float max_f0 = 650.0; float wind_dur = 0.01; float frame_step = 0.005;
and the layout file is
layout=f0 F0 %.2f prob_voice %.1f rms %.2f ac_peak %.3f\n