The shape of a spoken phrase in Mandarin

« previous post | next post »

A few years ago, with Jiahong Yuan and Chris Cieri, I took a look at variation in English word duration by phrasal position, using data from the Switchboard conversational-speech corpus ("The shape of a spoken phrase", LLOG 4/12/2006; Jiahong Yuan, Mark Liberman, and Chris Cieri, "Towards an Integrated Understanding of Speaking Rate in Conversation", InterSpeech 2006). As is often the case for simple-minded analysis of large speech datasets, this exercise showed a remarkably consistent pattern of variation — the plot below shows mean duration by position for phrases from 1 to 12 words long:

The Mandarin Broadcast News collection discussed in a recent post ("Consonant effects on F0 in Chinese", 6/12/2014) lends itself to a similar analysis of phrase-position effects on speech timing. So for this morning's Breakfast Experiment™, I ran a couple of scripts to take a first look.

As described in Jiahong Yuan, Neville Ryant, and Mark Liberman, "Automatic Phonetic Segmentation in Mandarin Chinese: Boundary Models, Glottal Features and Tone", ICASSP 2014, we started with a 16-year-old published dataset (1997 Mandarin Broadcast News Speech LDC98S73, 1997 Mandarin Broadcast News Transcripts LDC98T24), and processed it as follows:

We extracted the “utterances” (the between-pause units that are time-stamped in the transcripts) from the corpus and listened to all utterances to exclude those with background noise and music. Utterances from speakers whose names were not tagged in the corpus or from speakers with accented speech were also excluded. The final dataset consisted of 7,849 utterances from 20 speakers. We randomly selected 300 utterances from six  speakers (50 utterances for each speaker), three male and  three female, to compose a test set. The remaining 7,549 utterances were used for training.

For this morning's exercise, I further divided the training-set utterances wherever there was a significant silent pause, yielding 10,699 breath-group-like phrases comprising 96,697 syllables. The resulting dataset is significantly larger than the laboratory collections used in typical phonetics experiments. But it's small compared to Switchboard — about 10,000 phrases vs. about 250,000 phrases, in the versions used to produce this post's plots; and Switchboard in turn is small by the standards of modern speech-technology research.

Still, a very consistent pattern of syllable duration by phrasal position emerges. This plot shows average duration by position for phrases between 7 and 16 syllables long:

The expected phrase-final lengthening emerges clearly, as it did in the case of English. But there are also some striking differences — here's are side-by-side plots for comparison:

English words (Switchboard) Mandarin syllables (Broadcast News)

The Mandarin data shows a striking shortening effect in the antepenultimate syllable — and a smaller shortening effect in the syllable before that. The Mandarin measurements also show shortening of the phrase-initial syllable, and a tendency for phrase-medial syllables (after the first few, and before the final four) to be longer.

These differences might be a language effect, that is, a difference between English and Mandarin phrasal speech timing. But they might also arise because we're looking at syllable durations rather than word durations, or because the speech is broadcast news reading rather than telephone conversations, or because of characteristic phrase-final syntactic or lexical patterns in Mandarin news broadcasts, or . . .

Despite this indeterminacy, the patterns are striking enough to be worth further investigation, I think. During some other breakfast period, I (or others) could look at syllable-wise patterns in Switchboard, or word-wise patterns in the Mandarin broadcast data. And with a bit of additional high-quality forced alignment, we can compare published English broadcast material, or Mandarin conversational material, or for that matter different sorts of speech in other languages — and use regression methods to try to untangle the causes and effects.

As I wrote a few years ago:

From the perspective of a linguist, today's vast archives of digital text and speech, along with new analysis techniques and inexpensive computation, look like a wonderful new scientific instrument, a modern equivalent of the 17th-century inven+on of the telescope and microscope.

We can now observe linguistic patterns in space, time, and cultural context, on a scale three to six orders of magnitude greater than in the past, and simultaneously in much greater detail than before.

When we focus our new instruments on a familiar object, we often see interesting and unexpected things — and that's exactly what happened here. Seeing such patterns is the starting-point of science, not the end. But generating and testing hypotheses in hours rather than months is still a big win.

In this case, the dataset is barely one or two decimal orders of magnitude larger than a typical laboratory phonetics experiment, but there are still enough exemplars at each phrase length to see the effect:

No. Sylls 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
N0. Phrases 115 702 578 1061 849 934 855 836 718 624 570 487 385 356 298 218

(In comparison, the Switchboard dataset had between 12,124 and 151,995 exemplars for each phrasal word-count.)

And since the speech was collected  and transcribed 17 years ago for another purpose, and automatically segmented recently as part of yet another series of experiments, the additional time expended in data collection and measurement is reduced to zero, or at least to the few minutes needed to write a couple of analysis scripts.

For completeness, here's the plot for phrase lengths from 1 to 6 syllables:

I left those out of the earlier plot because at those shorter phrase lengths, the phrase-final and phrase-initial effects are still apparently overlapping and interacting, so that the graphs are harder to read.

Here's the same data with the plots aligned from the start of the phrase rather than the end:

And the syllable-duration means by phrase length, in an R-accessible form, are here, so you can re-plot them in other ways if you prefer.

Update — Jiahong Yuan points out that

The shortening of the antepenultimate syllabe is indeed a striking result. Have you checked tone distributions at different phrase positions? One possibility (although not very likely to me) would be that the third syllable from the end tends to be tone0 (了,的,着,etc.), followed by a disyllabic word. From another perspective, in classical Chinese poems, 5 and 7 syllables per line are most popular, and the rhythm is 2-3 or 2-2-3. If the rhythm is somewhat present in Modern Chinese, then shortening the third syllable from the end also makes sense.

Jiahong's first suggestion is indeed at least part of the story! Here are the counts for the last four syllables in phrases of lengths 7-16 syllables:

POS   Tone1 Tone2 Tone3 Tone4 Tone0
END-3  1190  1297  813   1782  265
END-2   909  1039  676   1704 1019
END-1  1315  1279  963   1729   61
END-0   918  1104  649   2256  420

And the proportions:

POS   Tone1  Tone2  Tone3  Tone4  Tone0
END-3 0.223  0.243  0.152  0.333  0.050
END-2 0.170  0.194  0.126  0.319  0.191
END-1 0.246  0.239  0.180  0.323  0.011
END-0 0.172  0.207  0.121  0.422  0.079

A more sophisticated statistical analysis will be needed to sort out the various effects. But we can get a glimpse just by plotting positional effects separately for the five tone classes:

This suggests that the larger number of neutral-tone (tone 0) syllables in antepenultimate position is not the whole story.

Update #2: I took a few extra minutes to do something that I should have done in the first place, namely calculated mean durations for the Mandarin data by word rather than by syllable. Here's the comparison done that way:

English words (Switchboard) Mandarin syllables (Broadcast News)



  1. D.O. said,

    June 21, 2014 @ 11:09 am

    It is, of course, a Breakfast Experiment (I will go on the limb here and claim that you did not register it with the Trademark Office), but still some info about deviations from average would be useful. Or maybe deviations are a bit misleading because the duration of fragments within one phrase are highly correlated (through average rate of speech) than the fragments should be normalized somehow…

    [(myl) A serious analysis will take into account phonological categories (tones as well as vowels and consonants), word and phrase structure, etc. … I think it's probably not very helpful to provide standard errors for these plots in the absence of some real modeling.

    But for what it's worth, here are the standard errors in seconds for the 9-syllable-long phrases, by position:

    0.00211 0.00219 0.00193 0.00209 0.00216 0.00211 0.00201 0.00182 0.00203

    Since e.g. the difference between the 9th and the 7th positions is about 70 msec., and the difference between the 4th and 7th positions is about 30 msec., standard errors of about 2 msec. leave us plenty of statistical-significance headroom, if that's what you're worrying about… ]

  2. D.O. said,

    June 21, 2014 @ 5:34 pm

    Prof. Liberman, thank you for posting standard errors. I was thinking more along the lines not of statistical significance, but real life significance. If there is a position-dependent change in the length of syllables (or words in English) it will be revealed with enough data even if it is very small. But I was thinking more along the lines of the width of the distribution, which I gather is standard error of the mean times square root of the number of fragments with given length. The overlap then is quite large. Because people hear one fragment at a time and not a statistical composite of a large number of them, it is interesting to figure out what happens within one fragment. The regular and robust change in the length of final syllables can indicate the end of the fragment for a speaker or listener and thus join the set of such indicators. But because of large overlap the absolute duration of a syllable is not a good indicator and one should at least correct for overall speed.

    [(myl) Except for the final-lengthening effect, which is robust enough to emerge from pretty much any analysis, what we're seeing here is meant more to give a sense of what the overall phrasal shape might be, not what details of timing might convey which sorts of information in particular utterances. For that, we'll need a model that factors in the effects of differences in consonant and vowel intrinsic duration, the effects of tone class, local variations in speaking rate, constituent structure, word frequency and local conditional entropy, emphasis, etc. — plus the nature of their interactions.]

  3. Simon P said,

    June 23, 2014 @ 1:40 pm

    Late to the party, and maybe nobody will read this comment, but I can absolutely see how, if the final word is a two-syllable compound, the syllable before is often going to be very short, even in non-toneless syllables. These syllables will often be other grammatical words like 到 or 个 or short verbs like 去. Example phrases made up at the spur of the moment:


    In the first two examples, the antepenultimate character is likely to be pronounced quickly. It's not that important, and the information can readily be understood even if the listener doesn't hear it. The last phrase is something of a counterexample, where the penultimate word is disyllabic and the final one monosyllabic. Still, looking at words, the penultimate word is likely to be short (going on gut feeling as a second-language speaker here) and the final word long. I suspect if you divide by words instead of syllables and correct for the number of syllables in each word, the pattern might be even stronger.

    Also, if you took a colloquial sample, the effect would probably diminish because of sentence particles like 吧, which tend to be short in Mandarin (but long in Cantonese!).

    I have little evidence besides instinct for the above reasoning. Anyway, really interesting find!

RSS feed for comments on this post