Syllable rhythm in English and Mandarin

I've always been skeptical of the distinction between "stress-timed" and "syllable-timed" languages, at least as a claim about the phonetic facts of speech timing as opposed to the psychological dimensions of speech production and perception. Syllable durations in all languages vary widely, due to differences in the intrinsic durations of different vowels and consonants, the effects of phrasal position and emphasis, and many other factors. As a result, inter-stress intervals in languages like English or German are not actually "isochronous", and neither are inter-syllable intervals in languages like French or Spanish. And it's not even true that speakers generally make such intervals closer to isochronous than the relevant timing factors would otherwise predict.

But in "Speech rhythms and brain rhythms", 12/2/2013, I showed a plot of the average syllable-scale power spectrum in the 6300 American-English sentences in the TIMIT dataset, which indicated a key periodicity at 2.4 Hz. I noted that "2.4 Hz corresponds to a period of 417 msec, which is too long for syllables in this material. In fact, the TIMIT dataset as a whole has 80363 syllables in 16918.1 seconds, for an average of 210.5 msec per syllable, so that 417 msec is within 1% of the average duration of two syllables. […] One hypothesis might be that this somehow reflects the organization of English speech rhythm into 'feet' or 'stress groups', typically consisting of a stressed syllable followed by one or more unstressed syllables."

I added that "Unfortunately there aren't any datasets comparable to TIMIT in other languages; but I'll see what I can come up with as a more-or-less parallel test in languages that are said to be 'syllable timed' rather than 'stress timed." Almost ten years later, I've never delivered on that promise, though it would have been easy to do so. So for today's Breakfast Experiment™ I'll show the same analysis for the 6300 sentences in the recently-published Global TIMIT Mandarin Chinese dataset.

So here's the plot for American English TIMIT:

And the same thing for Mandarin Chinese Global TIMIT:

This time, the key periodicity seems to be at 4.59 Hz., which corresponds to a duration of 218 msec. This is not much greater than the average syllable duration of 197 msec. in that dataset (88848 syllables in 17457 seconds = 0.1965 seconds per syllable). So maybe (this variety of) Chinese is (sort of) syllable-timed (on average) after all?

The "stress timing" hypothesis, as far as I know, originated in Daniel Jones, An Outline of English Phonetics, 1918 (p.106):

Vowel length also depends very largely on the rhythm of the sentence. There is a strong tendency in connected speech to make
stressed syllables follow each other as far as possible at equal distances.

Jones (and subsequent British phoneticians) were so fixated on this idea that they totally ignored the most important contextual effect on duration, in English as in all other languages, which is pre-boundary lengthening. First documented (for French) by Jean-Pierre Rousselot several decades earlier, pre-boundary lengthening seems to have been ignored by British phoneticians until the second half of the 20th century. (If you know of exceptions, please send me the details…)

There have been many debunking attempts over the years: a few random examples are Pier Marco Bertinetto, "Reflections on the dichotomy ‘stress’ vs.‘syllable-timing’" (1989); Richard Cauldwell, "Stress-timing: Observations, beliefs, and evidence" (1996); Antonio Pamies Bertrán, "Prosodic typology: on the dichotomy between stress-timed and syllable-timed languages" (1999); Amalia Arvaniti, "The usefulness of metrics in the quantification of speech rhythm" (2012). As Bertinetto wrote in 1989,

Perhaps no other phenomenon of phonology is so widely accepted, with so little supporting evidence.

But the stress-timing/syllable-timing idea remains a seductive one. As Arvaniti observes, people keep coming up with clever new metrics to show that if you just look at things the right way, some aspects of the relevant units are indeed deeply isochronous (or at least deeply isochronous-trending). And an even larger group just assumes the idea without discussion. So maybe I'm joining the isochronism chorus?

Not yet.

But it's easy to look at average syllable-scale spectra for (more diverse) collections of recorded speech in a wide variety of languages, including things like audiobooks, news broadcasts, narratives, and conversations. So we'll see…

  1. Chris Button said,

    February 28, 2023 @ 12:12 pm

    In my opinion, stress-timed versus syllable-timed is like consonant versus vowel. An unfortunate oversimplification.

    Perhaps a naive question, but wouldn't the distinction between pitch accents and nuclear tones (i.e. countour accents) in the approach of the British school go some way to automatically account for pre-boundary lengthening?

  2. Jerry Packard said,

    February 28, 2023 @ 1:09 pm

    I think these data require a little more unpacking for your audience (including me!). That said, the data seem to show that the occurrence of periodicity (positioning of ‘loudness’ that we associate with ‘stress’) in Mandarin seems to be about every 1.1 syllables, while it is about every 2 syllables in English (do I have that right?). This seems to imply (2-syllable) word stress in English and (single) syllable stress in Mandarin, right?

    This is surprising to me because I have always felt that the evolution of Mandarin was more toward 2-syllable ‘chunks’ (each with their own ‘stress’) in the spoken language.

    THAT having been said, my understanding had always been that of *Japanese* as the prototypical syllable-timed language. Cheers!

  3. David Marjanović said,

    February 28, 2023 @ 2:04 pm

    "Syllable-timed" is a misnomer – most "syllable-timed" languages are really mora-timed. Languages as different as Japanese and Romanian come to mind.

    I wouldn't expect Mandarin, specifically, to be a textbook example of a stress-, syllable- or mora-timed language. It has some amount of noticeable stress and some reduction of unstressed syllables, but not a lot. Maybe it's right in the middle.

  4. Chris Button said,

    February 28, 2023 @ 3:12 pm

    @ David M

    By replacing syllable-timed with mora-timed, aren’t you just perpetuating the artificial distinction from stress-timed?

    I’m no intonation expert, but for what it’s worth, I think any language can be described using a moraic analysis.

  5. Terry K. said,

    February 28, 2023 @ 4:42 pm

    In church recently, during one of the prayers the congregation recites together (and not something with a poetic rhythm), I noticed it had a particular rhythm to it, more regular, it seems to me, than a normal speech rhythm (one person, normal talking). Like, a way we learn to speak and keep a rhythm so that we can stay together, which we don't do with ordinary talking. That seems to interrelate to this. I suspect that this kind of reciting together comes closer to "stress-timed" than ordinary speech, but that's just an amateur guess, not a studied opinion.

  6. Taylor, Philip said,

    February 28, 2023 @ 5:54 pm

    Reciting, for example, the Lord's Prayer in church, one follows the rythm that one has subconsciously acquired during repeated prior exposure. Were one to do otherwise, the result would be chaotic. Reciting, in this context at least, is surely no different from singing in that respect.

  7. JMGN said,

    February 28, 2023 @ 6:05 pm

    Could somebody please elaborate a bit on the notion of "pre-boundary lengthening" ?
    Maybe some concise reference too?

  8. AntC said,

    February 28, 2023 @ 6:06 pm

    Leaving aside what we name the phenomenon, and whether myl's fancy devices can measure it, _is_ there a phenomenon we can all agree we're hearing? And do we agree (say) English is at one extreme and French at the other? (I agree with @DavidM that Mandarin isn't so extreme as French — to my ears. And I'm not saying French is the most extreme extreme, merely that's an example I come across often.)

    I have a dear friend whose English vocabulary is excellent, but still I find them difficult to follow. Because their first language is French/they learnt English only at school/they moved to an English-speaking environment only in mid-life. To use musical terminology: the rhythm of their English is different vs native English speakers; and different in the same way as other French L1 speakers.

    Contrast I have another dear friend whose L1 is German; their accent is terrible; but I find them easier to follow.

  9. David Marjanović said,

    February 28, 2023 @ 6:34 pm

    By replacing syllable-timed with mora-timed, aren’t you just perpetuating the artificial distinction from stress-timed?

    I'm trying to say the distinction is not artificial; it's merely 1) commonly misleadingly mislabeled and 2) not a dichotomy, but a continuum – not all languages occupy an extreme.

    I’m no intonation expert, but for what it’s worth, I think any language can be described using a moraic analysis.

    If by that you mean that a mora is a consistent time unit, then no: that fails in English and German, for example. That's why it took me years to understand how (Latin) hexameters work, and why German poetry was entirely limited to doggerel between the late 14th century (when a sound shift made the existing poetic meters, which were all partially length-based, completely unworkable) and 1619 (when Martin Opitz discovered that it's possible to invent purely stress-based meters).

  10. AntC said,

    February 28, 2023 @ 7:22 pm

    An example in Te Reo Māori — with musical notation/timing.

    This is a particularly stylised/ceremonial chant, but does reflect the rhythm of that language being quite different to English.

  11. Christian Johnson said,

    February 28, 2023 @ 9:39 pm

    First, a disclaimer: I'm not a linguist, just a writer who enjoys language(s) and admires linguistics from a (safe) distance.

    I am, however, an American in Hong Kong who hears, but doesn't speak, a lot of both Cantonese and Mandarin/Putonghua.

    I'd be *super* interested to see a similar analysis swapping in Cantonese for Putonghua. From what I've read and heard, Cantonese is farther along the supposed "syllable-timed" spectrum than Mandarin, and in my experience native Cantonese speakers of English tend to sound even "choppier" than native Mandarin speakers. But I also understand that at the syllable level, Cantonese preserves several constant finals that Mandarin has lost: for example, the "Bei" in Beijing becomes "pak" in Canto. Could that, rather than some poorly-evidenced concept of syllable timing, account for much of the perceived difference?

  12. Jonathan Smith said,

    February 28, 2023 @ 11:02 pm

    Impressionistically there is a strong and clear cline from the word-stressier Mandarins of the north to the syllable-ier Mandarins of the south inclusive of Taiwan (I suppose this is affect #1a in imitations of southern speech after s=sh etc.); presumably there is a connection here to the southern languages at several levels (indeed earlier "Mandarins" are/were southern-ier.) Based on one sentence :D, TIMIT Mandarin is self-conscious readings of news-type sentences (and was recorded in Shanghai), so may well be a good way down the cline in the aggregate. I would suggest TIMIT Guanzhong but maybe there won't be much difference given the nature of the recordings (and the state of actual Guanzhong Chinese…)

  13. Jerry Packard said,

    February 28, 2023 @ 11:50 pm

    Regarding the syllable- mora-timed distinction, it strikes me that mora can do anything syllable can do, but not vice-versa, so ‘mora’ gives a finer-grained distinction that not all languages use, while all languages use ‘syllable’. But I think this is what both David and Chris are saying, unless I’m mistaken.

  14. Bob Ladd said,

    March 1, 2023 @ 4:15 am

    I second Jerry Packard's request for a little elaboration on what the graphs in the OP are graphs of.

  15. Chris Button said,

    March 1, 2023 @ 6:49 am

    I think we’re straying a little from the pre-boundary lengthening question (although I’m still unclear why “nuclear” contour accents in the British school don’t cover that automatically, nor do I admittedly fully understand the point of the o.p.), but on the whole mora issue…

    My poorly informed take is that it is very unfortunate that Japanese is always rolled out as a “mora” timed language. Yes, haiku are mora timed, so English renditions using syllables for timing are always awkwardly wrong (much to the ignorance of many translators). But you can also mora-time English. A word like “brighter” is two syllables “brigh.ter”, but many would argue that that syllable division is incorrect and should be “” to reflect how we actually speak unless carefully breaking the word into syllables when sounding it out. There we have a -t at the end of “bright”, but take something like one syllable “fire” and we get “fie.r”. What’s even more weird is how a similar word like “near” for a Brit is “nea.r” even in “” but for an American it becomes “” ( I’m ignoring here any discussion on how long vowels or diphthongs should be handled in terms of moras.

    As for accent, schwa reductions on unaccented syllables are no different in terms of overarching intonation theory from tone reductions, devoicing of vowels, shortening etc. in languages traditionally classified as “syllable timed”.

    Just my two cents, as a non-specialist on the topic. The reason I care so much is that the Old Chinese distinction of syllables into two types (conventionally called type A and type B) has a lot to do with mora timing, but specialists can’t seem to get their heads around that so end up coming up with all these other theories on how it could have emerged.

  16. Jerry Packard said,

    March 1, 2023 @ 8:07 am

    My teacher (DRL) to the rescue!

  17. Bob Ladd said,

    March 1, 2023 @ 10:00 am

    @ JMGN, Chris Button: The notion of pre-boundary lengthening is fundamentally simple: Other things being equal, the same word or syllable will be longer at the end of a word/phrase/utterance than it is elsewhere. The details are extremely complicated (e.g. Turk and Shattuck-Hufnagel in Journal of Phonetics 2007), but the point is that it is about actual phonetic DURATION, not phonological length as reflected in poetic meter, etc., and not intonational structure (i.e. I don't understand how the British school notion of intonational tail is relevant). Similarly, the notion of stress-timing, mora-timing, etc., at least in its origins, was literally about duration and periodicity: syllables and moras and stress-groups (feet?) were supposedly all "isochronous", i.e. the same duration, and all kinds of adjustments and compensations were assumed to take place to bring about actual phonetic isochrony. There is very little evidence for adjustments of this sort.

  18. Chris Button said,

    March 1, 2023 @ 10:25 am

    @ Bob Ladd

    Thanks for the clarifications there.

    I was thinking that the contour tones on intonational tails automatically added length relative to the level pitch tones that preceed them in the intonation phrase. I suppose it would be particularly salient if the nucleus (usually at or close to the end) was one syllable at the very end of the intonation phrase rather than one with the tail spreading out across trailing post-nuclear syllables.

  19. Chris Button said,

    March 1, 2023 @ 10:30 am

    Or to put it another way, a more traditional approach to intonation that does not focus solely on level pitch adjustments (as I tend to see nowadays) might be a more natural way to account for pre-boundary lengthening via longer contour tones, albeit perhaps more effectively in some cases than others.

  20. Sarah C said,

    March 1, 2023 @ 4:58 pm

    Seems weird to me that the low-frequency areas in the two plots are so different. Something about recording conditions, or actual linguistic differences?

  21. Natasha Warner said,

    March 3, 2023 @ 2:03 pm

    I could also use more explanation of the graphs (and I should be able to understand this), although I think maybe I figured it out after a while. As for what's really going on with stress/syllable/mora-"timed" languages, I still think Rebecca Dauer (1983, 1987) got it just right (as Anne Cutler put it to me one time). Dauer proposed that these are not some timing category that speakers are trying to use like a clock during speech production, instead, a bunch of phonological factors in combination lead to patterns that we have identified as stress rhythm, syllable rhythm, or mora rhythm. Having more or less alternating stress that affects amplitude, duration, and vowel quality, along with not having a vowel length distinction or having one cued by quality rather than purely duration, make a language more like the ones we call stress rhythm. Having a huge duration-only vowel length distinction and consonant length distinction, like Japanese, along with not having any particularly prominent syllable (since Japanee pitch accent doesn't make alternating patterns of prominence/reduction) makes a language more like the ones we think of as mora rhythm. It's not that speakers are trying to make anything isochronous, or even trying to make a particular rhythm, it's that various phonological factors create a rhythm, and we as linguists have identified a few more or less relatively categorical types of those. Since we know that very young newborns detect differences among rhythm classes of languages (Nazzi et al's work), there must be something there, we're not just imagining it entirely. But I'd rather call it "rhythm" than "timing," given all the evidence against isochrony. I did have a lot of fun doing one project on this and trying to find some new ways to measure whether there was anything like Japanese speakers making an attempt to make moras more isochronous than they would have been otherwise (there wasn't)–apologies for citing my own paper.

    On the graphs in this post, how does this compare to Steven Greenberg and Takayuki Arai's work showing a surprisingly similar distribution of syllable durations in English and Japanese? It seems to be measured a very different way, and I guess since they used syllable duration, they had no way to find this result of more or less alternating strong syllables in English.

    I love the quote about this hypothesis being so widely accepted with so little evidence. It drives me nuts when my own students, not knowing I've worked on this topic, use the terms "stress/syllable/mora-timed" as if they were real. But on the other hand, I've had to advise students not to work on speech rhythm at all, because it seems like many reviewers are now so frustrated by it that papers on any aspect of speech rhythm often get harsh rejected, even if they are not in any way claiming "timing" or "isochrony." Since we know from the infant studies that some form of rhythm differences are there across languages (even if not categories), that seems like a shame.

    Arai, T., & Greenberg, S. (1997, September). The temporal properties of spoken Japanese are similar to those of English. In EUROSPEECH.
    Dauer, R. M. (1983). Stress-timing and syllable-timing reanalyzed. Journal of phonetics, 11(1), 51-62.
    Dauer, R. M. (1987, August). Phonetic and phonological components of language rhythm. In Proceedings of the 11th international congress of phonetic sciences (Vol. 5, pp. 447-450).
    Nazzi, T., Bertoncini, J., & Mehler, J. (1998). Language discrimination by newborns: toward an understanding of the role of rhythm. Journal of Experimental Psychology: Human perception and performance, 24(3), 756.
    Warner, N., & Arai, T. (2001). The role of the mora in the timing of spontaneous Japanese speech. The Journal of the Acoustical Society of America, 109(3), 1144-1156.

  22. Jerry Packard said,

    March 3, 2023 @ 5:28 pm

    Thank you for your nice post.

  23. Chris Button said,

    March 6, 2023 @ 10:56 am

    Yes, thanks for the post.

    Regarding isochrony, it’s interesting since I recall someone once telling me that Kuki-Chin Tibeto-Burman languages didn’t have underlying phonological distinctions like “laam” versus “lamm” in terms of length and that it was just “laam” with isochronic lengthening for “lamm” on the surface. The discomfort seemed to come from the idea that a consonantal coda could pattern like a nuclear vowel with distinctive length.

    The sad thing is that these observations were made by in the 1960s for Kiki-Chin (and they are by no means the only languages worldwide showing this), and yet academics still tend to go down the consonant-vowel rather than obstruent-sonorant distinction. It seems to me to be largely the same reason why they all want triangular vowel systems in all proto-languages as well even when all the evidence suggests otherwise at the underlying phonological level. Sigh … plus ça change …

    Apologies for the random digression.

