[Warning: More than usually geeky...]
During the past decade or two, there's been a growing body of work arguing for a special connection between endogenous brain rhythms and timing patterns in speech. Thus Anne-Lise Giraud & David Poeppel, "Cortical oscillations and speech processing: emerging computational principles and operations", Nature Neuroscience 2012:
Neuronal oscillations are ubiquitous in the brain and may contribute to cognition in several ways: for example, by segregating information and organizing spike timing. Recent data show that delta, theta and gamma oscillations are specifically engaged by the multi-timescale, quasi-rhythmic properties of speech and can track its dynamics. We argue that they are foundational in speech and language processing, 'packaging' incoming information into units of the appropriate temporal granularity. Such stimulus-brain alignment arguably results from auditory and motor tuning throughout the evolution of speech and language and constitutes a natural model system allowing auditory research to make a unique contribution to the issue of how neural oscillatory activity affects human cognition.
Most of the attention focuses on the "theta band" at about 4-8 Hz, e.g. Huan Luo and David Poeppel, "Phase Patterns of Neuronal Responses Reliably Discriminate Speech in Human Auditory Cortex", Neuron 2007:
How natural speech is represented in the auditory cortex constitutes a major challenge for cognitive neuroscience. Although many single-unit and neuroimaging studies have yielded valuable insights about the processing of speech and matched complex sounds, the mechanisms underlying the analysis of speech dynamics in human auditory cortex remain largely unknown. Here, we show that the phase pattern of theta band (4–8 Hz) responses recorded from human auditory cortex with magnetoencephalography (MEG) reliably tracks and discriminates spoken sentences and that this discrimination ability is correlated with speech intelligibility. The findings suggest that an ∼200 ms temporal window (period of theta oscillation) segments the incoming speech signal, resetting and sliding to track speech dynamics. This hypothesized mechanism for cortical speech analysis is based on the stimulus-induced modulation of inherent cortical rhythms and provides further evidence implicating the syllable as a computational primitive for the representation of spoken language.
Or Uri Hasson et al., "Brain-to-Brain coupling: A mechanism for creating and sharing a social world", Trends in Cognitive Science, 2012:
During speech communication two brains are coupled through an oscillatory signal. Across all languages and contexts, the speech signal has its own amplitude modulation (i.e., it goes up and down in intensity), consisting of a rhythm that ranges between 3–8Hz. This rhythm is roughly the timescale of the speaker’s syllable production (3 to 8 syllables per second). The brain, in particular the neocortex, also produces stereotypical rhythms or oscillations. Recent theories of speech perception point out that the amplitude modulations in speech closely match the structure of the 3–8Hz theta oscillation . This suggests that the speech signal could be coupled and/or resonate (amplify) with ongoing oscillations in the auditory regions of a listener’s brain.
A possible weakness of Luo and Poeppel 2007 (a fascinating and deservedly influential study) was that the same phase analysis that they found to identify the brain responses to different sentences also worked in exactly the same way when applied to the amplitude envelope of the original audio. This suggests that simple modulation of auditory-cortex response by input signal amplitude might be the main mechanism, rather than any more elaborate process of phase-locking of endogenous brain rhythms.
I'm not yet convinced that there's a special role for endogenous rhythms in speech production and perception, beyond the necessary modulation of brain activity by the necessarily cyclic manipulation of speech articulators in production, and by the associated cyclic variation in acoustic amplitudes. All the same, this meme is associated with a range of interesting hypotheses, some of which may well turn out to be true and important.
But I've also noticed that the properties of the overall amplitude-modulation of the speech signal (caused by opening and closing the mouth, by turning voicing and noise-generation mechanisms on and off, and to some extent by varying subglottal pressure) are in some respects not quite as the focus on theta-scale rhythms predicts. To indicate what I'm curious about, I'll show a simple analysis of the TIMIT dataset (which Luo and Poeppel also used) in the 1-15 Hz range.
Here's the spectrogram, waveform, and amplitude envelope of a read sentence (one of the 6300 read sentences in TIMIT):
Here's the average spectrum of all 6300 sentences in the range of 1-15 Hz., calculated from the rectified waveforms of all 6300 speech files:
Things don't look very different if we base the spectra on the RMS amplitude in similar 25-msec windows:
2.4 Hz corresponds to a period of 417 msec, which is too long for syllables in this material. In fact, the TIMIT dataset as a whole has 80363 syllables in 16918.1 seconds, for an average of 210.5 msec per syllable, so that 417 msec is within 1% of the average duration of two syllables.
So why is the spectrum roughly flat up to 2.4 Hz or so? And why does there seem to be a different slope between (roughly) 3 and 7.5 Hz, compared to 7.5 to 15 Hz?
One hypothesis might be that this somehow reflects the organization of English speech rhythm into "feet" or "stress groups", typically consisting of a stressed syllable followed by one or more unstressed syllables. But this would predict that similar analysis of material in other languages would show a different pattern — and I'm skeptical, mostly as a matter of principle but also based on the fact that human listeners trying to distinguish between two languages based on lowpass-filtered speech don't typically do very well (e.g. around 65%, where chance is 50%).
Unfortunately there aren't any datasets comparable to TIMIT in other languages; but I'll see what I can come up with as a more-or-less parallel test in languages that are said to be "syllable timed" rather than "stress timed".