Here's what you get if you align 11 million words of English-language audiobooks with the associated texts, divide it all into phrases by breaking at silent pauses greater than 150 milliseconds, and average the word durations by position in phrases of lengths from one word to fifteen words:
The audiobook sample in this case comes from LibriSpeech (see Vassil Panayotov et al., "Librispeech: An ASR corpus based on public domain audio books", IEEE ICASSP 2015). Neville Ryant and I have been collecting and analyzing a variety of large-scale speech datasets (see e.g. "Large-scale analysis of Spanish /s/-lenition using audiobooks", ICA 2016; "Automatic Analysis of Phonetic Speech Style Dimensions", Interspeech 2016), and as part of that process, we've refactored and realigned the LibriSpeech sample, resulting in 5,832 English-language audiobook chapters from 2,484 readers, comprising 11,152,378 words of text and about 1,571 hours of audio. (This is a small percentage of the English-language data available from LibriVox, which is somewhere north of 50,000 hours of English audiobook at present.)
As a check on this process, I wrote a little script to divide our LibriSpeech dataset into pause groups and take the average of word durations by pause-group position, resulting in the graph above. Aligned at phrase ends rather than phrase starts, the same data looks like this:
There are some interesting similarities and differences with the analogous patterns in some other collections that I've looked at over the years:
But one consistent and unsurprising feature in all cases is the lengthening of pre-pausal words — the spoken equivalent of ritardando al fine.
Because this final lengthening is (so to speak) amortized over different numbers of words or syllables in phrases of different lengths, average word durations depend systematically on phrase duration. Here's the plot of mean word duration as a function of phrase length in words, from this same LibriSpeech exercise — this dataset is large enough that the hyperbolic relationship shows up very nicely:
One important consequence is that you should be careful of using duration (of words, syllables, or segments) as an experimental variable, without taking account of the interaction with phrase length and phrase position. This is crucial if different subsets of your data have different distributions of phrase lengths, as is often the case.
Here's an example from a recent workshop paper on the analysis of ADOS ("Autism Diagnostic Observation Schedule") interview segments (Julia Parish-Morris et al., "Exploring Autism Spectrum Disorders Using HLT", CLPsych 2016):
There are overall differences in mean word duration between the ASD ("Autism Spectrum Disorders") and TD ("Typically Developing") groups — but these differences could have been obscured (or exaggerated) by group differences in the distribution of phrase lengths.
Update — here are the syllable durations for the same dataset, calculated vowel-onset-to-vowel-onset for non-phrase-final syllables, and vowel-onset-to-silence-onset for phrase-final syllables: