In Meg Wilson's post on marmoset vs. human conversational turn-taking, I learned about Tanya Stivers et al., "Universals and cultural variation in turn-taking in conversation", PNAS 2009, which compared response offsets to polar ("yes-no") questions in 10 languages. Here's their plot of the data for English:
Based on examination of a Dutch corpus, they argue that "the use of question–answer sequences is a reasonable proxy for turn-taking more generally"; and in their cross-language data, they found that "the response timings for each language, although slightly skewed to the right, have a unimodal distribution with a mode offset for each language between 0 and +200 ms, and an overall mode of 0 ms. The medians are also quite uniform, ranging from 0 ms (English, Japanese, Tzeltal, and Yélî-Dnye) to +300 ms (Danish, ‡Ākhoe Hai‖om, Lao) (overall cross-linguistic median +100 ms)."
So for today's Breakfast Experiment™, I decided to take a look at similar measurements from one of the standard speech-technology datasets, namely the 1/29/2003 release of the Mississippi State alignments of the Switchboard corpus. For details on the corpus itself, see J.J. Godfrey et al., "SWITCHBOARD: telephone speech corpus for research and development", IEEE ICASSP 1992). Here's a random selection from one of the conversations:
The (hand-checked) alignments indicate the start and end of all words, noises, and silences, for each speaker in each conversation. I counted all cases in which a speaker starts talking after the other speaker has been talking, either starting after the other speaker has stopped (yielding a positive offset equal to the silent gap), or before the other speaker has stopped (yielding a negative offset equal to the amount of overlap).
The result is a distribution in general agreement with Stivers et al. (although I'm looking at all speaker changes, not just answers to polar questions):
But the much larger dataset (about 2,000 times as many offset measurements) brings out some perhaps-interesting additional structure, especially an apparent increase in counts around -100 msec, 0 msec, and 100 msec (indicated by red vertical lines in the plot). This might be connected to the "periodic structure" postulated in Wilson & Zimmerman 1986 and Wilson & Wilson 2005, though they found conversation-specific differences in the time-structure suggesting that such effects might be washed out in a collective histogram of this sort.
Since there's some demographic data available for the speakers in the SWB corpus, we can look at possible differences according to sex, age, years of education, geographical region, and so on. For this morning, I'll just take a look at speaker sex, and in particular whether there's any difference in speaker-change offsets between female/female and male/male conversations:
It's clear from the plot that overall, interactions in the FF conversations have shorter offsets than in the MM conversations. (FWIW, the median is 130 msec for the males vs. 30 msec for the females.) As usual, this raises more questions: Is this a difference across all types of interaction? Or are things different for "back-channel" responses vs. question-answer pairs vs. substantive comments? And what happens in mixed-sex conversations?
It might also be interesting to look at speaker age effects, regional effects, and so on. I've run out of time this morning — but isn't it fun to be able to do an interesting empirical investigation in an hour or so? And isn't it too bad that there's not more communication between the disciplines centered on conversational analysis and the disciplines centered on speech technology?