This post follows up on Mark Dingemanse's guest post, "Some constructive-critical notes on the informal overlap study", which in turn comments on Kieran Snyder's guest post, "Men interrupt more than women".
As part of a project on the application of speech and language technology to meetings, almost 15 years ago, researchers at the International Computer Science Institute (ICSI) recorded, transcribed and analyzed a large number of their regular technical meetings. The results were published by the Linguistic Data Corsortium as the ICSI Meeting speech and transcripts. As the publication's documentation explains:
75 meetings collected at the International Computer Science Institute in Berkeley during the years 2000-2002. The meetings included are "natural" meetings in the sense that they would have occurred anyway: they are generally regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. In recording meetings of this type, we hoped to capture meeting dynamics and speaking styles that are as natural as possible given that speakers are wearing close-talking microphones and are fully cognizant of the recording process. The speech files range in length from 17 to 103 minutes, but generally run just under an hour each.
There are a total of 53 unique speakers in the corpus. Meetings involved anywhere from three to 10 participants, averaging six. The corpus contains a significant proportion of non-native English speakers, varying in fluency from nearly-native to challenging-to-transcribe.
There's an extensive set of "dialogue act" annotations of this material, available from ICSI, and described in Elizabeth Shriberg et al., "The ICSI Meeting Recorder Dialog Act (MRDA) Corpus", HLT 2004.
I don't have a lot of time this morning. But, spurred by Mark Dingemanse's guest post, I thought I'd take a quick look at the gendered patterns of overlap in the ICSI Meeting Corpus.
The start and end times of each participant's contributions were registered in the transcription process, e.g.
<Segment StartTime="669.140" EndTime="670.895" Participant="fe016">
So <Emphasis> another </Emphasis> idea I w- t- had
<Segment StartTime="671.053" EndTime="673.050" Participant="fe016">
just now actually for the <Emphasis> demo </Emphasis> was
<Segment StartTime="673.700" EndTime="677.511" Participant="fe016">
whether it might be of interest to sh- to show some of the prosody uh <VocalSound Description="mouth"/>
<Segment StartTime="677.644" EndTime="679.466" Participant="fe016">
work that <Emphasis> Don's </Emphasis> been doing.
<Segment StartTime="679.476" EndTime="680.182" Participant="me013">
<Segment StartTime="680.520" EndTime="688.950" Participant="fe016">
Um actually show some of the <Emphasis> features </Emphasis> and then show for instance a task like finding sentence boundaries or finding turn boundaries.
<Segment StartTime="689.290" EndTime="689.707" Participant="fe016">
<Segment StartTime="690.110" EndTime="696.221" Participant="fe016">
you know, you can show that <Emphasis> graphically, </Emphasis> sort of what the features are doing. It, you know, it doesn't work <Emphasis> great </Emphasis> but it's definitely giving us <Emphasis> something. </Emphasis>
<Segment StartTime="696.660" EndTime="698.160" Participant="me013">
Well I think at –
<Segment StartTime="696.689" EndTime="698.692" Participant="fe016">
I don't know if that would be of interest or not.
<Segment StartTime="698.160" EndTime="705.848" Participant="me013">
at the very least we're gonna want something <Emphasis> illustrative </Emphasis> with that cuz I'm gonna want to <Emphasis> talk </Emphasis> about it and so i- if there's something that shows it <Emphasis> graphically </Emphasis>
<Segment StartTime="702.830" EndTime="703.643" Participant="fe016">
<Segment StartTime="706.390" EndTime="708.699" Participant="me013">
it's much better than me just having a bullet point
<Segment StartTime="709.050" EndTime="711.262" Participant="me013">
pointing at something I don't <Emphasis> know </Emphasis> much about, so.
<Segment StartTime="709.650" EndTime="713.373" Participant="fe016">
I mean, you're looking at this now – Are you looking at Waves or Matlab?
As a result, it's easy to write a script to detect overlaps of various types and amounts. So for this morning's Breakfast Experiment™, that's what I did.
Contrary to what the documentation says, I found 59 (rather than 53) different participants identified across the 75 meetings — 43 male and 16 female. 43/(43+16) = 73% male.
Counting conversational segments (as divided in the transcripts) I found 101,739 segments by male speakers and 27,355 segments by female speakers — 101739/(101739+27355) = 79% of the segments were produced by male speakers.
Adding up the duration of the segments, I found 65,919.2 seconds in female-produced segments and 232,460 seconds in male-produced segments — 232460/(232460+65919.2) = 78% male speaking time.
This is quantitatively consistent with the males in these meetings being about 14% more talkative, on average, than the females were — (1.14*43)/(1.14*43 + 0.86*16) = 0.781. That's a substantially larger difference than I found in looking at a large collection of telephone conversations ("Gabby guys: The effect size", 9/23/2006), where "in conversations between the sexes, the men used about 6% more words on average than the women did".
What about overlaps? If we count only overlaps where someone started talking while someone else was talking, and continued after any talk segments that started earlier had finished, there were 8,299 cases where the overlapper was female, and 25,904 cases where the overlapper was male. Thus 25904/(25904+8299) = 75.7% of the overlappers were male. In this sense of "overlap", a larger percentage of female-produced segments overlapped earlier speech (8299/27355 = 30.3%) than male-produced segments (25904/101739 = 25.5%).
If we ask about who got overlapped instead of who did the overlapping, we find that there were 8316 cases where a female segment was overlapped, vs. 25888 cases where a male segment was overlapped. Thus 25888/(25888+8316) = 75.7% of the overlappees were male. Again, a larger percentage of female-produced segments were overlapped (8316/27355 = 30.4%) than male-produced segments (25888/101739 = 25.4%).
All of this is approximately consistent with what we found in Yuan et al. 2007, where we looked at overlaps in telephone conversations: "In general, females make more speech overlaps of both types than males; and both males and females make more overlaps when talking to females than talking to males. "
This is the crudest sort of beginning — to make real sense of what happened in these meetings, we'd need to look at aspects of the conversational dynamics that are not recorded in the transcripts, such as the roles of the speakers involved (e.g. meeting convener or principal speaker vs. other roles), differences among interrupting to support a point, interrupting to dispute a point, interrupting to change the subject, etc. We might also look at more elaborate models of overlap patterns, such as those described in Kornel Laskowski et al., "On the Dynamics of Overlap in Multi-Party Conversation", InterSpeech 2012.
There are also other meeting datasets to look at:
The ISL Meeting speech and transcripts (part 1): a first subset of the ISL Meeting Corpus (112 meetings). It contains 18 meetings collected at the Interactive Systems Laboratories at Carnegie Mellon University in Pittsburgh, PA during the years 2000-2001. The recorded meetings were either natural meetings where participants needed to meet in the real world, or artificial meetings, which were designed explicitly for the purposes of data collection but still had real topics and tasks. The duration of the meetings in this corpus ranges from eight to 64 minutes and averages at 34 minutes.
There are a total of 31 unique speakers in the corpus. Meetings involved anywhere from three to nine participants, averaging at five. The corpus contains a significant proportion of non-native English speakers, varying in fluency.
The AMI corpus, available from Edinburgh: 100 hours of meeting recordings. The recordings use a range of signals synchronized to a common timeline. These include close-talking and far-field microphones, individual and room-view video cameras, and output from a slide projector and an electronic whiteboard. During the meetings, the participants also have unsynchronized pens available to them that record what is written. The meetings were recorded in English using three different rooms with different acoustic properties, and include mostly non-native speakers. […]
[T]he AMI Meeting Corpus has a somewhat unusual design. Except for corpora set up to inform a spoken dialogue systems application by showing what the system needs to produce, dialogue corpus designers usually aim to capture completely natural, uncontrolled conversations. Around one-third of our data is like this; it consists of meetings from various groups that would have happened whether they were being recorded or not. However, the rest has been collected by having the participants play different roles in a fictitious design team that takes a new project from kick-off to completion over the course of a day. The day starts with training for the participants about what is involved in the roles they have been assigned (industrial designer, interface designer, marketing, or project manager) and then contains four meetings, plus individual work to prepare for them and to report on what happened. All of their work is embedded in a very mundane work environment that includes web pages, email, text processing, and slide presentations.
And there are probably others that I don't know about.
Because these recordings and transcripts have been published (and a fortiori because they were created in the first place), it's possible to contemplate the kind of analysis that Mark Dingemanse recommends. Short of undertaking the onerous task of classifying discourse dynamics by hand throughout the transcripts, we could classify a random sample. Or we could try to set up a plausible crowdsourcing method. But again, recording and publication make all things possible.
I agree with Mark that observation of behavior is simultaneously a study of the observer's perceptions and the experimental subjects' behaviors. But I would point out to him that merely shifting to observations of recordings rather than observations of unrecorded interactions just shifts this problem to a different domain. The shift can help, if the classificatory categories are well defined, and if the classification is done blind by annotators who are not aware of the experimental hypothesis, and if inter-annotator agreement is evaluated quantitatively, and if the recordings and annotations are published so that others can validate and extend them. But these conditions are all-too-rarely met — for example, in the extensive sociolinguistic literature on t/d deletion, the classificatory categories are not explicitly defined, inter-annotator agreement is not calculated, the coding is not done blind, and the datasets are not published.
There's no bigger target for the Open Data movement than this one, in my opinion.
Update — as requested by Jon Lennox in a comment below, here's the full 2×2 table of male/female overlap counts:
Thus male speakers' overlappings were 19913/(19913+5990) = 76.9% on male overlappees, and 5990/(19913+5990) = 23.1% on female overlappees. Since female speakers took up 65929.2/(232460+65919.2) = 22.1% of the speaking time overall, this suggests that male speakers overlapped female speakers at a slightly higher rate than if they had intervened at chance times.
Female speakers' overlappings were 5973/(5973+2326) = 72.0% on male overlappees, and 2326/(5972+2326) = 28.0% on female overlappees. Since female speakers took up 22.1% of the speaking time overall, this suggests that female speakers had a somewhat greater tendency to overlap other female speakers as opposed to male speakers.