My summer

« previous post | next post »

.. or at least six weeks of it, will be spent at the 2017 Jelinek Summer Workshop on Speech and Language Technology (JSALT) at CMU in Pittsburgh. As the link explains, this

… is a continuation of the Johns Hopkins University CLSP summer workshop series from 1995-2016. It consists of a two-week summer school, followed by a six-week workshop. Notable researchers and students come together to collaborate on selected research topics. The Workshop is named after the late Fred Jelinek, its former director and head of the Center for Speech and Language Processing.

I took part in the first of these annual summer workshops, back in 1995, as a member of the team focused on "Language Modeling for Conversational Speech Recognition".

This summer, I'll be part of a group whose theme is described as "Enhancement and Analysis of Conversational Speech".

One of the group's goals is to do a better job of "diarization", i.e. keeping track of who spoke when in conversations. Existing systems do an especially bad job with overlapping speech, which can be extremely common.

Here's a graphical representation of (accurate) diarization in a (real) conversation between Red and Blue:

And the same thing continued for a while (though not to the end of the conversation):

As discussed here, turn-taking overlaps are often cooperative rather than competitive — and it would be good to be able to supplement robust diarization with a functional analysis of conversational flow.

As the workshop progresses, I'll post some updates.



  1. Cervantes said,

    June 22, 2017 @ 7:59 am

    How is this generated? Is it computer recognition of the speakers' voices, or human coding? Do you have references?

    [(myl) It's the start of a telephone call, where the two sides are well separated. I then ran a "speech activity detector" on each side of the call separately, and plotted the results with a simple R script.]

  2. Mark Gould said,

    June 22, 2017 @ 9:55 am

    This sounds (and looks) fascinating.

    Although not strictly conversational (I expect they are edited), Variety's series of video conversations between actors might be an interesting source of material.

    As an example, the conversation between Oprah Winfrey and Thandie Newton ( is full of interruptions that appear to be supportive or affirmative. It certainly feels very different from the exchanges between John Lithgow and Kevin Bacon (, where the turn-taking appears to be more formal (although I confess I haven't diarised them).

  3. elikabergelson said,

    June 22, 2017 @ 10:18 am

    i'm so psyched you'll be working on this this summer (i'll be in pittsburgh for a week, with related meetings to this:)). But more to the point: the red and blue and purple, above, already hard enough to properly tag (manually or automatically), leave out a (perhaps even more difficult) part of the problem: people are rarely conversing in twosomes in an otherwise silent room (tv that's perpetually on in the background, the rustling of the dishwasher, cocktail party effects for various key words in otherwise backgrounded speakers…).
    Deciding what speakers and noises merit tracking (and more selfishly for my own work, which parts of the signal learners (e.g. children) can appreciate enough to learn something from) is hard for trained humans, and thus unsurprisingly non-trivial for machine algorithms as well. Given how widely home environments range, it's amazing we learn language (and the relevant normalization, generalization, phonology, etc. of our speech-sounds) at all!

  4. Cervantes said,

    June 22, 2017 @ 12:30 pm

    Aha. So you can't do this from a single recording, then.

    [(myl) Maybe by the end of the summer :-)…]

    Also, no reason to think that phone calls resemble face-to-face interactions in this respect.

    [(myl) "Resemble", more than you might think, but there are certainly serious differences. And there are more than two cases to be considered.]

  5. Philip Taylor said,

    June 22, 2017 @ 2:06 pm

    Almost certainly because my wife is on VOIP in Cornwall whilst I am on the PSTN in Kent, our telephone conversations are enforcedly half-duplex, and this in turn means that any resemblance between them and our normal (face-to-face) conversations is virtually non-existent.

RSS feed for comments on this post