« previous post | next post »

It's here. Not the car battery, and not another one of the movies, but the First DIHARD Speech Diarization Challenge and the associated Interspeech 2018 special session.

As discussed in "My summer" 6/22/2017, I spent a couple of months last summer in Pittsburgh working with a couple of dozen other people on a workshop project with the title "Enhancement and analysis of conversational speech", whose primary focus was automatic diarization: determination of who spoke when.

The opening and closing presentations for this workshop are available here — see also "Too cool to care", 8/12/2017, and Bergelson et al., "Enhancement and analysis of conversational speech: JSALT 2017", ICASSP 2018.

In pre-workshop project description, we promised to "document progress on overlap detection, on robust diarization, and on analysis of overlapping segments; and […] use our failures to characterize the remaining difficulties and to lay out a path for further research." And the conclusion to the ICASSP paper said:

During the JSALT 2017 summer workshop, we explored some new approaches to diarization, and made some improvements in standard methods. But as we expected, the general problem is by no means solved. So to encourage further progress, we plan to create a series of Diarization Challenges, the first of which will be submitted to InterSpeech 2018.

Our draft plan for the first Challenge envisages a collection of single-channel recordings involving various numbers of speakers, with several different samples of each of several different types of
interaction. The interaction types will be things like clinical interviews, business meetings, broadcast interviews and discussions, conversations over meals, courtroom discussions, and child language recordings. None of the samples will have previously been published as part of a speech research dataset, or used in a previous evaluation campaign.

A gold-standard delimitation of speech activity start and end times will be provided for each sample. We foresee two tasks, in increasingly level of difficulty: (1) given the audio and the gold SAD,
to split any speech segments containing overlaps into single speaker and multiple speaker subsegments, and assign all segments and subsegments to speakers, where the number of speakers is unknown; (2) given only the audio, perform SAD, identify any overlapped regions, and assign all segments and sub-segments to speakers.

Future Challenges might include such things as the use of multiple audio channels, analysis of conversational dynamics, grounded diarization where (a smaller or larger amount of) training material is provided for some speakers, evaluation of human-in-the-loop efficiency for alternative methods, and so on.

For practical reasons, this first challenge includes only English-language material. But future challenges will open up the languages involved as well as the types of recordings.


1 Comment

  1. Adam Christopher said,

    February 13, 2018 @ 7:05 pm

    i'm very interested in the last part about future challenges, only English- language material? Do you think the english Language can evolve? like we used to speak really posh if thats the correct word but now its all changed, do you think in the future it could change back? Great post by the way, Happy to have read it.

RSS feed for comments on this post