Speaker change detection

« previous post | next post »

A couple of years ago ("Hearing interactions", 2/28/2018), I posted some anecdotal evidence that human perception of speaker change is accurate and usually also pretty fast. I noted that the performance of automatic systems at analogous tasks was distinctly underwhelming in comparison.

A recent paper measures human performance more systematically, and compares a state-of-the art program — Neeraj Sharma et al., "On the impact of language familiarity in talker change detection", ICASSP 2020:

The ability to detect talker changes when listening to conversational speech is fundamental to perception and understanding of multitalker speech. In this paper, we propose an experimental paradigm to provide insights on the impact of language familiarity on talker change detection. Two multi-talker speech stimulus sets, one in a language familiar to the listeners (English) and the other unfamiliar (Chinese), are created. A listening test is performed in which listeners indicate the number of talkers in the presented stimuli. Analysis of human performance shows statistically significant results for: (a) lower miss (and a higher false alarm) rate in familiar versus unfamiliar language, and (b) longer response time in familiar versus unfamiliar language. These results signify a link between perception of talker attributes and language proficiency. Subsequently, a machine system is designed to perform the same task. The system makes use of the current state-of-the-art diarization approach with x-vector embeddings. A performance comparison on the same stimulus set indicates that the machine system falls short of human performance by a huge margin, for both languages.

 

 



4 Comments

  1. Michael Watts said,

    April 26, 2020 @ 5:32 pm

    I find the item "talker" pretty jarring, and it's used heavily in that abstract. Is this (a) ordinary academic linguistics jargon [doubtful, given myl's "perception of speaker change" above]; (b) ordinary Indian English; (c) ordinary Indian academic jargon; (d) Sharma et al. attempting to use academic jargon but getting it wrong; or (e) other?

    [(myl) It's not uncommon, as these examples show.

    But a Google Scholar search for "talker identification" gets 1,760 hits, while "speaker identification" gets 72,900.]

  2. Philip Taylor said,

    April 27, 2020 @ 4:02 am

    I wonder whether the use of "talker" in the quoted article stems from the need to avoid the jarring effect that would otherwise have occured had the article started with two mentions of "multi[-]speaker speech". I would certainly have tried to avoid "multi-speaker speech" in those circumstances, although I confess that I would probably have fallen back on "multi-participant speech".

    Apropos of "talker", when I worked in telegraphy some 55+ years ago we made considerable use of the "PK talker". This was nothing more than a simple telephone line linking Electra House (London) and Porthcurno (Cornwall), and I can only assume that its rather odd name came from the fact that it allowed both ends to talk whereas all other links required the use of Morse code or (less frequently) teleprinters using 5-unit code.

  3. Bob Ladd said,

    April 27, 2020 @ 4:14 am

    I also generally prefer speaker to talker, but as MYL says, talker is definitely not uncommon. My impression is that talker tends to be used more in contexts where the physics of speech are part of the research question – so, caricaturing roughly, you'd be more likely to find talker in the Journal of the Acoustic Society of America than in Language in Society or Linguistic Inquiry. A very crude Google search just now (e.g. searching for speaker or talker together with pharynx, as compared with those words together with predicate or socioeconomic) suggests that that's a valid generalization.

  4. Martin Barry said,

    April 27, 2020 @ 7:44 am

    It's an engineering thing I think – those working in speech technology are far more likely than phoneticians to use the term 'talker'. If dim and distant memory serves, it was once explained to me as a way of avoiding ambiguity over discussion of loudspeakers. 'Speaker design' and 'speaker recognition' are both (differently!) unambiguous, but 'speaker selection' might not be.

RSS feed for comments on this post