Mining a year of speech

« previous post | next post »

John Coleman was on the BBC Digital Planet program a couple of weeks ago, discussing a recently-awarded grant from the (British/American/Canadian) "Digging into Data" challenge.  The proposal was submitted under the title "Mining a Year of Speech", and also involves the British Library Sound Archive, and some researchers at Penn, including Jiahong Yuan, Chris Cieri, and me.  An Oxford University press release is here.

Last week, John was in Philadelphia, discussing plans for who'll do what when.  On the U.K. side, the primary goal is to index the audio of the spoken part of the British National Corpus. On the U.S. side, we'll be indexing a variety of other spoken materials, and working with our British partners on issues of pronunciation modeling across dialects, integration of diverse metadata from different sources, and approaches to web-based search and retrieval for various types of researchers.

One of the things that I learned during John's visit is that during his time at Bell Labs, before he took the job at Oxford, he occupied the office that I had used during my last few years there. And as it happens, one of the other awards in the Digging into Data challenge was to a group involving Mats Rooth at Cornell — and Mats, I believe, occupied the same office during the interval between John's time there and mine.

For an example of what can be done with this sort of text/audio alignment, take a look at the presentation on the website of U.S. Supreme Court oral arguments (e.g. this one).  The techniques we'll be using on the Digging into Data project were developed (mainly by Jiahong Yuan) for the SCOTUS application, under an NSF grant that just ended this past year.


  1. Jarek Weckwerth said,

    January 19, 2010 @ 4:57 pm

    This is just too exciting. Could we please fast forward to when you're ready?

  2. Aleksandr said,

    January 19, 2010 @ 9:28 pm

    I'm curious about the amount and sophistication of the computer assistance that'll be employed:

    Does this project entail any transcription of (spoken) audio data? Can that be done in an automated way for so many different voices, or is human involvement necessary? If it can be transcribed without human intervention, what software will be used?

    In any case, this is fascinating stuff!

  3. Sili said,

    January 21, 2010 @ 10:00 am

    Is this similar to the English They Work for You effort to fit the Hansard reports from Parliament to the video of the debates? I seem to recall Ben Goldacre tweeting about a request to the public to help with the alignment. (More citizen science.)

RSS feed for comments on this post