Blizzard Challenge 2012

« previous post | next post »

Every year since 2005, speech synthesis researchers have organized a yearly Blizzard Challenge, "[i]n order to better understand and compare research techniques in building corpus-based speech synthesizers". Part of the research effort involves the general public, who are invited to perform a series of evaluations of the results.

Participation takes about one hour in total — but your participation is registered, so that you can leave at any point, and then return and take the evaluation up again at the point where you left off. If you're willing, please follow this link to enroll and participate.

The technology being evaluated is the kind of speech synthesis where new texts are synthesized by re-combining bits and pieces of recordings of real human voices, or perhaps by concatenating elements drawn from a statistical analysis of the same recordings. There are many different approaches to every aspect of this process: how to choose the bits and pieces, how to take them apart and put them together, how to modify the resulting combination (if at all), and so on. The central idea of the Blizzard Challenge series is to give everyone the same collection of recordings to start with, and the same texts to synthesize, and the same limited amount of time to get it all done, and then to compare the results so that everyone can learn more about how to improve the technology.

The name "Blizzard" arose because the original speech collections used came from the CMU ARCTIC databases. And the ARCTIC name was chosen partly because

… we chose to use out-of-copyright books from the Gutenberg Project. With most of these texts being at least 70 years old, we face the issue of language drift. The English language has changed considerably over the past centuries and we did not want to infuse in our prompt set archaic English sentences. Thus we have hand selected a set of short stories whose style is recognizably modern, if not completely contemporary. Partly for consistency and partly from personal preference, we selected stories largely from the early 20th century author Jack London. Many of these stories – famously “To Build a Fire” – depict the difficult living conditions of the Yukon. Other selected books also describe the far Canadian north, hence our moniker Arctic.

The 2005 challenge was based on. The voice building data for the 2012 challenge is

[a]udiobook data, segmented into utterances and with transcriptions […], comprising around 50 hours of speech material, of which around 32 hours have high-confidence transcriptions, with the remainder having transcriptions of lower confidence.

The audiobook recordings come from, and consist of John Greenman reading four books by Mark Twain (A Tramp Abroad; Life on the Mississippi; The Adventures of Tom Sawyer; and The Man That Corrupted Hadleyburg, and Other Stories.


  1. David Donnell said,

    May 17, 2012 @ 2:13 am

    Would it be sticking my neck out to suggest that the Eskimos have fewer words for "blizzard challenge" than we do?

  2. Adrian Morgan said,

    May 17, 2012 @ 5:00 am

    I've completed section one. No guarantee that I'll complete the other sections, but I might.

    I always find it hard to know how to rate something out of a number. Especially when I don't know in advance the average quality, or range of quality, of the things I am asked to evaluate. It helps to write down as close to the outset as possible a clear definition of each rating, so that I can use it consistently.

    In this case, I rated the voices according to the following code, which others may find useful. It would also be interesting if anyone else's interpretation of the ratings differed from mine.

    1 = No resemblance.
    2 = Borderline resemblance.
    3 = Evident but limited resemblance.
    4 = Some indistinguishable phrases.
    5 = Indistinguishable (save differences attributable to recording environment).

  3. Ginger Yellow said,

    May 17, 2012 @ 5:32 am

    Given the timing, I assumed this was somehow going to be about Diablo 3. Not sure I can think of any interesting linguistic issues in the game, though.

  4. David said,

    May 17, 2012 @ 6:49 am

    Re Diablo 3, I'm finding the bizarrely wide range of accents among the residents of the very small town of Tristram interesting.

  5. Rosie said,

    May 17, 2012 @ 11:15 am

    I don't understand why now that we finally have native HTML audio in all browsers, the site requires a browser plugin to play the audio files. Browser plugins are a nuisance and a security risk. It would be so easy to replace all "embed" tags with "audio" tags.

  6. blahedo said,

    May 20, 2012 @ 8:45 pm

    I finally did this, and the reference speaker sounds eerily like Adam West.

RSS feed for comments on this post