Language Log

A synthetic singing president?

August 13, 2010 @ 10:24 am · Filed by Mark Liberman under Computational linguistics, Speech technology

A couple of days ago, Gary Marcus told me about the Beatles Complete on Ukulele project, and introduced me to its creator, David Barratt.

Gary got involved because he's working on a book about "learning to become musical at the age of 40", and so he's joining a roster of performers that includes the Fort Greene Childrens Choir (Age 7 and Under Section), Samantha Fox, and many others (82 so far), recording voice-and-ukulele versions of all 185 songs in the Beatles catalog. Gary is of course singing With a Little Help from My Friends (because, he explains, "otherwise I couldn't carry a tune in a bucket"), and his contribution is scheduled to be released on July 19, 2011.

So how does Language Log come into this? Well, David wants to recruit Barack Obama to sing Let it Be, and Gary thought that I could help. In turn, I believe that YOU can help.

This is not because I have any pull at the White House, or because I expect you to have connections there either. I'll let Gary explain:

The coolest, and this is where you could come in, is going to be Let it Be. But it's the only one in which the singer isn't going to be working directly with the producer, on account of the singer's high workload and blissful lack of awareness of the project. I am speaking of none other than our President, Barack Obama.

Dave's idea is to cull the individual words from the considerable archives of Obama's speech. I think he can use autotune to get the pitch right, but he is finding it harder than he anticipated to find some of the individual words. Notwithstanding the famous insult of Obama being "articulate" relative to his ethnic background, the man is a bit of a co-articulator.

Dave wondered whether anyone would have clever ways of isolating particular words from archives of Obama's speech; I thought maybe you could help, if you found the project to be sufficiently amusing. If you are interested, I can pass along the list of words in the song and which ones have proven elusive. If you think it's not your cup of tea, or too much work, but had thoughts of someone else who could help, that would be terrific, too.

At this point, if you know even a little about modern speech synthesis techniques, you'll be thinking along a somewhat different track. And if you don't, here's a chance to learn something.

The past few decades of speech-synthesis research offer some lessons for a person faced with the sort of problem that David set himself. And even better, they offer specific algorithms and open-source programs that could be used to create a synthetic singing version of President Obama — or of anyone else for whom a suitable set of transcribed recordings exists.

These days, a speech synthesis system starts with a large corpus of speech from a single speaker, which has been transcribed and time-aligned in terms of a suitable set of phonetic units. In order to synthesize a novel utterance, the system selects a set of unit-sequences from this corpus that span the pronunciation of the target utterance, and glues the selected elements together, typically with some changes in pitch and time.

The elements selected for concatenation need not be — and generally aren't — words. Each of them might be part of a word, or a segment spanning the end of one word and the beginning of another, or a stretch of several words plus a word-fragment, or whatever. Of course, the algorithm will try to choose segments that are minimally co-articulated with their surroundings, or are maximally compatible with the adjacent selections. The longer the selections, in general, the better the process will work — but of course you also want to match the speaking rate and the pitch contour and the voice quality and the co-articulative context and so on. Furthermore, selecting a longer stretch in one place may force you to use a shorter (or otherwise inferior) stretch somewhere else, so the unit-selection algorithm needs to perform a complex multi-dimensional optimization in an astronomically large search space.

Differences among synthesis algorithms include the details of the phonetic units that are manipulated, the techniques for transcribing and segmenting the source corpus, the unit-selection criteria and search algorithm, the techniques for concatenating and modifying the selected material, and so on. There's also a whole set of issues having to do with how to map textual input into the phonetic descriptions that drive such an algorithm.

This general approach is motivated both by the nature of speech and by the state of our (in)ability to model the processes that create it — and I should note at this point that I'd be surprised to learn that President Obama is any more of a "co-articulator" than the rest of us. If you try to create an utterance by splicing together words taken randomly out of context, the results are rarely going to be good, no matter whose voice you're using.

For the past half-decade, Alan Black and a host of other researchers have been helping to improve the ensemble of corpus-based speech-synthesis technologies by organizing yearly runs of the Blizzard Challenge. As the kick-off paper (Alan Black and Keiichi Tokuda, "The Blizzard Challenge: Evaluating corpus-based speech synthesis on common datasets", InterSpeech 2005) explained,

In order to better understand different speech synthesis techniques on a common dataset, we devised a challenge that will help us better compare research techniques in building corpus-based speech synthesizers. In 2004, we released the first two 1200-utterance single-speaker databases from the CMU ARCTIC speech databases, and challenged current groups working in speech synthesis around the world to build their best voices from these databases. In January of 2005, we released two further databases and a set of 50 utterance texts from each of five genres and asked the participants to synthesize these utterances. Their resulting synthesized utterances were then presented to three groups of listeners: speech experts, volunteers, and US English-speaking undergraduates. This paper summarizes the purpose, design, and whole process of the challenge.

The 2009 edition of the Blizzard challenge workshop was held last September in Edinburgh, with nearly 20 papers presented, and the 2010 edition will be held next month in Kyoto. The sorts of data and problems addressed in recent Blizzard Challenges was explained in Vasilis Karaiskos et al., "The Blizzard Challenge 2008":

The Blizzard Challenge 2008 was the fourth annual Blizzard Challenge. This year, participants were asked to build two voices from a UK English corpus and one voice from a Mandarin Chinese corpus. This is the first time that a language other than English has been included and also the first time that a large UK English corpus has been available. In addition, the English corpus contained somewhat more expressive speech than that found in corpora used in previous Blizzard Challenges.

To assist participants with limited resources or limited experience in UK-accented English or Mandarin, unaligned labels were provided for both corpora and for the test sentences. Participants could use the provided labels or create their own. An accent-specific pronunciation dictionary was also available for the English speaker.

A set of test sentences was released to participants, who were given a limited time in which to synthesise them and submit the synthetic speech. An online listening test was conducted, to evaluate naturalness, intelligibility and degree of similarity to the original speaker

How big are the source corpora? Well, the original 2005 challenge involved about an hour of speech. The materials for the 2008 (and 2009) Blizzard challenges were

… about 15 hours of recordings of a UK English male speaker with a fairly standard RP accent. […] For Mandarin, the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China, released a 6.5 hour Mandarin Chinese database of a female speaker with a standard Beijing accent.

Yesterday, I was able to harvest 65 of President Obama's weekly radio addresses for which the White House website provides an mp3 file and a transcript. Together these comprise about 5.3 hours of audio and about 55,000 words of transcript. (There are 7 more for which either the mp3 file or the transcript seems to be missing, and one that's in Spanish.)

Nearly all of this is speech read in a formal style, well recorded in a consistent setting: in principle, it should be plenty for building a decent synthesis system. The fact that the audio is released in the form of 128-kb/sec. mp3 files may be an issue for some algorithms, but shouldn't be a problem for many approaches. And for many current synthesis system, creating a new voice (to zeroth order) is basically just a matter of putting the corpus in the hopper and pressing the start button.

Also, in pursuit of particular goal of getting Synthetic Obama to sing Let it Be, you have a big advantage: you can cheat. That is, if a syllable or phrase somewhere in the song sounds bad, you can fix bugs in the corpus alignment, tweak the selection algorithm, or otherwise intervene by hand to fix things up.

So why do I say "you"? Why don't I just do it myself?

Well, it's partly because I'm having trouble finding time for the research projects that I'm already committed to. And it's partly because I would hate to deprive you of the opportunity to achieve fame by helping our president contribute to the Beatles Complete on Ukulele project.

But mostly, there's a fair amount of work still to do before anyone can press that "start" button to transform the collection of transcribed recordings into a synthetic presidential voice. And the skills required are various, and often not especially technical.

For example, the White House website containing the weekly radio addresses has been through some changes, and the layout and format of the transcripts is not very consistent. I put an hour or so into writing scripts to extract from the html just the text corresponding to the read portions, but I'm sure there are some problems left. Also, the texts are usually the "prepared remarks", and the speaker may sometimes go off the script. So someone needs to listen to the audio and edit the allegedly-corresponding transcripts.

Once that's done, there are some issues with creating the phonetically-aligned version. For example, there are words in the text that will not be in the standard pronouncing dictionaries, and will therefore need to be added. And there are the usual dates, dollar amounts, percentages and so on whose actual pronunciation may need to be checked.

So for a project like this, work by a smallish crowd of volunteers would be efficient, effective and fun. If you're interested, let me know and I'll point you to some possible tasks.

August 13, 2010 @ 10:24 am · Filed by Mark Liberman under Computational linguistics, Speech technology

Permalink

11 Comments

Ryan Denzer-King said,

August 13, 2010 @ 10:59 am

I know next to nothing about this kind of research, but I'd be game for trying some small part of it.

[(myl) See the instructions here.]
Sam said,

August 13, 2010 @ 11:50 am

A related work, from two years ago: http://www.youtube.com/watch?v=HioPyCID6RI

They used Obama video/audio clips to match the lyrics of "Never Gonna Give You Up". They don't seem to use AutoTune, perhaps because it wasn't yet available.. (?).

It's an incredible job of video fakery, though. Note the reflection on the stage around 1:40.
Debbie said,

August 13, 2010 @ 12:41 pm

I don't know if further help could be found by the creators of ZoomText, but it is adaptive software with a voice component that is exceptionally well done. Just a thought!
Clarissa at Talk to the Clouds said,

August 13, 2010 @ 1:15 pm

I see I'm the first person nerdy enough to be thinking "Oh, no, an Obama Vocaloid? This is going to the top of the Oricon charts!"

http://en.wikipedia.org/wiki/Vocaloid
Christian Cossio said,

August 13, 2010 @ 2:19 pm

Very funny thing.
I'd like to collaborate with it.
fs said,

August 13, 2010 @ 3:23 pm

Sam: autotune has been around since 1997.

[(myl) And similar techniques have been around in the speech technology area since the 1970s.]
George said,

August 13, 2010 @ 4:19 pm

Are there no legal or ethical issues in having a speaker say something (or sing it), that they never said or intended to say?
Joff said,

August 13, 2010 @ 4:28 pm

There's also this set of example of George Bush having his speeches turned into songs, IMO much better executed than the Obama "Never Gonna Give You Up" example Sam posted.

Bush singing "Sunday Bloody Sunday"
Bush singing "Imagine"
Sham said,

August 13, 2010 @ 5:24 pm

Cereproc here in Edinburgh have an Obama voice built, with a demo available on their site: http://www.cereproc.com/products/voices

Sadly, they don't have a fully interactive demo version available as they do for GWB, or it would simply be a matter of pasting in the requisite lyrics.

[(myl) Thanks for the link! In general, though, changing pitch and time in a "standard" synthetic phrase will not work as well as if you choose the units conditional on the target prosody.]
Nijma said,

August 13, 2010 @ 10:31 pm

When it comes to presidential boogie, this one is pretty hard to beat.
peterj said,

August 15, 2010 @ 12:57 pm

Am I the only person reading this who finds this activity unethical (at least, doing so without having the speaker's prior permission)?

RSS feed for comments on this post

A synthetic singing president?

11 Comments

Ryan Denzer-King said,

Sam said,

Debbie said,

Clarissa at Talk to the Clouds said,

Christian Cossio said,

fs said,

George said,

Joff said,

Sham said,

Nijma said,

peterj said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta