Deep fake audio

« previous post | next post »

Helen Rosner, "A Haunting New Documentary About Anthony Bourdain", The New Yorker 7/15/2021:

It’s been three years since Anthony Bourdain died, by suicide, in June of 2018, and the void he left is still a void. […]

In 2019, about a year after Bourdain’s death, the documentary filmmaker Morgan Neville began talking to people who had been close to Bourdain: his family, his friends, the producers and crew of his television series. “These were the hardest interviews I’ve ever done, hands down,” he told me. “I was the grief counsellor, who showed up to talk to everybody.” […]

There is a moment at the end of the film’s second act when the artist David Choe, a friend of Bourdain’s, is reading aloud an e-mail Bourdain had sent him: “Dude, this is a crazy thing to ask, but I’m curious” Choe begins reading, and then the voice fades into Bourdain’s own: “. . . and my life is sort of shit now. You are successful, and I am successful, and I’m wondering: Are you happy?” I asked Neville how on earth he’d found an audio recording of Bourdain reading his own e-mail. Throughout the film, Neville and his team used stitched-together clips of Bourdain’s narration pulled from TV, radio, podcasts, and audiobooks. “But there were three quotes there I wanted his voice for that there were no recordings of,” Neville explained. So he got in touch with a software company, gave it about a dozen hours of recordings, and, he said, “I created an A.I. model of his voice.” In a world of computer simulations and deepfakes, a dead man’s voice speaking his own words of despair is hardly the most dystopian application of the technology. But the seamlessness of the effect is eerie. “If you watch the film, other than that line you mentioned, you probably don’t know what the other lines are that were spoken by the A.I., and you’re not going to know,” Neville said. “We can have a documentary-ethics panel about it later.”

The idea of crafting a speech-synthesis system to imitate a single speaker's voice has been around for a long time. More than three decades ago, I was asked to create a synthetic vocal version of the woman who had recorded the prompts for an AT&T voicemail system (see "Celebrity voices", 3/26/2011). Of course the underlying technology has gotten much better, to the point where Morgan Neville can confidently assert that "if you watch the film … you probably don't know what the other lines are that were spoken by the A.I." (If and when the documentary's audio becomes available, it will be an interesting test case for vocal fraud detection…)

A (possible) deep-fake audio has been in the news recently about events in Haiti, although I haven't seen any discussions in the English-language press. The recording in question is this July 10 tweet from the validated account of Martine Moïse, recorded after the assassination of her husband Jovenel Moïse, in which she was seriously wounded:

Shortly after that tweet appeared, an article in the AyiboPost claimed it was a fake ("La note vocale de Martine Moïse pourrait avoir été manipulée, selon deux groupes d’experts"):

La voix dans la note vocale diffusée hier d’abord par le média pro PHTK, Tripotay Lakay, puis par le gouvernement, y compris le compte officiel Twitter de la première dame, pourrait ne pas être celle de Martine Moïse. Ce sont les conclusions sorties des travaux de deux groupes d’experts en intelligence artificielle aux États-Unis et en France qui ont collaboré avec AyiboPost, depuis le 10 juillet.

Les deux groupes ont utilisé deux méthodes et techniques différentes. Ils sortent avec la même conclusion : la voix dans la note vocale diffère de celle de Martine Moïse, retrouvée dans multiples enregistrements publiés avant l’attaque à la résidence du président dans la matinée du 7 juillet 2021.

The voice in the vocal note published yesterday first by the media pro PHTK, Tripotay Lakay, then by the government, including the official Twitter account of the first lady, might not be that of Martine Moïse. That's the conclusion of work by two groups of artificial intelligence experts in the United States and France who have worked with AyiboPost since July 10.

The two groups used two different methods and techniques. They come to the the same conclusion: the voice in this vocal note differs from that of Martine Moïse, taken from multiple recordings published before the attack at the president's residence in the morning of July 7, 2021.

The headline of the AyiboPost article originally suggested that the recording was synthetic, but it was changed the next day because the two groups of experts complained that this misrepresented their conclusions — see the follow-up article "Position d’AyiboPost sur la note vocale de Martine Moïse". They also published a Google Drive folder with the data and methodologies used.

When the article first appeared, Michel DeGraff wrote to ask for an evaluation of its conclusion, because the claim had gone viral in Haiti and was playing a role in the socio-political chaos there in the aftermath of the assassination. I enlisted some others who know more about the techniques involved. One of them did a deep dive into the topic — his evaluation was that the two research groups involved had used current standard techniques in a technically competent way:

  1. The work of the French group is based on techniques developed at Johns Hopkins.
    • They use an x-vector extractor—trained on external speaker-labeled speech data, such as VoxCeleb and Librispeech—to extract a 1024 dimensional embedding for the speaker from the 6 known recordings, and another vector embedding from the 1 questioned recording.
    • They then present the two embeddings to a PLDA based similarity function—whose parameters also are trained on external speaker-labeled data—to compute a log-likelihood ratio (LLR) for the hypothesis that the two embeddings represent the same speaker.
    • The LLR decision threshold—theoretically 0.00, but in practice calibrated on some external speaker-labeled data—is set to 0.04.
  2. Based on the observed LLR, they reject the hypothesis that the questioned recording comes from the purported speaker.
    • I find their approach sound, and am inclined to accept their finding with some caveats.
  3. The work of the other group is based on techniques initially developed at the 2013 JHU summer workshop and later refined at Google.
    • They use a d-vector extractor—also trained on external speech data— to extract 256 dimensional embeddings from the 6 known and 1 questioned recordings.
    • A cosine similarity (I think) is then computed between various pairs/sets of embeddings.
    • The decision threshold— ideally 1.0, but in practice calibrated on data—appears to have been calibrated in a curious way: They
      • leave out one known recording that is closest in length to the questioned recording;
      • extract embeddings on the other 5 known recordings and average them;
      • find the cosine similarity between the held-out known recording and the average of the 5 known recordings;
      • set that similarity (0.84) as the decision threshold.
    • They then compute the cosine similarity of the 1 questioned recording against the 6 known recordings in a few different ways: They
      • compare the 1 questioned recording against the average of the same 5 known recordings
        • the questioned recording has similarity below threshold;
      • compare the 1 questioned recording against the average of all 6 known recordings
        • the questioned recording has similarity below threshold;
      • compare each of the 5 known recording, and the 1 questioned recording, against 1 known recording
        • the questioned recording is the least similar to the 1 known recording;
        • the similarities of 4 of the 5 known recordings are above threshold;
        • the similarity of the 1 questioned recording is below threshold.
  4. Based on the observed similarities, they reject the hypothesis that the questioned recording comes from the purported speaker.
    • I am less familiar with these so-called end-to-end methods, but the method is persuasive.
    • The fact that one of the 5 known recordings also fell below threshold is indicative of the stochastic nature of these methods.

Overall, automatic speaker verification techniques suggest somewhat strongly that the voice in the questioned recording may not be that of the purported speaker.

Here are my caveats:

  1. The acoustic conditions in the known recordings are quite variable.  Some have background music mixed in, some are recorded in outdoor or other such environments where there is background noise, the microphones are very different, etc.  Only one has studio quality audio (or two?).  While it is unlikely that the noise present will change the automatic assessment, esp. given the amount of speech, it is something to bear in mind.
  2. The purported speaker in the questioned recordings is said to be recovering from multiple bullet wounds, and presumably under a lot of stress, both physiological and mental.  It is quite likely that this will have a significant effect on voice quality.  While the automatic methods used have been validated on speech from a speaker separated over long periods of time and in different recording environments, I am quite sure that there are no objective measurements about the accuracy of either method when the subject is under this much duress.
  3. Neither automatic method takes into account the content of the speech, not even phonetic or lexical idiosyncrasies of the speaker, which forensic voice examiners often pay considerable attention to.  Nor do the automatic methods address such crucial factors as chain of custody of the questioned (and known) recording.  Any conclusion about authenticity would have to take those factors into account.

Although I have less experience with the use of x-vector and d-vector methods, I'd go a little further, extending the expert's observation that "The purported speaker in the questioned recordings is said to be recovering from multiple bullet wounds, and presumably under a lot of stress, both physiological and mental.  It is quite likely that this will have a significant effect on voice quality." According to this article, "Moïse was suffering from gunshot wounds to her arms and thigh along with a severe injury to her hand and her abdomen." The speaker-verification algorithms used have never been tested on material recorded under conditions like that, as far as I know.

If you listen to the contested recording, and compare it to the other samples that were used in the comparison, you'll note that:

  1. The speech is breathier, lower in pitch, and slower than the comparison samples — which makes sense given the alleged context.
  2. There are many microphone-breath artefacts, which I don't see in the samples used in the technical comparisons.
  3. Two of the comparison samples have background music, and two have significant background noise — the contested recording has neither.

Some of the apparent breath artefacts are puzzling to me — in particular, there's something that happens in most /s/-to-vowel transitions, like this /sa/

or this /se/

That sort of thing seem very unlikely to be the result of a "deep fake" synthetic voice, unless it were trained on a lot of speech with similar artefacts (which don't seem to be present in any of Mme. Moïse's previous recordings).

The contested recording is fluent, as if it were read or memorized from a prepared text, which I would not have expected given the context. However, the earlier samples of her speech are similar, even when she seems clearly to be extemporizing, so perhaps fluent rhetoric is a skill that she retained although injured. The recording cuts off in the middle of a word, due to Twitter's maximum video length of 2:20. I would have thought that a fake (whether "deep" or human) would have been planned to fit.

So my overall conclusion is that the contested recording is almost certainly not a "deep fake"; and the results of the two speaker-verification evaluations are not convincing. The recording certainly might have been created by a voice double, but there's no real evidence of that, at least in the audio itself.



  1. Chas Belov said,

    July 18, 2021 @ 4:16 am

    There appears to be a word missing in:

    "There are many microphone-breath artefacts, which I don't in the samples used in the technical comparisons."

    Hope this helps.

    [(myl) Thanks — fixed now.]

  2. Kenneth Ward Church said,

    July 18, 2021 @ 9:16 am

    Very nice analysis. Michel asked me for help. I roped in a few others. We all came to the conclusion that the rumors about the technology were too good to be true. Technology is getting better and better, but it is hard to say anything with that much confidence, especially in this case.

  3. Bob Moore said,

    July 18, 2021 @ 6:48 pm

    I have been thinking for several years that this sort of speaker-specific TTS should be used to produce narrations of nature programs once the 95-year old David Attenborough is no longer around.

RSS feed for comments on this post