AI voice-over?

« previous post | next post »

On 5/8/2024, the Defense Visual Information Distribution Service (DVIDS) offered a "Graphical representation of how the precision cutting charges will be used on key bridge section":

Several bits in the voice-over suggest that it was generated by a text-to-speech program — I'll note a couple of them below. And the failure to capitalize "Key Bridge" in the page's title might also be a symptom of AI-generation?

The first voice-over issue is the phrasing of the opening sentence:

To refloat the motor vessel Dalí
the section of steel structure draped over it
and pinning it down must be removed.

Since the conjunction "draped over it and pinning it down" is a reduced relative clause, it's odd to have a strong phrase break after "over it".

And a bit later, the voice seems to place main word stress on the final syllable of "analyzed":

First, salvage and demolition teams will have analyzed the structure,

Zeroing in a bit futrher:

Listen to the whole thing — what else do you hear?



11 Comments

  1. Y said,

    May 11, 2024 @ 2:44 pm

    "To be able to remove steel": "remove" has two stresses in it. "identified locations": again, "idéntifíed". Can this be a matter of regional accent, like the double-stressed (AAVE?) pronunciation of "police"?

    "Fireworks" sounds very weird, like different recordings were spliced in mid-word.

    There are a few audible breath intakes. I think a (not audible) breath intake is what causes the break after "over it".

  2. Mark Liberman said,

    May 11, 2024 @ 3:48 pm

    @Y: "There are a few audible breath intakes. I think a (not audible) breath intake is what causes the break after "over it"."

    Do you think that the speaker is actually breathing? Or are these just interpolated breath sounds — which engineers often splice into human recordings, FWIW…

  3. AntC said,

    May 11, 2024 @ 3:50 pm

    "Fireworks" sounds very weird,

    Yes, I noticed that. Almost pronounced as two words, with emphatic stress on the second: heating is ineffective, but fire _works_.

    But yeah, it's getting harder to tell if AI-generated.

  4. P Resnik said,

    May 11, 2024 @ 4:27 pm

    Odd lengths of breaks at various phrase boundaries. But what really struck me was the pacing and emphasis on the phrase “will have analyzed” at around 00:25. The three syllables in “analyzed” are too even and the real smoking gun is the stress on the third syllable, which is reminiscent of the classic “put the ACcent on the wrong syllAHble.

  5. Y said,

    May 11, 2024 @ 4:32 pm

    The slight breathlessness and the hoarseness go together. There are also places where final plosives are barely released.

    All in all, on balance I think it's an inexperienced voice narrator, who was given a few tips on speaking clearly before being recorded. I think that is more likely than a devilishly clever simulation of natural imperfect speech. This is the Corps of Engineers, not Hollywood.

    Cf. other videos by the same producer:
    https://jakepope.com/work/video

  6. JPL said,

    May 11, 2024 @ 5:12 pm

    WRT the text, I would have preferred "precision cutting offers one of the most efficient and safest methods for enabling (or "allowing") the removal of steel …"; the "to" clause would be appropriate as a complement to a verbal element (like "used", but here you have a nominal ("method"). ("is used to enable" vs "method for enabling") Lack of unstressing on "one" in "offers one of the most efficient" as if it's occurring sentence-initially; likewise, "will have" in "will have analyzed" would normally be unstressed. "puffs" seems to have the vowel quality of "pull", rather than that of "but". "These" in "These teams" is unnecessarily stressed, since no contrast of any kind is to be expressed, even though the phrase is at the beginning of the sentence.
    Y: The stress patterns on "analyzed" and "identified" are similar to those of West African English, but there are no other indicators of WAE in the recording.

    Newsreaders and hosts on the TV often produce infelicities and misinterpretations of sentence structure, but the stress patterns here are weird, so I'm going to say, "this is AI all the way".

  7. AntC said,

    May 11, 2024 @ 5:19 pm

    Newsreaders and hosts on the TV often produce infelicities and misinterpretations of sentence structure, …

    Is there a specific Baltimore cadence/rhythm thing going on? This local newsreader (on the same topic) has some odd (to me) stress patterns. For example "tomorrow" at 0:30; "seafarers center" at 0:47. And in general a lot of nasalised vowels.

  8. Jarek Weckwerth said,

    May 11, 2024 @ 5:44 pm

    Hmmm. The sound quality is really quite low. I wouldn't expect anything like this from a typical serious modern TTS system. Also the unreleased final plosives is not something you typically get, I think. And the artefact in the middle of "fireworks" is just off the scale. Something really weird is going on. Maybe a low quality TTS system built into an animation app?

  9. Julian said,

    May 11, 2024 @ 5:58 pm

    [Before reading other comments]
    "EN-CASED" instead of "en-CASED"
    "MILL-i-METRES" instead of "MILL-imetres"
    "YOU-WOULD-SEE" instead of "you-would-SEE"
    "FI-er-WORKS" instead of "FI-er-works"

  10. AntC said,

    May 12, 2024 @ 3:54 am

    Synchronicity with this and the concurrent Indigo thread:

    Earth's Oceans Were Purple for Nearly 2 Billion Years — weird AI-generated sound quality with a narrow frequency range. But part of the difficulty in following along is the scriptwriter can't seem to write short sentences to save their life. Poor AI just can't turn it into natural cadences.

  11. Philip Taylor said,

    May 12, 2024 @ 9:02 am

    Julian — "MILL-i-METRES" instead of "MILL-imetres" — would you really expect "MILL-imetres" and not "MILLY-metres" ?

RSS feed for comments on this post