Hyperbolic lots

« previous post | next post »

For the past couple of years, Google has provided automatic captioning for all YouTube videos, using a speech-recognition system similar to the one that creates transcriptions for Google Voice messages. It's certainly a boon to the deaf and hearing-impaired. But as with Google's other ventures in natural language processing (notably Google Translate), this is imperfect technology that is gradually becoming less imperfect over time. In the meantime, however, the imperfections can be quite entertaining.

I gave the auto-captioning an admittedly unfair challenge: the multilingual trailer that Michael Erard put together for his latest book, Babel No More: The Search for the World's Most Extraordinary Language Learners. The trailer features a story from the book told by speakers of a variety of languages (including me), and Erard originally set it up as a contest to see who could identify the most languages. If you go to the original video on YouTube, you can enable the auto-captioning by clicking on the "CC" and selecting "Transcribe Audio" from the menu.

The transcription does a decent job with Erard's English introduction, though I enjoyed the interpretation of "hyperpolyglots" — the subject of the book — as "hyperbolic lots." Hyperpolyglot (evidently coined by Dick Hudson) isn't a word you'll find in any dictionary, and it's not that frequent online, so it's highly unlikely the speech-to-text system could have figured it out. But the real fun begins with the speakers of other languages.

Perhaps some day Google will develop a system that automatically detects the language of a speaker, much as Google Translate can automatically detect a swath of written language. That's a tall order, of course, given the amount of dialectal and idiolectal variation among speakers of a given language, not to mention the difficulties presented by L2 speakers. As it stands, though, the auto-captioning will always assume that the speaker is using English, so the transcription from a foreign language ends up looking something like the autour-du-mondegreens of Buffalax. (Arnold Zwicky calls them bilingual homophonic translations.)

My contribution, which begins at 1:05, is in Indonesian. Here are my lines:

Ketika dua orang itu bertemu
[When the two men met]
Mezzofanti dapat berbicara dengannya sangat lancar
[Mezzofanti was able to speak to him very fluently]
dalam bahasa Ukraina
[in Ukrainian]
dan mereka sempat mengobrol selama berjam-jam
[and they chatted for hours]

This gets auto-captioned as:

dictated to allow me to go to
that's about the that at the beach adding and seven atlanta
and the house of will grant
then the regressive but no perot so i'm not going down down

Interestingly, when auto-captioning is enabled on Erard's follow-up video with the languages identified, the transcription is slightly different for the second and fourth lines, even though the audio is the same as before:

dictated to allow me to go to
that's about the that at the beach and then seven atlanta
and the house of will grant
then the incident but no perot so not to jump down

I think I prefer the first version — "so i'm not going down down" is much more poetic.

(And if you enjoy this type of thing, you'll probably love the found poetry of Google Voice.)


  1. Rubrick said,

    May 2, 2012 @ 2:30 am

    "And the seven Atlanta and the house of Will Grant" could easily be an early REM lyric.

  2. Mark Liberman said,

    May 2, 2012 @ 9:08 am

    Perhaps some day Google will develop a system that automatically detects the language of a speaker, much as Google Translate can automatically detect a written language.

    Looking at the history of the NIST Language Recognition Evaluation suggests that the day is already here, at least with respect to the technical capabilities available in principle.

    Google was not among the entrants in the 2009 LRE evaluation, but the results from those that did enter were promising: roughly 5% false alarms at 5% misses for 10-second test samples in an open-set test; 2% false alarms with 1% misses in a 30-second closed-set test.

    No doubt Google's internal technology is by now substantially better than that, given the large amount of training material available to them.

    Of course, the problem of recognizing which language is being spoken where in a multi-language recording is a somewhat harder one.

  3. Hugo said,

    May 2, 2012 @ 11:11 am

    This reminds me of those awesome "misheard lyrics" videos. Most notably, metal band Nightwish's "Wishmaster" song. The Finnish singer sings in English, but with an accent and in an opera kind of voice, which allows for quirky interpretations such as "Hamster / A dentist / Hard porn / Steven Seagal", among many others. I wonder what Google's captions would be like, though. (I can't access YouTube right now.)

    [(bgz} The misheard Nightwish song is mentioned in Mark's post on autour-du-mondegreens linked to above. YouTube link is here.]

  4. Amy Reynaldo said,

    May 2, 2012 @ 11:14 am

    As a hard-of-hearing person, I have to say that Google's English-language translations have largely sucked. They're so broadly wrong, they distract from the audio. The auto-captions end up making my comprehension of the audio worse. Transcripts written by a human are still much more useful.

  5. GeorgeW said,

    May 2, 2012 @ 12:33 pm

    It would seem phonological tests could narrow the language down quickly -sound segments, syllable structure, etc.

  6. Jangari said,

    May 2, 2012 @ 9:06 pm

    Having just returned from a field trip with 27 odd hours of elicitation sessions to transcribe, this kind of technology is frustratingly too far in the future, especially for underdescribed languages…

  7. mgh said,

    May 3, 2012 @ 11:10 am

    my favorite google voice transcription had a caller telling me she was "dialing my eyes and crossing my teeth" — upon listening to the voicemail, it turned out she was simply dotting her i's and crossing her t's, but I prefer the google version

  8. Chandra said,

    May 3, 2012 @ 1:13 pm

    Somewhat along the same lines is the endlessly entertaining Bad Lip Reading site: http://badlipreading.tumblr.com/

  9. Arlo James Barnes said,

    May 3, 2012 @ 4:51 pm

    It is interesting that different 'scans' of the same audio lead to slightly different transcriptions. It is possible the cause is some sort of noise either in the audio file or added as an accidental artifact in the process of analysis. On the positive side, maybe this could (or perhaps already is) used a probabilistic checksum (as it were) to help better approximate the actual words used.

RSS feed for comments on this post