What use electrolytic pickling?

« previous post | next post »

Once you've written down your responses to the dozen audio clips in yesterday's perception experiment, you can check them against the truth, and also against the transcripts generated by Google's automatic captioning system, both given below.

No. Truth ASR
(1) what is a liberal education what use electrolytic pickling
(2) the social fa- sciences far from producing the social font-size is far from producing
(3) who, it was assumed who it was a sarong
(4) a re- repository of learning uh… crisp repulsive tory of learning
(5) now the new knowledge required specialization now the new knowledge required specialized asian
(6) the distinction between a liberal and a professional education became ever more vague the distinction between a liberal and a professional education became evermore phadke
(7) is whether or not one has a college education is whether or not one has ecology education
(8) I would go so far as to say even nihilism i would go so far sicily even nihilism
(9) for instance the search for general universal knowledge for instance the searched ford general universal now
(10) like atoms, molecules, cells, and tissues like adams molecule selves and tissues
(11) to each new generation tutsi you can write
(12) when it no longer does so, its days are numbered when it no longer dot cell hits days our number

These clips all come from the YouTube videos of Donald Kagan's farewell address ("Donald Kagan's farewell", 5/5/2013). I noticed the mistaken ASR segments when I used the text of the automated transcription to skim the content.

My point here is not at all to make fun of Google's speech recognition capabilities. I've long been a staunch defender of current ASR technology in general, and Google's implementation of it in particular. And in fact, the overall quality of the Kagan transcripts is very good — there are stretches where nearly all the words are correct.

Still, errors of the kind illustrated above indicate some of the… shall we say, areas for potential improvement.

There are some cases where the transcript is a plausible rendering of the pronunciation, but is not very plausible as English-language content, e.g. "the searched ford general …" in place of "the search for general …", or "like adams molecule selves and tissues" for "like atoms molecules cells and tissues". I'm surprised that the recognizer's n-gram language model, which is contemporary ASR's approximation to what makes sense, made these choices. And there are a few things that are just bizarre, where I can only imagine that some obscure bug has short-circuited Google's language model entirely:  "became evermore phadke" for "became ever more vague", or "tutsi you can write" in place of "to each new generation".

But I'm most interested in cases where the language model is producing plausible sequences that are totally inconsistent with the sound. "What use electrolytic pickling?" is an exhilaratingly poetic substitution for "What is a liberal education?" Unfortunately, there's no way that anyone who knows English can hear Kagan's "a liberal education" as "electrolytic pickling":

Audio clip: Adobe Flash Player (version 9 or above) is required to play this audio clip. Download the latest version here. You also need to have JavaScript enabled in your browser.

Something similar has happened when the self-correction in "fa- sciences" leads it to be rendered as "font size is":

Audio clip: Adobe Flash Player (version 9 or above) is required to play this audio clip. Download the latest version here. You also need to have JavaScript enabled in your browser.

Or when "repository of learning" turns into "repulsive tory of learning":

Audio clip: Adobe Flash Player (version 9 or above) is required to play this audio clip. Download the latest version here. You also need to have JavaScript enabled in your browser.

The ASR system can hear "a liberal education" as "electrolytic pickling", or "fa- sciences" as "font size is", or "repository of learning" as "repulsive tory of learning", only because current ASR's acoustic models are extremely diffuse, extremely forgiving of deviations from the sounds they have been trained to expect. This is partly a virtue, because it allows systems to cope with speaker variation, pronunciation variation, and variation in recording conditions. But it's also a weakness, because it allows the language model to introduce this kind of phonetically impossible interpretation.

Of course, there's also a language-modeling issue here, because in the context of Kagan's overall talk, "electrolytic pickling" is a highly implausible bigram, and an adaptive language model should have noticed this. But I want to underline what these examples illustrate about the problems with the acoustic models in current ASR systems, as discussed in "High-entropy speech recognition, automatic and otherwise", 1/5/2013.

As promised in that post, we're going ahead with the idea of creating a high-entropy isolated-word test set. The idea is to define a list of about 60,000 English wordforms that most literate speakers would know (how to use, how to pronounce, how to recognize, how to spell); to record a set of speakers reading randomly-selected items from that list; to test the recognition accuracy of a set of listeners on those test sets; and to compare this to the recognition accuracy of ASR systems.

The goal is to evaluate the extent to which the fuzziness of current acoustic models is really a problem, and to provide a task where the quality of acoustic models can be tested in a high-entropy (~ 16 bits per item) setting.

Creating the list of wordforms is not an entirely trivial problem. I started with the commonest words in the Google unigram corpus; but a random sample of the 64k commonest all-alphabetic words includes about 10-15% of things like glycosuria, tiefmeyer, politica, and azam. Similarly inappropriate things can be found even in the commonest 30k wordforms in that list, while there are plenty of appropriate choices at ranks lower than 64k.

So I made a list of about 80,000 wordforms by combining items from a number of different lists; and I've recruited some people to judge these for appropriateness, with the idea of cutting the list back to a more plausible subset. We're about 40% through this process — if you're a native speaker of American English, with good intuitions about what words ordinary literate Americans are likely to know, and you'd like to volunteer to help out, send me email.



37 Comments

  1. Jongseong said,

    May 8, 2013 @ 6:27 am

    For (10), I immediately heard "like Adams", and then automatically performed a mental shift to "like atoms" as soon as I heard that it was followed by "molecules". There is no grammatical issue with "like Adams, molecules, cells, and tissues"—I performed the substitution because of the choices "Adams" and "atoms", the latter fit better in the list. What kind of ASR system could make such choices based on semantic considerations? Is there a workable solution without resorting to explicitly adding "atoms, molecules" as a bigram in the language model? I'm curious because I don't know much about ASR.

    [(myl) For most speakers of American English, "atoms" and "Adams" are pronounced exactly the same way. But a statistical language model should prefer "atoms molecules" to "Adams molecules", especially in the broader context — by brute bigram statistics and perhaps by local topic-modeling as well.

    This is by no means a trivial case, though, since "Adams" is overall about twice as common as "atoms". However, the point of this post is not to explore the frontiers of language modeling, as interesting and important as those are, but to reinforce the plausibility of the idea that current acoustic models are far too forgiving. ]

  2. Miles said,

    May 8, 2013 @ 6:46 am

    I had the same reaction as Jongseong to (10).

    Also, at end of (12), the final 'd' is barely there, so it could sound like "number"; although, of course "days are numbered" is familiar phrase.

  3. FM said,

    May 8, 2013 @ 6:50 am

    @Jongseong
    Even without reference to semantics, but using plain statistics, the n-gram model used by the recognizer should have "told" it that "Adams, molecules" has a far lower probability of occurring than "atoms, molecules".
    The Google Ngram Viewer does not even have any occurrence of "Adams molecules": http://books.google.com/ngrams/graph?content=atoms+molecules%2CAdams+molecules&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=

  4. Herman said,

    May 8, 2013 @ 7:48 am

    "crisp repulsive tory of learning": another reference to British Education Secretary Michael Gove?

  5. Adrian Morgan said,

    May 8, 2013 @ 8:00 am

    Leaving aside transcription policy decisions (such as whether or not to record speaker hesitations) and punctuation, my only inaccurate transcription was in #2, which I recorded as "science is" instead of "sciences".

  6. Ray Girvan said,

    May 8, 2013 @ 8:12 am

    It being US-sourced, I did list "Adam's molecules, cells and tissues", as an alternative to "atoms … etc", as not being out of the bounds of possibilities if it was a creationist speaking.

    "Electrolytic pickling" is presumably in the lexicon because it does exist in a metallurgical context.

  7. PaulB said,

    May 8, 2013 @ 8:36 am

    I'm English, and (10), with its rather nasal 'd' (er, I mean 't') and long 'm', sounds to me more like "animals" than "atoms". Because of the context I listened to it several times to check.

  8. Ari Goldberg said,

    May 8, 2013 @ 8:42 am

    With regard to "the search for general" -> "the search ford general", Justin Quillinan has noticed that Google appears to have a high rate of brand name insertion in its transcriptions. In an experiment using iterated synthesized speech->google transcription-> etc. over Kafke, he found that Google inserted brand names like Samsung and Red Bull (for red ball) into the text. My guess is that they've optimized their language model to ensure that Youtube's commercially-oriented content (and maybe even their ads) are transcribed correctly.

    http://www.replicatedtypo.com/iterated-learning-using-youtube-videos-and-speech-synthesis/6123.html

  9. Morten Jonsson said,

    May 8, 2013 @ 9:17 am

    In number 4, Google did pick up something the "true" transcription missed: the false start on "repository" has an "s" and maybe even a "p" (res- or resp-, not re-), as reflected in Google's "crisp."

  10. Ellen K. said,

    May 8, 2013 @ 9:32 am

    But a statistical language model should prefer "atoms molecules" to "Adams molecules"

    I think the slight pause (in speech) or comma (in writing) is significant. "Adams, molecules" is distinct from "Adam's molecules", with the latter seeming to me to be much more plausible as something someone might say; though, going from memory (I can't relisten at the moment) it's not a plausable fit for what was said in the sample here.

    On number 4, I heard it as "a res repostitory of learning", which gives us the s, at least, of "crisp".

    In 9, the pronunciation of nihilism was unfamiliar to me, but I was still able to transcribe it correctly. It helps that it seemed to be a context where the coining of "Nealism/Neilism" is quite unlikely.

  11. Ellen K. said,

    May 8, 2013 @ 9:32 am

    Oops… typo, 9 should be 8.

  12. Boris said,

    May 8, 2013 @ 9:47 am

    I go #9 wrong. I transcribed "to search" instead of "the search".

  13. FM said,

    May 8, 2013 @ 10:02 am

    (I tried to submit this comment earlier but something bugged, I hope it does not show up twice. My apologies if it does.)

    A look at the Longman Pronunciation Dictionary confirmed my intuition that, at least in articulatory terms, "Adams" and "atoms" are not "pronounced exactly the same way", even in General American, as Wells' transcription of the former does not involve the alveolar tap in the rendition of the "d": /ˈædəm/ vs. /ˈæt̬əm/.
    That, admittedly, is contradicted by this Wikipedia page, which makes them homophones, but not being a native speaker and lacking material evidence, I would tend to trust the first source.
    One has to admit, though, that to the hearer they sound extremely similar, so is it even possible that the acoustic model used by Google's speech recognizer might differentiate between the two phonemes?

    [(myl) It's unwise to rely on a British source for American pronunciations, at least in this case. I'm willing to wager a substantial sum of money that if we test a representative sample of Americans in a context that doesn't encourage them to use facultative pronunciations, they will not be able to distinguish better than chance between their own pronunciations of pairs like "Adams" and "atoms" or "ladder" and "latter". Some speakers have vowel-quality differences, e.g. in "rider" vs. "writer", but that doesn't generally happen words whose stressed vowel is in the TRAP lexical set.This is an experiment that I've done with more than one phonetics class, and so I'd be very confident of winning your money. Interested in taking the bet?]

  14. Eric P Smith said,

    May 8, 2013 @ 10:12 am

    My only difficulty is that in #2 I hear the aborted word as a four-letter obscenity. I've listened dozens of times and I can't hear it any other way. From the prosody it doesn't sound, to my ears, like an aborted word. Sign of a dirty mind I guess!

    Like so many others, I hears "atoms" as "Adams" until the following word disambiguated it.

  15. Eric P Smith said,

    May 8, 2013 @ 10:14 am

    …I heard…

  16. Theophylact said,

    May 8, 2013 @ 10:41 am

    Electrolytic pickling seems to be a pretty good way of preparing stainless steel for, say, welding.

  17. is said,

    May 8, 2013 @ 11:24 am

    The one mistake that I made was also one that the ASR system made: I interpreted the beginning of 4 as "uh" instead of "a".

  18. Morten Jonsson said,

    May 8, 2013 @ 11:52 am

    @is

    Out of context, I don't think it's possible to know if that's an "uh" or an "a" in number 4, so I wouldn't call that a mistake.

  19. dw said,

    May 8, 2013 @ 12:56 pm

    As a native Brit who's lived in the US for 15 years, I transcribed everything perfectly, except that in #4 I had "(uh) (riss) repository of learning" rather than "a (re-) repository of learning".

    I had the same initial hesitation as others with "Adams" vs. "atoms" but the context quickly disambiguated.

    Incidentally, I guessed that this exercise might have something to do with yod-dropping/yod-coalescence/the do-dew merger. Nine out of the twelve sentences featured words containing historical /iu/: education (thrice), producing, assumed, new (twice), universal, tissues.

  20. Daniel Barkalow said,

    May 8, 2013 @ 1:18 pm

    On purely prosodic grounds, I was also certain that "Adams/atoms" was the first member of a four-member list, not a modifier of the list or its first member. I get "atoms" on semantic grounds, but the context could have made "Adams" or "Adam's (something unsaid)" okay. On the other hand, if he were saying "Adam's molecules, …" (and not misreading an unfamiliar document) he couldn't have used that tempo.

  21. fiddler said,

    May 8, 2013 @ 2:01 pm

    Well hey, I got them all right. Don't I get a prize or something?

  22. Morten Jonsson said,

    May 8, 2013 @ 2:07 pm

    Re "atom" vs. "Adam," one thinks of this oldie:

    http://www.youtube.com/watch?v=3v48Rp4tl0o

    Well, I'm gonna preach you a sermon 'bout Old Man Atom,
    I don't mean the Adam in the Bible datum.
    I don't mean the Adam that Mother Eve mated,
    I mean that thing that science liberated.

  23. Language Log “perception experiment” | Calvin Li said,

    May 8, 2013 @ 4:07 pm

    […] the promised discussion analyzes (the various failures of) the Google speech-to-text system used on YouTube. […]

  24. Alan Curry said,

    May 8, 2013 @ 5:14 pm

    Your wordlist based on frequency counts augmented by human judgement sounds a lot like SCOWL (http://wordlist.sourceforge.net/) with its 10 levels of commonness. The closest one to your 60,000 target is 57,344. Not good enough?

    [(myl) Not good at all, for this purpose.

    The file scowl-7.1/final/english-words.70 (with 33,681 entries) contains lots of things like aneling, molal, frore, saturniid, antennule. The file scowl-7.1/final/english-words.80 (138,890 entries) is much worse: it's mostly junk like misogamic, maffia's, thesmothetes, enisling, thesmothetes, ….

    Even scowl-7.1/fina/english-words.60 (with only 12,736 entries) has got a surprising amount of junk in it.]

  25. David Morris said,

    May 8, 2013 @ 5:44 pm

    For more about Adam and atom: see http://www.youtube.com/watch?v=BjlJ2F646bA

    (I first heard this sung by a professional vocal group in Australia. This is the only version I can find of it online)

  26. Ray Girvan said,

    May 8, 2013 @ 8:20 pm

    @Eric P Smith: I hear the aborted word as a four-letter obscenity

    The trouble is, once you're told it's some kind of perceptual experiment, you start suspecting this kind of thing. I did wonder at that point whether it was a test of whether we'd pick up subliminal insertions of swear-words – "the social fk sciences" – but rejected that theory when there were no more examples,

  27. Jongseong said,

    May 8, 2013 @ 9:47 pm

    FM: A look at the Longman Pronunciation Dictionary confirmed my intuition that, at least in articulatory terms, "Adams" and "atoms" are not "pronounced exactly the same way", even in General American, as Wells' transcription of the former does not involve the alveolar tap in the rendition of the "d": /ˈædəm/ vs. /ˈæt̬əm/.

    You need to look at the explanation of symbols in the Longman Pronunciation Dictionary, which gives:

    /t̬/ : alveolar tap, usually voiced, AmE t in city

    The dictionary also has an expanded section on T-voicing, which says the following about /t̬/: "For many Americans, it is actually identical with their /d/ in the same environment, so that AmE shutter /ˈʃʌt̬ ər/ may sound just the same as shudder /ˈʃʌd ər/."

    So the Longman Pronunciation Dictionary clearly indicates that "Adams" and "atoms" are pronounced the same for many Americans. Note the choice of the symbol /t̬/ which indicates a voiced /t/, implying no distinction with /d/.

  28. FM said,

    May 9, 2013 @ 1:53 am

    @ myl & Jonseong,

    Thank you for your insights, there's always something to be learnt from exchanges on Language Log!

  29. Victoria Simmons said,

    May 9, 2013 @ 4:46 am

    I heard "suppository of learning," but maybe I was affected by having read the post on Kagan's painful prescriptions for fixing higher education.

    Reading Google's captioning, I find my mind wandering off to Sweden, with its løveli lakes, wonderful telephøne system, and mani interesting furry animals.

  30. Lazar said,

    May 9, 2013 @ 5:21 am

    @Jongseong: Indeed, I think the flapping of /d/ in American English is only slightly less important than that of /t/. If I try to pronounce a word like "shudder" with the true plosive [d] that you'd hear in British English, it sounds unnatural – the sort of thing I'd expect only in affected American speech.

  31. Adrian Morgan said,

    May 9, 2013 @ 5:55 am

    (Me suspects Victoria may be thinking of Norway. In Sweden I'd expect to find a wandöful tälephån system instead. Yeah, off-topic, which is why this is short and parenthesised.)

  32. Victoria Simmons said,

    May 9, 2013 @ 10:58 am

    Adrian: We're just hearing the subtitles differently. I use the Python rendering.

  33. Chris Waters said,

    May 9, 2013 @ 2:31 pm

    For the record, there were only two notable differences between my transcription and what MYL is calling "truth". The first, noted by a couple of other people, is the extra 's' in "res-repository", and the second is that I transcribed "f'r'instance" instead of "for instance"–you could claim that ASR did better than me there… :)

  34. Faldone said,

    May 9, 2013 @ 3:18 pm

    Regarding the for instance I would go so far as transcribing it "frinsance". But then I hear a lot of people talking about Present Obama and the infastructure.

  35. Lazar said,

    May 9, 2013 @ 6:03 pm

    @Faldone: I remember during the 2008 election, it seemed to me that John McCain was constantly referring to a Senro Bama and Preston Bush.

  36. Aaron said,

    May 11, 2013 @ 8:29 pm

    Does it make any difference that I immediately recognized the speaker as Donald Kagan? Perhaps if the ASR system had listened to him for hours, it would've done better…

  37. pranav said,

    May 12, 2013 @ 12:50 am

    How is a high entropy word list related to/different from a high frequency word list?

RSS feed for comments on this post