Crazy captions

« previous post | next post »

Watching this this CNN story on YouTube I noticed some really weird closed-captioning. You can try it for yourself — open the story on YouTube, turn CC on using the controls on the bottom right of the video panel, and see what you get.

In case it gets fixed, or your environment is different for some reason, I recorded a short sample:

I'm hoping that some reader who knows about the relevant technology can help me understand what happened here. Was it a speech-to-text system going off the deep end? A human transcriber hitting random keys on their machine? A coding or transmission glitch? Secret messages from Q? Inquiring minds want to know.


  1. Keith said,

    November 14, 2020 @ 5:30 am

    The closed captioning that I just saw on youtube was just like in your recording.

    The University of Colorado at Boulder has an interesting page that indicates that the automatic captions are not at all good.

    The accuracy level of YouTube captions is typically below accepted standards for accuracy. If you intend to use the following approach, you must extensively review the auto-generated captions to ensure that there are no remaining errors.

  2. James said,

    November 14, 2020 @ 7:05 am

    @Keith: no, I'm pretty certain this is not YouTube automatic captioning. If it was, the option in the menu would read "English (auto-generated)". Instead the options are "English – DTVCC1" and "English – CC1" – maybe someone can explain what those mean, but it suggests they were added by the video creator.

    The Youtube automatic captioning seems to be much much much better than that. For an example, here is a "What you should know before going into linguistics" video: . (The very first sentence of the video seems the least well transcribed, but I also can't work out exactly what the speaker says there!)

    You'll see that at the word-by-word level, it's pretty accurate. While making my online lectures this term, I've found YouTube much more accurate than, for example, Panopto's automatic captioning. It even does a fairly decent job with technical mathematics and makes moderately reasonable guesses when dealing with spoken mathematical notation. If you're the creator of the video, you can download the automatically-generated .srt file and add it to your own video, to post wherever.

    What it seems the YouTube system doesn't try to do at all is to separate words into sentences, or to punctuate at all. One just gets a sort of stream-of-consciousness flow of words. Panopto does at least try (with mixed success) to insert appropriate punctuation.

    I'd be interested to hear more from someone who understands how the YouTube system is set up. In particular – it adds words one by one to the caption, synchronised appropriately with the speaker, rather than putting up whole phrases at once in a more conventional way. On the other hand, if you download the .srt file, it doesn't reflect this word-by-word flow – you get longer chunks of text.

    Also, anyone know if there are any free tools out there which combine YouTube's word-level accuracy with an attempt to punctuate? I know, it's maybe a bit much to ask for free…. It's already really great that YouTube makes this tool available for free.

    [(myl) Yes, I'm pretty sure that the captions in this particular case were embedded in the video rather than created by YouTube. See the Wikipedia articles on Closed Captioning and CEA708 for some of the complex history and current situation.

    YouTube's automatically-generated transcripts are often pretty good. There's a long history of systems for automatic sentence-division and punctuation of such ASR wordstreams — it would no doubt be better to integrate that into the original audio analysis, but obviously it could also be applied after the fact.]

  3. William Berry said,

    November 14, 2020 @ 1:05 pm

    I’ve noticed that some sort of evidently sophisticated sound (“heavy thud”, etc.) and speech recognition software is being used these days for many (if not most) television programming (thinking mainly of streaming platforms here).

    It usually works well enough, but sometimes goes a little bonkers.

    I’m not an IT person (or even particularly computer literate) but it doesn’t seem like it would take much in the way of a bit of garbled code or the like, at a crucial point in the processing, to turn the content to gibberish.

  4. William Berry said,

    November 14, 2020 @ 1:09 pm

    “[b]eing used to generate subtitles” I should have said.

  5. William Berry said,

    November 14, 2020 @ 1:12 pm

    Ignore the above.

    The comment it was meant to correct was evidently eaten by the Electricity and isn’t worth the bother of reconstructing.

  6. Garrett Wollman said,

    November 14, 2020 @ 4:19 pm

    For what it's worth, MIT recently settled a lawsuit with disability rights advocates requiring all Institute-published video content to be captioned going forward. It specifically calls out automatic transcriptions like YouTube's as being unacceptable: a human-in-the-loop level of captioning accuracy is required to comply with the terms of the settlement.

  7. Sophie said,

    November 14, 2020 @ 5:47 pm

    I used to work as a live captioner for UK, Australian and Canadian news networks and there are three ways to produce captions, only one of them fits this.
    1: using speech-to-text from a human respeaker who reapeaks what is said on screen in a way that is easier for a computer to interpret. Uses words from a dictionary, So errors always have correct spelling but the wrong words, syntax or grammar. Eg. “Boris Johnson” becomes “bore his Johnson”
    2: auto generated speech to text, same as above but without the human interface. See any YouTube auto generated captions for typical errors.
    3: live text input, usually by a stenographer on a special short hand keyboard. This is the only method i know that would create typography errors like in the video (ie non dictionary words appearing). Stenography keyboards are very different To QWERTY keyboards and someone without training would at best produce garbled nonsense. My pet theory is that a stenographer was working from home and their cat covered their shift for them…

    Another possibility is some kind of transmission issue with the captions themselves but that is more likely not to display anything at all

  8. Max said,

    November 14, 2020 @ 6:08 pm

    I don't know if this is still the case, but closed captions for live TV were at one time transcribed using Stenotype machines, which are the keyboards also used for court transcription. They're very fast because you hit several keys at once to type an entire word, but they're prone to puzzling mistranscriptions because the software that parses it has to expand e.g. T H / EU S / A PB / KP A PL / P L into "This is an example."

    In this case, however, I think something just got garbled in the caption encoding.

    [(myl) For more on stenotype machines, see "Blame Miles Bartholomew, Ward Stone and IBM", 9/5/2004.]

  9. Jarek Weckwerth said,

    November 14, 2020 @ 6:35 pm

    In terms of punctuation of automatic captions, I've been impressed by how Microsoft Stream deals with my lecture recordings. Very often the punctuation is spot on, and where it goes haywire (it does sometimes), I can almost always see how my using non-read intonation leads it astray.

  10. KWillets said,

    November 16, 2020 @ 2:07 pm

    Last year I recorded screenshots of similar bizarre captions in Youtube music on Korean content. Now I find that they've turned them off.

    It was a stream of semi-phonetic transcriptions mixed with random noise, in both Hangul and English. Entire passages were transcribed from instrumental music, and only a small number of words matched the lyrics.

    I enjoyed seeing a system that was even worse at segmenting Korean than I was, but it never approached any level of usability as captioning, and I'm still puzzled as to why it was provided at all.

  11. Andrew Usher said,

    November 18, 2020 @ 9:06 pm

    Yes, almost surely a stenotype. It's no surprise they're often used for captioning, and unfortunately a malfunction will look like this and they can't stop the show to get it put right, so they'll just continue the best they can.

    k_over_hbarc at

RSS feed for comments on this post