Perils of topic modeling

« previous post | next post »

Today's xkcd illustrates why topic modeling can be tricky, for people as well as for machines:

The mouseover title: "As the 'exotic animals in homemade aprons hosting baking shows' YouTube craze reached its peak in March 2020, Andrew Cuomo announced he was replacing the Statue of Liberty with a bronze pangolin in a chef's hat."

The strip is about trends in Google searches rather than in document content, but the point is similar: it's one thing to detect a new cluster of words and phrases, and something else to assign an interpretation.

In some cases, the discovery is just a new instance of a familiar type. And here, of course, the familiar type is epidemic or pandemic — but there are a few socio-cultural steps from that to sewing machines, webcams, and flour.



11 Comments

  1. Philip Taylor said,

    May 5, 2020 @ 8:31 am

    Is this perhaps another example of metathesis — someone confusing causal relationships with casual ?

  2. Thomas Hutcheson said,

    May 5, 2020 @ 10:04 am

    I thought that the Ngram program was no longer being updated?

  3. mg said,

    May 5, 2020 @ 10:06 am

    @Philip Taylor – it's Randall Munroe doing his usual finding fun when it's bleak. He's very aware that correlation doesn't equal causation. The increases in all these search terms are definitely related to the pandemic, just in non-obvious ways.

    This comic beautifully illustrates the perils of trying to find the reasoning behind synchronicity of changes without prior knowledge of context. These increases in search frequency all have the same root cause, but it's definitely not one that would occur to an outside observer with no knowledge of the current situation.

  4. Garrett Wollman said,

    May 5, 2020 @ 10:35 am

    @Thomas, Google Trends is for Google search query strings; it's unrelated to Google Books Ngrams, which shows phrases retrieved from a fixed subset of the Google Books corpus.

  5. Kristian said,

    May 5, 2020 @ 11:59 am

    It's a good comic. For someone who knew what a pangolin was (I didn't) and who notices that the pangolin curve is the first to start rising there at the end, it wouldn't necessarily have been that hard to guess. Once you make the association between "exotic animal" and "disease", it's not that hard to fit in the others. (That is, given the major clue that the rise in these search terms is related.)

  6. John From Cincinnati said,

    May 5, 2020 @ 1:43 pm

    For those perlhaps unfamiliar with Google Trends, the particular depicted graph is Thing whose citation here I harvested from the ever-useful ExplainXKCD WIKI). Randall Munroe is not only Not Making This Up, but GOOMHR is an acronym defined at urbandictionary, for Get Out Of My Head Randall.

  7. John From Cincinnati said,

    May 5, 2020 @ 1:46 pm

    Sorry, missing quote.
    whose citation here

  8. Dave Cragin said,

    May 5, 2020 @ 4:37 pm

    For more fun, the best illustrations I've seen of issues that are superbly correlated, yet have no causal relationship are by Tylervigen.com

    He showed that "US spending on science, space & technology" correlated almost perfectly with "Suicides by hanging, strangulation, and suffocation" (Correlation r = 0.99789 )

    Or "Per capita cheese consumption" correlated very well with "Number of people who died by becoming tangled in their bedsheets." (r=0.947)

    Hence, he illustrates the point by mg, i.e., correlation doesn't equal causation.

    I didn't realize Ngrams based only on books (and only a subset of them). Thanks.

  9. Andrew Usher said,

    May 5, 2020 @ 6:49 pm

    I assumed 'pangolin' was something unrelated to the currently-obvious connection between the others. Also I imagine he cheated a little by not putting them on exactly the same scale, even though they're real data.

    k_over_hbarc at yahoo dot com

  10. unekdoud said,

    May 7, 2020 @ 7:01 am

    "Pangolin" is explainable as a common automatic suggestion (by Google) or correction (by phones) for "pan-".

    Or maybe, as has been the case for decades, Google knows something we don't.

  11. Victor Mair said,

    May 7, 2020 @ 12:17 pm

    There are lots of reasons why "pangolin" is a trending term nowadays:

    1. Chinese love to eat them

    2. They are on the verge of extinction

    3. They were being sold in the same Wuhan wet market as the COVID-19 bats and implicated by some in the transmission of the bat virus to humans

    4. One of the most popular online China affairs discussion groups is called "Pangolin"

    etc.

RSS feed for comments on this post