Tortured phrases, LLMs, and Goodhart's Law

« previous post | next post »

A few years ago, I began to notice that the scientific and technical papers relentless spammed at me, by and similar outfits, were becoming increasingly surrealistic. And I soon learned that the source for such articles was systems for "article spinning" by "rogeting" — automatic random subsitution of (usually inappropriate) synonyms. Those techniques were originally developed many years ago for spamdexing, i.e. generating "link farms" of fake pages, in order to fool search engine ranking systems by evading simple forms of content similarity detection,

And the same techniques also fool simple systems for plagiarism detection — though the incoherent results are not useful for student papers, at least in cases where instructors actually read the submissions. But the same time period saw the parallel growth of predatory publishing (and analogous developments among generally reputable publishers), and the use of mindless quantitative publication metrics to evaluate researchers, faculty and institutions. The result: an exponential explosion of "tortured phrases" in the scientific, technical, and scholarly literature: "talk affirmation" for "speech recognition", "straight expectation" for "linear prediction", "huge information" for "big data", "gullible Bayes" for "naive Bayes",  "irregular woodland" for "random forest", "savvy home" for "smart home", and so on.

Starting a half a century earlier, people explored various ways of generating random texts without such phrase-torturing. — Dissociated Press in 1972, the Postmodernism Generator in 1996, the Chomskybot in 2002, SCIgen in 2005, the Snarxiv in 2010, etc. But of these, only SCIgen actually creates plausible enough complete paper drafts to have been widely used in creating papers for submission to conferences and journals.

The growth of Large Language Models has obviously offered new possibilities for link-farm creation, and parallel challenges for search engines — and of course the analogous options for individual authors. Educators are panicking (though occasionally enthusing) about students' use of these programs, in place of more traditional methods for getting others to do their assignments for them. And no doubt techniques based on LLMs will also take over the production of fake research papers, thus systematically eliminating (the main sources of) tortured phrases. So I expect my academic-spam email to become less amusing, though equally annoying.

The details are constantly changing, but the background of these struggles is consistent — Goodhart's Law  playing out in various domains of evaluation:

When a measure becomes a target, it ceases to be a good measure.

Or in an older and more focused version, Campbell's Law:

The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.

Links for following the process include Retraction Watch, and the Problematic Paper Screener. Note that the PPS site offers a useful .csv dump of 3693 "fingerprints" from widely-used rogetting systems.

And for a more general discussion of Goodhart's Law in scientific publishing, leaving out the role of disguised plagiarism, see Michael Fire and Carlos Guestrin, "Over-optimization of academic publishing metrics: observing Goodhart’s Law in action", GigaScience 2019.




  1. AntC said,

    June 20, 2023 @ 10:59 am and similar outfits, were becoming increasingly surrealistic.

    Yeah my spam is getting increasingly shrill, like a toddler trying to fake a temper tantrum to get attention.

    Are you the firstname lastname mentioned in a paper on topic-I've-never-had-anything-to-do-with?

    That full name doesn't actually appear. That lastname doesn't appear for any firstname nor initials. That firstname doesn't appear.

    Academia fantasises I have extremely deep knowledge in almost every field. I wish.

  2. J.W. Brewer said,

    June 20, 2023 @ 1:39 pm

    But unlike Goodheart's Law, where a common dynamic is the observer becoming a participant in the situation and thus changing its dynamics without having actually intended to, the software that "rogeting" is aimed at evading is not intended for scientific data-collection but for more focused and practical ends like plagiarism detection. So the assumption should have always been an arms-race dynamic where cheaters learn how not to do the specific thing that cheater-identification software 1.0 can recognize, which in turn leads to the development of cheater-identification software 2.0, which lasts until another generation of cheaters learns how to avoid the thing that tips off that iteration of the software. And so on and so on over the horizon.

  3. David J. Littleboy said,

    June 23, 2023 @ 8:57 am

    Only vaguely related here, but the news today has it that the folks outsourcing new data for the LLMs have noticed that the folks they are hiring are using LLMs to generate text, which they then charge the LLM companies for as though those latter folks had actually written it.

    Meaning that LLM output is likely to get flakier over time.

RSS feed for comments on this post