Tortured phrases, LLMs, and Goodhart's Law
A few years ago, I began to notice that the scientific and technical papers relentless spammed at me, by academia.edu and similar outfits, were becoming increasingly surrealistic. And I soon learned that the source for such articles was systems for "article spinning" by "rogeting" — automatic random subsitution of (usually inappropriate) synonyms. Those techniques were originally developed many years ago for spamdexing, i.e. generating "link farms" of fake pages, in order to fool search engine ranking systems by evading simple forms of content similarity detection,
And the same techniques also fool simple systems for plagiarism detection — though the incoherent results are not useful for student papers, at least in cases where instructors actually read the submissions. But the same time period saw the parallel growth of predatory publishing (and analogous developments among generally reputable publishers), and the use of mindless quantitative publication metrics to evaluate researchers, faculty and institutions. The result: an exponential explosion of "tortured phrases" in the scientific, technical, and scholarly literature: "talk affirmation" for "speech recognition", "straight expectation" for "linear prediction", "huge information" for "big data", "gullible Bayes" for "naive Bayes", "irregular woodland" for "random forest", "savvy home" for "smart home", and so on.
Read the rest of this entry »