Tortured phrases: Degrading the flag to clamor proportion

« previous post | next post »

Guillaume Cabanac, Cyril Labbé & Alexander Magazinov, "'Bosom peril' is not 'breast cancer': How weird computer-generated phrases help researchers find scientific publishing fraud", Bulletin of the Atomic Scientists, 1/13/2022:

In 2020, despite the COVID pandemic, scientists authored 6 million peer-reviewed publications, a 10 percent increase compared to 2019. At first glance this big number seems like a good thing, a positive indicator of science advancing and knowledge spreading. Among these millions of papers, however, are thousands of fabricated articles, many from academics who feel compelled by a publish-or-perish mentality to produce, even if it means cheating. […]

We have been able to spot fraudulent research thanks in large part to one key tell that an article has been artificially manipulated: The nonsensical “tortured phrases” that fraudsters use in place of standard terms to avoid anti-plagiarism software. Our computer system, which we named the Problematic Paper Screener, searches through published science and seeks out tortured phrases in order to find suspect work. While this method works, as AI technology improves, spotting these fakes will likely become harder, raising the risk that more fake science makes it into journals.

As of January 2022, we’ve found tortured phrases in 3,191 peer-reviewed articles published (and counting), including in reputable flagship publications.

See also (by the same authors) "Tortured phrases: A dubious writing style emerging in science. Evidence of critical issues affecting established journals", arXiv.org 7/12/2021.

You can explore their results at the Problematic Paper Screener site — put a word or phrase into the search box and check out the hits.

For example, the most recent hit for "speech" is a book chapter published by Springer Nature in 2021, "Comparative Analysis of GUI-Based Prediction of Parkinson Disease by Speech Using Machine Learning Approach". Here are the 22 "tortured phrases" that their software discovered in that document:

* **Alzheimer's infection** instead of the established _‘Alzheimer's disease’_
* **Parkinson's infection** instead of the established _‘Parkinson's disease’_
* **Parkinson's sickness** instead of the established _‘Parkinson's disease’_
* **R2 esteem** instead of the established _‘R2 value’_
* **arbitrary backwoods** instead of the established _‘random forest’_
* **backing vector machine** instead of the established _‘support vector machine (SVM)’_
* **dimensionality decrease** instead of the established _‘dimensionality reduction’_
* **fake neural** instead of the established _‘artificial neural (network)’_
* **fluffy C implies** instead of the established _‘fuzzy C-means’_
* **gullible Bayes** instead of the established _‘naive Bayes’_
* **head part investigation** instead of the established _‘principal component analysis (PCA)’_
* **help vector machine** instead of the established _‘support vector machine (SVM)’_
* **inertial estimation unit** instead of the established _‘inertial measurement unit’_
* **invulnerable framework** instead of the established _‘immune system’_
* **man-made consciousness** instead of the established _‘artificial intelligence’_
* **mean outright mistake** instead of the established _‘absolute error’_
* **molecule swarm** instead of the established _‘particule swarm’_
* **profound neural organization** instead of the established _‘deep neural network’_
* **square mistake** instead of the established _‘(mean) squared error’_
* **squared blunder** instead of the established _‘(mean) squared error’_
* **fluffy AND fuzzy** instead of the established _‘fuzzy (logics)’_
* **irregular subspace** instead of the established _‘random subspace (caveat: ‘irregular subspace’ is a regular term in anomaly detection and its applications doi:10.1109/TPWRS.2012.2224144)’_

But there are plenty of other substitutions that are apparently not on their list yet. Here's the chapter's abstract:

Parkinson’s ailment is the most predominant neurodegenerative issue influencing in excess of 10 million individuals around the world. There is no single test that can be directed for diagnosing Parkinson’s illness. Due to these troubles, to research an AI way to deal with precisely analyze Parkinson’s, utilizing a given dataset. To forestall this issue in medicinal divisions need to foresee the sickness influenced or not by discovering exactness figuring utilizing AI systems. The point is to research AI-based systems for Parkinson ailment by expectation brings about the best precision with discovering arrangement reports. The examination of a dataset by regulated AI technique (SMLT) to catch a few data resembles variable recognizable proof, uni-variate investigation, bi-variate, and multi-variate examination, missing worth medications and break down the information approval, information cleaning/getting ready and information representation will be done on the whole given dataset. To propose, an AI-based strategy to precisely anticipate the illness by discourse side effect by forecast brings about the type of best exactness and also analyze the presentation of different AI calculations from the given medical clinic dataset with assessment arrangement report, distinguish the outcome shows that GUI with the best exactness with accuracy, Recall, F1 Score explicitness and affectability.

For example, the final phrase "explicitness and affectabity" is a tortured substitution for the standard "specificity and sensitivity".

The whole paper is pretty much incomprehensible without diagnosing and inverting those substitutions. The very first sentence is

Human stride is the procedure of motion accomplished through facilitated appendage development and the controlled removal of the person’s focal point of mass.

I'm not sure what "facilitated appendage development" and "controlled removal" substitute for, but "focal point of mass" is clearly "center of gravity". Later in that same paper, in addition to the 22 tortured phrases identified by the Problematic Paper Screener, we see "various sclerosis" for "multiple sclerosis", "blunder pace" for "error rate", "Bolster Vector Machines" for "Support Vector Machines", and on and on.

It's a shock to me that Springer Nature puts this stuff out — but a few seconds of search show that this esteemed (?) publisher is not an isolated case, with Wiley, IEEE, and Elsevier also among the guilty parties.

The tortured phrase "flag to clamor" (substituting for "signal to noise") gets 94 Google Scholar hits, for example this 2017 IEEE publication, whose abstract ends

Preliminary results shows a quite improvement in compression ratio, mean square blunder and the pinnacle flag to clamor proportion (PSNR).

For readers not from the relevant fields, "PSNR" would normally be "peak signal to noise ratio" rather than "pinnacle flag to clamor proportion" — and note that the authors retain the correct initialism, which no longer matches the thesaurusized phrase.  There are many similar examples Out There, for example "alluring resonation imaging (MRI)" from this IEEE publication (along with "friendly" tumor for "benign" tumor):

Tumors were put using ultrasound mixes of prepared tomography (CT channel), alluring resonation imaging (MRI), and possibly nuclear imaging. A biopsy is performed to make sense of on the off chance that it is a friendly or risk tumor.

Update — Many of the "tortured phrases" are not obvious except in context (and sometimes not even then). For example, the first sentence of this Springer Nature publication is

This examination keeps an eye on the issues of division of strange personality tissues and custom tissues, for instance, dim issue, white issue, and cerebrospinal fluid from alluring resonation pictures using feature extraction framework and support vector machine classifier.

I'm pretty sure that "strange personality tissues and custom tissues" are thesaurusized substitutions, but I'm not sure what the sources were. But in that context, "dim issue" is probably "grey matter", "white issue" is "white matter", and "alluring resonation" is "magnetic resonance" :-)…

Update #2 — In the comments, Alexander Magazinov finds the source of (the first paragraph of) the work I cited. The first paragraph of "Comparative Analysis of GUI-Based Prediction of Parkinson Disease by Speech Using Machine Learning Approach", Springer Nature 2021, is plagiarized (with gibberish-inducing thesaurusizing) from Alexander Turner and Stephen Hayes, "The Classification of Minor Gait Alterations Using Wearable Sensors and Deep Learning",  IEEE Transactions on Biomedical Engineering  66, 2019. I suspect that the rest of the chapter is stolen by similar means from that and (perhaps) other sources. This plagiarized gibberish is one of 69 chapters in Advances in Power Systems and Energy Management, Springer Nature 2021.
 



30 Comments

  1. anne said,

    March 22, 2022 @ 7:15 am

    I'm a little naive and uneducated in those matters, but how is that peer-reviewed?

    [(myl) A very good question, to which I don't really know the answer. Some of the examples from supposedly-reputable sources such as Springer Nature, IEEE, Elsevier, Wiley, etc., are derived from conference or workshop proceedings, where the publisher presumably cedes "peer review" control on a one-time basis to the editors, who must be either asleep at the wheel or culpable in the fraud.

    Such one-time collections presumably represent a somewhat different situation from publication in a journal, where the iterated nature of the source publication means that reputational damage is a bigger deal, both for the editors and for the publisher. (Though there are plenty of examples of "tortured phrases" in journals as well, and of course the editors also have control in that case as well.)

    Why don't the publishers note these transparent collections of gibberish? Presumably they never look, perhaps because they have a financial incentive not to. And so far, they're not suffering any (financially consequential) reputational damage as a result.]

  2. Jenny Chu said,

    March 22, 2022 @ 7:34 am

    It makes me rather sad that – even given the extraordinarily low standards exhibited by these publications – I have yet to publish a single "peer reviewed" paper in a reputable academic journal. I shall content myself, I suppose, with publishing pithy comments on academically-inclined blogs.

  3. Aaron said,

    March 22, 2022 @ 7:39 am

    Some of these remind me of Strange Planet, a webcomic all about phrasing things in unexpected ways. "Flag to clamor proportion" is exactly how SP's alien characters would talk about signal to noise ratio!

  4. unekdoud said,

    March 22, 2022 @ 10:11 am

    Peer-reviewed? I think you mean…
    *nervously fumbles with thesaurus*
    look-seen.

  5. David Marjanović said,

    March 22, 2022 @ 10:25 am

    It's a shock to me that Springer Nature puts this stuff out —

    Proofreading in scientific publishing is entirely left to the authors; there's even at least one publisher (PLOS) that doesn't make page proofs at all. Copyediting, when it exists, is done by people who barely speak English and introduce lots of mistakes that you're supposed to catch in proofreading and explain to the hapless copyeditors.

    It's an oligopoly. Almost all academic publishing belongs to Springer Nature, Elsevier, Informa or Wiley. They all have profit margins that sometimes surpass 40%.

  6. Rodger Cunningham said,

    March 22, 2022 @ 10:29 am

    thesaurusized

    Thesaurized? Me thesaurizete thesaurous epi ten gen …

  7. Rodger C said,

    March 22, 2022 @ 10:32 am

    More to the point, in my later teaching career I'd get undergrad papers (English department) with whole paragraphs reading like this. The students either never read the result of their software use or couldn't tell the difference.

  8. /df said,

    March 22, 2022 @ 11:04 am

    Fortunately:

    "Springer Nature offers a variety of licensing options for institutions of all sizes in the academic, government and corporate sector. "

    Is there a refund for junk papers?

    Decades ago, it was the social sciences that were satirised for this, albeit the junk had to be created by hand: https://en.wikipedia.org/wiki/Sokal_affair

    Then AI was deployed to finger low quality scientific journals and conferences: https://news.mit.edu/2015/how-three-mit-students-fooled-scientific-journals-0414

    Now we all can haz junk academia.

    As in the MIT link, the answer to automated junk papers is automated paper reviewing and it looks like the training set is quite extensive.

  9. Rodger C said,

    March 22, 2022 @ 11:09 am

    And of course I meant "epi tes ges." Good morning …

  10. Rodger C said,

    March 22, 2022 @ 11:10 am

    Aand, my previous comment has disappeared. *wanders off*

  11. Terpomo said,

    March 22, 2022 @ 1:51 pm

    Rodger, wasn't αυ likely pronounced /aβʷ/ by the time the New Testament was written? Why transliterate it as au, unless you're just transliterating mechanically by letters rather than sound?

  12. Philip Barnett said,

    March 22, 2022 @ 3:06 pm

    Of course most of the papers being discussed here are disguised plagiarism, but perhaps a small fraction of these questionable papers are genuine, but written by people who have a limited command of English, because English is not their native language. In one of my previous jobs, I had to look at papers written in foreign languages, but the publishers of these journals required English language abstracts. Often the English in these abstracts was so pathetic it was almost humorous.

  13. Rodger C said,

    March 22, 2022 @ 3:57 pm

    Terpomo, I'm not very up on the precise dating of the changes in Greek pronunciation between Pericles and Justinian, but anyhow I was trying to emphasize my not-very-serious suggestion that "thesaurusize" should be "thesaurize." Mi thisavrizete thisavrous epi tis yis!

  14. JPL said,

    March 22, 2022 @ 6:12 pm

    @Jenny Chu said:
    "I shall content myself, I suppose, with publishing pithy comments on academically-inclined blogs."

    I had to laugh at that one. Yeah, that's where it's at, so thank you! On the other hand, the producers of such artificial text-imitations are capable of neither pithy comments nor unique thoughts.

    It's funny that you have here what looks like natural texts, but without any intended message whatsoever; yet the reader is left to puzzle about the interpretations of the expressions, their "what is expressed". I think you could say that these (I'm talking about the whole so-called "articles", but not the originals) are examples of what John Lyons called "system-sentences", expressions purely generated by the syntactic rules, but without any reason for being in any actual speech situation. (Even Alexa's sentences can be interpreted as more or less effective responses to one's requests.)

    So it looks like "publish or perish" as a universal incentive is not a good idea, if it results in so many articles that nobody ever reads and that nobody even intended to be read. On the other hand, we recently had the example of Terrence Kaufman who didn't play this game, but who any linguistics department would absolutely cherish as a colleague.

  15. D.O. said,

    March 22, 2022 @ 7:00 pm

    My guesses: "facilitated appendage development" = "coordinated limb motion" and
    "controlled removal" = "controlled displacement". The whole sentence should probably read close to
    "A walk is the systematic movement accomplished by coordinated limb motion and the controlled displacement of the person’s center of gravity".

  16. Andreas Johansson said,

    March 23, 2022 @ 2:14 am

    Back in the equivalent of junior high, we often had homework assignments along the lines of "summarize ch. 5 in your own words". Some of my classmates would do this by taking the summary at the start or end of the chapter and rewriting it sentence by sentence, changing words for synonyms and the like. Always struck me as more work than doing it properly. The teachers never seemed to catch on, or perhaps just didn't care.

  17. Alexander Magazinov said,

    March 23, 2022 @ 4:54 am

    Hi,

    It wasn't an easy task at all to decipher the text about "facilitated appendage development," but I eventually succeeded.

    Vasudha Reddy, G., Deepika, G., Jesudoss, A.

    Human stride is the procedure of motion accomplished through facilitated appendage development and the controlled removal of the person’s focal point of mass. Step is a mind-boggling dynamic procedure comprising of numerous associating components over shifting time scales. Anomalies in stride are a phenotype predominant to numerous disarranges with causes going from neurological infection, mind harm, physical inabilities, or mixes thereof. The loss of walk capacity and its impact on portability can be of huge inconvenience to an individual’s personal satisfaction. The determination and treatment of such an issue are fundamental to save or improve a person’s portability. Unusual step work is frequently analyzed by expert clinicians utilizing a mix of past determinations, walk work perception, hereditary information, MRI, CT, and general wellbeing.

    And here you go – A. Turner and S. Hayes, "The Classification of Minor Gait Alterations Using
    Wearable Sensors and Deep Learning," https://core.ac.uk/download/346700129.pdf

    HUMAN gait is the process of locomotion achieved through coordinated limb movement and the controlled displacement of the individuals centre of mass. Gait is a complex dynamic process consisting of multiple interacting elements over varying time scales. Abnormalities in gait are a phenotype prevalent to multiple disorders with causes ranging from neurological disease, brain damage, physical disabilities or combinations thereof. The loss of gait function and its effect on mobility can be of significant detriment to a person’s quality of life. The diagnosis and treatment of such disorders is essential to preserve or improve an individual’s mobility. Abnormal gait function is often diagnosed by specialist clinicians using a
    combination of previous diagnoses, gait function observation,genetic data, MRI, CT and overall health.

  18. David Marjanović said,

    March 23, 2022 @ 8:36 am

    perhaps a small fraction of these questionable papers are genuine, but written by people who have a limited command of English, because English is not their native language.

    It's extremely unlikely that such people would use a thesaurus. I've reviewed manuscripts written in a very Chinese English, and they don't look like that at all.

    [(myl) Also, as Alexander Magazinov shows in the previous comment, the serious thesaurus-salad examples can be traced back to the source they were stolen from.]

  19. Alexander Magazinov said,

    March 23, 2022 @ 10:57 am

    @David Marjanović

    I know only one paper where tortured phrases found by Guillaume Cabanac's PPS are true positives (i.e., they are really incorrect replacements for well-established terms) and where they appeared because the authors are non-English speakers.

    Here it is, commented on PubPeer – the purpose of the comment was to notify the authors so that they could use correct terms in their future publications.

    https://pubpeer.com/publications/8E501D53D361F5266E02B72DF5E54A

    The English in that paper is overall quite coherent, very dissimilar to the paraphrased nonsense exemplified in this blog.

  20. Lance said,

    March 23, 2022 @ 12:50 pm

    So thrilled to see this happening in the sciences. (Sarcasm off.) I first encountered this phenomenon in web-scraped data in 2013, though I'm sure it was older than that. I think the source of it then was less "we're plagiarizing text in order to publish it as our own" than it was scam websites that would try to increase their Google ranking by making fake websites with text that linked to their site. More links = more relevant to people searching; but since Google would discount multiple copies of the same text, the sites would use this thesaurus-substitution to make the text they were copying look like a "different" text.

    Insofar as the data I happened to see this in was webpages about pregnant women, it meant that a solid chunk of the data involved things like the book What to Anticipate When You're Anticipating. But my absolute favorite was the substitution of "a single" for "one", which is really clever and not all meaning-changing when you replace "one mother" with "a single mother".

  21. Philip Taylor said,

    March 23, 2022 @ 3:16 pm

    Tangential to the main thread, but inspired by Lance's comment (above) — is it known why "In so far as" is frequently abbreviated to "Insofar as" but never (as far as I know) to "Insofaras" ? It seems odd (crazy !) to me that we should happily elide the first three words, for no apparent reason, yet flatly refused to elide the fourth.

  22. Lance said,

    March 23, 2022 @ 3:24 pm

    is it known why "In so far as" is frequently abbreviated to "Insofar as" but never (as far as I know) to "Insofaras" ?

    I mean, I'm not abbreviating anything when I use it; I'm just using a word, "insofar", that Merriam-Webster tells me has existed since 1596. (Or perhaps the single lexical item "insofar as", which MW has a listing for; it does not have one for "in so far as".)

  23. Philip Taylor said,

    March 23, 2022 @ 3:43 pm

    So when you (Lance) write "insofar as", the first element is just a normal word to you ? I ask because it is not to me, as I perceive it as a clear elision of "in so far" but have no sense as to why it has been elided (or why the fourth element stubbornly refuses to elide).

    O.E.D. Suppl. (1976) says of the spelling in so far: ‘still conventionally written thus ( Hart's Rules for Compositors, ed. 37, 1967, p. 75) but also frequently as a single word or with hyphens’; however, solid spelling is now nearly twice as frequent.

    [(myl) "Elide" and "elision" are odd terms to use for the choice to write certain common word sequences "solid", i.e. omitting the internal spaces. "In so far" has certainly not been elided, though the phrase's internal spaces have been.

    In any case, the typographical choices here seems roughly the same as those behind "alongside of" but not "*alongsideof". ]

  24. John Swindle said,

    March 24, 2022 @ 5:24 am

    The software for putting text through the thesaurus seems to be called a "paraphrasing tool." Google finds many of them, some promising to improve your writing, others promoting search-engine optimization. At least one is free and can be used online, producing fairly conservative results. I haven't tried it on scientific jargon.

    "… It is or maybe for us, the living, we here be committed to the awesome errand remaining some time recently us that, from these honored dead we take expanded dedication to that cause for which they here, gave the final full degree of commitment that we here profoundly resolve these dead should not have kicked the bucket in unsuccessful; that the country, should have a modern birth of opportunity, which government of the individuals, by the individuals, for the individuals, might not die from the earth."

  25. David Marjanović said,

    March 24, 2022 @ 11:04 am

    German not only has insofern, als, but also the question for it: inwiefern "in what way, to what extent, how, what do you mean".

    Like in English, the construction is not productive: *in so or *in wie don't otherwise occur. That drives lexicalization.

  26. Peter Grubtal said,

    March 24, 2022 @ 12:55 pm

    Philip Taylor

    "Tangential to the main thread " – euphemism of the week?
    But seriously, I do appreciate your comments – I think we have a lot in common, but this thread deals with a pretty serious matter – I'd like to see it developed and not derailed.

  27. Philip Taylor said,

    March 24, 2022 @ 5:48 pm

    I have no desire to derail the thread, Peter, so I will say no more. If the powers-that-be deem "insofar as" worthy of further discussion, then I will make any further comments in a thread dedicated to that and analogous topics.

  28. John Swindle said,

    March 25, 2022 @ 12:57 am

    In so distant because it may be extraneous to the most string.

  29. amy said,

    March 27, 2022 @ 6:59 pm

    "Article spinning" is what this type of thesaurising is usually called.

    That said, some of these are actually amusing and catchy enough to become memes. I particularly enjoyed "gullible Bayes", "arbitrary backwoods", and "mean outright mistake".

  30. John Swindle said,

    March 29, 2022 @ 4:22 am

    And the software is an "article spinner." Thank you! There seem to be various synonyms, all of course in the spirit of the thing. And something called an "article rewriter," alluded to in the BAS article as "AI models like GPT-2." One source explains that their article rewriter "changes the actual meaning of the given text" and "[m]akes text unique and plagiarism-free." It's easy to see how useful that would be for the advancement of science.

RSS feed for comments on this post