Tortured phrases

« previous post | next post »

Article by Holly Else in Nature (8/5/21):

"‘Tortured phrases’ give away fabricated research papers

Analysis reveals that strange turns of phrase may indicate foul play in science"

Here are the beginning and a few other selected portions of the article:

In April 2021, a series of strange phrases in journal articles piqued the interest of a group of computer scientists. The researchers could not understand why researchers would use the terms ‘counterfeit consciousness’, ‘profound neural organization’ and ‘colossal information’ in place of the more widely recognized terms ‘artificial intelligence’, ‘deep neural network’ and ‘big data’.

Further investigation revealed that these strange terms — which they dub “tortured phrases” — are probably the result of automated translation or software that attempts to disguise plagiarism. And they seem to be rife in computer-science papers.

Research-integrity sleuths say that Cabanac* and his colleagues have uncovered a new type of fabricated research paper, and that their work, posted in a preprint on arXiv on 12 July1, might expose only the tip of the iceberg when it comes to the literature affected.

[*VHM:  Guillaume Cabanac, a computer scientist at the University of Toulouse, France]

Scientific term

Tortured phrase

Big data

Colossal information

Artificial intelligence

Counterfeit consciousness

Deep neural network

Profound neural organization

Remaining energy

Leftover vitality

Cloud computing

      Haze figuring

Signal to noise

      Flag to commotion

Random value

      Irregular esteem

Suspecting that the tortured phrases are the result of automated translation or software that rewrites existing text, Cabanac and colleagues ran a selection of abstracts from Microprocessors and Microsystems and other journals through a tool that can identify whether texts have been generated by the artificial-intelligence tool GPT. Of the Microprocessors and Microsystems papers flagged by the tool, manual checks revealed “critical flaws” in some of them, such as nonsensical text, as well as plagiarized text and images.

To dig deeper, the group downloaded all papers published in Microprocessors and Microsystems between 2018 and 2021, a time frame they chose because an upgraded version of GPT was released in 2019. They identified around 500 “questionable articles” based on various factors. Their analysis revealed that papers published after February 2021 had an acceptance time that was five times shorter, on average, than those published before that date. A high proportion of these papers came from authors in China. And a subset of papers had identical submission, revision and acceptance dates, the majority of which appeared in special issues of the journal. This is suspicious, the authors say. Unlike standard issues, overseen by the editor-in-chief, special issues are usually proposed and overseen by a guest editor, and focus on a specific area of research.

The sentence that I have highlighted is the only mention of China, but I dare say — as a long-term investigator of Chinglish — that it is easy to spot this technique of using machine translated awkward phraseology as a specialty of writers from China.  Phrases such as "Flag to commotion" for "Signal to noise" and "Irregular esteem" for "Random value" just reek of Chinglish.

The article goes on for nine more paragraphs and offers much additional information about the modus operandi of these machine assisted plagiarists and the means whereby they are investigated and identified.

 

Suggested readings



28 Comments

  1. Dick Margulis said,

    August 14, 2021 @ 6:27 am

    Those phrases would seem to be the product of article spinning (https://en.wikipedia.org/wiki/Article_spinning), which is an English-to-English plagiarism technique. No translation required.

  2. Marnie said,

    August 14, 2021 @ 6:51 am

    I've seen these for much longer than GPT even exists. There are knockoff blogs and news portals, typically link farms with some "content" in this style for SEO, and it's usually not hard to find the original text that makes so much more sense. It's not GPT or any other artificial intelligence, and it's not Chinglish either. It's some very simplistic software that does synonym substitution.

  3. David Marjanović said,

    August 14, 2021 @ 7:28 am

    So rather than using a computer program to translate the papers, they used a program to throw a thesaurus at it.

    This reminds me of a tale from the good old days when every professor had a secretary, the simultaneously bad old days when professors didn't even know how to touch-type. Professors would write their manuscripts by hand and give them to their secretary to type. One such secretary didn't know that statistically significant was a fixed technical term. Finding it repeated over and over again, she introduced some "elegant variation" and replaced most instances by such things as numerically interesting.

  4. Bathrobe said,

    August 14, 2021 @ 8:25 am

    I came across a similar phenomenon when trying to buy Tess of the D'Urbervilles from Amazon (Kindle edition). There were a lot of relatively cheap editions available, but when I looked inside I found they were written in exactly the style described here: it wasn't Hardy; it was Hardy completely transformed into semi-intelligible text by substituting loose synonyms for the original vocabulary. The effect was peculiar in the extreme. I noticed that many of these counterfeit editions appeared to come from India. It concerned me so much that Amazon was listing and charging money for these shoddy fakes that I sent them a message informing them of the problem.

    The fakes now seem to have disappeared — thankfully, although I'm deprived of the pleasure of passing on to readers some of the barbarous English that was being fobbed off onto unsuspecting consumers.

  5. Bathrobe said,

    August 14, 2021 @ 8:56 am

    Fortunately I happened across another example of the same problem. This is the opening of Oliver Twist, as offered by Amazon Australia for $5.95 (https://www.amazon.com.au/Oliver-Twist-Charles-Dickens-ebook/dp/B09CKN5ZXN/ref=sr_1_16?dchild=1&keywords=Oliver+Twist&qid=1628947746&s=digital-text&sr=1-16&asin=B09CKN5ZXN&revisionId=40fee65&format=1&depth=1)

    Among other public buildings in a sure city, which for many reasons it is going to be prudent to refrain from citing, and to which I will assign no fictitious call, there's one anciently not unusual to most towns, awesome or small: to wit, a workhouse; and on this workhouse turned into born; on a day and date which I need not trouble myself to repeat, inasmuch as it is able to be of no feasible result to the reader, in this degree of the business in any respect events; the item of mortality whose call is prefixed to the top of this bankruptcy.

    For a long time after it become ushered into this global of sorrow and problem, by way of the parish health care provider, it remained a rely of enormous doubt whether the kid would live on to undergo any name at all; in which case it's miles truly more than likely that these memoirs would never have regarded; or, in the event that they had, that being comprised within multiple pages, they could have possessed the inestimable advantage of being the maximum concise and trustworthy specimen of biography, extant inside the literature of any ageoru.S.A..

    Although I am no longer disposed to maintain that the being born in a workhouse, is in itself the maximum fortunate and enviable situation which can probably befall a human being, I do imply to mention that in this specific example, it changed into the satisfactory issue for Oliver Twist that would through possibility have came about. The fact is, that there was giant issue in inducing Oliver to take upon himself the office of breathing,—a difficult practice, but one which custom has rendered important to our clean life; and for some time he lay gasping on a little flock mattress, rather unequally poised among this world and the subsequent: the stability being decidedly in favour of the latter. Now, if, during this short period, Oliver were surrounded via careful grandmothers, disturbing aunts, experienced nurses, and medical doctors of profound knowledge, he might most necessarily and unquestionably have been killed in no time. There being no person by way of, but, however a pauper old lady, who turned into rendered instead misty by way of an unwonted allowance of beer; and a parish doctor who did such matters by contract; Oliver and Nature fought out the factor among them.
    The end result become, that, after a few struggles, Oliver breathed, sneezed, and proceeded to market it to the inmates of the workhouse the reality of a brand new burden having been imposed upon the parish, via putting in as loud a cry as ought to fairly were expected from a male toddler who had not been possessed of that very useful appendage, a voice, for a much longer area of time than 3 minutes and a quarter.

    For the original Oliver Twist, see the next post.

  6. Bathrobe said,

    August 14, 2021 @ 9:02 am

    Among other public buildings in a certain town, which for many reasons it will be prudent to refrain from mentioning, and to which I will assign no fictitious name, there is one anciently common to most towns, great or small: to wit, a workhouse; and in this workhouse was born; on a day and date which I need not trouble myself to repeat, inasmuch as it can be of no possible consequence to the reader, in this stage of the business at all events; the item of mortality whose name is prefixed to the head of this chapter.

    For a long time after it was ushered into this world of sorrow and trouble, by the parish surgeon, it remained a matter of considerable doubt whether the child would survive to bear any name at all; in which case it is somewhat more than probable that these memoirs would never have appeared; or, if they had that being comprised within a couple of pages, they would have possessed the inestimable merit of being the most concise and faithful specimen of biography, extant in the literature of any age or country.

    Although I am not disposed to maintain that the being born in a workhouse, is in itself the most fortunate and enviable circumstance that can possibly befall a human being, I do mean to say that in this particular instance, it was the best thing for Oliver Twist that could by possibility have occurred. The fact is, that there was considerable difficulty in inducing Oliver to take upon himself the office of respiration,—a troublesome practice, but one which custom has rendered necessary to our easy existence; and for some time he lay gasping on a little flock mattress, rather unequally poised between this world and the next: the balance being decidedly in favour of the latter. Now, if, during this brief period, Oliver had been surrounded by careful grandmothers, anxious aunts, experienced nurses, and doctors of profound wisdom, he would most inevitably and indubitably have been killed in no time. There being nobody by, however, but a pauper old woman, who was rendered rather misty by an unwonted allowance of beer; and a parish surgeon who did such matters by contract; Oliver and Nature fought out the point between them. The result was, that, after a few struggles, Oliver breathed, sneezed, and proceeded to advertise to the inmates of the workhouse the fact of a new burden having been imposed upon the parish, by setting up as loud a cry as could reasonably have been expected from a male infant who had not been possessed of that very useful appendage, a voice, for a much longer space of time than three minutes and a quarter.

  7. john burke said,

    August 14, 2021 @ 9:22 am

    "Bankruptcy" for "chapter" is nice. Seven? Thirteen? Poor Dickens failed to specify.

  8. Daniel Milton said,

    August 14, 2021 @ 9:32 am

    What’s the point? For classics in the public domain wouldn’t it be simpler and just as profitable to simply reproduce the text?

  9. Gregory Kusnick said,

    August 14, 2021 @ 10:38 am

    Just guessing here, but it seems possible that even if the original work is in the public domain, the digital edition may be entitled to some form of legal protection against piracy. So these synonymized editions are an attempt to get around that. The tradeoff then is between producing a new scan from public-domain source material, or using an automated tool to obfuscate a pirated text.

  10. Robert Coren said,

    August 14, 2021 @ 10:39 am

    @David Marjanović: This reminds me of a sentence appearing in one of Samuel Hopkins Adams's Grandfather Stories which I read (and reread, and reread) as a child, in which the titular grandfather (who disapproves of gambling) refers to "the reprehensible game of draw foxes". This puzzled me, as I had never heard of such a game; it was only on rereading it as an adult that I realized that "foxes" was surely a typist's misreading of Adams's handwritten "poker".

  11. jhh said,

    August 14, 2021 @ 10:43 am

    Following…

  12. J.W. Brewer said,

    August 14, 2021 @ 1:01 pm

    It would be grimly amusing if a hapless foreign academic trying to write a publishable article in English with fairly weak English fluency had been exposed to enough of these as to have gotten the misimpression that e.g. "colossal information" really was an accepted term of art in scholarly writing on the topic, and thus added "tells" of plagiarism to their draft without actually having engaged in plagiarism.

  13. Andrew Usher said,

    August 14, 2021 @ 2:11 pm

    This isn't a matter of just using those strange synonyms; it's a matter of being fake through and through, and in a manner that should be obvious to any human reader. The fact that it took so long to notice would seem to prove that no one really is reading these journals.

    The fact that the publishers can get away with this is the worst aspect of the whole matter – even if blatantly caught they can just issue a retraction, perhaps dismiss one editor for the sake of form, and just keep on going – they don't suffer the penalties for 'academic misconduct'.

  14. Brett said,

    August 14, 2021 @ 3:06 pm

    @J.W. Brewer: I recently refereed a paper which I am quite certain was not an automated plagiarist translation of some other work; yet it contained several instances of these kinds of erroneous translations of technical terms. Quoting from my review:

    At another point, the paper states: "This is commonly known as random degeneration"; whereas the correct term is "accidental degeneracy."

    I'm not sure where the (Ukrainian) authors got these expressions, but I would not rule out the possibility that they had gleaned them from other papers that had been obfuscated to avoid plagiarism detection.

  15. amy said,

    August 14, 2021 @ 6:10 pm

    I personally like the term "counterfeit consciousness" more than "artificial intelligence" and wish it'd become more common – because the former suggests an underlying malicious intent, which is probably true of a lot of the things that "artificial intelligence" is meant to do – further enslave the population under the control of Big Tech.

  16. AntC said,

    August 14, 2021 @ 6:11 pm

    @JWB, @Brett a hapless foreign academic … had been exposed to enough of these as to have gotten the misimpression that … was an accepted term of art … and thus added "tells" of plagiarism to their draft without actually having engaged in plagiarism.

    I think you'd have to question whether this academic had actually studied their topic in sufficient depth for their draft to be adding anything to the body of knowledge. Aren't there 'standard papers'/frequently cited that more or less defined the fixed phrases and their usage? Shouldn't the academic's supervisors be identifying those during undergraduate/PhD level?

  17. Bathrobe said,

    August 14, 2021 @ 6:28 pm

    It struck me that there must be software out there designed to disguise plagiarism, by substituting vocabulary thesaurus-style. This must be what the people who are flogging the fake versions of Tess of the D'Urbervilles and Oliver Twist used. No one would go through a book and do this substitution by hand; it wouldn't be worth the huge amount of effort required.

    Is it possible that the same kind of software is responsible for these fake scientific papers? Again, it's hard to fathom why anyone would do this, because the plagiarism should surely be obvious to anyone who checked. But vocabulary-substituting software might be an explanation for both phenomena…

  18. Julian said,

    August 14, 2021 @ 6:42 pm

    @Bathrobe:
    That's psychedelic, man! I want it!

  19. Bathrobe said,

    August 14, 2021 @ 7:00 pm

    It struck me that there must be software out there designed to disguise plagiarism, by substituting vocabulary thesaurus-style.

    By rights I should acknowledge that Dick Margulis and Marnie had already mentioned this in the first two comments at this post. My apologies.

  20. AG said,

    August 14, 2021 @ 7:27 pm

    While it has the telltale look of machine translation, I wonder if "haze figuring" might be too good to be true (or false, in this case) … the first couple of google results for the term were either about Cabanac's work, or seem (to my untrained eye) to be related to actual terms of art that are mutations of the cloud computing concept – there seem to be things really called fog, mist, edge, and haze computing? One example:

    https://www.hindawi.com/journals/wcmc/2021/5599907/

  21. Viseguy said,

    August 14, 2021 @ 7:35 pm

    It was the most optimal of times, it was the least optimal of times, …. It is a far, far more optimal thing that I do, than I have ever done; it is a far, far more optimal …. Wait, wait …. ahem. My name is Ishmael.

  22. J.W. Brewer said,

    August 14, 2021 @ 7:58 pm

    For some non-computer-science examples, a friend of mine who teaches high school history reports having received a student paper referencing the Big Melancholy (a/k/a Great Depression) and says a colleague got one referencing "burglar nobles" (a/k/a "robber barons").

  23. J.W. Brewer said,

    August 14, 2021 @ 8:12 pm

    Another possibility in Brett's situation is that the authors certainly know their own technical term (the Russian in talking-about-physics contexts seems to maybe be случайное вырождение, and maybe Ukrainian has just borrowed Russian for the talking-about-that-sort-of-physics register?) but aren't sufficiently immersed in English technical literature to get the English equivalent right versus some dictionary or MT-software attempt.

  24. KevinM said,

    August 15, 2021 @ 10:13 am

    One source of demand: Professional norms and standards that require journal publications, even from clinicians or others who have no particular interest in research. https://forbetterscience.com/2020/01/24/the-full-service-paper-mill-and-its-chinese-customers/ (Don't mean to pick on China, BTW. As the article points out, Westerners are very much implicated on both the demand and supply side.)

  25. Andrew Usher said,

    August 15, 2021 @ 8:12 pm

    While we can't change those Chinese laws, we could stop them affecting our journals, if the publishers cared at all.

  26. Andreas Johansson said,

    August 16, 2021 @ 1:02 am

    No one would go through a book and do this substitution by hand; it wouldn't be worth the huge amount of effort required.

    Not books, but back in secondary education, I had classmates who, confronted with homework of the type "write a short biography of Lord Byron", would manually rewrite the textbook section on him with this sort of substitutions.

    The teachers were apparently fine with this.

  27. Rod Johnson said,

    August 16, 2021 @ 10:41 am

    I'm getting a serious "Tlön, Uqbar, Orbus Tertius" vibe from the idea that researchers might start unwittingly using the "tortured phrases" as if they were the real thing. Maybe fifty years from now, when breathless articles in the popular press "counterfeit consciousness" is only five years away, they'll think it was quaint that we once called it "artificial intelligence."

  28. Rodger C said,

    August 17, 2021 @ 9:15 am

    Not to be confused with false consciousness. Or has that happened too?

RSS feed for comments on this post