Kyle Orland, "The telltale words that could identify generative AI text", ars technica 7/1/2024

In a pre-print paper posted earlier this month, four researchers from Germany's University of Tubingen and Northwestern University said they were inspired by studies that measured the impact of the COVID-19 pandemic by looking at excess deaths compared to the recent past. By taking a similar look at "excess word usage" after LLM writing tools became widely available in late 2022, the researchers found that "the appearance of LLMs led to an abrupt increase in the frequency of certain style words" that was "unprecedented in both quality and quantity."

To measure these vocabulary changes, the researchers analyzed 14 million paper abstracts published on PubMed between 2010 and 2024, tracking the relative frequency of each word as it appeared across each year. They then compared the expected frequency of those words (based on the pre-2023 trendline) to the actual frequency of those words in abstracts from 2023 and 2024, when LLMs were in widespread use.

The results found a number of words that were extremely uncommon in these scientific abstracts before 2023 that suddenly surged in popularity after LLMs were introduced. The word "delves," for instance, shows up in 25 times as many 2024 papers as the pre-LLM trend would expect; words like "showcasing" and "underscores" increased in usage by nine times as well. Other previously common words became notably more common in post-LLM abstracts: the frequency of "potential" increased 4.1 percentage points; "findings" by 2.7 percentage points; and "crucial" by 2.6 percentage points, for instance.

The cited paper is Dmitry Kobak et al., "Delving into ChatGPT usage in academic writing through excess vocabulary", arXiv.org 7/3/2024 (v.1 6/11/2024):

Recent large language models (LLMs) can generate and revise text with human-level performance, and have been widely commercialized in systems like ChatGPT. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists have been using them to assist their scholarly writing. How wide-spread is LLM usage in the academic literature currently? To answer this question, we use an unbiased, large-scale approach, free from any assumptions on academic LLM usage. We study vocabulary changes in 14 million PubMed abstracts from 2010-2024, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. Our analysis based on excess words usage suggests that at least 10% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, and was as high as 30% for some PubMed sub-corpora. We show that the appearance of LLM-based writing assistants has had an unprecedented impact in the scientific literature, surpassing the effect of major world events such as the Covid pandemic.

This claim may very well be right — I haven't evaluated their statistical model, in which what counts as "excess word usage" will very much depend on the assumed nature of the underlying stochastic processes — but there are some things about the argument that leave me skeptical.

The first thing to note is that some of the cited increases are much more "abrupt" than others, with forms of the verb delve leading the list, as shown in their Figure 1:

And their Figure 2(a), captioned "Frequencies in 2024 and frequency ratios (r). Both axes are on log-scale":

(I believe that the paper's authors downloaded the PubMed data and did their own searches and counts, but if you do your own exploration on the PubMed site, keep in mind that PubMed apparently searches by lemma, so that a search for "delves" also hits on "delve", "delving", "delved" — which I'll signal in the usual way with square brackets, e.g. [delve]. And PubMed returns the number of citations (abstracts or available texts) per year containing the lemmas in question — for the relative frequency results, normalizing for the number of available citations per year, you can use Ed Sperr's github page. I haven't tried to separate the alternative forms, but that should not matter for the points I'm making below.)

My first observation is that decades-long trends in relative PubMed word usage are common, and not just because of real-world references like ebola, covid, and chatgpt. For example, [explore] has been gaining on [investigate] for 25 years or so, with acceleration over the past decade:

It's also worth noting that trends of similar size (and often similar direction) can be found in more general sources than PubMed, e.g. Corpus of Historical American English:

And the next thing to notice is that the changes in [delve], though extremely large in proportional terms, are small in terms of actual citation frequency, e.g. 5,526 in 2024 for [delve] compared to 108,616 for [explore]:

The proportional change for [delve] from 2022 to 2024 is indeed impressive (numbers below are from Ed Sperr's github page — the 2024 data is only for part of the year, obviously…)

citations 629 to 5,526, factor of 8.8; citations per 100k 35.37 to 591.33, factor of 16.7

But if every PubMed citation containing a form of [delve] were written (or edited) by ChatGPT, those 591.33 citations per 100k would amount to just 0.59% of the year's citations. That's lot less than the "at least 10%" claimed by the article, which throws us back into the evaluation of their overall statistical model.

And the proportional changes from 2022 to 2024 for their other chosen words are substantially smaller, e.g.

[showcase]: citations 1,900 to 4,470, factor of 2.4; per 100k 106.85 to 478.74, factor of 4.48 [surpass]: citations 1,984 to 4,348, factor of 2.2; per 100k 111.57 to 465.67, factor of 4.17 [emphasize]: citations 17,945 to 22,151, factor of 1.2; 1009.13 to 2372.36, factor of 2.35 [potential]: citations 455135 to 270421, factor of 0.59; per 100k 25590.98 to 28908.95, factor of 1.13



And [delve] has been gaining in relative frequency on (for example) [explore] since 2009, long before ChatGPT was available to researchers, even though [explore] has been gaining in popularity during that time:

In the broader COHA collection, [delve] has been increasing in popularity since the 1940s:

None of this explains [delve]'s proportional change factor of 16.7 from 2022 to 2024 in citations per 100k, but it does show that there are cultural trends (even fads) in word usage, independent of recent LLM availability. And it's also not clear why ChatGPT should promote [delve] and not e.g. [probe] or [seek] or [sift] — though maybe it's because of who they hired for RLHF?

[See also "Bing gets weird — and (maybe) why" (2/16/2023); "Annals of AI bias" (9/23/2023) ]

