Christiaan H Vinkers et al., "Use of positive and negative words in scientific PubMed abstracts between 1974 and 2014: retrospective analysis", BMJ 2015:
Design Retrospective analysis of all scientific abstracts in PubMed between 1974 and 2014.
Methods The yearly frequencies of positive, negative, and neutral words (25 preselected words in each category), plus 100 randomly selected words were normalised for the total number of abstracts. […]
Results The absolute frequency of positive words increased from 2.0% (1974-80) to 17.5% (2014), a relative increase of 880% over four decades.
The "positive words" they used were
(Amazing OR Assuring OR Astonishing OR Bright OR Creative OR Encouraging OR Enormous OR Excellent OR Favorable OR Groundbreaking OR Hopeful OR Innovative OR Inspiring OR Inventive OR Novel OR Phenomenal OR Prominent OR Promising OR Reassuring OR Remarkable OR Robust OR Spectacular OR Supportive OR Unique OR Unprecedented)
expressed in terms of a query framed for the online PubMed interface to the MEDLINE collection of abstracts. That interface returns a count of the matching titles and abstracts, which is the method they used to get their numbers. (I presume that the capitalization, which MEDLINE ignores as far as I can tell, was added by a copy editor at BMJ?)
One problem with this method is that the mean length of MEDLINE titles + abstracts has increased substantially over time — the graph below shows the mean length in words from 1974 to 2014:
This is partly because a larger fraction of older entries in MEDLINE have only titles and no abtracts (40.4% of article titles in 1975 come with an abstract; 85.8% of article titles in 2014 do), and it's partly because both titles and abstracts have gotten longer over time. And a simple probability calculation tells us that a larger proportion of longer articles will use a given set of words, even if the frequency of those words doesn't change.
I was able to calculate the numbers shown above because PubMed also licenses the MEDLINE data for bulk download. I obtained the "2016 MEDLINE/PubMed Baseline Database Distribution", and extracted the titles, publication dates, and abstracts for years 1974 though 2014.
Analysis of that dataset more or less replicates their calculation of the proportion of article titles-and-abstracts containing any of the 25 "positive" words — though I get 1.6% for the average of the years 1974-1980, and 16.1% for 2014, rather than the 2% and 17.5% that they report:
The difference is probably because the material in the 2016 baseline distribution is a bit different from the material that was indexed on line at the time of their search.
If we look instead at the statistically more reasonable metric of frequency relative to overall word count, the same trend remains quite strongly evident:
But the proportional increase now seems smaller: For their 25 "positive" words, I get mean summed frequency of 292.2 per million in 1974-1980, vs. 1151.8 in 2014. That's an increase of 394%, not 880%.
One other issue that should eventually be addressed is that not all of the "positive" words are actually used in a positive way. For example, uses of bright in MEDLINE hardly ever belong in the category of what the authors call "strikingly positive" words. Here are five examples chosen at random from 2014:
Colour attributes of cornflower honey were characterised by elevated values of L(∗) and particularly high values of b(∗) and h coordinates, which correspond to medium bright honey with intense yellow colour.
The AgNCs displayed a bright red emission when excited at 545nm.
In MIS group, the most common symptoms while reading were difficulty to move lines (85%), doubling (53%), and difficulty in bright condition (27%).
However, exposure to bright sunlight might make birds easier to detect by predators and may also cause visual glare that can reduce a bird's ability to monitor the environment.
Rgs9-/- mice spent less time than wild-type mice in both dim and bright light.
Several other words in the list, like assuring and supportive, often have a similar problem — their meaning is technically positive, but the work that's being reported in the abstract is not getting praised:
The waste management process is quite vulnerable, especially when it comes to assuring the right destination for the delivery of the hazardous waste.
Physicians should adapt their consultation style to the needs of adolescents by seeing the adolescent patient alone for some time and by assuring them of conditional confidentiality.
Better mechanisms should be put in place for assuring the safety of such infants.
Verification of ordered doses by pharmacists is critical in assuring appropriate use of this regimen.
For supportive therapy colestyramine and colestipol come into consideration as well as beta-sitosterol.
The observations imply that this type of sympathetic blockade may be of therapeutic value in some vascular disorders of the hand, and as supportive treatment in vascular surgery on the extremities
Pattern of intra- and extracellular disorders in skeletal dysplasias: comparative morphological studies on the supportive tissue.
Recovery occurred with supportive care.
In other cases, positive words are used in phrases that give them a neutral or explicitly negative spin:
In cases of idiopathic marrow failure, the situation is less hopeful.
Students who elected the seminar were initially less favorable toward the elderly than were their classmates.
Finely granular deposits of IgM and IgGl were found in most glomeruli with less prominent deposits of IgGa and IgA.
Even on the second postoperative day, the antiarrhythmic effects of these two beta-blockers were not remarkable, effective only in 2/4 animals in the case of Kö 1400-Cl and in 2/3 animals in the case of propranolol.
While not encouraging the use of physical restrains on mental patients, the author presents a statement of its continuing use and underscores the need for preparing students for a group of traditional procedures that in some ways are increasingly out of step with our times and aesthetically offensive.
The digital computer simulation suggested that this model is not unique, but will require further testing.
Eosinophils were never prominent.
There is some, although never excellent, agreement between real and simulated evolution.
This position is taken because many view the Prophet as not objecting to contraception but never encouraging mass population control.
The overall result demonstrates that despite enormous therapeutic effort, the infection in articular fractures is a serious complication which often leads to permanently functional deficiencies.
In our series, patients not receiving maximal standard local treatment often had relapse of local disease despite favorable responses to chemotherapy.
Failure of unilateral carotid angiography to opacify a saccular aneurysm of the anterior communicating artery was observed; this occurred despite excellent visualization of the parent artery with spontaneous cross-filling and adequate demonstration of the intracranial carotid system bilaterally.
Microscopically, the tumours were composed of small, round cells without remarkable structural features.
Thus, in the context of a psychological stressor, the activation of the amygdala CRH system can occur without robust activation of the hypothalamic CRH system.
Infarction of the RMCA territory may cause agitated confusion in patients without prominent localizing signs; the initial neurologic findings may suggest a metabolic encephalopathy.
The basic trend in lexical frequencies seems solid. But a somewhat more sophisticated analysis (or at least some sample-based checking) would be a good idea, before we follow the lead of Philip Ball in Nature News ("‘Novel, amazing, innovative’: positive words on the rise in science papers", 12/14/2015) in concluding that "Scientists have become more upbeat in describing their research".
It's also worth noting, again, that this trend, whatever its real extent and nature, is an example of a much more general phenomenon: Given large amounts of text data from different times, places, and authors, there are patterns everywhere.
It makes intuitive sense that the use of first-person plural pronouns in MEDLINE abstract has risen steadily over time:
In fact, the increase of 567% for we (from a mean rate of 685 per million words in 1974-1980, to a rate of 3886 per million in 2014) is larger than the 394% increase in the 25 "positive" words. But again, the interpretation requires some caution: I suspect that this is a stylistic change, favoring authorial we over agentless passives, rather than an increase in collectivistic consciousness.
What are we to make of the fact that very has steadily decreased in popularity? Perhaps Prof. Strunk's mantra is taking effect — or perhaps authors are becoming more circumspect. Or, in the other direction, choosing more spectacular intensifiers…
But with respect to the secular increase in WH-words, I got nothin'. What happened in 1989 to cause how to increase in frequency, by a factor of 4 over 25 years?
The decline of which is plausibly yet another quantitative residue of which-hunting:
But what happened in the mid-90s to trigger a bear market in negations?
Overall, the MEDLINE data — like other longitudinal collections — shows secular trends in the frequencies of a large proportion of the words that are common enough to be trackable. What these trends mean, and even what they really are, is somewhat harder to determine.
Increasingly large amounts of text increasingly available, with increasingly reliable information about time and place and genre. And the computational resources needed for simple analysis are increasingly fast, cheap, and accessible. So we're sure to see more and more work of this general kind, for better or for worse.
[I'm working with Martijn Wieling on a more serious analysis of the MEDLINE dataset than I've given in this Breakfast Experiment™ report. In order to try to minimize mistakes in processing the rather complex MEDLINE data, we're doing the same analyses in parallel with different programs in different languages. If this process turns up any errors in the numbers or plots above — which might well happen — I'll update appropriately.]