18th-century RNA research

« previous post | next post »

As I was looking into the history of term biomarker, Google Scholar reminded me that automatic information extraction from text remains imperfect:

Google Scholar's translation into APA format:

Crea, F., Watahiki, A., & Quagliata, L. (1769). Identification of a long non-coding RNA as a novel biomarker and potential therapeutic target for metastatic prostate cancer. Oncotarget 5, 764–774.


It shouldn't take AI to fix this, since Google Scholar has other information sources about this particular article. And more broadly, even a crude implementation of ACS ("automatic common sense") ought to raise a flag about 18th-century RNA research.

But the automatic translation of character strings into fielded records sometimes goes wrong, even when there's contrary information in the string itself:

Baldacci, F., Lista, S., O’Bryant, S. E., Ceravolo, R., Toschi, N., & Hampel, H. (1750). Alzheimer Precision Medicine Initiative (APMI), 2018. Blood-based biomarker screening with agnostic biological definitions for an accurate diagnosis within the dimensional spectrum of neurodegenerative diseases. Methods Mol. Biol, 139, e155.

And ACS remains a seriously underdeveloped field, if it's a field at all — though the idea improving precision by exploiting redundancies in unreliable data is hardly new.

Anyhow, for most users of a system like Google Scholar, this sort of thing doesn't matter, or doesn't matter very much. Which is presumably why there's been no serious attempt at a fix. In fact, users' implicit cost function presumably leans towards recall at the expense of precision.



9 Comments

  1. John Wilkins said,

    September 26, 2020 @ 7:17 am

    I've come across this for years, and it's a pain, because if you are seeking the origin of a novel term and the n-gram shows a bump in, say 1950, then it can take hours trying to track it down until you decide it is a mistake

  2. Rose Eneri said,

    September 26, 2020 @ 8:07 am

    I'd like to know what is meant by "agnostic biological definitions."

  3. Philip Taylor said,

    September 26, 2020 @ 9:32 am

    Your keyboard is playing up again, Mark — "researcch".

  4. David L said,

    September 26, 2020 @ 11:55 am

    It should be "agonistic" but that's not Google's fault — it's a typo in the title of the paper

  5. Stephen Hart said,

    September 26, 2020 @ 3:58 pm

    David L said,
    "It should be "agonistic" but that's not Google's fault — it's a typo in the title of the paper"

    I was about to write "And not the only one:

    Blood-Based Biomarker Screening with Agnostic Biological Definitions for an Accurate Diagnosis Within the Dimensional Spectrum of Neurodegenerative Diseases

    But then found:

    Recognizing molecular patterns by machine learning: An agnostic structural definition of the hydrogen bond

    Reimagining psychoses: an agnostic approach to diagnosis

    Agnostic classification of Markovian sequences

    Examining overlap and homogeneity in ASD, ADHD, and OCD: a data-driven, diagnosis-agnostic approach

    So, presumably meaning something like "without assumptions" or "evidence-based."

  6. David Morris said,

    September 26, 2020 @ 6:34 pm

    So if all this research is agnostic, it is unknown and unknowable?

  7. Peter Taylor said,

    September 27, 2020 @ 11:24 am

    @Stephen Hart, "agnostic" as meaning roughly "without assumptions" is certainly a term of art in machine learning (and thus doubles back to the topic of this blog post), but that would make "agnostic … definitions" oxymoronic.

  8. Chas Belov said,

    September 28, 2020 @ 2:37 pm

    Sorry, not seeing the error. What's the correct version?

  9. Lance said,

    September 30, 2020 @ 3:27 am

    What's really shocking to me is not even the fact that Google found 200+ articles from before 1900. I mean, yes, use some common sense, don't see "1863" as a page number and decide it's a year, but OK, at least 1863 is a year.

    What's shocking to me is that if you shorten the time span from "through 1900" to "through 100", it still turns up 26 articles from years like "13" and "19" and even "1". You don't need the common sense that people weren't writing about biomarkers in the first century of the Common Era to get this right; you just need the common sense that *a year on a scholarly article should have four digits*.

RSS feed for comments on this post