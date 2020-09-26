« previous post | next post »

As I was looking into the history of term biomarker, Google Scholar reminded me that automatic information extraction from text remains imperfect:

Google Scholar's translation into APA format:

Crea, F., Watahiki, A., & Quagliata, L. (1769). Identification of a long non-coding RNA as a novel biomarker and potential therapeutic target for metastatic prostate cancer. Oncotarget 5, 764–774.



It shouldn't take AI to fix this, since Google Scholar has other information sources about this particular article. And more broadly, even a crude implementation of ACS ("automatic common sense") ought to raise a flag about 18th-century RNA research.

But the automatic translation of character strings into fielded records sometimes goes wrong, even when there's contrary information in the string itself:

Baldacci, F., Lista, S., O’Bryant, S. E., Ceravolo, R., Toschi, N., & Hampel, H. (1750). Alzheimer Precision Medicine Initiative (APMI), 2018. Blood-based biomarker screening with agnostic biological definitions for an accurate diagnosis within the dimensional spectrum of neurodegenerative diseases. Methods Mol. Biol, 139, e155.

And ACS remains a seriously underdeveloped field, if it's a field at all — though the idea improving precision by exploiting redundancies in unreliable data is hardly new.

Anyhow, for most users of a system like Google Scholar, this sort of thing doesn't matter, or doesn't matter very much. Which is presumably why there's been no serious attempt at a fix. In fact, users' implicit cost function presumably leans towards recall at the expense of precision.

