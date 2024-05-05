« previous post |

In a comment on yesterday's "Software testing day" post, ernie in berkeley offered a nice "QA Engineer walks into a bar" joke, and pointed us to its origin in an old xkcd comic "Exploits of a Mom":

…which in turn reminded me of an old problem, discussed in "Excel invents genes", 8/26/2016:

Mark Ziemann, Yotam Eren and Assam El-Osta, "Gene name errors are widespread in the scientific literature", Genome Biology 2016:

The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions.

This was a problem a dozen years ago when I worked on information extraction from biomedical literature — it's amazing to me that it still goes on. The authors note that

Automatic conversion of gene symbols to dates and floating-point numbers is a problematic feature of Excel software. The description of this problem and workarounds were first highlighted over a decade ago [1]—nevertheless, we find that these errors continue to pervade supplementary files in the scientific literature. To date, there is no way to permanently deactivate automatic conversion to dates in MS Excel and other spreadsheet software such as LibreOffice Calc or Apache OpenOffice Calc. We note, however, that the spreadsheet program Google Sheets did not convert any gene names to dates or numbers when typed or pasted; notably, when these sheets were later reopened with Excel, LibreOffice Calc or OpenOffice Calc, gene symbols such as SEPT1 and MARCH1 were protected from date conversion.

It's shocking that biologists ever relied on Excel as a database system, and even more shocking that they're still doing it.

And of course this is adjacent to the problem of wrong row or column numbers in data analysis, and the wider problem of Cupertinos and other autocorrect effects, and so on.

I don't have time this morning to check whether MS has finally fixed the issue of Excel inventing new gene names (and similar things in other research areas) — or at least provided and documented a setting to allow researchers to turn off such "helpful" re-interpretations.

But it occurs to me that the rise of LLM "AI" means that there will soon be (the opportunity for) many new types of dataset corruption, as legions of clueless (or at least context-agnostic) developers enlist the intervention of helpful AIs everywhere…

