New frontiers in dataset corruption
« previous post | next post »
In a comment on yesterday's "Software testing day" post, ernie in berkeley offered a nice "QA Engineer walks into a bar" joke, and pointed us to its origin in an old xkcd comic "Exploits of a Mom":
…which in turn reminded me of an old problem, discussed in "Excel invents genes", 8/26/2016:
Mark Ziemann, Yotam Eren and Assam El-Osta, "Gene name errors are widespread in the scientific literature", Genome Biology 2016:
The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions.
This was a problem a dozen years ago when I worked on information extraction from biomedical literature — it's amazing to me that it still goes on. The authors note that
Automatic conversion of gene symbols to dates and floating-point numbers is a problematic feature of Excel software. The description of this problem and workarounds were first highlighted over a decade ago [1]—nevertheless, we find that these errors continue to pervade supplementary files in the scientific literature. To date, there is no way to permanently deactivate automatic conversion to dates in MS Excel and other spreadsheet software such as LibreOffice Calc or Apache OpenOffice Calc. We note, however, that the spreadsheet program Google Sheets did not convert any gene names to dates or numbers when typed or pasted; notably, when these sheets were later reopened with Excel, LibreOffice Calc or OpenOffice Calc, gene symbols such as SEPT1 and MARCH1 were protected from date conversion.
It's shocking that biologists ever relied on Excel as a database system, and even more shocking that they're still doing it.
And of course this is adjacent to the problem of wrong row or column numbers in data analysis, and the wider problem of Cupertinos and other autocorrect effects, and so on.
I don't have time this morning to check whether MS has finally fixed the issue of Excel inventing new gene names (and similar things in other research areas) — or at least provided and documented a setting to allow researchers to turn off such "helpful" re-interpretations.
But it occurs to me that the rise of LLM "AI" means that there will soon be (the opportunity for) many new types of dataset corruption, as legions of clueless (or at least context-agnostic) developers enlist the intervention of helpful AIs everywhere…
David Marjanović said,
May 5, 2024 @ 10:02 am
Why? Most biologists are not programmers and don't know more about computers than the office-job population at large.
However, I struggle to imagine why the conversion to dates would be automatic. In Excel you can change the format of a cell (or column or line) from "standard" to "text", and then anything in it will be interpreted as text and not as a date… right?
Sal said,
May 5, 2024 @ 10:14 am
In my experience it depends on how you format the columns and whether you give Excel any opportunity to reinterpret the format. For example if you save the file as .csv and then someone opens it in Excel, the format may be lost.
Mark Liberman said,
May 5, 2024 @ 10:35 am
@David Marjanović, @Sal:
First, Excel still does the cited automatic format correction, e.g.
Second, at least in the current MacOS version, it's not obvious how to change the "format" to avoid those corrections — I checked everything in the "Format" and "Tools" menus, without finding any way to do it.
I never use Excel except when forced to, and so I've resisted learning many aspects, including this one. But the cited 2016 article makes it clear that many biologists had the same problem then — and I'm betting that quite a few have experienced similar problems during the intervening years.
The point of this post, though, is that modern "AI" opens up the possibility of much broader and less obvious forms of "helpful" content changes.
Philip Taylor said,
May 5, 2024 @ 11:01 am
There was a time when I could foresee no possibility of ever needing a spreadsheet — now I cannot envisage life without one, since they play a vital rôle in virtually everything that I do, from typesetting books from Excel source to scheduling bowls matches for the coming season.
Being an avowed MS Windows user, however, I regret that I cannot offer any advice on how to persuade MS Excel in its Apple Macintosh incarnation to treat (e.g.,) "SEPT2" as a gene name rather than a date. Nonetheless, if I suggest how it can be accomplished in the MS Windows version (Excel 2010), it may be possible for Mark and others to identify the analogous route when using an Apple Macintosh. For MS Windows, the route is as follows :
1) Create a new, blank, workbook
2) Click on the column header for the column(s) of interest (e.g., click on the uppercase "A" above the first column)
3) Select "Format cells"
4) Under the first tab, headed "Number", select the option "Text"
5) Enter (e.g., "SEPT2" [without string quotes]) in cell A1
6) Hit <return> and check whether the cell displays "SEPT2" as intended or whether it displays instead "2-Sep".
7) If the cell displays "SEPT2", save the file in CSV format and check that the "SEPT2" remains unchanged therein.
For me, using MS Windows 7 (64-bit, Enterprise), MS Excel 2010 does the Right Thing — your mileage, under a different O/S or version of Excel, may vary.
Gregory Kusnick said,
May 5, 2024 @ 11:30 am
So apparently one-fifth of authors failed to adequately proofread their data files before publishing. Automated tools for detecting such errors are clearly possible (Ziemann et al. used one), so why are they not a standard part of publishing protocol in this field?
Philip Taylor said,
May 5, 2024 @ 12:52 pm
Sorry, for
please read
Terry K. said,
May 5, 2024 @ 2:48 pm
To what Philip Taylor says about how to get something like SEPT2 to not convert to a date in EXCEL on a Windows computer, I'll add there's also a method that doesn't require rightclicking.
Select a column by clicking on the header. Find "Format" on the "Home" toolbar and click it. Look for "Format Cells" (at the bottom for me) and select it. Proceed as Philip Taylor describes.
Mark Liberman said,
May 5, 2024 @ 3:04 pm
@Terry K., @Philip Taylor: [Complex instructions for protecting a column]
…which presumably needs to be done for every column of every "workbook"? And (according to Sal) may not survive saving as .csv and reading back into Excel?
Anyhow, the point is not to criticize Excel, but rather to observe that we've seen many past examples of dataset corruption due to context-agnostic developers being "helpful" to some users kin ways that are problematic for others — and we're likely to see much more of the same as a result of the trend to add helpful LLM AI in many ways to many apps…
Philip Taylor said,
May 5, 2024 @ 3:37 pm
« [Complex instructions for protecting a column] … which presumably needs to be done for every column of every "workbook"? »
Definitely not — any column (or even cell) that is required to contain (and correctly identify) anything other than text must be formatted accordingly. Formatting all columns as text is guaranteed to ensure that no formula [e.g., "=SUM(A1:A7)"] will be recognised as such.
« And (according to Sal) may not survive saving as .csv and reading back into Excel? ». Not only "may", will not survive. By definition CSVs are Comma-Separated VALUES, not Comma-Separated OBJECTS WITH PROPERTIES. When one opens a CSV file in Excel, a new workbook is created, and by definition that workbook will have no specific formatting associated with any or all columns. I believe that the default formatting is "General", so "SEPT2" would be interpreted as a date.
Francis Deblauwe said,
May 5, 2024 @ 3:46 pm
I tested this in MacOS native spreadsheet application Numbers and the gene names are safe: no substitution occurs.
AntC said,
May 5, 2024 @ 4:37 pm
@DM Why? [rely on Excel] Most biologists are not programmers and don't know more about computers than the office-job population at large.
Clearly (from the instructions above) using Excel requires the user (whatever discipline they think they're following) to act as programmer. And as myl note with saving to csv and reloading, to continue to think as a programmer. Using Excel for any sort of calculation/searching/sorting also needs thinking as a programmer.
Excel might or might not fit the mindset of an accountant with columns of figures; it's a very poor fit to the mindset of a database — even for mundane purposes like scheduling bowls matches. There used to be (especially on Mac) lightweight point-and-click database systems, in which you could start off entering data ad-hoc in a grid, with the tool learning that you were mostly putting text in this column: would you like me to fix this as a text format? Despite having the scary name 'database', they needed less programming than people end up with for Excel.
Sadly, most of those systems have died out because 'Professionals' (who really should be thinking more carefully about their tools) just opt for Excel.
Rick Rubenstein said,
May 5, 2024 @ 5:43 pm
I blame the removal of Clippy. He would have cleared this whole situation right up.
Noam said,
May 5, 2024 @ 7:52 pm
Eh, easier to just rename the genes https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates
But supposedly you can now disable some of the auto formatting https://www.techradar.com/pro/microsoft-is-finally-fixing-an-excel-data-bug-that-had-scientists-fuming
Circeus said,
May 5, 2024 @ 8:09 pm
Even the authoritative source on gene symbols fought it tooth and nail for years, they eventually changed the gene symbols just to get around that.
Also on the topic of "It's shocking that biologists ever relied on Excel as a database system, and even more shocking that they're still doing it."
My poor man, you would be livid with how few Excel documents in my job contain numbers at all.
Jon said,
May 6, 2024 @ 12:45 am
It's not just Excel, as the cartoon points out. Some time ago I saw a quote from the IT manager of Tesco, UK's largest supermarket chain. I don't recall the exact words, but the sense was "Anyone who doesn't think that their database is full of rubbish hasn't had a good look at it."
Before I retired, I used a database of half a million records, collected by others for other purposes over a couple of decades, to create maps. It took me weeks to clean up the data before it was usable. Some had to be discarded as uncleanable. Even when data is collected by professionals who know the problems, errors creep in. Epidemiologists include a step in their studies following data collection: data cleaning.
It occurred to me recently that there are two types of database: those whose owners care whether the data is accurate, and those whose owners don't care. Companies like Google, Facebook, etc don't care because they are in the advertising business. They collect detailed profiles of all of us, and much of the information is false or misleading. But it would take an enormous effort to clean it, and it is not worth the effort. Advertising based on data that is half right is far more effective than untargeted advertising.
Philip Taylor said,
May 6, 2024 @ 4:04 am
"using Excel requires the user (whatever discipline they think they're following) to act as programmer" — that does not seem an unreasonable requirement to me. Consider the analogy of (e.g.,) driving a car : when an evolutionary biologist drives a car, he is required to think as a driver, not as a biologist — if he were to think as a biologist, then he might well be inclined to mow pedestrians down at random, since by so doing he would be increasing the statistical probability of his genes being inherited rather than theirs. A computer (program) is a tool, and when using any tool one needs to be aware of its limitations.
AntC said,
May 6, 2024 @ 5:41 am
@PT Consider the analogy of (e.g.,) driving a car …
You might remember when driving a car needed double-declutching to change gear, or a manual choke, or a starting handle to turn over the engine, or when the starter motor would sometimes get jammed as the pinion engaged the flywheel — needing opening up the bonnet to give a gentle carefully-placed tap on its shaft.
I suspect car-driving biologists are rather glad to use electronic ignition and automatic gearboxes. The tinkering needed with Excel not so much.
bks said,
May 6, 2024 @ 5:59 am
Why “good PhD students are worth gold!” A grad student finds an error
…
https://retractionwatch.com/2021/02/08/why-good-phd-students-are-worth-gold-a-grad-student-finds-an-error/
KeithB said,
May 6, 2024 @ 7:57 am
Phillip Taylor:
Or a Classicist's way to drive a car. Later in life Tolkien was wealthy enough to afford his own car. His motto for dealing with pedestrians was "Charge them and they scatter!"
ANDRE CASTRO GURGEL said,
May 6, 2024 @ 8:36 am
Just surround gene codes with quotes. "MARCH11" will be left untouched.
This can be easily done en masse with a script run previously on the data to be imported.
Jonathan Smith said,
May 6, 2024 @ 8:58 am
speaking of which LMK if someone knows an encoding fix for Column A 'cities' within all_china.csv <a = href="https://zenodo.org/records/10697975"here (for me appearing æ€€å® / 潜山 / å®¿æ¾ / 太湖 / æ¡åŸŽ…)
Jonathan Smith said,
May 6, 2024 @ 8:59 am
oops
speaking of which LMK if someone knows an encoding fix for Column A 'cities' within all_china.csv here (for me appearing æ€€å® / 潜山 / å®¿æ¾ / 太湖 / æ¡åŸŽ…)
GH said,
May 6, 2024 @ 9:52 am
@AntC:
I think that's neither clear nor correct. The steps Philip Taylor described (which, BTW, are essentially identical on Mac) are merely the Excel UI structure for selecting the data format, and that has nothing to do with programming. It requires you to know where different options are in the Excel menu structure, is all.
I suppose you could argue that the very notion of different data types that behave differently are a programming concept, but it's really such a fundamental notion that it is unavoidable in any kind of data processing. (For example, when we get a bill, we understand that the prices of the individual items are added up, but we don't expect the names of the items to be similarly summed.) I would be very much surprised if distinguishing data types was not already required by other parts of the scientist's task.
If anything, it is Microsoft's attempt to save us from having to think about the data type by inferring it from the value that was causing the problem.
Philip Taylor said,
May 6, 2024 @ 9:58 am
Possible approaches to your problem here, Jonathan.
Jonathan Smith said,
May 6, 2024 @ 12:27 pm
Thanks Philip Taylor — this tool did the trick
(original text for the above: 怀宁 / 潜山 / 宿松 / 太湖 / 桐城…)
David Marjanović said,
May 6, 2024 @ 2:39 pm
And so did a fifth of the reviewers and a fifth of the editors. Unfortunately that's not remotely surprising.
Because nobody knows they exist.
No, they fail to opt against it. It comes as part of MS Office, which is installed on their office computers, and they're already using it for actual spreadsheet calculations…
Bingo, and that was our esteemed host's point about LLMs potentially making that much worse.
Terry K. said,
May 6, 2024 @ 3:03 pm
Regarding Excel formatting, you can select as many columns (or rows, or other group of cells) on a sheet as you want before changing the formatting. You can select the whole sheet by clicking in the little spot at the upper left of the sheet. So far as I know, if there's multiple sheets in a workbook you have to do each sheet separately.
Lasius said,
May 7, 2024 @ 3:24 am
@Philip Taylor
First of all, there aren't a lot of reproduction opportunities in prison. Secondly, what makes you think biologist think along these lines?
/df said,
May 7, 2024 @ 4:54 am
FWIW neither LibreOffice Calc (6.2 tested) nor Gnumeric corrupt text entered in a blank sheet A1 cell from
SEPT2
to a date, so researchers who replace the defective OEM software on their machines may be safer. However if their gene name were, say,2024-05-07
(depending on the locale settings), they'd have to force it to text with an initial single quote, or pre-format the cell/column to text, etc, as above.Philip Taylor said,
May 7, 2024 @ 6:46 am
"neither LibreOffice Calc (6.2 tested) nor Gnumeric corrupt text entered in a blank sheet A1 cell from SEPT2 to a date" — Excel does not "corrupt" such text, it interprets it as a date, which is in many cases exactly what the user would like it to do. Of course, there will be exceptions (e.g., the case of the SEPT2 gene/protein), but I would venture to suggest that the occasions on which Excel correctly identifies "SEPT2" as a date far outweigh the occasions on which it incorrectly interprets it.
~flow said,
May 7, 2024 @ 3:33 pm
So I have worked a fair amount with Excel and pride myself at having been quite good at it; at least I was good enough to impress *some* people with my skills (but then those were people that were easy to impress given their level of familiarity with the software): Like Philip Taylor, I also once typeset an entire book with Excel for which I had to write routines in Visual Basic to take the data and remote-control PageMaker (see how long ago that was; my present skills have devolved to 'somehow informed user' level by now) to draw hundreds of charts.
I only tell you that so you can appreciate that whenever I started working with a fresh copy of Excel (or Libre/OpenOffice Calc for that matter, a software that has spared no cost or effort to repeat every single damn bad design decision of Excel) I disabled all of the Automatismen like auto-caps or date recognition. I must say that there are *lots* of these autocorrect and auto-insert mechanisms, and usually I didn't catch them all in one go but had to work for a week or two with the occasional invisible hand mucking with my data (MY data, Excel, this is MY data) and me hunting down yet another 'helpful' setting. So yeah, not quite as easy and obvious as one would like to have it.
I'm somewhat on the fence as for a solution for the problem at hand. On one side I see the data worker who's not fully competent to use the software they're using and some kind of failure of oversight where either said data workers are provided with "low-surprise tools" (that could be a version of Excel with all the problematic settings turned off); on the other side I see fully grown MDs who are these days forced to struggle with font sizes in Word and autocorrect in Excel. There used to be specialists that did the typesetting and publishing stuff but those are mostly gone in a world that is dead-set on eliminating any and all jobs that can be eliminated (recent examples are Boeing and Tesla; cutting corners and firing people cost Boeing tens of billions of dollars for the plane crashes alone, but it's in principle the same everywhere).
When management insists on MDs doing the typesetting details for their papers, I guess society gets what they didn't pay for. I hate "digitalization" and the brain-dead manner that policy makers and management (i.e. the upper classes) are fever-dreaming and hallucinating a world where AI and robots do all the work for zero dollars. It's a shitshow already and can only become worse.
I can fully appreciate Jon's quote from Tesco, "Anyone who doesn't think that their database is full of rubbish hasn't had a good look at it."; my version of it used to be "every interesting data collection will contain faulty records".