Excel invents genes

« previous post | next post »

Mark Ziemann, Yotam Eren and Assam El-Osta, "Gene name errors are widespread in the scientific literature", Genome Biology 2016:

The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions.

This was a problem a dozen years ago when I worked on information extraction from biomedical literature — it's amazing to me that it still goes on. The authors note that

Automatic conversion of gene symbols to dates and floating-point numbers is a problematic feature of Excel software. The description of this problem and workarounds were first highlighted over a decade ago [1]—nevertheless, we find that these errors continue to pervade supplementary files in the scientific literature. To date, there is no way to permanently deactivate automatic conversion to dates in MS Excel and other spreadsheet software such as LibreOffice Calc or Apache OpenOffice Calc. We note, however, that the spreadsheet program Google Sheets did not convert any gene names to dates or numbers when typed or pasted; notably, when these sheets were later reopened with Excel, LibreOffice Calc or OpenOffice Calc, gene symbols such as SEPT1 and MARCH1 were protected from date conversion.

It's shocking that biologists ever relied on Excel as a database system, and even more shocking that they're still doing it.

If you're unavoidably connected to people who do this, the authors have provided some scripts for assessing the damage:

Bash scripts, URLs and output data supporting the conclusions of this article are available in the SourceForge repository (https://sourceforge.net/projects/genenameerrorsscreen/).





  1. bks said,

    August 26, 2016 @ 9:00 am

    Just the tip of the iceberg on the horrors of Excel in Science. E.g. I'd venture that > 1% of the time, novice users will sort just one column when they mean to sort all rows in the table by that column.

  2. wtsparrow said,

    August 26, 2016 @ 9:09 am

    A simple start-up or new-document macro can designate arbitrary columns as text format, removing the automatic conversion.

  3. D.O. said,

    August 26, 2016 @ 9:39 am

    I guess, it is shocking in Casablanca manner.

  4. David L said,

    August 26, 2016 @ 9:46 am

    @bks: Funny you should say that. I recently had cause to start using Excel — not for science, I hasten to add — and I needed to do exactly what you mention. Doing what seemed obvious didn't produce the right result, and it took me forever to figure how to get what I wanted. Only now I don't remember how I did it, so if I need to do the same thing again I am back to square one.

  5. Sandwichman said,

    August 26, 2016 @ 10:45 am

    Excel supports two different date systems, using 1900 and 1904 as the reference years. By default the Windows version uses the 1900 system and Mac uses 1904. The 1900 system is compatible with other spreadsheets but incorrectly assumes that 1900 was a leap year and thus adds a February 29 that never happened.

    Cutting and pasting between files created on the different system can have disastrous results. I worked on a labor arbitration 15 years ago in which the other side's consultant had cut and pasted from one employer's Mac file into another employer's Excel file. This added a mysterious four years seniority to all of the employees from the Mac file. This of course rendered the consultant's analysis completely worthless.

  6. Charles in Toronto said,

    August 26, 2016 @ 10:52 am

    Working in a Chemistry lab, I often assembled documents and spreadsheets describing method development and validation for liquid chromatography and mass spectrometry (LC-MS). One of the most common solvents I used was acetonitrile, often abbreviated ACN.

    It was the bane of my existence that every time I typed ACN, it got "auto-corrected" to CAN.

  7. leoboiko said,

    August 26, 2016 @ 11:55 am

    I've read many badly-typeset Brazilian books where the word ter, “to have”, was randomly capitalized as Ter mid-sentence. I only understood the reason when I had the displeasure of typing a text in Microsoft Word, for some journal which wouldn't accept Latex or PDFs. I found out that, in older versions at least, Brazilian MS Word would by default auto-“correct” ter – one of the most frequent wordforms of the language, and an important auxiliary verb – because it treated the word as an abbreviation of terça-feira, “Tuesday”.

    To add insult to injury, Portuguese does not even capitalize weekday names – this was a creeping Anglicism. Even if the software could magically distinguish the (very common) verb from the (rare) abbreviated weekday, capitalizing it would be wrong. And this was default behavior!

  8. Theo said,

    August 26, 2016 @ 12:12 pm

    An ex-colleague told that they had discovered that a significant effect in a PhD thesis on suicide prevention was caused by a shift in columns. On a punch card. Data entry errors are really old.

    [(myl) Shifts in rows and columns are apparently quite common in modern biomedical research, sometimes with hair-raising consequences, as documented by Keith Baggerly in this talk from 2010 on "The Importance of Reproducible Research in High-Throughput Biology: Case Studies in Forensic Bioinformatics".

  9. Quodlibet said,

    August 26, 2016 @ 1:52 pm

    @Theo "Data entry errors are really old." I have a sneaking suspicion that they may be even older than punch cards. I truly hope that the error you mentioned affected only the thesis and not any practical suicide-prevention efforts.

  10. Rubrick said,

    August 26, 2016 @ 5:01 pm

    It's shocking that biologists ever relied on Excel as a database system.

    Echoing D.O. (after looking up the Casablanca reference): Surely not shocking to someone who's seen as much as you, Mark! Appalling, yes; depressing, surely. I, at least, am not even mildly surprised.

  11. maidhc said,

    August 26, 2016 @ 9:04 pm

    You can't even get the ISO standard date format 2016-08-28 if the language is set to American English. Canadian English is okay though.

    Obligatory: https://xkcd.com/1179/

  12. Ben Zimmer said,

    August 26, 2016 @ 9:40 pm

    And let's not forget the great Reinhart-Rogoff Excel scandal of 2013. Economists should know better too.

  13. Roger Lustig said,

    August 26, 2016 @ 10:56 pm

    @maidhc: you can, but you have to define it yourself under "custom."

  14. Guy said,

    August 26, 2016 @ 11:46 pm

    The use of SourceForge to host those scripts is rather comical – SourceForge is known to be incredibly sketchy and most developers have since moved to git(hub).

  15. Walter Underwood said,

    August 27, 2016 @ 11:04 am

    At my work, I generate a lot of data that includes ISBNs. I usually make CSV files from Python, then load them in Excel. Excel interprets the 13-digit ISBNs as floating point numbers and throws away the check digit when it is a zero. Sigh.

    I finally found a workaround, but it is vile.

  16. Jim McCusker said,

    August 28, 2016 @ 11:04 am

    I've dealt with the this problem directly, and it's not so much that excel is being used as a database system as that it's very useful to work with genomic data in CSV so that it can be browsed in Excel. This often leads to the data making its way through excel for final publication so that it can be "cleaned up" and nicely formatted (because many journals don't accept CSV files, which is a travesty). I've been fighting gainer this for years, and am an advocate of publishing everything as Linked Data RDF, so that the meaning of the data can be made 100% clear. But things move slowly in science, and even more so in Science.

  17. Dan T. said,

    August 29, 2016 @ 7:03 pm

    I have periodic (and futile) battles with people at work to try to stop them from loading CSV data into Excel and saving it out from there, which they often do even if they aren't changing anything in it because people these days don't seem to be capable of copying files from one subdirectory or drive to another, or downloading them from one website or email attachment and uploading them to another, without passing them through whatever program Windows wants to open them in.

    Among the things Excel screws up is to convert account numbers and postal smart barcode data to exponential notation, and strip leading zeroes from zip codes. (At least the latter is reversible; when I write programs that have to load in addresses from CSVs I make sure to check the length of zip code fields and restore leading zeroes to them as needed.)

  18. John Maline said,

    August 30, 2016 @ 5:20 pm

    MS office apps are overflowing with heuristics.

    When ever heuristics come up at work, my shorthand definition is "an algorithm that sometimes produces the wrong result"

  19. Ken said,

    August 30, 2016 @ 7:42 pm

    @Ben Zimmer: Some people think that Reinhart and Rogoff did know better…

    It's similar to Volkswagen's emission-testing scandal, which the corporation blamed on a design defect – a defect that allowed the car to detect whether it was being tested, and only turn on the emissions-control system during testing.

RSS feed for comments on this post