
Mistaken Identifiers: Gene name errors introduced inadvertently using Excel - youngerdryas
http://www.biomedcentral.com/1471-2105/5/80
======
tokenadult
Another comment asks,

 _why are we seeing so many articles about Excel?_

The main reason is that Excel was recently in the news because a paper about
economic policies for countries was found to have been based on data kept in
an Excel spreadsheet that was poorly documented enough that errors in the data
weren't found right away. (This is the highly condensed version of the story.)

These stories resonate here on HN because many, many, many of us have had
occasion to use Excel as a tool. Members of the general public who share
information with me (for example, contact lists for youth soccer teams) have
learned in a corporate environment to treat Excel as the "universal" data
exchange format. So I will be dealing with Excel spreadsheets for years even
if I never create another one.

One expects Excel to operate like a tool, a way to manipulate data in some
straightforward, well defined ways. I don't expect Excel to do what is
described in the article submitted here: treat any text data value with
certain embedded strings as special data types that the program can rewrite
without explicit user command. That turns Excel from being a tool in the
workplace to being a surly co-worker who habitually messes up other workers'
projects. I intentionally minimize my use of Excel because I don't like its
artificial intelligence turning into artificial stupidity while I try to get
my work done. To find out that Excel is actually actively impeding medical
research by rewriting data cells in spreadsheets is a dismaying example of why
I can't treat Excel as a useful tool, so of course I was glad to upvote this
informative submission.

AFTER EDIT:

I shared the link that opened the thread here among my Facebook friends, and
one friend commented,

"This is (luckily) old news and no bioinformatician worth his keyboard uses
Excel any more. It's just too much of a wild card."

He followed up after another friend's comment with

"Microsoft is squarely in the wrong here. The tool aggressively reformats
highly technical data fields and the behavior is remarkably hard to keep
turned off. I've been working in this specific field for 15 years, and I can
guarantee that the power users do know their tools. What they know these days
is to go use R or even one of the OSS applications like LibreCalc.
Unfortunately, more naive users new to bioinformatics analysis routinely get
tripped up by this and other overly assistive features of the Office suite."

~~~
maxerickson
Excel is not particularly rewriting data cells in spreadsheets. When importing
other formats, it has some automatic conversions that people apparently find
convenient. It's a little bit pedantic, but a text file with some sort of
structured format is not the same thing as a spreadsheet.

It probably doesn't make any sense to apply automatic conversions to formats
that are more or less defined by convention, but if you mark a column in an
Excel file as text, Excel won't apply magic to that column.

The bigger problem is that it is considered acceptable to not keep a record of
the changes being made to the data (a script can serve nicely as both a data
processor and a record of the processing being done).

~~~
lutusp
> if you mark a column in an Excel file as text, Excel won't apply magic to
> that column.

The problem is that the CSV file format has no reliable mechanism to mark a
column as text( * ). Known workarounds include inserting a single quote at the
beginning of a field, but this single quote may remain in the field forever,
polluting the data downstream from the import.

* The obvious approach of enclosing CSV text fields in quotes, and non-text fields without quotes, won't work -- too many Excel versions strip out all the quotes while importing, without considering the implication of their selective application. Also, in many Excel versions "DEC1" is converted into a date whether or not it's quoted.

~~~
niggler
I posted this the last time there was a discussion about CSV and excel:

If you are generating the CSV yourself, save yourself some agony and just wrap
the text in =""

    
    
        $ cat test.csv
        ="12.34567890124312341234123412341234",="1-5"
        $ open -a Microsoft\ Excel test.csv
    
    

<https://news.ycombinator.com/item?id=5514587>

Explanation:

It uses another auto trick: if the leading character is '=', the result is
treated as a formula.

Thus, you can actually write formulas in your CSV and excel will interpret
appropriately:

    
    
        $ cat test.csv
        =1+1,="1-5"
        $ open -a Microsoft\ Excel test.csv
    

You should see the value '2' (with content `=1+1`)

<https://news.ycombinator.com/item?id=5514704>

~~~
spikels
An alternative solution that works is some situations is to format the column
as "Text". However you must do this before pasting the data into the cells or
else it will already be converted. I use this technique for zip codes which
can have leading zeros.

------
carbocation
I love that they published a paper on this. For most of us, this is one of
those things that surprises you once and then convinces you not to use Excel
for genomic work in the future, since no magic should be applied without
explicit instructions from the scientist. For them, they realized that this
Excel/gene name error affected almost everyone at least once, probably
reasoning that journal referees would be quite sympathetic despite the
otherwise bland nature of the story.

The type of auto-conversion that's going in, as mentioned in the article, is
e.g., DEC1 (text) to 1-DEC (date), etc.

~~~
jforman
I just changed the format of the column to "text" from "auto" and kept using
it for my thesis work. It's too useful to bail on because of a quirk like
this.

~~~
lutusp
> I just changed the format of the column to "text" from "auto" and kept using
> it for my thesis work.

Yes, but for an existing spreadsheet, that won't work ex post facto. A
spreadsheet in which some gene names have selectively been converted to dates
won't be repaired by changing the column's data type. Such a remedy must be
applied in advance of the import.

------
a_p
This article was published in 2004. It was still causing problems in 2012 [1],
when a paper published in the prestigious journal _Cell_ had a few gene names
written as dates [2]. Even the GEO [3] had problems with this , according to
[1]. The culprit is automatic type conversion, which the user has to
understand in order to avoid it. Usually, the way around this problem is to
format the columns as text before the data is entered. But it is much harder
to work with someone else's munged data or spreadsheet, especially when it is
in the form of a .csv file. Even though there is an RFC standard for CSV
files, RFC 4180 [4], it is a description of how different editors handle CSV
files, and not a prescriptive definition.

[1] [https://dontuseexcel.wordpress.com/2013/02/07/dont-use-
excel...](https://dontuseexcel.wordpress.com/2013/02/07/dont-use-excel-for-
biological-data/)

[2] [https://dontuseexcel.wordpress.com/2013/02/07/dont-use-
excel...](https://dontuseexcel.wordpress.com/2013/02/07/dont-use-excel-for-
biological-data/)

[3] <http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL13667>

[4] <https://tools.ietf.org/html/rfc4180>

------
steven2012
I don't know why people are blaming Excel.

The problem is that these researchers, or even the economists in the paper on
Global Austerity, aren't properly trained in what is essentially a computer
programming task, and they don't do things like validate their data. These are
basically bugs in their spreadsheet that they didn't catch.

Sure, Excel may behave unexpectedly for their particular uses, but for the
vast majority of finance people, it works very predictability. If they had
spent time validating the data, they would have realized that the names had
been modified, and they could have corrected for it.

~~~
niggler
It's easy to blame the tools. It's harder to admit that you don't understand
the tools you are working with.

~~~
rbanffy
... or when the wrong tools become the de facto standard people are expected
to use.

------
lutusp
I am amazed by the diplomacy shown in the linked article. Nowhere does the
article suggest that Excel not be used for such important scientific work,
work that has a bearing on human health and well-being.

If the paper had been about drunk driving, or guns in the hands of children,
it could be expected to suggest obvious remedial steps along with the data.
But as to Excel, it's as though it's the only available tool for data
reduction and communication. In fact, it's one of the more expensive of the
alternatives, many of which are much better suited to the task being
described.

------
gus_massa
This article is from 2004, but I suspect that the problem is still present in
the last version.

A few month ago I had a similar problem with a list of usernames. One of them
was something like julio-90 (In Spanish, "julio" is the name of a person and
the name of a month.) and Excel changed it to the date jul-90 (i.e.
1990-07-01).

------
mjn
If you're importing from a CSV, there's a workaround that, unlike the
Microsoft recommendation linked in this article, doesn't require adding a
spurious apostrophe to the front of your data:
[http://stackoverflow.com/questions/165042/stop-excel-from-
au...](http://stackoverflow.com/questions/165042/stop-excel-from-
automatically-converting-certain-text-values-to-dates)

~~~
lutusp
You might have included the workaround that actually works from the linked
conversation's many suggestions that don't work -- add a space in front of
each field of the original CSV data that you don't want converted. This is
more effective and reliable if each CSV data field is quoted, and the space is
inside the quotes.

------
manojlds
What am I missing, why are we seeing so many articles about Excel???

~~~
ycitm
Because Reinhart and Rogoff was such a massive deal.

~~~
kenjackson
But the R&R error had almost nothing to do with Excel. The same mistake could
have been made with R or pen and paper. They were careless in a way I wouldn't
expect a high schooler to be careless. Not in some way that Excel did
something unexpected.

It would be like blaming C# because you did foreach over collection A, rather
than collection B.

~~~
ycitm
You're right, of course. The problem in the R&R case is not with Excel itself,
but that it's being used with a lack of rigour.

If it were being done in C# or R I would expect unit tests and so on. I'm not
saying that would make such a mistake _impossible_ , but programmers have
processes and tools for a reason.

In the OPs case the problem actually is with excel though.

------
moo
Another problem I've seen with dates and Excel is copying date information
from one workbook to another. The problem with this is that the workbooks may
be using different underlying date systems. Workbooks originating on a
Macintosh use the 1904 date system and Windows defaults to the 1900 date
system. Copying dates between workbooks can give you shifted dates.
<http://support.microsoft.com/kb/180162>

------
spikels
Anyone know why such a straightforward article has eight authors? What could
each of them possibly have contributed?

I have noticed in the past that bioscience articles tend to have lots of
authors but always thought it was due to their inherent complexity requiring
lots of different skills.

Perhaps it is really just a way to get more published papers for more people
to help their academic career. This may explain some of the super long CVs
these guys often have.

~~~
ams6110
It is a way to get more published papers. It's very common for a research lab
director or faculty member to be listed as an author on all papers produced by
their organizations. But the practice also recognizes people who may have been
involved in the production of important data in the paper, even if those
people didn't write any part of the paper directly or were acting in a
"service provider" capacity and are not part of the larger research project
that the paper describes.

~~~
spikels
Makes sense. I have read many more math, stats and econ papers than
bioscience. In there areas there tends to only be a few authors. Often there
is an acknowledgement of some helped received either in the paper or at th
bottom of the first page. In these areas paper are written by individuals
rather than organizations. Bioscience takes much more infrastructure and
coordination as well as being much more expensive.

Still seems crazy that eight "authors" should get credit for this article.

------
lutusp
I have to say this problem would go away in a flash if a database were used
instead of a spreadsheet program. The reason is that each database field has a
strict type, and "DEC1" in a database field identified as text would never be
converted, or in any way changed.

The problem is that Excel spreadsheet fields don't have any specific type --
they're defined on the fly by the data that's inserted into them. And worse,
different data are interpreted in different ways _in the same column_ , where
you would expect some consistency within the column.

Those accustomed to database work, using tables having strictly defined data
types, may be surprised to learn that, during an Excel import, successive
fields in the same column can be interpreted in a dozen different ways, based
on the data being read, not on the field's defined data type (which doesn't
exist in a spreadsheet).

What I find sad about these recent Excel stories is that few seem to be
willing to dump the program and choose an alternative.

------
Friedduck
I guess I'd just assumed that everyone was aware of Excel's penchant for
converting alphanumerics to scientific notation (any numeric sequence that
contains what looks like an exponent.) It also converts long numeric sequences
(say, a 20-digit ID number) to scientific notation. Any alpha sequence that
could be interpreted as a date ("02-01") of course will be converted thusly.

Most of these are avoidable _except_ the long ID number problem. Even with
careful formatting as text the last time I experienced the problem Excel was
still performing implicit conversions in ways that weren't immediately
apparent, and that rendered the whole experiment worthless.

The recent problems with bad formulas are easily solvable using the built-in
auditing features, or formula arrays, or just discipline. The shortcomings
reported for statistical functions ("Computational Statistics and Data
Analysis", June, 2008) are another issue altogether.

Jeff

------
bbgm
As noted this article is from 2004. Unfortunately such errors are still
around, as noted by Neil Saunders in a recent blog post[1].

1\. [http://nsaunders.wordpress.com/2012/10/22/gene-name-
errors-a...](http://nsaunders.wordpress.com/2012/10/22/gene-name-errors-and-
excel-lessons-not-learned/)

------
lunchladydoris
This article was published in 2004. I would hope that bioinformatics
researchers have ways of dealing with these issues now.

------
tosseraccount
As the paper points out, this an old problem. Excel is an excellent tool
unlike most Microsoft applications. You learn to live the quirks. When you see
the "September" genes, you just file it under "do a final cleanup and be sure
this doesn't wind up in the final submission of the paper".

