
Gene name errors are widespread in the scientific literature - campuscodi
http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7
======
viraptor
I did some small tasks for people working in bioinformatics and what I've seen
is both amazing (the science part) and terrifying (the tools). Apparently I
saved hours of retyping for one person who was doing a manual JOIN on gene
names between two CSV files. As in ctrl+f name of gene from file A, copy,
paste into another window. For thousands of rows. I was trying to explain that
you can import CSV files without autoformatting as well, but they didn't
believe me... sigh... (at least they were aware that this is a bad idea with
default settings, because the formatting changes)

I don't know if this can be fixed and how. Lots of people seem to have their
own process and they neither understand that the tools can be used more
efficiently / correctly, nor question the long, manual process they follow
currently. They basically require an "intermediate excel" course which ends
with "if you're doing copy/paste 3 times in a row, you should look for better
solution". I'm not even questioning the use of excel at this point...

~~~
joshvm
This kind of problem is rife in academia.

Part of the problem, as a sibling mentioned, is that many people get into
these kinds of fields from non-programming or even computer-illiterate
backgrounds. At an undergraduate level, even in heavily numerical disciplines
like physics, there is relatively little coding. Even then, it's tedious crap
like F90 and bulletproof C. So people get into a Masters or a PhD in a field
they love and are suddenly confronted by data analysis, and they have no idea
what to do.

I've spoken a lot to my girlfriend (an astrophysicist) about this, as she's in
this position herself. It's not that she isn't smart, she just has little to
no experience with data wrangling and she's (in my opinion) been inadequately
trained. I've solved things with Python one-liners that she would have spent
literally days doing manually. We've had conversations along the lines of:

"But why does this file have 700 lines of input data filenames hardcoded? You
know you could write some code to grab them for you?"

"Yes, but in the time it would take me to write the code, I may as well do it
by hand." [In the end I wrote a 5 line regex to do it]

So I can assure you, people question the manual process and they think you're
magical when you show them a quicker way of doing things. However, often there
isn't the motivation or confidence to try and be magical themselves.

~~~
th0waway
This is where I take the time to explain to everyone that "Yeah, while it may
save you time to do it manually, if you do it THIS way instead, you never have
to write it from scratch again, and your output goes waaay up".

Every job I've ever had, I always look at how my predecessors do the job, then
automate the crap that I don't want to deal with. :/

~~~
joshvm
I'm in the middle of writing a retrospective on my PhD - the stuff I wish I'd
done at the beginning. Perhaps unsurprisingly, it mostly boils down to
automate and pipeline _everything_. It almost always saves time because
inevitably you have to repeat work with parameters changed very slightly.

------
apathy
If only someone had warned them! Like, say, 12 years ago:

[https://bmcbioinformatics.biomedcentral.com/articles/10.1186...](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-5-80)

Bonus: the number of errors is positively correlated with Impact Factor (a
tool used by statistically illiterate administrative types to judge the
quality of research).

~~~
bioinformatics
When researchers stop using Excel as their main "database", this problem might
be solved.

~~~
Tomte
What's the alternative?

Remember, these are people who don't understand data types, or if they do they
are to lazy to declare them.

I don't see them formulating a correct SQL query. Or use any kind of
programming language that has strict typing.

It would have to be a custom-tailored system that knows about nomenclature in
the field. Sounds not very efficient.

~~~
Sacho
> I don't see them formulating a correct SQL query.

Well, neither can programmers, since SQL injection is still the most common
security vulnerability in software, so I agree, SQL is probably a bad tool for
the job :)

> It would have to be a custom-tailored system that knows about nomenclature
> in the field. Sounds not very efficient.

Is Excel a custom-tailored system that knows about nomenclature in the field?
The article seems to explicitly argue against that.

I don't think it's a failure to understand data types. It's a mismatch between
what you expect the software to do, and what it does by default.
Unfortunately, Microsoft has steadfastly refused to allow any way to change
the auto-formatting options(check out some people really pissed off for being
treated like children here - [http://answers.microsoft.com/en-
us/office/forum/office_2007-...](http://answers.microsoft.com/en-
us/office/forum/office_2007-excel/stop-auto-correction-of-number-into-a-
date/9968c54a-221b-4b18-a3d1-cfd3d312a8a6?auth=1)).

There's two useful prongs of attack here - one is to somehow force Excel to
conform to the expectations of researchers - perhaps an extension that works
to prevent the most egregious cases of auto-formatting gone wrong? The
alternative would be what you suggest - creating and marketing a custom
solution, the problem there is that you'd need either buy-in from a
significant number of researchers to spread it, or you'd need to replicate a
lot of Excel features to make the transition smooth for others.

~~~
Tomte
> prevent the most egregious cases of auto-formatting gone wrong

Sure, but converting "DEC1" to "December, 1st" is not egregiously wrong, it's
a valuable feature and in most of the cases the expected thing.

~~~
Sacho
Not to a person who specifically __does not want __that feature - and since
Excel does not provide a way to turn it off or customize it, my idea was that
an extension might be able to. Of course, I 've never written an Excel
extension(macro?) so I have no idea.

~~~
Tomte
Of course you can turn it off. By formatting the cells as the correct type.

~~~
Sacho
That's a ridiculous argument and you should know it. Forcing users to do
manual work every time instead of having an option to disable or configure a
feature is just a UX fail. Doubly so because I would bet those formatted excel
files don't survive transition, and the data is actually transmitted in CSV or
whatever, so you'd have to reformat the data over and over every time you open
to edit it, and hope that someone along the way doesn't make a mistake. This a
problem software is meant to solve, not create..

You could argue that all you can do is mitigation since CSV files don't offer
much ability to influence how Excel will load them, and all you would need is
one improperly-configured Excel along the pipeline to break the data. However,
this is a significant mitigation - Excel apps would be configured once, and
you would deal with a situation 1% of the time, and the solution would be
trivial(just configure it!). Instead now you're dealing with the problem every
time, and the solution(just mark the cells!) takes a lot more effort.

I don't think Microsoft necessarily has incentive to add this
configuration(the science community as a whole is probably a tiny blip on its
radar), but this is why we create modular and extensible software - so others
can tweak it to their liking.

~~~
Tomte
No, it's the correct argument.

Putting an option in to disable something that only a miniscule number of
users would ever want is not sensible.

Note bene: I'm only talking about this DEC1-like conversions. I agree that
there are lots of conversions that are annoying.

------
rflrob
For a long time, I knew I was dropping this gene [1] from my analyses because
pandas automatically converted its name to a floating point. With the right
combination of flags I was able to get it to work right, but even real
programming languages are not immune.

[1][http://flybase.org/reports/FBgn0036414.html](http://flybase.org/reports/FBgn0036414.html)
\- short symbol is "nan"

~~~
sevenless
"And I hope you've learned to sanitize your database inputs."

------
sevenless
Science is probably the least scary thing that an Excel bug can affect.
JPMorgan Chase's "London Whale" venture lost $2 billion in part because
spreadsheet modelers divided by a sum instead of an average to get a value at
risk.

[https://baselinescenario.com/2013/02/09/the-importance-of-
ex...](https://baselinescenario.com/2013/02/09/the-importance-of-excel/)

Then there was the infamous Reinhart-Rogoff paper in economics, used to
justify harmful austerity policies worldwide post-2008, that came to false
conclusions using a row formula that wasn't updated.

[https://www.washingtonpost.com/news/wonk/wp/2013/04/16/is-
th...](https://www.washingtonpost.com/news/wonk/wp/2013/04/16/is-the-best-
evidence-for-austerity-based-on-an-excel-spreadsheet-error/)

~~~
woliveirajr
That's not excel bug.

I've seen co-workers getting wrong results due to a wrong copy-and-paste,
where no attention was paid. This is no bug: it's a misuse off the program.

Know your tools is always a good advice.

~~~
sevenless
It is a bug. There is no way to stop Excel from automatically turning certain
strings in files that it reads in, into dates.

------
rdtsc
At the university I was working on a project to do parsing of gene
relationships from literature. And yeah remember the inconsistencies. Also
genes have funny names there is a SHH (Sonic Hedgehog), a DICER1 (which cuts
something RNA or DNA I forgot), and a bunch of other silly ones.

Ultimately though coming from the world of algorithms and nicely organized
data, it was frustrating how disorganized the nomenclature seemed.

~~~
hyperion2010
Don't forget that the genomics folks also renamed a whole bunch of genes a few
years ago so now there are two different names for the same thing floating
around!

~~~
dekhn
For every gene, there are typically at least 3-4 names that reference the
gene. In some cases, two genes have the same name- for example, OCT1 and
Oct-1. The first is "organic cation transporter 1" and the second is "Octamer
binding protein 1". The second was "renamed" to (IIRC) POU2F1 but there are
still plenty of references to the old name even in new literature.

This is just one example. The entire gene naming area is a pile of bollocks.

~~~
apathy
And we haven't even touched on the glory of _Drosophila_ gene names.

------
niels_olson
Pathology resident working with big data and an undergrad in physics checking
in. I learned SQL. Met Stonebraker before the Turing Award. I taught myself
Python in part by working on Rosalind problems. It's not that I don't
understand. I don't have chunks of time large enough to quiet my mind, frame-
shift, and work on my research at the code level. I've got IRBs and all sorts
of oversight to deal with, budgets, etc. and the biology. A central success of
my project has been recruiting professional CS people early.

That's my single biggest suggestion for biologists: know that professional
computer scientists and programmers are desperately hungry for interesting
data and would love nothing more than to help you design the project up front
so they don't get sucked into the vortex of technical debt that _will_ swallow
your project if you don't set it up right early.

------
golergka
Excel is a fantastic tool, and widely used in bioinformatics — but to use any
tool properly, you have to learn it. I'm really thankful that it was tought
first in my bioinformatics class, before any specific tools or programming
languages, so we wouldn't commit stupid errors as this one.

------
Snoooze
> Linear-regression estimates show gene name errors in supplementary files
> have increased at an annual rate of 15 % over the past five years, outpacing
> the increase in published papers (3.8 % per year).

This still doesn't actually tell us if the problem is getting worse. Or if it
does it is badly worded. Even assuming this 3.8% is derived from their own
data, you need the number of papers published that contain genelists (I would
imagine this has probably risen faster than the number of papers overall).

In other words, the authors should have plotted the error rate over time
rather than the number of errors over time.

------
callesgg
Even worse there is not even a proper definition of what a gene is.

The meaning changes. Depending on who you are talking to, What book you are
reading, What part in a book you are reading.

~~~
roywiggins
Biology (the discipline) is messy because biology (the reality) is really
really messy.

------
nonbel
It would be interesting to a certain type of person to look back on this in
ten years and see if any of those papers were corrected, etc. Some links are
in this thread that would allow this to be done now as well. Also check out
the fMRI and microarray scandals that contaminated decades worth of
publications.

Did using correct data and analysis pipelines actually matter to the
conclusions these authors came to?

------
mtgx
Relevant:

[http://learn.filtered.com/blog/5-excel-bloopers-with-huge-
co...](http://learn.filtered.com/blog/5-excel-bloopers-with-huge-concequences)

------
sevenless
Fucking Excel! They haven't fixed this utterly egregious bug in over a decade.

I'm going to make a sign and picket Microsoft when I'm next in the area.

~~~
garaetjjte
What bug?

------
gaur
I have long fantasized about writing a virus that forces Excel to quit if it
detects the user is attempting to use Excel to do scientific research.

~~~
viraptor
What's your proposed alternative that doesn't require hours to start with?
Serious question - I believe no programming languages can be applied here. (at
least not yet) The barrier to entry must be minimal.

~~~
sevenless
You seem to be asking why people shouldn't do data analysis without knowing
anything about data analysis.

~~~
viraptor
I'm saying - we've got lots of people who know the science and do not know or
want to know anything about programming. If we want good results, we need to
enable them to work better somehow. Proper programming is a good idea for
adding to university programs now, so that in ~10 years we can have this talk
again and point out that people know better tools and shouldn't be using
excel, or at least shouldn't be using it badly.

