
One in five genetics papers contains Excel errors (2016) - pjc50
http://www.sciencemag.org/news/2016/08/one-five-genetics-papers-contains-errors-thanks-microsoft-excel
======
hk__2
> The researchers, who published their analysis in Genome Biology, say the
> issue can be fixed by formatting Excel columns as text and remaining
> vigilant—or switching to Google Sheets, where gene names are stored exactly
> as they’re entered

I doubt that. I fight with Google Sheets every day because it thinks some of
my text values are dates.

~~~
nerdponx
I wonder if the world couls benefit from a SQLite-backed spreadsheet program.
All the data type BS would be basically solved.

~~~
ajross
Yes, imposing a schema on arbitrary spreadsheet data absolutely fixes typing
errors like this.

It also produces a tool that has the semantics of a formal database and not a
"spreadsheet", so none of the biologists in question are going to use it,
being biologists and not programmers. We can't fix this with tools. Like the
article said, proper scientific rigor is really the only solution.

In some fields (e.g. physics), computation is so fundamental to the research
that practitioners are basically computer scientists. We aren't there yet in
genetics, nor do we really want to be.

~~~
cosmojg
> We aren't there yet in genetics, nor do we really want to be.

Why not? Interdisciplinary, quantitative biology is the future of biology, and
computers make this kind of work so much easier for the same reason that they
make the work of physicists easier. More and more, it seems my colleagues
without at least rudimentary programming skills are being left in the dust.

------
skywhopper
This headline is quite unfair. If Excel didn’t try to normalize data that was
being input, how many more errors would there be in these and other datasets?
Anyway, no automated tool is going to make the right choices in every case.
The researchers and journals have a responsibility to take care to check and
clean their own datasets. They blame this class of errors on Excel. But if
they aren’t catching such obvious issues (like a gene name field that has a
date datatype and doesn’t match any gene names) what about all the errors that
aren’t Excel’s “fault” but are much more subtle?

~~~
windows_tips
Did the researchers license Excel for genetic work? This seems like a major
error to treat your gene data as dates.

~~~
SiVal
Excel doesn't treat gene data as dates, it treats text input in certain
formats as dates unless you tell it not to by explicitly specifying the data
type. Scientists have a whole constellation of tools to choose from--various
spreadsheets, Jupyter notebooks, R programs, custom apps written by
colleagues, etc.--and each one will have a certain feature set that will have
to be used properly to avoid errors. Excel is not "buggy" in this respect.

I will say, though, that I have been repeatedly surprised when creating large,
complex models for others in Excel at how easy it is to have a bug in your
"code" that isn't obvious. In this respect, it's a bit like JavaScript's
philosophy of "don't crash or sound an alarm, just make your best guess and
keep going." For simple things, this works fine, but for anything complex, I
really prefer more "bondage and discipline" from the platform, so my potential
errors are pointed out rather than forgiven.

~~~
windows_tips
I didn't expect Excel to understand gene data. I expected that scientists
would have used an appropriate tool for gene data.

>complex models for others in Excel

It's best not to do that except in cases of emergency, I think. If you are
going to get complex, SQL may be a much better solution, or something else.

------
jarfil
Spreadsheets are not databases... but everyone insists on using them as such,
so this happens.

~~~
hk__2
> Spreadsheets are not databases... but everyone insists on using them as
> such, so this happens.

They are, by Wikipedia’s definition: "A database is an organized collection of
data, stored and accessed electronically".

~~~
johnchristopher
So is a bunch of files in a folder then.

~~~
olooney
That's right. Many databases are literally just folders of files once you look
under the hood - for example, here is PostgreSQL's layout:

[https://www.postgresql.org/docs/9.4/static/storage-file-
layo...](https://www.postgresql.org/docs/9.4/static/storage-file-layout.html)

Note in particular the comment "Each table and index is stored in a separate
file" which shows that the file layout is not trivial.

SQLite is another database, but it also is "just files in a folder:" the main
difference is that there is a single file for all the tables and indexes but
it uses several other files in the same directory as well:

[https://www.sqlite.org/tempfiles.html](https://www.sqlite.org/tempfiles.html)

Note that MS SQL Server uses much the same strategy as SQLite, although you
have the option of splitting it across multiple OS files if you need more
control over physical storage:

[https://docs.microsoft.com/en-us/sql/relational-
databases/da...](https://docs.microsoft.com/en-us/sql/relational-
databases/databases/database-files-and-filegroups?view=sql-server-2017)

But if SQLite is a database, how is .xlsx or .ini or .json fundamentally any
different? I don't think there's anywhere you can draw a bright line that will
include all well known databases but exclude things that "merely" file
formats. If we draw the line at "transactions," for example, we'll find that
MongoDB and ElasticSearch no longer count. If we draw the line at "user access
control" then SQLite and redis are not databases.

Older RDBMSes like Oracle may eschew OS files in an attempt to gain more
control but they can't escape the fact that they too are are fundamentally
just "organized collection of data, stored and accessed electronically." And
then there are in-memory databases - which do not use the disk at all, but are
plainly still considered "databases" \- it's right there in the name! These
varied examples show that the word "database" is used very broadly and any
definition narrower than "A database is an organized collection of data,
stored and accessed electronically" will certainly fail to include many
instances of things which are commonly referred to as databases.

------
perl4ever
A lot of things contain errors because of Excel "helpfully" changing data.
Speaking as someone who's spent a lot of time dealing with databases and
spreadsheets provided by non-technical people to input.

~~~
mirimir
Same here. I frequently had to demand data in tab delimited format. Comma and
quote delimited didn't make the nut sometimes, when values included quotes.
Sometimes | delimited was the only workable solution.

------
tabtab
Ideally they should use a formal RDBMS, but that cranks up the cost of the
study. A data QA team also helps. The question is whether these measures are
"worth it". Are the Excel errors significant enough? Are 10 studies with a
conclusion flaw rate of 30% better or worse than 5 studies for the same cost
with say a 10% conclusion flaw rate? A cost/benefit analysis on this can get
tricky.

In almost any endeavor there is a quantity-versus-quality trade-off per given
unit of resources. In the IT field, we often say: A: cost, B: speed of
delivery, C: quality. You can only pick 2 of the 3, which are they? It's kind
of the same principle, but with delivery time more explicit.

------
ezoe
People doesn't realize that what they are doing in Excel is no other than the
programming. If the task requires programming, it should better be handled by
a programming language which support version controlling and debugging.

~~~
bjoerns
True. But isn't it a bit too easy to just tell people to stop using Excel
(which they won't do anyway) instead of giving them something to help them
solve their problems _within_ the Excel/spreadsheet world?

~~~
pas
Departments, journals and other institutions could crowdsource an Excel plugin
to solve this. (Or at least to lint the spreadsheets.)

------
Ma8ee
As someone who way too many times have had to try to wrangle data sent as
excel sheets into useful shape, this doesn’t surprise me even a little bit.

------
jacques_chester
I use occasions like these to point the website operated by the fine people of
the European Spreadsheet Risks Interest Group:
[http://www.eusprig.org/](http://www.eusprig.org/)

------
kmm
Does this affect the research itself, or only the presentation of it?

~~~
toufka
It _can_ affect the research - doing subsequent analysis on the resulting
data; searching for all genes with the name "J12" resulting in no hits,
because it's been renamed to "January 12th". This is especially true when
sifting through thousands of hits, where searches become more powerful than
just listing/sorting the top 10 hits.

There are apocrypha in the field about when such results actually impacted
results and grants. Not often, but not unheard of either.

------
rubatuga
Also, we should be using Entrez IDs or ncbi accession numbers instead of the
horrible gene symbol system

------
dsjoerg
Corrected headline: "One in five genetics researchers are careless with their
data"

See also "A poor workman blames his tools"

~~~
_Wintermute
It's an irritatingly common requirement for supplementary data submitted to a
journal to be in the form of an excel file, so I'd bet a number of those
errors were caused by researchers just going through all the hoops required
for publishing.

~~~
michaericalribo
Chiming in with the first big name journal I thought of—Nature, which lists
Excel as a more appropriate format for supplementary data than a table:
[https://www.nature.com/nature/for-authors/supp-
info](https://www.nature.com/nature/for-authors/supp-info)

It doesn’t mean Excel is the right tool, but it does speak to a fundamental
problem: it’s a non-starter to try to change the entire scientific community
in one move. We need to find ways to make the tools people use do a better job
at doing what we want them to, rather than shouting that they’re using the
wrong tool...

------
chevman
[2016] One in five genetics papers contains errors thanks to Microsoft Excel

~~~
dang
Thanks; we added the year above.

------
windows_tips
Thanks to the person whom ran their genetic data through their Excel
autoformatter?

Is this what really passes as science around here?

~~~
andbberger
Oh boy... I've got some bad news for you...

~~~
windows_tips
What is it?

~~~
rubatuga
That most geneticists aren’t as good at excel as you think

~~~
windows_tips
I don't know how good most geneticists are at Excel, it just seems to be the
wrong tool for the job.

~~~
michaericalribo
Unfortunately, in academia the only correct answer to the question, ‘what is
the right tool for the job?’ is the question, ‘what does your adviser use?’
Excel is not the optimal tool—I agree. But, it’s the tool (many) geneticists
use, and it’s important to find a way to make it work /better/ for the
purposes it is /being used for/, rather than taking up space complaining that
it’s /not the right tool/.

~~~
jhbadger
Unfortunately Microsoft Office is pretty much the software bench biologists
use. Excel for data files, Word for papers, and Powerpoint for presentations.
Bioinformaticans may use R for analysis and LaTeX for papers and
presentations, but these are seen as needlessly "nerdy" by bench types.

