
Gene name errors are widespread in the scientific literature (2016) [pdf] - lgeorget
https://genomebiology.biomedcentral.com/track/pdf/10.1186/s13059-016-1044-7
======
nellaby
The amount of time I have wasted dealing with this in the past, and no doubt
in the future, is much greater than I would like to admit. It highlighted the
importance of plain text csv/tsv files and regular file backup/duplications.
In my opinion Excel should be avoided at all costs when doing the majority of
analyses, however for ease of use and data linkage sometimes it can't be
avoided.

However, I don't understand why Excel wont allow the user to set a standard
format for blank cells (i.e. plain text) and maintain that format until the
end of time, unless the user specifically resets it.

Select all, format cell before pasting data into a sheet should work as
expected. However, formatting is ignored as soon as anything is pasted into
the spreadsheet.

I can't help but find this ridiculous. Is that too much to ask for? Is this
behaviour more convoluted to implement or are there unforeseen consequences
that are beyond me?

------
askvictor
I've had the same problem in Google Sheets with student codes at schools I've
worked at. The standard code is first three letters of surname then a
sequential 4 digit number. So Jane Martin might be MAR0011 if she was the 11th
person with a surname starting with Mar. Now Google Sheets for some stupid
reason interprets MAR0011 as March 2011. Excel at least only falls for this if
the number is recognizable as a year (e.g. MAR2011) which is very unlikely in
most school for a few years yet. There are also plenty of surnames starting
with Jul and Jun.

I would have thought that Google (and possibly Excel) might have a saner
approach that considers the context of the rest of the column. A single date
value in a column of hundreds of text values should be probably treated as
text, or at the very least set of a big red flag in the UI.

~~~
thrower123
Trying to guess the format of arbitrary pasted values is nuts, and I don't
know how it has persisted for so long.

Another fun one is Excel truncating datetime values when you open a CSV. I
love those bug reports where somebody tells me a report is incorrect and
attaches the raw CSV, which is completely correct.

~~~
jerf
"Trying to guess the format of arbitrary pasted values is nuts, and I don't
know how it has persisted for so long."

The answer to this question, and a number of other people's similar questions
in this thread, is the sheer overwhelming quantity of people who would
complain if this behavior was changed, ranging from mumbling to serious
threats to not buy the software if it isn't changed back made to Microsoft's
CEO.

The problem is that the belief-set that says this is a crazy thing to do is a
very programmer-y one. Being a programmer myself, I fully agree. But there's a
lot of people who not only won't understand any proposed solution, but won't
even agree that what you're describing is a _problem_. To many people, it
sounds like a feature, and they will tell you they use it all the time, and
they will tell you it has never seriously bitten them. In the latter case, for
a rather large set of people, it'll be true... of course there will be a
decidedly non-trivial set of people as well whom it _has_ bitten, but they
just don't know it. But note those people will still be in the set of people
who will claim taking the "feature" away is something they would hate,
precisely because they don't know it has bitten them!

And even if you do convince them there's a problem, they'll probably still
disappoint you by continuing to resist the idea of forcing them to be clear
about the type of data they're entering, and "solve" the problem by adding yet
more layers of guessing, such as askvictor's suggestion.

It's really, really hard to create a tool that can successfully force users to
think more clearly than they already are, and if you do pull it off, users
probably won't like it and stop using and buying it.

------
bayesian_horse
There was once a story about a clinical trial being scrapped because it was
based on a one-off error in a list of genes.

------
jbj
I found the best solution is to quote everything from a tsv or csv file, in
case someone requests a dataset in xlsx format. but whenever I get an xlsx
file, I just learnt to always check for dates, it is sometimes possible to get
around this if there are other identifiers or genomic coordinates, but some
gene names will be converted to the same date, which leaves it impossible to
backtrack the original genename.

Unfortunately it does not look like something to be solved:

QUOTE: "Microsoft Excel is preprogrammed to make it easier to enter dates. For
example, 12/2 changes to 2-Dec. This is very frustrating when you enter
something that you don't want changed to a date. Unfortunately there is no way
to turn this off. But there are ways to get around it."

Source: [https://support.office.com/en-us/article/stop-
automatically-...](https://support.office.com/en-us/article/stop-
automatically-changing-numbers-to-dates-452bd2db-cc96-47d1-81e4-72cec11c4ed8)

------
swiley
I’ve seen wrong definitions for fractals (“non integer fractal dimension”) so
many times in what I thought were peer reviewed papers (usually they’re not
math papers though heh)

------
mnw21cam
The good news is that the HGNC is actively going around changing the really
terrible gene names into ones that don't cause trouble.

------
ssfak
The findings are well known, but I find it a bit ironic that the authors
provide supplementary material as an Excel file...

~~~
rflrob
In all likelihood, the authors are aware of this irony. Journals can be very
rigid organizations.

------
et2o
You basically can’t use HGNC names in bioinformatics. It’s much better to use
identifiers like ESENBML IDs or whatnot.

