
Scientists rename human genes to stop MS Excel from misreading them as dates - virde
https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates
======
LatteLazy
It's honestly amazing that Excel hasn't fixed this issue. It's pisses off an
enormous number of users especially in basically any non-US country (even if
01/02 is a date, it isn't the second of January in most of the world...)

~~~
owl57
I don't even know why anyone still uses this notation in today's connected
world. 2020-08-06 is ideal. Russian standard (06.08.2020) is uglier (digits
are not sorted by significance) but still unambiguous. But that slash horror?
Why would anyone continue inflicting this on themselves and their peers!?

~~~
reportgunner
yes, YYYY-MM-DD HH:MM:SS is the only proper format.

It is sorted from the greatest unit (year) to the smallest unit (second). If
you treat them as text and sort alphabetically they still get sorted from
oldest to newest.

Other formats don't sort properly.

writing dd.mm.yyyy is like writing time ss:mm:hh

writing mm/dd/yyyy is like writing time mm/ss/hh

If I really have to put YYYY at the end of the date I use the 'dd-mmm-yyyy'
format which excel translates based on client locale:

\- 13-mar-2020 in enUS

\- 13-bře-2020 in czech

~~~
arrrg
I agree with everything you are saying about sorting and I frequently use
YYYY-MM-DD, however I do want to make one point: DD.MM.YYYY does make some
human sense as a date format since the year is very rarely important (and if
the year is actually important often the day is so unimportant that it can be
left out), the month is sometimes important and the day is often the most
relevant piece of information.

So for any kind of practical planning that day first order makes some sense,
but I wouldn’t die on a hill for it.

~~~
reportgunner
I see where you are coming from, but I would argue that if you omit a part,
it's not a date anymore.

~~~
owl57
Yes! It's not the first time I posted a rant on the net and found myself
guilty of the sin the very next day. This time, a web app of mine has big
tables full of dates, and I didn't quickly figure out a way to show the full
date without too much clutter. So, I used "dd.mm" and postponed a finer design
till next year. Guess what, this UI element hasn't changed in the next three
years and today I finally stumbled on a bunch of records which really can't be
deduced from context.

------
Duller-Finite
Excel isn't the program of choice for most scientists and computational
biologists, who typically use R, python, or command line tools. However, we
often get data from other scientists or reanalyze data from other groups that
can have these errors. It's so frequent of a problem that there are scientific
papers about it [1].

[1]
[https://genomebiology.biomedcentral.com/articles/10.1186/s13...](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7)

~~~
aden1ne
This ignores the reality that one will ultimately have to interact with people
who have no understanding of any of these tools. Unless you happen to work in
a pure computational biology group, one _will_ have to interact with lab
workers, biologists with no training (or understanding of) in R or python,
doctors, etc. All these people will know excel.

~~~
Duller-Finite
That's why I'm in favor of this change in nomenclature.

~~~
otherme123
IMO the change is good, but is a case of detected vs undetected.

Recently I was working with some colleagues, being I the computer savvy and
them the lab people. I send them some data in CSV, that when opened in Excel
turned 123.456 into 123456 (it was a problem with locales, some people using
"," as decimal and some using "."). We noticed because the values should be
between 0 and 1000. But what if the column could be between 0 and 1000000? A
small quantity of numbers bumped up by a factor of 3 could fly under the
radar, and distort further measurements. And the error is undetectable forever
once published.

I like it better the programming language approach: look, this is how you
write a string, this is a char, this is a float and this an integer.
"2020-08-04" is a string until you ask me to turn it into a date. "SEPT1" is a
string, and you are going to do quite the gymnastics to make me understand it
as "date(2020, 9, 1)". Do you like "," or "." as thousands? Then we first turn
the number into a string and then format, but the original number is kept.

~~~
ImaCake
Excel technically has a type system where you can change the type of value a
cell has. In my experience it is difficult to convince excel to actually
change the cell's data type sometimes. Doing so can often change the
underlying data as well. Personally I avoid excel if I can, because it's
quirks are just too frustrating. But it certainly has its uses.

------
mcv
> _" Why, exactly, in a fight between Microsoft and the entire genetics
> community, was it the scientists who had to back down?"_

Back down? Or pick a better tool. If Excel proves to be an unreliable tool for
your job, use a better one. Alternatives exist, ranging from Google Docs, and
LibreOffice, to simpler light-weight spreadsheets. Or possibly more specialist
tools.

Why does everything always have to be put in Excel if Excel is such a poor
tool for so many things?

~~~
acid__
Excel may have failed in this specific task, but let’s not pretend like its
functionality doesn’t run circles around Google Docs and LibreOffice.

Excel is a “pretty darn good” tool for 95% of tasks. If your work has highly
varied workflows, then that flexibility more than makes up for its failures on
the last 5%.

If you have very specific workflows on the other hand, you may find value in
replacing Excel with a specialist tool. But let’s not pretend that specialist
tools don’t also have their own shortcomings; at best they’ll achieve 99.9%
coverage of tasks.

~~~
reportgunner
From my point of view Excel hits a sweet spot between 'very simple tasks' and
'very complex tasks'.

~~~
_emacsomancer_
The sweet spot being "too complicated for simple tasks" and "not sophisticated
enough for complex tasks"?

~~~
BbzzbB
>"too complicated for simple tasks"

How can Excel possibly be too complicated for simple tasks? It is pretty much
as straightforward as it goes when it comes to grid-file viewing and editing.
You can show it to anyone from a high-schooler to a 60 year old (with minimal
experience on computers) colleagues and they will figure it out rather easily,
good luck teaching Python/Pandas to the latter.

>"not sophisticated enough for complex tasks"

Not sure how that works either really. Between formulas and VBA macros, people
have and are making tools complex enough they have no business to be an Excel,
and yet they are even if it isn't the best tool for it. Once you go past that
point, Excel isn't even in the conversation nor does it pretend to be able to.

It has issues, and people playing or working with complex (or simple) data
would be better served to learn programmatical tools, but until they do Excel
will serve them well as long as they stay wary of basic quirks.

~~~
reportgunner
> _How can Excel possibly be too complicated for simple tasks?_

Excel is too complicated to send an e-mail for example. You can send e-mail
from excel but it's much simpler to do in Outlook.

Excel is too complicated for simple math = if you want to do 1024*436 it is
better to fire up a calculator rather than wait for excel to start up.

~~~
BbzzbB
That's irrelevant tho, Excel isn't meant to be Outlook so why should it also
be able to send email straightforwardly? The fact you're even able to send one
thru VBA to me shows it is not "too simple for complex tasks", albeit probably
ill-suited in this case (and many more, but that's besides the point).

And while the calculator might be marginally faster if you exclude start-time,
Excel is still perfectly fine for your use-case it only requires to type one
extra symbol than the calculator, namely "=1024*436[ENTER]". Hell, it's
probably faster to do it in Excel if you already have an active window. I
don't see how it could possibly fulfill the task in a simpler manner.

Is R a bad statistical tool because I can't use it as Photoshop?

------
hermitcrab
Excel is an amazing tool. But it also has some significant shortcomings:

* A well known tendency to mangle date and gene data under the guide of being 'helpful'. * Easy to making mistakes when cutting and pasting cells. * Difficult to see what is going on in a spreadsheet. * Poor handling of CSV files.

Some of these shortcoming are inherent to spreadsheets. Others are specific to
Excel, but hard to overcome due to the weight of backward compatibility.

I have written a product for transforming and analysing tabular data
([https://www.easydatatransform.com](https://www.easydatatransform.com)) that
tries to overcome these issues:

* Doesn't change your input data file. * Doesn't re-interpret your data, unless you ask it to. * See changes as a visual data flow. * Operations happen on a whole table or column. * Good handling of CSV files.

Also it doesn't try to do everything Excel does.

It is a fairly new tool. Would appreciate some feedback.

~~~
gverrilla
I'm a business guy, so I don't know much about it, but looks very useful and
easy! I have been off the game for 5 years now, but I used to run a magento
ecommerce, and I think your software might have helped managing products
listings and stuff like that at the time. You might wanna take a look at this
costumer segment, particularly small and medium businesses. They probably have
a process for this already, but your software might be a good replacement,
even though they won't be actively looking for it because they already have
something that works. Good luck!

~~~
hermitcrab
Thanks. I will see what I can find out.

------
leecarraher
Final Jeopardy : Trebeck: You know what, how about you just write down a
number, any number at all. Could be a 1 or a 2, perhaps 3... and excel you
answered; A smiley face emoji, simply stunning.

Excel is no longer motivated by the original intention of a spreadsheet, and
now caters to the lowest common denominator, a piece of graph paper. As such
MS has shifted focus from doing calculation to text and graphics layout tool.
white text copied from a terminal : white text white background you got it!
comma separated numbers : default a long string with commas in it want a plot
: it is in the insert menu for some reason, since plots and numbers are no
longer excels raison d'être

~~~
ubermonkey
MSFT did the same thing with Project when they introduced something called
"Manually Scheduled Tasks."

Project is fundamentally a critical path scheduling tool, or at least was.
Task A must finish before Task B, which must complete before Task C. If A is
delayed, then that delay pushes B and C out, too.

This is what it's FOR, more or less.

Manually scheduled tasks don't move. They're set with whatever dates you give
them, and do not move in response to delays or whatnot from predecessor tasks.

People _wanted_ this because some (dumb) people insisted that "well, that task
CANT move because it has to be done by then!"

This is akin to asking for the arithmetic engine to be turned off in Excel,
because by golly you really need 2 and 2 to sum to 17.5.

~~~
VonGallifrey
> People wanted this because some (dumb) people insisted that "well, that task
> CANT move because it has to be done by then!"

So? There are tasks like this that simply CANT move, why should a project
management tool not have the ability to model that? If you don't like the
feature you don't have to use it.

I get that you wouldn't want these fixed dates and if you can just plan and
execute your project by yourself, but that is not always the case.

As an example: I had a project where it was really important for us to inspect
the Aircraft Type (A321) into which we wanted to install our hardware and
software. We where given an opportunity to do so by an airline during the
maintenance of one of their A321. This was a fixed date. If we didn't finish
our preparations before that maintenance date then we simply would not get
this opportunity (or at the very least would have to wait a very very long
time for the next opportunity). The Aircraft would simply not wait for us.

Not every task can move and just because you think they should does not make
it so. Maybe there is more to Microsoft Project than just YOUR use-case?

~~~
ubermonkey
This is such a "peak HN" kind of response.

I've been working with scheduling and project management tools for 20 years --
not just MS Project (widely regarded as the idiot cousin of the market,
honestly) but also things like Primavera.

There's no reason to have a "manual" task in a scheduling tool.

First, if you have a task that can't move, you set it with a deadline and
watch your deadline (as well as watching to see if the task moves PAST the
deadline). Cementing the task in place doesn't help you; in fact, it actively
HURTS you because it hides the fact that your forecast path isn't valid
anymore.

Second, these tools ALSO include the idea of constraints. Scheduling tools
include the idea of constraints which limit the critical path motion according
to specific rules (based on the type of constraint in play).

Using ANY constraint, though, is frowned upon in serious scheduling circles
precisely BECAUSE they distort the predictive ability of a critical path
schedule. If Task C has a hard deadline of 1 Sept, then you watch you critical
path to see if that remains possible. As tasks slip, you stay on top of the
chain of tasks to seek opportunities to streamline or reduce scope ahead of
critical task so that the deadline can be met.

(Guess what? It's not always possible.)

And you do this because you SEE that the schedule shows Task C moving to the
right.

If you lock the task in place, odds are you won't notice that your critical
path is collapsing.

The tl;dr is that cementing a task in place in a critical path schedule is not
a good way to model deadlines. This is something any competent scheduler will
tell you. It's part of the PMBOK, it's built into DoD scheduling guidelines,
etc.

>Maybe there is more to Microsoft Project than just YOUR use-case?

Hilarious. I will say it's clear one of us doesn't quite understand the
problem domain as well as they might represent, but given my background I know
it's not me.

~~~
iterati
This feels more "peak HN" than the previous poster as it's dripping with self-
importance and condescension.

~~~
ubermonkey
Or, you know, actual knowledge about the problem domain. But you do you.

------
rkachowski
My mind is blown that Excel's usability is so bad that the representation of
__the human genome itself __has to adapt around it 's undesired behaviour.

As in, the history of genetics research is now irreversibly linked with the
shortcomings of this one software product, which just happens to be incapable
of describing the genetics of the organisms that created it.

~~~
mmcgaha
I hate to sound like a salty old IT guy, but here we go. It is not the fault
of Excel that people are using it wrong. They have the ability to import the
data as text but they skip that step all together. If the user does not say up
front what the column is, Excel has to guess. If Excel didn't try to guess,
someone would be making a comment on how bad usability is when an obvious date
field was getting interpreted as text.

~~~
WorldMaker
It's also behavior that goes back to the Ancient Times and predecessors such
as Visicalc and Lotus 1-2-3. Even ancient ones will tell you if you need to
enter a thing and it has to be text and only text precede it with a quote
mark, ie 'MARCH1, just as you would precede a formula with =. It's Spreadsheet
101 knowledge dating back many decades.

The clickbait headline is fun, but the real headline is more like "Scientists
find it easier to rename things than learn the basics of data entry in the
tools they use".

~~~
jackvalentine
> It's Spreadsheet 101 knowledge dating back many decades.

Astounding, I've literally never heard of this in the 20 years or so I've been
using spreadsheets. I'll be using it from now on!

~~~
WorldMaker
Welcome to the lucky 10,000 [1] club! It's a fascinating thing about cultural
knowledge that stuff that should be taught early in using a tool, people
discover anew every day. It's a useful reminder that I also wasn't disparaging
the scientists for finding renaming easier than retraining. I'm sure these
scientists were very familiar with the costs of learning/relearning. (ETA:
Which is why my real joke was about ancient ones and how easily knowledge of
the 70s/80s seems ancient and easily forgotten.)

[1] [https://xkcd.com/1053/](https://xkcd.com/1053/)

------
Cactus2018
In a previous version of MS Excel, after opening a CSV file, Excel would
silently write the 'interpreted' version to disk. So much trouble, simply from
previewing a CSV.

~~~
ImaCake
I have had excel change date formats on me several times because of this
"feature". I noticed it no longer does this sometime in the past year.

------
glofish
Infuriatingly,

the paper announcing the new guidelines of renaming genes, a work of
fundamental importance to all scientists in the world, cannot be read without
an expensive subscription to the journal.

[https://www.nature.com/articles/s41588-020-0669-3](https://www.nature.com/articles/s41588-020-0669-3)

Thanks science (sarcasm!)

~~~
rolph
copy paste the DOI into scihub

~~~
LordDragonfang
That's entirely missing the point.

~~~
dj_mc_merlin
it works

~~~
aaron695
Just like Excel

~~~
diarrhea
Until it doesn't...

------
jordigh
To people asking, "why do they use Excel?" that's like asking "why must we be
subjected to gravity?"

The whole world's data ultimately comes from or ends up in an Excel
spreadsheet. Sure, we might use other intermediate data storage methods, but
in the end it's going to go into some scientist's or some politician's
computer, and by golly it's gonna be in Excel. Trying to rally against Excel
is like trying to rally against fundamental forces of nature.

This is just an example of that fundamental force winning.

~~~
meow1032
I don't disagree completely with this, but just want to point out that it's
kind of a bad smell to have computational biologists who are - as someone in
the article puts it - computationally illiterate. I have met lots of these
types over the years, and usually their methods are kind of a gong show. If
you can't properly sanitize your data inputs on your column headers, why
should I trust that you've treated the rest of your data properly?

~~~
acidburnNSA
I have a strong feeling that, if people really put an effort into reading and
replicating more papers, we would find that a lot of what's being published is
simply meaningless.

In grad school I had a subletting roommate for a while who was writing code to
match some experimental data with a model. He showed me his model. It was
quite literally making random combinations of various trigonometric functions,
absolute value, logarithms, polynomials, exponents, etc. into equations that
were like a whole page long and just wiggling them around. He was convinced
that he was on a path to a revolution in understanding the functional form of
his (biological) data, and I believe his research PI was onboard.

I guess "overfitted" never made it into the curriculum.

~~~
gameswithgo
isn’t the saying, 80% of everything is garbage?

~~~
goatlover
Yeah, but one would hope that science has a higher standard. 80% garbage
results in science sounds catastrophic to our understanding of the world, and
in particular when it comes to making policies based on that science.

~~~
adamsea
There's the saying "science advances one funeral at at time."

'‘A new scientific truth does not triumph by convincing its opponents and
making them see the light, but rather because its opponents eventually die,
and a new generation grows up that is familiar with it.’ This principle was
famously laid out by German theoretical physicist Max Planck in 1950 and it
turns out that he was right, according to a new study.'

[https://www.chemistryworld.com/news/science-really-does-
adva...](https://www.chemistryworld.com/news/science-really-does-advance-one-
funeral-at-a-time-study-suggests/3010961.article)

Also the story of Ignaz Semmelweis who discovered that if doctors washed their
hands it reduced deaths during childbirth - but for a variety of reasons his
findings were resisted.

[https://www.npr.org/sections/health-
shots/2015/01/12/3756639...](https://www.npr.org/sections/health-
shots/2015/01/12/375663920/the-doctor-who-championed-hand-washing-and-saved-
women-s-lives) [https://www.npr.org/sections/health-
shots/2015/01/12/3756639...](https://www.npr.org/sections/health-
shots/2015/01/12/375663920/the-doctor-who-championed-hand-washing-and-saved-
women-s-lives)

Point being, as awesome as science is, it's still a human enterprise, and
humans are still, well, human.

------
lordnacho
The problem is that your average business user of Excel thinks type safety is
a bug, not a feature. If Excel enforced types instead of guessing on your
behalf, a lot of people would complain.

And it would be really hard to explain why it is sometimes useful to have
constraints to someone who normally sees Excel's flexibility as its main
strength.

I don't think I've ever seen a non coder use Excel in a sensible way:
maintainable, easy to change, consistent meanings of entities, simple to
understand. It's always a ball of spaghetti, even for pretty small projects.
Loads of VLOOKUPs, external DLLs, buttons everywhere, vba files galore. Plus
they lay out the cells haphazardly.

~~~
jrott
Yeah excel produces hairballs so easily. From a coder perspective I've only
seen excel workbooks that are bad or terrifying.

On the other hand a ton of people that don't write software for a living and
have no interest in code manage to produce things that help them do their job
and automate a ton of tedious stuff.

~~~
lordnacho
Most people don't write novels or poetry for a living either, but we still
need some grasp of basic writing skills like grammar and structuring. We're
already moving towards a future where a lot of people have coding as part of
their job, and we should expect them to know a few things.

~~~
jrott
Oh I totally agree. It’s just worth acknowledging where most bad excel
spreadsheets come from or at least it gives me more patience with them.

------
yummypaint
Why in the world is excel the application of choice? Last time I tried to use
it (for much less complicated things than genetics), it choked and became
unusable when the filesize exceeded about 6MB. I have yet to encounter a
spreadsheet oriented task that isn't better implemented in gnumeric. Maybe we
should also shorten all publications so adobe reader can display them without
crashing?

~~~
zapdrive
Maybe you should ditch your Pentium 1 and buy a newer computer?

~~~
mellow2020
If the same machine can do it without hanging with better coded software, the
hardware obviously isn't the problem.

------
c3534l
I've always stood by the principle that a computer should never "correct"
human input without asking and this is a great example of that. There should
be a prompt asking if you want to convert the data, and there should always
remain a way to undo "corrections," but it should _never_ happen without the
user knowing about it or wanting it to happen.

------
mnw21cam
The problem is it isn't just dates. If you load a list of genomic variants
(read: mutations) into Excel, then the standard way to describe whether a
person has the variant or not is to use "0/0" for no, "0/1" for yes
heterozygous, and "1/1" for yes homozygous. Guess what gets auto-converted
into the first of January.

------
alphanumeric0
Research software dev here. This came as a huge shock to me when I started at
my job. I work with very smart, dedicated people performing cancer research,
why would they put up with this affecting their productivity? Humans really
are adaptable creatures.

After a few months of working there my boss handed me 3 or 4 Excel
spreadsheets to compare to ensure a recent change I made hadn't affected our
data (we don't have much in the way of automated tests either). As a software
developer, this was a deeply troubling request.

One option was to load them in to database tables so that I could perform SQL
queries against the data (Postgres has COPY that works with CSVs), which isn't
hard and probably the path most people should take, but I didn't want to write
table definitions.

I ended up using
[https://github.com/BurntSushi/xsv](https://github.com/BurntSushi/xsv) (I am
not affiliated with the project in any way). It's a command-line tool written
in Rust that performs queries/joins/manipulation/basic analysis against
CSV/TSV files. While not as analytically powerful as Excel or Postgres, I was
able to verify the data was good and pipe out results into another file
without writing any custom code, and without opening a single file.

~~~
MaxBarraclough
Were the files expected to be identical? If so, _diff_ would have done the
job.

Perhaps not directly relevant, but the lesser known GNU Recutils looks neat.
Perhaps some day I'll find an opportunity to try it out.

[https://www.gnu.org/software/recutils/manual/recutils.html#I...](https://www.gnu.org/software/recutils/manual/recutils.html#Introduction)

------
t-c-h
As a bioinformatician I'm not too fond of this shift. Software should be
sculpted around our needs, not the other way around. It's basically submitting
to the fact that we've stubbed our toes hundreds of times to the exact same
rock and never learned.

But then again, I use Excel rarely, usually only at the very end of some
analysis (even then I prefer R/Python libraries for visuals). So I do have
sympathy for wet-lab researchers who rely heavily on Excel.

------
foresto
Somewhat tangential:

If you use software in American English but dislike American date and time
formats, you might see if your OS respects the en_DK locale setting. It works
on most recent linux systems I've tried.

$ LC_TIME=en_US.UTF-8 date '+%x %X'

08/06/2020 01:00:00 PM

$ LC_TIME=en_DK.UTF-8 date '+%x %X'

2020-08-06 13:00:00

------
numpad0
Excel’s problem is it _destroys_ the original keystrokes.

Anyone know why? It makes little sense to me.

~~~
dkarl
Because usability is measured by the naive expectations of the least
sophisticated users.

~~~
numpad0
But haven’t you seen the least sophisticated users trying to type in the same
sequence until it sticks?

------
Brett_S
If the scientists had asked someone who knew Excel well, then they would have
been told to prevent autocorrect from running enter ’MARCH1 with the
apostrophe at the start.

~~~
mrunkel
This. Why does nobody know this?

------
zw123456
There is a super simple work around for this and it is all over the web if you
search for it. Open a blank sheet, go to data, click on From Text, that allows
you to import data identifying the type for each column. Yes, it is a couple
more clicks but for 99% of people the auto-formatting is probably nice. I have
shown this to people a million times at work and they are always amazed. But
honestly a quick google search provides the solution.

------
bronzeage
Honestly, Microsoft should fix Excel to stop corrupting data by default. This
is 100% Microsoft's fault that an international organisation resorts to
workaround renaming things because they needlessly parse and modify input in a
default configuration.

You can also say pretty much many of Microsoft security issues over the years
boil down to their programs needlessly overthinking and parsing perfectly
valid input, for obscure reasons.

------
csours
Excel also clobbers long numbers, which has caused all manner of confusion for
serial number audits.

To prevent this, set your whole sheet to Text before any other steps.

------
Shivetya
It is just so fun when you have @ as a lead character for fields. You can do
DATA-IMPORT from CSV and it is fine but if you just double click load a CSV
from explorer it tries to interpret the data as a formula and randomly loses
the @

I have not check myself the full list of special characters that cannot be
loaded CSV style from Explorer but one should expect consistency with DATA-
IMPORT functionality

~~~
Cactus2018
Two more examples.

In a CSV with zip codes, Excel drops the leading zero: Boston, 02114.

A CSV with text ranges: 1-10, 11-20, 21-30... becomes 10-Jan and 20-Nov!

~~~
qayxc
That's because CSV is an untyped data format.

Why don't people use the import options available to them? You can select the
precise data type of each column if you know the format anyway.

While the default choices that Excel makes are questionable at times, they're
both known and can easily be overridden.

------
alistairSH
Ugh. Why is "MARCH1" interpreted as a date at all? Does anybody use that as
shorthand for March 1?

------
Eyas
I'm surprised no one here mentioned backwards compatibility as the likely
ultimate reason Microsoft hasn't fixed this by default yet.

As others have mentioned, the "fix" of turning off auto-formatting is already
available. But many here are wondering my Microsoft hasn't fixed the default.

I assume it's for the same reason Microsoft Excel purposely claims the date
1900-02-29 is valid, to be compatible with Lotus-1-2-3. I assume many legacy
spreadsheets would break unpredictably if interpreting values as dates by
default changed.

------
njarboe
"There’s no easy fix, either. Excel doesn’t offer the option to turn off this
auto-formatting, and the only way to avoid it is to change the data type for
individual columns. Even then, a scientist might fix their own data, but as
soon as someone else opens the same spreadsheet in Excel without thinking,
errors will be introduced all over again."

These are easy fixes. I have this problem using excel. Just format your data
as text. If you do that and give the excel file to someone else the errors
will not be introduced again. The quote above is at best misleading and I
would say just wrong. The problem only happens when you import data in a non-
excel format like comma delimited. In that case you have to click on a button
when importing to excel to import it as "text" instead of automatic excel
formatting. It would be great if you could set this as the default (maybe you
can, but I don't know how), but it is something you get used to very quickly.

Think of all of the financial people that use excel that would have this same
problem. I'm sure there are some mess ups, but I don't think that industry has
the same endemic problem with this excel file formatting issue. There are
definitely problems with using excel, but I don't think this would have to be
one of them.

Edit: Of course it is not an easy fix to get everyone to use Excel correctly,
but these types of bad data errors come from all over the place if the person
using the data is not careful.

------
vikramkr
A lot of people are attacking excel in this thread, just remember to give a
fair share of the blame to people naming genes as well. These are meaningless
names and changing them is frankly easier than changing a feature in excel
(that the finance folk probably don't want changed). And its a good excuse to
clear up some of the egregiously silly disease related names as well so we
don't tell parents their kid is suffering from a debilitating mutation in
luke-Skywalker-like upside-down cantaloupe 12b or whatever.

~~~
rolph
no these are not meaningless names, there is undue confusication in the case
of the more contemporary names such as [Sonichedgehog]

these are not meaningless lables
[https://en.wikipedia.org/wiki/Sonic_hedgehog](https://en.wikipedia.org/wiki/Sonic_hedgehog)

mentioned eslewhere in this thread is a nomenclature that allows one to easily
find notes referring to the gene from your research library. It was a new
cadre of young upcoming scientists that decided to break with tradition and
use something familiar to lable genes according to game characters or pop
icons.

~~~
vikramkr
They're tags and you end up with multiple names for the same gene. They might
as well be replaced with an arbitrary string of numbers- those are equally
easy to do a quick Google scholar or pubmed search with. You don't lose
anything of value by changing the name. In chemistry amd organic chemistry,
the proper IUPAC name of the compound also tells you its structure. It tells
you what the compound is in a real physical sense. If you go to someone thats
never heard of a compound d and you give them the full IUPAC name, they can
draw it. You can't change those names without losing meaning.

You go to someone who's never hears of the hedgehog signaling pathway and ask
them what SHH is, they're not gonna be able to give you much.

~~~
rolph
this does not negate the fact that these tags or lables actually do have
meaning they are not arbitrarily made up.

if you do not have any understanding of rust and you read the source code it
seems like meaningless tags but if you understand the idea of syntax you can
see there is some underlying principle to the combinations of characters.

being unfamiliar with the scheme doesnt make it meaningless hash. the failure
is not representation it is interpretation.

if we want to talk about absolutes the universal way to identify a gene is
with its locus or loci depending on the nature of the gene, before we get
there we use "tags" lables indexes to allow quick access to a particular
character in a local store of data.

the whole thing about revising the standards of a genes identifying name is
about getting the greatest benefit quickly with the least effort so we can get
back to science and away from blaming the tools for the lack of craftsmanship.

~~~
vikramkr
Since when is the universal way to identify a gene its locus? The locus is
just the position of a given gene in a particular chromosome. The function of
a gene is not exclusively tied to chromosomal location. If I take the sequence
for SHH and move it to a different location in the genome under the
appropriate promoters, it will still express sonic hedgehog. If I go and shove
some other gene sequence into where SHH is otherwise located on the genome,
that new insert will not serve as a ligand in the hedgehog signalling pathway.
The entire field of synthetic biology is predicated on moving bits of DNA to
new locations.

The point is that there is no concrete, universal, IUPAC style scheme to
naming these genes. I don't know what the comparison to a program language is
supposed to imply. The syntax when programming has meaning beyond a
meaningless hash. The syntax in gene naming barely has more meaning that a
meaningless hash, because at least occasionally its related to a phenotype or
function or keeps the same silly scheme for multiple members of the (e.g.
hedgehog) pathway. That's not much.

~~~
rolph
do you know why HH is named such ?

>>the function of a gene is not exclusively tied to chromosomal location<<
...epigenetics, ultrastructure of expression systems

"Since when is the universal way to identify a gene its locus" since molecular
genetics...and gene mapping.

this whole thing started with the idea that gene lables are arbitrary
meaningless tags by extension not a problem to change labels to stop a
spreadsheet program from mangling data. so now that we both know that genetic
labels or tags or whatever you prefer to insert, are far from arbitrary thats
the end of that.

~~~
vikramkr
You may note I used the word exclusively. TADs are hardly the end all be all
of gene function. Epigenetics has more to do with gene regulation than
function, and TADs aren't remotely close to the dominant force in epigenetics
either.

Gene mapping through genetic linkage maps hasnt really been a major part of
identifying and classifying genes for a while - not since sequencing became
accessible. You could consider a genome sequence the ultimate physical (not
linkage) map, but that's a couple decades out of fashion.

Edit: and im still not sure whether you're defending or attacking the naming
of the hh family of genes, but yes, I'm intimately familiar with the hedgehog
signaling pathway, and just because the hh loss of function makes fruit fly
embryos look spiky doesn't mean that the name is justified or meaningful. Amd
a basic problem in ontology is figuring out what that unifying theme should be
to name all genes around, its not as easy as in chemistry. But until that's
solved, the world loses nothing by changing shh to something less embarrassing
so patients don't have to deal with being told their cancer has driver
mutations in a pathway involving sonic the hedgehog.

Is your background in biology/molecular biology? I'm happy to share some links
to resources if you'd like to read up on the state of the art now, including
on gene ontology (GO). GO is an attempt to unify how genes are represented to
give more meaning to the representations. I have my doubts about how
successful these initiatives will be, but they're still really fascinating
stuff. The field has come a long way since Morgan and Sturtevant developed
that first linkage map back in 1913. And in epigenetics, there's so much more
than TADs. There's the usual methylation/history acetylation etc, but also
things like alternative splicing (perhaps not strictly epigenetics) that
significantly change what genes do and what sequence even gets translated from
the same locus. There's a lot of amazing stuff in happy to share some
resources that I think are a good place to get caught up on the state of the
art.

------
lvturner
I think another question is that why isn't there a viable alternative to excel
for this particular use case?

Is there, perhaps, a market for niche spreadsheet applications that serve one
particular market?

~~~
hairofadog
I’ve been wishing for something like this lately in the apple ecosystem: sort
of a “plain text” tabular editor that can handle large data sets. Numbers
feels like it’s designed for presentations more than data crunching, and Excel
(or at least the price of excel) would be overkill for my needs (which largely
consists of data cleanup, spot-checking, and quick calculations). I’d happily
pay for an IA-Writerish spreadsheet app.

Edit: or maybe Sublime Text is more what I mean. In any case I’m hoping
someone will pop in and say, “well why aren’t you Snappets!?” and then I’ll go
buy a copy of Snappets.

~~~
Legogris
Not exactly what you're asking for, but what about libreoffice?

~~~
hairofadog
It’s been a while since I fooled around with that. I’ll give it another look.

------
jpeloquin
Most of the reactions seem fall into three categories:

1\. "You (the individual) should stop using Excel". Good advice. As the
article mentions, though, if you ever send a tabular data file to someone else
to edit, there's a fair chance they will open it in Excel and corrupt the
data. Excel's CSV import/export cycle has even more traps than using xlsx
consistently, so in some sense individual avoidance of Excel makes the problem
worse. You could send SQLite files or somesuch so your colleagues can't work
with your data, I suppose.

2\. "You (all scientists / whatever non-programmer group) should stop using
Excel for data analysis". True, but getting everyone to change is hard. Both
top-down (thou shalt not use Excel!) and market-driven (Airtable!
LibreOffice!) haven't worked so far. Hopefully there will be more progress on
this front.

3\. "Scientists should learn to use their tools". Even if a user knows all the
scenarios in which Excel corrupts data, it cannot in general be stopped from
doing so. If you set column types up front, type every value manually (no
copy-paste) & save and check on each edit, and then never re-save the file
again or let anyone else re-save it, you're safe. Probably. I sort of expect a
bunch of replies saying this isn't safe at all. But if the file is edited, you
need to check the whole thing. Eventually an error will slip through.

The only way I've found to safely work with Excel files, or files that might
be edited in Excel, is to _put the data files under version control_ and
always check the diffs. Often someone else's import/export cycle will change
the quoting style on the whole file. In this case, the work needs to be
repeated so the diffs are clean and the changes can be verified. This is a
necessary but very obnoxious part of this process. Version control targeted at
non-programmers and non-text files would help a lot.

There are also tools to to flag potential data corruption errors [1], but
error detection in the absence of the original data won't be perfect. And if
it is manually run, it won't be run consistently. Most people don't have
continuous integration pipelines for their data.

Excel is very powerful and has a lot of potential to improve non-programmer
productivity. It is unfortunate that avoiding data corruption is such a
minefield. The situation could be significantly improved by anyone with a
decent product and a great marketing strategy.

[1] [http://maplab.imppc.org/truke/](http://maplab.imppc.org/truke/)

------
engineer_22
Human language is more flexible than computer language. We send a few emails,
agree on a new word, bingo bango problem solved. Easier than everyone learning
a new computer program, easier than teaching everyone how to avoid the errors
in Excel, just send a memo to all involved, and those who miss the memo will
slowly catch on anyway. MARCH1 -> MARCHF1 seems like a good change and will
not cause any confusion.

------
dekhn
In grad school I studied a gene which at the time was called Oct1 ("octamer
binding protein 1"). My main problem was literature searches, which often
found "OCT-1" (organic cation transporter-1). Genomic naming is a total mess,
I found it easier to just mentally compute the md5sum of a name, then memorize
the first few digits (only need about 8-10 hex digits).

~~~
DMLoeffe
> just mentally compute the md5sum of a name

What are you taking about lol

~~~
aqme28
> mentally compute

I think he means "remember"

~~~
dekhn
md5 is easy- you only have to remember a few hundred constants (they can also
be computed on the fly) and do basic bit operations on a small output (16
bytes).

------
curiousllama
Excel datetime functions are garbage. you have to explicitly, manually tell it
not to format things as dates, the most destructive data type, but it doesn't
act that way for other formats (e.g., $ doesn't turn things into accounting
format).

That said: every datetime function I've ever written is also garbage so...
glass houses, I guess?

~~~
bronzeage
Which is exactly why feeding all your input into a trash input parsing
function by default is a horrible idea. If dates handling was just 1 extra
button you click, the problem wouldn't exist. Overzealous default behavior,
the #1 sin of Microsoft.

------
ocdtrekkie
Another reason why not to "wait for an update for Excel" is that one of the
perks of Excel is it's very long-lived data formats. People may be using Excel
2003 to work with you on a file you created in Excel 2019. You have to go all
the way back to Excel 2000 to lose the ability to work on modern documents.

------
danielecook
I'd love to see a better alternative to Excel geared more towards scientists.
While it is true that R and python should be used for analysis, its often hard
to escape a tool like Excel for data entry or browsing datasets. It would be
nice if there was something out there that was faster and could handle larger
datasets, and had features designed to help with data entry and browsing.

I am envisioning something with double-entry data checking, enforcement of
data types, and rules to check validity. Equally important would be the
features it would lack - no tools to color or format text, no variable font
sizes, no embedded images, etc.

------
nayuki
[https://www.youtube.com/watch?v=yb2zkxHDfUE&t=696](https://www.youtube.com/watch?v=yb2zkxHDfUE&t=696)
"When Spreadsheets Attack!" by Stand-up Maths

------
arkanciscan
I wish we could rename addresses with "drive" in them so that Instapaper and
Pocket stop reading the abbreviation as "Doctor".

"There was a shooting today on The 800 block of MLK Doctor"

------
superjan
Tangent: I once tried to help a friend (non-coder) with his slow excel sheet.
He was using string search in excel to match virus DNA fragments. As you’d
expect, it was way more complicated than neccesary, and likely buggy. I
searched around for something better. But if you cant code, its a really big
step to ditch your excel and learn a Programming language. But there is no
such thing. It’s understandable that people stick with what they know, but the
results are awful.

------
mattmar96
If this kind of thing amuses you, check out the book Humble Pi by Matt Parker.
Its a collection of stories about maths errors. Lo and behold, most of them
happen in Excel.

------
jbaber
I don't see as pitiful that these non-technical* people can only use Excel.

I see as glorious that I'm able to get something close to real database tables
out of non-technical people as long as they're in Excel.

I once populated a pretty sophisticated database by giving a bunch of computer
semi-literate people Excel tables with only headers and simple instructions.
They respected foreign key constraints, etc. that I could never have directly
explained.

*in the software engineering way

------
24gttghh
Just change the data type of the column in question??

In the article it even mentions this and links to how to change it:

>Excel doesn’t offer the option to turn off this auto-formatting, and the only
way to avoid it is to change the data type for individual columns.

[https://www.youtube.com/watch?v=SppKiKIdCkI&feature=youtu.be](https://www.youtube.com/watch?v=SppKiKIdCkI&feature=youtu.be)

------
DarkWiiPlayer
Reading this just makes me incredibly sad. This feels so dumb and backwards...
Why do people even still use excell at all?

~~~
jpindar
Because their manager wants the data in Excel, because their managers used
Excel, because THEIR managers used Excel.

It's managers all the way down.

------
xenonite
I suppose it is not even enough to check and change "JANUARY1": in Spanish it
is "ENNERO1". So one needs to check the word in every language for which Excel
exists. And in every language that Excel supports in the future.

------
stjohnswarts
Who actually uses a tool for critical work and doesn't understand it enough to
get around such annoyances? I have a bunch of gripes about various languages I
use but I know that they are standard issue and nothing I can do about it so I
work around them.

------
pkphilip
Couldn't they have just used a prefix. Eg: G_MARCH1 instead of renaming the
whole set?

------
racl101
Well, it's funny to know that even the people doing the most cutting edge
stuff are also getting fucked in the rear by that piece of shit software
Excel.

Seriously, I hardly trust the program anymore. I always need to get the truth
from Python and Pandas.

------
Florin_Andrei
> _Bruford’s theory is that it’s simply not worth the trouble to change_

Not worth it for the Microsoft power brokers. Very much worth it for a lot of
people.

And so the world is sometimes shaped by the narrow money interests of a very
small group.

------
snow_mac
Why Doesn't Excel allow you to change date formatting? or column formatting?

~~~
emteycz
It does.

------
jaclaz
Or perhaps prepending them when typing with either a single quote, apex or
double quote (left/center/right align AND consider as Text), like all the rest
of Excel users have done for the last 25+ years?

~~~
psychometry
You don't understand the problem. Read the article.

~~~
proactivesvcs
I read the article and parent is correct in part. However, equally the article
touches on the fact that many people may use a spreadsheet and not all will be
correctly trained to use their complex tools properly, in order to ensure data
integrity.

I consider this a failing not in software or Excel, but in education and
professional standards. Scientists can be expected to understand and use their
tools correctly, since not doing so will taint their data.

~~~
conductr
Is the simple renaming of 27 genes a good solve? I think it is. Any time you
expect people to be trained, you’re planning for failure. Even a trained
person can make a mistake or forget to do the manual steps. This eliminates
the possibility.

~~~
proactivesvcs
I think it's the pragmatic approach, yes. I'm disappointed that the article
(and so many of the comments here) are not addressing the root cause. The
worst part is that not all of the people using it need to be trained - data
validity enforcement ought to solve this outright.

------
rolph
when i was undergrad there was an academically priced package for about 500$
you get the entire suite of word excel access powerpoint bells whistles and
nice rugged carrier box for all the tomes. i dont groom well i used it until
taking an assistants position and we used ...lotus, symphony,and norton
utilities and some funky in house coded version of a Dbase, and IC4 [some
inventory control utility]

Excel was nowhere in sight, and that was thirty years ago.

If someone wants a data set i give them a .csv it is then thier fault if they
plug it into excel.

------
nesarkvechnep
[https://thedailywtf.com/articles/another-immovable-
spreadshe...](https://thedailywtf.com/articles/another-immovable-spreadsheet)

------
phendrenad2
If the genomics community is really so big, they should just make their own
file format (.gcvs or something) and make an Excel plugin that treats it as
literal text cells.

------
husamia
This was right decision to make. Gene names aren’t as important as being able
to handle them properly. Names should be easily handled by software.

------
iaw
My favorite excel feature is when it randomly flips a 0 to an 8 or vice-versa
in really long integers. Took a long time to figure out what was going on with
that one.

------
justinsaccount
No one knows how to use excel.

[https://www.youtube.com/watch?v=0nbkaYsR94c](https://www.youtube.com/watch?v=0nbkaYsR94c)

------
komali2
In excel, can you set the a column to not format to dates? I think you can do
that in google docs for example. Why don't they do that?

~~~
LatteLazy
It's actually worse. If you expressly set the format of a column to text, then
put 01/01 in it, you still get Jan-01 and a field value of 46000ish.

~~~
komali2
That's horrible UX - why on earth is it that way?

------
ezekiel68
I have heard of life imitating art -- but I must say I find it a little
eyeroll-worthy to contemplate that otherwise bright grad students and other
researchers were unable to push through the struggle to understand and adapt
to the quirks of MS Excel that many of the rest of us (in other fields) have
needed to. I would have sooner expected Microsof to have added a "gene name"
cell formatter option than for this to have occurred.

------
Gatsky
Ah, it’s a shame they are doing this. Finding garbled gene names is a quick
way to pick a poor quality paper when reviewing.

------
yters
Why not an excel plugin to fix the problem?

------
alkonaut
Automatic type conversion is the root of so much pain. It doesn’t just apply
to C# or javascript apparently.

------
Metacelsus
This happened to me! Someone sent me a spreadsheet where OCT4 was changed to
"October 4"

------
patgrdj
So all that noise for people that don't know to change the data type of a
column?

------
jtdev
Excel is the 2020s technological progress stunting equivalent of the fax
machine.

------
unnouinceput
Quote: "There’s no easy fix, either. Excel doesn’t offer the option to turn
off this auto-formatting..."

What kind of loopy Excel variant the article's author is using? It's right
there in settings, you can easy stop auto-formatting. Yes, default
installation has it on, but it can be turned off.

~~~
mplanchard
From the article:

> Even then, a scientist might fix their own data, but as soon as someone else
> opens the same spreadsheet in Excel without thinking, errors will be
> introduced all over again.

------
danso
The article links to a paywalled Nature article [0], titled "Guidelines for
human gene nomenclature". I googled that title to find a free and open
version, and came across what seems to be the official page for HGNC's
(extensive) naming guidelines [1], though what's currently published seems to
be an older standard, originally published in 2002 [2]

[0]
[https://www.nature.com/articles/s41588-020-0669-3](https://www.nature.com/articles/s41588-020-0669-3)

[1]
[https://www.genenames.org/about/guidelines/](https://www.genenames.org/about/guidelines/)

[2]
[https://pubmed.ncbi.nlm.nih.gov/11944974/](https://pubmed.ncbi.nlm.nih.gov/11944974/)

------
angel_j
There must be a net benefit to using this software, right? Otherwise somebody
in charge would make the decision to use something else, right? It’s not like
they are forced to use MS excel, right?

------
xchip
Why are they usin excel on the first place?

------
ynodir
Havent read the article, but come on. Tools should serve us, not the other way
around. It'd be enough just to change the affected cells' type.

~~~
jmkjaer
> Havent read the article

Please do. The article has a paragraph that addresses this.

------
lunchladydoris
Or perhaps stop using Excel?

~~~
DethNinja
What’s the alternative?

~~~
hutzlibu
Well, everybody should set up their own, custom script powered favourite DB of
course.

Ok, seriously, it would be nice to te able to recommend a alternative. I heard
LibreOffice Calc is not really equivalent?

~~~
welterde
python+pandas [1] is probably quite an easy choice for most common things
people would use excel for. And being in the python ecosystem offers lots of
libraries for statistics, machine learning, signal analysis, plotting etc.

[1]
[https://pandas.pydata.org/docs/user_guide/10min.html](https://pandas.pydata.org/docs/user_guide/10min.html)

~~~
vxNsr
People like real time feedback it's why word is more popular than tex, all
these tech folks pushing code solutions to wysiwyg software issues are missing
the point.

~~~
welterde
Ipython/jupyter notebooks fill that niche quite nicely.

------
vmchale
Load-bearing bug :p

------
takluyver
Programmers love to complain about Excel, but we happily use YAML, which has
essentially the same footgun: certain strings (like 'on') need quoting if you
don't want it to interpret them as something else.

~~~
fabian2k
We programmers also complain a lot about YAML, though maybe not enough and not
as much as about Excel. But some YAML footguns like the country code for
Norway being interpreted as a boolean are reasonably famous, and I think
widely regarded as a bad idea.

~~~
takluyver
To be fair, there are few things programmers don't complain about. But YAML
still seems to be very popular as a configuration format for new tools, e.g.
CI services.

~~~
3pt14159
I wish everyone would just switch to TOML. It's sane _and_ readable.

~~~
t-writescode
Or json!!! Why did we leave json???

~~~
RcouF1uZ4gsC
No comments in standard json.

~~~
3pt14159
And you have to choose between readability and whitespace. Doesn't matter for
most usecases, but you can't blindly write it out. You have to choose ahead of
time:

1\. Potentially shoot yourself in the foot with tons of extra whitespace. 2\.
Unreadable mess.

TOML is better. You can choose whitespace if you want, but you can always go
back to dot notation. Though I (sadly) agree JSON is better than YAML. One of
the few times I've changed my mind from A to B and then back to A in tech.

------
randompwd
So if someone stopped at the byline, they would have thought it was 'excel
dumb' issue:

> Sometimes it’s easier to rewrite genetics than update Excel

rather than y'know, scientists cant properly use a tool they're using.

bad journalist.

edit: it even seems some of the commenters here stopped at the byline. quelle
surprise

------
tomp
_> For example, HECA used to have the gene name ‘headcase homolog
(Drosophila),’ named after the equivalent gene in fruit fly, but we changed it
to ‘hdc homolog, cell cycle regulator’ to avoid potential offense.”_

Ugh... everything is politics now.

~~~
krastanov
As the sentence before the one you quoted explains, this is not done for
politics. It is done because clinicians have to be taken seriously by the
parents of the child having a mutation in that gene. Saying "your child has a
headcase mutation" ends up causing defensive reactions instead of discussing
treatment. Maybe that is an irrational reaction on the parents' part, but not
everyone is super rational the first moment they learn their child has an
illness.

