Hacker News new | past | comments | ask | show | jobs | submit login
Scientists rename human genes to stop MS Excel from misreading them as dates (theverge.com)
621 points by virde 85 days ago | hide | past | favorite | 506 comments

It's honestly amazing that Excel hasn't fixed this issue. It's pisses off an enormous number of users especially in basically any non-US country (even if 01/02 is a date, it isn't the second of January in most of the world...)

Excel is stuck. It can't change this behavior if it cares about backwards compatibility. There are uncountably many recorded macros and all sorts of scripts and data tools everywhere, that implicitly and inadvertently depend on behavior like this. Like Javascript's warts, you have to declare these as features and live with them.

Of course Microsoft has thought about this, it's silly to think they never have and we're smarter. Yes, this behavior is annoying and US-centric, they know that. But they also know that breaking compatibility with all these scripts and macros would be the worse problem in the larger picture. That picture is huge, it would be on the order of the scope of the Y2K effort to modify every code everywhere that's ever touched Excel dates.

Would you want Excel to introduce a "quirks mode" to handle this sort of thing?

imagine if they worked with the industry to build in a set of relatively agreed upon defaults, like turning autoformat off for dates that matched a critera. then geneticists could activate the geneticist profile and excel works as they'd expect. if they encountered a file that broke, they could turn that profile off.

that's a fairly sophisticated solution, but it's just one of many potential approaches to fixing the problem.

It would be way less effort to have the same exact program under a completely new name.

MS Excel

MS (names are harder than I thought / remembered) Matrix

In the back-end they'd use the same sanity and share 99% of the code, but Excel would add in the cruft / warts compatibility quirks.

And they could do even better technically and have "MS Excel" be implemented on top of MS Matrix using an extension framework. That'd really firewall the cruft and encourage the core program to be flexible.

Less effort and they can charge twice for the same product, genius!

Only everybody knows what excel is, and no one knows Matrix.

And unfortunately my guess is that most Excel users don't surf HN or other geek sites.

> It can't change this behavior

They could easily make it an optional behaviour.

They could even have it on by default, or rather, have “Excel NG/Matrix” be an optional no-quirk behaviour regimen.

They could have a scripting environment variable that is “quirkkMode=ON” by default and would maintain backwards compatibility at the small expense of needing to specify sane behaviour as an exceptional circumstance: just another line of boilerplate.

There’s a lot of things they could do. They’ve no doubt considered all of them and then some, and yet they‘c’è decided to do almost nothing. I suppose that says something about the disconnect between how we and they perceive their incentives, but I’m not sure what that assertion is.

Exactly, add a checkbox in the settings menu and the problem is solved for everyone.

> "Exactly, add a checkbox in the settings menu and the problem is solved for everyone."

A million checkboxes in the settings menu later, people will be griping that "Excel is terrible for having all these crazy settings to deal with!"

In other words, a quirks mode.

Excel doesn’t care about backwards compatibility. Random things break every single major update. If only the breaks were because they where fixing these extremely annoying “features” then it would be understandable, but no, they persist.

I don't even know why anyone still uses this notation in today's connected world. 2020-08-06 is ideal. Russian standard (06.08.2020) is uglier (digits are not sorted by significance) but still unambiguous. But that slash horror? Why would anyone continue inflicting this on themselves and their peers!?

yes, YYYY-MM-DD HH:MM:SS is the only proper format.

It is sorted from the greatest unit (year) to the smallest unit (second). If you treat them as text and sort alphabetically they still get sorted from oldest to newest.

Other formats don't sort properly.

writing dd.mm.yyyy is like writing time ss:mm:hh

writing mm/dd/yyyy is like writing time mm/ss/hh

If I really have to put YYYY at the end of the date I use the 'dd-mmm-yyyy' format which excel translates based on client locale:

- 13-mar-2020 in enUS

- 13-bře-2020 in czech

I agree with everything you are saying about sorting and I frequently use YYYY-MM-DD, however I do want to make one point: DD.MM.YYYY does make some human sense as a date format since the year is very rarely important (and if the year is actually important often the day is so unimportant that it can be left out), the month is sometimes important and the day is often the most relevant piece of information.

So for any kind of practical planning that day first order makes some sense, but I wouldn’t die on a hill for it.

I see where you are coming from, but I would argue that if you omit a part, it's not a date anymore.

Yes! It's not the first time I posted a rant on the net and found myself guilty of the sin the very next day. This time, a web app of mine has big tables full of dates, and I didn't quickly figure out a way to show the full date without too much clutter. So, I used "dd.mm" and postponed a finer design till next year. Guess what, this UI element hasn't changed in the next three years and today I finally stumbled on a bunch of records which really can't be deduced from context.

It’s a cultural thing: if I’m not mistaken the traditional Chinese system is Y-M-D and always has been.

I suggest a literal T instead of a space, and don’t forget the timezone offset (or a literal Z).

ISO 8601 is great.

I agree with and independently also use the parent's modification to ISO8601 with that respect; though I do frequently call the format string that contains whatever magic cookies get that output something similar to iso8601mod. (The modifications: Emit a string rather than a T, and (if included) use the email like +/-OFFSET to indicate the current timezone offset rather than specify a timezone. No offset implies UTC for system times and local time for human-facing times.)

Well it is but the T is ugly :/

I would argue that its use goes beyond sorting and includes people making mental estimates about time intervals. Having some sort of sensible order helps here, which I'm guessing is YMD for left to right languages and DMY for right to left languages, since that puts the most significant digits first.

No, it does not take into account timezones.

We have ISO 8601, no need to reinvent the wheel as a square.

https://xkcd.com/1179/ (date only)

This XKCD is incorrect, 20130227 is also a valid ISO 8601 date.

agreed fully

How is "06.08.2020" unambiguous?

There are no conventions that write a date as mm.dd.yyyy, so if you encounter a date in that format, it is safe to assume dd.mm.yyyy.

Contrast this with the slash or dash notation, where both mm/dd/yyyy and dd/mm/yyyy are prevalent in the world. But 2020-08-06 is unambiguous in the same way that 06.08.2020 is: the only convention in common use is yyyy-mm-dd.

> There are no conventions that write a date as mm.dd.yyyy

While slash is the most common, use of hyphens and dots as separators for US-style dates is not at all unheard of in the wild for manually formatted dates. The fact that it doesn't show up in lists of national “preferred formats” and doesn't tend to be commonly implemented as a prebaked format in software doesn't mean it's not a real thing people see and will interpret dates they see in light of.

How is that safe to assume??? You can't assume anything, why not ask the source of the data? The other way is to attempt to confirm which order by looking for first 2 digits > 12, but that info might not be available depending on the data.

Unfortunately, this isn’t true in Canada. MDY is very common, especially in the English parts. See:


MDY separated by dots? The link you gave does not seem to support that.

Separating dates with dots (DD.MM.YYYY) is what’s taught in school in Germany.

So, yeah, there is a convention to write dates exactly that way.

Read the comment again...

It's unambiguous for everybody except Americans with their weird switching.

Does anyone use mm.dd.yyyy?

Germany does. Probably the majority of Europe too

I think you misread: most of mainland Europe including Germany use dd.mm.yyyy or something else that is reasonable sane even if it isn't sortable like yyyy-mm-dd. Or am I wrong on this? Norway at least use mm.dd.yyyy and never mm/dd or mm.dd.yyyy

Afaik Norway also uses the sane format: DD.MM.YYYY

See parent source: https://en.wikipedia.org/wiki/Date_format_by_country

Edit: of course. To late to edit but if you read my post, but that was what I meant.

> The format dd.mm.yyyy using dots is the traditional German date format.

> Since 1996-05-01, the international format yyyy-mm-dd has become the official standard date format


Schools still teach DD.MM.YYYY. That’s what’s relevant because that’s what people actually widely use.

Shouldn’t Americans be very familiar with the idea that official norms (the metric system) play no role if the people don’t want to use them?

I understand - it seems I was too optimistic to read that the use of yyyy-mm-dd has become "the official standard" in Germany, without considering the popular usage and convention under discussion.

I wish more people complied with ISO 8601.

Even better, wish everyone would standardize on:

* YYYYMMDD for dates

* YYYY-MM for months

* YYYYMMDDTHHMM for time where T is the capital letter T. Two additional digits can be added for seconds and then as many additional digits as needed for precision.


[Edited for format]


Because it's meant for human consumption. Day and month is usually relevant, while year is less so, so let's put it last.

>2020-08-06 I've just switched to 20200806. Let it be an int and be done with it.

A beautiful property of using ints to represent dates is that you can manipulate them using simple arithmetic. For example, if you want to refer to noon-time on 20200806, you can simply divide the day in two and write it as 10100403. An elegant, DRY solution to a perennial problem.

I don’t get this? What does 10100403 mean?

The poster is gently teasing the idea that integers like 20200806 are a reasonable way to to represent dates. Dividing 20200806 by 2, we get 10100403, representing the half-way point of the day (noon-time). Of course it's nonsense.

I’ve been had

It’s beautiful. Just mod 7 to find the day of the week.

Excellent example. Honestly, I have no idea why the concept hasn't caught on yet.

(I'd love to discuss this further... I'll be back on HN later tonight, sometime around 16834005.)

Human readability sometimes matters.

The integer 20200806 is chosen solely for the purpose of human readability. If you actually wanted to store a date as an integer, you'd use the Julian day.

what human can't parse that? Sure it might take a half a second longer but it's hardly confusing.

For those of us whose professional lives sometimes require staring at directory listings full of filenames like "output-200003021342"... for the love of God, please just put the dashes in.

I have an infinite number of dashes, I'll send you a lifetime supply!


There are dozens of auto-formatting "features" that piss me off each and every day across all of Microsoft's products.

My pet hate:

- use the mouse to select precisely the part of the word/URL/string you want for Ctrl-C purposes

- it autocorrects your selection to the whole word including a CrLf if it is nearby.


Reading this comment caused me frustration.

The only workaround I've found to be effective is selecting a few letters on one end of your intended selection and using ctrl+arrows to precisely select.

You turn them off,forget about it, and then auto-update turns them all back on for you!

Or you're on a different machine that doesn't have it and you have to decide if digging through the menus to turn them off is worse than manually undoing all of the "fixes" for your message.

Have you ever posted a code snippit into a default install of Outlook?

usually you can ctrl-z it, but it gets annoying if you're trying to type a small code snippet in MS Word or similar.

Even the default text editor on Mac OSX attempts to "fix" your typing. It's not just Microsoft.

I personally dislike this behavior in Excel, as I'm accustomed to working with plaintext data and using pandas/strftime to get datetimes as I need them. But I also figure that my situation as a technical user is likely in the minority, compared to people who are using spreadsheets every day to do manual data entry.

That said, I completely agree with the tangential issue of U.S. dates being misleadingly different in format compared to non-U.S. Always an issue when teaching data/spreadsheets to a class with at least one non-American – but also a good reason to teach them the value of ISO8601 :)

It's not just about data entry. Entering dates properly in excel allows you to calculate with dates.

If you want to know what date and time it will be 476 days and 12 hours from today you can just do =TODAY()+476.5

This is very useful when the requirement is to 'happen every 10 days' or you're looking for '30 days in the past'

Excel has a very low barrier of entry compared to pandas while boasting an immense amount of power and features. I think it was not an easy challenge to keep it going over the decades.

> US dates I'm a non-US person and I fixed this simply by changing my locale to enUS everywhere. It has an added benefit of not having weird translations of everything in random tools and excel functions not being localized to their cringy versions in my native tongue.

For Germany this converts to the first of Feb, and this way of entering dates is just so convenient, as you can type the whole date just using the keys from the num block. We do actually support this format in our app for date entry, simply because it's so convenient.

The usual way to write the date Feb 1st in Germany (and I think anywhere but the US) is 01.02 or 1.2. but not 01/02 And the . is part of the num block as well.

Also, data being interpreted different based on region/language settings is a sure way to end up with bugs, so I think its a terrible thing that Excel does this.

I'd say that in general, we need to get rid of locales in computing. They're an unending source of bugs and annoyances.

It's not "techies know best" whining. We've been increasingly computerizing the economy for the past 50+ years; it's past time for societies to adapt to that reality, instead of wasting time and money on dealing with dozens of date formats, number separators, currency notations, etc.

We are going to get rid of locales as much as we are going to get rid of local languages, tradition and culture.

The US switched to the metric system in the 60s and there is a shitload of benefits in doing so. Has it worked? Not really. Still using the old system everywhere.

So the solution cannot be to get rid of locales, but to actually use them properly:

- Always use the right locale for the job (the OS or browser should be the oracle for the right locale to use)

- Read the data in the user defined locale

- Store the data in some canonical form (e.g. store numbers as number types instead of strings, use ISO-8601 for dates, ...)

- Write the data to the user defined locale

If you program in C++/Java and all your libraries handle locale correctly, it's doable. A bit more difficult in languages and libraries written by people who aren't even aware of the concept - of which there's plenty. Impossible if you're dealing with customers or managers who don't understand the concept of "data type". The latter is, for example, why the problem from TFA exists in the first place - i.e. Excel being stringly typed, and eagerly converting data into your system locale.

And then, there's the problem of users - whatever locale they have set in their system were most likely not set by them, and are often misaligned with what they're naturally using.

There's a lot of bugs and issues happening to people every day that could be removed if major software vendors said, "sorry, the only allowed input format for date is ISO8601, and dot is the only valid decimal separator; take it or leave it".

The problem is that having a near system-wide locale setting is a bad idea; it necessarily causes friction because it is far less granular then it should be. Ideally, the locale should be a property of some piece of content, if that content necessitates formatting or interpreting some underlying data whose presentation is locale-dependent, otherwise absent. The locale should not be determined by the OS or the browser since it would be reasonable for a user to want to use different locales for different applications and web pages (or pieces of content within web pages).

Furthermore, I would argue that the very notion of user locale based localization by default is misguided, since it is fundamentally no different from automatic translation of content based on the users' UI language setting. It is a form of misrepresenting the content, though with things like dates you usually don't lose much information in the process.

What computer programmer can't handle locales? I mean that's 1st year comp programming 101 stuff. I've never had an issue. Sure I've had to fix other's work but it was usually old stuff that never knew it would be leaving the factory floor where it was being originally written, but still.

Almost all of them, judging by software I've been dealing with on both Windows and Linux systems, as well as the web. Even the big companies full of smart people get it wrong - as evidenced by Excel, and the problems like those described in TFA.

The problem doesn't even have a correct solution, because the locale settings in the OS and the browser aren't often set by end users (a regular person probably doesn't even know that they exist) and the defaults depend on many random factors (like which store you bought your computer with preinstalled Windows from).

What computer programmer can't handle locales? I mean that's 1st year comp programming 101 stuff. I've never had an issue. Sure I've had to fix other's work but it was usually old stuff that never knew it would be leaving the factory floor where it was being used, but still.

a) You write "1/2" and Excel replaces it with "01. Feb" by default if using de-DE as Locale (just checked it) [This auto replacement is also the basis of the linked article]

b) There is no "." on the num-pad for German keyboard layouts, they have a "," the German decimal separator.

Damn, you‘re right about the comma on German num blocks. I’ve avoided anything but US Keyboards for over a decade. Local keyboards are just another pain, like OS wide locals.

If you write 1/2 just put a ' in front of it.

Not if I want it to be converted to a date ;-)

Maybe American dates are only foisted on us English speakers? Yet another complexity to be unravelled...

As an non American English speaker Excel would also format that as 1 Feb for me. It has been pretty good at picking up your locale for a long time, and is also reasonably trivial to change

>It's honestly amazing that Excel hasn't fixed this issue.

I'm surprised there hasn't been a dotfile option added yet

Maybe a registry edit?

Sure, open the gates to hell.

Why hasn’t the field standardized around another tool? Is the list of excel plugins that something libreoffice couldn’t replace it, where they could just make add the options they need? My guess is the field is mostly lacking people interested in working on tooling like this?

It mostly has; most serious work is done in R and python, and there are a lot of supporting tools, documentation, and useful packages for both languages. It's not like the actual statistical analysis in your average paper was actually done in excel.

However, Excel remains a nice tool for "I'll just look at this CSV with the final results from this analysis, sort it by correlation, and see if any of the usual suspects are up top". And if the next step is "yeah, that looks fine - I'll just copy the top 100 genes into this convenient GUI pathway analysis tool", you're suddenly exposed to whatever Excel did to your data.

And as for "why not libreoffice", most researchers I personally run into are strong molecular biologists who've learned a subset of R for their uses; they're not really likely to go out and find libreoffice on their own. Besides, the writing process for papers includes sending drafts and spreadsheets to doctors and pathologists and editors, who are probably on hospital computers with a short whitelist of programs ... and I don't really want to debug subtle compatibility issues in the sort of garbage fire those documents can turn into.

Because actual quality (usability of research, open access, reproducibility, data availability, sane exposition/introduction, proper methods, documentation and code) is not as important as producing a fancy manuscript.

Journals don't care, referees/reviewers might care, but they are unpaid and they usually don't want to rock the boat that much.

But you can select from dozens of formats in the cell setup? Not sure I understand the issue?

You can select any format you like. Then, if you put in 01/01, excel ignores your format choice, displays Jan-01 and puts 46000 in the actual cell. If you re-select the text format, you get "46000". The 01/01 you actually entered is gone forever.

Now imagine you load a file with 1000s of values in the form AB/CD, many are trashed. If you save the file, you've lost the original data.

All because it might save some data entry drone 5 seconds to expressly make something a date.

There are then also issues about whether "01/02", assuming it actually is a date, should be the second of Jan (US) or the first of Feb (EU, UK, North America ex US, Africa and Pacific regions). In many places, based on some arbitrary and we'll hidden settings, you will get the wrong result.

I honestly think the only reason excel does this is to force you to use excel formats and make it harder to work with non-MS products...

Just type '01/02 or ="01/02" and get on with your life

Better learn to properly import data via csv. If you just double click you get the excel helpfulness. If you choose open you can specify the column formats. I load dates and zip codes as text and then format the columns correctly AFTER everything loads. VBA macros are wonderful if you have the same format over and over again. If I has a nickle for everytime excel munged a date or stripped a leading zero from a zip code, I'd be rich beyond dreams.

I've always been surprised at how bad Excel is at importing CSV files. The solution feels like it was hacked together by an intern over the weekend even though this is an incredibly common use case.

I had to build a very basic CSV->XLS tool once because the built-in CSV import kept screwing up. Admittedly the CSV files were slightly mis-formatted in places, but that wasn't the only headache with the import.

It's really good these days compared to a few years back. You just need to use the import feature in the Data ribbon menu instead of just double clicking on the csv file. This way you get to use the "Power Query" tool that allows you to import a csv file a lot more precisely.

I think you've failed to grasp the issue, it's way more than just things I type sadly. It prevents the use csv files for many purposes. That stops excel users interacting with the wider ecosystem of file\data processing systems...

I think the reasoning is people type in "01/02" to actually mean Jan 2nd more often than they import CSVs into Excel. So Microsoft favored one over the other

I don't see a problem with that, as it's not undefined behavior – you know exactly how Excel will treat those values

The problem is they destroy the original data.

If they want rawdata="01/02" to display as "Jan-02" (or whatever), that's annoying but I can fix it. But they also delete the raw data and replace it with "43862". Reformatting cannot fix that and it is Excel that has chosen to actively break it.

They're not even self consistent with this: If I carefully make sure the data is correct (use "'01/02"), then save as csv and load the same file, it breaks. What sort of program can't save\load without losing data!?

That's without touching why they need to interfere or whether the US standard is the correct one to use or that fact excel is no where near this aggressive with any other data format.

(edited to correct feb to jan)

"01/02" does not translate to Jan 2nd in most of the world, because DMY is much more popular than MDY.

Excel is biased for US users.

> "01/02" does not translate to Jan 2nd in most of the world, because DMY is much more popular than MDY.

This doesn't follow. There's no year in 01/02. MD is more popular than DM.

In which country do people use DMY, but also MD? This just sounds wrong, and my (admittedly limited) sampling shows that when DMY is abbreviated it becomes DM. 01/02 is still the 1st of February everywhere I went except in the US.

Who said someone used DMY combined with MD? DMY/MDY is not an exhaustive list of date formats, you know. It is not even an exhaustive list of "incredibly common date formats".

You said “MD is more popular than DM”. In most countries I’ve been to DM and DMY are used more or less interchangeably (with different separators). However, I have never seen anyone use MD, though I assume it happens in the US.

I made no allusion to DMY and MDY being the only possibilities, because that would be ridiculous.

> I made no allusion to DMY and MDY being the only possibilities, because that would be ridiculous.

You did indeed. How else could you interpret this exchange?

>>> "01/02" does not translate to Jan 2nd in most of the world, because DMY is much more popular than MDY.

>> There's no year in 01/02. MD is more popular than DM.

> In which country do people use DMY, but also MD?

The only way to have that question make any sense at all is to assume that MDY and DMY are the only options. That certainly is ridiculous, but I'm not the one who said it.

For fuck’s sake. It’s simple, really: - DMY is much more popular than MDY, in almost all of Europe, Africa, South America, and in large parts of Asia. - DM is a common short form of DMY when the year is not ambiguous => I highly doubt your assertion that MD is more popular than DM, and in fact I am certain that 1 February is a much more common way of parsing 01/02 than 2 January.

Note that none of these points require the absence of any other way of writing dates. You could indeed argue, preferably with examples and not hypotheticals, that some locales exist in which MD would be the natural way, and that they outweigh the others.

Now you can move the goalposts once more if you really need to. It really is tedious.

I think more common than that people mean first of February. Only the US is backwards.

Is that January 2, February 1, January 2002, or January 1902?

In Excel? It's January 2nd of the current year, which again is probably the most common use

In your copy of Excel it is January 2nd. In most other countries, it is February 1st.

But here's Excel's trick: you type in 01/02, Excel interprets that as January 2nd and switches it to the underlying OLE date format (some number in the 40000s). Ship that Excel file across the ocean were they would write Jan 1 as "02/01", and it shows them the date as "02/01." Excel uses your local date format preferences.

This is one reason why it is important for Excel to convert the raw input to another format. I'd probably prefer that it didn't touch it for CSV files, but I get it for .xlsx

That's because it's not an issue at all.

It's people using a tool without knowing said tool. You can disable auto-formatting (or even better yet - set the column data type) with a simple click.

Even if you correctly format a column, excel will ignore that if it sees something that looks like a date. That's part of the problem: this isn't just automatic formatting, it's very aggressive and hard to turn off. Plus you cannot revert the changes excel has made: turning 01/02 is converted to an ibt in 46000 range. So reformatting the cell doesn't get the original input back, it just lands you with a bunch of ints.

Back in the day I actually wrote a function that would undo this for some sheets that people kept breaking...

This is not auto-formatting, it's deeper than that. It is not actually possible to disable Excel's date recognition, although you can re-change the format of a cell after the fact.

Excel is from an era when programs still catered to power users. Tools were made to have learning curves, ideally not particularly steep curves, but curves nevertheless. It wasn't expected that users would hit the app running, intuiting everything there was to know about the program in their first minute of using it.

The result is a rich deep program that users can grow into, rather than a shallow trivial program that optimizes for the noob experience and leaves power users out in the cold.

Auto-formatting seems like more of a noob feature than a power user feature. Lacking control over when it happens is definitely not a power user feature.

Treating SEPT1 as a date is behavior that will be correct for the vast majority of users, be they noob or experienced, and only incorrect for a very tiny minority of users who are doing things related to genetics. This sort of auto-completion feature is orthogonal to the noob/power user axis, except insofar as the user is expected to know how to circumvent this behavior if/when they need to.

The power user behavior would be to only do the conversion on cells that are typed as DATE datatype. But that would catch the noobs who don't know to set the cell type.

Honestly though, dates make more sense being month first, that way if you are trying to discover which one comes first, 10th of february or 10th of january, you only have to go to the first parameter in format 01/10 vs 02/10 instead of 10/01, 10/02. Logically, the most broad data item should go first.

Which is why most systems are yyyy/mm/DD when this matters

When does it matter to have month and day, but not year? Like the one time is for birthdays. The web is full of articles exactly dated with old day and month, neglecting year.

Excel isn't the program of choice for most scientists and computational biologists, who typically use R, python, or command line tools. However, we often get data from other scientists or reanalyze data from other groups that can have these errors. It's so frequent of a problem that there are scientific papers about it [1].

[1] https://genomebiology.biomedcentral.com/articles/10.1186/s13...

This ignores the reality that one will ultimately have to interact with people who have no understanding of any of these tools. Unless you happen to work in a pure computational biology group, one _will_ have to interact with lab workers, biologists with no training (or understanding of) in R or python, doctors, etc. All these people will know excel.

That's why I'm in favor of this change in nomenclature.

IMO the change is good, but is a case of detected vs undetected.

Recently I was working with some colleagues, being I the computer savvy and them the lab people. I send them some data in CSV, that when opened in Excel turned 123.456 into 123456 (it was a problem with locales, some people using "," as decimal and some using "."). We noticed because the values should be between 0 and 1000. But what if the column could be between 0 and 1000000? A small quantity of numbers bumped up by a factor of 3 could fly under the radar, and distort further measurements. And the error is undetectable forever once published.

I like it better the programming language approach: look, this is how you write a string, this is a char, this is a float and this an integer. "2020-08-04" is a string until you ask me to turn it into a date. "SEPT1" is a string, and you are going to do quite the gymnastics to make me understand it as "date(2020, 9, 1)". Do you like "," or "." as thousands? Then we first turn the number into a string and then format, but the original number is kept.

Excel technically has a type system where you can change the type of value a cell has. In my experience it is difficult to convince excel to actually change the cell's data type sometimes. Doing so can often change the underlying data as well. Personally I avoid excel if I can, because it's quirks are just too frustrating. But it certainly has its uses.

I had to use Excel for 100% of my publications and posters in medical services research. Either the data is in Excel or Excel is a tidy place to put data dictionaries. While I'd love to use only R, most of my collaborators wouldn't be able to use it, it's niche, whereas Excel is the lingua franca of data analysis.

Those tools aren’t exclusive.

I use python almost exclusively for data analysis but still open files in excel to view them for lots of reasons.

Nothing in excel is reproducible, but it’s still on all my computers.

Did an analysis a few years ago of the programs run At my organization by job series and scientists run a lot of excel.

I spent 30 minutes today convincing a Data scientist why she shouldn't use excel to store her interim data and that even an untyped data forms such as csv or json would be a better medium compared to an excel document.

I feel that excel should still do the right thing which is annoying, but the change also helps people avoid confusing the names with dates, for example, in the excel sheet has dates in it well.

> "Why, exactly, in a fight between Microsoft and the entire genetics community, was it the scientists who had to back down?"

Back down? Or pick a better tool. If Excel proves to be an unreliable tool for your job, use a better one. Alternatives exist, ranging from Google Docs, and LibreOffice, to simpler light-weight spreadsheets. Or possibly more specialist tools.

Why does everything always have to be put in Excel if Excel is such a poor tool for so many things?

Excel may have failed in this specific task, but let’s not pretend like its functionality doesn’t run circles around Google Docs and LibreOffice.

Excel is a “pretty darn good” tool for 95% of tasks. If your work has highly varied workflows, then that flexibility more than makes up for its failures on the last 5%.

If you have very specific workflows on the other hand, you may find value in replacing Excel with a specialist tool. But let’s not pretend that specialist tools don’t also have their own shortcomings; at best they’ll achieve 99.9% coverage of tasks.

From my point of view Excel hits a sweet spot between 'very simple tasks' and 'very complex tasks'.

The sweet spot being "too complicated for simple tasks" and "not sophisticated enough for complex tasks"?

>"too complicated for simple tasks"

How can Excel possibly be too complicated for simple tasks? It is pretty much as straightforward as it goes when it comes to grid-file viewing and editing. You can show it to anyone from a high-schooler to a 60 year old (with minimal experience on computers) colleagues and they will figure it out rather easily, good luck teaching Python/Pandas to the latter.

>"not sophisticated enough for complex tasks"

Not sure how that works either really. Between formulas and VBA macros, people have and are making tools complex enough they have no business to be an Excel, and yet they are even if it isn't the best tool for it. Once you go past that point, Excel isn't even in the conversation nor does it pretend to be able to.

It has issues, and people playing or working with complex (or simple) data would be better served to learn programmatical tools, but until they do Excel will serve them well as long as they stay wary of basic quirks.

> How can Excel possibly be too complicated for simple tasks?

Excel is too complicated to send an e-mail for example. You can send e-mail from excel but it's much simpler to do in Outlook.

Excel is too complicated for simple math = if you want to do 1024*436 it is better to fire up a calculator rather than wait for excel to start up.

That's irrelevant tho, Excel isn't meant to be Outlook so why should it also be able to send email straightforwardly? The fact you're even able to send one thru VBA to me shows it is not "too simple for complex tasks", albeit probably ill-suited in this case (and many more, but that's besides the point).

And while the calculator might be marginally faster if you exclude start-time, Excel is still perfectly fine for your use-case it only requires to type one extra symbol than the calculator, namely "=1024*436[ENTER]". Hell, it's probably faster to do it in Excel if you already have an active window. I don't see how it could possibly fulfill the task in a simpler manner.

Is R a bad statistical tool because I can't use it as Photoshop?

> if you want to do 1024*436 it is better to fire up a calculator

You lose your history this way, while you can easily keep it in Excel.

You could presumably come up with similar arguments for why Word (or equivalents) isn't "too complicated for simple tasks" and "not sophisticated enough for complex tasks", but, from experience, I'd argue that's also incorrect.

To be fair, I wasn't arguing in bad faith and I would've welcomed arguments (should you care enough to participate) to tell me why you are right. Instead you throw Word into the mix, for which those theories (perhaps ill-phrased, but I wouldn't call them arguments) can indeed be just as easily dismissed.

I think the arguments are stronger against Word than they are for Excel, but they're similar.

If you want to run some simple calculations on numbers, Excel is vastly overpowered and baroque. New version? Hope you've learned the new magic series of GUI clicks. Just like Word, Excel files are not inspectable (again, more baroque 'magic').

I can easily run calculations on plain texts tables in Emacs (along with the normal conveniences of being able to reorder by row, column; apply formulae across a range of cells etc.). And that's all displayed in plain text. And I can easily add new functions, keybindings, whatever, to suit my needs.

But the fact that you say the arguments against Word "can indeed be just as easily dismissed" makes it unlikely that this is likely to be a productive dialogue.

For Word, it's way more complicated than most people need, and again, users are at the (non-existing) mercy of the application. It's not very user-configurable - you have to learn how Word expects you to do things. Again, its data files are non-inspectable monstrosities, with minor corruption causing complete corruption, and fragility between versions.

I manipulate text for a living. For simpler things, it's much easier and productive to use something like Markdown or (for more complex things) something like Org-mode. For complex things, I need LaTeX - Word simply lacks necessary functionality.

Well, I will see myself out as this is indeed an unproductive conversation, it feels like I am speaking to a Linux zealot about why my grandmother has no need for Arch. But just to answer a last time, inspecting files is in no way something the average MS Office user has any need to do, it has nothing to do with the conversation really.

So, Excel is to complex for simple calculation, but you expect my mother and nearly retired colleagues who never used non-GUI apps to learn emacs to add two columns together? Is this a joke? Besides, the fact you can do it so much easier with emacs, regardless of the veracity, is fallacious at best and says nothing of Excel.

Yeah, you need to learn how the program works, that's how programs always work. Are you telling me you didn't need to learn how emacs and Latex work when you first tried it? I for one have never met someone who didn't learn how to use Word, it's pretty damn easy.

>I manipulate text for a living

Here, we find the problem. You are a specialist and you seemingly lost touch with the way average people use Excel and Word. The only argument I see in these two comments is that Excel and Word are not suited for specialist needs, which is probably true, but I don't see a hint of why they are not suited for basic operations or their intended use cases. So I will see myself out, thanks.

LibreOffice AND Google Docs fall in the same exact spot than Excel here. Try it. Type "SEPT-2" or "SEPT2" in any of them and see what happens.

LibreOffice Calc: "SEPT-2" is a date, "SEPT2" is not.

Do they change the data or only the representation? Several people here argued that LibreOffice preserved the original data, which would be a pretty fundamental difference in this context.

SEPT-2 is changed to date (that is how all major spreadsheet software works if an input is accepted as date), SEPT2 is kept as text as entered.

And actually it depends on the locale you work in whether SEPT is a valid month abbreviation or not.

Some biologist will write a gene as Sept-2. Its best to change the name so that its not a date.

Characterizing this as a "fight between Microsoft and the entire genetics community" is... a choice. But I guess Tech is this season's "big bad", so every subplot has to tie back.

Better tools for managing genomic information are probably out there, but if everyone picks Excel (or workalikes that all do automatic date formating too), then hoping people switch is tilting at windmills.

Unilaterally declaring you will not use names that confuse Excel will fix the problem (until we get new month names, anyway), and not require anyone outside their sphere of influence to comply.

I had the same situation at work, as an aquired company we had servers named devX, the new parent had servers named devX and pushed ssh config files to our laptops that made it impossible to connect to our servers; we asked them to fix it the first couple times, but eventually renamed our servers to vedX to avoid the issue. It wasn't the right solution, but it was a stable solution and let us get on with our lives instead of fighting with IT.

The vast majority of people who use tabular data a lot in biology are already just using python or R primarily, and most do this work on a cluster. This article is what happens when you send your .csv to a PI who hasn't bothered to brush up on any computer literacy after somehow mastering punch card programming, or the undergrad whose breadth of understanding beyond the web browser starts with MS office and ends with g suite.

But they did pick a better tool: they changed the name. Nomenclature is also technology.

Excel is an amazing tool. But it also has some significant shortcomings:

* A well known tendency to mangle date and gene data under the guide of being 'helpful'. * Easy to making mistakes when cutting and pasting cells. * Difficult to see what is going on in a spreadsheet. * Poor handling of CSV files.

Some of these shortcoming are inherent to spreadsheets. Others are specific to Excel, but hard to overcome due to the weight of backward compatibility.

I have written a product for transforming and analysing tabular data (https://www.easydatatransform.com) that tries to overcome these issues:

* Doesn't change your input data file. * Doesn't re-interpret your data, unless you ask it to. * See changes as a visual data flow. * Operations happen on a whole table or column. * Good handling of CSV files.

Also it doesn't try to do everything Excel does.

It is a fairly new tool. Would appreciate some feedback.

I'm a business guy, so I don't know much about it, but looks very useful and easy! I have been off the game for 5 years now, but I used to run a magento ecommerce, and I think your software might have helped managing products listings and stuff like that at the time. You might wanna take a look at this costumer segment, particularly small and medium businesses. They probably have a process for this already, but your software might be a good replacement, even though they won't be actively looking for it because they already have something that works. Good luck!

Thanks. I will see what I can find out.

Website looks great. Will try shortly and I'm more than willing to give you my money if it delivers. We need a product like this.

Edit: great call out to 7 days of non consecutive use

Thanks. I just checked the video. Looks neat! Excited to try it out.

Final Jeopardy : Trebeck: You know what, how about you just write down a number, any number at all. Could be a 1 or a 2, perhaps 3... and excel you answered; A smiley face emoji, simply stunning.

Excel is no longer motivated by the original intention of a spreadsheet, and now caters to the lowest common denominator, a piece of graph paper. As such MS has shifted focus from doing calculation to text and graphics layout tool. white text copied from a terminal : white text white background you got it! comma separated numbers : default a long string with commas in it want a plot : it is in the insert menu for some reason, since plots and numbers are no longer excels raison d'être

The date parsing being discussed has been the behavior for at least 20 years, probably more like 30.

Your comment about graph paper echoes a comment from a former Excel PM: "The gridlines are the most important feature of Excel, not recalc."


Nice, well there you go.

MSFT did the same thing with Project when they introduced something called "Manually Scheduled Tasks."

Project is fundamentally a critical path scheduling tool, or at least was. Task A must finish before Task B, which must complete before Task C. If A is delayed, then that delay pushes B and C out, too.

This is what it's FOR, more or less.

Manually scheduled tasks don't move. They're set with whatever dates you give them, and do not move in response to delays or whatnot from predecessor tasks.

People wanted this because some (dumb) people insisted that "well, that task CANT move because it has to be done by then!"

This is akin to asking for the arithmetic engine to be turned off in Excel, because by golly you really need 2 and 2 to sum to 17.5.

> People wanted this because some (dumb) people insisted that "well, that task CANT move because it has to be done by then!"

So? There are tasks like this that simply CANT move, why should a project management tool not have the ability to model that? If you don't like the feature you don't have to use it.

I get that you wouldn't want these fixed dates and if you can just plan and execute your project by yourself, but that is not always the case.

As an example: I had a project where it was really important for us to inspect the Aircraft Type (A321) into which we wanted to install our hardware and software. We where given an opportunity to do so by an airline during the maintenance of one of their A321. This was a fixed date. If we didn't finish our preparations before that maintenance date then we simply would not get this opportunity (or at the very least would have to wait a very very long time for the next opportunity). The Aircraft would simply not wait for us.

Not every task can move and just because you think they should does not make it so. Maybe there is more to Microsoft Project than just YOUR use-case?

This is such a "peak HN" kind of response.

I've been working with scheduling and project management tools for 20 years -- not just MS Project (widely regarded as the idiot cousin of the market, honestly) but also things like Primavera.

There's no reason to have a "manual" task in a scheduling tool.

First, if you have a task that can't move, you set it with a deadline and watch your deadline (as well as watching to see if the task moves PAST the deadline). Cementing the task in place doesn't help you; in fact, it actively HURTS you because it hides the fact that your forecast path isn't valid anymore.

Second, these tools ALSO include the idea of constraints. Scheduling tools include the idea of constraints which limit the critical path motion according to specific rules (based on the type of constraint in play).

Using ANY constraint, though, is frowned upon in serious scheduling circles precisely BECAUSE they distort the predictive ability of a critical path schedule. If Task C has a hard deadline of 1 Sept, then you watch you critical path to see if that remains possible. As tasks slip, you stay on top of the chain of tasks to seek opportunities to streamline or reduce scope ahead of critical task so that the deadline can be met.

(Guess what? It's not always possible.)

And you do this because you SEE that the schedule shows Task C moving to the right.

If you lock the task in place, odds are you won't notice that your critical path is collapsing.

The tl;dr is that cementing a task in place in a critical path schedule is not a good way to model deadlines. This is something any competent scheduler will tell you. It's part of the PMBOK, it's built into DoD scheduling guidelines, etc.

>Maybe there is more to Microsoft Project than just YOUR use-case?

Hilarious. I will say it's clear one of us doesn't quite understand the problem domain as well as they might represent, but given my background I know it's not me.

This feels more "peak HN" than the previous poster as it's dripping with self-importance and condescension.

Or, you know, actual knowledge about the problem domain. But you do you.

Project scheduling on works badly if you have resources 100% dedicated to a project. Trying to acurately forcast with partial resource is an exercise in madness.

On the very large programs I'm typically involved in, it's unusual to have a perfect resource pool like you describe. It's MORE likely that the whole thing is scheduled with job codes, and the labor pool is treated as elastic with additional people brought on as needed to fill roles as Electrical Engineer II or whatever.

Generally speaking, you can schedule just fine with partial resources as long as your project doesn't need more of a given resource at any given time than can be allocated to it. Really big construction projects and really big government (defense) contracts work like this.

My mind is blown that Excel's usability is so bad that the representation of the human genome itself has to adapt around it's undesired behaviour.

As in, the history of genetics research is now irreversibly linked with the shortcomings of this one software product, which just happens to be incapable of describing the genetics of the organisms that created it.

I hate to sound like a salty old IT guy, but here we go. It is not the fault of Excel that people are using it wrong. They have the ability to import the data as text but they skip that step all together. If the user does not say up front what the column is, Excel has to guess. If Excel didn't try to guess, someone would be making a comment on how bad usability is when an obvious date field was getting interpreted as text.

It's also behavior that goes back to the Ancient Times and predecessors such as Visicalc and Lotus 1-2-3. Even ancient ones will tell you if you need to enter a thing and it has to be text and only text precede it with a quote mark, ie 'MARCH1, just as you would precede a formula with =. It's Spreadsheet 101 knowledge dating back many decades.

The clickbait headline is fun, but the real headline is more like "Scientists find it easier to rename things than learn the basics of data entry in the tools they use".

> It's Spreadsheet 101 knowledge dating back many decades.

Astounding, I've literally never heard of this in the 20 years or so I've been using spreadsheets. I'll be using it from now on!

Welcome to the lucky 10,000 [1] club! It's a fascinating thing about cultural knowledge that stuff that should be taught early in using a tool, people discover anew every day. It's a useful reminder that I also wasn't disparaging the scientists for finding renaming easier than retraining. I'm sure these scientists were very familiar with the costs of learning/relearning. (ETA: Which is why my real joke was about ancient ones and how easily knowledge of the 70s/80s seems ancient and easily forgotten.)

[1] https://xkcd.com/1053/

Excels usability isn't bad though. Which is why everyone uses it. It falls down in this (and lots of other cases) but if you want to see what good usability looks like, look at what people use.

Everyone uses Excel because it's a tool everybody learned at University or even highschool, nothing more, it's not because it's good at all.

It's a terrible app but Microsoft managed to shove it everywhere in education. Same for Words.

Unfortunately Libre Office and co tried to reproduce the terrible UX of both software, while coming up with inferior functionalities because they had to try to maximise compatibility with these MS tools, time not spent on providing actual useful stuff.

Access is a bit better and I wish more non programmers learned basic SQL instead of Excel.

> Everyone uses Excel because it's a tool everybody learned at University or even highschool, nothing more

My anecdote is the exact opposite.

I was recently introduced to using Excel and actually writing some VBA inside of it when I helped a non-programmer with a complicated data processing task on a sheet they were populating from a database.

This was the first time I really used Excel, even though I've had continuous access to it since Windows 3.0.

Aside from some small annoyances, it was a generally pleasant experience. I also can't think of any tool that comes remotely close to the power and flexibility of Excel, particularly for people who are not programmers.

I kind of got hooked on it, and now I frequently use it for little data processing tasks that I used to do in a REPL - and the bonus is I can share those with people who are not programmers.

You may enjoy "Pure Functional Programming in Excel" by Felienne Hermans presented at GOTO 2016 https://www.youtube.com/watch?v=0yKf8TrLUOw

> because they had to try to maximise compatibility with these MS tools,

Worst excuse ever. Excel had to maximize compatibility with Lotus 1-2-3 (in the 80s/early 90s, the dominant spreadsheet application by far) from Day 1. It wasn't just that the files were compatible, Excel included keystroke compatibility so 1-2-3 experts could continue navigating the UI with their old commands.

Unlike the LibreOffice team, Microsoft made a superior program in virtually every way, despite the burden of backwards compatibility with another program.

So what in your opinion is "the right tool" to deal with spreadsheet type of tasks while maintaining things like easy to move to other PC and so on?

It may be a terrible app, but it's better than the competition. Does that mean it's the best one?

My favorite is still that Excel can't handle dates before 1900

In a previous version of MS Excel, after opening a CSV file, Excel would silently write the 'interpreted' version to disk. So much trouble, simply from previewing a CSV.

I have had excel change date formats on me several times because of this "feature". I noticed it no longer does this sometime in the past year.


the paper announcing the new guidelines of renaming genes, a work of fundamental importance to all scientists in the world, cannot be read without an expensive subscription to the journal.


Thanks science (sarcasm!)

copy paste the DOI into scihub

That's entirely missing the point.

it works

Just like Excel

Until it doesn't...

To people asking, "why do they use Excel?" that's like asking "why must we be subjected to gravity?"

The whole world's data ultimately comes from or ends up in an Excel spreadsheet. Sure, we might use other intermediate data storage methods, but in the end it's going to go into some scientist's or some politician's computer, and by golly it's gonna be in Excel. Trying to rally against Excel is like trying to rally against fundamental forces of nature.

This is just an example of that fundamental force winning.

I don't disagree completely with this, but just want to point out that it's kind of a bad smell to have computational biologists who are - as someone in the article puts it - computationally illiterate. I have met lots of these types over the years, and usually their methods are kind of a gong show. If you can't properly sanitize your data inputs on your column headers, why should I trust that you've treated the rest of your data properly?

I have a strong feeling that, if people really put an effort into reading and replicating more papers, we would find that a lot of what's being published is simply meaningless.

In grad school I had a subletting roommate for a while who was writing code to match some experimental data with a model. He showed me his model. It was quite literally making random combinations of various trigonometric functions, absolute value, logarithms, polynomials, exponents, etc. into equations that were like a whole page long and just wiggling them around. He was convinced that he was on a path to a revolution in understanding the functional form of his (biological) data, and I believe his research PI was onboard.

I guess "overfitted" never made it into the curriculum.

> It was quite literally making random combinations of various trigonometric functions, absolute value, logarithms, polynomials, exponents, etc. into equations that were like a whole page long and just wiggling them around.

Technically, we call that a "neural network". Or "AI".

It reminds me more of a genetic algorithm

Everytime I think I'm getting somewhere with a GA I eventually realise I've just created a guided monte carlo simulation.

This answers the question "will we ever have AI as smart as a human?"

Yes. It just turns out it's a particular human, whose analysis is very very dumb.

I work in computational materials science (where ML brings funding) and a funny paper of this kind is here: https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.11... - they are literally trying out 100000s of possible combinations by brute force, to build a "physical model".

Then they go on conferences and brag about it, because they have to (otr they know it's bs). Datasets are soso (you can have a look at QM9...) and for more specialized things, people generally don't bother trying to benchmark or compare their results on a common reference. It's just something new...

And with all that: even without doing fancy statistical methods without knowing too much about it, your theoretical computations might not make so much sense (at least in the sheer number which is pumped out and published)...

> (otr they know it's bs).

Well, that's a new acronym for me. I wonder where it came from. Apparently it's "on the real". Sounds like AAVE?

I thought it was off the record like the old Pidgin plugin or https://en.wikipedia.org/wiki/Off-the-Record_Messaging.

It means off the record.

off-the-record (not written, not cited, but available in personal discussions).

Oh, I thought "on the real" fit the context better, meaning they knew in their heart of hearts it was bullshit, but "off the record" is about the same.

> Well, that's a new acronym for me. I wonder where it came from. Apparently it's "on the real". Sounds like AAVE?



typo for 'or'?

> I have a strong feeling that, if people really put an effort into reading and replicating more papers, we would find that a lot of what's being published is simply meaningless.

People have figured that out long ago [1] (I know the author of that paper lately turned somewhat controversial, but that doesn't change his findings). It's not very widely known in the general public. But if you understand some basic issues like p-hacking and publication bias and combine that with the knowledge that most scientific fields don't do anything about these issues, there can hardly be any doubt that a lot of research is rubbish.

[1] https://journals.plos.org/plosmedicine/article?id=10.1371/jo...

isn’t the saying, 80% of everything is garbage?

Sturgeon's Law https://en.wikipedia.org/wiki/Sturgeon%27s_law "ninety percent of everything is crap."

Yeah, but one would hope that science has a higher standard. 80% garbage results in science sounds catastrophic to our understanding of the world, and in particular when it comes to making policies based on that science.

There's the saying "science advances one funeral at at time."

'‘A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it.’ This principle was famously laid out by German theoretical physicist Max Planck in 1950 and it turns out that he was right, according to a new study.'


Also the story of Ignaz Semmelweis who discovered that if doctors washed their hands it reduced deaths during childbirth - but for a variety of reasons his findings were resisted.

https://www.npr.org/sections/health-shots/2015/01/12/3756639... https://www.npr.org/sections/health-shots/2015/01/12/3756639...

Point being, as awesome as science is, it's still a human enterprise, and humans are still, well, human.

You do realize we've been at this for a few thousand years, right? Catastrophic is putting it mildly.

Unfortunately - like bs, power law is fractal.

We better work on making more stuff.

Better stuff?

If you have more stuff, and 10% is good, than that stuff can be more of the better.

Not my area at all but isn't this genetic algorithm type stuff?

In grad school I had a friend that was doing olfactory (smell) research on rats with tetrode drives (wires in their brain). He was looking at the neuronal response to smells that they gave the rats and had a few signals to match up. There was the signal from the arduino running the scent gates, the amps that reported the weak neuronal currents, the nose lasers that gated the ardunio, etc. He was having a hard time getting through all the data in his MatLab code and I offered to help for some beer.

After the 11th nested 'if' statement, I upped the request to a case of beer. I'm not certain he ever got the code working.

To the larger point, scientists are not programmers. They got into their programs to do research. What keeps them going is not the joy of programming, but the thrill of discovery. Programming is nothing but a means to an end. One they will do the bare minimum to get working. Asking hyper stressed out grad students to also become expert coders isn't reasonable.

And yes, that means that the code is suspect at best. If you load the code on to another computer, make sure you can defenestrate that computer with ease, do not use your home device.

I keep seeing this sentiment when it comes to those in the natural sciences, but it makes no sense.

I could replace "programming" in your above little bit with "mathematics" and it would be just as weird.

Our modern world runs on computers and programs, just as our modern world and modern science built itself on mathematics and required many to use it. So too the new world of science may require everyone to know to program just as they know about the chemical composition of smells, or the particulars of differential equations, etc.

And I know your argument isn't "they shouldn't learn programming", but honestly since I keep seeing this same line of reasoning, I can't help but feel that is ultimately the real reasoning being espoused.

Science is getting harder, and its requirements to competently "find the exciting things" raises the bar each time. I don't see this as a bad thing. To the contrary, it means we are getting to more and more interesting and in-depth discoveries that require more than one discipline and specialty, which ultimately means more cross-functional science that has larger and deeper impacts.

Most scientists are not great at math either.

Again: these are tools that are means to an end. They only need to work well enough to get the researcher to that end.

A lot of what are considered essential practices by expert programmers are conventions centered around long-term productivity in programming. You can get a right answer out of a computer without following those conventions. Lots of people did back in the day before these conventions were created.

That's not to say that everybody with horrible code is getting the right answers out of it. I'm sure many people are screwing up! My point is just that ugly code does not automatically produce wrong answers just because it is ugly.

By analogy, I'm sure any carpenter would be horrified at how I built my kayak rack. But it's been holding up kayaks for 10 years and really, that's all it needs to do.

I will add that in general, statistical analysis of data is not by itself adequate for scientific theory--no matter how sophisticated the software is. You need explanatory causal mechanisms as well, which are discovered by humans through experimentation and analysis.

And you can do science very well with just the latter. Every grand scientific theory we have available to us today was created without good programming ability, or really the use of computers at all. Many were created using minimal math, for example evolution by natural selection, or plate tectonics. Even in physics, Einstein came up with relativity first, and only then went and learned the math to describe it.

Your point is maybe a little obtuse to me, because it sounds like you are arguing for "computers are tools that should be learned, but really no one does and who can blame them, they just want to science" and simultaneously arguing, "tools aren't science, and science can be done without them".

I feel like the later is obvious: of course the tools aren't science, but if you want to do real work and real science, your tools are going to be crucial for establishing measurements, repeatability, and sharing how one models their hypothesis onto real world mechanics.

Likewise, the former is just the same commonly repeated thing I just argued against and my reply is the same: so what? You building a kayak is not science and is irrelevant.

Scientists can't reach a meaningful conclusion without proper use of tools. All they can do is hypthesize, which is certainly a portion of science (and many fields are in fact stuck in this exact stage, unable to get further and come to grounded conclusions), but it is not the end-all of science, and getting to the end in the modern day science means knowing to program.

Of course there are exceptions and limitations and "good enough". No one is arguing that. The argument I am refuting is those who think "tools are just tools, who cares, I just want my science". That is the poor attitude that makes no sense to me.

> Scientists can't reach a meaningful conclusion without proper use of tools.

I'm just trying to make the point that "proper" is subjective. Software developers evaluate the quality of code according to how well it adheres to well-established coding practices, but those practices were established to address long-term issues like maintainability and security, not whether the software produces the right answer.

You can get the right answer out of software even if the code is ugly and hacky, and for a lot of scientific research, the answer is all that matters.

The usual reason programmers object to ugly, hacky code is that it's a lot harder to be justifiably confident that such code actually does produce the right answer -- "garbage in, garbage out" is just as true in function position as it is in argument position.

Tbh I think its a case for multidisciplinary research. You wouldn’t only hire one skill set to run a company, even a tech one, so why should research be any different? That’s probably where the deep insights are.

People that are just decent programmers can make at least twice (probable 3 or 4 times) as much money working for industry than for science in an academic environment. Most programmers that would work for less money because they are interested in science will be more interested in computer programming problems than basic programming to support a scientist. NSF won't give you $250k to hire a really good programmer to support your gene analysis project. More like 100k if you are lucky.

So what you end up with are that great scientists that are decent programmers are the ones who can do the cutting edge science at the moment.

That's a problem we should aim to solve.

Think of the flip side: Programmers are terrible biologists.

Sure, it would be great if we all had more time to learn how to code. Coding is important. But I'd say the onus should be on coders to build better tools and documentation so they are empowering people to do something other than code, rather than reduce everything to a coding exercise because making everything look like code means less boring documentation and UX work for coders.

I mean, biology is in fact a full on degree program and you pretty much need a PhD before you're defining an original research topic. It's not because biologists are dumber and learn slower. It's that biology is complicated and poorly understood, and it takes years to learn.

Contrast this to coding... you don't even need to go to college to launch a successful software product, and the average person can became proficient after a few years of dedicated study. However, this is a few years that biologists don't have, as their PhDs are already some of the longest time-wise to finish.

The decision to rename genomes is totally consistent with the biologists MO: if a cell won't grow in a given set of conditions, change the conditions. Sure we can CRISPR edit the genes to modify a cell to to grow in a set of conditions, but if it's usually far easier to just change the temperature or growth media than to edit a cell's DNA.

My take away is that this is more a failure of programmers and/or a failure of their managers to guide the programmers to make tools for biologists, than of biologists to learn programming. Sure, coders get paid more, but they aren't going to cure cancer or make a vaccine for covid-19 without a biologist somewhere in the equation. And I'm glad the biologists developing vaccines today are doing biology, and not held up in their degree programs learning how to code!

MatLab has taken over bio specifically because it has great documentation and examples. If Python was psuedo-code that compiles, then MatLab is just speaking English. Even still, the spaghetti that researchers get into is just insane.

> To the larger point, scientists are not programmers. They got into their programs to do research.

I would say most research, to an ever growing degree, is so heavily dependent on software that it's tough to make that claim anymore. It makes no sense to me. It's like saying Zillow doesn't need software engineers because they are in the Real Estate business, not the software business.

I maybe misspoke. I meant that scientists do not go into science to program, they go into it to discover and do research (among many many other things). Sure, some do find joy in good programming, but that's not why they are there to begin with. Becoming a better programmer isn't their passion, and those skills remain underdeveloped as a result.

> To the larger point, scientists are not programmers.

I mean, sort of. Some research is essentially just programming; other research can get by with nothing but excel. Regardless, it's unreasonable to ask most scientists to be expert programmers -- most aren't building libraries that need to be maintained for years. If they do code, they're usually just writing one-shot programs to solve a single problem, and nobody else is likely to look at that code anyway.

It's not usually computational biologists who are using Excel.

What if you want to share data with a wetlab biologist who want to explore their favorite list of genes on their own?

There are lots of great computational biologists, but being a computational biologist doesn't necessitate being good with computers. Plenty of PI's rely pretty much exclusively on grad students and post-docs to run all their analyses.

Not that I'm saying using excel is bad either. I use excel plenty to look at data. But scientists need to know how to use the tools that they have.

If people are just looking at the spreadsheets then wouldn’t the cells interpreted as dates not be a problem? It seems like it would only be a problem if you’re doing computation on the cells.

Excel changes the displayed text when it interprets the text as date. You store "MARCH1", it displays "1-Mar".

It's also my experience of research in biological sciences that it is a widespread belief/fact that in order to get published in a top journal, the analysis methods must be "fancy", for example involving sophisticated statistical techniques. I worked on computational statistical methods so I'm not against that per se, but the problem is that if you have the training to contribute at the research front of an area of biology you rarely have the training to understand the statistics. Some would say that the collaborative publication model is the solution to that, but in many cases the end result isn't what one would hope for. I do think that less emphasis on "fancy" statistics, and more emphasis on simple data visualizations and common sense analyses would be a good thing.

Agreed. And to add to that: The more fancy the statistics have to be, the less robust are the results.

I'm an ex-computation biologist who did most of his work in python but periodically had to interop with excel.

THe basic assumption I have is that when I input data into a system, it will not translate things, expecially according to ad-hoc rules from another domain, unless I explicitly ask it to do so.

It's not clear what data input sanitization would mean in this case; date support like this in Excel is deeply embedded in the product and nobody reads the documentation of Excel to learn how it works.

it would be nice if everyone was expert at everything, but they cant be. it would be nice if they hired experts but money doesn’t grow on trees. we often insist on a degree of excellence we refuse to pay for

It's not about being an expert at everything or hiring more people. These aren't particularly hard problems, it's not difficult to find biologists who are incredibly adept at using python, R or C. It's about thinking about how science gets funded and how it gets implemented. I've written here before about the difference between "grant work" and "grunt work", and how too computer touching tends to get looked down upon at a certain level.

If you're deciding who gets a large-scale computational biology grant, and you're choosing between a senior researcher with 5000 publications with a broad scope, and a more junior researcher with 500 publications and a more compuationally focused scope, most committees choose the senior researcher. However, the senior researcher might not know anything about computers, or they may have been trained in the 70's or 80's where the problems of computing were fundamentally different.

So you get someone leading a multi-million dollar project who fundamentally knows nothing about the methods of that project. They don't know how to scope things, how to get past roadblocks, who to hire, etc.

What's your source on it not being difficult to find biologists who are adept at using python, R, or C? Most biologists operating in private industry or academia have many years of training in their fields and many have learned their computational tools as they've gone on, meaning they've never received proper training. It seems dubious to claim that there's this neverending source of well trained biologists who are also adept at programming.

I would say the number of biologists who actually understand programming is extremely small. I've been programming for fun for ~15 years, and I'm about to finish a PhD in chemical biology (e.g. I started programming in C far before I started learning biology).

You might occasionally run into someone who is passable - at best - with R or Python. But most of the code they might write is going to be extremely linear, and I doubt they understand software architecture or control flow at all.

I don't know any biologists who program for fun like me (currently writing a compiler in Rust).

To be fair, linear code is often totally sufficient for most types of data analysis. Biologists don't really need to understand design patterns or polymorphism, they just need to not make computational mistakes when transforming the data.

Absolutely. My point was more than you can't expect comp. biologists to actually be "good" programmers when compared to SWE or even web devs.

Most of the code I write to do biological data analysis is fairly linear. However, I also generally use a static type system and modularity to help ensure correctness.

I've perused a lot of code written by scientists, and they could certainly learn to use functions, descriptively name variables, use type systems and just aspire to write better code. I just saw a paper published in Science had to issue a revision because they found a bug in their analysis code after publication that changed most of their downstream analysis.

It does get rather problematic when you have large quantities of...stuff. You can't run linear stuff in parallel so now you're bound to whatever single CPU core you have lying around.

I'd say that getting some basic data science computing skills should be more important than the silly SPSS courses they hand out. Once you have at least baseline Jupyter (or Databricks) skills you suddenly have the possibility to do actual high performance work instead of grinding for gruntwork. But at that point the question becomes: do the people involved even want that.

I write 'one off' programs all the time. Most of what I write I throw away, and I program for a living. Those are usually fairly linear. Which is fine. If I am writing something that will be re-used in 6 different ways and on a 5 person team. That is when you get out the programming methodologies. It is usually fairly obvious to me when you need to do it. For someone who does not do it all the time. They may not know 'hey stop you have a code mess'.

It one of the reasons why people end up with spreadsheets. Most of their data is giant tables of data. Excel does very well at that. It has a built in programming language that is not great but not totally terrible either. Sometimes all you need is a graph of a particular type. Paste the data in, highlight what you want, use the built in graph tools. No real coding needed. It is also a tool that is easy to mismanage if you do not know the quirks of its math.

It doesn't take being an expert at Excel to understand how Excel autoformats. It takes a few days of actually working with data or an introductory class that's today taught in American primary schools.

Sorry for asking but are you familiar with how MS Excel aggressively converts data to dates? There's no way to "sanitize" it (without resorting to hacky solutions like including extra characters) and even if you fix the data, it will re-change them to dates the next time you open the file.

You're simply incorrect. If you set the column format to Text it will never convert data to dates, including when you open the file.

Great, how do you set a custom column format in a CSV file?

I'm only familiar with LibreOffice and not Excel myself, but: if you want to be sure a column is treated as text in a CSV file, you have LO quote text fields on save, and have it treat quoted fields as text on import. I assume Excel must have similar options.

LibreOffice, on opening of CSV file, always pops an import dialog where you can do this. To keep the column format permanently, save as ods.

For the most part we aren't talking about computational biologists but experimentalists using Excel. People at the bench need to collect their data somehow, and using Excel for tabular data and Word for text data is just what they know. Typically they then pass these files over to computational biologists for analysis. Yes, it would be nice if they would use more appropriate tools, but I know from experience that the typical result of trying to teach them better tools is the experimentalists just rolling their eyes and saying that they don't have time to learn some nerdy program because they have experiments to run.

Excel is a wonderful tool and a type of programming that is very accessible to many people. I use it all the time.

Considering how Perl was chosen as the computational biologists lingua franca in 1990's - 2k's since it was good at text manipulation (since genes are represented by text) I would say they don't have a history of making good choices.


> " To people calling, "why do they use Excel?" that's like asking "why must we be subjected to gravity?""

I respectfully disagree with this. Excel is fundamentally not suited to analysing *omics data. It's often the default program affiliated with a .csv filetype on people's computers, but trying to get an entire field of scientific research to rewrite itself based on its glorified bugs is...wrong, in my opinion.

If you see wrong things in the world, do you accept them as they are, or try -- however ineffectually -- to force change for the better? I for one bang the drum into the wind and try to get biochemists off it. I teach people to be very sceptical of excel in my stats courses, for example (aside from some showstopping bugs and downright dangerous defaults, its RNG is particularly crap).

Excel is not fundamentally suited, but it's fundamentally used.

I wish you luck, but I long ago declared defeat against Excel. I can't prevent everyone from using it. I have come to accept that whatever Excel does is what I have to live with, because someone, somewhere out there hasn't heard the anti-Excel gospel yet.

> I wish you luck, but I long ago declared defeat against Excel.

Totally relate to this.

I studied a degree because I wanted to rid the world of Excel.

Just like I wanted to rid the world of pie charts.

Turns out the rest of the world is quite happy using both of them, no matter whether I think it's right or not.

I'm actually having a lot of scenarios where Excel is banned except for some managers, C-level and board type stuff.

If you produce, touch or directly act on data, Excel is no longer allowed. On the other levels the people that still use it (because they think they have the skill - which they often don't) can't do meaningful direct-action on those files anyway so that solves part of the problem of acting on bad data or bad data processing.

We do allow web-based spreadsheets, that has the benefit of limiting what you can do and drives people to either put a request in with the correct team or take internal courses on how to do it right.

Some people hate it, but then again, some people hate Excel. The difference is that the people that don't hate it now have an empowered skillset that they didn't have before. In theory you could do the same with Excel-oriented internal courses, but we didn't want to do Excel because a lot of us think it's stupid (often simply for emotional reasons but the push was based on hard data like process metrics and user input).

To add to this: it's probably because I find myself attracted to data-driven organisations where using Excel didn't help the workforce much anyway, and the positions I'm in allow for a certain degree of influence on the tools that we support and how much explicit allow/deny we apply on them.

The difference is that the people that don't hate it now have an empowered skillset that they didn't have before.

Not knowing how to use the primary tool in your field because you only know how to use a web-based spreadsheet with 1/100th the functionality is not an "empowered skillset." It's a cruel trick to play on employees, who will have difficulty finding further jobs in that field because they don't have a basic understanding of the tool everyone else uses.

I was talking about Apache Spark via Databricks. ~220 of the people that were attached to Excel and PowerBI switched over for their calculations and data processing, 3 remained. One was let go, one didn't actually use Excel to write data, only to see it and search it, and the third wasn't able to learn the replacement so they got to keep Excel.

It's not a cruel trick and even if it was: web-based spreadsheets might not have the same functionality, but the people that were using the desktop-based spreadsheets weren't doing much more than basic computation on fields and search/replace. That works fine on the web.

It's not like we took away their toys and gave them dirt instead. We just made everyone take the startup course to get them acquainted and if that went well we would decide our next steps. It went well, and the next steps became removing Excel from the standard workflows.

Quantitative social science (which maybe a lot of comp bio people might regard as a "softer" science) has really made huge strides in moving this kind of analysis to R. Of course one person can't change it, but those with high status in the field can make a difference by leading the way and setting standards.

At some point, you generate a dataframe with the data in R, save it as a csv, email it to someone, and they open it in.....Excel.

This is the exact sort of thing that indicates market-power being concentrated to a point requiring severe, acute regulatory action.

Not at all

There is tremendous competition in the spreadsheet business, and anyone is free to use Julia, R, whatever they like.

But they don't, and that's a problem. It's an education problem: everybody automatically defaults to Excel, even when it's not the proper tool. People need to learn about better tools.

You can't force people to use a different tool that they don't want to use. They [(computational) biologists] know other tools pretty well, they don't want to use them. If you want to help, work on better tools for them, based on their needs. Regulation is about the worst thing you could've thought of. If you said this to my gf's research group, they would yell at you, Excel is a loved piece of software among them, and other spreadsheet applications are not even close. They don't want to have a dozen different tools that they have to switch between fifty times an hour, converting data in the meantime, they want to know one or two very well and have the data all there. Excel is a data swiss army knife, which is exactly what they need, it has some warts but forcing them off Excel really is not the solution.

Where am I talking about forcing them? You're the only one bringing that up, and I strongly disagree with it. Education is not force, it's empowering.

The only reason they want to use a tool that doesn't really fit their use case, is because they're not aware of better tools. I find it very hard to believe that scientists actually want to use a tool that corrupts their data. If that is true, then that is absolutely a problem with their attitude towards science and data.

It's not that hard to imagine a tool that can do exactly what Excel can, but without corrupting your data. It might even exist already. LibreOffice got mentioned a lot; it can do almost(?) everything Excel can, but without corrupting your data. If there are problems with it that make it useless to scientists, there's a good chance the LibreOffice community can fix them.

I think better tools that preserve the integrity of their data are absolutely the solution here.

You're taking a minor, very unusual issue that happens only to a subset of a subset of a subset of users and resolving it with regulation, while forgeting all the cases where Excel works perfectly for them. Libreoffice is good, but it has much more warts than Excel does and the presentantion capabilities (a key functionality) simply are not as good.

Are you seriously opposed to education because you see it as regulation?

Also, I don't think it's a very minor issue if the tool you use to process your data, changes your data. It may only do that in limited cases, but the fact that it does it at all should alarm anyone who cares about data integrity.

> This is the exact sort of thing that indicates market-power being concentrated to a point requiring severe, acute regulatory action.

This is what I react to. Sorry I missed that you're not the original commenter, that's my bad.

I am not sure though what other tool should the biologists learn - they usually know R and Python and they use Excel because they need what Excel offers; there isn't any other software that allows them to easily do whatever they need to do with the data, quickly iterating on the ideas AND then present it nicely.

The solution here should be to fix Excel, but it really is a small issue (that has been worked around, too; and if you know how to use excel correctly, it does not happen to you) compared to making another Excel, which seems like a monstrous task.

Yeah, that wasn't me, and that comment was already countered by pointing out there was plenty of competition in the spreadsheet market. My reaction was that a lot of people may not be aware of all the options and simply default to Excel because that's what they know.

You're claiming that they do know the alternatives but the alternatives fail in worse ways than Excel does.

I don't know what the magic sauce is that makes Excel so much better than the alternatives, or why it would be such a monstrous task to replicate that. My impression is just that to many people, Excel is simply the default tool to enter data in, no matter what the actual problems with Excel are. Excel is like gravity; good or bad, you put up with it because it is what it is. And if that's the case, that'd be pretty bad and in need of change, because Excel is not at all like gravity; it's merely one tool of many. If there's a more appropriate tool, people should use that. If there isn't, one could be developed.

I stand by my point that a standard data collection tool that modifies the data you put into it is a really bad idea for any field where accuracy and data integrity is important, and I'd expect science to be one of those fields. You may be able to work around the limitations of the tool, but the risk is still there.

I just don't see it being a big deal. You can't just force people to use something they don't like.

Ah yes. Without acute regulatory action, Lotus 1-2-3 would have never been defeated, the PalmPilot would have reigned supreme as a smartphone device to this day, and Starbucks would have taken over all coffee production.

Having Excel be the default for .csv files has caused me problems in the past as well with Excel automatically changing data when you open it. The unfortunate thing I have come across multiple times is a csv file that has some values with a lot of significant figures, i.e.

123456789876, 987654321234

After opening in Excel it will convert these to:

1.23E+11, 9.87E+11

Then resaving the file (as a csv) will store the values in scientific notation and lose all of the digits.

This issue is common in the oil and gas industry. Wells are typically identified with a 14-digit number that Excel likes to convert to scientific notation and then truncate digits when saving to CSV.

That's the problem of having Excel open CSV files by default - if you use Data//From Text/CSV you get a chance to change the import process to make those columns text rather than Excel guessing they are numbers.

Excel can't even properly support UTF-8 csv documents. But Libreoffice Calc does. Pure madness.

Same here, but with IMEI numbers. Excel silently truncates them, it’s frustrating. Why can’t the auto format detect and just store them in text format?

Why can't there be a flag (defaulted to off) to control auto formatting

I was thinking that perhaps the autoformat options could be explicitly shown in the column headers

    number date currency
    A      B    C
and you could click on the format to change it. Italics could show assumed formats, bold could show user set formats.

I'm newly returned to (very minor) Excel use and assumed formats with no feedback seems wrong. My situation is dates _not_ being picked up as dates and it not being obvious that it hasn't; I'm using ISO format (eg 2020-06-17), I guess you have to create a new date format or use USA-ordering before it will recognise dates.

I would be happy if Excel supported some way to specify the column type in the CSV itself. Ie first row headers, second row data type/hint, subsequent rows the actual data.

At the moment users need to remember every time they import the CSV to specify A data type for each incorrectly handled column. Suffice to say they don’t always remember... and with large CSVs with hundreds of columns it’s just a pain.

What might be better is what I'll call “CSV Schema”: support for paired CSV files where one normal CSV has data, and the other is a schema file which contains headers which match the data files headers with a standard extra first header prepends (call it “meta-name”), and the data rows each provided metadata for the corresponding column in the data file. The most common (and often the only) metadata row might be for type information, but the format could support additional metadata.

This would allow existing CSV tools that work with data files to continue to work without modification, while CSV Schema aware tools could make use of the schema.

This is what happens on explicit import, but, formatting is per-cell, not per-column. "General" is the format for "idk, guess" - it stays General after a transform, it's not converted.

Because most people want to import numbers as numbers most of the time. Excel detects numbers and treats them as numbers. You do have the option to import them as text. So, Excel (as usual) does what's the best for most people.

The reasonable default should be not to change existing data. If the user wants to, by all means do it, but not before. If you want to be smart about it, offer a popup asking if a particular column or field should be interpreted as a number or date, but by default, keep everything a string.

Why does excel truncate a very long number silently? That is what happens with IMEI numbers. It cuts off the last 4-6 digits. It’s destroying numbers.

Excel should know any number that is longer than it can internally support should be stored as text. Or they need to introduce a data type that can store arbitrary precision numbers.

It's not that Excel can't support that long of a number in general. When you open up a csv in Excel, it is not being worked on in "Excel format" where it is displaying the number in scientific notation but it is still retaining the full number. If you click on the cell in the text window in the top it shows you the full number.

The asinine thing happens when you save it back as a csv. Excel doesn't think to itself "hey, I should write these full values I have stored in each cell to the csv file". It just writes whatever is currently displayed to the csv file instead.

Same here, but with part numbers in electronics manufacturing. Excel also removes leading zeros from package names such as 0402.

That also applies to phone numbers :-(

And zip codes.

Yes, and then compound this problem by transferring files with people that.... are somewhat technically illiterate and it turns into a nightmare.

You are bringing me back nightmares, how I wish I was Python-litterate during my previous employment.

> Excel is fundamentally not suited to analysing *omics data

you are 100% correct. and yet it will continue to happen.

Yes, and? It should be a simple checklist item for any scientific journal to ask the author whether they used excel or not. If they say yes, reject the paper unless they can show their work hasn't been affected negatively by using excel.

On the practical side, how would people show that their work hasn't been affected negatively by using Excel? How would the journals evaluate that? I suspect this would just become another box to check - yes, we use Excel, yes, we checked the results, like the last 50 times.

Intellectually, it feels snobbish to single out Excel like this. I'm a software engineer in science, and I generally agree that scientists should learn some kind of coding. But you can make mistakes in Python or R as well - not to mention in physical experiments. We should check data and methods in general, not presume incompetence if people use one tool.

By providing a repo with raw data, and the code that runs on it that eventually produces the results that are in the manuscript. Anything else is just a bunch of handwaving.

It's not coincidental that big evil FDA/pharma requires people to sign - on paper - their lab notes (and to have lab notes, and to sign that they calibrated the equipment, and that they checked for common mistakes).

And yes, I know this costs a lot, and that this is sort of a pipe dream. And I'm not saying this is the best solution. Renaming genes might be a lot better, because it just allows the field to move on, and Excel will eventually die out, or at least researchers will move away from it - maybe to Jupyter, if we're lucky.

So, all in all, of course Excel is just a symptom, but society is already pretty stubborn when it comes to financing R&D and progress.

That doesn't solve the analysis problem downstream when a non-expert is getting started, doesn't know about Excel's anti-features, and starts doing analysis using the only tool that the world has ever told them is acceptable for tabular data.

Ideally excel would change, but since we know it won't, and we want to work with lots of people with minimal problems, we must adapt.

Why should Excel change? Most people mean March, the first, when they enter march1.

Researchers should use other, more appropriate tools, or at the very least specify the column type when importing data into Excel. It's not that hard.

Do they mean March 1 when they import data from a text file?

It's one thing to change typed in user text in real time. That's not causing any problems. It's another to randomly mutate cells amongst tens of thousands of rows. I don't think that has ever helped anyone.

How are these non-experts getting started? If they are in academia, academia should start teaching these practical things too. (Yes, I know the problem is that many old school bigwig researchers are doing even worse things.)

There are people of all sorts: amateurs that want to play with the data but will not make study of it their primary field, trainees that will make use of tabular data extensively in their career, and then experts in more rarefied fields (e.g. immunology, clinicians, etc.)

If I can get a trained immunologist looking at my data, I'd much rather have 5 more minutes of their analytical skills than teaching them about common data exchange pitfalls.

yeah but then you'll just make them use SAS and nobody wants that. just try convincing anyone who graduated two decades ago to use something reasonable like R

But we didn't accept them as they are. We looked at two options, change excel or change gene names, and picked the one most likely to solve the problem. Gene names mean nothing anyway. We're going from random letters that mean nothing to humans and computers to still random letters that mean nothing but at least don't get confused for dates.

there was a time when a gene was named to facilitate search through a bookshelf full of laboratory notes. you could look at a genes lable and know exactly where to find reference to the gene, the lable would indicate the stack number the shelf, the section the volume the page[s] and paragraph....lables such as Sonichedgehog dont do that anymore.

I've been in the room when deciding what to call a gene that the lab had just characterized/identified as interesting for the publication they were working on, and let's just say that making it easy to catalog and find the gene wasn't really on their mind. They'd actually been using a deliberately obfuscated name in conferences to prevent people from scooping their work on the gene before, which is pretty common.

conference is a different game but that doesnt change what happens in the lab, if you look at the overall picture you can see the potential for duplication of alphanumeric strings. this is why in conference lables must be developed as you have a wider context in conference than in your own lab. You can be scooped for research regardless of what lable is given to a gene, and for that matter even if you make publication first as not everybody has the same ethical framework of honesty or due accreditation. the time before computers were used reflexively was when we would record data, in cursive written in hardcover bound note books. a putative gene has a name that allows it to be found. later if things work nicely that putative is confirmed to be a gene and if things work ideally the locus can be established


around the late nineties the obsession with videogames and netspeak took hold, and the popular names began to crop up.

Surely then the blame should go one step further back: .csv and tab-separated are poor file formats for *omics data and excel (and pandas, etc.) have no choice but to guess data types and metadata. Unfortunately I'm just not sure there's any format that's sufficiently better to ever replace them.

>based on its glorified bugs

If someone types MARCH1, DEC1, OCT4, SEPT1 into a column, why should Excel not treat these as dates? Probably 99.999% of users would prefer this auto-understanding. To make users have to then select cell formatting and select date is a terrible UI decision.

It's odd to call such a useful feature a bug.

>Excel is fundamentally not suited to analysing *omics data

But it's excellent for a massive range of basic analysis and especially for a unified platform to share results, which is why everyone uses it. It's good for publication or proposal steps of creating graphs from data output from other more specialized systems, that usually lack the flexibility and widespread use that Excel has.

If my group uses some specialized, often in house, analysis software, my results cannot be spread to others unless I put them into a more common and accessible format. And nothing in the genomics and wider scientific community has anywhere near the widespread availability as Excel.

You might as well complain they also use word docs instead of Latex or troff. Not everyone wants to spend days fighting software instead of doing work.

> If someone types MARCH1, DEC1, OCT4, SEPT1 into a column, why should Excel not treat these as dates?

It should treat these as dates, unless told otherwise. But Excel goes one step further, and throws away the text you originally typed in. That means, if you type "DEC1" into a spreadsheet, see it looks stupid, and change the data type of that cell to "Text", then it's too late, and Excel will show "37226". It has destroyed the original data and replaced it with the numeric equivalent of the date.

The data type of the spreadsheet cells should be a display format only. The text you typed in (or loaded from a CSV file) should be preserved, and the data type of the cell should only dictate how that is treated and displayed, not how it is stored.

Excel does that, largely: cells have a datatype, a value, and a format (among other properties). Changing the format leaves the value unchanged.

However, when you input something into a cell, Excel has to parse it and decide what value and datatype to assign to it. That is just inevitable, and the same everywhere: in Python, `x = 3` assigns an integer, `x = 3.` a floating point, and `x='3'` a string. Similarly hints (or explicit specifications of type) are possible in Excel.

However, given that, what you typed originally is not recoverable anywhere (Python doesn't record whether you wrote `x=3.` or `x = 3.00`, why should it?).

So, to the extent that it is reasonable, Excel does what you request, and to the extent it doesn't do it, it's not reasonable, I'd say: A cell need not maintain the original pre-parsing input.

> The text you typed in (or loaded from a CSV file) should be preserved

That’s even worse, since a user assumes the cell holds what it shows. Excel, for example, auto completes entries, which is extremely useful for data entry, and for accuracy. If it only held the few characters entered, now it has to repeat exactly the same autocompletes every time it opens, and has to do so even if the data that originally caused the auto complete changes. All this is nearly algrithmically I’m-Seville without changing the document into a history. And now CSV is useless for data transfer.

It’s vastly more sane to change to the data the user meant to enter, show it to the user for confirmation, and save that.

That the 99.99% use case bites the remaining 1 in 10,000 is a perfectly reasonable trade off, done throughout society in uncounted ways.

It is shortsighted to call this behavior a "useful feature". It was and is a grave and serious mistake.

Automatically converting SEPT1 to a date caused far more damage than the benefit of the fleeting convenience.

The damage that this "feature" caused and will keep causing goes far beyond the domain of bioinformatics.

Even in bioinformatics things have not been "solved", for the next decade the same gene will have two names when looking it up in previously published data.

> Automatically converting SEPT1 to a date caused far more damage than the benefit of the fleeting convenience.

Who are you to judge? For the novice, even for most users it'll be what they want. And for the expert (or even intermediate user) it is trivial to override.

How exactly would you trivially correct the massive number of published results in genomics data where SEPT1 was converted to a date?

20% of publications had errors introduced by Excel in them! The real number is probably much higher (Nature had a rate of 35%). What is the rationale in saying that the benefits must still be worth it?

See: Gene name errors are widespread in the scientific literature


The root of the problem is not about how Excel displays information, or how Excel parses the data but that the data it tacitly changed upon saving it.

Open a CSV file that has SEPT1, don't take any action, save it right away. Boom, the data has been changed and you cannot recover the original information.

So we should not use computers to automate common tasks? In this case likely 99.99% of the time someone enters this they want it to be a date.

It's one thing if it were typed in, but this is not typically from typing, it's from opening a CSV file that has a field that's OCT4. (And honestly, if I want a date, I'll hit the date formatting button, Excel.)

It is super hard to figure out a way to import data into excel without triggering auto-conversions. I knew a way a few years ago when I was dealing with lots more biologists, but I have forgotten now. If I had to do it now, I would write XLSX directly then manually check the file in Excel, because Excel is fundamentally untrustworthy.

So much this. I work in educational software, and have a little write up about the troubles we’ve had with schools “fixing up” CSVs from their student information systems before sending them to us. The simple act of opening the CSV and hitting save loses data (Microsoft actually fixed some of the problem very recently). That said, they will use Excel, it’s not in my control.

I received dozens of comments “Just use Open Office / Libre Office”. I am but a man, I cannot change the world. I forgive that which I cannot control. We work with tens of thousands of schools. Getting all of their administrators to install Open Office, and thousands of administrative assistants to remember to use it rather than Excel is simply an impossibility.

- https://donatstudios.com/CSV-An-Encoding-Nightmare

I’ve shifted to trying to get people to use the PowerBI data import function, at least. It offers the ability to import your data without managing it.

If the gravity metaphor doesn't reach you, maybe this will:

The world speaks English. It's an illogical and absurdly spelled language. Esperanto is miles ahead on all logical measures. Yet we will never switch, both because the cost of switching is enormous, and because there is no way to coordinate/organize such a move.

> The world speaks English. It's an illogical and absurdly spelled language.

Man, the extent the aviation world has to bend itself to avoid the pitfalls of English....

"to" and "two" sound alike, and so does "for" and "four".

"Turn left t(w)o four zero" or "Climb t(w)o eight thousand"

To avoid that, the lingo is:

"Turn left heading two four zero", "Climb and maintain two eight thousand" (or, "Climb flight level two eight zero").

At any rate, switching to metric would be much better (If I'm 2000 metre high, and I have a glide ratio of 1:10, I can glide 20,000 metre or 20 km. If you're 5000 feet high, and you have a glide ratio of 1:10, how many nautical miles can you glide?), but for the reasons you mentioned, it won't happen anytime soon.

Don't they say "turn left heading two four zero" which disambiguates this? Not knowledgeable here, just that sticks in my brain.

I'm surprised people don't use other words for digits, as they do for letters (alpha, bravo, charlie...).

Most of the digits sound distinct enough that they don't get confused- except "nine", which can sound like "five" over a bad radio link, and is therefore pronounced "niner".

Also... Excel is great. What exactly is "ahead" of Excel? People on here really underestimate how great of a product Excel is because of its cultural affiliations.

Excel is great, and generally really useful. But not so much in microbiology. Sure, if you have several dozen genes/proteins with some quantitative values and some conditional formatting to visually distinct certain properties, it's fine.

But in microbiology you usually works with very large datasets that undergo a lot of calculations through a lengthy pipeline. Larger sets and more intense ML approaches even need HPC's. Inserting Excel into this pipeline would be catastrophic.

Oh yeah, I'm totally aware. My day job is designing and building software to construct/maintain exactly those pipelines.

What I'm saying (which is not news to you) is that clearly Excel is excelling along some dimension that people really care about. Even if it is lacking along other dimensions, the goal ought to be figuring out what makes Excel so good, then figuring out how to reconcile that with the needs of e.g. a high scale genomics pipeline.

But to make even an inch of progress against that, we need to get rid of this "haha Excel is for plebes" attitude that many in the software community adopt.

Excel excells at being already installed on most computers. That's it.


Most computers also have a built in calculator app, but I have never heard of those being difficult to pull from the hands of users, even when replacing it with purpose built software that costs tens of millions of dollars per year.

But what do I know :)

LibreOffice Calc is free and doesn't corrupt data. It also has full support of UTF8

In the case menctioned in the article, LibreOffice is on par with Excel. Don't believe it? Open up LibreOffice, type "MAR-1" or "SEP-2" in any cell, and look how easy is your data corrupted.

I have meet some other glitches: identifiers in the form of "1E123" get turned into scientific numbers. The column was something like "1A123", "1B123", etc. Those things are sneaky: you can have thousands of rows, and Excel/LO doesn't mind if only the 0.1% matches its rules for smartness. They just change without notice, leaving the others intact.

I'm a bioinformatic, and this kind of stuff is a daily issue. IMO, two problems colide:

1. People with very rudimentary knowledge of computers. I've meet some people way smarter than me, with thousands of papers written, that cannot open a CSV with R with a gun to their head. They can cut you, put three robotic arms in your heart, remove a tumour, sew you and send you home in three days. But R is just too much.

2. People that needs to collect and analyze data, and Excel is the easiest tool they know, but at the same time it's no lame toy: it sorts, it sums, it filters, it makes stats, it graphs... You cannot ask this people to use SQL, specially if it involves foreign keys. They just use Excel for data collection, storage and analysis.

Excel (and LibreOffice) are to blame. This shit could be avoided if they had a "scientific mode" to not be smart with your data, and be the default mode for any file in CSV/TSV mode. Unless you explicitly turn the file into a XLS/ODT or ask "turn this range of cells into dates for me" or "this cells are scientific numbers", they should not change a single dot.

> This shit could be avoided if they had a "scientific mode" to not be smart with your data, and be the default mode for any file in CSV/TSV mode.

There's a whole lot of business use of CSV/TSV with Excel that benefits from the existing defaults. Heck, I've seen federal government websites distributing code lists that are actually in CSV format with .XLS extensions and that are expected to use the default “smart” conversion.

Even if Excel should arguably have had different defaults, the impact on other uses of changing it now would be enormous.

Then we are doomed to the lowest denominator. The bad actor here is the one that expects some software to magically change your data, and he should expect an error or at least a warning or asking for confirmation.

To good actor is people who carefully care and curate their data, just to be corrupted silently by the program.

Lets say then, keep the default but add at least some sort of "respect my data as it is, because this is a CSV with CSV extension" mode.

> Lets say then, keep the default but add at least some sort of "respect my data as it is, because this is a CSV with CSV extension" mode.

CSV doesn't have really data types (or even a hard and fast standard), so that doesn't really work. OTOH, Excel lets you specify data handling by column when importing from untyped text-based formats like CSV, which you should probably do if you care.

Or you should write up a standard for a CSV-like format with explicit semantics around data types (it could even be CSV-compatible so that you get no worse results than normal CSV handling if using software that is aware of CSV but not the new format) and start using it where those defined semantics are useful, and try to get Microsoft to add specific support to Excel for it.

> CSV-like format with explicit semantics around data types

You could encode the column names in row 1 and column types in row 2. JavaScript type names would be convenient for most CRUD work, but SQL type names might be more appropriate for more complicated data.

> if they had a "scientific mode" to not be smart with your data

That "scientific mode" is called text. Assign the column type Text during CSV import and nothing is converted to numeric or date.

LibreOffice also explodes (or at least did a few years ago) if you try to open a few hundred mb csv file. Definitely can’t do fast filtering on columns in large files.

Should those things be done in sql? Yes... could this data be sent as SQLite files? Yes... were they endemically sent as csv and nothing could change this? Yes...

Excel is by far the best of “spreadsheet” apps I’ve seen at munging through multi-hundred-mb csv files.

makes sense!

fwiw, i have not tried to use libreoffice for any heavy lifting, unfortunately (fortunately?). i tend to use bash utils for that, and i 100% do not think telling people to "just use grep / awk / perl and csvcut" is by any means a solution.

It also doesn't use the decimal separator I have specified in my user profile, so it doesn't handle me pasting data from another application...

Malplezure, vi pravas :( Sed la espero neniam mortas plene.

English as Lingua Franca (Language of Western Europe) is less than century old. My mother studied French, because it was dominant language of that time.

I think Interlingua is better as a Spanish speaker.

This calls to mind a quote from Cormac McCarthy's Blood Meridian:

  It makes no difference what men think of war, said the judge. 
  War endures. As well ask men what they think of stone. War 
  was always here. Before man was, war waited for him. The 
  ultimate trade awaiting its ultimate practitioner. That is 
  the way it was and will be. That way and not some other way.
Just swap our war for Excel! "It makes no difference what men think of Excel, said the judge. Excel endures. As well ask men what they think of stone."

This accords with my experience.

When I worked in clinical trial design/analysis, no one did their analyses in excel, but for initial data formatting and clean-up? /everything/ went through excel, even if the final cleanup was a python script.

You’re not going to have the same ease of eyeballing your data in SPSS or SAS or R.

I'd wage for formatting and clean-up, it's a lot faster to proceed via R or Python (notebooks), and it can be reproduced a lot more easily.

I understand that Excel allows for a quick glance, but I'm not sure how R does not offer the same visual cues? Assuming that the data originates from some csv/table structure, Excel requires some GUI clicking and R needs a read.csv() call.

I would say R is a lot more useful for a quick check on whether all columns have one data structure, contain NaN/NA's, are of equal length, etc.

True in every sense. We shifted from google sheets to O365, there's just no way to go around excel. Its efficient and every one from a novice to expert can find their way around it.

Pray tell, why did you switch from Sheets?

The dataset was growing and sheets were slowing down. Excel installed locally on a machine with good hardware is a pleasure to work with. As the team grew more and more people were asking for office licenses and like i said, we have colleagues who've been using excel for ages and they like to have 2-3 files open on multiple screens and work between files using references. It's just not as fast as using excel locally.

I'd like to add that we also had an issue with our clients using only excel and there were issues with calculations not working, not so often but with enough people making noise our management decided to get the subscription. I've been using sheets for years even as a student when i couldn't afford an office license.

> to people asking, "why do they use Excel?"

I often read comments on HN starting with “to people asking” or “to the people wondering”... and introducing a new discussion, usually on another similar topic, basically hijacking the whole discussion.

“Newton had a great time for a long time with his description [of gravity], and then at some point it was clear that that description was fraying at the edges, and then Einstein offered a more complete version,” As for flat spreadsheets, they serve their purpose just like any other tool. I don’t think I heard people complain about spreadsheets for dealing with flat data, the same way no one complains about hammers to deal with nails. If we’re trying to deal with interconnected and enterprise resources we don’t maintain those resources in Excel. We store them in Enterprise Resource Planning software or build relational databases (which are as old as gravity). Nothing is wrong with spreadsheets, they’re the sweetest way to deal with quick data until we use them for things they can’t do.

Whenever people ask me about getting into the field and if they should learn Python or R or Scala or whatever else, I always respond with you need to get really good with excel/Google Sheets and SQL. Then worry about everything else. I can't tell you how many times I've done an analysis I'm really proud of in something like R Markdown only to have the end user say that's great now can I get it in an excel? I just deliver everything in excel as a default now.

> good with excel/Google Sheets and SQL

Those aren't very good for cleaning and wrangling data, even if the output for analysis ends up being a spreadsheet.

Agreed that spreadsheets aren't great for that, but totally disagree about SQL. SQL is the best cleaning and wrangling data tool out there. There's a reason the trend in the industry is towards tools like DBT. There are certainly things for which you need to use Python or R, but for 95% of tasks some combination of SQL and a spreadsheet will get you where you need to go.

The hyperbole here is amazing. You're saying that a 40 year old computer program that runs exclusively on windows is as un-changable as a _law of nature_.

The english comparison made in a peer comment is more apt, but wow.

The tech industry is so self-centered we think that Scientists should change their fields because of our bugs.

Excel is older than many of us, so yeah, having it around for our whole lives does really make it seem like a law of nature. Sooner or later, all data ends up in Excel. I don't like it at all, but that's what happens.

As for this being an Excel bug: many, perhaps most people don't think it is. That's why it's so permanent: almost everyone likes it.

Also, on the list of dumb things that make me roll my eyes at Microsoft, interpreting the string "MARCH1" as a date and trying to normalize date formats isn't exactly the dumbest.

OK then. Use Excel. But at least know the basics of Excel. Such as simply right clicking the column header and picking a different format so that it will stop auto-interpreting the input.

Nah, I plan on having HDF5 or Apache Arrow be the portable table of data and then build WASM based viewers - it's the excel killer but the war of attrition will be long !

Ah, just make the data file larger. No one is going to use a spreadsheet tool when there are millions of rows (I hope).

Excel is quick and easy. From startup to import to calculation. You haven't imported a single line with Python or R.

And other fields are probably stuck with bad word processing papercuts.

Excel should belong to the commons at this stage

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact