Of course Microsoft has thought about this, it's silly to think they never have and we're smarter. Yes, this behavior is annoying and US-centric, they know that. But they also know that breaking compatibility with all these scripts and macros would be the worse problem in the larger picture. That picture is huge, it would be on the order of the scope of the Y2K effort to modify every code everywhere that's ever touched Excel dates.
Would you want Excel to introduce a "quirks mode" to handle this sort of thing?
that's a fairly sophisticated solution, but it's just one of many potential approaches to fixing the problem.
MS (names are harder than I thought / remembered) Matrix
In the back-end they'd use the same sanity and share 99% of the code, but Excel would add in the cruft / warts compatibility quirks.
And unfortunately my guess is that most Excel users don't surf HN or other geek sites.
They could easily make it an optional behaviour.
They could have a scripting environment variable that is “quirkkMode=ON” by default and would maintain backwards compatibility at the small expense of needing to specify sane behaviour as an exceptional circumstance: just another line of boilerplate.
There’s a lot of things they could do. They’ve no doubt considered all of them and then some, and yet they‘c’è decided to do almost nothing. I suppose that says something about the disconnect between how we and they perceive their incentives, but I’m not sure what that assertion is.
A million checkboxes in the settings menu later, people will be griping that "Excel is terrible for having all these crazy settings to deal with!"
It is sorted from the greatest unit (year) to the smallest unit (second). If you treat them as text and sort alphabetically they still get sorted from oldest to newest.
Other formats don't sort properly.
writing dd.mm.yyyy is like writing time ss:mm:hh
writing mm/dd/yyyy is like writing time mm/ss/hh
If I really have to put YYYY at the end of the date I use the 'dd-mmm-yyyy' format which excel translates based on client locale:
- 13-mar-2020 in enUS
- 13-bře-2020 in czech
So for any kind of practical planning that day first order makes some sense, but I wouldn’t die on a hill for it.
ISO 8601 is great.
We have ISO 8601, no need to reinvent the wheel as a square.
https://xkcd.com/1179/ (date only)
Contrast this with the slash or dash notation, where both mm/dd/yyyy and dd/mm/yyyy are prevalent in the world. But 2020-08-06 is unambiguous in the same way that 06.08.2020 is: the only convention in common use is yyyy-mm-dd.
While slash is the most common, use of hyphens and dots as separators for US-style dates is not at all unheard of in the wild for manually formatted dates. The fact that it doesn't show up in lists of national “preferred formats” and doesn't tend to be commonly implemented as a prebaked format in software doesn't mean it's not a real thing people see and will interpret dates they see in light of.
So, yeah, there is a convention to write dates exactly that way.
See parent source: https://en.wikipedia.org/wiki/Date_format_by_country
> Since 1996-05-01, the international format yyyy-mm-dd has become the official standard date format
Shouldn’t Americans be very familiar with the idea that official norms (the metric system) play no role if the people don’t want to use them?
* YYYYMMDD for dates
* YYYY-MM for months
* YYYYMMDDTHHMM for time where T is the capital letter T. Two additional digits can be added for seconds and then as many additional digits as needed for precision.
[Edited for format]
(I'd love to discuss this further... I'll be back on HN later tonight, sometime around 16834005.)
I have an infinite number of dashes, I'll send you a lifetime supply!
- use the mouse to select precisely the part of the word/URL/string you want for Ctrl-C purposes
- it autocorrects your selection to the whole word including a CrLf if it is nearby.
The only workaround I've found to be effective is selecting a few letters on one end of your intended selection and using ctrl+arrows to precisely select.
Have you ever posted a code snippit into a default install of Outlook?
That said, I completely agree with the tangential issue of U.S. dates being misleadingly different in format compared to non-U.S. Always an issue when teaching data/spreadsheets to a class with at least one non-American – but also a good reason to teach them the value of ISO8601 :)
If you want to know what date and time it will be 476 days and 12 hours from today you can just do =TODAY()+476.5
This is very useful when the requirement is to 'happen every 10 days' or you're looking for '30 days in the past'
Excel has a very low barrier of entry compared to pandas while boasting an immense amount of power and features. I think it was not an easy challenge to keep it going over the decades.
> US dates
I'm a non-US person and I fixed this simply by changing my locale to enUS everywhere. It has an added benefit of not having weird translations of everything in random tools and excel functions not being localized to their cringy versions in my native tongue.
Also, data being interpreted different based on region/language settings is a sure way to end up with bugs, so I think its a terrible thing that Excel does this.
It's not "techies know best" whining. We've been increasingly computerizing the economy for the past 50+ years; it's past time for societies to adapt to that reality, instead of wasting time and money on dealing with dozens of date formats, number separators, currency notations, etc.
The US switched to the metric system in the 60s and there is a shitload of benefits in doing so. Has it worked? Not really. Still using the old system everywhere.
So the solution cannot be to get rid of locales, but to actually use them properly:
- Always use the right locale for the job (the OS or browser should be the oracle for the right locale to use)
- Read the data in the user defined locale
- Store the data in some canonical form (e.g. store numbers as number types instead of strings, use ISO-8601 for dates, ...)
- Write the data to the user defined locale
And then, there's the problem of users - whatever locale they have set in their system were most likely not set by them, and are often misaligned with what they're naturally using.
There's a lot of bugs and issues happening to people every day that could be removed if major software vendors said, "sorry, the only allowed input format for date is ISO8601, and dot is the only valid decimal separator; take it or leave it".
Furthermore, I would argue that the very notion of user locale based localization by default is misguided, since it is fundamentally no different from automatic translation of content based on the users' UI language setting. It is a form of misrepresenting the content, though with things like dates you usually don't lose much information in the process.
The problem doesn't even have a correct solution, because the locale settings in the OS and the browser aren't often set by end users (a regular person probably doesn't even know that they exist) and the defaults depend on many random factors (like which store you bought your computer with preinstalled Windows from).
b) There is no "." on the num-pad for German keyboard layouts, they have a "," the German decimal separator.
I'm surprised there hasn't been a dotfile option added yet
However, Excel remains a nice tool for "I'll just look at this CSV with the final results from this analysis, sort it by correlation, and see if any of the usual suspects are up top". And if the next step is "yeah, that looks fine - I'll just copy the top 100 genes into this convenient GUI pathway analysis tool", you're suddenly exposed to whatever Excel did to your data.
And as for "why not libreoffice", most researchers I personally run into are strong molecular biologists who've learned a subset of R for their uses; they're not really likely to go out and find libreoffice on their own. Besides, the writing process for papers includes sending drafts and spreadsheets to doctors and pathologists and editors, who are probably on hospital computers with a short whitelist of programs ... and I don't really want to debug subtle compatibility issues in the sort of garbage fire those documents can turn into.
Journals don't care, referees/reviewers might care, but they are unpaid and they usually don't want to rock the boat that much.
Now imagine you load a file with 1000s of values in the form AB/CD, many are trashed. If you save the file, you've lost the original data.
All because it might save some data entry drone 5 seconds to expressly make something a date.
There are then also issues about whether "01/02", assuming it actually is a date, should be the second of Jan (US) or the first of Feb (EU, UK, North America ex US, Africa and Pacific regions). In many places, based on some arbitrary and we'll hidden settings, you will get the wrong result.
I honestly think the only reason excel does this is to force you to use excel formats and make it harder to work with non-MS products...
I had to build a very basic CSV->XLS tool once because the built-in CSV import kept screwing up. Admittedly the CSV files were slightly mis-formatted in places, but that wasn't the only headache with the import.
I don't see a problem with that, as it's not undefined behavior – you know exactly how Excel will treat those values
If they want rawdata="01/02" to display as "Jan-02" (or whatever), that's annoying but I can fix it. But they also delete the raw data and replace it with "43862". Reformatting cannot fix that and it is Excel that has chosen to actively break it.
They're not even self consistent with this: If I carefully make sure the data is correct (use "'01/02"), then save as csv and load the same file, it breaks. What sort of program can't save\load without losing data!?
That's without touching why they need to interfere or whether the US standard is the correct one to use or that fact excel is no where near this aggressive with any other data format.
(edited to correct feb to jan)
Excel is biased for US users.
This doesn't follow. There's no year in 01/02. MD is more popular than DM.
I made no allusion to DMY and MDY being the only possibilities, because that would be ridiculous.
You did indeed. How else could you interpret this exchange?
>>> "01/02" does not translate to Jan 2nd in most of the world, because DMY is much more popular than MDY.
>> There's no year in 01/02. MD is more popular than DM.
> In which country do people use DMY, but also MD?
The only way to have that question make any sense at all is to assume that MDY and DMY are the only options. That certainly is ridiculous, but I'm not the one who said it.
Note that none of these points require the absence of any other way of writing dates. You could indeed argue, preferably with examples and not hypotheticals, that some locales exist in which MD would be the natural way, and that they outweigh the others.
Now you can move the goalposts once more if you really need to. It really is tedious.
But here's Excel's trick: you type in 01/02, Excel interprets that as January 2nd and switches it to the underlying OLE date format (some number in the 40000s). Ship that Excel file across the ocean were they would write Jan 1 as "02/01", and it shows them the date as "02/01." Excel uses your local date format preferences.
This is one reason why it is important for Excel to convert the raw input to another format. I'd probably prefer that it didn't touch it for CSV files, but I get it for .xlsx
It's people using a tool without knowing said tool. You can disable auto-formatting (or even better yet - set the column data type) with a simple click.
Back in the day I actually wrote a function that would undo this for some sheets that people kept breaking...
The result is a rich deep program that users can grow into, rather than a shallow trivial program that optimizes for the noob experience and leaves power users out in the cold.
Recently I was working with some colleagues, being I the computer savvy and them the lab people. I send them some data in CSV, that when opened in Excel turned 123.456 into 123456 (it was a problem with locales, some people using "," as decimal and some using "."). We noticed because the values should be between 0 and 1000. But what if the column could be between 0 and 1000000? A small quantity of numbers bumped up by a factor of 3 could fly under the radar, and distort further measurements. And the error is undetectable forever once published.
I like it better the programming language approach: look, this is how you write a string, this is a char, this is a float and this an integer. "2020-08-04" is a string until you ask me to turn it into a date. "SEPT1" is a string, and you are going to do quite the gymnastics to make me understand it as "date(2020, 9, 1)". Do you like "," or "." as thousands? Then we first turn the number into a string and then format, but the original number is kept.
I use python almost exclusively for data analysis but still open files in excel to view them for lots of reasons.
Nothing in excel is reproducible, but it’s still on all my computers.
Did an analysis a few years ago of the programs run At my organization by job series and scientists run a lot of excel.
Back down? Or pick a better tool. If Excel proves to be an unreliable tool for your job, use a better one. Alternatives exist, ranging from Google Docs, and LibreOffice, to simpler light-weight spreadsheets. Or possibly more specialist tools.
Why does everything always have to be put in Excel if Excel is such a poor tool for so many things?
Excel is a “pretty darn good” tool for 95% of tasks. If your work has highly varied workflows, then that flexibility more than makes up for its failures on the last 5%.
If you have very specific workflows on the other hand, you may find value in replacing Excel with a specialist tool. But let’s not pretend that specialist tools don’t also have their own shortcomings; at best they’ll achieve 99.9% coverage of tasks.
How can Excel possibly be too complicated for simple tasks? It is pretty much as straightforward as it goes when it comes to grid-file viewing and editing. You can show it to anyone from a high-schooler to a 60 year old (with minimal experience on computers) colleagues and they will figure it out rather easily, good luck teaching Python/Pandas to the latter.
>"not sophisticated enough for complex tasks"
Not sure how that works either really. Between formulas and VBA macros, people have and are making tools complex enough they have no business to be an Excel, and yet they are even if it isn't the best tool for it. Once you go past that point, Excel isn't even in the conversation nor does it pretend to be able to.
It has issues, and people playing or working with complex (or simple) data would be better served to learn programmatical tools, but until they do Excel will serve them well as long as they stay wary of basic quirks.
Excel is too complicated to send an e-mail for example. You can send e-mail from excel but it's much simpler to do in Outlook.
Excel is too complicated for simple math = if you want to do 1024*436 it is better to fire up a calculator rather than wait for excel to start up.
And while the calculator might be marginally faster if you exclude start-time, Excel is still perfectly fine for your use-case it only requires to type one extra symbol than the calculator, namely "=1024*436[ENTER]". Hell, it's probably faster to do it in Excel if you already have an active window. I don't see how it could possibly fulfill the task in a simpler manner.
Is R a bad statistical tool because I can't use it as Photoshop?
You lose your history this way, while you can easily keep it in Excel.
If you want to run some simple calculations on numbers, Excel is vastly overpowered and baroque. New version? Hope you've learned the new magic series of GUI clicks. Just like Word, Excel files are not inspectable (again, more baroque 'magic').
I can easily run calculations on plain texts tables in Emacs (along with the normal conveniences of being able to reorder by row, column; apply formulae across a range of cells etc.). And that's all displayed in plain text. And I can easily add new functions, keybindings, whatever, to suit my needs.
But the fact that you say the arguments against Word "can indeed be just as easily dismissed" makes it unlikely that this is likely to be a productive dialogue.
For Word, it's way more complicated than most people need, and again, users are at the (non-existing) mercy of the application. It's not very user-configurable - you have to learn how Word expects you to do things. Again, its data files are non-inspectable monstrosities, with minor corruption causing complete corruption, and fragility between versions.
I manipulate text for a living. For simpler things, it's much easier and productive to use something like Markdown or (for more complex things) something like Org-mode. For complex things, I need LaTeX - Word simply lacks necessary functionality.
So, Excel is to complex for simple calculation, but you expect my mother and nearly retired colleagues who never used non-GUI apps to learn emacs to add two columns together? Is this a joke? Besides, the fact you can do it so much easier with emacs, regardless of the veracity, is fallacious at best and says nothing of Excel.
Yeah, you need to learn how the program works, that's how programs always work. Are you telling me you didn't need to learn how emacs and Latex work when you first tried it? I for one have never met someone who didn't learn how to use Word, it's pretty damn easy.
>I manipulate text for a living
Here, we find the problem. You are a specialist and you seemingly lost touch with the way average people use Excel and Word. The only argument I see in these two comments is that Excel and Word are not suited for specialist needs, which is probably true, but I don't see a hint of why they are not suited for basic operations or their intended use cases. So I will see myself out, thanks.
Unilaterally declaring you will not use names that confuse Excel will fix the problem (until we get new month names, anyway), and not require anyone outside their sphere of influence to comply.
I had the same situation at work, as an aquired company we had servers named devX, the new parent had servers named devX and pushed ssh config files to our laptops that made it impossible to connect to our servers; we asked them to fix it the first couple times, but eventually renamed our servers to vedX to avoid the issue. It wasn't the right solution, but it was a stable solution and let us get on with our lives instead of fighting with IT.
* A well known tendency to mangle date and gene data under the guide of being 'helpful'.
* Easy to making mistakes when cutting and pasting cells.
* Difficult to see what is going on in a spreadsheet.
* Poor handling of CSV files.
Some of these shortcoming are inherent to spreadsheets. Others are specific to Excel, but hard to overcome due to the weight of backward compatibility.
I have written a product for transforming and analysing tabular data (https://www.easydatatransform.com) that tries to overcome these issues:
* Doesn't change your input data file.
* Doesn't re-interpret your data, unless you ask it to.
* See changes as a visual data flow.
* Operations happen on a whole table or column.
* Good handling of CSV files.
Also it doesn't try to do everything Excel does.
It is a fairly new tool. Would appreciate some feedback.
Edit: great call out to 7 days of non consecutive use
Excel is no longer motivated by the original intention of a spreadsheet, and now caters to the lowest common denominator, a piece of graph paper. As such MS has shifted focus from doing calculation to text and graphics layout tool.
white text copied from a terminal : white text white background you got it!
comma separated numbers : default a long string with commas in it
want a plot : it is in the insert menu for some reason, since plots and numbers are no longer excels raison d'être
Your comment about graph paper echoes a comment from a former Excel PM: "The gridlines are the most important feature of Excel, not recalc."
Project is fundamentally a critical path scheduling tool, or at least was. Task A must finish before Task B, which must complete before Task C. If A is delayed, then that delay pushes B and C out, too.
This is what it's FOR, more or less.
Manually scheduled tasks don't move. They're set with whatever dates you give them, and do not move in response to delays or whatnot from predecessor tasks.
People wanted this because some (dumb) people insisted that "well, that task CANT move because it has to be done by then!"
This is akin to asking for the arithmetic engine to be turned off in Excel, because by golly you really need 2 and 2 to sum to 17.5.
So? There are tasks like this that simply CANT move, why should a project management tool not have the ability to model that? If you don't like the feature you don't have to use it.
I get that you wouldn't want these fixed dates and if you can just plan and execute your project by yourself, but that is not always the case.
As an example: I had a project where it was really important for us to inspect the Aircraft Type (A321) into which we wanted to install our hardware and software. We where given an opportunity to do so by an airline during the maintenance of one of their A321. This was a fixed date. If we didn't finish our preparations before that maintenance date then we simply would not get this opportunity (or at the very least would have to wait a very very long time for the next opportunity). The Aircraft would simply not wait for us.
Not every task can move and just because you think they should does not make it so. Maybe there is more to Microsoft Project than just YOUR use-case?
I've been working with scheduling and project management tools for 20 years -- not just MS Project (widely regarded as the idiot cousin of the market, honestly) but also things like Primavera.
There's no reason to have a "manual" task in a scheduling tool.
First, if you have a task that can't move, you set it with a deadline and watch your deadline (as well as watching to see if the task moves PAST the deadline). Cementing the task in place doesn't help you; in fact, it actively HURTS you because it hides the fact that your forecast path isn't valid anymore.
Second, these tools ALSO include the idea of constraints. Scheduling tools include the idea of constraints which limit the critical path motion according to specific rules (based on the type of constraint in play).
Using ANY constraint, though, is frowned upon in serious scheduling circles precisely BECAUSE they distort the predictive ability of a critical path schedule. If Task C has a hard deadline of 1 Sept, then you watch you critical path to see if that remains possible. As tasks slip, you stay on top of the chain of tasks to seek opportunities to streamline or reduce scope ahead of critical task so that the deadline can be met.
(Guess what? It's not always possible.)
And you do this because you SEE that the schedule shows Task C moving to the right.
If you lock the task in place, odds are you won't notice that your critical path is collapsing.
The tl;dr is that cementing a task in place in a critical path schedule is not a good way to model deadlines. This is something any competent scheduler will tell you. It's part of the PMBOK, it's built into DoD scheduling guidelines, etc.
>Maybe there is more to Microsoft Project than just YOUR use-case?
Hilarious. I will say it's clear one of us doesn't quite understand the problem domain as well as they might represent, but given my background I know it's not me.
Generally speaking, you can schedule just fine with partial resources as long as your project doesn't need more of a given resource at any given time than can be allocated to it. Really big construction projects and really big government (defense) contracts work like this.
As in, the history of genetics research is now irreversibly linked with the shortcomings of this one software product, which just happens to be incapable of describing the genetics of the organisms that created it.
The clickbait headline is fun, but the real headline is more like "Scientists find it easier to rename things than learn the basics of data entry in the tools they use".
Astounding, I've literally never heard of this in the 20 years or so I've been using spreadsheets. I'll be using it from now on!
It's a terrible app but Microsoft managed to shove it everywhere in education. Same for Words.
Unfortunately Libre Office and co tried to reproduce the terrible UX of both software, while coming up with inferior functionalities because they had to try to maximise compatibility with these MS tools, time not spent on providing actual useful stuff.
Access is a bit better and I wish more non programmers learned basic SQL instead of Excel.
My anecdote is the exact opposite.
I was recently introduced to using Excel and actually writing some VBA inside of it when I helped a non-programmer with a complicated data processing task on a sheet they were populating from a database.
This was the first time I really used Excel, even though I've had continuous access to it since Windows 3.0.
Aside from some small annoyances, it was a generally pleasant experience. I also can't think of any tool that comes remotely close to the power and flexibility of Excel, particularly for people who are not programmers.
I kind of got hooked on it, and now I frequently use it for little data processing tasks that I used to do in a REPL - and the bonus is I can share those with people who are not programmers.
Worst excuse ever. Excel had to maximize compatibility with Lotus 1-2-3 (in the 80s/early 90s, the dominant spreadsheet application by far) from Day 1. It wasn't just that the files were compatible, Excel included keystroke compatibility so 1-2-3 experts could continue navigating the UI with their old commands.
Unlike the LibreOffice team, Microsoft made a superior program in virtually every way, despite the burden of backwards compatibility with another program.
the paper announcing the new guidelines of renaming genes, a work of fundamental importance to all scientists in the world, cannot be read without an expensive subscription to the journal.
Thanks science (sarcasm!)
The whole world's data ultimately comes from or ends up in an Excel spreadsheet. Sure, we might use other intermediate data storage methods, but in the end it's going to go into some scientist's or some politician's computer, and by golly it's gonna be in Excel. Trying to rally against Excel is like trying to rally against fundamental forces of nature.
This is just an example of that fundamental force winning.
In grad school I had a subletting roommate for a while who was writing code to match some experimental data with a model. He showed me his model. It was quite literally making random combinations of various trigonometric functions, absolute value, logarithms, polynomials, exponents, etc. into equations that were like a whole page long and just wiggling them around. He was convinced that he was on a path to a revolution in understanding the functional form of his (biological) data, and I believe his research PI was onboard.
I guess "overfitted" never made it into the curriculum.
Technically, we call that a "neural network". Or "AI".
Yes. It just turns out it's a particular human, whose analysis is very very dumb.
Then they go on conferences and brag about it, because they have to (otr they know it's bs).
Datasets are soso (you can have a look at QM9...) and for more specialized things, people generally don't bother trying to benchmark or compare their results on a common reference. It's just something new...
And with all that: even without doing fancy statistical methods without knowing too much about it, your theoretical computations might not make so much sense (at least in the sheer number which is pumped out and published)...
Well, that's a new acronym for me. I wonder where it came from. Apparently it's "on the real". Sounds like AAVE?
People have figured that out long ago  (I know the author of that paper lately turned somewhat controversial, but that doesn't change his findings). It's not very widely known in the general public. But if you understand some basic issues like p-hacking and publication bias and combine that with the knowledge that most scientific fields don't do anything about these issues, there can hardly be any doubt that a lot of research is rubbish.
'‘A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it.’ This principle was famously laid out by German theoretical physicist Max Planck in 1950 and it turns out that he was right, according to a new study.'
Also the story of Ignaz Semmelweis who discovered that if doctors washed their hands it reduced deaths during childbirth - but for a variety of reasons his findings were resisted.
Point being, as awesome as science is, it's still a human enterprise, and humans are still, well, human.
After the 11th nested 'if' statement, I upped the request to a case of beer. I'm not certain he ever got the code working.
To the larger point, scientists are not programmers. They got into their programs to do research. What keeps them going is not the joy of programming, but the thrill of discovery. Programming is nothing but a means to an end. One they will do the bare minimum to get working. Asking hyper stressed out grad students to also become expert coders isn't reasonable.
And yes, that means that the code is suspect at best. If you load the code on to another computer, make sure you can defenestrate that computer with ease, do not use your home device.
I could replace "programming" in your above little bit with "mathematics" and it would be just as weird.
Our modern world runs on computers and programs, just as our modern world and modern science built itself on mathematics and required many to use it. So too the new world of science may require everyone to know to program just as they know about the chemical composition of smells, or the particulars of differential equations, etc.
And I know your argument isn't "they shouldn't learn programming", but honestly since I keep seeing this same line of reasoning, I can't help but feel that is ultimately the real reasoning being espoused.
Science is getting harder, and its requirements to competently "find the exciting things" raises the bar each time. I don't see this as a bad thing. To the contrary, it means we are getting to more and more interesting and in-depth discoveries that require more than one discipline and specialty, which ultimately means more cross-functional science that has larger and deeper impacts.
Again: these are tools that are means to an end. They only need to work well enough to get the researcher to that end.
A lot of what are considered essential practices by expert programmers are conventions centered around long-term productivity in programming. You can get a right answer out of a computer without following those conventions. Lots of people did back in the day before these conventions were created.
That's not to say that everybody with horrible code is getting the right answers out of it. I'm sure many people are screwing up! My point is just that ugly code does not automatically produce wrong answers just because it is ugly.
By analogy, I'm sure any carpenter would be horrified at how I built my kayak rack. But it's been holding up kayaks for 10 years and really, that's all it needs to do.
I will add that in general, statistical analysis of data is not by itself adequate for scientific theory--no matter how sophisticated the software is. You need explanatory causal mechanisms as well, which are discovered by humans through experimentation and analysis.
And you can do science very well with just the latter. Every grand scientific theory we have available to us today was created without good programming ability, or really the use of computers at all. Many were created using minimal math, for example evolution by natural selection, or plate tectonics. Even in physics, Einstein came up with relativity first, and only then went and learned the math to describe it.
I feel like the later is obvious: of course the tools aren't science, but if you want to do real work and real science, your tools are going to be crucial for establishing measurements, repeatability, and sharing how one models their hypothesis onto real world mechanics.
Likewise, the former is just the same commonly repeated thing I just argued against and my reply is the same: so what? You building a kayak is not science and is irrelevant.
Scientists can't reach a meaningful conclusion without proper use of tools. All they can do is hypthesize, which is certainly a portion of science (and many fields are in fact stuck in this exact stage, unable to get further and come to grounded conclusions), but it is not the end-all of science, and getting to the end in the modern day science means knowing to program.
Of course there are exceptions and limitations and "good enough". No one is arguing that. The argument I am refuting is those who think "tools are just tools, who cares, I just want my science". That is the poor attitude that makes no sense to me.
I'm just trying to make the point that "proper" is subjective. Software developers evaluate the quality of code according to how well it adheres to well-established coding practices, but those practices were established to address long-term issues like maintainability and security, not whether the software produces the right answer.
You can get the right answer out of software even if the code is ugly and hacky, and for a lot of scientific research, the answer is all that matters.
So what you end up with are that great scientists that are decent programmers are the ones who can do the cutting edge science at the moment.
Sure, it would be great if we all had more time to learn how to code. Coding is important. But I'd say the onus should be on coders to build better tools and documentation so they are empowering people to do something other than code, rather than reduce everything to a coding exercise because making everything look like code means less boring documentation and UX work for coders.
I mean, biology is in fact a full on degree program and you pretty much need a PhD before you're defining an original research topic. It's not because biologists are dumber and learn slower. It's that biology is complicated and poorly understood, and it takes years to learn.
Contrast this to coding... you don't even need to go to college to launch a successful software product, and the average person can became proficient after a few years of dedicated study. However, this is a few years that biologists don't have, as their PhDs are already some of the longest time-wise to finish.
The decision to rename genomes is totally consistent with the biologists MO: if a cell won't grow in a given set of conditions, change the conditions. Sure we can CRISPR edit the genes to modify a cell to to grow in a set of conditions, but if it's usually far easier to just change the temperature or growth media than to edit a cell's DNA.
My take away is that this is more a failure of programmers and/or a failure of their managers to guide the programmers to make tools for biologists, than of biologists to learn programming. Sure, coders get paid more, but they aren't going to cure cancer or make a vaccine for covid-19 without a biologist somewhere in the equation. And I'm glad the biologists developing vaccines today are doing biology, and not held up in their degree programs learning how to code!
I would say most research, to an ever growing degree, is so heavily dependent on software that it's tough to make that claim anymore. It makes no sense to me. It's like saying Zillow doesn't need software engineers because they are in the Real Estate business, not the software business.
I mean, sort of. Some research is essentially just programming; other research can get by with nothing but excel. Regardless, it's unreasonable to ask most scientists to be expert programmers -- most aren't building libraries that need to be maintained for years. If they do code, they're usually just writing one-shot programs to solve a single problem, and nobody else is likely to look at that code anyway.
What if you want to share data with a wetlab biologist who want to explore their favorite list of genes on their own?
Not that I'm saying using excel is bad either. I use excel plenty to look at data. But scientists need to know how to use the tools that they have.
THe basic assumption I have is that when I input data into a system, it will not translate things, expecially according to ad-hoc rules from another domain, unless I explicitly ask it to do so.
It's not clear what data input sanitization would mean in this case; date support like this in Excel is deeply embedded in the product and nobody reads the documentation of Excel to learn how it works.
If you're deciding who gets a large-scale computational biology grant, and you're choosing between a senior researcher with 5000 publications with a broad scope, and a more junior researcher with 500 publications and a more compuationally focused scope, most committees choose the senior researcher. However, the senior researcher might not know anything about computers, or they may have been trained in the 70's or 80's where the problems of computing were fundamentally different.
So you get someone leading a multi-million dollar project who fundamentally knows nothing about the methods of that project. They don't know how to scope things, how to get past roadblocks, who to hire, etc.
You might occasionally run into someone who is passable - at best - with R or Python. But most of the code they might write is going to be extremely linear, and I doubt they understand software architecture or control flow at all.
I don't know any biologists who program for fun like me (currently writing a compiler in Rust).
Most of the code I write to do biological data analysis is fairly linear. However, I also generally use a static type system and modularity to help ensure correctness.
I've perused a lot of code written by scientists, and they could certainly learn to use functions, descriptively name variables, use type systems and just aspire to write better code. I just saw a paper published in Science had to issue a revision because they found a bug in their analysis code after publication that changed most of their downstream analysis.
I'd say that getting some basic data science computing skills should be more important than the silly SPSS courses they hand out. Once you have at least baseline Jupyter (or Databricks) skills you suddenly have the possibility to do actual high performance work instead of grinding for gruntwork. But at that point the question becomes: do the people involved even want that.
It one of the reasons why people end up with spreadsheets. Most of their data is giant tables of data. Excel does very well at that. It has a built in programming language that is not great but not totally terrible either. Sometimes all you need is a graph of a particular type. Paste the data in, highlight what you want, use the built in graph tools. No real coding needed. It is also a tool that is easy to mismanage if you do not know the quirks of its math.
I respectfully disagree with this. Excel is fundamentally not suited to analysing *omics data. It's often the default program affiliated with a .csv filetype on people's computers, but trying to get an entire field of scientific research to rewrite itself based on its glorified bugs is...wrong, in my opinion.
If you see wrong things in the world, do you accept them as they are, or try -- however ineffectually -- to force change for the better? I for one bang the drum into the wind and try to get biochemists off it. I teach people to be very sceptical of excel in my stats courses, for example (aside from some showstopping bugs and downright dangerous defaults, its RNG is particularly crap).
I wish you luck, but I long ago declared defeat against Excel. I can't prevent everyone from using it. I have come to accept that whatever Excel does is what I have to live with, because someone, somewhere out there hasn't heard the anti-Excel gospel yet.
Totally relate to this.
I studied a degree because I wanted to rid the world of Excel.
Just like I wanted to rid the world of pie charts.
Turns out the rest of the world is quite happy using both of them, no matter whether I think it's right or not.
If you produce, touch or directly act on data, Excel is no longer allowed. On the other levels the people that still use it (because they think they have the skill - which they often don't) can't do meaningful direct-action on those files anyway so that solves part of the problem of acting on bad data or bad data processing.
We do allow web-based spreadsheets, that has the benefit of limiting what you can do and drives people to either put a request in with the correct team or take internal courses on how to do it right.
Some people hate it, but then again, some people hate Excel. The difference is that the people that don't hate it now have an empowered skillset that they didn't have before. In theory you could do the same with Excel-oriented internal courses, but we didn't want to do Excel because a lot of us think it's stupid (often simply for emotional reasons but the push was based on hard data like process metrics and user input).
To add to this: it's probably because I find myself attracted to data-driven organisations where using Excel didn't help the workforce much anyway, and the positions I'm in allow for a certain degree of influence on the tools that we support and how much explicit allow/deny we apply on them.
Not knowing how to use the primary tool in your field because you only know how to use a web-based spreadsheet with 1/100th the functionality is not an "empowered skillset." It's a cruel trick to play on employees, who will have difficulty finding further jobs in that field because they don't have a basic understanding of the tool everyone else uses.
It's not a cruel trick and even if it was: web-based spreadsheets might not have the same functionality, but the people that were using the desktop-based spreadsheets weren't doing much more than basic computation on fields and search/replace. That works fine on the web.
It's not like we took away their toys and gave them dirt instead. We just made everyone take the startup course to get them acquainted and if that went well we would decide our next steps. It went well, and the next steps became removing Excel from the standard workflows.
There is tremendous competition in the spreadsheet business, and anyone is free to use Julia, R, whatever they like.
The only reason they want to use a tool that doesn't really fit their use case, is because they're not aware of better tools. I find it very hard to believe that scientists actually want to use a tool that corrupts their data. If that is true, then that is absolutely a problem with their attitude towards science and data.
It's not that hard to imagine a tool that can do exactly what Excel can, but without corrupting your data. It might even exist already. LibreOffice got mentioned a lot; it can do almost(?) everything Excel can, but without corrupting your data. If there are problems with it that make it useless to scientists, there's a good chance the LibreOffice community can fix them.
I think better tools that preserve the integrity of their data are absolutely the solution here.
Also, I don't think it's a very minor issue if the tool you use to process your data, changes your data. It may only do that in limited cases, but the fact that it does it at all should alarm anyone who cares about data integrity.
This is what I react to. Sorry I missed that you're not the original commenter, that's my bad.
I am not sure though what other tool should the biologists learn - they usually know R and Python and they use Excel because they need what Excel offers; there isn't any other software that allows them to easily do whatever they need to do with the data, quickly iterating on the ideas AND then present it nicely.
The solution here should be to fix Excel, but it really is a small issue (that has been worked around, too; and if you know how to use excel correctly, it does not happen to you) compared to making another Excel, which seems like a monstrous task.
You're claiming that they do know the alternatives but the alternatives fail in worse ways than Excel does.
I don't know what the magic sauce is that makes Excel so much better than the alternatives, or why it would be such a monstrous task to replicate that. My impression is just that to many people, Excel is simply the default tool to enter data in, no matter what the actual problems with Excel are. Excel is like gravity; good or bad, you put up with it because it is what it is. And if that's the case, that'd be pretty bad and in need of change, because Excel is not at all like gravity; it's merely one tool of many. If there's a more appropriate tool, people should use that. If there isn't, one could be developed.
I stand by my point that a standard data collection tool that modifies the data you put into it is a really bad idea for any field where accuracy and data integrity is important, and I'd expect science to be one of those fields. You may be able to work around the limitations of the tool, but the risk is still there.
After opening in Excel it will convert these to:
Then resaving the file (as a csv) will store the values in scientific notation and lose all of the digits.
number date currency
A B C
I'm newly returned to (very minor) Excel use and assumed formats with no feedback seems wrong. My situation is dates _not_ being picked up as dates and it not being obvious that it hasn't; I'm using ISO format (eg 2020-06-17), I guess you have to create a new date format or use USA-ordering before it will recognise dates.
At the moment users need to remember every time they import the CSV to specify A data type for each incorrectly handled column. Suffice to say they don’t always remember... and with large CSVs with hundreds of columns it’s just a pain.
This would allow existing CSV tools that work with data files to continue to work without modification, while CSV Schema aware tools could make use of the schema.
Excel should know any number that is longer than it can internally support should be stored as text. Or they need to introduce a data type that can store arbitrary precision numbers.
The asinine thing happens when you save it back as a csv. Excel doesn't think to itself "hey, I should write these full values I have stored in each cell to the csv file". It just writes whatever is currently displayed to the csv file instead.
you are 100% correct. and yet it will continue to happen.
Intellectually, it feels snobbish to single out Excel like this. I'm a software engineer in science, and I generally agree that scientists should learn some kind of coding. But you can make mistakes in Python or R as well - not to mention in physical experiments. We should check data and methods in general, not presume incompetence if people use one tool.
It's not coincidental that big evil FDA/pharma requires people to sign - on paper - their lab notes (and to have lab notes, and to sign that they calibrated the equipment, and that they checked for common mistakes).
And yes, I know this costs a lot, and that this is sort of a pipe dream. And I'm not saying this is the best solution. Renaming genes might be a lot better, because it just allows the field to move on, and Excel will eventually die out, or at least researchers will move away from it - maybe to Jupyter, if we're lucky.
So, all in all, of course Excel is just a symptom, but society is already pretty stubborn when it comes to financing R&D and progress.
Ideally excel would change, but since we know it won't, and we want to work with lots of people with minimal problems, we must adapt.
Researchers should use other, more appropriate tools, or at the very least specify the column type when importing data into Excel. It's not that hard.
It's one thing to change typed in user text in real time. That's not causing any problems. It's another to randomly mutate cells amongst tens of thousands of rows. I don't think that has ever helped anyone.
If I can get a trained immunologist looking at my data, I'd much rather have 5 more minutes of their analytical skills than teaching them about common data exchange pitfalls.
around the late nineties the obsession with videogames and netspeak took hold, and the popular names began to crop up.
If someone types MARCH1, DEC1, OCT4, SEPT1 into a column, why should Excel not treat these as dates? Probably 99.999% of users would prefer this auto-understanding. To make users have to then select cell formatting and select date is a terrible UI decision.
It's odd to call such a useful feature a bug.
>Excel is fundamentally not suited to analysing *omics data
But it's excellent for a massive range of basic analysis and especially for a unified platform to share results, which is why everyone uses it. It's good for publication or proposal steps of creating graphs from data output from other more specialized systems, that usually lack the flexibility and widespread use that Excel has.
If my group uses some specialized, often in house, analysis software, my results cannot be spread to others unless I put them into a more common and accessible format. And nothing in the genomics and wider scientific community has anywhere near the widespread availability as Excel.
You might as well complain they also use word docs instead of Latex or troff. Not everyone wants to spend days fighting software instead of doing work.
It should treat these as dates, unless told otherwise. But Excel goes one step further, and throws away the text you originally typed in. That means, if you type "DEC1" into a spreadsheet, see it looks stupid, and change the data type of that cell to "Text", then it's too late, and Excel will show "37226". It has destroyed the original data and replaced it with the numeric equivalent of the date.
The data type of the spreadsheet cells should be a display format only. The text you typed in (or loaded from a CSV file) should be preserved, and the data type of the cell should only dictate how that is treated and displayed, not how it is stored.
However, when you input something into a cell, Excel has to parse it and decide what value and datatype to assign to it. That is just inevitable, and the same everywhere: in Python, `x = 3` assigns an integer, `x = 3.` a floating point, and `x='3'` a string. Similarly hints (or explicit specifications of type) are possible in Excel.
However, given that, what you typed originally is not recoverable anywhere (Python doesn't record whether you wrote `x=3.` or `x = 3.00`, why should it?).
So, to the extent that it is reasonable, Excel does what you request, and to the extent it doesn't do it, it's not reasonable, I'd say: A cell need not maintain the original pre-parsing input.
That’s even worse, since a user assumes the cell holds what it shows. Excel, for example, auto completes entries, which is extremely useful for data entry, and for accuracy. If it only held the few characters entered, now it has to repeat exactly the same autocompletes every time it opens, and has to do so even if the data that originally caused the auto complete changes. All this is nearly algrithmically I’m-Seville without changing the document into a history. And now CSV is useless for data transfer.
It’s vastly more sane to change to the data the user meant to enter, show it to the user for confirmation, and save that.
That the 99.99% use case bites the remaining 1 in 10,000 is a perfectly reasonable trade off, done throughout society in uncounted ways.
Automatically converting SEPT1 to a date caused far more damage than the benefit of the fleeting convenience.
The damage that this "feature" caused and will keep causing goes far beyond the domain of bioinformatics.
Even in bioinformatics things have not been "solved", for the next decade the same gene will have two names when looking it up in previously published data.
Who are you to judge? For the novice, even for most users it'll be what they want. And for the expert (or even intermediate user) it is trivial to override.
20% of publications had errors introduced by Excel in them! The real number is probably much higher (Nature had a rate of 35%). What is the rationale in saying that the benefits must still be worth it?
See: Gene name errors are widespread in the scientific literature
The root of the problem is not about how Excel displays information, or how Excel parses the data but that the data it tacitly changed upon saving it.
Open a CSV file that has SEPT1, don't take any action, save it right away. Boom, the data has been changed and you cannot recover the original information.
It is super hard to figure out a way to import data into excel without triggering auto-conversions. I knew a way a few years ago when I was dealing with lots more biologists, but I have forgotten now. If I had to do it now, I would write XLSX directly then manually check the file in Excel, because Excel is fundamentally untrustworthy.
I received dozens of comments “Just use Open Office / Libre Office”. I am but a man, I cannot change the world. I forgive that which I cannot control. We work with tens of thousands of schools. Getting all of their administrators to install Open Office, and thousands of administrative assistants to remember to use it rather than Excel is simply an impossibility.
The world speaks English. It's an illogical and absurdly spelled language. Esperanto is miles ahead on all logical measures. Yet we will never switch, both because the cost of switching is enormous, and because there is no way to coordinate/organize such a move.
Man, the extent the aviation world has to bend itself to avoid the pitfalls of English....
"to" and "two" sound alike, and so does "for" and "four".
"Turn left t(w)o four zero" or "Climb t(w)o eight thousand"
To avoid that, the lingo is:
"Turn left heading two four zero", "Climb and maintain two eight thousand" (or, "Climb flight level two eight zero").
At any rate, switching to metric would be much better (If I'm 2000 metre high, and I have a glide ratio of 1:10, I can glide 20,000 metre or 20 km. If you're 5000 feet high, and you have a glide ratio of 1:10, how many nautical miles can you glide?), but for the reasons you mentioned, it won't happen anytime soon.
But in microbiology you usually works with very large datasets that undergo a lot of calculations through a lengthy pipeline. Larger sets and more intense ML approaches even need HPC's. Inserting Excel into this pipeline would be catastrophic.
What I'm saying (which is not news to you) is that clearly Excel is excelling along some dimension that people really care about. Even if it is lacking along other dimensions, the goal ought to be figuring out what makes Excel so good, then figuring out how to reconcile that with the needs of e.g. a high scale genomics pipeline.
But to make even an inch of progress against that, we need to get rid of this "haha Excel is for plebes" attitude that many in the software community adopt.
Most computers also have a built in calculator app, but I have never heard of those being difficult to pull from the hands of users, even when replacing it with purpose built software that costs tens of millions of dollars per year.
But what do I know :)
I have meet some other glitches: identifiers in the form of "1E123" get turned into scientific numbers. The column was something like "1A123", "1B123", etc. Those things are sneaky: you can have thousands of rows, and Excel/LO doesn't mind if only the 0.1% matches its rules for smartness. They just change without notice, leaving the others intact.
I'm a bioinformatic, and this kind of stuff is a daily issue. IMO, two problems colide:
1. People with very rudimentary knowledge of computers. I've meet some people way smarter than me, with thousands of papers written, that cannot open a CSV with R with a gun to their head. They can cut you, put three robotic arms in your heart, remove a tumour, sew you and send you home in three days. But R is just too much.
2. People that needs to collect and analyze data, and Excel is the easiest tool they know, but at the same time it's no lame toy: it sorts, it sums, it filters, it makes stats, it graphs... You cannot ask this people to use SQL, specially if it involves foreign keys. They just use Excel for data collection, storage and analysis.
Excel (and LibreOffice) are to blame. This shit could be avoided if they had a "scientific mode" to not be smart with your data, and be the default mode for any file in CSV/TSV mode. Unless you explicitly turn the file into a XLS/ODT or ask "turn this range of cells into dates for me" or "this cells are scientific numbers", they should not change a single dot.
There's a whole lot of business use of CSV/TSV with Excel that benefits from the existing defaults. Heck, I've seen federal government websites distributing code lists that are actually in CSV format with .XLS extensions and that are expected to use the default “smart” conversion.
Even if Excel should arguably have had different defaults, the impact on other uses of changing it now would be enormous.
To good actor is people who carefully care and curate their data, just to be corrupted silently by the program.
Lets say then, keep the default but add at least some sort of "respect my data as it is, because this is a CSV with CSV extension" mode.
CSV doesn't have really data types (or even a hard and fast standard), so that doesn't really work. OTOH, Excel lets you specify data handling by column when importing from untyped text-based formats like CSV, which you should probably do if you care.
Or you should write up a standard for a CSV-like format with explicit semantics around data types (it could even be CSV-compatible so that you get no worse results than normal CSV handling if using software that is aware of CSV but not the new format) and start using it where those defined semantics are useful, and try to get Microsoft to add specific support to Excel for it.
That "scientific mode" is called text. Assign the column type Text during CSV import and nothing is converted to numeric or date.
Should those things be done in sql? Yes... could this data be sent as SQLite files? Yes... were they endemically sent as csv and nothing could change this? Yes...
Excel is by far the best of “spreadsheet” apps I’ve seen at munging through multi-hundred-mb csv files.
fwiw, i have not tried to use libreoffice for any heavy lifting, unfortunately (fortunately?). i tend to use bash utils for that, and i 100% do not think telling people to "just use grep / awk / perl and csvcut" is by any means a solution.
It makes no difference what men think of war, said the judge.
War endures. As well ask men what they think of stone. War
was always here. Before man was, war waited for him. The
ultimate trade awaiting its ultimate practitioner. That is
the way it was and will be. That way and not some other way.
When I worked in clinical trial design/analysis, no one did their analyses in excel, but for initial data formatting and clean-up? /everything/ went through excel, even if the final cleanup was a python script.
You’re not going to have the same ease of eyeballing your data in SPSS or SAS or R.
I would say R is a lot more useful for a quick check on whether all columns have one data structure, contain NaN/NA's, are of equal length, etc.
I'd like to add that we also had an issue with our clients using only excel and there were issues with calculations not working, not so often but with enough people making noise our management decided to get the subscription. I've been using sheets for years even as a student when i couldn't afford an office license.
I often read comments on HN starting with “to people asking” or “to the people wondering”... and introducing a new discussion, usually on another similar topic, basically hijacking the whole discussion.
Those aren't very good for cleaning and wrangling data, even if the output for analysis ends up being a spreadsheet.
The english comparison made in a peer comment is more apt, but wow.
The tech industry is so self-centered we think that Scientists should change their fields because of our bugs.
As for this being an Excel bug: many, perhaps most people don't think it is. That's why it's so permanent: almost everyone likes it.