Hacker News new | past | comments | ask | show | jobs | submit login
One in five genetics papers contains errors thanks to Excel (2016) (science.org)
402 points by redbell 3 days ago | hide | past | favorite | 255 comments





Don't get me started on the automatic conversion of strings that it thinks are in scientific notation into numbers - which you can't switch off!

We have large data exports from systems that include things like unique location code. You accidentally happen to notice that a block of these look weird and it isn't just the display of them that has changed, the contents of the cell were changed by Excel automatically, without asking, and you cannot disable it.

Absolute BS after all these years. I hate that they won't fix these niggling issues that keep tripping people up over the years and just make excuses. Microsoft's usual response is: "We only work on things that affect a large number of customers". Yeah Microsoft, if you keep closing these bug reports, then each time someone reports it, you can just say that it only affects one person and close it again.

Or...you could show how amazing your company is by doing what most of us have to do: Fix it, add more debugging for the next time it happens if you can't recreate it, or have a properly tracked reason to say, "only a very few people have asked for this but changing it might break these other areas/bacwards compatability" or something.


Yes, it's that last reason. Backwards bug compatibility. There are zillions of recorded macros and VBA scripts and other workflows by non-technical users that inadvertently depend on these behaviors in subtle ways. It's like all Javascript's weird warts, you can't change these behaviors without breaking old stuff somehow. It's true that Microsoft could say that more directly, but of course they'd rather just avoid that argument.

Is there a reason you don't specify the data type when importing or if it is a frequently used data file format automate the import using VBA specifying the correct data type?

If you open a CSV file without going through a specific import process, you don't even get the option to specify a data type. And once it's open it's too late to fix it, the original data is already gone.

The specific import process isn't some sort of esoteric process. It is the data import wizard. Also as I said if you are regularly importing data from a file with the same format writing some VBA to do so is pretty quick and simple task.

Also your data isn't gone. It is still in the CSV file you imported it from. Re-import it.


I would wager more people open a CSV by double-clicking on it rather than using the import data wizard. And even if you use the wizard it takes extra work to specify the type of each column, which most people won't bother with.

Writing some VBA is a simple process if you're a programmer. I wonder how many genetic researchers fit that description?

P.S. when I said "too late to fix it", I meant by some process within Excel. Of course you can re-import the original file, but maybe you only notice the problem after you've done a lot of work with it?


Expecting you to learn the basics about the tools you're using is not expecting too much. And if you are too lazy to spend a few seconds specifying data types then you get what you deserve.

> Expecting you to learn the basics about the tools you're using is not expecting too much.

Do you/have you worked in a corporate environment? You seem to have an idealistic view about how end users are expected to use Excel.


I worked in corporate environments for decades.

The original comment I responded to said they regularly imported large data sets and the in the case of the genetists they also are regularly importing data into Excel. In other words Excel is a regularly used and fundamental tool to their work. In this case I would expect someone to learn the basics of using it. Just as I would expect a developer to learn their editor, build system, version control system, etc.


Excel chews up CSVs that it opens. I know this because an accountant checked each file our code produced using Excel before trying to import it into another program. We proofread our code before we realizing the problem was somewhere else. Shoulder-surfed the process, found the giant bug with a green X on it.

It's no better at exporting to CSV. I wrote a CSV parser a few years ago that had one set of logic for Excel CSVs and a completely different set for everything else.

Excel doesn't change CSV files when it imports them. If the imported file was being changed then the user was saving back to the same file they imported from.

This information doesn't help anyone.

The fact is the person was double-clicking a file in a list to view its contents and Excel was trampling it. Nobody in their right mind will waste time to open Excel first, use import feature, re-navigate to the file they were already looking at, and go through the import dialog just to see what's inside.


Trampling it to me implies that Excel was somehow modifying the contents of the file. Which it doesn't do by double clicking on the file and just viewing it. Do you mean that the data shown in Excel wasn't what was expected because of the auto data conversion?

Believe me, I was blown away just the same. And it's not like the accountant clicked a save button of a on-close dialog, no. Opening a CSV file was enough.

I first used Excel on Windows 3.1 and over the decades since have opened countless CSV files with it. I have never observed this behavior.

> Don't get me started on the automatic conversion of strings that it thinks are in scientific notation into numbers - which you can't switch off!

Every week it bites me once or twice. Drives me bananas.


Why still use MS Office when LibreOffice is freely available?

Because that is what your work gives you and what all of your coworkers use.

Does it fix subj-related issues?

Previous popular thread, Aug 2020: https://news.ycombinator.com/item?id=24070385

> Scientists rename human genes to stop MS Excel from misreading them as dates (theverge.com)

Related details:

2023:

https://www.pcmag.com/news/microsoft-finally-fixes-excel-gli...

> Years after introducing Excel's automatic conversion features, Microsoft rolls out an update to prevent it from changing gene symbols to dates.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9325790/

> Gene Updater: a web tool that autocorrects and updates for Excel misidentified gene names


Know your tools. When I see such headlines, I feel sad about the state of this world and obstacles to progress, angry about the stupidity of designing all the tools for the most lazy of users, and at the same time I also laugh about the failures of proprietary software and its users, who could easily have invested a little bit of time to learn alternatives, but apparently chose not to.

Yes the tooling they use might be terrible. It is your responsibility to either deal with the terrible tooling yourself, learn better tooling, or get a capable computer person in the room, who can navigate the tooling landscape and get you the results.

And of course, that is not even addressing checking your result yet. This is a sad state of the research landscape, often financed by public money, and then throwing money at MS for using a proprietary tool and messing up.


Working in this field, I think a lot of the issue is publishers often require small datasets in xls format for paper submission. So someone in the chain (corresponding author, editor, journal staff member etc.) opens the file in excel and saves it again.

Bioinformaticians are not doing their analyses in excel.


Yes, tools are not made for users but for the tool-builders. Everyone else must adapt!

I think you’re reframing “when all you have is a hammer, every problem is a nail” into “when all you have is a screw and a hammer, it’s outrageous the hammer wasn’t designed to drive screws”.

A single tool must work for ever use case or should the tool be picked based on the use case? Tool X doesn't work for use case Y. Regardless of where you shift the blame, tool X doesn't work for use case Y holds.

Tools are designed for a particular use case/userbase. Excel was not designed for gene scientists. It wouldn't make sense for me, a home owner who uses a jigsaw to cut molding, to complain to a chainsaw maker that their chainsaw isn't right for my job.

Excel is made for spreadsheets, not databases. Microsoft already has a database application in Access.

An annoyance with spreadsheets that deterred me from ever using them in teaching is that they've perpetuated a arithmetic order-of-precedence bug. ("Bug" in the sense that it contradicts long-standing mathematical convention.) If you type

    -3^2
in a cell and press ENTER, the spreadsheet tell you it's "9". It should be "-9"; in math, exponentiation has precedence over unary minus, so you square 3, then negate the result. For instance, if you tell students to graph "y = -x^2", they should draw a parabola opening downward.

I don't have a recent copy of Excel to check this in, but this was the case in the '97 version. I just tried it in the current LibreOffice calc, and it returns "9". My guess is that one of the early spreadsheets messed up the order of precedence, and everybody after copied it for compatibility.

On the other hand, I just tried maxima and python and they both give "-9".

I wonder if this particular problem afflicts people who copy formulas from (say) math books into spreadsheets.


This issue seems like partly an artifact of the invented binary operator ^. In math exponents are superscripts and there's no binary operator- it's part of the term. But for text on computers, binary operators ended up being fabricated for the things you couldn't represent directly. The caret is common and was the first one to appear, but doubled multiplication signs is another one. Traditionally, binary operators have lower precedence than unary operators since the unary minus is considered part of the term, so the exponent turning into an operator mucks things up if it's still implemented to adhere to that logic.

You sort of see the same issue with division. The forward slash is a completely invented binary operator since the actual division symbol was often not present- and let's be real nobody uses the binary division operator when writing formulae. It's supposed to represent the dividing line in a fraction, similar to how division is usually represented in a formula as a fraction of two other expressions. It's got lower precedence than anything in either term- but, if you just replace the dividing line with a forward slash to input the formula into a computer, you'll get incorrect results, because it's replacing what is part of a complete term (the division line) with a new binary operator inserted between sets of terms, which is now subject to precedence rules.


I suspect it's more likely an artifact of how the number is lexed. `-3^2` probably gets lexed into Number(-3), Operator(^), Number(2), which results in the aforementioned precedence issues. The reason for parsing the operator with the number is that it makes it easier to handle the case where you just write a negative number as a literal value into a cell.

Source: I've written an Excel clone before. I don't believe it has the same bug, but if it does, that will be how it's crept in.

EDIT: looking at some of the descriptions of the bug, it seems like it happens when handling variables (i.e. cell references) as well, which makes it seem like a pure precedence issue and not a parsing issue. So I've got no idea, presumably someone simply messed up the precedence order.


It's almost certainly a precedence issue. It's much easier to consider that unary operators have precedence over binary operators, and just learn the precedence rules for each class, rather than a global precedence rule. Plus, there's no conflict with math notation, as there is no exponentiation operator in math.

In my country we teach the same rules in math class. Blaming it on “unary vs binary” was a stretch. Next you’re going to blame it on the lexer for producing the -3 as a single term instead of two. (Which WOULD explain something, but… fix it?!)

In my country we use a horizontal line with a dot above and below to indicate in-line division in lower grades. Exactly like the computer /.

It’s not like there was no precedent here.


I also learned the line with a dot above and below in my country, the USA. But that was a very long time ago, math teaching has changed immeasurably since my time.

In Unicode it's U+00F7: https://www.compart.com/en/unicode/U+00F7


In the C language, -a*b parses as (-a)*b, but c-a*b parses as c-(a*b).

Unary minus has a higher precedence than binary operators.

You don't notice because the semantics allows the sign to move around, unlike with exponentiation.

But when we throw in edge cases involved in undefined behavior, oops!

  0 - INT_MIN/2 // fine: parses as 0 - (INT_MIN / 2)

  -INT_MIN/2    // not okay: parses as (-INT_MIN) / 2
The INT_MIN value need not have an additive inverse because of a quirk in two's complement.

> The INT_MIN value need not have an additive inverse because of a quirk in two's complement.

For comparison, in Java, these 3 expressions each yield Integer.MIN_VALUE (i.e. -2147483648):

    -Integer.MIN_VALUE
    Integer.MIN_VALUE * -1
    Integer.MIN_VALUE / -1
I have to admit I expected all 3 to throw.

edit On reflection I shouldn't have expected that, I recall reading John Regehr's blog post on the downsides of how Java defaults to wrapping behaviour: https://blog.regehr.org/archives/1401


This is probably the most overly pedantic, nitpicky reason for not using a program I've ever heard.

I'm aware that there are two different conventions on this issue, so I just use parentheses to get the behavior I want.

But, growing up, as the top math student in my class, it never occurred to me that somebody out there wants -3^2 to equal -9, I thought it was just a weird quirk in some calculators/programs. How would you read that expression aloud? I think of it as "negative three squared" so that's why (-3)^2 makes sense to me. Do you say "the negative of three squared"?

In 8th grade, I remember being instructed to type such an expression into the calculator to observe how it does something contrary to what we expect it to. From that moment on, I thought, "Huh, guess you have to use parentheses." It certainly wasn't cause enough to throw out my calculator, let alone tell others not to use it, just because I prefer a slightly different precedence convention.


> This is probably the most overly pedantic, nitpicky reason for not using a program I've ever heard.

If you found learning math easy, you're fortunate. But lots of people find learning math difficult and frustrating, and things which might not have bothered you can be big deals for those folks. If I used a program in teaching which has a convention about basic arithmetic operations that is the opposite of the convention that mathematicians use, it is one more source of confusion and frustration for people.

Student: "You said that -3² was -9, but Excel says it's 9."

Me: "Well, mathematicians use a different convention than spreadsheets."

Student: "So which one should I use on a test? Can we use both?"

Me: "Since this is a math class, you should use -9, not 9."

Student: "How am I supposed to remember that? This is why I hate math ..."

Everyone will weigh costs and benefits differently. There is plenty of good math software out there like Mathematica, R, Geogebra, or maxima. Spreadsheets didn't seem to offer much, and there was this arithmetic convention thing that I knew would be an issue.

I'm sorry if you find it pedantic and nitpicky. I always tried to minimize unnecessary causes for upset, because there were difficulties enough learning math without my adding to them. If you saw people getting extremely angry or in tears because they "didn't get it", I think you'd understand. Math is really hard for some people.


I think we should agree that standard notation is too ambigious and switch to reverse polish notation:

  3 2 ^ -     -9
  3 - 2 ^      9
No way to misinterpret that!

Except 3 - 2 ^ makes it hard to express 3 2 - ^ without accidentally subtracting, so in this case a unary negation sign needs to be a different symbol!

I dug out my HP 50g over the weekend to play with, and conveniently…

Sigh. I understand why we commonly enter math on basically a teletype-with-ASCII, and I don’t have an urge to go all APL, but for a while we were so close to a future where we could’ve had separate negation or multiplication or exponentiation symbols that might’ve removed so much room for error. I mean, that little calculator and its predecessors were popular and widely used by the same people who brought us things like Unicode and the space cadet keyboard. If only one of them had said, gee, it sure would be handy to have a +- key on the keyboard the person in the next cubicle is designing as I have on the calculator on my desk!

But nope, Everything Is ASCII won and here we are. At least programming languages are starting to support Unicode identifier names, which has its own issues but is excellent for non-Latin alphabet users who want to write code in their own tongue. It seems like a reasonably short hop from there to giving operators their own unambiguous names. I can imagine a near-distant future where a linter says “you typed -3. Did you mean ⁻3?”, to the chagrin of programmers still entering their code on teletypes.


It would be nice if OSs defaulted numeric keypad / * and - to Unicode ÷, ×, −. I never use them even when I do use the digits. That would solve the more glaring typewriter legacies. Then you'd just have the apostrophe/single-quote as the last remaining unification.

I'm not convinced. I was brought up with the middle dot for multiplication (and × reserved for cross products, I suppose?) and according to Wikipedia, the

> ISO 80000-2 standard for mathematical notation recommends only the solidus / or "fraction bar" for division, or the "colon" : for ratios; it says that the ÷ sign "should not be used" for division

I think these things are way less standardised even on paper than you believe.


That was kind of my point. We somewhat settled on an ASCII representation of many symbols, and that space is so small that lots of them have multiple meanings you have to infer. There was a brief window where we could’ve taken a different path and used different symbols for different things. Alas, we didn’t.

I don’t contend we should change things today. I do think if I were personally writing a new programming language from scratch today, I’d likely use different Unicode symbols for different operators, and the accompanying language server would nudge people to use them.


Everyone learnt -3^2=-9 in middle school… it is a very clear implementation mistake

"minus (pause) three squared"

I had a discussion on one of the Nim boards because Nim does the same, i.e. Writing -2^2=4 but 0-2^2=-4 because Nim treats the first as the unary - which takes precedence over exponentiation. I realise that you can argue how it is technically correct (and interestingly several of the people I was talking with couldn't even get my point), but I still argue it's incredibly unintuitive for anyone coming from an regular engineering/science background who wants to do regular work. I would almost argue will exclude it to be used by many due to this.

You're right that violating established mathematical convention can be a deal-killer for some kinds of adoption.

A few years ago I was reading the docs for a new programming language, thinking it might be useful in teaching. The docs were well-written and in a beautifully produced book. I got to the chapter on trig functions and discovered that they'd decided to make angles increase clockwise. And there was a graph of the sine function, with the graph below the x-axis from 0 to 180 degrees. And I sadly put the well-written beautifully-produced book on the shelf and haven't looked at it since.

I can see why people would think "angles increase clockwise" is more natural than the existing convention ("angles increase counterclockwise") - it's the way clocks do it, right? Yeah, it's just convention, but when the established convention has been around for at least a century or two and there are libraries full of books and papers which use it, "more natural" still isn't good enough reason to break with it. And it really wasn't necessary to do that for their project.

I'm sure people who do software can think of lots of conventions which may even suck but will never be replaced.


The Godot game engine has its 2d graphics origin at the top-left of the screen, with the positive Y axis pointed down. Having clockwise trig functions is a natural consequence of that.

I think it's a fairly common setup for all 2D graphics software.


I don't think I've ever seen a graphical system of any kind that didn't have 0,0 at the top-left corner of the monitor or viewport either, with positive x going right and positive y going down.

I actually didn't even think about it until now. Now it's going to bug me. God damnit. :V


Somewhat related, bitmap files (.bmp) store the image lines in bottom-to-top order, effectively putting (0,0) at the bottom-left.

It's the only image format I've ever seen that does that -- everyone else stores lines in top-to-bottom order, consistent with putting (0,0) at the top-left.


What do bitmap and excel have in common? :D

E.g. OpenGL normalized device coordinates have x=0,y=0 at screen center, with top left being x=-1,y=1. It's easier to reason with in many cases.

> I don't think I've ever seen a graphical system of any kind that didn't have 0,0 at the top-left corner of the monitor or viewport either, with positive x going right and positive y going down.

Is that because the electron beam in cathode ray tubes scanned from top left to bottom right?


OS/2's Presentation Manager had 0,0 at the bottom-left corner of the monitor.

I just checked and Adobe Illustrator does the same.

I think it’s interesting that actually, they didnt change the rotation definition (from X+ toward Y+), but because it’s a visible change from their inversion of the plane, people believe they did.


"more natural" is a good reason any time since otherwise you get to perpetuate these bad conventions for all the future generations to suffer.

What is "more natural" is a matter of opinion, and opinion will differ.

But even if there's a lot of agreement that an existing convention could stand improvement, that doesn't by itself make it "a good reason any time" for throwing out the existing convention.

What is a "convention"? It's something followed by a large "installed base". So changing a convention means a large cost will be incurred in changing up.

Who should decide whether the benefits of changing outweigh the costs? Someone has to pay for it, and simple fairness suggests that the people who will bear the costs of changing up should have the largest say.

The point is that just because someone thinks something new is better doesn't mean that old should be thrown out. And if you ignore that installed base, the change just doesn't happen.

We tried in the U.S. to switch over to metric years ago. Many of us think it would have made sense, but many more people didn't agree and it didn't happen.

It would be easier computationally if there were 100 degrees in a circle rather than 360. But the 360-installed-base is too large and the costs of changing are judged to be too great, so we're stuck with 360.

You're absolutely right, though, that suggestions for change should always get a fair hearing, and people who believe in them should go ahead and see if enough other people will sign on.


> I got to the chapter on trig functions and discovered that they'd decided to make angles increase clockwise.

That is an established mathematical convention, called "bearing". https://en.wikipedia.org/wiki/Bearing_(navigation)

> And there was a graph of the sine function, with the graph below the x-axis from 0 to 180 degrees.

But that definitely isn't a convention anywhere; bearing 0 has sine 1.

There isn't really one mathematical convention on "angles". There's a fairly strong one on angles that are named theta, but in a math class it's normal to orient phi in whatever way makes sense to you. As you trace a sphere, do you want phi to represent the angle between (1) the radius ending in your point and (2) the xy plane, as that angle varies from negative pi/2 to pi/2? Do you want it to represent the angle between (1) the radius ending in your point and (2) the positive z axis, as that angle varies from 0 to pi? That's your call. An increase in the angle just means it's getting wider; what direction that requires the angle to grow in depends on how you defined the angle and which of its sides is moving.


Regardless of relative directions around circles, the sine and cosine between 0° and 90° can be described unambiguously in terms of ratios between side lengths of right triangles. You could decide to define the functions differently, but then they'd no longer be the sine and cosine, they'd be something else. The whole point of having these two named functions is that they're a common ground, independent of whatever idiosyncratic angle measurements are useful for a given problem.

> Regardless of relative directions around circles, the sine and cosine between 0° and 90° can be described unambiguously in terms of ratios between side lengths of right triangles.

That's true, if there are no angles greater than 90° or less than 0°, as is the case in a non-pathological right triangle. In this case, as ratios of nonnegative lengths, all trig functions are always nonnegative.

If you want to include angles outside those bounds, then you care about what exactly occurs where, and while you can unambiguously define angles between 0 and 90 to have all positive trig functions, you can also unambiguously define them to have negative sines and tangents. Fundamentally what's happening is that you're defining certain line segments to have negative length instead of positive length. Which line segments should have negative length isn't a question about angles.

> You can decide to define the functions differently, but then they'd no longer be the sine and cosine, they'd be something else.

Only in a sense much stricter than what people generally use. Sine and cosine themselves are hard to distinguish - you can also call them sine (x) and sine (x - 270). Some people might argue that the sine of (x - 270) is still a sine.

> In general, the two functions can be described by their differential equations

If you do that, you'll completely lose the information about where sine is positive and where it's negative. You can apply any phase shift you want (as long as you apply it to both functions) and their differential equations will look exactly the same.


> If you want to include angles outside those bounds, then you care about what exactly occurs where, and while you can unambiguously define angles between 0 and 90 to have all positive trig functions, you can also unambiguously define them to have negative sines and tangents.

You could define trig functions differently, but then you'd need a separate pair of unnamed functions to express "the ratios of unsigned side lengths of a right triangle in terms of its unsigned interior angles". It's the same reason we don't count "-1 apple, -2 apples, -3 apples, ...". Or why horizonal and vertical lines usually fall on the x-axis and y-axis instead of the (1/√2,1/√2)-axis and (-1/√2,1/√2)-axis. We optimize for the common case.

> If you do that, you'll completely lose the information about where sine is positive and where it's negative. You can apply any phase shift you want (as long as you apply it to both functions) and their differential equations will look exactly the same.

What do you mean? "sin(0) = 0, cos(0) = 1, and for all x, sin'(x) = cos(x), cos'(x) = -sin(x)" is perfectly unambiguous. If you changed the initial conditions, you'd get another pair of functions, but then they'd no longer be the sine and cosine, they'd be some other linear combination. And for that, refer to what I said about the x-axis and y-axis: better to take the stupid simple (0,1) solution and build more complex ones from there.


> What do you mean? "sin(0) = 0, cos(0) = 1, and for all x, sin'(x) = cos(x), cos'(x) = -sin(x)" is perfectly unambiguous.

It's pretty straightforward. "sin(0) = 0" is not a differential equation. Any phase shift applied to sine and cosine will produce exactly the same set of differential equations that apply to sine and cosine; you can rename the shifted functions "sine" and "cosine" and you'll be fine.


> That is an established mathematical convention, called "bearing".

Bearing is a nautical convention not a mathematical one.

I have worked on boat computer systems and can assure you that all the angles were in radians going in the proper direction while beatings were separate always shown in degrees and clockwise.


As someone else noted, "bearing" isn't a mathematical convention. To be clear, when I say "mathematical convention" I mean what mathematicians do and teach. But I didn't know about "bearing" in navigation so thanks for mentioning that.

> There isn't really one mathematical convention on "angles".

There is for angles in the plane, which are the angles I was discussing. In every math course from trig where people first encounter angles in the plane they increase as you go counterclockwise. This is true in trig, precalc, calculus, ... You will not find a math textbook in which plane angles increase clockwise. I think that counts as a convention.

That convention determines the graph of the sine function, because sin theta is defined in trig courses as the y-coordinate of the point where the ray from the origin determining the angle intersects the unit circle. So (e.g.) if 45 degrees means 45 degrees clockwise, that ray is below the x-axis, and the y-coordinate of the intersection is negative -- and hence, sin 45 degrees would be negative.

If angles increase clockwise from the positive x-axis, then sin 45 degrees will be negative. And if sine 45 degrees is negative, then angles are increasing clockwise from the positive x-axis. And any mathematician would tell you that sine 45 degree is 1/sqrt(2), not -1/sqrt(2).

> ... in a math class it's normal to orient phi in whatever way makes sense to you.

You're correct that there are two prevailing conventions for the angle phi in spherical coordinates. Mathematicians measure phi downward from the positive z-axis, so it takes values from 0 to 180 degrees. (Actually, it's sort of like "bearing" that you mentioned.) Physicists measure phi upward from the x-y plane, so it can take values from -90 to 90 degrees. It does cause some confusion in teaching Calc 3, because students also taking a physics or astronomy course may be seeing two conventions for phi. However, in 3 dimensions (spherical coordinates) there's no natural "clockwise" or "counterclockwise".

But there is a convention for measuring phi in math classes -- it's the one I described above. Check any calculus book. Our colleagues in physics don't like it, but oh well. :-)


There's no regular engineering convention about the ^ exponentiation operator, because there is no exponentiation operator in regular or engineering math notation. The superscript used for exponentiation doesn't need a precedence level, it has a natural grouping behvaior.

This is most visible in the fact that you never need parentheses around an exponent expression in math notation, but you need them a lot in programming notation. They are just different notations.

Consider in math notation:

   2+2
  3    + 5
Programming notation:

  3^(2+2)+5
Completely different notations in a much more fundamental way than how they treat unary minus.

It's insane to me that you'd throw away a whole language, toolchain, ecosystem, and community, just because you disagree with the caret operator not taking precedence over a unary minus. Do they not have parentheses on the keyboards where you're from?

The exponent operator is just short for the POWER function in Excel. So from that perspective it is the correct behaviour I think even if it is mathematical incorrect.

I recall reading that this was easier to implement in memory constrained machines running Excel back then, and now can't be changed for compatibility.

It has bitten me when I computed the pdf of a standard normal in Excel, invoking exp(-A1^2), say.

Someone made a website (in 2003, it's a bit out of date) tracking this issue:

http://www.macnauchtan.com/pub/precedence.html


But of course it can be changed, you can make it configurable even if you decide the default should continue be this silly mistake "for compatibility"

There's no "mistake", the current way makes much more sense in programming notation. It's far better that -3^2 is consistent with -3+2. The caret notation for exponentiation is anyway completely different, and works very differently in terms of precedence from using super scripts in math.

That's a mistake in every statement: Excel isn't programming, so programming notation isn't even relevant

There is also no difference in the caret notation vs superscript, its upward pointing form literally meant to signify SUPERscript

It's far better that

0-3^2

-3^2

are consistent, consistency between exponent and addition makes little sense since by universal convention they have different priorities, so you'd not expect any "consistency" there Also your -3+2 example is meaningless since its output is the same as

-((+3)+2)

so there is no inconsistency with

-9

And no, ^ doesn't universally work differently vs superscript, just in some poorly designed apps


Excel formulae are absolutely a form of computation, or programming.

While the caret is meant to symbolize superscript, it is nevertheless a completely different notation for exponentiation.

I don't see why 0-3^2 and -3^2 need to be consistent necessarily. Sign change and subtraction are different operations, so they can have different relationships with other operators.

If + worked like you want ^ to work, then -3+2 would equal -5, instead of the more common -1.

^ does work differently from superscript in all apps. The way you write "three to the power two plus two" is completely different.


Abacus is a form of computation, doesn't make it programming.

> While the caret is meant to symbolize superscript, it is nevertheless a completely different notation for exponentiation.

Wait, do you believe slash / to be a completely different notation with different rules for division?

Also, it's not completely different, I've already explained that its form points to the same participle - RAISing base to the power, exactly the same as superscript. It's just that input/typesetting on computers is very primitive, so you can't really use superscript conveniently, otherwise it's semantically the same, so having different rules for the same meaning makes no sense

> I don't see why 0-3^2 and -3^2 need to be consistent necessarily

ok, if you fail to see this basic similarity but somehow think -3+2 is identical, don't have anything else to say here

> If + worked like you want ^ to work, then -3+2 would equal -5, instead of the more common -1.

Why would I ever want addition to work the same as exponeiation??? That's your weird wish for them to behave the same, I respect the math precedence of operators.

> ^ doesw differently from superscript in all apps

that's not true, https://www.wolframalpha.com/input?i=-3%5E2

Many calculators / calculator apps also behave the same


In Wolfram alpha as well, ^ is a different notation than superscripts: https://www.wolframalpha.com/input?i=-3%5E2%2B2

And of course / is completely different from fractions too. Math notation is two-dimensional, and requires relatively few parentheses. Computer notation is uni-dimensional and requires parentheses all over the place.

This is how math notation looks like, try to write this in C/Excel/Wolfram Alpha without parens:

      2 + 2
    -3      + 4
   _____________ = 17
      2 + 3

> try to write this ... without parens:

Why? This extra condition doesn't help you, and why your link shows nothing, it behaves exactly as I'd expect, ^ is identical to superscript, you're just making an implicit mistake of thinking +2 is somehow covered by ^ and would be part of the superscript, but it wouldn't, that's a different source of ambiguity

What would help is an example where parens aren't needed, but nonetheless slash would mean something else vs horizontal line, like in the original example

That's how you show semantic "completely different"


The math notation doesn't need parens. The computer notation needs parens. This alone is simple obvious proof that they are different notations, at least in my understanding of what a notation is.

-3^2 in the math notation doesn't need parens only because of operator precedence.

The fact that the computer ^ requires parens in more cases like -3^(2+2) is irrelevant for this and doesn't allow you justifying different precedence rules (and your downgrading from "completely different" to "different" isn't a proof, just "tautology". Hey, they also look different, so they are different!)


I don't have a horse in this seemingly extremely important race, but Excel absolutely is programming

It is configurable - with brackets.

You confuse configuration with data entry

But it's 9!

Check it: https://www.mathplanet.com/education/pre-algebra/explore-and....

"You also have to pay attention to the signs when you multiply and divide. There are two simple rules to remember: When you multiply a negative number by a positive number then the product is always negative. When you multiply two negative numbers or two positive numbers then the product is always positive."

So basically you have -3x-3 and result is 9.


No - in a subtle way, you're assuming that the unary minus has precedence, when the point is that it doesn't.

You're right that "thing^2" means "thing times thing", but in "-3^2", what is it that is being squared? To write it, as you did, as "(-3) x (-3)", assumes that in "-3^2" the thing being squared is "-3". But that in turn assumes that the unary minus is done before the square. By the standard mathematical convention, in "-3^2" the thing being squared is "3". So you do "3 * 3", then you negate the result and get "-9".


Interesting. In our high school math class when -3^2 was hand written it was ok to interpret it as (-3)^2. But maybe that could have been because spacing would be used as some informal parentheses.

Leaving "-3²" to be "interpreted" according to context is not a good idea. Mathematical notation should be unambiguous and context-free as much as possible, not "do what I mean". A computation involving "-3²" will be right or wrong depending on what it means; that shouldn't depend on who is reading it or under what circumstances.

Not to unduly slight your teacher, but it could be they weren't sure about what "-3²" means. Everyone who teaches has gaps in their knowledge -- I sure did. :-)


Precedence rules are mostly a formalization of the underlying basic convention that the operations whose terms are written closer together are done first.

Mostly to make polynomials easy.

But if we’re doing math mostly on computers, we should adopt rules that make writing on computers easy — not pedantically insist typing code follow the rules of handwriting polynomials.


It was the same for me. The question seems to be whether "-3" here represents the operation "negate" or the number "negative 3". If it's the latter, then -3^2 is (-3)^2.

Thanks. I had no idea. Seems counterintuitive at first, but it makes sense, since ’-3’ is '0 - 3'.

This is a good way to think of it. It would be very confusing if “0 - 3^2” and “0 + (-3^2)” should mean different things.

But what about (0-3)^2?

You said that “-3” = “0-3”.

So we have “-3^2” is “(0-3)^2” is 9. Agreeing with -3^2 = 9.

You’re performing a sleight of hand when you define “-3” to be “0-3”, but move the parenthesis to get your second equation. You have to insert your definition as a single term inside parenthesis — you can’t simply remove them to change association (as you have done). That’s against the rules.

So if you think “-3” is “0-3”, then you should agree the answer is 9.


> (0-3)^2?

It’s entirely unambiguous due to the parenthesis.

I don’t think the rest of your argument actually makes sense.

There is no sleight of hand required. The original argument is entirely related to having unary minus and binary minus which are different operators conceptually have similar precedence as being less surprising.


My point is that you can’t define “-3” as “0-3” to make it work: you’re assuming exactly what’s being debated (via sleight of hand) when you insert the terms with brackets in the way you do rather than the way I do.

When you try to swap in the unary operator without that to make it “less surprising”, you get 9.

Precisely what you said was wrong about the unwary operator (in Excel).


But no one is defining -3 as (0-3). You are entirely missing the point. I am going to quote myself again:

> The original argument is entirely related to having unary minus and binary minus which are different operators conceptually have similar precedence as being less surprising.

And no, you don’t get 9 when you swap the unary operator. That’s the whole point and why it’s surprising that Excel did reverse the precedence for implementation easiness.


> You said that “-3” = “0-3”.

No, I didn’t.


Isn't this treating 3 as something that can be analyzed from -3? Like -3 is the union of 3 and the (-), and not something onto itself. Ought it to be the case that negative is more than just a sign? Like, I think 3 should be ontologically distinct from -3, even if the addition of 3 and -3 = 0. Idk, I don't have a reason for this, it feels right for some reason.

You are mistaking the notation for the object.

There is indeed two ontologically different elements 3 and (-3) in Z. The question is however purely about what is the meaning of the ambiguous without precedence rules representation -3^2.

Note that it gets more complicated quickly if you want to keep thinking about it in that mathematicians often consider ontologically different but equivalent operations as the same when it’s irrelevant to what they are doing or the results trivially extend to both case. See for example 3-3 and 3+(-3).


I mean, all equations are true if all terms are set to equal zero, so the term itself must have some sort of ontological distinction.

-3^2 is not a standard mathematical convention. -3² is, but that is completely different in many other ways.

https://www.wolframalpha.com/input?i=-3%5E2

https://en.wikipedia.org/wiki/Order_of_operations

Parentheses, Exponentiation, Multiplication, Division, Addition, Subtraction

-3^2 would then be correctly parsed as -(3^2) which is -9.

Parsing it as (-3)^2 would require the addition of parentheses.

This gets to the special case of the unary minus sign... which the Wikipedia article specifically calls out.

    Special cases

    Unary minus sign

    There are differing conventions concerning the unary operation '−' (usually pronounced "minus"). In written or printed mathematics, the expression −3² is interpreted to mean −(3²) = −9.

    In some applications and programming languages, notably Microsoft Excel, PlanMaker (and other spreadsheet applications) and the programming language bc, unary operations have a higher priority than binary operations, that is, the unary minus has higher precedence than exponentiation, so in those languages −3² will be interpreted as (−3)² = 9. This does not apply to the binary minus operation '−'; for example in Microsoft Excel while the formulas =-2^2, =-(2)^2 and =0+-2^2 return 4, the formulas =0-2^2 and =-(2^2) return −4.
(edit)

Digging into this a little bit more...

https://www.gnu.org/software/bc/manual/html_mono/bc.html#TOC...

    The expression precedence is as follows: (lowest to highest)

    || operator, left associative
    && operator, left associative
    ! operator, nonassociative
    Relational operators, left associative
    Assignment operator, right associative
    + and - operators, left associative
    *, / and % operators, left associative
    ^ operator, right associative
    unary - operator, nonassociative
    ++ and -- operators, nonassociative

    This precedence was chosen so that POSIX compliant bc programs will run correctly. This will cause the use of the relational and logical operators to have some unusual behavior when used with assignment expressions. Consider the expression:
    ...
This brings us to the POSIX specification for bc https://pubs.opengroup.org/onlinepubs/9699919799.2018edition...

This also shows the unary - having higher precedence than ^.

https://github.com/gavinhoward/bc/blob/master/manuals/develo...

    This document is meant for the day when I (Gavin D. Howard) get hit by a bus. In other words, it's meant to make the bus factor a non-issue.

    This document is supposed to contain all of the knowledge necessary to develop bc and dc.

    In addition, this document is meant to add to the oral tradition of software engineering, as described by Bryan Cantrill.
... now, it would be interesting if gavinhoward could clarify some of the design thoughts there (and I absolutely love the oral tradition talk).

Precedence is a property of notation and operators, not a property of operations. The ^ notation for exponentiation is fundamentally different from the common superscript notation, and so it is only natural that it has different precedence rule. The biggest difference is in how exponent expressions are handled, of course: in math notation, the entire expression is written in superscript and no parantheses are needed; in computer notation, you always need parantheses around the exponent expression.

Gavin Howard here.

There really isn't a design choice to be made. POSIX requires unary negation to have higher precedence.

The only precedence change (I can remember) from GNU bc is that I changed the not operator to have the same precedence as negation. This was so all unary operators had the same precedence, which leads to more predictable parsing and behavior.


The question that I'm curious about is "why does POSIX require unary negation to have a higher precedence?"

Was it a "this is the way that bc worked in the 70s because it was easier to write a parser for it?" or was there some more underlying reason for the "this problem gets really icky if unary negation has lower precedence than the binary operators and makes for other expressions that become less reasonable?"

It's like the Logical XOR issue ( https://youtu.be/4PaWFYm0kEw?t=2236&si=Wi0gwV-XctLGN98I ) ... and I'm of the opinion that there's a real reason why this design choice was made.

(Aside: Some other historical "why things work that way" touching on dc's place in history: Ken Thompson interviewed by Brian Kernighan at VCF East 2019 https://youtu.be/EY6q5dv_B-o?si=YKr4j_FAEp-OihiX&t=1784 - it goes on to pipes and dc makes an appearance there again)


Unfortunately, I can't really say much about historical design.

Fortunately these rules are natural and easy to remember, especially the details of how different languages choose different orders, relative to the confusing and abhorrent prefix notation (- (^ 3 2)) or (^ (- 3) 2).

As a lover of prefix and postfix notation, there is an unambiguous parsing of each of those that does not deepened on any order of operations. Neither lisp nor forth have the question at all - you can only write it exactly as you mean it.

    (- (expt 3 2))
is always -9 without needing to ask which has higher precedence.

    (expt -3 2)
is likewise always 9. There is no question if - is a binary or unary operator in prefix notation and what its order of operation should be.

Likewise, in dc

    3 _ 2 ^ p
    _3 2 ^ p
and

    3 2 ^ _ p
where '_' is the negation operator (it can be used for writing -3 directly as _3, but _ 3 is a parse error) return their results without any question of order of operations.

When you start touching infix, you get into https://en.wikipedia.org/wiki/Shunting_yard_algorithm which was not a fun part of my compiler class.

(And yes, I do recognize your credentials ... I still think that lisp and forth (above examples for dc) are better notational systems for working with computers even if it takes a bit of head wrapping for humans).


Notation is a tool of thought and of communication. The clearest example of infix botching both is probably perl style regex.

Postfix is interesting in forth. It makes the stack manipulations very easy to reason about, and the stack is very important there so this looks like a win. The cost is in coherently expressing complex functions, hence the advice to keep words simple. The forth programmers are doing register allocation interwoven with the domain logic.

Lisp makes semantics very easy to write down and obfuscates the memory management implied. No thought goes on register allocation but neither can you easily talk about it.

Discarding the lever of syntactic representation is helpful for communication and obstructive to cognition. See also macros.


Maybe something's wrong with my terminal. In dc:

    3 _ 2 ^ p   gives 0

    _3 2 ^ p    gives 9

    3 2 ^ _ p   gives 0

    5 _ p       gives 0

    _5 p        gives -5
You didn't intend that I should get those zeros, right?

     ~ % dc -v
     dc 6.5.0
     Copyright (c) 2018-2023 Gavin D. Howard and contributors
     Report bugs at: https://git.gavinhoward.com/gavin/bc

     This is free software with ABSOLUTELY NO WARRANTY.
     ~ % dc   
     3 _ 2 ^ p
     9
     _3 2 ^ p
     9
     3 2 ^ _ p
     -9
     5 _ p
     -5
     _5 p
     -5
     (control-D)
The version that I have appears to have _ parsed as an operator in addition to the negation of a numeric constant.

I am the author of that dc.

You are correct about its behavior.

See https://git.gavinhoward.com/gavin/bc/src/branch/master/manua... (scroll down to the underscore command).


I'm using the version from Debian testing:

    dc -V
    dc (GNU bc 1.07.1) 1.4.1
In fact, in my version "-v" as opposed to "-V" isn't recognized as a valid option.

    3 _ 2 ^ p
    0

I am Gavin Howard, the author of the other dc.

My dc does have a few differences from the GNU dc. I added the extension of using _ as a negative sign.

That is why you are both seeing behavior differences.


Thanks for looking all this up, particularly the stuff about spreadsheets. I wonder what real early spreadsheets like Lotus 1-2-3 or even Visicalc would do with "-3^2". I have to feel one of the early spreadsheet programmers just made a mistake, because I can't think of a practical reason for going against the mathematical convention if they were already aware of it. Or maybe it comes from bc? ... does bc predate modern spreadsheets?

(Well, I just tried "=-3^2" in an org-mode table and it gives "-9".)


You've got to go even earlier... which is where the bc link becomes interesting. It's from 1975.

Lotus 1 2 3 dates from 1983... I can't find a copy of it that is runnable.

VisiCalc would be a good one to look at at 1979. It also presents 9 https://archive.org/details/VisiCalc_1979_SoftwareArts

You've also got sc https://en.wikipedia.org/wiki/Sc_(spreadsheet_calculator) from 1981.

    docker run -it ubuntu:latest
    # apt-get update
    # apt-get install sc
    # sc
    = -3^2
And you'll see 9.00 (screen shots of those two https://imgur.com/a/L0ZvJlP and the one from VisiCalc )

This is the way its worked for a long time.

---

(edit / further thoughts)

I believe that the underlying issue is that unary - (negation) and binary - (subtraction) use the same operator and you need the unary one to have a very high precedence to avoid other problems from happening.

Consider the expression: 2^-2

Is that 0.25 or a parse error?


Unary and binary - can be given different precedence, because there is never ambiguity as to whether you're in front of a unary minus or a binary minus. A binary operator is never encountered at the beginning, or after a parenthesis, or after another binary operator, therefore those are the cases where you'd have a unary - or +.

Rather, the problem is whether -2 is parsed as a numeric literal, or a unary minus followed by a numeric literal (which would only include positive numbers).


Why would the expression 2^-2 pose a problem? There is no ambiguity there that requires operator precedence to resolve.

Correct, but irrelevant to the question. The usual rules of math require it to be parsed like -(3²), so there are only positive numbers being multiplied.

How we right negative literals in expressions is the issue. Nobody would seriously think -x^2 = -4 had no non-imaginary solutions. But if we swap out x for a literal number people start interpreting it ambiguously as maybe a negative number exponentiated, but fundamentally it’s the same syntax as before.

The problem with Excel is that it tries to do almost everything in one software tool: front-end (editing, presentation), analysis, semantic modeling, data storage/database.

Almost all reasonable engineers see that there is something wrong with such an approach. But almost all everyday computer users think that this is the way computing has to be.

Sometimes I wonder why even I voluntarily open it for certain tasks - anyway, despite all the criticism, Excel has reached the Lindy[1] threshold for me and is here to stay.

[1] https://en.wikipedia.org/wiki/Lindy_effect


The biggest problem is that the development cost for a solution 'a bit better' than a complex Excel spreadsheet is often stratospheric.

Many tech start-ups are 'replace this thing people do in Excel with a purpose built tool'


Sometimes people need to be saved from themselves. E.g. spreadsheets can have mistakes very quickly when people treat them as databases and start copying columns between spreadsheets assuming that the primary keys (e.g. in one column) are identical while they happily add rows and move rows around.

To be fair, there are no good, accessible database tools around for your average non programmer user.

Access tried to be this a decade ago, until MS started to let it die. So now, your only option is basically Excel. There's a reason it's the main thing people gravitate into.


For really simple databases, I've had limited success getting people to use a UI in front of SQLite:

https://sqlitebrowser.org/


> Access tried to be this a decade ago

three decades ago :)


Claris FileMaker was that tool. Easy to understand and provided clear separation between data and the rest. Bundled with Mac OS so everyone used it. It is now an expensive enclave.

My problem with excel is that I love it too much, even though I know that I maybe shouldn’t. It hits some perfect point between structured and unstructured data that is conducive to just filling in cells and seeing what happens.

I say “I shouldn’t” because the off-ramp from a working solution to a proper productized code-based approach can be very painful.


Spreadsheet are wonderful and powerful, Excel is ass. Love the concept, hate the implementation.

Unfortunately, it’s also industry standard ass.

Excel can also send mail through VBA scripts, thus satisfying Zawinski's Law[1].

[1] http://www.catb.org/jargon/html/Z/Zawinskis-Law.html


Except that it's proprietary and will eventually be un-maintained and stop working. But spreadsheets, in general, fall under the Lindy effect and open source software will continue it for centuries to come.

Although excel spreadsheets have had more longevity than lots of other formats…

You might have had a point in the days of .xls, but Excel by default uses .xlsx now which is just XML packaged inside a .zip archive if I recall.

And yes, Excel still fully supports .xls too.

I fear whatever format LibreOffice uses will die first, case in point I don't even remember what it's called even though I should as a computer nerd.


MS-OOXML is barely an open format. Have you tried implementing it? ECMA-376 part 1 is over 5000 pages, and there are four parts to it! (Part 1 contains an extra bit about SpreadsheetML, but by that point we were two zips deep. I turned back ere I got lost.)

The OpenDocument formats, meanwhile, are older, simpler and better than their MS-OOXML equivalents. (The ODF spec is 1041 pages altogether – 215 pages of that are the spreadsheet formula language.) LibreOffice's implementation is a little janky, sure, but I can edit OpenDocument files by hand. Try doing that to a MS-OOXML file. (Good luck.)


Its easier for me to imagine a world in which AI makes spreadsheets invisible to the modern person than it is to imagine a world in which Excel isn't the de facto spreadsheet.

We have a saying in German: A bad worker always blames their tools.

Excel has many quirks, but I'm still very grateful that it exists, for quickly putting together some numbers and still being able to change the inputs to my formulas.


But at the same time, there is a saying in user experience design: don't blame the user.

At least at the time of the article, there was no way to disable the auto-conversion of certain strings (like "SEPT2") into dates. A setting to disable this would have stopped many errors amplified by researchers working late at night or rushing to meet a deadline.

It's true that there has to be some point where the users of the tool need to put in the effort to learn how to best use it. But effort poured in from the other end by the developers, too, can go a long way to prevent common errors and save users time.


A poor worker blames their tools, but no-one can write the Bible on a stamp with a carrot...

Again, this saying has a nuance people often miss.

You don’t blame your tools.

Not all tools. Not tools forced upon by some archaic industrial standard or habit. Not stupid tools you’d never use otherwise but have no choice.

Excel has many quirks, but I'm still very grateful that it exists, for quickly putting together some numbers and still being able to change the inputs to my formulas.

That’s nice, but Excel didn’t invent spreadsheets. It invented adding BS to them and if it didn’t exist, you’d still have WhateverCalc successor available at the moment.


I think what you mean here is that you are grateful that spreadsheet applications exist.

https://simple.wikipedia.org/wiki/Spreadsheet


In Polish we have: a bad ballerina is hindered by the hem of her skirt

In Arabic, “the one who can’t dance says the floor is crooked”

A bad dancer finds even his ballsack annoying.

That's interesting. In English, the idiom is: "It's a poor artist who blames his tools."

As a native English speaker, I've actually only heard 'a bad workman blames his tools', not the artist form. Now that I google it, did you mean craftsman? I don't see any references to artist.

But why oh why can't you disable automatic date reading? Put something in settings with a checkbox that I, the user, can tell excel to F-off with that nonsense.

Is there anybody who can argue the 'for' case for having this on all the time without recourse?


Introducing checkboxes somewhere in the bowels of settings adds more ways for things to go wrong.

What is the default? Do the defaults differ across versions? How do you keep it consistent across computers and installations? What if you actually need the function ad hoc?

This reminds me of CSV export. I haven't used Windows for a decade but I remember that if you wanted to change how decimal numbers were exported you had to change the locale and reboot the computer. To change a setting in Excel. That is insane. Sprinkling checkbox patches isn't too far from this.


> What is the default?

The old behavior

> Do the defaults differ across versions?

No

> How do you keep it consistent across computers and installations?

It should be per-cell, so it's document specific.

> What if you actually need the function ad hoc?

Every option in the formatting pane is ad-hoc, this wouldn't be any different.


Aside from setting all your cells to text, they have added checkboxes for this now.

https://insider.microsoft365.com/en-us/blog/control-data-con...


I just tried this out. I'm using office 2021 and when I first opened excel, this setting was there just like in the link and it seemed to work when I entered a few strings in. 'Fantastic!' I thought.

Then I closed it and thought of a few other date-like strings to try and this time the option had disappeared! Every date-looking string was instantly turned into date! I tried a few other times and this setting is gone. WTF is that about?


Have you opened an older file? Maybe it only works with files created with the feature. I’ve no idea why that would be the case, but it’s all I can think of.

Both times (and a few after that), it was accessing excel settings before opening any files or starting a new one. I've tried after opening files, same thing. The setting is just not there like it was the first time.

I had a similar issue just two weeks ago at $DAYJOB. I was scraping a log file and I had written an awk script to convert the HH:MM:SS.uuuuuu timestamps to nanoseconds since epoch for ingestion by another tool. Little did I know that awk uses floating point for numbers internally, and so my conversion function was incorrectly rounding the result, which messed up my later analysis in arcane ways. To add insult to injury, I had the same problem again when paging through the data in Visidata, but there was an easy workaround I don't exactly remember.

At work we have had problems with excel changing values of numerical entries due to different locale formatting multiple times in the past, and many times causing hours of lost time till the culprit (excel) was found. Other types of values that are completely inconvenient for excel include social security numbers and phone numbers; which one would assume are more general purpose and thus closer to the standard use cases of excel than genes. Still excel is messing these things up. Opening a csv file is also a non-trivial overly complicated endeavour for excel in year 2024, which should be done with a double click (somehow other spreadsheet programs manage to handle csv files much more easily). The worst is that excel is usually used by non-technical people, who misunderstand these idiosyncrasies and time is wasted over and over.

My work frustration with excel is not so much excel itself, but that people will destroy / delete / not understand formulas (or think they’re improving them…) and then the end result of their edits is garbage. However this is my fault usually for not using the protect / lock features well enough. I do know that notes explanations get overlooked. I’ve mitigated this slightly by almost always having in our shared a blank start, that’s just a backup to start from, and then the actual one to be used.

Here's another interesting article on the same topic (2016) entitled "Gene name errors are widespread in the scientific literature" : https://genomebiology.biomedcentral.com/articles/10.1186/s13...

From the article:

"The problem of Excel software inadvertently converting gene symbols to dates and floating-point numbers was originally described in 2004. For example, gene symbols such as SEPT2 (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] are converted by default to ‘2-Sep’ and ‘1-Mar’, respectively."


I studied bioinformatics in university back in 2005, and the first subject we were trained on was not Python or Java, but using Excel. Our teachers explicitly told us about excel's conversions to dates and other pitfalls. I find it mind-boggling that some researchers still made these mistakes 10 years later.

Excel is a wonderful tool. But you need to learn your tools and find out about possible footguns.


Haranguing your users not to make mistakes is utterly worthless. The only way to move the needle is to design systems that are easier to use correctly.

Excel is not a genetics tool. It has millions if use cases, many more important than genetics. Excel didn’t care about this.

It’s like people complaining because sugar gets misused. Or that murderers stab people with knives. The solutions isn’t to “fix” knives.


It's not like you have to be in genetics for excel to bite you. I once had a “business” column in a CSV with one cell set to July 11. Why? Because the business was 7/11. (I assume in a different locale it would've become November 7?)

The simple solution is to do what every CSV → DataFrame library does, which is ensure columns are a homogenous type. In this case a single non-date entry in a column would be enough to treat the whole column as string.


Excel is a tool for general-purpose data processing. What the genetics people are doing is exactly that.

General purpose means you’re trying to meet general purposes not every specialised purpose

Also it is a somewhat working solution, to ban knives in certain places, like clubs, so people don't become drunk murderers in the first place ..

This is an awful mindset and I'm tired of people acting like tools only ever have pre-defined purpose and you need a special purpose tool to do anything. Also that's a dumb analogy. Here's a better one: A knife that breaks 20% of the time when you try to cut pork, because you didn't buy a separate pork-cutting knife.

I'm not doing anything nearly as special and always have dates import as numbers for whatever reason. Thanks microsoft.


How a knife that breaks 0.00001% of the times because it’s used a billion times a day?

If excel broke 20% of the time, I’d agree. But it rarely breaks. It’s just widely used.

I’ve used Excel for decades. I just set the data types on my columns. The reason Excel does that is because the vast majority of people like it and rely on it. And changing it now will break millions of workflows.

People assume their workflow is super important and worthy of software making special exceptions just for them. There’s an easy solution that people can follow now. Let’s focus on that rather than introducing a “fix” that breaks it for other people.

Excel has thought about this and there’s no simple fix. Nobody is forced to use Excel.


What are you talking about? You responded to the wrong person because this has nothing to do with what I said.

> Also that's a dumb analogy. Here's a better one: A knife that breaks 20% of the time when you try to cut pork, because you didn't buy a separate pork-cutting knife.

Excel doesn’t break 20% of the time. It rarely breaks. I think you’re assuming that genetics is more worthwhile than the millions of other uses. Think about how widely it’s used and your analogy doesn’t work very well.


Perhaps the first time you start excel maybe a short setup tutorial that sets the default parameters you want for every spreadsheet you create.

I don’t think it would have much of an impact. And I hate those forced wizards and tutorials as I use excel in lots of environments.

I remember Excel team writing about why they didn’t have advanced settings to turn it off. I don’t remember the rationale but I’d rather have some switches I can set for the situations where I don’t want it.

Although I do want it on and just check my data types. And for the most part I solved this by opening and never editing in Excel. It seems to be the one hack I’ve gotten coworkers to stick with is “don’t click save” when opening large files in Excel.


This isn't Microsoft lecturing them.

It's entirely valid for the school to tell students to avoid easily avoidable pitfalls.


Of course they can lecture them — it just isn't going to work. The error rates will barely budge.

There is one effect: it allows smug gloating about how stupid, lazy, and irresponsible these users are.

Blaming the user is the last refuge of the incompetent.


> Blaming the user is the last refuge of the incompetent.

Why then is it the dominating mindset in software design today, and advertised as being the opposite to the mindset that gives you Excel?


> The only way to move the needle is to design systems that are easier to use correctly.

If you do it like most software vendors do, by simplifying and removing functionality, you're moving the needle in the wrong direction.


Depends I think.

Ideally on the screen UI only those things are shown, that are relevant in the context.

And the context of beginners is very small, so they don't need to see advanced tools they never will use anyway. But for sure it is not the right way to also remove the tools for the advanced users who do need them.

But it is possible to make UIs that can be customized ..


Do you think the comment is written by Microsoft??? The sloppy researchers are not his users. Another way to move the needle is to place higher value on technical competence.


One wonders if these scientists took a week to learn sufficient Python and pandas, how much better the world would be?

(Certainly possible, I teach "I don't want to be a programmer"-types all the time. Taught a class this week in fact.)


I work with this type of scientists, supporting them in an IT capacity. Many of them do try to learn Python, I've been quite impressed by it, but realistically Excel isn't going away any time soon regardless. Sometimes it's the right tool for the job.

Missing checkbox. "Assume Everything Is A String"

We used to hit all kinds of Excel weirdness with inventory etc.

It was our fault that our part numbers could look like this:

00010190-95.020


That feature already exists. It's technically not a "checkbox", but you can set a cell (or all of them) as string type in like 3 clicks.

The problem was that you could not import the data without it being destroyed. By the time you could see it in Excel, the conversion had already been completed, without a way to get text.

Still, an easy footgun to stumble into if you weren't mindful when setting up your spreadsheet

I know that the "anything" goes nature of excel is part of what makes it such a universally reached for tool, but every time I see sonething like this, I think that there would be a clear use case for a more restrictive spreadsheet.

9/10 spreadsheets are tables, but because they're spreadsheets they inherit the behaviour of "no conistent behaviour in columns, everything is independent and different".


You can set the format of the column when making a table. Much more than that and you've ended up at Access. The problem here with Excel is it's wonky default until you've made that explicit choice.

Even when this paper was published, errors due to Excel mangled gene names were well known for at least a decade.

Somewhat recently, some of the more error prone genes were renamed to accommodate Excel. (Ex: SEPT7 -> SEPTIN7)


Only one in five? Evidence of egregious overengineering by the Excel team.

This seems to be a problem quite smilar to dynamic or static typing.

The users expect that Excel will figure out what to do with the input correctly.

But you can manually tell Excel what to do with data in a column.


I use Excel a fair bit at work, mostly for looking at data, but also to fairly easily generate SQL statements based on such data, often after some formulas have been applied.

Though I think what I'd really want is some tool which has the same grid-like visualization, filtering and direct entering, but was code-based under the hood and without magic conversions.

So, take Excel, and when you enter a formula in a cell, it actually writes a line of code for you, which you can inspect and edit. Including adding your own functions and such.

When importing delimited text data, you'd have to specify what the data is in each column. It should still save the original text data so you can change your mind, but yeah, no automagic stuff.


The Data Wrangler extension for VS Code might be of interest. As you apply operations to your dataset, it generates Pandas code.

https://code.visualstudio.com/docs/datascience/data-wrangler

https://www.youtube.com/watch?v=5tWJVLF6PuA


Yes, something like this is on my mind since quite some years as well.

It would be limited to programmers, though.



Excel will convert many CUSIPs (strings) into numbers (scientific notation). For a relatively comprehensive list, see below.

https://www.empirasign.com/cusip-excel-rosetta/

Barring some types of corporate actions, CUSIPs numbers cannot change, and I doubt the ABA is aware of this issue.


So, the question is: Would there be fewer or more errors if something other than Excel was used? Has much rigorous research been done on this?

One should probably also ask how much research would not have gotten done or other issues that would crop up if less tech-savvy researchers didn't have a beginner friendly tool like Excel available.

Unfortunately researchers can't fork the universe with a different set of defaults for this rigorous study you appear

Fixing the headline...

One in five genetics papers have errors caused by mistakes in the use of Excel

As if longhand calculations never have errors?


The difference here is that the excel errors are systematic so should be more easily avoidable, whereas long hand calcs are more varied and harder to correct.

I use Excel a lot. These things don't trip me up in my field. Could it be that this field is a rather niche use case, and therefore the users should learn to watch out for them?

I always felt that there is a hierarchy for working with numbers that goes something like this

Ease of use:

1. Excel

2. SQL

3. Functional programming (e.g. Scala, Python to some measure e.g. Pandas)

4. Imperative programming (C/C++/Java)

But then there another hierarchy that (roughly) goes in the other direction, which is about quality, repeatability, tooling.

If you are at 1 or 2, you responsibility will not be about writing tests and verifying your code using traditional engineering methods.

However! You are responsible for cross checking your results based on the input. This may be a manual process. But actually looking at the numbers from several different angles can give higher quality than writing contrived testcases (in 3 or 4).


2 s,SQL,j

https://www.jsoftware.com/indexno.html

Also:

https://www.jsoftware.com/help/dictionary/intro.htm

EDIT: The help section has 6 books. If you want, you can do self-teach yourself advanced math stuff with very few lines. I suggest to install Gnuplot as a dependency, for plots.


Very good! I believe you.

Unfortunately I feel like it would be irresponsible to transition our stack to working with J because of available competence and relearning.

I can see it being used in research though.


At least we got rid of Access?

I've seen things created in MS Access that can't be unseen.


Let me reword this. "One in five genetics papers contains errors because the authors were careless when using Excel".

I dislike Excel for what it does (EUCs, nearly impossible to track changes, etc.) But on the other hand it is an amazing tool.


> One in five genetics papers contains errors because the authors were careless when using Excel

Or Excel is a remarkably easy tool to mishandle because of generally unexpected transformations it makes 'for you' automatically in an easy to miss way.


You think Excel should make special accommodations because genetics authors misuse it?

Excel has been acting this way since before bioinformatics existed. Authors need to use their tools properly.


Excel (and GSheets) are tools where I think a stripped down "trust my input, I know what I'm doing" mode that doesn't try to intuit user intent would be useful.

As I type this comment and my phone miscorrects "intuit" to "Intuit," I think also Google keyboard could benefit from such a mode that only handles spelling mistakes but doesn't replace uncommon words with common brands, etc.


But if you actually do know what you are doing, you know to set the cell type appropriately.

Yes, of course. Every cell as type "Text" so it doesnt screw up everything I paste...

The real problem is the behavior of the default "General" type, which actually means "guess at every value and ham up all my data."

I frequently have to paste in strings which consist of 0 prefixed number ids. I know very well to make sure the column is text before pasting, but other users don't always remember and frequently get their data messed up by the behavior of "General", which assumes that what you wanted was an integer and thus "helpfully" strips all the prefixed 0s.


I get all the points made (and the many more not included).

The point is that Excel works great for 99% of the people and for 99% of the use-cases. I am a heavy excel user, for my financial planning, work, etc. And it pisses me off when I see a column that should be "networkdays" (working days) becoming $ or getting decimals, but hey, you take the bad with the good.


They already did in 2023.

Excel did not have an option for turning automatic conversions off.

You can now, FINALLY disable automatic conversion. Honestly that "feature" has been a bane of my existence, and I don't work with genes.


I had no idea. Found https://insider.microsoft365.com/en-us/blog/control-data-con..., thanks.

I’d love to know the percentage of corrupted data across all Excel workbooks.


There are three options:

1. Make the software better (very hard on complex systems with GUIs)

2. Ensure everybody knows all footguns (impossible)

3. Don't care about those people (easy)

If we opt for #3, we might as well not even be in software as a profession/hobby. Having such low standards indicates that we don't really care.


> EUCs

Had to search for that one. It seems that EUC here means "end-user computing", and I was shocked to discover it's a pejorative term used by vendors to describe what they consider a problem that needs solving,


Careless as in they type in "MARCH1" or "SEPT2" in a cell and these get turned automagically into 1-Mar and 2-Sep dates after pressing Enter?

Oh, come on. The Excel user interface encourages mistakes, as surely as a bicycle with handlebars that steer backwards.

https://www.youtube.com/watch?v=MFzDaBzBlL0

It's not all on the users.


At least it's a bicycle.

Remember the whole "computer as bicycle for the mind" thing? That didn't happen, the world went in the opposite direction. Software like Excel are the last surviving remnants of the idea of empowering end users to improve their work and lives.


> Software like Excel are the last surviving remnants of the idea of empowering end users to improve their work and lives.

I agree. I love Excel. But this attitude only makes sense if we assume one can only fix easy-to-make mistakes by dumbing the software down, which is not true.


I had a different interpretation of the video you linked. It seems to me even a backwards bicycle is perfectly drivable as long as you adjust you mental model.

In that sense the excel UI doesn't make sense only for those not used to it. Which might be a nice analogy


Reminds me of YAML's handling of booleans (`no` -> false) and git commit SHAs (treated as numbers in scientific notation)

Younger me would've said that WYSIWYG editors were a mistake and that researchers should've used LaTeX.

Now I think these errors are a small price to pay for convenience. One could waste a lifetime fighting small things like this and still lose. It's just the world we live in.


US ZIP codes are what I always have issue with during imports. They start with 0 in New England and the zero always gets removed.

> The researchers […] say the issue can be fixed by formatting Excel columns as text and remaining vigilant—or switching to Google Sheets, where gene names are stored exactly as they're entered.

I think the safest fix is to avoid spreadsheets altogether, as long as scientific research is concerned.


A HN thread similar to this one about how using Microsoft Excel corrupts data occurs about once a month. Nothing ever comes of it. Those emotionally invested in using Excel actively ignore criticism of it, or themselves for that matter.

Actually Microsoft have recently added a warning dialog about certain automatic csv conversions. Perhaps they finally read the thread. https://insider.microsoft365.com/en-us/blog/control-data-con...

Yes I'm aware. It would be good to have a group policy or some such registry setting to disable them all by default for every computer on the network.

Don't we ever learn. We've known since VisiCalc in the '80s not to trust spreadsheets for important data without crosschecking the results.

The question is why do we still use substandard tools for processing important data like this.


"The question is why do we still use substandard tools for processing important data like this."

Because there is no better alternative (yet)?

A better alternative needs to be really better, to justify the effort of people relearning how to do things in this better tool then.


I think another factor is that the spreadsheet model has basically zero barrier to entry from a users perspective. You can have elementary school students punch data from a classroom experiment into an excel table, which is great!

Any replacement system which, for example, enforced a strong separation between operations, input reference data and output result data would require users to learn the model before attempting to use the software. This is a pretty big ask, especially since lots of small-scale users wouldn't see an immediate benefit. I think of it like the tradeoff between dynamic and static typing when programming- it's the same "upfront mental overhead versus long term maintainability" question IMO.


There isn't much relearning needed since errors like the one described in the article illustrate that people don't know much about this tool either

And Sheets isn't "really better", yet gained a noticeable share


Does Excel have some reasonable way of unit testing of formulas?

No, as far as I know. And I have been thinking of ways to that fir the better part of the last decade. I think bringing modern programming practices into Excel might make huge differences. But I don't know how to do that in an Excel user friendly way

kinda like automatic fallback or other automatic help/fix features.

they sound like something that would be helpful but in practice they just end up being a massive violation of the principle of least surprise.


What do you guys recommend for viewing and scrolling around tabular data? Most of my excel usage is literally just double clicking to have a look at a file.

No, it is because incentives are misaligned and people care about getting cited. It doesn’t matter whether it’s correct or not.

Thank got we don't have a monopoly in office software and can easily switch to a competitor.

It's not really Excel, is it? But users not reviewing what's happening, not paying attention.

This attitude is at the root of why software is so much worse than it could be. "It's the users' fault" as a real root cause is borderline non-existent, because it can always be obviated in the software, even though the kind of talent that can do that for complex systems is exceedingly rare.

I disagree. The dominating attitude today is that users are idiots and too dumb to use anything they can't master in 30 seconds from first exposure, which has a nice side effect of cutting out a lot of functionality from the scope. The result is, for example, 20 startups going after any given idea, each spending years refining a slightly different take on a useless set of simple features.

(Actually, many of such products can be made better by replacing them with an Excel sheet, which is a big part of the reason why people who actually need to get shit done end up using Excel.)


I would say that the problem with excel is that it makes two contradicting assumptions which is a red flag for bad design. "it is user's fault for not knowing how to use the tool" and "user is lazy idiot who won't setup the cell types themselves". Its a little bit of convince of not having to set cells as numbers in a program that is use mainly for crunching numbers, but having basic auto detection text/numbers would have covered 90% of use, the remaining 10% is unique for each user and trying to guess that is as likely be helpful as making things worse.

This is users using software in a way the designers never expected, decades after it was designed. Those users then chose to ignore warnings about using it.

Newer programs like Google Sheets have better default behaviors. They have free access to it.


Yes, the problem is using Excel, not Excel by itself.

Quit using Excel in research!

Didn’t the same thing happen in economics because of a bug in Stata in the 90s?

This is 100% on the scientists who continue to use a tool that doesn't work for their use case!

Maybe they are locked into the tool because of legacy reasons, and they don’t have a developer around in their team responsible for upgrading their tool / software workflow. This must be a non-trivial problem for a lot of workers all around the world

This is why Statisticians do not use Excel for scientific work. They use SPSS, R, Minitab, MATLAB, SAS, etc.

Yeah I was about to say that the news has to be that Excel has extensive use in biomedical research.

Sounds like user-error, not a software error. Learn to use the software and its idiosyncrasies. RTFM.

God I hate programmers so much.

"Hurdurdur it's not my fault. The end users are wrong. We shouldn't have to spend time creating an interface and UX that actually works as expected"

Absolute fucking retards. We make tools for these people. If these tools do not work as end users would expect to that is our mistake. Stop coping about 'training end users'.


That's not programmers, that's business analysts, product owners, quality assurance, programmers only code what is asked for.

Having score set straight - not everything can be made "just do the UX that actually works" because there is more users and more "what actually works" than you can implement.

Not everything can be "just simple", excel for instance is powerful beast but it is powerful because it is complex and one can do really complex stuff with it. I can make simple spreadsheet software but no one will be using it because it will not allow to do really complex stuff.


This. There is absolutely no excuse for Excel's behavior here. It's been DECADES and they still haven't added a goddamn dialog that says, upon importing data, "OH, it looks like your data contains stuff that looks like dates, are they dates, or are they NOT DATES? Thanks, have a nice day"

This does indeed exist in the latest version I have running. When opening a CSV all automatic conversion can be disabled

lol. What a solution. Maybe learn how to use a proper DB. Research science is like 14 years behind the times. I was amazed at the SciPy24 conference how almost no data scientists use DBs. Wild. Pandas is your friend here and can at least help you with json stuff. Using excel though, yeesh. Yeah the problem isn’t excel, it’s bad training on data handling from schools.

lol

The problem is scientists insisting on using tools like excel and Python instead of decent type safe language to do their calculations in

The problem with most software for complex tasks, including Excel and Python, is the hidden and/or non-intuitive mechanics. That's why a AI companion should accompany them for non-expert users (99%), which would be able to check what the user is doing for intention-result errors and warn.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: