

On the accuracy of statistical procedures in Microsoft Excel 2007 [pdf] - adulau
http://www.pages.drexel.edu/~bdm25/excel2007.pdf

======
tangue
Best part : _"Gnumeric was originally such a good Excel clone that it even
reproduced many Excel errors. When informed that these were errors, the
handful of part-time programmers who maintain Gnumeric managed to fix them
correctly in a matter of months"_

~~~
0x0
And the linked paper for this claim:
<http://www.csdassn.org/software_reports/gnumeric.pdf>

------
christopheraden
I think that in jobs where statistics is the main function of the job, you
will often not find them using Excel. The business intelligence folks at my
company use it for cursory analyses, but I often question the rigor involved
in their practices, and using Excel to do statistics is icing on the cake. I
would trust the results of SAS or R or, hell, even Minitab, over the results I
get out of Excel. Microsoft seems more content making their product shinier
than improving the underlying numerical accuracy of their methods.

The sooner people realize it's not designed for real statistics, the sooner
people can stop hearing statisticians rant when they receive
"Report_Analysis_Feb1-2012_Update_Final-v3.xlsx".

~~~
stdbrouw
If it's not designed for real statistics, what is it designed for then, fake
statistics? I see what you're saying, but I think the real distinction you're
drawing is between heavy-duty work and run-off-the-mill descriptive stats.
Even the latter need to be accurate.

~~~
christopheraden
Thanks for pointing out the distinction. You're right that this is what I was
going for, but the articles I've seen thus far about the inaccuracies in Excel
deal mainly with the poor capabilities of the optimization routines, RNG, and
distribution functions. If you're balancing a budget with numbers that involve
2 decimals or taking measures of central tendency on similarly-scaled values,
Excel isn't woefully inadequate.

~~~
stdbrouw
That makes sense. Apologies if I was being flippant.

------
gtani
Excel's main limitations (disclaim: I haven't upgraded to 2013 yet)

\- you can't easily audit a spreadsheet unless each cell only refers to cells
immediately above or to the left. Otherwise it's GOTO and COMEBACK
programming. "referentially opaque probabilistic graph programming"

\- Control-] lets you see where a cell is referenced in other formulas, but
only if the formula is on the same tab. Control-[ does take you to other tabs.
Control-` shows you a wall of formulas

\- there's no easy way to track significant digits and sources of floating
point error, e.g. adding numbers that are orders of magnitude different.

\- in the past, underflow has even been a security issue

[http://www.checkpoint.com/defense/advisories/public/2011/cpa...](http://www.checkpoint.com/defense/advisories/public/2011/cpai-11-June.html)

\- you can't report bugs, or see their tracker, that I know of

<http://connect.microsoft.com/> (i'll save you the time, Excel isn't on 2
lists

\- you can't go to extended precision, Rationals or unlimited precision
integers or floats when you need.

------
xijuan
Ah. Andrew Gelman just complaint about this few days ago on his blog:

 _In response to the latest controversy, a statistics professor writes:

It’s somewhat surprising to see Very Serious Researchers (apologies to Paul
Krugman) using Excel. Some years ago, I was consulting on a trademark
infringement case and was trying (unsuccessfully) to replicate another
expert’s regression analysis. It wasn’t until I had the brainstorm to use
Excel that I was able to reproduce his results – it may be better now, but at
the time, Excel could propagate round-off error and catastrophically cancel
like no other software!

Microsoft has lots of top researchers so it’s hard for me to understand how
Excel can remain so crappy. I mean, sure, I understand in some general way
that they have a large user base, it’s hard to maintain backward
compatibility, there’s feature creep, and, besides all that, lots of people
have different preferences in data analysis than I do. But still, it’s such a
joke. Word has problems too, but I can see how these problems arise from its
desirable features. The disaster that is Excel seems like more of a mystery._

------
macmac
Analysis of Excel 2010
<http://homepages.ulb.ac.be/~gmelard/rech/gmelard_csda23.pdf>

~~~
Osiris
Apparently 2010 isn't much better.

------
olympus
One problem of particular concern for many budding statisticians is the
difference in the way Student's t-distribution and the Normal distribution are
handled. The Excel functions for the t-distribution split the probability into
two tails, but the functions for the Normal distribution did not. This made it
quite a pain when doing some calculations with both large and small sample
sizes. Newer versions of Excel have "fixed" this by adding a flag to the
t-distribution functions that let you choose whether you wanted to use single
tailed probability or two tailed probability. But this flag has not appeared
on the functions for the normal distribution and thus the functions for these
two similar distributions are still inconsistent.

Another annoying issue was the autocorrelation in the random number generator
(a very bad thing for a _random_ number generator). The main rand() function
is now corrected but the problem lives on in the randbetween() function, which
shows that these two related functions are using separate RNGs instead of them
both tapping into a single good RNG. It's very frustrating if you want to do
any real work in Excel because you can't trust the built in functions.

------
soofy
This is a funny sarcasm on the topic:

<http://www.bbc.co.uk/news/magazine-22213219>

They are saying Excel - established data tool - is, in fact, dead for big
data. And what else matters?

~~~
mbq
The problem is that still most data is not big; even worse, big data is often
either small data presented in a very sparse way (logs for instance) or is
being subsetted or summarised to fit in Excel :(

------
doppenhe
In Excel 2010 a review of statistical accuracy of functions and related work
was done. Here is a summary of the work: [http://office.microsoft.com/en-
us/excel-help/what-s-new-chan...](http://office.microsoft.com/en-us/excel-
help/what-s-new-changes-made-to-excel-functions-HA010355760.aspx)

Its also worth noting that Excel only guarantees a certain level of accuracy
(15 digits if im not mistaken). In cases where this is highly important SSPS,
MatLab , etc should be used since they offer a higher degree of accuracy.

------
1337biz
I don't even know if this matters that much within a corporate environment.
Those dealing with the numbers are most of the time aware of the fact that
these numbers have already inherent flaws and do only approx. the reality out
there. As long as the results sound somewhat plausible chances are dim that
anyone is going to ask questions. At the same time those requesting the
numbers are happy "to have the facts on their side".

------
rrreese
Excel 2007 was released in late 2006. Six and a half years have passed and two
new versions have been released. It would be much more interesting to see the
accuracy of the current version, then this very old version.

~~~
mutagen
Microsoft is likely to preserve some errors in the name of backwards
compatibility. Spreadsheets that calculate different values in different
versions are going to cause them more headaches from their customers than
these reports of errors, as the people who care are already using other
methods.

~~~
jimhefferon
Sit down, have a drink, and think about what you are saying.

~~~
carbocation
And yet, in your heart of hearts, you know mutagen's point to be true.

------
chris_wot
Does anyone know how LibreOffice compares?

~~~
adulau
Calc (from OpenOffice) was analyzed among others in 2010 "On the numerical
accuracy of spreadsheets, Journal of Statistical Software, 34, 4, 1-29."

<http://www.jstatsoft.org/v34/i04>

" This paper discusses the numerical precision of five spreadsheets (Calc,
Excel, Gnumeric, NeoOffice and Oleo) running on two hardware platforms (i386
and amd64) and on three operating systems (Windows Vista, Ubuntu Intrepid and
Mac OS Leopard). The methodology consists of checking the number of correct
significant digits returned by each spreadsheet when computing the sample
mean, standard deviation, first-order autocorrelation, F statistic in ANOVA
tests, linear and nonlinear regression and distribution functions. A
discussion about the algorithms for pseudorandom number generation provided by
these platforms is also conducted. We conclude that there is no safe choice
among the spreadsheets here assessed: they all fail in nonlinear regression
and they are not suited for Monte Carlo experiments."

~~~
chris_wot
I'm curious as to why use a Spreadsheet for statistical functions if they
cannot be trusted?

~~~
regularfry
Because they claim to support statistical functions, their users expect them
to be able to do it, their users already _have_ the spreadsheet software and
know how to use it, and knowledge that they can be dangerously broken is not
widespread.

------
mcx
How does SPSS compare? I've mostly used Excel and SPSS back in college.

