
Is Microsoft Excel an Adequate Statistics Package? - Jerry2
http://www.practicalstats.com/xlsstats/excelstats.html
======
avs733
A couple notes from a heavy Excel and heavy R (and heavy several other similar
products) user...

factories run on excel...terrifyingly so. I know semiconductor semiconductor
fabs that basically run on queries in a many sheeted workbooks. The reason is
because not everyone codes, especially not factory foreman, technicians, etc.
They don't need a continuously running app (well they do but they don't know
it) they just need formatted data.

That links to comment two. The issue that Excel solves over code goes back
(IMHO) to two issues that come up with coders/noncoders. First, most users of
excel are interested in the answer rather than the process. I get the value of
being able to audit your steps visually, but the transition from Excel to R is
a transition from the primary visual being data (i.e., results) to process.
Depending on your perspective that can be highly meaningful.

Second, and probably more importantly, Excel lets you visualize your data as
it progresses. For those with lower (or just different/less abstract) spatial
and visual reasoning ability, seeing the progress of data from column to
column can have a fairly profound effect. This extends to students who are
trying to learn by seeing the progression as steps of a processing algorithm
are applied. Doing so in code abstracts that process heavily. For some,
especially those not used to treating data in the abstract/blind via code.

~~~
SmokyBorbon
> factories run on excel...terrifyingly so

Chemical plants too. But that does not mean there is no code.

An Excel document can be programmed with VBA to make connections to network
resources, read/write files, send emails, etc.

I might not be the most effective tool for every task but it's everywhere and
allows workers to automate work without making a request to hire a developer
or buy new software. Someone can gradually automate business tasks on their
own initiative rather than go through multiple layers of corporate
bureaucracy.

~~~
avs733
absolutely agreed. I mean terrifying more in the sense of the lack of controls
that I have typically seen in place and the way 'pull request equivalents' are
handled and modifications are made.

------
kprybol
The other reason to not use Excel for stats? It's virtually impossible to
reproduce your work unless you documented every step in some other format.
Excel is easily one of the worst tools you can use if you ever need to refer
back to/redo your original work at any point in the future or if you need to
be able to validate your result with any confidence. Yea I know there are some
people who are freaking magicians with VBA who can "fix" many of the problems
listed, but that doesn't absolve excel of its statistical sins.

~~~
MlCROSOFT
I think that's where R stats package (R Project) really shines, I love the
command line console that displays every input / action so you have a track
record of what you did. Used it for T-tests, linear regression, ANOVA, MANOVA.
For those working in biological / environmental science fields, I would
definitely recommend it.

~~~
okket
There is a free R course on Codeschool if anyone wants to try it for an hour
or two. It's fun.

[https://www.codeschool.com/courses/try-r](https://www.codeschool.com/courses/try-r)

------
hexane360
I've spent a lot of time using Excel in ways it was never meant for.

If anyone is trapped using Excel to make a box plot, it is possible (disclamer
please for the love of god never do this): Make a stacked column chart. The
first series should be equal to Q1 minus the graph minimum. Format this to be
invisible. Series two should be median - Q1, formatted as a black outline on a
white bar. Series three should be Q3 - median, formatted like series two.
Error bars can be added to give the whiskers. Here's an example:
[https://i.ytimg.com/vi/ucWmfmXb1kk/maxresdefault.jpg](https://i.ytimg.com/vi/ucWmfmXb1kk/maxresdefault.jpg)

Again, don't do this.

For a histogram, just make a min and max for each bin, and use countifs(range,
">=" & min, range, "<" & max) to get the amount in each bin.

~~~
bereasonable
Excel 2016 supports box and whisker plots.

------
twunde
I disagree. Where I work, we have one person that does all the hedging, price
calculations and other financial modeling and he only uses Excel. The only
practical downsides to his use of Excel it's that he doesn't have direct
database access so we generally need to create the initial reports to give him
the data and secondly Excel can only hold a little over 2M rows.

~~~
throw_away_777
How do you verify that his calculations are correct?

~~~
shostack
Why do you think this can't be verified? You can see the formulas in cells, or
you can see the macro code. Further, you could auto import older data inputs
and outputs via data connections in Excel and set up some automated checks to
flag if something is off.

Is there a use case where you think this wouldn't be possible?

~~~
throw_away_777
You can verify it sure, but it is much harder to verify excel equations than
code. There is no github for Excel (besides saving with different names). You
generally don't see the equations when you look at Excel - this is one of the
reasons Excel is so error prone.

~~~
shostack
Fully agree. My point was that it can indeed be verified though in many
circumstances.

That said, I think there's always a balance between a quick down and dirty
business solution that gets the job done, vs. something fully engineered.
Additionally, it is much easier for business users to shoulder more of the
workload while letting people with programming knowledge focus on other tasks.

------
ramblenode
A point I have not seen mentioned is that Excel encourages bad practices for
data visualization. No other statistics or data analysis software I am aware
of gives you the option of 3D-ifying a 2D plot. This adds negative value to
the plot for anyone who is actually interested in data over eye candy. Simple
example is a pie chart. Give it depth and it becomes much more difficult to
reason about, and one's reasoning could easily change if the chart was
rotated.

~~~
learningman
Agreed for the most part, although there are rare valid uses for 3d graphs. I
think most contemporary data vis people would say that Excel makes it possible
to make pie charts at all means it may encourage bad practices.

~~~
infinite8s
What are those rare valid uses for 3D graphs?

~~~
learningman
I can think of one-- we had a 96-element datatable (12x8) and limited space in
our biochemical journal article. We were looking to graphically show
differences in orders of magnitude to help explain the assay. A 3d column
chart fit the bill because we could show lots of data in a small space, and
readers could quickly see the range of values.

------
ee8aq3g5c6
Let's not forget the time Excel's bad UI contributed to a bombshell paper's
faulty evidence in favor of fiscal austerity, probably prolonging the European
recession: [http://www.bloomberg.com/news/articles/2013-04-18/faq-
reinha...](http://www.bloomberg.com/news/articles/2013-04-18/faq-reinhart-
rogoff-and-the-excel-error-that-changed-history)

~~~
nolok
Yeah no, if people who were decided by this paper took it at face value
without checking the numbers made sense, it's not an excel ui's fault. As for
the others, they would have picked that policy choice with or without that
paper.

~~~
ee8aq3g5c6
Sure, we could argue over multiple possible sources of blame and maybe we will
conclude that Excel is not the most important. It seems nearly impossible to
have that debate in a rigorous way. In any case, "the evidence will be ignored
anyway" is not a good argument to use a tool that will produce bad evidence.

------
teamonkey
> "Changes in Excel 2010 have improved its use for statistics considerably.
> For earlier versions of Excel, however, the answer is generally ‘No’. The
> following refers to versions prior to 2010 (2011 on the Mac)."

------
tangue
Considering Microsoft recent involvment with R I have the crazy hope that
future versions of Excel will be shipped with R as a scripting language.

~~~
Mikeb85
IMO adding scripting to a spreadsheet is a waste of time when R can simply
import a spreadsheet, perform whatever operations needed, then spit out
another spreadsheet/database/whatever format you want.

~~~
actuallyalys
Plus, if you use a tool like RStudio or Jupyter, it's pretty easy to see the
data as you manipulate it, which is a commonly cited advantage of Excel, with
none of the awkwardness of trying to look at cell formulas.

------
godzillabrennus
Harper Reed gave a fireside chat at 1871 in Chicago a year back where he said
that he's yet to see any meaningful data opportunity that can't be addressed
by Excel. Most are just trying to build a solution where there isn't a
problem.

~~~
ramblenode
He has apparently never worked with a dataset of more than 1,048,576
observations, Excel's row limit [0].

[0] [https://support.office.com/en-us/article/Excel-
specification...](https://support.office.com/en-us/article/Excel-
specifications-and-limits-ca36e2dc-1f09-4620-b726-67c00b05040f)

------
lordnacho
Never, ever use Excel for anything other than prototyping, spot-checking data,
or as a makeshift GUI while you develop a proper one.

I've seen several financial shops where people were moving millions of dollars
around using Excel. IMHO, it's always an indicator of deficient processes and
lack of coding skill. Yes, I do know that some clever people use it. They are
productive in spite of excel, not because of it.

Your main problem with excel is not that some statistical function is missing,
or wrong, or misleading. Sure, that's an issue, but you can live with a few
things being wrong if you can detect them and fix them yourself. I'll come
back to stats later...

The problem with Excel is it's damn near undebuggable. There's simply nothing
in the way of someone making a beast of a calculation, with the flow going all
over several sheets. You can even make things circular if you want. The data
and the code are all together, mixed up. Was a number in a given cell written
there by the VB code, or was it an input? You can use the auditing functions,
but chances are you will see a spaghetti of arrows.

It's also non trivial to find differences between different versions.
Typically the dude who is using excel has also never heard of Git or SVN, so
you will see a load of sheets like "portfolio1" and "portfolio2_new_old" and
so on. I don't know if it's changed, but when I was using Excel, the files had
code files within them, rather than separate files, like we do with most other
languages.

Of course you aren't forced to write crappy spreadsheets, but there's simply a
tendency for people who don't code to be a bit messy. But Excel is positively
inviting trouble. It's so flexible that anyone under a little pressure will
hack in some extra bell or whistle, building up tech debt for future
generations. It basically lulls the novice into thinking they can build
anything.

Amazingly I've met several people in finance who pride themselves on
spreadsheets that stretch over thousands of lines. I remember a billion dollar
merger where the analyst in charge of the "modelling" showed me how they
reached the line limit (I think that's gone now, so good luck!). It's as if
making things complicated justified their salaries, so maybe that's why excel
is so popular.

Now, about statistics. If you're doing anything non-trivial, you absolutely do
not have space to look at all the individual numbers in your matrices. Just
like if you're solving equations on a piece of paper, you need your own
symbolism. You need to be able to give things names and see short statements
at the appropriate level of abstraction. You probably need to be able to
verify the pieces independently, too. So unit tests for various operations,
that aren't intruding into the business logic.

~~~
ZanyProgrammer
I could've sworn I've seen people who work in the financial and maybe data
science industries who've praised Excel to the heavens on HN. Like on the same
level as Python, R, etc.

I don't work in finance (or data science really) so I can't comment, but it
seems merely a tool to me.

~~~
lordnacho
No way, not data science. How are you going to load in millions, possibly
billions of individual numbers?

With finance there's a lot of things that seem like good fits, but only if you
keep them small. Something like a personal budget is fine, where everything
fits on a screen.

~~~
blahi
I have 2.4 billion data points loaded in Excel at the moment.

Not that it matters. "Data science" is not something that magically kicks in
after you go beyond some "big data" threshold.

Excel is heavily used in managerial science type of positions and I can assure
you those are rather heavy on the "data science" workflows.

~~~
ramblenode
The maximum number of rows in the most recent version is 1,048,576 [0]. I'm
deducing then that your data is in wide format. What would happen if you
wanted it in long format? It seems you would just be out of luck with Excel,
which is not a problem in any other major statistics software.

[0] [https://support.office.com/en-us/article/Excel-
specification...](https://support.office.com/en-us/article/Excel-
specifications-and-limits-ca36e2dc-1f09-4620-b726-67c00b05040f)

~~~
blahi
Except that there is Power Pivot which is xVelocity - the same columnar store
database use in SQL Server.

------
itisbiz
I've been recommending to all the people I work with to use Excel add-in Power
Query to connect to data and do all calculations and transformations. Then
they can just replace/update data sources and refresh queries. It's somewhat
more accessible for non technical people than Power Pivot/ DAX/MDX. it's easy
to make tabular dimensional record sets ready for consumption by pivot tables,
Tableau, etc. You can choose to keep data in model instead of showing in
worksheet to bypass row limits and connect model to pivot table. Best of all
it teaches people ETL, better data management, separation of data and report.

------
princeb
there are a few unique issues i face with excel beyond the stats package.

1\. if you shift cells (ctrl c, ctrl v) around, delete or insert rows, you may
mess up existing cell references in formulas without realizing it. your
vlookups, hlookups will not change your column numbers just because you did.
your vba code will not change your A1 cell references. things will blow up
here in spectacular fashion.

2\. if you have a massive spreadsheet with a lot of lookups, UDFs, non-static
cell values (like a Bloomberg real time feed), it's not so clear which UDFs in
which cells get calculated in what order. sometimes it results in #value
errors, which is a million times more desirable than if there was a iferr(...,
0) or iferr(..., "") and you can't tell if there was a failure.

3\. your macros will happily destroy your work if you let it by mistake
(writing over formulas in the wrong sheet etc). python will generally not
destroy the code it's running.

4\. AUTOFORMAT will destroy, without any honor or humanity, any data if it
just barely looks like it should be something else. I've had strings get
converted into dates, 0s get stripped off (I think geneticists also face that
same issue), all kinds of nonsense.

some of the problems I see raised here in HN (such as errors in formulas,
sanity checking) are also issues in other tools like scipy, matlab. common
errors in these languages are off-by-one matrix references, terminating loops
prematurely (esp for numerical solutions), formulas not written correctly,
brackets in the wrong place or + instead of -, typos in variable names, nan
versus 0 vs na, these are things that affect excel equally.

otherwise, excel is pretty good. it's quick to prototype, it gives passable
charts if all you need are passable charts. it's very good at displaying
intermediate results. it's pretty ok for WSYWIG presentation, formatting,
especially if you have custom reports to produce every week rather than
regular ones that tex can solve. the biggest thing about excel is that
everyone uses excel and if you try to send over results in a non-excel format
they'll (clients or whoever) ask you to send it back in xls.

oh also... mediocre workers can produce excel sheets of passable quality.
mediocre works may not even produce a single scipy script of any quality. I
have seen some horrific matlab code, written by people with engineering
background. i've seen one guy, in his desire to make a programming language
look just like excel, write a single line for each of the 50 charts he creates
and calls them Chart1, chart2, chart3, rather than use a for loop even though
it's just 2 lines. it's totally bizarre.

~~~
viraptor
> your vlookups, hlookups will not change your column numbers

If you're not using tables yet, you should. They solve that problem easily,
and make addressing more explicit (relative to the table name rather than the
sheet)

~~~
princeb
you can still use table names in vlookups, and it will still not solve your
problem because vlookups do not work by table headers but by column numbers

~~~
viraptor
You can use either index/match as avs733 suggested, or alternatively:

    
    
        =VLOOKUP(value, Sometable, MATCH("Column2", Sometable[#Headers]))

------
fithisux
Never, there are better alternatives and especially FOSS. For windows, PSPP is
there to make your life easier. If you are a spreadsheet guy Libreoffice will
make you never turn back to excel and can be programmed in StarBasic. R is
amazing, but if you do not want to invest Scilab and Octave are there. if you
are the freeware proprietary guy Google Sheets, WPS Office or FreeOffice are
better alternatives than Excel. And if you have a good PC, why not give the
FOSS giac/xcas a try, you may find your sanity.

------
ramblenode
Related discussion:
[https://news.ycombinator.com/item?id=12370605](https://news.ycombinator.com/item?id=12370605)

------
galkk
"statistics don't lie quote but liars use statistics"

Well, the similar can be said about authors of that article (don't know about
newer version of Excel though).

Basically they are saying "Of course, it all appears only to old versions of
Excel but there were sooo much problems with them".

So, what?

------
nimish
Excel can calculate whatever you need but the main issue is that it will
modify your data to be helpful like converting number strings to floats unless
you tell it not to and even then...

------
Fej
When an article title poses a question like that... the answer is probably no.

------
chris_wot
How does LibreOffice compare?

~~~
gaius
It's just as unsuitable. But there are plenty of open source packages that R
the right tool for the job...

~~~
nn3
The article disagrees with you

""" Solution #2: Alternatives to Excel Yalta (ref 1) states that p-values
[inverse probability distributions] reported by the free OpenOffice’s Calc
spreadsheet and the open-source Gnumeric spreadsheet do not have the same
numerical problems as does Excel - their programmers used accurate
algorithms."""

It is not surprising because with an open source program everyone who can
program can fix such issues, while with Microsoft you are at the mercy of the
likely overworked Excel team.

~~~
Delmania
> It is not surprising because with an open source program everyone who can
> program can fix such issues, while with Microsoft you are at the mercy of
> the likely overworked Excel team.

Yep, that's the theory behind open source applications. The reality is that in
a company, people will prefer Excel because Microsoft is a point of contact
that can work with, blame, or yell at to fix because you're paying them. With
OpenOffice or LibreOffice, sure, you could have your engineering department
fix it, or they could work on the software you need for your business.

~~~
chris_wot
In the rather large Australian company I work for currently my manager asked
himself out loud "how many tines did we ask Microsoft for support with
Office"?

The answer was - "never". They wouldn't have listened and so it was not only
pointless, but it was actively frowned upon!

You can purchase a rather less expensive support contract with Collabora. You
don't have to build or fix the software yourself.

------
0x145555
Excel is the de facto standard for my STAT 301 class.

