
Saving the World from Spreadsheets [video] - matt_d
https://www.youtube.com/watch?v=GyWKxFxyyrQ
======
bschne
Here's the corresponding papers:

CheckCell - [https://people.cs.umass.edu/~emery/pubs/CheckCell-
preprint-O...](https://people.cs.umass.edu/~emery/pubs/CheckCell-preprint-
OOPSLA2014.pdf)

ExceLint -
[https://arxiv.org/pdf/1901.11100.pdf](https://arxiv.org/pdf/1901.11100.pdf)

------
wwarner
Excelint [https://github.com/plasma-umass/ExceLint-
addin](https://github.com/plasma-umass/ExceLint-addin) CheckCel
[https://github.com/plasma-umass/DataDebug](https://github.com/plasma-
umass/DataDebug)

~~~
phonon
What's the difference between [https://github.com/plasma-umass/ExceLint-
addin](https://github.com/plasma-umass/ExceLint-addin) and
[https://github.com/ExceLint/ExceLint](https://github.com/ExceLint/ExceLint) ?

~~~
emeryberger
(ExceLint co-author and the speaker on this video here) The first one is the
latest version which works in all modern Excel versions (Windows, Mac,
Online). It’s a rewrite in TypeScript I did while on sabbatical this year at
Microsoft Research (also it has faster and improved algorithms and more
features to further improve its precision and usability). It is actively under
development. Installation instructions are in the README. We are hoping to
have it posted on the Microsoft store at some point. If you would like to see
something like this in Excel, say so!

~~~
phonon
Cool! Does it include all the capabilities of [https://github.com/plasma-
umass/DataDebug](https://github.com/plasma-umass/DataDebug) or should/could
both be installed?

~~~
emeryberger
Right now, ExceLint and CheckCell are not integrated. But this is something we
would like to do!

------
vogtb
I have a habit of taking notes on these types of talks when I stumble upon
them, so for anyone interested, here's a quick write up of the video. Some of
it is paraphrased,.

* "[Microsoft estimates that 750M users of Excel. (%7 of the world population).]"

* Spreadsheet errors basically ruined the economy of Greece.

* "State of the art" is manually double checking your formulas. This is what the experts suggest…

* There's apparently an article out there from Forbes titled "Sorry, Your Spreadsheet Has Errors (Almost 90% Do)".

* Thomas Herndon is the guy that did manual spreadsheet verification to prove that there were errors.

* Talks about CheckCell, ExceLint.

* Input errors are a huge problem: "Roughly 1% of characters people mistype."

* "[1 out of 20 cells manually typed probably has an error.]" (Woah.)

* "[Users often add a digit or remove a digit, changing the order of magnitude]"

* 1) Manual data entry is hard to do correctly, 2) Writing formulas/code/Excel that uses that data is also hard to do correctly.

* One take away: Like code, if you're not testing it manually, and no one is testing it for you manually, and you're not writing tests, and the results aren't "gut-checked" or the results aren't used, why would it be correct? If a tree falls…

* "[A lot of public posted Excel sheets are filled with errors, or fudging.]" Look at the grades one that he shows around 19:36 to see what I mean.

* "The Bootstrap" \- stats analysis using simulations. "[Resample samples]... random sample with replacement, repeatedly, to get distribution of output of calculation." Requires a homogenous range. Allows you to find "outliers" that drastically change the output. "What is the likelihood of observing one of the simulations under the null hypothesis, and if it's below [X] then we say it's unusual." Dude in audience at 30:29 describes it well.

* Formulas are easier to audit because they're usually named w/ column, etc. Data is hard.

* Goes through a long process of describing how they gather data, etc. Good stuff, but the short and the long of it is CheckCell is good.

* Loops back around to the global finance sheet that had a lot of errors: CheckCell worked on it.

* ExceLint - static analysis, ranks errors and their fixes. Can find off-by-one-like errors. Formulas using off-by-one ranges, etc.Excel has its own error finder, but it gives a lot of false positives and false negatives."Most errors are reference errors" \- wrong row, wrong column, too short a range, too long a range, etc. "Looks for disruptions in rectangular regions." Not just outliers. Looking for irregularity, where regularity is basically low entropy. "[Capture the relationship of cells/ranges and their relationship to one another.]" Looking for relationships that minimize entropy ("[Because users aren't insane and they're putting things in a rectangular grid.]" Looks for every rectangle (i.e. range) that when merged with a neighbor, would remain rectangular. That is considered a potential fix. Then you can simulate the fix as if you already did it, and check the entropy on that.

* A lot of the errors, and their origins have to do with basic Excel features. Some of these features were outlined as best practices in Joel's "You suck at Excel" talk, which is kinda funny. Great power, Uncle Ben, etc.

* Dropped this one: "SUM is [something like 45% of formulas]".

(edit: spacing out list)

~~~
freqshow
Thank you for posting this. I know that your taking a few moments to offer
this bit of help to the community is appreciated by many more people than will
ever reply to you.

------
Mvandenbergh
I think this stuff is great. I'm a big fan of the real world and meeting
people where they are. If your plan for saving the world from spreadsheets is
to convince people to use Pandas instead (and there are certainly people on HN
who think that way), you're not really serious about fixing things. Developing
these kinds of checking tools, which could realistically be rolled out to many
Excel users is a great step forward.

------
wwarner
I don't know of anything similar for Jupyter notebooks or sql either. I guess
testing with simulated data designed to yield expected results would be the
way I test my stuff.

~~~
jnbiche
> I don't know of anything similar for Jupyter notebooks or sql either

Exactly. It feels like we're falling into the exact same problem with Jupyter
notebooks as with spreadsheets: they become increasingly used by professionals
who code (but who aren't software devs) to create bug-ridden, unmaintainable,
large-scale software because they become familiar with the tool they have at
hand.

~~~
analog31
Yes, definitely. I'm guilty as charged to some extent. I'm a heavy user of
Jupyter, and introduced Python to my workplace, where it is now used by a
handful of scientists in R&D.

I'm hesitant to blame the tool. Instead, I think it's a matter of our
exuberance and interest in producing new results that causes us to get ahead
of our software engineering skills and build things that get out of hand.
Also, the professional developers are simply not available to help us improve
things. We're on our own.

------
baxtr
and then people just love AirTable... especially the dev community I feel

~~~
chopete
I tried using AirTable twice. It has a steep learning curve. The UI comes in
the way often.

If they really have day-to-day users - they must be from the top down approach
(somebody up the chain selected it) or forcefully committed ones or a have a
perfect use-case.

It is certainly not for regular/most excel users.

------
larrydag
I'm an R advocate. I call R the "Excel Buster". I think reproducible research
is very important and tools should follow. Excel has its place in fast
mathematics prototyping but for reproducible research it is quite lacking.

~~~
mycall
Vast majority of Excel users would implode learning R.

~~~
FromHoiPolloi
Can confirm. Source: am an excel user.

