Hacker News new | comments | show | ask | jobs | submit login
You shouldn’t use a spreadsheet for important work (lemire.me)
245 points by ot on May 27, 2014 | hide | past | web | favorite | 195 comments

Felienne Hermans has done some very interesting work examining how to refactor spreadsheets. (http://www.felienne.com/publications)

Her PhD dissertation on the subject "Analyzing and Visualizing Spreadsheets" (http://www.felienne.com/archives/2534) is particularly detailed.

Much though some people might think that spreadsheets are not for serious work, they're being used for that right now, and that is unlikely to change. Better to figure out how to improve the tooling, rather than undertake the quixotic quest to get business people to abandon their perfectly viable programming tool.

There is an excellent video with her on InfoQ, explaining her spreadsheet formula refactoring tool Bumblebee (and the F# code behind it): http://www.infoq.com/presentations/spreadsheet-refactoring

What's interesting to me is that software spreadsheets are more or less just evolved version of what apparently VisiCalc got right on the first try. Sure software spreadsheets are basically just parroting the physical ledgers that came before them, but turning them into software almost immediately opened up a kind of dynamic that didn't exist before. The latest Excel is basically just an organically evolved VisiCalc. There really hasn't been a huge paradigm shift since then.

It's not clear that there should be either, spreadsheets work really well. Efforts to move them into databases or whatever be damned. A spreadsheet is != to a database table except maybe in the most abstract possible sense in some cases. They really are different things and I agree that issues with spreadsheets should be solved by improving the tools not trying to paradigm shift spreadsheet users into an inappropriate environment that also happens to have a very long and highly skilled environment setup requirement.

I'm not a computer scientist, but I think that spreadsheets were possibly ahead of their time in a couple ways:

1. They let novices build fairly effective user interfaces, just by entering headers for rows and columns.

2. If I understand what "data flow" programming is (based on using LabVIEW), then the formula cells in spreadsheets are such a thing. Each formula is recomputed whenever its inputs change. That creates an incredibly quick testing cycles. Displaying intermediate results by necessity creates a built in debugger.

In addition, the lack of "complete" programming features, such as loops, may make spreadsheets less forbidding for novices to create and debug.

If these things are, as I suspect, compelling enough on their own, then like you say, the rest of the stuff in the latest Excel would just be icing on the cake -- not that I'd give up any of it.

They have loads of theoretical advantages :

1) spreadsheets are a purely functional programming language. When you point this out to management users of it (after explaining what it is), they wouldn't want to live without it.

So in reality the most widely used programming language in the world, is a purely functional one.

2) they are NOT turing complete (assuming you stay away from VBScript)

Of course, they're worse than BASIC when it comes to naming things, which is what everyone here is complaining about. But these are major advantages, to be fair. The only thing that comes even vaguely close to how spreadsheets work are the IPython notebooks.

Not to be too pedantic, but...

Felienne Hermans actually implemented a Turing machine in Excel, without using scripting: http://www.felienne.com/archives/2974

There was an HN post about this in September 2013: https://news.ycombinator.com/item?id=6416631

It looks like it uses one row per step. Excel supports finite rows, and hence this does not prove that Excel is Turing complete. A program like while(true); will eventually run out of rows and terminate.

Re. naming things, In the original BASIC, variable names were also quite limited. A-Z, A[0-9]-Z[0-9], which is actually an even more limited space than most spreadsheets which allow for something like [A-Z]{1,4},[0-9]{1,4} at least.

You could also argue that spreadsheets are kind of homoiconic, since code and data occupy the same structure. But that's a bit of a stretch.

Technically VBA != VBScript, although they are similar.


> Efforts to move them into databases or whatever be damned. A spreadsheet is != to a database table except maybe in the most abstract possible sense in some cases.

The problem is that very often, spreadsheets are used as databases because they are the tool the user is familiar with, not because the application is inherently more suited to a spreadsheet.

> They really are different things and I agree that issues with spreadsheets should be solved by improving the tools not trying to paradigm shift spreadsheet users into an inappropriate environment that also happens to have a very long and highly skilled environment setup requirement.

There is no reason a database using a relational data model (but not the multiuser/concurrency features that are also part of the relational model) needs to have a "long and highly skilled environment setup requirement", or, in fact, be any harder to setup than installing an app just like a spreadsheet app would require.

>There is no reason a database using a relational data model (but not the multiuser/concurrency features that are also part of the relational model) needs to have a "long and highly skilled environment setup requirement", or, in fact, be any harder to setup than installing an app just like a spreadsheet app would require.

This is a very salient point to me. A lot of people take the route of talking about why spreadsheets are so bad. In most programming language discussions, I see the more sane folks arguing that most languages aren't bad, but some have terrible idioms, or reams of ancient legacy code from before best practice was a best practice, or maybe they make it too easy to do the wrong thing, and inconvenient to do the right thing (I'm thinking of a lot of stuff Rich Hickey has said).

Instead, I like to think about how the spreadsheet environment or tooling could be altered to make it more natural to do the "right" thing, or to prevent certain classes of errors (maybe in similar ways to how type systems can). I'm not proposing a specific solution, but more a mindset for what I think is interesting (and probably not a terribly original mindset at that).

The excel database functions (DAVERAGE, DCOUNT, DGET, DMAX, DMIN, DPRODUCT, DSTDEV, DSUM, DVAR) give you about 90% of what you want in a database with a query "language" that's simpler than SQL.

> The excel database functions (DAVERAGE, DCOUNT, DGET, DMAX, DMIN, DPRODUCT, DSTDEV, DSUM, DVAR) give you about 90% of what you want in a database with a query "language" that's simpler than SQL.

As someone who has worked rather extensively with both, no, they give a lot less than 90%, and for even fairly simple uses they are often more complex to use than SQL.

For you and the way you think. Don't forget the non-programmer mind.

I've seen non-programmers with years of Excel experience and including several formal classes struggle to use the database functions to do things that non-programmers with similar levels of general technical proficiency breeze through in SQL after a single couple of days intro to SQL class. It's not a scientific study, but in my experience what I said is particularly true of the "non-programmer" mind.

Good to see you're alive. You had me worried there. Message me sometime.

Pivot Tables are even easier to construct and use than these functions. They calculate things like averages, counts and sums for lots of dimensions really quickly.

If you don't like the Pivot Table layout, you can use easily construct formulae to pull the relevant values from a Pivot Table into whatever format you want. You still get the speed of calculation/refresh and, if you label your fields well, have formulae which you can copy-paste across a large area, making the sheet easy to inspect and reason about.

That's a very good point and a good reason why anyone who wants to "fix" this problem should build on spreadsheets.

> Much though some people might think that spreadsheets are not for serious work, they're being used for that right now, and that is unlikely to change. Better to figure out how to improve the tooling, rather than undertake the quixotic quest to get business people to abandon their perfectly viable programming tool.

Its largely used by business people because they (1) have inadequate IT support to have things developed using proper tooling, and (2) have been forbidden from using better tooling themselves by IT. Since both of these are direct products of IT policy, I don't think its a quixotic quest to fix it -- however, trying to fix it by trying to convince them to just give it up is the wrong approach. The people who need to be convinced to change the situation are CIOs.

A lot of people who have access to Excel also have access to Access - and a lot of what they do would be better served in database than a spreadsheet. Yet they don't use one - there must be a reason for that, and I don't think it's IT restrictions.

This little tidbit from Chris Granger seems relevant:

"Excel is inherently observable since it doesn't have any hidden state and all values are there for you to see and manipulate. It's also direct. You change values in the grid, drag drop things, do calculations on selections, and so on. And it manages to sidestep a lot of incidental complexity; spreadsheets are timeless, without setup, and don't even have a notion of being run."


Saying Excel doesn't have any hidden state is stretching things a bit--your formulas don't show up until you are directly on them, so mistakes (formula results that got turned into static data, for example) are hidden until you put your cursor directly onto them.

"formula results that got turned into static data"

This hints at one of the sources of problems with Excel. One good approach to using Excel is to always be in one of two modes: changing the structure (equivalent to coding) and entering values (equivalent to using software). Of course, at the early stages of designing something you will be entering lots of dummy values (e.g. ARPU) to check that your formulae work.

However, many people don't distinguish between collections (rows, columns or other contiguous areas) of cells which _should_ contain values, and those which should contain formulae. I, like many, prefer to indicate input cells with a yellow or orange background, so that I know everything else is a formula.

You can do even better:

- If you're done designing a spreadsheet, and expect to use it for a production process, then mark the input cells as unlocked, and protect all the sheets. Then the end user won't be able to mess up the formulae.

- Always write formulae in a way that they can be copied across or down whilst pointing to the right places. This can be achieved through use of one or more $ signs to fix a reference. I've seen a fair number of spreadsheets where there are a large number of similar formulae, but they have been entered/adjusted manually. This is fine only if you never make mistakes, and if no one else needs to change the formulae later or verify they are correct.

My point is that "formula results that got turned into static data" can be avoided with just a few easy rules. However, most people aren't taught these rules :(

> Always write formulae in a way that they can be copied across or down whilst pointing to the right places. This can be achieved through use of one or more $ signs to fix a reference.

Except if you are using modern version of Excel, don't general use $ references for this, go one step further and use names (for fixed individual datapoints where you would use two $s) or named tables with named columns (for the most common use of single $ references.) This is more self-explanatory and less error-prone.

I never use it, but you can toggle between showing formulas and showing values with CTRL + ` (grave accent) (http://office.microsoft.com/en-001/excel-help/display-or-hid...)

> Excel is inherently observable since it doesn't have any hidden state

Except that it does -- even in terms of simple numerical values what you see is not necessarily the actual value, but instead the result of passing the value through a format string that can vary by cell.

Sadly, most people don't use this feature, and are content to show values using the default format, even if it results in values displayed as 12904819.23 (which might be better displayed as 12.9m or 12.90m).

> don't even have a notion of being run

They actually do have such a notion - in Excel you can set the calculations to Manual and then you hit F5 to refresh everything in one go. Useful when you deal with current date and time.

It's actually F9. F5 opens the Go To dialog box.

This Rory Blyth cartoon has been ungoogleable for at least ten years now, but I still check every so often. Now here it is!


What I was going to say before unexpectedly finding that, is that the gap between Excel and Access is huge. I've made a living off of that gap at various points, and at some point I stopped thinking they were "doing it wrong" and started thinking, "whatever makes you happy."

> ...at some point I stopped thinking they were "doing it wrong" and started thinking, "whatever makes you happy."

The secret to success in contract software development.

SQL is scary.

I've talked to hundreds of customer prospects for Treasure Data, and by far the biggest surprise I encountered is how few people know their way around SQL. I am not talking about just "business" people but also programmers.

Contrast this dearth of SQL-proficient population with the massive, massive Excel user base: I jokingly tell my friends that Excel is by far the most popular programming language =p

Much of the ORM market and NoSQL were precisely because programmers don't understand SQL.

True that.

I use SQL extensively, and never run into all these problems other people run into.

I also don't use ORM. there's so much you can do well and fast with SQL.

(I've also seen dreadful abominations in sql, like stored procedures that dynamically generate code for pivot tables... wait, that was me who did that)

That, and typing "select x from table A, table B where tableA.foreign_key = tableB.index and tableA.index=5" is a little long-winded and more prone to error, when you can just do x = ModelA.objects.get(pk=5).ModelB

SELECT B.x FROM tableA A JOIN tableB B on A.foreign_key = B.index WHERE A.index = 5

That's very simple, if you know SQL.

I think

Access actually has (or had, it's been a long time) a very well put together visual query designer. As long as you weren't doing anything crazy you'd never need to touch SQL.

Not true. As soon as you go just a little step further, your query cannot be represented by the GUI stuff anymore.

Someone has to explain how a database works before you can use Access properly. Explain the relational model. Otherwise it just offers tables to the user that look a bit like an excel sheet but with less functionality. In Excel they can form some sort of data model a lot more easily.

I find points 1 and 2 difficult to believe as the primary reason people rely on Excel. I think the primary reason is as seanstickle suggested: Excel is the only programming environment many "business people" know and are comfortable in. I personally know people who use Excel for data analysis, and it's because they don't know a general purpose programming language.

edit: Based on your other comments downthread, I realized that you meant the IT department themselves should develop the application using "proper tooling", not the business people. That means that the business people will have to hand over requirements to the IT people, who will write the software. I think turn-around time may be too much for business people who want to do their own data exploration.

> Based on your other comments downthread, I realized that you meant the IT department themselves should develop the application using "proper tooling", not the business people.

Not necessarily, though that really depends on the technical competencies in the organization and a number of other factors.

> That means that the business people will have to hand over requirements to the IT people, who will write the software.

Well, I kind of think that if you are going to have an organization that effectively uses technology, you need technology-aware process/system analysis embedded in operational organizations at a fairly low level facilitating process optimization including, as necessary, technology acquisition and software development. That's a pretty big culture change for most organizations that aren't technology organizations, I will admit.

> I think turn-around time may be too much for business people who want to do their own data exploration.

I don't think the problem area in Excel use is "data exploration", but, yes, I think that overly bureaucratic, insufficiently agile processes which separate software development from operations do create friction which leads to use of suboptimal tools -- including, but not limited to, use of Excel where its not appropriate.

I disagree strongly with (1).

If you just need to sum a list of numbers, or quickly calculate payroll or something, Excel does a fine job. For certain types of engineering calculations (the horror), Excel does a fine job.

The overhead of having a full developer implement the tool is absurd, because many of these problems can and should be solved in less than 10 minutes of spreadsheet jockeying.

> If you just need to sum a list of numbers, or quickly calculate payroll or something, Excel does a fine job.

Sure, its great for lots of one-time quick calculations. I was really referring to its use in on-going operations with changing requirements where maintenance is necessary.

The problem comes in when something that is suitable as a one-off tool -- and perhaps, as such, might make a decent prototype for a proper ongoing, maintainable tool -- instead gets pressed into service as a production tool and becomes a sinkhole of technical debt.

Don't forget scientists. I am an engineer at a biotech company and they use Excel for everything. You may be amazed at what they can do with it (and horrorstruck by the systems/workflows they implement with it).

(3) It's ubiquitous, because virtually every serious company with financials uses Office, so (4) employees are expected to know how to use them because their senior members have used them for decades. I know plenty of financial analysts with different companies in different positions and every single one of them spends time every day nose-deep in Excel.

My mother was a senior accountant at one of the Big Four accounting firms, and spent so much time Excel that she once built a Gantt chart in it.

I was watching a show the other day where they discussed their mapping of Benjamin Franklin's social network (who he communicated with, where they were in the world, etc). They showed some of the process and it appeared to involve some Excel workbooks with hundreds of sheets and terribly normalized data.

I can only imagine the amount of work involved in trying to extract any sort of useful information from the raw data in that format. I run into stuff like this all the time.

All I could think was that given a day and a Postgres+PostGIS server someone could have probably saved them entire man-months of their time...

If I had millions I would pay people like you to teach people like them how to "do it properly". Or even just implement it properly with lots of comments and docs and maybe a talky video.

I developed a set of utilities for Excel (http://www.breezetree.com/spreadspeed/), and I started adding some auditing features to it. After spending a few weeks on the auditing tools, I decided that I should do a little market research before investing more time on auditing features. Well, I'm glad I did because I found out that the market is saturated with spreadsheet auditing tools. Most are simple and inexpensive (like mine), but there are some fairly sophisticated tools out there. So my takeaway from this is not that there aren't sufficient auditing tools, but that the market needs a non-programmer, user-friendly way to build robust yet malleable models.

> Much though some people might think that spreadsheets are not for serious work, they're being used for that right now, and that is unlikely to change.

Indeed. Here's a great quote from 2005: "databases are rocks, spreadsheets are water": http://www.propylon.com/news/ctoarticles/051115_master_foo.h...

Very often Excel is a "programmer in a hostile environment" last hope. By programmer I mean someone who works in sales, bussiness analysis, finances, etc. but knows how to write programs.

In a typical non-IT company someone who needs to automate something has to wait for IT department to purchase software which would do the task. And very often it turns out that in software purchase process the purpose of that software is lost, requirments are comming from a wrong person and at the end that software is pretty useless.

It is not that someone can just download Python or Ruby or Java and start coding. No, no, company IT would never allow for this because they fear about security, patents, licences, etc. (and it does not matter that these doubts are not justified, very often it is just forbidden and that's all).

But chances are that such company has MS Office. Excel is just an interface to a better or worst programming languge. That's just better then nothing.

JP Morgan's $6.2 billion "London Whale" trading loss:

  An unnamed overworked staffer in London copied and pasted the
  wrong figures into an Excel spreadsheet, throwing the firm's
  risk models out of whack.

Spreadsheet errors are reaching epidemic proportions globally. What we need is a transnational organization to contain the threat with research, best practices and conferences!


Love their compendium of horror stories. Did you know the US Securities and Exchange Commission has weak accounting because they rely on spreadsheets?


  As we have reported in prior audits, SEC's general ledger system
  and certain software applications and configurations are not
  designed to provide the accurate, complete, and timely
  transaction-level data needed to accumulate and readily report
  reliable financial information...

  Many of the agency’s financial reporting processes are still
  manual in nature and reliant on spreadsheets and databases to both
  initiate transactions and perform key control functions.

I used to work for Deutsche Bank. All financially modeling ran through Excel, though admittedly a massive C++ library that is a big Excel macro plugin. These sheets get massive and there are a few clusters of blades that do nothing but run Excel macros day and night.

But why is this madness? It works for them. The iBanking world is the madness. You can put in place effective change controls and auditing with any system. However, if people just want it done and don't care how, regardless of whether these calculations were in Python, Mathematica or Fortran, garbage in will still be garbage out. So, let's not wantonly scapegoat Excel for a process problem here.

Coincidentally, so did I.

I agree with you 100%. iBankers use Excel because it does what they want. One of the main reasons I left was because I got tired of trying to re-invent the excel wheel with every piece of software. There was always an inherent lack of trust that any UI was "doing it properly," so we always had to add the failsafe of export to CSV so people could do it themselves in Excel. Mind-boggling inefficiency.

The value of export is beyond just checking on UI.

If they can expert it to Excel, then they can do whatever simple data analysis they just made up. No reason to wait for IT to implement whatever stats they are curious about, order it in some new way or whatever. Doing those things in excel is very fast and they get to have complete control over it.

Even if they only color cells and rows depending on some ad-hoc rule that will never be used again, excel still added them a lot of value.

Plus, I would pay gold for users that actually test system before relying on it. There is always shortage of testers and if the customer would do some of it, he deserves a discount.

I don't disagree. The problem is that people are error prone, and excel doesn't value check. We had to build large BPM-driven processes in order to account for this. Also I don't want to imply that users got to test the system while it was being built. That would imply that there was a logical product development process in place, which for most projects, there wasn't. Users got something that had been frankensteined over the course of the project, and muckity-mucks wondered why projects and applications went stagnant because users stopped using them and went back to excel.

There is actually a business reason why this is the case. The general assumption in most investment banks is that either the market or the regulations will change so quickly that by the time you've written software the "right" way, either the market or the government has already moved on. Excel lets you move fast, and make changes quickly, even if it results in a bloated mess later on. Like you say, Excel isn't the issue, it's the nature of the business.

I would still expect that with their profit margins they would all have come up with custom solutions by now.

They all do in varying degrees - e.g. Goldman's Slang/SecDB -and are investing in well-engineered valuation/risk systems that work across various trading units due to regulatory pressures as well as to reduce manpower/maintenance costs.

You make Excel sound like the true agile way to do software.

I like how people use tools for their own good and then blame them solely for everything going wrong. The least people using spreadsheet should have done is have a form of background check not using spreadsheet to check if their data is near correct value.

Nobody ever caused a bug in a piece of custom software that caused an expensive problem?

Knight Capital Group?

Users make mistakes, users given the power to work with computers that can cause billions of $ to be traded can cause billions of $ in losses. The solution would seem to be to limit the exposure, to put a cap on the amount of $ that a user could trade without going through some form of verification.

Spreadsheets may not be ideal, but they're the poor mans gate to programming and I suspect they don't have a bug incidence that is much greater than regular software developed by programmers.

Of course the programmers would love to believe otherwise.

Yeah, I don't get all the Excel-hate. You can most certainly put in tests and other safety factors; back in the 90s I built spreadsheets for tasks ranging from electricity market pricing (for a national power generation company) to interest-rate shopping (for banks to allocate expense spending). There were tests and tests of tests because the sheets were shuffling around huge quantities of money. The object was not just to arrive at the best outcome every month but to generate a defensible audit trail that referenced the canonical steps of the manual procedure.

I get that programmers hate the formulae being distributed all over the place, but this is precisely the advantage for business people; they want to be able to 'walk the heap' and follow the provenance of individual values back to the source, and they want to see things in parallel at all times. CS people lean towards theoretical provability, business people are inclined towards using statistical sampling. So I would write unit tests that would lock most of a spreadsheet while running a batch of checks on the contents of individual sub-sheets, then lock the validated sheet, unlock the next sub-sheet and copy-by-value the data into that for the next round; but I would also load up random historical batches that were known-good and make sure all the final totals matched.

Sure, spreadsheets are inefficient, but a mistake in an a very efficient process can easily lead to monstrous results. And from the domain specialist's point of view (ie the accountants/business managers), the inefficiency cost of doing things in spreadsheets is far preferable to the costs of auditing a black-box process at a later date if queries arise about its integrity.

I guess it's a good thing we don't have errors in software then, because that would totally eliminate these kinds of issues.

Oh wait, a user can enter a bad number in any tool, regardless of it being Excel or some custom cobbled together monstrosity that does less than the same work, but costs 100x as much.


Another website people interested in spreadsheet erros might like is Ray Panko's spreadsheet research


(Since the blog's website is not responding, I had to read the article from google's cache[1])

The author has well-intentioned advice about avoiding MS Excel but it's misguided. The criticism fails to accommodate the reason why MS Excel was used. MS Excel is the lingua franca of non-programmers. Thomas Piketty is a trained economist, not a programmer. It's not realistic to expect Mr. Piketty to set aside months (years) of his time to master C++/Python/Scala/Fortran/etc to avoid using Excel. It's more realistic for an economist to use MS Excel to back his thesis than for a Python programmer to write a bestselling book about economics.

If we then tweak the advice to be, "if Piketty is not a programmer, he should have hired a compsci graduate student as a programmer", well... you've only shifted the (human source of) errors somewhere else. Plenty of examples where software not written in Excel had fatal errors: Therac-25[2], Mars Climate Orbiter[3]

Lastly, some of Piketty's errors were transcription errors. In other words, GIGO (Garbage In Garbage Out). Therefore, using Python while there were GIGO issues isn't going to solve the data problem.




It strikes me as rather odd that a discipline in which a person purports to be on solid analytical ground would eschew learning how to write their research and modeling code in Fortran, C, C++, Scala, Python, etc. I can't think of a single colleague of mine, when I was in academia, who did not know how to write code in at least one of the languages you mentioned. That's because such knowledge is necessary to do rigorous, proper numerical computation in an analytical discipline like math, a variety of sciences and, yes, economics.

The "GIGO" issue you note is irrelevant, but even if it weren't GIGO is compounded and amplified by the copy-and-paste paradigm of spreadsheet "programming", so if anything bringing up GIGO is even more damning of the use of spreadsheets for analytical work.

There are plenty of much better tools that many economists use. For example, Stata. Stata allows you to view your data at any point in time in a nice table, but also to manipulate it using a very easy to learn (and yet very expressive) language that you can write interactively as well as in .do files. The economist I worked for wanted the whole process, from cleaning the data, running the regressions, outputting the tables, etc., to be in a set of .do files so we could hand over the original data files and all the code for review and replication. While building the final analysis, though, you could also work interactively to figure out what you wanted the final product to be, though of course you could log everything you did interactively so there would always be a record.

Economists are expected to know how to program, and they do. They just have a tendency to reserve writing programs for complex models and prefer spreadsheets for what they consider "simple" data analysis.

On the point about transcription errors: Using python could have avoided some transcription errors. If the error was in copying a single constant from a source into the program, then in python it might be buried as an unlabeled magic number deep in the source tree, but it also might be labeled as a top level constant identifier. In the former case it would have the same problem as in a spreadsheet, but in the latter, a typo would likely be caught by someone reading over the code at some point. Or if some of the data came in a file format that excel couldn't automatically import, maybe he could have used a python library instead of manually transcribing 50 data points. It's hard to think of a situation where using python would have lead to more transcription errors.

I don't know where this sentiment that economists (who are not trained as programmers) are expected to know computer programming languages comes from.

Both Harvard and Stanford economics degrees do not list Computer Science 101 as a core requirement to graduate.

Compare Stanford's requirements for Economics[1] with Electrical Engineering[2]. The engineering major has core classes including computing and programming. The economics major does not. (Whether or not economics studies should include it (and therefore Stanford is "wrong" in leaving it out) is a separate conversation.)

Since Thomas Piketty is from France, is there a tradition of European schools requiring computer programming language courses in their training?

As for the Python vs Excel comparisons, I say people are overestimating Python's syntax to avoid errors and underestimating Excel's visual spatial grid of cells & near-universal collaboration to also avoid certain classes of errors. For non-programmers, the rows-columns grid displayed visually at all times is almost a perfect 1-to-1 correspondence to their mental model. For Python loops, it takes programmer training to mentally "unroll" loops and see how data is munged. This visual "blindness" in programming languages ends up creating its own class of errors.

I've done programming in MS Excel + Excel macros/VBA, MS Access, Oracle PL/SQL, MS Transact-SQL, C++ Qt, C# ADO.NET/LINQ, Python,etc and I'm not convinced the syntax of programming languages leads to reduced errors. It's just a different class of errors.



"It's not realistic to expect Mr. Piketty to set aside months (years) of his time to master C++/Python/Scala/Fortran/etc to avoid using Excel."

I tend to disagree. Mr Piketty graduated in Mathematics (from a highly selective institution). It is realistic to expect him to be able to use those tools if needed.

Until you realize that there is more to mathematics than discrete mathematics. For someone doing, let's say, Topology computers are so useless you could well spend your entire postgraduate life without using anything more complicated than Word.

"Lastly, some of Piketty's errors were transcription errors. In other words, GIGO (Garbage In Garbage Out). Therefore, using Python while there were GIGO issues isn't going to solve the data problem."

More importantly the people claiming there are errors in hi work are making an even clearer error than Piketty in claiming that inequality hasn't risen. It has, and it is shown on dozens of measures. That they are trying to fight something as well established in main stream economics as evolution is in biology leads me to believe that in this case it is ideological windmill fighting more than honest critique.

Spreadsheets ARE software, anyone arguing differently has insufficient experience in at least one of the fields to have an informed opinion. I've worked professionally in both, and the argument in this article is bunk.

Spreadsheets are a different kind of software with advantages and disadvantages compared to compiled/scripted languages.

How do you treat a spreadsheet as real software:

* One operation per cell

* Inputs, processing, and display separated into distinct sheets

* Every formula checked twice (or three times)

* Write row and column check sums

* Treat row and column locking ($F4 or F$4) as the footguns they are.

* Name constants

Testing spreadsheets requires a different set of tools and behaviors, but to say that "quick and dirty" spreadsheets are less durable than "well engineered software" is really to say that "quick and dirty analysis" is less durable than "well engineered analysis."

Spreadsheets are remarkably easy to learn to use and to use correctly compared to scripted/compiled software. Not the least benefit being that they force abstraction in a way that is extremely intuitive to most people -- spacial context with the labels and state simultaneously displayed leads to much much lower context loading cost (for the human brain) than keeping all the abstract variables in your head when writing code.

Mr. Lemire seems to treat the tool as a toy then claim it is only good as a toy. Flub and not Flub languages alike can be run quite easily without unit tests, in which case they are no more durable to errors than excel (and arguably less as they don't display intermediate state).

> Spreadsheets are remarkably easy to learn to use and to use correctly compared to scripted/compiled software.

I don't think this is true. IME, people who aren't primarily programmers/developers and have been using Excel for years tend to have worse practices developing Excel tools than the same people have with much less time learning, e.g., Python. Excel is much easier to learn to use to get some kind of result, perhaps, but not easier to learn to use correctly. (Part of it, I think, is because the cost of incorrect use is somewhat more deferred, which is also part of what makes it more attractive to use initially in many cases.)

I largely agree with you.

Perhaps it is a difference in semantics.

It is very easy to learn to use correctly, but if you aren't taught by someone who uses it correctly (either self taught or otherwise) you're not likely to discover the best practices.

Easy to learn correctly != obvious way is correct != commonly learned correctly.

… but I will not use Microsoft Excel to run a bank or to compute the trajectory of the space shuttle

Does the author realize that Excel is not used to run banks or to track the space shuttle? Sure, some traders might base their models on Excel. But, the actual exchanges, trade routing and account general ledgers are all on very tightly controlled systems, usually mainframes in the case of ledgers. To boot, NASA does not use a VBA Excel plugin to schedule burns and execute reentry. (Excel 2007 won't even run on a 8086 chips)

Does the author realize that Excel is not used to run banks or to track the space shuttle?

I'm sure he does. He's exaggerating for effect. (Well, at least he is regarding the Space Shuttle.)

Billions are made in the finance industry from Excel spreadsheets. Whole funds have been run from a single spreadsheet making buy / sell calculations, with a whole IT infrastructure designed to automate the execution of that single spreadsheet file.

Sure, it may be stupid, but it's democratized programming. That's why spreadsheets work.

I remember working on a system at an investment bank in London before the GFC that valued exotic derivative portfolios on a grid of machines all running an instance of Excel; each trade was represented as a separate spreadsheet with well known locations for market data to be inserted.... madness!

Wow. That's...I'm not even mad. Excel as an actor framework?

Wrong end of the correlation/causation pattern.

Human beings make mistakes.

A lot of humans use Excel.

Hence a lot of errors in Excel.

Would all those humans program their logic in Python - we'd have a lot of wrong calculations in Python code and an article stating to not use Python.

I got a different message, the message was that mistakes are harder to find in Excel than in other programming languages, and correct operation is harder to validate. Had Piketty used Python it might be that the libraries would be more amenable to code review as reading the structure is easier than it is reading the formula contents of all the cells in a spreadsheet.

There are some excellent tools that are used by civil engineers where the formulas all have to be published when working on any thing. They miss a decimal point and bridges fall down. My brother in law is working on automating some of the drudge work but essentially they produce a falsifiable sequence of claims so that anyone can assure themselves they got the correct answer, and if the formualae are wrong it is immediately apparent.

I am human, therefore there is no difference between coding in C and coding in Rust/Cyclone?

I don't think it works that way. The surface area where you are permitted to make mistakes is completely different in language X compared to excel. In Excel, `="a"+4` is a cell error but a valid spreadsheet. In python it is an invalid program, throwing a runtime error. In haskell, it doesn't even compile.

It's an oversimplification to assume the hypothetical Python implementation would have less errors. We don't know that. Some type of math errors would overlap between Excel and Python. Some types of errors would be easier to stumble into in Python than Excel.

For example, an inexperienced programmer in Python might use floating point instead of Decimal data type to add currency amounts. His programming loop to sum the amounts would be incorrect. In MS Excel, adding currency in cells correctly down to the penny is a no brainer. Sure, the Python programmer can be taught "best practices" to avoid this type of error but the point is that while Python helps eliminates some errors, it also creates new ones.

Another example is data munging. A programmer might write some Python to slurp a data file (exported from mainframe or whatever) and do some financial calculation on it. Load the values into an array some other memory structure. The problem is that the memory contents are "hidden" from sight unless the programmer uses a visual debugger or prints out all the values to inspect. The Python programmer may not notice that some values are misaligned or garbage. With Excel, the non-programmer imports the data file and he immediately scrolls through the worksheet as a sanity check. His eyeballs notice that the source data is dirty. Again, the Python "problem" can be corrected with best practices but the point still remains: different tools create different problems.

Lastly, Excel spreadsheets are easily emailed among dozens of people. You can't send ???.py programs to everyone because you can't expect all the Windows users to have the Python runtime. Spreadsheets are even sent to iPhones and tablets and you definitely can't expect easy Python deployment there. With less coworkers examining the ???.py file, it has potential for more errors compared to a scenario where everybody can participate in questioning the spreadsheet's numbers.

Our intuition says that on balance, the hypothetical Python/Haskell solution should have less errors than MS Excel but I can't confidently say I've seen any definitive proof of that.

>The problem is that the memory contents are "hidden" from sight unless the programmer uses a visual debugger or prints out all the values to inspect.

I think that's what most people do when they are programming.

the parent seems to be referring to survivorship bias.

It's not true that all tools are exactly the same.

I suppose a good example of this would be all the affordances built into a modern highway. Softer turns, curves prior to intersections (they increase visibility and encourage the stopping traffic to slow down), embankments, etc.

At least Python will ease the testing of the various functions you code. Everything won't be hidden inside cell all over the pages

Apparently the errors found were "transcription errors" not "programming errors." Using Excel or not then is irrelevant.

Excel (and any other spreadsheet) is a nice and convenient tool to get easy tabular inputs and some results fast. Spreadsheets were one of the first useful programs on PCs.

With spreadsheets a lot of steps needlessly require human transcription (cut/paste, typing and so on). So there is a casual relation.

Steps require "human transcription"? I think you've got a wrong term, and if you mistype the data you have on the paper when making a spreadsheet I don't see why you wouldn't when doing the same typing using your text editor.

"Manual intervention" would have been better. The steps involved can be copying and pasting, selecting a range (very nasty if you miss and don't notice), and even typing data from another source. Compare that to something like R where you typically load the data (even from a URL) with command like "read.table()" and most operations are implicitly over the whole data frame (no copying of data from paper required).

Somebody has to type the data you loaded.

No, with good fortune the data is directly transcribed by a machine. Babbage included a printer in his designs to avoid human transcription error.

Genomics data is produced by machines. And R is quite popular with bioinformaticians that analyse it.

There have been a number of efforts to build a better spreadsheet. One that I was quite familiar with (I bought a license!) was Resolver One [1] that was trying to build a spreadsheet with support for Python at the cell-level. Unfortunately it didn't take off, and the team moved onto Python Anywhere [2]. Excel clearly has strong network effects!

I loved this article [3] from 2008, that argues that Excel and VBA were responsible for the credit crunch, that later possibly lead to Reinhart/Rogoff's flawed research using Excel... its spreadsheets all the way down!

[1] http://www.resolversystems.com/products/resolver-one/ (but link appears to be dead [2] https://www.pythonanywhere.com [3] http://www.theregister.co.uk/2008/01/21/vba_office_victory/

I stumbled across Pyspread the other day, a more recent attempt at a spreadsheet using Python at the cell-level. See http://manns.github.io/pyspread/.

  Pyspread expects Python expressions in its grid cells, which makes a spreadsheet
  specific language obsolete. Each cell returns a Python object that can be accessed
  from other cells. These objects can represent anything including lists or matrices.
I haven't used it yet, most of my personal spreadsheets are in Google Sheets and I'm not too keen to pull them out of Google Drive.


I can believe "flawed product" were constructed by spreadsheet but that seems a kind of shallow analysis.

Clearly, the housing bubble involved a rush to buy whatever - spreadsheet might seem the "immediate cause" but "wishful thinking" would clearly be the larger cause with various inherent tendencies of the financial system behind that.

I loved this article [3] from 2008, that argues that Excel and VBA were responsible for the credit crunch

That's a rather aggressive phrasing of what the article says, that many of the calculations involved in the credit crunch were done in Excel/VBA.

Yes, I agree its a bit aggressive. Ultimately it was human nature... spreadsheets don't cause apocalyptic financial crises, overzealous capitalists with spreadsheets cause apocalyptic financial crises!

The author was (is?) a recruiter for quants and financial developers so he certainly had a unique perspective.

Is it just me, or does the financial system seem to be designed in precisely the opposite way of how we try to design good software? One huge intertwined web, where one small thing indirectly affects thousands of other things. Its seems to be the equivalent of a "big ball of mud" type piece of software.

You are making the mistake of thinking the financial system was designed in any meaningful way!

The beauty of Excel is its ability to do many things 'well enough' in a way that is accessible to a lot of people.

My dad put together his architectural plans for a new house in Excel because it is a tool with which he felt comfortable. Some people do calendars in Excel while others try and recreate a full General Ledger system. Because it works well enough, people don't see the need to invest the time in learning a new application and instead invest their time in pushing the limits of the tool.

As someone who builds financial models and who audits those built by others (and is a competent programmer by night), I think that the key source of risk in Excel models is that the tool has no knowledge of intentions. What I mean by this is that if I'm building a cash flow model or if I'm doing a pixel drawing, Excel doesn't care; as users, we are forced to create our own structure and build in our own checks and balances. If I make a balance sheet in Excel that doesn't balance, Clippy won't show up and let me know that things are broken.

I've often thought that it would be really amazing if a semantic layer could be built that uses Excel as the calculation backend. This sort of tool could understand the sorts of concepts of financial statements, projections, time-series and other concepts that often show up in financial models. It would have a built-in understanding of the domain-specific models that would let it leverage that understanding to reduce risk in the building of financial models. If I told it that I wanted to add a revenue stream that is tied to the output of production, the tool would connect the dots between the production schedule, any inflation and/or foreign exchange assumptions and would feed changes in working capital according to the associated collection terms, etc...

Before I get too carried away, the point is that this type of semantic layer would be much better at preventing and detecting anomalies and potential errors in the development of a high-risk financial model. Does anyone have experience with any such tools?

That's why I always thought it'd be a good idea for Microsoft to expose such a semantic layer via a .NET API. When you can program an Excel sheet using the .NET languages and have access to the broader .NET libraries then all kinds of interesting possibilities start to open up. The fact is business users love Excel and a great number of CRUD applications could be easily built on top of Excel if only we had a reasonable programming environment (keep the VBA, you could create a VBA implementation on top of .NET). That would be win/win for everyone!

I always got the feeling the Office team and .NET team never got along well considering the completely lack of cohesion between _any_ of their respective products.

Microsoft VSTO can be used to build office addins and extensions. Meaning you can code in any CLR language. C#, VB, C++ presumably.

"Microsoft Visual Basic for Applications (VBA) uses unmanaged code that is tightly integrated with Office applications. Microsoft Office projects created by using Visual Studio enable you to take advantage of the .NET Framework and Visual Studio design tools."

- http://msdn.microsoft.com/library/vstudio/ss11825b.aspx - http://msdn.microsoft.com/library/vstudio/bb386107.aspx - http://msdn.microsoft.com/en-us/office/hh133430.aspx

Probably worth a look. I worked at an iBank previously and we were investigating when/how we could start using VSTO more..

I used to work on finance spreadsheets that used VBA to perform ftp, connect to https servers to download and parse xml docs, perform file renaming, etc. Horrid, but functional. If MS shipped Excel with C# as a VBA co-equal, the reliability of these kinds of "hacks" would increase ten-fold.

In my experience most civil engineering organisations use spreadsheets for the majority of design calculations. Some large organisations even still insist on engineers writing out calcs by hand. For example, a calculation that determines whether or not a retaining wall is of sufficient size more often than not will be completed in excel. Ditto with the calculation that checks the weight bearing capability of a column or beam in a large building.

Bespoke tools will be bought for tasks too complex or important for excel - finite element analysis, problems involving non-linear springs etc. Some time-consuming, repetitive tasks may also be deemed worthy of more automated tools, but on the whole the engineering industry is very much in the dark ages when it comes to modern software approaches. I often wonder what you'd end up with if you introduced a team of computer scientists into a civil/structural engineering company and told them to assist with analysis. I imagine you'd get some pretty innovative approaches to concept screening/design/cost-optimising etc.

This is what I'm hoping to do with a software background going to more traditional engineering.

It's just a little offputting when people talk about the tools they (have) to use -- When I first started, I would never have expected Excel to be so prominent, and Matlab so absent.

Because not every engineer is a programmer. Especially not every non-CS/EE engineer. Especially not every non-CS/EE engineer over ~30.

Yes, they've been exposed to programming at some level, but they are not programmers. Even as a late-20s mech engineer, I am far above most of my peers when it comes to slinging code and I'm a rank amateur. Many wouldn't even want to touch it.

You're mistakingly assuming that the path of least resistance is "teach non-CS/EE engineer to program because they are technically minded and can easily pick it up" rather than "adapt mathematical process to suit excel".

Also, you shouldn't use an unstable-at-load web server for important writing.

Ironically, it's Spreadsheet Server 2014.

Yet spreadsheets do not allow testing.

This is an absurd claim. Any good spreadsheet will contain multiple self-tests, either in live formulas or via macros.

The next evolved-over-multiple years large and complex spreadsheet tool that I inherit that has even a rudimentary test suite will be the first.

More often than not, the fact that a spreadsheet was used meant business was doing an end-run around lack of IT support initially to build the thing and wanted the quickest thing that appeared to work without any concern for (or even knowledge of) software development process, and its almost certain that in the history of changes (even if IT was forced later to adopt the spreadsheet) have been do the minimum required to meet each new requirement without really understanding what went before (often because the one person that understood what the original was doing has left.)

Which isn't really a problem with the tool, but with the social conditions which make the tool attractive to use in the first place.

"good spreadsheet"

I think the biggest problem is the small crossection of these two terms.

there is a cross section who knew :-)

In the past a company y that I worked for went down and one of the factors was some wacky spreadsheet in our finance workflow :-(

Honestly, I don't have a problem with the fact that Excel is used in so many cases - programming for non-programmers is important.

My problem is that Excel is terrible. Its formula system is painful and the formulas are invisible.

It's not that a light user DB/spreadsheet program is a bad idea, it's that Excel made a lot of terrible decisions 20 years ago and now they're married to them.

> Excel made a lot of terrible decisions 20 years ago and now they're married to them

I can't really think of any off the top of my head, would be interested if you could share one or two that they're permanently married to.

However, I could make a very long list of easy to implement fixes & features they could add that would offer different ways of doing things than the braindead way you have to do things now, while still maintaining backwards compatibility. My conspiratorial theory is that MS is very well aware of all these things, and they will be released only if a legitimate competitor appeared on the scene, or slowly over time to encourage upgrades.

Excel doesn't support dates before 1900-01-01, and it incorrectly treats the year 1900 as a leap year[1]. This bug-as-a-feature started in the first Excel for the sake of compatibility with Lotus files.

[1] http://support.microsoft.com/kb/214326/en-us

EDIT: Apparently Wikipedia has a list of "quirks". I haven't vetted any of the citations, but the page claims issues with statistical function accuracy, failures with modulo operations, confusing use of numeric precisions, and an inability to open two identically-named files

Little things like how terrible and confusing their formula language is. Anything beyond simple arithmetic requires two Googles per cell.

Now, replacing it with Python isn't better - any language that uses double-equals-signs is not fit for a layman-programming-platform.

Your confidence in the layman is inspiring. Why is == more difficult to understand than parentheses for function calls, for example?

Because highschool math acquainted them with the notion that a function call looks like

thanks to sin(x) and cos(x) and log(x). Some people will even have vague notions of functions with multiple variable inputs like f(x,y).

Basically, highschool math introduced a lot of syntax that users will be familiar with - we're universally okay with the typical +-/* operators, for example, and ^ isn't a stretch, neither is f(). AND/OR/NOT are introductory Boolean algebra and while they're not universally taught in high-school, I find many scientists are acquainted with them (although languages using || && ! syntax are a good way to lose them). Excel's lack of boolean operators (booleans operators are functions in excel) is disappointing.

I've always found that SQL gives a good minimal arithmetic/boolean-logic toolkit for laymen. It's a good model to follow.

If you think it's normal to write sin(x), cos(x), or log(x), you've been away from math for too long. Those would nearly always be written much more conveniently as sin x, cos x, and log x. You see parentheses when you have complicated arguments, but using parens in sin(3x+4) doesn't really differ conceptually from using them in 5 · (3x+4), and nobody thinks you need parens for multiplication -- compare 5 · 3x or just a simple statement like 15 · 4 = 60.

f(x) is different; you're right about that one.

It's not that == is itself difficult to understand; it's that = doesn't mean equality in virtually all programming languages so they had to make up another symbol for it.

It's basically the first thing they have to teach you.

Presumably the layman has a high-school understanding of mathematics notation.

>My problem is that Excel is terrible ... the formulas are invisible.

Have you tried Ctrl-` ?

>"My problem is that Excel is terrible"

Do you know a better alternative? I have used LibreOffice and Numbers and I will take Excel any day. Saying that it is terrible is a very strong statement for a piece of software that arguably is what keep a lot of people in the Windows world.

Gnumeric isn't bad, I've been using that. It supposedly has 100% coverage of Excel functions.

But sure, it's a GTK app and so it looks more like Office 2003 than Office 2010.

Every 3rd-party approach is trying to replicate Excel's featureset and maintain compatibility of excel. That means any of Excel's idiosyncracies must be copied as well.

A re-think of the spreadsheet to be a little closer to a SQL database (but still layman-friendly) would be far more sensible. I wish something like Lotus Improv had won.

> A re-think of the spreadsheet to be a little closer to a SQL database (but still layman-friendly) would be far more sensible.

You could probably get there by cutting features out of Excel -- Excel has a lot of database-like features that can mitigate some of the problems of using it, the problem is that most of the people using it to what is easiest and most discoverable for the UI -- or what they learned as a power user 20 years ago, or learned cargo-cult fashion from (perhaps through intermediaries) someone who learned then -- so a lot of the features that are more clear and maintainable are rarely used.

i.e., the next time I encounter a spreadsheet in my work that I didn't design that uses column names in named tables for references rather than row/column references will be the first.

For all practical purposes, it's called MS Access. :)

Let me get this straight. It is hard to review, test and maintain code in Excel workbooks, so we should instead write custom software for these tasks?!

Custom software is, to a non-hacker, the definition of "difficult to review, test, and maintain." You may as well suggest they input their data into a mystery box with some pixies inside who will crunch the numbers for them. :)

Remember, Excel (and spreadsheets in general) ARE custom software, written for this exact need, and with decades of polish and bug-fixing so that even non-wizards can use them.

I agree that we should not fly the space shuttle from an Excel macro. But I feel that suggesting economists should write software instead of use spreadsheets is not only throwing out the baby with the bathwater, but also throwing out the bathtub and the whole bathroom too. Specialty tools exist for this very use case. Let's discuss how to improve them instead of indulging our hacker instinct and reinventing a hammer every time we need to drive a nail.

> It is hard to review, test and maintain code in Excel workbooks, so we should instead write custom software for these tasks?!

Well, maybe before we attempt to send men with guns to further deprive people of their property and liberty (which is, of course, the end goal of Piketty's work and advocacy of a global wealth task) we should determine whether or not his results are, y'know, true?

I know, I know, truth is such an old-fashioned concept.

What a stunningly off-topic, irrelevant comment.

Ummm, my comment was on-topic, since if Piketty's arguments for wealth-confiscation depend upon his research; if his research is wrong, then his conclusions are likewise unsupported.

And yes, when writing software we should use the appropriate tools for the task. If we're about to advocate the confiscation under force of a significant proportion of the world's wealth, then maybe--just maybe--we should use more complex, more expensive but more complex software development methods than 'bing into Excel and fiddle with the formulæ until they feel right.'

I'm pretty sure patio11 says on one of his blogs that anywhere anyone is using an excel file for important work is a SaaS waiting to be born.

edit: found it:

"My favorite symptom of an unmet need for software is any Excel spreadsheet which is ever updated by one employee, sent to a second employee, updated, and then sent back. Every time that happens a SaaS angel gets its wings."


It makes programmers' heads spin, but the business world is totally used to doing all its stuff in Excel.

If Excel code is hard to audit, that's means someone could write a tool to show all the calculations being performed to get to a result. I'm giving this idea away for free because I'm pretty sure someone has already done it.

Writing a new program from scratch has issues, too.

Slate (https://www.useslate.com/) is a relatively new product that does just that.

Even the built in features like cell naming are little known and little used by many spreadsheet creators, leading to avoidable mistakes.

I used to work in business modelling and a part of that was creating and auditing spreadsheets so complex they fell into the "this should never be a spreadsheet" category. There are several proprietary excel addons which are designed to do exactly this. Unhelpfully I can't remember the name of the ones I used but if you Google "Excel Auditing Tools" you get quite a few.

A popular addon is the ACE toolkit. I'm sure that there are many more that may simply not be available for public consumption.

I do financial modelling and our firm as developed its own formula auditing (and Excel Swiss Army knife) tool. For example, we can analyze a block of cells and have a visual overlay of which cells contain the same formulas, which cells contain formulas of different types (external file, external sheet, embedded constant, constant, etc..).

This lets us really optimize our time while giving is more confidence that we have properly reviewed an entire spreadsheet.

I have. Working on open sourcing the code right now.

Biggest issue I've had with excel and spreadsheets in general has been "put this on the web". People spend hours/days/weeks building a spreadsheet tool that does all their calculations, then say "put this on the web" to me, and I can't do it. Not in their budget. To them, it should just be some sort of magic, but... I don't know of tools to just 'put' something like that on the web. They want people to be able to interact with it, but 'not download it'. They won't use Google spreadsheets or any office365 online stuff.

Maybe there are some tools that can take an xlsx file and make it secure, interactive and loggable and are also free (or nearly so?) Am I missing something?

Maybe the problem with Excel is that errors are more likely to stay under the radar. You can refer to the wrong cell, extend formulas with relative reference where it should be absolute (or the opposite) whithout noticing. When programming logical errors will more often lead to fatal errors upon running the program.

If anything, it's more a matter of using Excel correctly. Tools like slate for excel make it easier to audit spreadsheet especially other people's.

I'll buy that - the fact that the pretty essential vlookup function defaults to guessing when it can't find a value, rather than defaulting to "Couldn't find the value you were looking to" - has been the source of endless errors.

Explicit really reduces the number of errors you run into - there's a lot of implicit going on in a spreadsheet.

I think you'll find that VLOOKUP is not used that much by people deeply invested in the modelling field. It is dangerous in the same way as referring to cell addresses in macros. (This is where named ranges are particularly useful)

As soon as the structure of your lookup table changes, your VLOOKUP formulas risk being invalidated. After being burned a bunch of times, I've switched to INDEX or a INDEX/MATCH to accomplish the same sort of thing. With INDEX/MATCH, you are required to be explicit in selecting both the range of the data and of the key.

The errors are not produced by using excel, according at lest to the financial times. But to manipulation of data.


Other people have praised the statistical work done on the book:

- http://www.nytimes.com/2014/03/24/opinion/krugman-wealth-ove... - http://www.telegraph.co.uk/culture/books/bookreviews/1081616...

So given the controversial nature of the subject, I wonder if we could point to an objective analysis of the work.

Error prone? I haven't found an IDE yet with syntax highlighting as nice as Excel's when I am punching in a formula that references a few cells.

People here don't seem to inderstand wuite why people use spreadsheets.

I knew someone who ised the spreadsheet supplied with MS Works (a truly hateful piece of software) to create fax cover sheets and notice signs and printed letters.

The spreadsheet was also used to creat Bills of material for electronic subcontracting. These would be printed out and then only the paper copy was relevant. (This worked better than you might think - that paperwork system had been refined over years and all the kinks ironed out. Everyone knew if a part had been ordered, arrived, booked in, kitted, issued to the shop floor, paid for by us and paid for as a final product by the customer. Moving from that paperwork system to sage line 100 was painful and produced a lot of confusion and kludges.

So, people like spreadsheets because they can get stuff done. To you it might seem like they're hammering in nails with the handle of a screwdriver, but the answer is to make better documentation and better software.

(Also I see people using screwdrivers and other handtools and I wonder why noone ever told them how to do it properly.)

The problem with spreadsheets to counter was Chris Granger is saying, is that they are NOT observable from a software maintenance perspective, the code is invisible but the entire heap and the intermediate values are visible.

I cannot easily swap out one set of data for another, making it hard to confirm my models are running correctly.

How do I diff a spreadsheet?

How do I refactor a spreadsheet?

Just a default note that the whole "he got stuff wrong" claim itself might not stand up to scrutiny.


Edit: Also regarding the Rogoff data. Sure, those findings really were crap but surprise, surprise, it was crap that policy makers were oh so eager to hear. One might blame the falsity on the dumb use of spreadsheet but it seems more likely it was a case of "let's fudge data in a crap fashion till we come up with what people want to hear". Then when someone points out how it's crap, we always blame the spreadsheets.

> it seems more likely it was a case of "let's fudge data in a crap fashion till we come up with what people want to hear"

That is a pretty bold claim.

Rogoff has a pretty long history of publishing what well-heeled people apparently want to hear, it's amazing that he was able to hide behind the "this was just a spreadsheet error" point.

The problem is I've mostly observed his chameleon act in radio interviews pre and post crisis. The pre-crisis interview, he claimed their was no problem and the post-crisis interview, he talked about how angry he was against the folks who claimed there was no problem, so unfortunately I can't provide a link (the files probably wouldn't be online anymore even if I took the trouble to find the link - smart shills do things by voice, it seems).

having worked for Investment Banking clients I can see what the author intended to address.

I have seen a lot of abuse/ over use of excel/VBA

I once had to write CRUD front-end with Excel/VBA activeX components retrieving data from Sybase / SQL Server databases.

Which could have been easily engineered as a simple web-app. "But NO .. IT HAD TO BE A FUCKING SPREADSHEET"

Welcome to my world. We are using excel for everything here, because basically it means the user doesn't have to think. That means there are countless errors, and I get to be a data entry clerk rather than a software developer.

I wrote an answer to this on my blog http://www.felienne.com/archives/3355

and @seansickle Thanks for linking to my stuff!

I've observed disasters created in spreadsheets. I've also observed -- and occasionally created -- robust models and calculations.

One flipside is that, if you know what you're doing, it's very easy to created orthogonal calculations and comparisons in order to check your work. And, if something doesn't compare or look right, to track back step by step and through all the precedents to find the fault(s).

Databases can also create and hide problems, particularly if they are not understood and/or designed and set up properly. And some problems thusly created can be rather opaque, particularly to those without a good understanding of or access to the underlying design and the principles upon which it rests.

As with many things, ultimately I found that the problems lay not with spreadsheets, per se, but with the people using them and with the organizations that tasked those people without consideration for their capabilities (and limits thereof) and without adequate resources to do the job right.

Personally, there has been a time or three where -- in significant measure due to such resource constraints -- I would have been sunk without the flexibility that a spreadsheet -- often in combination with some programming and database work -- provided me. For example, I could quickly and programmatically deal with the 95% that was clean enough, and then manually go through and figure out and adjust the crap.

Ideal? Perhaps not. But then, "business" seldom is. (Nor "life", for that matter.)

Meh, use the right tool for the right job. Sometimes that tool is a spreadsheet. My biggest complaint about Excel is that it fails gracefully when I would prefer that it pitch a fit.

This is all fine, but he doesn't suggest an alternative. Which, of course, there aren't any. Which is the reason it's a problem.

Don't social scientists and economists spend a huge amount of time learning statistical software such as SPSS or Stata or Gauss? Why bother with Excel?

Not in my experience (economics). An econometric class used Stata and I know a couple of statistics classes used SPSS, but beyond that it was use whatever you want, just get the job done. And for most people "whatever you want" was Excel, with a few people using Matlab. I never saw anyone using Stata when they didn't have to.

Admittedly one reason for this was that all student machines throughout the university had Excel installed and all the students had Excel on their laptops, while Stata was only available in a couple computer labs. And it's pretty much the same once you get out to the working world. No matter where go in the world or what office you walk into, if you sit down in front on windows machine, it will have Excel. Convincing your boss to buy you Stata or SPSS just so you can do a couple of quick statistical tests that you are perfectly capable of doing in Excel is just too much of a hassle most of the time.

"The fact that they use spreadsheets suggests that they spend very little time worrying about accuracy." I don't get it. Why does the fact that a person uses a particular piece of software suggest that they don't care about accuracy? Does this mean that just because I use R or Python I worry more about accuracy? Really? I think it has more to do with loss of control - real developers hate Excel because it gives business people something they can work with straight away. Every time there's a bug which could remotely be related to spreadsheets, people start another wave of Excel-hate. As if this was entirely impossible in a real programming language. Don't get me wrong, spreadsheets are not perfect by any means. But neither is anything else in this world. It purely comes down to choosing the most suited tool for a given problem. And in Piketty's case there's nothing wrong with choosing Excel.

Did you read the article? The author spends most of it attempting to justify that exact point. You may or may not agree with their arguments, but saying "I don't get it" without addressing any of those arguments isn't a useful response.

The problem with spreadsheets is not a simple issue of good vs. bad. Spreadsheets have an extremely flexible and well accepted if not intuitive interface. The problem is that they do not provide visibility into what processing is occurring, require a lot of error-prone manual manipulation, and are difficult to audit. As such, they are not really designed for testing and disciplined business processes that ensure accuracy and data integrity. You can't have complete flexibility as well as rigid controls.

One solution is to recognize when a given spreadsheets usage has increased to a point where it - or some portion of its functionality - would be better embedded in an application. Another is to have some users adopt a environment/language like R which addresses the short-comings listed above at the expense of being more complicated and less user-friendly. But there is no simple solution that is going to result in spreadsheets disappearing from use.

We use a spreadsheet to do gantt charts at work. Although Microsoft project is available, not everyone has installed it or knows how to use it. Everyone knows how to do basic Microsoft Excel. These schedules are only to a week granularity so we use one column for each week and each row becomes a task.

"… but I will not use Microsoft Excel to run a bank "

obviously you never worked in a big bank. Most traders use Excel a lot, if not in every moment of their work. It depends on the bank of course but in some banks even the trades are made from Excel spreadsheets.

Pretty much nobody will be getting right onto fixing this one, you can rest assured.

Those millions and billions of pounds and dollars and euros swishing back and forth in the financial centres? Yeah, Excel macros. BE AFRAID.

Same problem for any graphical programming environment which doesn't provide proper software tools, like diff's for updates, tests, coverage, debugging. But nevertheless they are used and recommended very successfully.

Matlab Simulink is a very similar environment used professionally to a much higher degree. There are extensions in some toolbox for graphical diffs, but who uses it? Who can review the diffs in a version control system?

And those systems are pretty hard to debug also. In a 10Khz loop you cannot step through or set breakpoints and hardly do printf debugging.

Data in arrays, exports to CSV, facile API... what's not to love?

I remember on job where I wrote a lot of tools to load a database by reading an excel spreadsheet from perl[1]. It was amazingly fun.

1) perl does have some good libraries to read Excel files without exporting

I have this as a "temporary" solution (now three years old) for getting data into my database. OK, I am using Python, not Perl. There is nothing fun about debugging this.

The non technical users can't tell the difference between 2 2.0 ' 2.0'. Usually excel will display the same thing to them.

I recently replaced one of their excels with a web form. "Ughh,una mierda" (its shit). So I asked whast excel did that the web form didn't. Basically a bit of JS to give some copy paste functionality, and she is now a lot happier than that solution than she was with the excel.

Perl also has nice libraries for creating Excel files. Used that at my last job to generate reports from MySQL.

Yep, it is amazing how much b_tching about crappy reports ends when you just give the raw data to a customer and let them pivot table to their heart's desire.

So does Python.

Import and export. I get a lot of love from exporting to Excel from people who use what I make. I assume this feature makes it seem that they are doing a lot of work when they can quickly produce large spreadsheets for the boss. The boss? He usually doesn't know nor care. He knows he approved a check for some computer stuff some months ago.....

What's scary here is how much of Finance is based on Excel. Tons and tons of spreadsheets. Pretty much every bank's risk measurements go through a spreadsheet at some point. So do most trades. High frequency may get executed in C++, but somewhere along the way it gets aggregated in Excel.

The only thing worse than being able to see all these convoluted spreadsheets, is not being able to see the code behind non-spreadsheet systems.

I think tools like MATLAB could bridge the gap between spreadsheets and general-purpose programming/data analysis. It has the huge standard library Excel users want, a REPL for prototyping, visual data inspection, and the ability to very quickly create charts and figures. The scripts are also easy to share and it doesn't involve use of the command line unless you want it to.

Anyone want to teach "MATLAB for Bankers"?

Bahaha. I know of multi-million complex international 3-party deals in the oil and gas business that were worked out on a spreadsheet. Insurance actuaries I know of refuse to move away from their spreadsheets that do business-critical risk table calculations.

It's idiocy, but I've seen the poor spreadsheet program warped into even worse uses, but not for business-critical stuff.

While spreadsheets are sub-optimal, the work was OPEN so people could find those bugs. We need to stop beating scientists who share their data because currently there is little incentive for them to do so.

Most scientists do not make their code and resources easily available. I'd prefer an open excel sheet to a closed python analysis which is only slightly less likely to be wrong.

Some groups however do share their code [1] (I am not really aware of similar sites for fields other than astrophysics, although there is plenty of code out there).

[1] http://asterisk.apod.com/wp/?page_id=12

Excel can only handle a certain amount of records. I've personally found it hard to deal with more than 100,000 rows in Excel.

The limit went up considerably with the introduction of 64-bit Excel. Also, there is a plugin called "Power Query" that allows the actual data to be stored in MSSQL. I have seen this work on tables with 100s of millions of records without a hitch.

I've done a bit of work building a complex Excel spreadsheet, because that was the job - and what I wanted afterwards was an automated way to turn it into a Python script. Collaborating in Excel is tricky - it was a large enough spreadsheet that 32-bit Excel would sometimes crash, and version tracking was entirely manual.

I've worked for many large financial institutions as a quant and am intimately familiar with use of spreadsheets in critical functions. I believe this is a multi-billion dollar opportunity. If someone is interested in start-up ventures in this space, I would be very interested in discussing further.

How would you overcome the very large inertial barrier that exists in that world to using something new and untested? The world of finance and investment banking is a world of networking and extreme resistance to change. There's a reason Bloomberg's suite of products sits as pretty as it does, and unless some great upheaval occurs will continue to do so for many years (it's not because the products are easy to use, developer friendly, or even good tools necessarily).

Also, you should put some contact info in your profile, or provide it in your comment, if you want to be contacted for further discussion.

> How would you overcome the very large inertial barrier that exists in that world to using something new and untested?

I've already successfully done this. The sales cycle can admittedly be long, but it's not impossible.

You have to create a product that addresses pain points for business users. It has to be a compelling product and it has to solve problems for business decision makers. Initially, you want to sell to the business, not to the IT department. It turns out that spreadsheets are mainly used by business users and subject matter experts, instead of IT.

The next issue is price. If you are trying to sell something for (say) $1 million to a big corporation, then it is quite a a long sales cycle (but still achievable). My personal take on this is you do want to eventually be charging several million dollars per corporate client, but you also should have a low cost "entry level" version of the product. The cost has to be low enough that a business power user could charge it to his Corporate American Express without being noticed (say $1,000) and download and install your application (he will need admin rights, which he might in fact have).

> The world of finance and investment banking is a world of networking and extreme resistance to change.

Yes there is a lot of networking that goes on. This can really work in your favor, because there is a sort of herd mentality. Once people start talking to each other about your product, customers will want to talk to you. If you close a large deal, news will quickly spread and other large deals can follow.

I would not agree that there is "extreme resistance to change." In my experience, banks, hedge funds, etc, are often early adopters and one of the largest users of high technology.

> There's a reason Bloomberg's suite of products sits as pretty as it does

I have Bloomberg in my cross-hairs, next to Excel.

> and unless some great upheaval occurs will continue to do so for many years

You are correct that change often requires a catalyst. But I think that great upheavals are happening everyday, but they are hidden behind closed doors, occurring deep inside of companies and departments. In my experience, it is precisely during those times of upheaval that you have the best chance of introducing a new, disruptive technology into an organization. There are many kinds of upheavals that happen. (mergers, divestitures, running into Excel's maximum row limit, etc) and they can all be catalysts for change.

I've put my email address in profile now, thanks for the tip.

The process NOT the tool.

Yes spreadsheets have errors, but the only method I know of that is able to get error rates down to near zero is due process. There are good processes that reduce error rates to near zero, but there are no tools that reduce error rates to near zero. The problem is the process, not the tool.

> The process NOT the tool.

But the tool constrains the process, which is why you should choose tools by appropriateness to the process they are going to be used in.

Although I agree with the author, using Piketty as the example seems counter to his argument: the FT found the errors in the spreadsheet in a few weeks. They were obviously not very difficult to find. Sounds like Piketty's problem was a lack of peer-review, not the tools he used.

Interesting that the article on wealth concentration (linked from the article), doesn't mention Piketty (yet). It needs some work.


A thought experiment: is it advisable to save mission critical, structured information as a loose collection of schemaless text files?

If it is not advisable, why is it accepted for software engineering professionals to manage their source code this way?

well a programming language with a specification has a pretty well-defined schema.

The E?-BNF provides a syntactical schema for the code and the rest of specification provides the semantic schema.

I look forward to the day when F# is a first class citizen in Excel. That will make my life much easier.

TLDR: Use the right tool for the right job.

We've done business with pen and paper for thousands of years.

I think we can handle spreadsheets. While I'm a programmer and often find it easy to whip up a program to do what many would do in a spreadsheet, I don't have the arrogance to tell people to not use spreadsheets for serious work.

On the contrary, I encourage it. It's a wonderful tool, and, in the end, superior to pen and paper.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact