

Ask HN: Why do spreadsheets get out of control?  What did you do about it? - cyswest

Lots of businesses use spreadsheets, but there seems to be a massive amount of people that run into "spreadsheet abuse" of some sort -- where spreadsheets continue to be used despite the fact that they are less than an optimal solution for whatever reason.<p>Have any of you experienced such a phenomena?  What was the nature of the data, and why weren't spreadsheets working out?  How severe of a problem was it, and what did you end up doing about it?
======
dmlorenzetti
In general, I think spreadsheets get "out of control" because they offer a
model by which non-programmers can program. This lets spreadsheets "get out of
control" in the sense that individual files become quite huge (even if they
never get extended to inherently unreasonable tasks).

Digging deeper, why are huge files a problem? I believe it's because
spreadsheets hide the manipulations they encode. This makes them hard to
verify, hard to test, and hard to re-use. The larger the file, the more
difficult it is to inspect and verify.

In short, spreadsheets present an opaque programming model, whose problems
compound as the file size increases.

My personal solution has been to use command-oriented data analysis tools,
mainly "R". Programs like these combat huge file sizes, and opacity, by
storing vector-oriented commands, rather than vectors of cell-oriented
commands.

To give a practical example, suppose you have a spreadsheet listing 100
employees, their hours worked, and their hourly rate. You want to multiply out
the cost per employee, and add all those values up.

With a spreadsheet, you have 100 cells, each containing a product of two
numbers, then another cell containing the sum. To verify the calculations, you
would have to click through 101 cells, inspecting their contents.

With a tool like R, the identical operation would be something like:
employeeCosts <\- employeeHours * employeeHourlyRate totalCost <\-
sum(employeeCosts)

Furthermore, with a spreadsheet you may not really care about that column of
{employeeCosts}. Still, you have to calculate them, and you have to look at
them as you attempt to make sense of your calculations. So you can either: (1)
leave the column along with all the other columns, making it harder to focus
on the results you care about; (2) put the column on a separate sheet, making
it harder to verify; or (3) do some visual trick like minimizing column
widths, making it harder for somebody else to explore and understand your
code.

A command-line tool, on the other hand, gives you lots of ways to avoid having
to look at unwanted intermediate values: (1) rm(exployeeCosts) (2) totalCost
<\- sum( employeeHours * employeeHourlyRate ) (3) findTotalCost <\-
function(timeSpent, rate) { return(sum(timeSpent*rate)) }

The last solution leads to re-usability, which is where spreadsheets really
start to fall down. Every spreadsheet I've ever been sent that links to
another spreadsheet has been a usability disaster.

Once you make the jump to a real data analysis tool, the benefits begin to
multiply. For exammple: (1) It won't take 5 minutes to upload your files to an
e-mail to send somebody. (2) You can version-control your work. (3) Your
graphics will almost certainly improve in quality. (4) You will find it easier
to maintain "Single Point Of Truth"-- for example, the table of employee rates
can "live" in a single file, which individual analysis scripts load as needed,
rather than getting duplicated in every spreadsheet that needs the numbers.

