Hacker News new | past | comments | ask | show | jobs | submit login
Missing Covid-19 test data was caused by the ill-thought-out use of Excel (bbc.com)
1041 points by cjlm on Oct 5, 2020 | hide | past | favorite | 801 comments



I work in healthcare and by far the biggest production issues we've ever run into are people who run Excel spreadsheets through our file processing where the leading zero's have been removed from patient identifiers because that's Excel's default behavior and you CANNOT TURN IT OFF!

EDIT: I have no idea who downvoted my post because what I said is 100% true. We have to tell customers to stop opening CSVs in Excel and then uploading them to us because what they upload could be missing critical data. Excel interprets a number and then formats it as a number, but in healthcare, 10 digit numbers are really strings. Unless your IT team has created an Excel extension and had it preloaded onto your local copy, Excel will remove leading zeros and there isn't a way to get them back if you save/overwrite.


Every single problem I've ever had with excel stems from excel trying to be too clever. Nearly all my problems with importing data into it could be solved if excel left everything as is until I asked it to format values a certain way.


>Every single problem I've ever had with excel stems from excel trying to be too clever. Nearly all my problems with importing data into it could be solved if excel left everything as is

We're biased to bash on Microsoft for being "too clever" but maybe we need a reality check by looking at the bigger picture.

Examples of other software not written by Microsoft that also drops the leading zeros and users asking questions on how to preserve them:

- Python Pandas import csv issue with leading zeros: https://stackoverflow.com/questions/13250046/how-to-keep-lea...

- R software import csv issue with leading zeros: https://stackoverflow.com/questions/31411119/r-reading-in-cs...

- Google Sheets issue with leading zeros: https://webapps.stackexchange.com/questions/120835/importdat...

Conclusion: For some compelling reason, we have a bunch of independent programmers who all want to remove leading zeros.


I'm certainly not singling out excel, this thread is just about excel so thought I'd share my frustration with that. I've had similar experiences with pandas certainly. I've not used the rest of your examples, but do have similar issues with libreoffice calc. Regards pandas it is certainly easier sorting out the assumptions independent programmers make by including 'dtype=str'.


The issue is that CSVs are untyped and software has to guess the type, sometimes unsuccessfully.


Does it? I think it should treat everything as a string since it can't be certain about the correct type. It could also suggest the correct type rather than applying it without user confirmation.


You want Excel to auto-interpret a table of numbers, that come from a CSV, as entirely text?

Looking forward to the first time anyone tries to use your excel on a table of numbers and then immediately has to multiply everything by *1 (in a separate table) just to get it back into numbers...


Oh yes please.

At least you would know what's happening and be in control of it

"Hey, is that a date? I bet that's a date!" - Aaaargh Noooo!


YES!! I seem to remember a few versions of Excel ago there was a way to import data and set the data type for each column in a preview window to see what it would look like. And to your point, it's very easy to select an existing column in Excel and change the data type anyway. So why not just make everything a string by default?


Maybe you recall the CSV import window in libreoffice


You are right on the cause of the issue. But the software does not HAVE to guess the type. If I paste something into Excel, I want what I pasted to be what's in Excel. Changing the type and inadvertently affecting the data isn't something any data software should do by default. And the fact that OP said they can't turn it off means that this is beyond a design mistake, it is a FEATURE. If Excel took itself seriously as a data platform, it would not make a change to any text by default.


But Excel has never been a data platform. It's a spreadsheet.

It's user's faults for using it in ways that it was never designed for.

Excel has always been about sticking numbers in boxes and calculating with them.

If you want unmodified string input, input strings into a tool intended to handle them.

Project specifications can be hard. Using 1) .xls files after they were superseded, 2) ANY data transfer method without considering capacity or truncation issues, speaks of incompetence.


You can format the column as text before you paste or you can load data from text and set the types for every column. Excel has features to handle all of these issues. I think it's biggest problem is that it's just too easy to use. It will allow novices to do all sorts of things with their data without providing guardrails for common mistakes.


Agreed with this. Excel does have data import feature that will allows us to fine-tune the import process to prevent this kind of mistake.

People just double click the CSV and complained that it didn't do it correctly. It is the same situation with scientific research data that researchers don't bother to use escape marker or blindly open the file without going through the proper import process. Then they blamed Excel for the that without understanding how Excel works.

Yes, Excel does have their quirks. But there are ways around those quirks, they have thousands of thousands guides out there about Excel. There is no excuses for people to complain about Excel didn't do the way that users want it to do without looking up for information.


Except that literally every end-user I've ever had the displeasure of dealing with did not know how to properly import data into excel.

Double clicking the CSV should open the data import dialog.


Yeah it does, libreoffice has methods of recognizing numbers that are to be interpreted as strings as well. This is user error and the programmer trying to blame that on the program rather than it being a process issue. The spreadsheet can't read your mind


> If I paste something into Excel, I want what I pasted to be what's in Excel.

And that's what you have in Excel. What gets displayed is a separate issue.

And no, you don't want to see exactly what you typed in, not in the general case.

And no, I can't believe I am defending Excel!


I believe excel considered itself a finance system not a data system. Hence the clear bias towards numbers over strings.


> has to guess the type

I believe that's the point, it certainly does NOT need to.


Absolutely! I ran into the arff format when using Weka and it was a revelation:

https://www.cs.waikato.ac.nz/ml/weka/arff.html


Well Excel trying to be clever stems from people not trying to be clever. We truly reap what we sow.


As an intern I was once given a spreadsheet of manually entered daily data, stored by a human being over the previous couple of years. Obviously he'd entered the dates in whatever form he felt like at the time and Excel had parsed them in a variety of godawful ways not least of which was defaulting to American month/day assumptions before UK day/month.

I think it would have been far quicker to just manually write a new column interpreting the dates based on previous/next etc. Instead I spent God knows how long trying to be clever, failing, and being embarrassed that I could not solve this obviously trivial problem.


And the people who make excel will just tell you "excel is not a database" lol


Which sadly, is what I want to tell them too!


The right way to open CSVs in Excel is: Data -> From Text -> Browse to your csv -> Delimited -> Comma -> Next -> Select Text formatting for columns with leading zeroes -> Finish

You can either send xlsx with format or csv without format. If this would be disabled then we'd have another group of people complaining that their dates from CSV are not parsed.


and to enforce that, remove the file association which opens .CSV files in Excel by default. If every machine opened CSV files in notepad a huge percentage of 'Excel' problems would just go away. It could be worse though, try opening and saving a CSV in Word and see what that does.


I used to work with books, which obviously includes ISBNs. They are 13 digit numbers, helpfully turned into scientific notation by Excel.


I used to do migrations between library systems, and often they'd export from the old system as CSV. One rule I gave them was "never open this file in excel", because they'd always try to be helpful by tidying something and I'd end up with scientific notation ISBNs, dates that were in one form if they could be coerced into US-style infix, otherwise left in the original form, barcodes losing zeros, etc. etc.


The same problem happens with UK phone numbers, where the leading 0 gets chopped off Solvable if there's a space between the dialling code and the number, but when individuals enter their number, they don't always do this


Don't ISBNs also contain an X sometimes?


The old ISBN10 standard could get X (meaning 10) as the last character (the last number is a check digit), but ISBN13 (which prefixes 978 or 979 to ) is all numbers. The check digit is calculated differently. One obvious advantage is that the ISBN13 is barcode compatible.

I would guess that most modern actors in the book business has been primarily using ISBN13 for at least the last decade.


Thanks! The last time I read about ISBNs was probably more than a decade ago.


X can be a check digit. From memory the check digit is calculated using a modulo 11 calculation.


Interesting to see just how many responses to this real-world problem are suggesting that people rewrite CSV files to work with Excel.

They clearly do not understand system integration and the use of CSV text files for data interchange between multiple systems and application. Hey, JSON and Javascript libraries are the answer to that, eh

There are already enough potential issues with CSV interpretation on wrapping strings, escaping characters and so on, but changing the content when a delimiter is found should not be added to that list.

You bold point is the most important, the default behaviour of Excel when opening a plain text CSV file is to alter the content for display, applying magic and often-unwanted formatting rules. That should be optional.

It should be possible to open a text CSV file in Excel, view the contents in columns but the same textual form, save the file in CSV format and open it in another viewer/editor and still see the same content as the original file.


The problem is windows doesn't have a preloaded double click viewer for csv except excel if you have Office. The average user just double clicks. If notepad or wordpad did any kind of formatting for viewing it would be less an issue. So you have to solve for excel as a dev knowing that's what 99% of people will use.


If you 'solved for Excel', it will be broken for the other applications that may want to use data from those Excel-specific CSV files, which is generally why CSV was chosen in the first place.


If you want it to be a string then put it in quotes. It's that simple. If the user is too dumb to slow down and look at the import options it is not excel's fault, it's a process issue. Excel can't read a cell and then read your mind and extract the correct option.


> there isn't a way to get them back if you save/overwrite.

Of course there is, in your case. If patient identifiers have a fixed, known length, then you can pad with leading zeros to recover them.

You only have a problem if 012345 and 12345 are distinct patient identifiers.

It is bone-headed in the first place to use numeric-looking identifiers (such as containing digits only) which are really strings, and then allow leading zeros. Identifiers which are really strings should start with a letter (which could be a common prefix). E.g. a patient ID could be a P00123.

This is useful for more than just protecting the 00. Anywhere in the system, including on any printed form, if you see P00123, you have a clue that it's a patient identifier. An input dialog can reject an input that is supposed to be a patient identifier if it is missing the leading P, or else include a fixed P in the UI to remind the user to look for P-something in whatever window or piece of paper they are copying from.

The main point in my comment is that instead of shaking your fist that the behavior of other users in the system, such as those who choose Excel because it's the only data munging thing they know how to use, you can look for ways that your own conventions and procedures are contributing to the issue.

If people are going to "Excel" your data, and then loop it back to you, maybe your system should be "Excel proofed".


The lengths of the identifiers are not known, man. We are dealing with like 200 hospitals. They all have various kinds of Electronic Medical Record system like Epic or Cerner, or one of the 15 other third party vendors that all have custom implementations. Healthcare is insane in our country.


First off, I “hate” Excel, but only because people misuse it. It, like so many other products of its kind, are for numbers! Yet I would be hard-pressed to find someone actually using it for numbers. People absolutely insist on using it as a generic table system for holding non-number data.

I recently had to show a 20+ year Excel-using fanatic how to import data from a CSV file so that they could select as type Text columns that contain leading zeros. The ability exists, but I have found scant few people who know how to actually use the product properly.

Oh, and I also work in healthcare.


Excel supports a variety of data types other than numbers and includes many built in functions for dealing with non-numeric data. Whatever it's history, it is not accurate to say that it is only for numbers.


Regardless of its _additional_ support for other types, those are only in support of its primary role as a spreadsheet. It is designed around the concept of plugging in _numbers_ and calculating with them. Everything else is to make those calculations well notated and human friendly.

Any other use of Excel is bending it into a role it wasn't intended for, and user beware.

And it is all too easy to just go there since there are soooo many convenience features for those who don't want to laern how to do the tasks well.


If notation were the sole purpose there'd be little need for anything but row and column labels or simple string input. Instead there's a full set of string manipulation tools, embedding of other objects, charts, images, drawings etc.

You are taking the application as it existed 35 years ago and saying it must still be that thing, yet it has had 35 years to evolve far beyond that. Microsoft itself, when it talks about Excel, talks about using it to organize "data", not just numerics data. It has become a more general purpose tool.


As a lay-person, what would you use for non numbers tabular data?


People love to use Excel to look at (with their eyes, or a screen reader, i.e., non programmatically) all kinds of data. They use it to list exports from data stores, and then take notes on what they discovered, show and hide data using the filter option, etc. It is just far too easy to use (and misuse) that it ends up being used for any type of data that fits into a table.


Excel have lots of different uses. It can be used as inventory tracking, sales, catalog, mileages, contact list, and the list goes on. It is useful in many different fields and industries.

You have to remember, Excel is extremely powerful beast. It have many specialized features that will handle any data it encountered with. I used Excel for 15 years and I am still finding features that made the process quicker. Of course, Excel have its limits and I am well aware of that.


I mean I thought you said that they should not use excel for these scenarios, but something else. So I thought you had an alternative software in mind?


I use Splunk (our organization pays lots of money for it, and yet it just sits there because people would rather use Excel). There is also Tableau and other products like it. Most of these tools have a free version, as well. They take some learning, but it is well worth it. Of course, taking that same amount of time to actually learn how Excel works is worth the time, too. I guess the issue is that Excel is just too "easy" to use, and hence people using it for years and years and not actually knowing how to use it.

I am not trying to be an elitist about this. It is just that the misuse of Excel (because people do not know how to use it) causes massive issues on a daily basis.


I haven't used it much myself, but I think Access is meant to answer this question.


Excel is fine, but keep the data as xls and you won't have problems like dropping leading zeros. As soon as you convert to a simpler file format like cvs, you will likely lose data unless you are very careful in both export and import.


I just upvoted you.

You are fully correct, I have seen plenty of stuff like that in life sciences projects.


> Excel interprets a number and then formats it as a number,

Unless you instruct it to interpret the field as a string.

> but in healthcare, 10 digit numbers are really strings.

I'm wondering, if you expect 10 digits and you get less than that, how difficult is it to add some padding zeroes?


With US ZIP (postal) codes, padding with zeroes can actually be ambiguous because we have "zip+4" which is often stored as two fields, but just as often concatenated into a single representation. If you have a zip+4 of "05555-0444" in two fields that each lose leading zeroes, and then concatenate to "5555444" you can't reconstruct it with certainty (00555-5444 or 05555-0444 or 55554-0044).

I had a package get seriously delayed one time because it kept being sent to a city whose 5 digit zip code was equal to [my 5 digit zip code, less its leading 0, and the first digit of my +4]. Fun times.


> Unless you instruct it to interpret the field as a string.

Except double clicking a CSV does not give you the option to do this. At this point Excel already decided to corrupt your data. And guess how most users open CSV files? That's right, they double click.

> I'm wondering, if you expect 10 digits and you get less than that, how difficult is it to add some padding zeroes?

If you expect 10 digits and get less than that, you have corrupted input data. Trying to "fix" this is exactly the sin excel is committing. Don't do that.


This would assume people (many who have over a decade of experience using Excel) actually know how to do that. The largest problem with Excel is people.


Excel rules. I still use it daily despite having access to big boy tools (python, SQL, R, etc.) Few tools make it easier to answer quick, simple questions about a dataset. Plus it's still the best way to browse SSAS cubes. Blaming excel for these sorts of problems is like blaming a hammer for not being a screwdriver.

I also work in healthcare.


Their database system afterwards could do that.

Or prepend a letter when producing the CSV to avoid EXCEL doing what it does.


Maybe it can be solved by prepending "ID:" for exports, and removing it on import. It could be proposed as a convention for all HC software companies.


Introducing standards across many companies is incredibly difficult but your main problem is that a lot of healthcare IT systems are very old and not actively maintained, they're essentially on life support. So changing the format of a file, or requiring new import cleaning, is going to break integrations and data sharing in many places. Basically it'd create far more problems than it's attempting to solve.


That is understandable.

I guess it depends on which country and which specific part of healthcare you are active in. In the 'care' domain in the Netherlands I also see that some of our integrations are one-off, but the most important ones do have industry wide standardization that receives updates based on law.


Although make sure to not use that for the first column heading, an uppercase I will make excel think it's a SYLK file and then refuse to open it.


Good catch! The Wikipedia page confirms it.

Nothing in software development is as straightforward as we might hope.

https://en.wikipedia.org/wiki/Symbolic_Link_(SYLK)


IME numeric identifiers are just asking for trouble. Some systems are sensitive to leading zeros, some systems will zero-pad, <input type="number"> loses zero-padding, etc. Worked with a client once with references to the same entity scattered across two databases, one of which zero-padded the identifiers and the other didn't. Oy vey.


> IME numeric identifiers are just asking for trouble

That's a failure of the system if it can't be told to not interpret the data. However you're saying the world is as it is; can't argue.


Agree with the problem reported here - the fact that Excel will drop leading zeroes and will turn phone numbers into Exponential format causes many live issues where customers use Excel to work on 'CSV' files :(


The issue I see is you have people who don't know the difference between a string data type and a number. I keep see "Listen to the scientists" and "Listen to the doctors". There are many facets to solving problems. Maybe many disciplines may be better at solving it. Had you/"healthcare" contacted folks with software expertise rather than being your own experts, this would have been a non-problem.


Also long numeric ids are rounded by Excel.

And fucking with dates.


Does it strip leading zeroes if you format the numbers as strings with quotes like this?

blah,blah,"00001553",blah


It is sometimes possible to write code to alter your CSVs to retain the strings.

The first thing would be to write them as: entry1,"0123456789",entry2 rather than entry1,0123456789,entry2. This has worked for me in some instances in Excel whereby I have to escape certain things inside a string, but I would not be surprised if Excel still messes this up. For example, giving the triangle exclamation mark box and then helpfully suggest to convert to number.

If you want to go further, you can do something like write a routine that alters the CSV, such as entry1,hospitalString(0123456789),entry2. Sure, there are problems with this too, but Excel can break a lot of things and the above examples I do use in practise (the first example I put the double quotes to escape single quotes in foreign language unicode).

Another thing Excel can do is break your dates, by switching months (usually only for dates < 13th of the month, but often a partial conversion in your data for < 13th and >= 13th) or convert dates to integers.


Working internationally with CSV holding dates is a recipe for disasters.


Couldn't you validate the data on upload or if it's such a common problem just auto-fix it by left padding w/ zeros?


There are data sources that have a clear unambiguous specification of what information is in each field, and exactly how they are formatted, such that you could auto-parse the column. And even have no data entry errors, so that you can rely on that parsing and have clean error free imported data.

There are many more that don't have a clear spec, or even if they do, have every possible variation of data corruption / keying errors / user misunderstanding / total 'don't give a shit, I'll enter it how I want, that's what the computer is supposed to handle' problems in that source data, that make parsing an absolute nightmare.

I've had to manually review a list of 10k+ data readings monthly, from an automated recorder, because the guy who was supposed to copy the collected data files as is and upload them, instead opened each one and fixed what he (badly mistakenly) thought had been recorded wrong. Different changes in a dozen different ways based on no particular logic beyond "that doesn't look right". And un-fireable, of course.


auto-fix by padding assumes that the only way to get an ID that's too short is by excel removing the leading zeros - presumably it could also be human error when entering etc. (in theory I suppose that would be validated in whatever form the number is entered into in the app that generates the CSV, but hard to say for sure). Or some other issue in data processing pipeline erroneously truncating the ID number some how.

Validating the data would at least prevent getting invalid data into database (and presumably this is happening already), but it doesn't actually "fix" the problem, you still then need the original provider of the data to fix what's missing.


These aren't fixed-width fields with a known length. The length of the field is unique to the person sending the file and they don't tell us how long it's supposed to be.


The parent post says: "Excel interprets a number and then formats it as a number, but in healthcare, 10 digit numbers are really strings." - so I was under impression that the length for that particular column is fixed - probably just misunderstood the sentence.

I wonder how human operators figure out if the value is correct, or the Excel messed it up, or the input was invalid in the first place? If it's even possible to do it reliably then probably there is some set of patterns and methods that possibly could be turned into an algorithm... just thinking out loud here, but seems as an interesting problem to tackle...


Validating on upload is sensible and should be done anyway. However, left-padding is not always going to work, it depends on the data. Common problems I see are with phone numbers. UK mobiles begin 07nnn. Excel kindly drops the leading zero and then really helpfully treats the long number as being an exponential-format floating point number, which cannot be turned back to the original number. In our systems, we require country codes in front of the numbers, so UK mobiles should being 447nnn. That doesn't lose the leading zero but still gets the E-format mess when opened by Excel. Customers often miss out the 44 and use the national 07 format. If that leading zero is missing, it can be quite difficult for a system to reliable determine whether that number, beginning with a 7, is an international number for a destination in Russia, or a user error and should've been 447, from an 07 data value.


One wonders why Excel doesn't store data as strings (or equivalent lossless format) and process it as numbers like SQLite does for instance.


Because Excel is a spreadsheet, for calculating with numbers.

Then there's additional code for dealing with all the inconvenient ways people format things, or want to add text labels, or do things closer to numerical/financial analysis, or all the other extras wrapped around the core "put numbers in boxes and do math".

That misunderstanding is at the core of Excel misuse.


Down-voted as it is factually incorrect. See comment by reportgunner below.


> I have no idea who downvoted my post

That would be the Excel devs working at Microsoft. They read HN. I can feel it.


Excel devs, do you hear me? Repent for your sins! Go back to Microsoft and fix those problems. The cost of Excel mishandling numbers and dates must be many orders of magnitude higher than your lifetime salary.

Thank you.


They can't. All those spreadsheets that depend on that behavior are out there. It makes spell create with an e look easy.


Just format the identifier cells as text. I've also had this problem, this is how I solved it.


That's fine for an individual working on a specific set of data in a specific sheet, but this isn't just a problem for one individual or a small team that can solve it and move on. It's a systemic problem throughout many vast organisations, with continuous influxes of new personnel and constantly changing requirements. When you get an XLS sheet sent over to you from another team that already made this mistake, it's too late and this happens all the time.


> Just format the identifier cells as text

CSV is text. If you mean in Excel, if you opened it in Excel (rather than importing and choosing non-default options), you've already lost the data so formatting doesn't help you.


Yes, I mean Excel. We have CSV to XLS import scripts/forms that format identifier cells as text. The data format is standardised. Using templates to do the imports was the dumb part. Microsoft has a Power BI tool if ones doesn't want to write or use scrips. Use that. I assume a government agency has the resporces to pay for it and for data scientists.

https://powerbi.microsoft.com/en-us/

Thanks for bothering to respond instead of downvoting.


> I assume a government agency has the resporces to pay for it and for data scientists

First, the upthread commented said "healthcare" not "government agency".

Second, as someone who has worked in public sector healthcare: HA HA HA!

I mean, sure we have the resources to pay for data scientists (of which we have quite a few) and could conceivably probably afford to develop custom scripts for any CSV subformat that we decided we needed one for (though if its a regular workflow, we're probably acquiring it an importing it into a database without nontechnical staff even touching it, and providing a reporting solution and/or native Excel exports for people who need it in Excel.)

The problem is that when people who aren't technical staff or data scientists encounter and try to use CSVs (often, without realizing that's what they are) and produce problems, its usually well before the kind of analysis which would go into that. If its a regular workflow that's been analyzed and planned for, we probably have either built specialized tools or at least the relevant unit has desk procedures. But the aggregate of the stuff outside of regularized workflows is...large.


Nearly all end-users open a CSV like this:

1. They see a file (they have file extensions turned off, which is the default, so they probably don't even know what a CSV is)

2. They double click it

Excel now corrupted the data. That is the problem. Good luck teaching all end-users how to use Excel properly.


> They have file extensions turned off, which is the default, so they probably don't even know what a CSV is

And if, also by default, Excel is setup with an association with CSVs, the CSV file will, in addition to not having an extension to identify it, will have an icon which identifies it with Excel.


> I assume a government agency has the resources to pay for it and for data scientists.

Bold strategy there, let's see how that plays out.

Having been in and around military / DoD usages for a long time, I can tell you it's always an uphill battle to get processes to work well, instead of defaulting to whatever the original spec happened to get included as a result of some incompetent who wasn't even aware of good practice.


So what's the big deal? Just read back the numbers, if it's less than 10 digits prepend zeros. Am I missing something (probably, which means you're leaving something out in the description of the issue)


The 10 digits was just an example. There like 100 different ID types with different formats/lengths in healthcare.


How on earth is this the top rated comment? I would downvote it if I could. As other people have noted, you can import CSVs values as strings instead of numbers.

Furthermore, losing preceding zeroes in number-typed values is not unique to excel; it is a common feature in all typed programming languages.


> you can import CSVs values as strings instead of numbers

Of course, but the problem isn't that the person who posted the comment doesn't know this - it's that many users of their systems don't know it. Most people are just going to accept whatever defaults Excel suggests and not know any better, causing problems down the line.


No, that's not correct. The parent poster specifically made this false claim: "that's Excel's default behavior and you CANNOT TURN IT OFF!" The false claim is the part that the poster wrote in all caps. He said that this behavior can not be turned off, when in fact it can be turned off.


It can't be turned off, if you're opening a CSV in Excel.

It can be avoided, if you go through the Data | Import tools. The complaint is that few Excel users know that the import engine is available, or use it, instead of just opening the file and getting all the default interpolations. Which can't be avoided in the usual Open code path.


How? Where is the setting to turn it off when initially loading a file?


>How? Where is the setting to turn it off when initially loading a file?

Confusion is happening because 2 different ideas of Excel using csv files:

- you saying "can't turn this off" : File Explorer double-clicking a "csv" or MS Excel "File->Open" csv.

- others saying "you can preserve leading zeros" : click on Excel 2019 Data tab and import via "From Text/CSV" button on the ribbon menu and a dialog pops up that provides option "Do not detect data types" (Earlier version of Excel has different verbiage to interpret numbers as text)


Quote your strings. The problem is more with how you generate CSV (which is easy to break in the first place) than Excel’s fault. It’s like writing your identifiers as numbers in JSON instead of string then complain the leading zeros are removed.


Quoting doesn't work. Try opening this CSV with Excel:

    a,b
    "01",01
Excel interprets both as the same number–1.


The correct way to generate a CSV cell with a leading 0 is

    ="01"
You can verify this with

    01,"01",="01"


Perhaps by "correct way" you meant "dodgy hack to make Excel happy and risk breaking more sensible implementations"?

Excel may predate the RFC but AFAIK MS didn't invent or coin the term CSV, so you can't just say whatever Excel does is correct. The RFC is loose because of nonsense like this, it doesn't mean it was ever a good idea.


How in the world is Excel supposed to know which fields you want to be numbers and which to be strings? CSV doesn't have that info built in unless you surround the number with quotes and you select the right process. Excel isn't just a CSV importer, it reads all sorts of files, and it needs some help if you expect it to work. What ever happened to process and a sense of responsibility and craft in your work?


Choosing based on whether it's in quotes seems like a better solution than using an equals sign. Not everything follows that convention admittedly, but not everything understands the '=' either. Or it could just treat everything as text until a user tells it otherwise.

But none of that is really the point. Because CSV files aren't just for importing into Excel. One of their main benefits is their portability. In other situations column types might be specified out of band, but even if not, putting equals signs before values is unconventional, so more likely to hurt than help. And in the cases it might help, i.e. when you only care about loading into Excel, then you have options other than CSV, rather than contorting CSV files for Excel's sake.

> What ever happened to process and a sense of responsibility and craft in your work?

I actually have no idea what you are on about. I'm talking about the "responsibility and craft" of not producing screwed up CSV files. Why do some people find that so offensive? Yes, it is not inconceivable that there could be some situation working with legacy systems where putting `="..."` in CSVs is, unfortunately, your best option. Sometimes you do have to put in a hack to get something done. But don't go around telling people (or yourself) that it is "the correct way".


[flagged]


I sure don't want to have to deal with people putting `="01"` in CSV files.


We have multiple legacy systems where I work, communicating via batch csv files, some of which are authored & curated by staff in Excel. I can confirm that doing this would be a very bad thing indeed and make our systems grind to a halt.


For CSV that isn’t solely targeting Excel, hacks around the way Excel works are useless.


The reason you use CSV is that you want to use it with software that ISN'T Excel. Otherwise you'd just use .xlsx. No other software uses this convention, and it's not correct CSV.


I guess the HN audience has voted. The result is: hacking around Excel stupidities is itself a stupid idea.

If you want Excel to understand your output, perhaps use a library which can write xlsx files.


“Correct” is a strong word. It isn’t defined in https://tools.ietf.org/html/rfc4180 so Excel should not try to be smart and add extensions only they support.


Excel predates RFC4180 by nearly 20 years (RFC4180 is October 2005, Excel 1.0 was September 1985) and this behavior was already cemented when the RFC was written.

As for the actual RFC, it's worth taking a read. Any sort of value interpretation is left up to the implementation, to the extent that Excel's behavior in interpreting formulae is 100% in compliance with the spec.


CSV predates Excel, and other CSV implementations don't have this behavior


What spec?

Anyway the RFC doesn't mandate any value interpretation IIRC.


If CSV were being used just to exchange data with Excel, we probably wouldn't be using CSV. Many systems neither need nor know that ="01" should be treated as the string "01".

If Excel were the only intended consumer, .xlsx would be a preferable file format. At least it's mostly unambiguous.


I ran into this last week with a UK bank. I was offered a CSV file. What I got was a CSV file with excel formulae in it.

I actually wanted a CSV file – preferably without having to resort to sed to strip out excel formulae.


And now the csv parser (or downstream process) has to guess whether to interpret that as the raw string or as the eval'd value.


If you want to write a bespoke CSV generator for an application where you know for sure that the file is only ever going to go straight to an Excel instance, sure.

For all the other uses in the world, that's a breaking change.


I have a list of companies I'd like you to consult for. Coincidentally, they're companies I'd like to work for, and I've quite enjoyed building proper database solutions which replace incredibly hacky/terrible Excel (or Excel adjacent) solutions.


Not even Excel uses that syntax when exporting to CSV (at least by default).


Are you asuming creation of CSV? 'cause most excel users wont be aware of why they should store in csv instead of excel's proprietary formats.


All previous major spreadsheet screwups were costing money (largest being $6 billion in losses due to a excel formula mishap) , but this one is playing with lives :(

5 – London Olympics Oversells Swimming Event by 10,000 Tickets

4- Banking powerhouse Barclay’s accidentally bought 179 more contracts than they intended in their purchase of Lehman Brothers assets in 2008. Someone hid cells containing the unwanted contract instead of deleting them.

3-utsourcing specialists Mouchel had to endure a £4.3 million profits write down due to a spreadsheet error in a pension fund deficit caused by an outside firm of actuaries

2- Canadian power generator TransAlta suffered losses of $24 million as the result of a simple clerical error which meant they bought US contracts at higher prices than they should hav

and the Biggest one is. -

Basic Excel flaws and incorrect testing led to JP Morgan Chase losing more than $6 billion in their London Whale disaster.

https://floatapp.com/us/blog/5-greatest-spreadsheet-errors-o...


They are costing money. No doubt.

At the same time, the business world runs on Excel. How much money is Excel making?

I've done my share of cursing at Excel at various jobs. At the same time, I am grateful for the quick and easy way it allows me and many others to manipulate data. It's unfair to just cite the costs of using Excel without acknowledging the benefits it brings.


The real question should be would it be cheaper to replace it with a proper solution (probably something custom built application written by a software engineer).

Excel's ease of use is it's downfall. It is the worlds most popular database, despite not actually being a database. I have wasted countless hours dealing with Excel where something else should have been used. I built a database for a friend recently, I think 75% of the work was cleaning the existing data from the excel to get it into the database.


People start using excel when the requirements are unknown, they use excel to understand the requirements while still being functional. To build a "proper solution", you frontload all the requirement discovery to build a system which only you or another developer will be able to update/change.

We blame excel, but excel is really just being used for prototyping and nobody takes a decision at a certain point to move on from that prototype.


I would say that people use Excel because they don't understand how to set up a relational database. It's usually non-technical users that set up Excel, then at a later stage I get asked to create a proper database from it.


>I would say that people use Excel because they don't understand how to set up a relational database. It's usually non-technical users that set up Excel,

I disagree that lack of db knowledge is the primary reason. I'm a programmer and I usually use MS Excel because it's easier than relational databases. I prefer Excel even though my skillset includes:

+ Oracle DBA certification and working as a real db administrator for 2 years

+ MySQL and MS SQL Server programming with raw "INSERT/UPDATE/DELETE" or with ORMs

+ SQLite and programming with its API in C/C++/C#

+ MS Access databases and writing VB for enterprises

The problem is none of the above databases (except for MSAccess) come with a GUI datagridview for easy inputting data, sorting columns, coloring cells, printing reports, etc.

Yes, there are some GUI tools such as SQLyog, Jetbrains DataGrip, Navicat, etc... but none of those have the flexibility and power of Excel.

Yes, a GUI frontend to interact with backend databases can be built and to that point, I also have in my skillset: Qt with C++ and Windows Forms with C#.

But my GUI programming skills also don't matter because for most data analysis tasks, I just use Excel if it's less than a million rows. Databases have a higher level of friction and all of my advanced skills don't really change that. Starting MS Excel with a blank worksheet and start typing immediately into cell A1 is always faster than spinning up a db instance and entering SQL "CREATE TABLE xyz (...);" commands.

Of course, if it's a mission-critical enterprise program, I'll recommend and code a "real" app with a relational database. However, the threshold for that has to be really high. This is why no corporate IT department can develop "real database apps" as fast as Excel users can create adhoc spreadsheets. (My previous comment about that phenomenon: https://news.ycombinator.com/item?id=15756400)


I find people often use Excel (or Access if they're brave) when they have a known problem and can visualise the solution in something they know, like Excel. The crucial factor is implementation time. They can open an Excel and start work immediately, or wait weeks or months going through a lengthy and complex IT procurement process that will take from their budget. So in most cases starting with Excel is a no-brainer. Same reason why a lot of stealth IT gets implemented.

It's often only after years of a business using what has become sacred & business critical Excels, that somebody suggests formalizing it into software. In a business with an IT function, or a consultancy looking for business, it should always be somebody's job to find these Excels and replace them with something more robust.


> Same reason why a lot of stealth IT gets implemented.

Honestly, I miss the days of writing VBA macros which save hours of work a week and being sneered at by the 'official IT'.

I worked in a team in a large commercial bank handling reconciliations with various funds. Some of which had to be contacted by phone to confirm the current holdings. We had a system which would import our current positions and take imports in various formats from funds. Somewhere around 40% of the differences where due to trades which had been executed over the reconciliation date. I wrote a VBA script which pulled in all of the differences and identified trades which where open over the period and automatically closed the discrepancy with a reference to the trade IDs.

Another time I wrote a VBA script which would take a case ID and look it up in a diary system (at the time the only way I found to do this was to use the Win32 APIs and manually parse the fields in the HTML from the system), it would then enter this at the top of a spreadsheet which had to be completed. People liked it so much I had to rewrite it so that it would work on a list of case IDs and automatically print out the checklist.

Much more fun than figuring out why Kubernetes is doing something weird for the 3rd time this week.


The problem with a database is that business users struggle to build them, they take longer, and sometimes can’t hold logic in the same way.

Plus in a database you can’t do the same sort of real-time analysis and also pass the document around for other non-technical folk to add to and modify.

In the real world in big companies, people often don’t want to talk to IT because they over-spec and quote what are perceived to be giant sums of money for something that can be created in an hour in excel.


To add to this point, the latest article says that they started building a replacement to the spreadsheet two months ago, so I think that reinforces the speed of a spreadsheet implementation vs anything else in the public sector!


You might not have realized this but they should have started work in January.


You might not have realized this, but a global pandemic was starting in January and they were probably collectively shitting bricks at that point trying to shoe-string together something to help capture reporting, while probably being in a change freeze, and also probably having to do a mass-migration to at home work with the inevitable struggles of that.


>It is the worlds most popular database, despite not actually being a database.

"A database is an organized collection of data, generally stored and accessed electronically from a computer system."[1]

Excel is an organized collection of data, stored and accessed electronically from a computer system. So I would call it a database.

[1] https://en.wikipedia.org/wiki/Database


Excel is not a database, it is a spreadsheet. They are different beasts.

This link will explain the difference between two of them. https://365datascience.com/explainer-video/database-vs-sprea...


What i was getting at is that it fulfills the (very) basic definition of a database, even though it's very bad for this purpose.


No, because the proper solution written by a software engineer will contain more bugs.


it will only be cheaper after the company loses money... no business person ever looked at something and said "it works, but it might not, some day... better spend $XXX,XXX on an engineer to fix the problem".


You have just identified the new market of Spreadsheet Insurance. I am thinking it could be quite lucrative if you get the right salespeople in on the ground floor. I'm willing to bet no funding source is unfamiliar with the demand for this product.


and one more:

"Scientists rename human genes to stop Microsoft Excel from misreading them as dates. Sometimes it’s easier to rewrite genetics than update Excel" https://www.theverge.com/2020/8/6/21355674/human-genes-renam...


Also the post-2008 financial crisis austerity programs based on Excel analysis.

https://mathbabe.org/2013/04/17/global-move-to-austerity-bas...


Austerity wasn't based on that study, and it didn't go away once that error was found.


That study was the intellectual justification for the policy.

To be fair, I'm not sure if any of the proponents actually believed (or had read) the study, but it was definitely wheeled out in debates against Keynesians.


Yeah, why there is no "scientific" mode to just turn of all "smartness" has baffled me for years and will probably continue to baffle me for years to come.

Or even an algorithm that can detect that you are using gene name from the cells around march1 and sept7.


I think that's a good way forward, Excel is a fantastically powerful, ubiquitous tool and isn't going away any time soon. If you're going to standardise on a technology, it makes sense to design your systems to work well with that technology. That includes making identifiers that you need to be treated like strings look like strings to your tools. A little bit of thought in advance and maybe some small compromises early on can save massive problems down the line.


A master crafter always adjusts their industry to match the currently popular tools. That's why we still use reins to steer automobiles and aircraft.


Well, not exactly like non-Excel systems are bulletproof either.

Entire Japanese stock market went down last week, and it's not like there was a flood of people on HN bemoaning that. At least with Reinhart and Rogoff et al you have a responsible party.

As opposed to 'nameless machine failed, and nameless backup machine also failed, and now it's in JIRA so don't worry about it'.

https://www.nytimes.com/2020/09/30/business/tokyo-stock-mark...

The glitch stemmed from a problem in the hardware that powers the exchange, said the Japan Exchange Group, the exchange’s operator, during a news conference. The system failed to switch to a backup in response to the problem, a representative said.


I think similar to “code smells” in software engineering, using excel for critical data pipeline work can be considered a “data smell”. It’s not bad in and of itself, but it’s a signal that something may have gone seriously wrong with the data engineering process.


On the other hand, its more like "nobody ever got fired for using excel"


I think in this case it’s more like “nobody who would know better than to use excel had the authority to make the decision not to use excel”. Given that this was a government project, my guess is that there are very few technical people involved at the decision making level.


I'm willing to make an exception in this case.


Unfortunately excel is the lingua franca of data exchange in SME


I am sure there is a solution to this problem that could make someone a fair bit of money, but i am not quite sure what it is yet.



That example is overblown as the system is supposed to be restarted more frequently than that for maintenance. A better example is cars have been frequently rebooting various internal computers while your driving for decades, that seems bad but it’s similarly irrelevant.


And how about this one: an Excel error was the cause for much unnecessary austerity after the 2008 financial crisis, costing potentially tens (hundreds?) of billions in lost output.

https://www.nytimes.com/2013/04/19/opinion/krugman-the-excel...


Is it really that hard to spend 5 minutes setting up a decent database? Use sqlite if you really want something simple.

I guess you still have to understand locking (especially on distributed filesystems.) I've certainly seen people mess that up with spreadsheets.


Sure, I'll just teach tell my wife to learn SQL and some devops while she's at it. After that I'll coach her on how to get buy in from management on this new system while she neglects her primary duties.

Throwing up a database and integrating it into a workflow/system isn't something anyone can just get up and do. I have to imagein you know that.


Yep! And then when your wife gets that promotion she wanted, she still has to spend time maintaining and making changes to the old database system because nobody else understands it or has the skills to change it.


Was it really a manual process? I read it as some system importing info automatically.

And if it was manual I am surprised that Excel did not complain about adding more than 65000 rows (or saving more than 65000 rows as XLS). If a user gets a warning about possible data loss they should investigate more.


The problem is people can't do this in a office. Nobody has admin rights on there computers. Everybody know how yo use Excel. We all know, if you have to make a ticket to your office IT for such things, you mostly have to wait a long long time and it never, never works the first time. I guess that is one of the main reasons, why the world runs on Excel (and MS Access). It just works, it's just there, no dealing with IT.


Sqlite is a bit of a special case, in my eyes. It's so ubiquitous that it's probably already installed somewhere on the average machine, just not as a top-level application.

Of course, there's a bit of a gap between "it's there on the machine" and "we can rely on it for useful work", but baby steps...


No it isn't, but in any big org the plebs are only allowed to use Office and anything else is IT captured.


Sure, Excel is instantly available. But also you get charting and formulas. Especially with formulas they've done something really right. You can transform a column in a few seconds while keeping the old one, tweaking the new one and having a graphical preview through the chart feature at the same time. Of course every Spreadsheet software can do this, but to me that's the ultimate killer feature


It takes a course to understand how to to use a relational database.

MS Access used to come as standard with Office and is actually the perfect solution to many of the problems that businesses use Excel for. It's very rarely that people actually used Access as Excel was far more intuitive and good enough for many projects especially in the early stages.


I can attest to this. I move job and my new work place used to manage everything in excel. Something's still are. Realising excel could not scale, and with no support for IT for a proper solution the "tech" guy (that is good with computers but not a developer) was tasked with creating a "database". What we now have is essentially 3 excel spreadsheet tables in access that you can run SQL on and input data using access forms. No normalisation, no well anything really. We're now coming to the point where access is creaking at the seams. I'm confident with properly designed tables and queries we wouldn't be, and I'm no expert either, but it will have to fail before there's a concerted effort to redesign or god forbid IT do there jobs.


Things are never 5 minutes. And it is just as easy to make mistakes with sql as it is with excel.


I have definitely just spent 5 minutes importing csv tables into sqlite so I can run a query against them.


Do note that anything other than Excel, UI/UX wise, failed.

When people wants a “database” they fire up Excel, start punching in numbers, solar calculators next to keyboard, and use eyeballs to search for strings.


Every time I see a problem like this I can't stop thinking that computer education is terrible everywhere.

We are able to teach almost everyone how to use complex software like Word and Excel. Why can't we teach people how to use a terminal, SQLite, or how to create a very simple Python script?


Well maybe it has something to do with the Windows monopoly and that they only want you to use their proprietary tools. If you learn too much then you will want to be free!!


Whenever you try to do something properly, some person in your organization condescendingly says "perfect is the enemy of the good".

And because Excel is "good" for a vast number of use-cases then people use it for everything.


Hey come now... we both need each other /s


As a data engineer, I think blaming Excel for this is the wrong framing. The real problem here is with the entire process, and specifically, the lack of testing.

Excel is great for many use cases, especially if you need people to enter data somewhere. Its UI is unparalleled in terms of quickly giving something to users that they can understand, mess around with, and verify. It's a very common use case to then need to pull in data from a bunch of Excel files, into one main repository of data (a data warehouse). That can be stored in an Excel file, although more commonly would be stored in a database.

But there are always problems with this process! There can be missing data, there can be weird data conversions because the program/language you're using to parse the data and get it into the database reads things differently than how Excel intended, there can be weird database issues that causes data loss, etc.

Complex systems always, always have bugs.

It is the job of a data engineering team to, among other things, test the systems thoroughly, and put in place systems to test against data loss, etc. It is pretty common, for example, to count the data going into a pipeline, and the data that you end up with, and make sure nothing was lost on the way.

Anyone can make a mistake. Any team, no matter how good, especially when they're rushed, can use shortcuts, use the wrong technologies because it's expedient, or simply have bugs. It is the job of the team, and of project management in general, to do all the manual and automatic testing necessary to make sure that mistakes are caught.

The real lesson isn't "Excel is bad". It's not. It's an amazing tool. The real lesson is "Data Engineering is hard, requires a lot of resources", and "all systems have bugs - testing is mandatory for any critical system".


Disagree on the "users can verify" part. Excel is very hard to verify, because of corner cases like date conversion. Formulas and macros, maybe spread out all over multiple sheets in hidden away white-on-white cells, make things very opaque and impossible to verify for most users. After a long gone colleague has passed the usage of some excel file, it is treated as an arcane and unchangeable wisdom from the past, never to be really understood or changed. Just maybe used as a copy-paste value import to the next generation.

Also, recognized good practices for software development (which any excel sheet that does more than just a sum() will be) like commenting and versioning are quite hard to impossible in excel. So even if you endavour to do it "right" but with excel, it is just the wrong tool for the job.


I vaguely remember a discussion about this and the problem boils down to Excel sheets being impossible to unit-test.

There might be, for example, a formula going through 20k lines in column F. But the one on line 1138 has a typo and the formula references an incorrect cell. No human will ever go through all the lines to check. Excel itself doesn't check stuff like that. And there are no tools for it either.


Excel actually checks that and gives you an 'Inconsistent formula' warning with a little green corner similar to when you comment a cell.

All you need to do is scroll through the worksheet you have just made.


All you need to do is scroll through 20,000 lines looking at each one


Excel have an ability to highlight the cell that have invalid data with a red circle to make it pop out on the table. Just look for something red.


20,000 is not a lot of lines though, I used to scroll 300k sometimes.

Anything below 500k rows is a 'small table' still.


The point is that's a manual, tedious, process that will be performed with error or completely skipped.

We have computers to do that sort of work.


It is a manual and tedious process if you make it manual and tedious for yourself.

After all, it's people that make computers do this work in the first place.


You can easily create tests for cases like that in excel. For example, if you know that all cells in a column must conform to specific conditions, you can create a new cell with a formula that validates the other cells. I have a bookkeeping excel where the "validation cell" becomes red when it detects errors.


Excel probably excels at promoting file formats like xlsx and such. I would not be surprised, if many mistakes, including choice of xlsx when saving, came from that, instead of storing data in csv, with the most simple csv syntax. So in that regard, I think excel is sometimes just bad.


I don't think data engineering is hard or needs much resources. It does need people who care and who will think about things.

Anyone who comes at a problem with the mindset 'this is going to he hard' probably lacks experience and will throw big-data frameworks at it, really screwing things up. The most significant, and valuable, resource needed is thought first, and knowledge+experience second.

All IMO anyway.


I always think that database design is pretty simple, but having worked on numerous databases designed by other, some people get it very badly wrong. "Wrong" maybe isn't the correct word, because you can use the database for what it is needed but it just requires extra joins and hacks at the application level to get around poor design.


Depends on context, you're talking in general, but in particular as the GP post said, there can be pressure from management to rush things, different data formats, bad or missing data and so many other pitfalls. It's hard when it's hard.

You can write a simple test to check a function is working correctly, but how do you make sure your 100,000 item database doesn't have corrupted or missing data caused by the latest pipeline update, especially if the corrupted parts are rare?


Collecting and distributing the results of life-or-death diagnostic tests is specifically one case where you don't want end users to "mess around with" the data though... (making sure the right test results get associated with the right record is surprisingly difficult and rigid software is part of that).


Excel is not bad, the users and institutions which use it for any use case are to blame.

The only correct usages of excel (and google sheets) are for user input to other processes, and visualization of data (which should be definitely stored in raw format in some database). And always assuming frequent backups of all sheets and of course manageable datasets. Anything else that includes external processes/scripts which append/ovewrite data in sheets is terrible practice and very error-prone.


The main problem I have seen is that data in Excel is essentially not validated.

Sometimes you get an number, sometimes you get a string containing a number. Sometimes the string contains white space that you don't want. And then there are dates, which can have similar problems multiplied ten times.


While I am not in the habit of defending Microsoft, I think it should be pointed out that it wasn't using "Excel" that was the biggest problem, but using "13-years obsolete format for Excel". Not that there was any reason to be using Excel for this, but it's a bit unfair to blame Microsoft for something they never claimed XLS could do, and provided a much better capacity in XLSX 13 years ago.

Again, it's hard to cry too many tears for Microsoft, but it does seem a bit off-target to blame "Excel" for this...


The issue with is excel is that its failure mode is suboptimal. Programming languages will generally crash at runtime when an unrecoverable error occurs (index out of bounds, memory allocation failed, null pointer, etc). But excel just throws away your data?


If it fails, it’s often silent. If not silent, hidden. If not hidden, misleading, if not misleading, opaque. If not opaque, trivial.

Excel is a wonderfully powerful tool that’s very bad at handling errors clearly.


It's the PHP of business software. Powers half the market, but boy do mistakes cost you.


Indeed, you have to manually inspect results to verify correctness. If you're lucky maybe your error will result in one of those weird error values being shown instead, but way too often that is not the case

Not to mention that complex formulae are still usually expressed as a bunch of gobbledygook in the cell value textbox, which is about as easy to parse as minified Javascript. And that's to technical users like ourselves.


I had Excel error out when I tried to open 10 million rows.


Excel tops out a little more over a million rows without specifically handling of larger files. I’m not sure excusing it, just a heads up to all the junior analysts out there who haven’t dealt with the problem(not implying you’re a junior analyst, just wanted to point something out to those who aren’t familiar with the issue).


No worries, Excel is nice for small/medium sized data and has the UI/Visualization which is nice.

The story I mentioned was because I wanted to look at the data before I started parsing it. I had full expectations to use either sqlite or Pandas.


> Programming languages will generally crash at runtime when an unrecoverable error occurs (index out of bounds, memory allocation failed, null pointer, etc).

JS would like to have a word with you.


In this case, Excel throws an error and says that not all rows have been loaded.

It seems that warning has just been ignored by the user.


Sure, and I think at least the headline ("ill-thought-out use of Excel") avoids blaming MS or Excel itself.

Ultimately tools are built for particular things, and if you choose to use a tool for something it's not built for, and it breaks catastrophically, that's on you.


No. Excel is definitely to blame. A simple error "cannot import/export more than 65k rows" would have done the trick. Not silently throw away rows.


The premise of excel is very much "don't bug me, just do it". The highly praised UI simplifies away corner cases and errors intentionally. Summing over an empty cell? Most probably zero, why throw an error and confuse the user? The simplified UI makes correctness hard to impossible, so excel is just unsuitable for tasks where correctness is desired.


Interpreting an empty cell as 0 is making an (often correct) assumption about what the user wanted.

Throwing away data without warning is almost certainly never what the user wanted.


You get CSV files as input and have to upload them into the database. You don't need Excel at all, neither in XSL, nor in XSLX format.

My bet is the biggest problem here is subcontracting this work to the lowest bidder, presumably from some developing country.


> subcontracting this work to the lowest bidder

knowing a little how things (don't) work in the UK, it's likely subcontracted, but to a company belonging to a mate of the director in charge of the whole thing


I'd take that bet ;-) From the article:

> > And it appears that Public Health England (PHE) was to blame, rather than a third-party contractor.


Over-13-years old Excel already opened delimited flat files natively, and it threw warnings when you opened a .csv file that exceeded the worksheet row limit. Someone in the distant past was just clever enough to hack together the script (with no error handling) without being curious enough to know it wasn't necessary in the first place.


If they had used the newer format, we'd be talking about this in a few weeks time when the larger limit was reached.

This is a problem of bad coding, and using the wrong tool for the job.

A defensive coding practice would have prevented this from going unseen. Using a database to store data would have prevented such arbitrary limits.


Yeah, something like "Missing Covid-19 test data was caused by using spreadsheet instead of database", would be more accurate.


Nobody is blaming Microsoft. From the article:

> And it appears that Public Health England (PHE) was to blame, rather than a third-party contractor.


Blaming PHE seems to have become a regular occurance this year. I wonder if it's justified or not.


It’s fun to point and laugh, but is this really just the consequence of an unprecedented health emergency demanding a rush job?

As far as bugs go, it doesn’t sound that bad. They didn’t lose data - they just processed it late? And they spotted it within days/weeks, and have a workaround/correction already? And it’s only the reporting that was wrong, not the more important part where they inform people of results?

I’d rather have this system now than be waiting for the requirements analysis to conclude on the perfect system.


There's a test and trace system that is supposed to phone everyone with a positive test result and get details of all their close contacts (people they sat in an office with, restaurants they visited, etc), who are in turn contacted.

A delay in the test results means the contact tracing could not happen, and people will have been going around spreading the virus they caught off the original, known cases.

Also, the whole test and trace system has been in the news a lot recently here for various failings, and things like this will just further knock people's confidence in it.


> A delay in the test results means the contact tracing could not happen, and people will have been going around spreading the virus they caught off the original, known cases.

A recent piece in The Atlantic argues that we got it all wrong, we should test backwards (who infected the current patient) instead of forwards.

This particularity is caused by the dispersion factor K which is a metric hidden (conflated) by the R0, meaning that the same R0 would be handled differently depending on the K factor.

A related suggestion was that we should focus more on super-spreaders as COVID doesn't spread uniformly. That's why the probability of finding a cluster is higher looking back than forward - most people don't get to actually spread the virus.

https://www.theatlantic.com/health/archive/2020/09/k-overloo...


You’re right - but the biggest issue is clearly one of experience. Someone was put into a critical role they weren’t prepared for - a position in which their failures cost lives. They don’t have a redundant process, or error checking, or even someone familiar with spreadsheets.

That’s the scandal: it’s the type of basic error that it screams “they’re not handling this well.” You wouldn’t tolerate your new SWE colleague asking “What’s an array again?”

If publishing the whole set of data was delayed by excel crashing - fine. But silent failure because they ran out of columns? Cmon...


I would say it's error from a rushed (and understandably so) job, I've been involved in a number of Covid related projects at the US county and state level. The teams and departments are having to move very fast with difficult technical problems. Keep in mind, it's not like they were just sitting around with nothing to do pre-covid. Basically, when the pandemic hit, everyone got a second job to go along with their day job.


I think that's too generous. It would indeed have been excusable for perhaps the first 2 weeks. But at this point, they've had an effectively unlimited budget, likely access to engineers working for big consultancies and big tech who would have been quite happy to literally drop everything to help out and 6 MONTHS to build a system that worked. And they are still using an Excel spreadsheet.


Well, I don't see why developing an application for transforming csv files to xls files would go faster than developing an application that uploaded csv files to a database.

I could see that i could be a rush job problem, but in this case they're not gaining any time.


Good points

Partly I think the Excel bit is news because Excel (and mistakes made with Excel) is easily relatable to people. But bugs always come up and most other bugs are just as stupid. If it had been an off-by-one loop error in some C code somewhere it would be just as dumb but you'd get none of the facepalm memes all over Twitter.


Off by one errors are all alike; every excel error is an error in its own way.


Not when it mixes up US and UK date formats when you import a CSV.

Not when it turns phone numbers into integers and strips off the leading zero.


This is the root of the problem isn't it? The wrong tool for the job.


Tolstoy keeps on giving ...


They had a budget of GBP 36 million. It's pretty bad.


Serco received 45.8 Million in March (not disclosed until April), and I'm pretty sure they've had more since.


What’s your source for that budget? The error was by PHE, not Serco.


You need to back that number up with real facts, as this article is regarding the PHE handling of reporting.


This is what it most likely is. Analogous to starting out on an easy to hack together tech stack, succeeding, and dealing with tech debt when something goes wrong.

I guess its easier to understand since everyone uses excel, however it does end up giving a halo of blame to excel, as opposed to human processes.


No. There is difference between a little rush job, and an incompetent rush job. I’m only speaking for my own experience, but when put under pressure most competent teams don’t end up throwing away such important data. I’ve been incompetent for most of my career, but it’s given the serious of task this is inexcusable. I don’t know how things work elsewhere, but checking that the same number of unique records(by some composite key) are ingested as exported isn’t an advanced concept. Know the value of the data you’re handling. Marketing data <> health data. If health data doesn’t line up: you figure out why and solve it. If you don’t have anyone available to do so then the problem is managerial/structural.


That has been the COVID experience in Britain in the last couple of months: the government is spending a ton of money in training new people and creating new processes, which is causing a ton of slowdowns and bugs.

In the grand scheme of things, having problems because of inexperienced people is much better than not having enough people like during April and May.


I wonder if this will be added to the European Spreadsheet Risk Interest Group's (EuSpRiG) horror stories list:

http://www.eusprig.org/horror-stories.htm


Discussed here: https://news.ycombinator.com/item?id=22431500. (Submitted many times, but that's the only real thread, and only from 7 months ago.)


TIL there is something called "spreadsheet risk management."

Wow. Thank you for this gem of human culture.


The site's cert is invalid. Got a mirror?


Certificates are not applicable, the link scheme is HTTP.


It presents a self-signed certificate if using a browser/extension that automatically redirects to HTTPS (which everyone should be).


If the extension redirects to a broken cert not intended for public use, the extension is broken and should not be used. This is not making things safer, this is training users to click through warnings again. Very much "don't".


The extension is not at fault, all it does is rewrite HTTP links to HTTPS (which should be the default IMO, I agree with the parent commenter). The fault is with the site providing a self-signed certificate when accessed over HTTPS.


All it does is assume that a https site serves the same content and audience as the corresponding http site. That is a broken assumption. The consequences of such a broken assumption are very much the fault of the extension.


> That is a broken assumption

Why is that a broken assumption? Can you name a legitimate reason for HTTP and HTTPS sites to serve separate contents and audiences? I would rather not connect over HTTP to _anything_ nowadays.


CMS, serve the content over http and the admin page over https.

And for sites with noncritical static content https is superfluous to dangerous. ESNI isn't implemented yet, IP addresses are still visible to the eyes. And content sizes and timing are a dead giveaway for the things you are looking at. HTTPS for everything is just a simulation of privacy at best, and misleading and dangerous at worst, because there IS NO PRIVACY in the aforementioned cases.


So, what we know is:

« PHE had set up an automatic process to pull this data together into Excel templates [...] When [the row limit] was reached, further cases were simply left off. »

The terrible thing here is dropping data rather than reporting an error and refusing to run.

It isn't clear what piece of software was behind this "automatic process".

Clearly the responsible humans are to blame.

If the software that dropped data rather than failing has that as its default behaviour, or is easily configured to do that, then I think that software (and its authors) are also to blame.

Is there anything in Excel itself that behaves like that?


> Is there anything in Excel itself that behaves like that?

I'd imagine the default error-handling behaviour of 9/10 Excel macros is to throw away data


Oh, my God. "On Error Resume Next" is one of my earliest memories of learning to program for Windows. It never occurred to me that it might result in deaths.


That's an interesting ethical point in CS: sure, you didn't intend the software you hacked together in an afternoon to be used in anything with life-and-death consequences. But that doesn't mean someone isn't going to come along later and use it, without ever even looking at the source, for something very critical down the line.

I wonder how many bleeding edge master branches of GitHub repos, pulled in blindly by someone cobbling something together to meet a deadline, are running in places they probably shouldn't be.


People have died from using unsuitable contruction materials or devices, that were never designed to be used that way.

It's not the fault of the original designer if he was clearly targetting a different purpose.



Yes and "on error resume next" is what I usually see in VBA code


Curious, in what context are you all still using VBA?


I, personally, wouldn't use VBA at all, ever. I would only consider Excel and Word for very simple use cases. For example, I refuse to use Word for technical documentation. But Office and Excel are everywhere in large corporates; and simple tactical spreadsheets turn into monsters over time. One former employer had a spreadsheet "application" where triggering a sheet recalc (accidentally) would take out an entire grid of machines. I've never been in any position to stop this stuff from happening.


I had to import data from Excel documents into MS SQL Server. Since this was supposed to be a one time thing I used the import wizard. Some values where quietly ignored because the cells happened to have different formats than the surrounding cells. They looked exactly the same, so it took me a little while to figure it out. And if I wouldn't have done some simple sanity checks those errors would definitely not have been discovered before it was too late.


> It isn't clear what piece of software was behind this "automatic process".

They probably mean a semi-automated process within excel, where each tab is a days extract or something similar and they are using external references to other sheets. In any vaguely up-to-date version of excel the way you would do this is via 'get and transform' which does not have any of these limitations (including the 1m record limit that the news article suggests).

The funny thing is that the latest versions of Excel are brilliant at aggregating and analyzing data and are more than suitable for this task if used correctly (i.e. using PowerQuery). It's just that way less than 1% of users are aware of this functionality - I would assume that even most hacker news readers probably don't know about PowerQuery/PowerPivot, writing M in excel and setting up data relationships e.t.c.


Each test result creating several rows of data seems like a problem too. In clean data, every observation is one row. It makes working with the dataset much easier. In this scenario, I would expect one observation to correspond to one test result. The multiple rows are then better off pivoted into columns.


Looking from an outside perspective I would agree, but in practice we don't actually know what this dataset is, so several things rows per test may make sense in the context of this stage of the data processing.


It never ceases to amaze me how often we as developers are quick to point out other people's obviously incorrect decisions, only to defend our own Rube Goldberg implementations 30 minutes later by pointing out that critics just don't understand the design constraints.

Or at least how often my peers do that. Obviously all of my systems and code are perfect.


Nothing seems as easy as another engineer’s problem.

This also amazes me, especially the whining about other people's code from developers. Truely believing that they would do it better. I've even seen inherited code posted to be ridiculed/shamed/bashed in some slacks and subreddits.


I'd say it depends on the data.

If the schema is consistent between rows, and it turns out a test result is made up of several rows because the test is composed of several stages, I would leave it as is until reporting time.

If you pivot prematurely, you could end up dropping data because there are new stages didn't exist when you implemented the pivot.


> If the schema is consistent between rows, and it turns out a test result is made up of several rows because the test is composed of several stages, I would leave it as is until reporting time.

In that scenario, indeed no action seems to be needed, because each row is one observation: every test stage is an observation. So it would seem to make sense.

One could then argue that each patient deserves their own table (observational unit).

But as other commenters pointed out, this is all speculation.


these are data entries from all over the UK so its most certainly not clean data


The article states that it is not known where in the country was affected, but I think we can deduce that. Here are the daily cases by specimen date. Dotted line is Saturday's data, solid line is Sunday's. The difference between the two is partly the normal daily update, but around 2/3 of it is correcting the underreporting:

7-day moving average: http://danger.handley.org.uk/misc/rates-uk-recent.png

Raw data: http://danger.handley.org.uk/misc/rates-uk.png

Looks like all regions were affected, but by far the largest corrections are in NW, NE and Yorkshire regions. In particular, NE had looked like cases were declining, but we can now see this was incorrect, and they're still increasing rapidly

Edit: note the most recent 3 days are always incomplete, so any decline shown there is not a real effect.


Nit: presumably whatever the maximum time is before results is the period in which updates to past results should be expected. I think it's about 1 week at present?

Before the recent leap in cases my test took 6 days to come back (negative). My wife's test at the same time came back the next day. At the time they were saying that 72 hours was the expected return time for results. For me it has been a couple of days of steadily worsening coughing, and I gather people take about 3days-1week to show symptoms ordinarily.

So UK results are most likely reflecting infections from 1-2 weeks ago.


Always use checked pre- and post-conditions in a data pipeline.

This simple post-condition would have caught this issue: The sheet after merge operation must have a number of rows equal to the sum of number of rows for all merged sheets.

Assuming this is a merge of a standardized input, then another post-condition might be: The number of columns in output shall equal the number of columns in the input. Might want to check header names, and order as well.

Thinking in terms of universal properties, and putting the checks into production, is better than unit-testing.


> Always use checked pre- and post-conditions in a data pipeline. [A] simple post-condition would have caught this issue

I'm not sure they teach that in medical school.


Which I feel might be indicative of a larger problem. If you are going to deal with data (and its processing), it only makes sense to teach those skills in school itself.


They spend so much time memorizing random trivia maybe a course in handling data wouldn't be a bad idea.


Give them a break. They don't have the luxury of a highly-structured field, purposefuly constructed by humans (engineers & mathematiciens on top of that) with only 50 years of legacy.

They have to deal with a field that was haphasardly constructed by nature over the course of 4 billion years.

Imagine having to retro-document a code base with 4 billion years of history, created solely by junior developers fresh out of college..


This quote from the article says this work was being done be a developer(s) and if that is the case they should have know better.

The problem is that PHE's own developers picked an old file format to do this - known as XLS.


It's unlikely this was done by a developer in reality, and it's more likely to be a data analyst that was called a developer in the press release.


Even just within my own sheet I always build in some sort of idiot-check, the assumption being that I’m the idiot.

Do a SUM and match it against a COUNTIF on another column, or something. It doesn’t really matter what it is but if the data is important at all, I always scatter little checks throughout.

The case described in this article sounds like someone who just doesn’t know how to use Excel. I mean why in the name of VLOOKUP would you override the default and choose to save as .xls? That was a conscious decision. Anyone worth their salt knows that’s stupid.

Excel is not to blame here.


I’d guess they wrote a script to merge the csv file and export in xls. Xls might have been the default of the library they used in a rush. I think this may be the case with some python libraries that generate excel files as xlsx is a more complicated format.


We did a bit of work for laboratories this year and csv is not an uncommon exchange format between labs. In general almost all exchange formats are text based, with labs saying they will upgrade to „modern xml formats“ at some point in the future. So seen in this context a csv or an excel file doesn‘t really surprise me and should probably also be seen in this context.


> exchange formats are text based, with labs saying they will upgrade to „modern xml formats“

Worth noting that XML is also a text format. SGML even can treat CSVs as markup. There's nothing wrong with CSVs/TSVs anyway - it's a concise tabular format using only minimal special coding for a record and a field separator, as envisioned by ASCII and EDIFACT. The problem seems more like that there was no error checking in place to capture file write errors, or more generally the use of non-reproducible, manual operating practices which seems common in data processing.


> There's nothing wrong with CSVs/TSVs

I can see the theoretical purity of this statement, but based on my experience working with CSV files generated by actual non-technical users I have to disagree here.

There are a number of footguns here that are really subtle and the average non-technical user has no hope of spotting them.

Problems that I've seen in the wild, off the top of my head:

* Windows vs. Linux line terminators breaks some CSV libraries.

* Encoding can change depending on what program emitted the CSV file, and auto-detecting encoding is not perfect. For example, Excel for Mac uses Linux encoding by default, IIRC.

* Excel does wacky things when you export a "CSV" in the wrong format; real users use Excel to generate their CSVs, not Python. For example if you import the string "0123456789" in an Excel sheet, it infers "number" and strips the leading "0" when you export. Now your bank account/routing numbers are invalid!

* "What's a TSV?" -- if users use CSV, how do you handle commas in the data? It's nontrivial to train users to do their CSV upload as a TSV.

Etc.

In practice we needed to build a fairly beefy helpdesk article with accumulated wisdom on how to not break your CSV exports, and most users don't read/remember these steps until they experience the trauma first-hand.

I'd say the CSV format is deceptively simple -- it's quite easy to do the right thing as a developer where the source and sink are both code you control, but in the wild it gets messy really quickly.


You have to navigate the same basic issues to convert each csv as it comes in to xls as PHE did in the article. They could just as well have converted them to a consistent csv format and stopped there.


Sorry, typo on the encoding point -- I meant to say that I've seen issues with UTF-8 vs. Latin-1, not "Linux encoding".


How to encode data with newlines into CSV files is always a fun morass to wander into.


Almost none [0] of the problem, such as line terminators, encodings, or escape characters, is an inherent flaw in the idea of using a text-based file to represent data. They can be fixed by just a little bit of standardization. But unfortunately, I don't think it's possible, we're stuck at here.

[0] Except type conversion, which is a real problem.


Agreed; I think the crux of the problem is that CSV is not a format that supports metadata. And even if you somehow created such a convention/standard, it's impossible to upgrade all the existing users with 10+ year-old Excel installs; that ship has already sailed.


> There's nothing wrong with CSVs/TSVs anyway

The first CSV file was created in 1983. The first CSV standard was created in 2005[1].

The two decades of CSV surviving as an informal standard means that it takes minutes to make a 95% complete CSV parser and an infinite amount of time to make a 99.99% complete CSV parser.

[1] https://en.wikipedia.org/wiki/Comma-separated_values#History


..this right here. No error checking, inadequate testing, no reconciliation process.

Excel is used extensively in many industries. Any file could be cut off in processing by any number of reasons, one off errors for e.g.

So the solution is to "fix" the process by using the existing broken process and smaller files....


I feel like text-based is colloquial shorthand for "schemaless" or "dynamically typed".


CSV is not schema less and it does not have to be any more dynamically typed than XML either.


It doesn't have to be, but it often is.


Did it mention that it was in the context of csv or an excel file?

I only say that because, as someone who is painfully aware of the limitations and problems of those formats, I'm similarly aware of getting "that web-guy" on a project who proclaims "lets put things in a modern xlm format!", and lo and behold the process is now an order of magnitude slower and the xml format an order of magnitude larger than the simple delimited tabular format or stream.

I'm also painfully aware of the old systems (and how old health systems are) with fixed sized buffers and processes, so I can see how this would happen in the context of a lot of computing.

Edit: i see later on someone is mentioning that twitter suggests it had to do with excel file size limitations...


There is nothing wrong with CSV or text base formats. You can even use AWS Athena to query CSV feels stored in S3. It's a good format for data import/export, that many systems can natively understand or have tools to parse, given it's known how to interpret data.

It was a data pipeline issue. Software has little to do with it. If they received data in json and tried to interpret it as CSV, the same could have happened. I believe Excel even warns when you open file that has too many rows.


I agree.

Tools exist, for analysts and engineers (MS Access comes to mind for the analyst, python for the engineer), that would rectify the problem. And I think it's a fair assumption to say that those tools would be readily available.

Kinda sounds like a management issue, as well. No one ever said "hey you know XLS doesn't support all of this data"?

What a mess.


CSV seems like a good choice for this kind of tabular, linear data.


Have you seen the kind of mess Excel can make with a CSV file? The names of several genes were recently changed so that Excel would stop mangling them.


CSV != Excel. I work CSVs regularly and can't recall the last time I opened up Excel (intentionally)


Doesn't Excel by default capture the .csv extension so that it gets called automatically when you try to open the file?

Since Excel is one of the few standard pieces of software that knows how to open CSV, it gets used a lot of times when it shouldn't. There's another post I made comparing Excel to a swiss army knife, and there's a reason for that.


The problem with Excel is that it changes the data, silently.


I don't know many engineers - even in the data team - with an Office license.


I think our company has a company-wide license for Office. When IT sets up a machine you get it automatically. Microsoft works hard to get those kind of setups to be common-place.


Last time I worked in a big corporate you had to fill in a long form to request a license for anything you needed. If you didn't use it again within a fortnight or so it got yoinked away...

The startups I've worked at since have all been big on GSuite.


Sure it does but it doesn't give a f*k - once you open it in excel, it does its' own formatting and will save that incorrect formatting even if you save the CSV as a CSV


P.S. It doesn't matter if you never personally load a CSV into Excel. You need to ensure nobody upstream or downstream from you does either.


Excel removes leading zeros from numeric fields no matter what you do and you cannot turn it off. This is a huge problem in healthcare where many patient identifiers have leading zeros


Zip codes are a problem for the same reason.


CSV is terrible for anything containing human text though. It's a nightmare.

I use SQLite as files a lot for this reason.


As the article notes, the CSV part of the pipeline was fine; it only broke after they imported the data to XLS.

One lesson I’d draw from that is to favor simple human-readable text formats like CSV, where they’re suitable for the job at hand.


There's nothing wrong with CSV, especially if you don't have arbitrary-text data (if you do, just don't do CSV). Excel though adds an addition layer of services, which turns it into a nightmare if used as something it explicitly wasn't made to be - a database.


XML would fix the max rows issue, but open you up to OOM issues instead!


Sorry, but if you run into OOM issues by parsing an XML file, you're using the wrong API.

The DOM for a large XML document will of course take tons of space in memory. The key to parsing XML files quickly and with low memory consumption is to only keep in memory what's necessary, by streaming over the elements.

https://en.wikipedia.org/wiki/Simple_API_for_XML


Correct. And to add to this: apparently the lost data was due to the data that exceeded the 16k rows XLS supports, so the amount of data per file was apparently not huge to begin with. So even a shitty XML parser should do just fine here.


As an aside, this is a very clear explanation of the bug in non-technical language, much better than you normally see on general news sites. Excellent reporting!


Most people don't know about solve order and possible multiple solves for a given cell on any update.

From : http://www.decisionmodels.com/calcsecretsc.htm

When a cell in a spreadsheet refers to another cell it must be finally calculated after the cell it refers to. This is called a Dependency.

Excel recognizes dependencies by looking at each formula and seeing what cells are referred to. See Dependency Trees for more details of how Excel determines dependencies.

Understanding this is important for User Defined Functions because you need to make sure that all the cells the function uses are referred to in the function arguments. Otherwise Excel may not be able to correctly determine when the function needs to calculated, and what its dependencies are, and you may get an unexpected answer. Specifying Application.Volatile or using Ctrl/Alt/F9 will often enable Excel to bypass this problem, but you still need to write your function to handle multiple executions per calculation cycle and uncalculated data.


So the limit for Excell is 65,000 rows for the old format, and one million-plus rows for the new format (from the article).

Then I wonder, is there any tool that mimic Excell but with Sqlite as the backend? The limit of rows in Sqlite is 2 raised to the power of 64 (18446744073709551616 or about 1.8e+19).

https://sqlite.org/limits.html


Mimicking a spreadsheet with an RDBMS would be a neat hack, but I think the better solution is to simply use the database like a database. If an amateur carpenter is trying to hammer screws into wood, you hand him a screw driver, not try to invent some sort of hammer with a twisting mechanism that turns screws when you wack them. Use the right tool for the job, and use it like it's meant to be used. If the workers involved don't understand the right tool, then either train them or hire new workers who do.


> not try to invent some sort of hammer with a twisting mechanism that turns screws when you wack them.

That's a thing, it's a manual impact driver. And sometimes it is the right tool for the job (for example, when you have stuck screws and a regular screwdriver would just strip the head).


Sure, kind of. You hit the impact driver with a hammer, but I wouldn't call the driver a hammer.


apparently part of the problem here was simply the fact that they used the columns [1] (of which there are far fewer, even on newer versions of excel) instead of the rows

[1] https://twitter.com/standupmaths/status/1313055411285774336?...

edit: retracting this as the person who posted that tweet made a correction in one of the replies


about 26^3 columns, or 17576


Something sort of like this already exists shipped with Excel in the form of Power Pivot, which stores data in an embedded database inside the Excel workbook, and is supposed to scale up to 2GB and hundreds of millions of rows.

https://support.microsoft.com/en-us/office/power-pivot-power...

As you might see from that link it's semi-integrated with the Excel interface, but the differences in the underlying model show through in some ways (like not being able to edit individual cells or use VBA)


You could open a DB as a table view in OO.org, so presumably you can in LibreOffice. I'm surprised Excel doesn't integrate with Sqlite DBs.

My naive view would expect tables <-> sheets; rows <-> tuples to be easy to do (for MS) and just don't touch the relational aspects??


Does it work with SQLite? Is it practical?


Microsoft actually has a reasonable migration path where you can use Excel backed by Access and then transition to Access backed by SQL Server. But you'd have to recognise that you had a problem first.


XLSX's limit is 2^20, or just over 1.04 million (as the article says).


The problem is that anyone who isn't a decent coder think excel is a sensible choice for this, because they've seen it in their school or work.

So when the proposals come in, you're going to see one guy who says it's all common sense and we use familiar old excel for everything, and another lunatic who says something called "pigsqueal" is actually the standard, connected to a "frontend" which for some reason is now separate to the "backend". This nutter thinks we have time to write some unit tests and also wants to add an authentication module so we know who uploaded what. And somehow he thinks we need to add logging so we can ensure errors can be tracked and debugged. Amazingly he seems to be suggesting that there's going to be errors in our process.


That would imply non-tech people having some surprisingly deep visibility into technical details. What they really see is Excel vs. some contractor doing "a system".

The reality is that Excel is available today and works, and scales up... well, until it doesn't. Still, you have data entry that everybody in the field understands, that's rock solid (so no need to unit test anything), and has a tried and true authentication module built in (a file sent from a government mail address). There's a risk of user errors, but the system will be working immediately[0]. Deployment can be done by anyone, as it's just clicking on File->New and starting to type data in.

Meanwhile, a "pigsqueal" solution will take half a year to design, produce and deploy (that's in an emergency, a year otherwise). And that's with competent IT people, and not someone who wants to profiteer off the crisis. Then you'll have to train the users, and hope the reality won't necessitate any changes, because they will take a while.

I think plenty of decent coders appreciate the fact that Excel is suitable for a surprisingly wide range of tasks, and that if you spot it widely used somewhere, it most likely means there isn't any comparable alternative available.

--

[0] - Note that the user error that finally happened did not happen in the place where you'd expect it to.


Excel never works for math, at any scale. It's a constant trashfire of "looks fine" until it shows a result someone doesn't want, and then an error is found that changes results and the cycle repeats.

Excel is nice for visualizing and browsing data you already have, and informally searching and sorting for hypothesis generation.


My 4th order Runge-Kutta ODE solver I did in an afternoon in Excel the other day, because I didn't want to pirate Matlab to do a homework assignment, begs to differ :). Not to mention, the entire financial industry and many (most?) small businesses.

Lots of people use Excel for various types of math, and it works fine, if you know what you're doing. Excel not being idiot-proof doesn't mean it's not usable.

And that's all without writing a single line of VBA.

(We can have a separate discussion about some of the newer Excel features, like automatically suggesting what kind of statistical analyses to perform on your data - I consider this to be a potential future source of serious fuckups, as it allows people to easily transform data using methods, whose assumptions and implications they do not understand.)

> "looks fine" until it shows a result someone doesn't want, and then an error is found that changes results and the cycle repeats

That's a feature, though. Excel is interactive, which means you get ample opportunity to do sanity checks on results (both partial and final) as you work on your sheet, as well as immediate feedback on corrections. In "properly written" IT systems, this is rarely the case - you end up discovering a problem further down the chain, and have to figure out which component did something wrong, and why.


> Not to mention, the entire financial industry

That IS the dumpster fire. Trust me, the reason I know is that I used to be that guy who thought Excel was a great tool.

I built derivatives spreadsheets, backoffice spreadsheets, trading systems with realtime data, all sorts of crap in Excel.

Really, it's Stockholm Syndrome. People who previously had zero computing power at their disposal think they've found the hammer that solves all problems when they're introduced to Excel, because now they can calculate lots of numbers.

They just get blinded by the revelation that they can now calculate "anything" and are happy to pay whatever the cost is in terms of future maintenance, ease of understanding, etc.

> Excel not being idiot-proof doesn't mean it's not usable.

This is absolutely true, I'd be able to do much better with Excel now than earlier. The problem is Dunning Krueger. There are too many people who think they can code once they're able to get a bit of Excel going, and they don't know that they can't. Not trying to be condescending, I've been there myself. It's just that you get a lot of "coding is a thing I have to do in order to get to some target", and so people think that once they've finally bashed out their spreadsheet, they've figured it all out.


You bring up some very good points. But to those, I have a question: what is the alternative? Both now (for UK healthcare), and in general?

The "people who previously had zero computing power at their disposal" may be wrong in thinking "they've found the hammer that solves all problems" - but there's literally no other hammer available for them. They're not programmers, they won't write their own software (nor would they be allowed to). Any other option involves so much organizational overhead - both initial and ongoing - that it's a non-starter.

I agree that people routinely use Excel way beyond their own skills. But I haven't heard of any viable alternative.


One part is knowing how important correctness actually is in your case. A lot of people on HN are perfectionists, and there's a lot of rhetoric going around our industry about avoiding errors and not breaking things; in reality business processes have always had to be error tolerant (because until recently they were always done by humans) and a quick solution that breaks sometimes is often more valuable than a slow solution that doesn't break.

Another piece is pervasive auditability. Any result should come with an explanation of where it came from; "Bob did some calculations in his head and he reckons the answer is 7" would be acceptable for some kinds of business decisions, while for others it needs to be more like "Bob followed the procedure specified in the XZY institute handbook, page 456". Somehow we've let all that go out the window, partly because people who don't understand computing are managing organisations that deeply depend on it. But you don't even really need computer literacy; what you do need is the same kind of scepticism that you'd apply to any other piece of work.

Managers need to manage. Some of the problem is just people lacking the necessary skills (and a lot of that goes all the way to the top: the UK government doesn't have the wherewithal to hire skilled computer professionals because at every level the people on top don't have the skills to assess whether the people below them are any good), but a lot is a misplaced perception of computers as infallible.


Excel is an excellent hammer. A spreadsheet is a system to develop pure functional graphs and to interact with them in real time. I spent lots of time on finance spreadsheets myself and also thought it was Stockholm syndrome, but when I transitioned to working on software I came to see exactly why spreadsheets are entrenched the way they are, enjoy seeing them used creatively, and would never wish bespoke software on finance people. And good for them, they seem to know better, themselves!


The alternative is to stop using your house wrench to fix the pipes and call a real plumber.


Grovel to Google or MSFT or whatever big name tech firm to borrow their coders.

They already have the infrastructure.

They have loads of coders sitting at home, not completely utilized. (Dunno how true it is, but a lot of people here comment that.)

They've already written programs to do something similar.

They're well aware of side issues like data protection, security, cross-platform, etc.

They could use the goodwill.

Wasn't Google already involved somewhere? Surely they can figure out how to count some tests as well.

I suppose if you're clueless you'll think Accenture or Capita are the same as Google, so yeah maybe we are screwed.


The alternative is to pay professionals to do it for them.


> It's just that you get a lot of "coding is a thing I have to do in order to get to some target", and so people think that once they've finally bashed out their spreadsheet, they've figured it all out.

I have a friend who wrote a 641-line long bash script to automate a web site. It doesn't use subroutines anywhere, the body for the program is a 550 line long loop with multiple if statements and loops inside it.

He thinks the program is maintainable and easy to understand because he didn't have particular difficulty writing it.


Wha......

What??

What school or book did he read that made him think that was a good idea?


He attended the school of self-reliance. He should be applauded for tackling the challenge and doing the job himself, where otherwise it would likely have not been done at all.

Obviously, he's not a programmer. The script, for us, is bad. Perhaps if something important for him depends on this script working, he should pay a software developer to spend some time cleaning it up.


> He should be applauded for tackling the challenge and doing the job himself

Indeed. it was a hobby project of his.


Could be worse. Could be a 641 line perl script.


Could be worse. Could be a one line Perl script.


He's entirely self taught. He's not stupid or anything, it's just that he hasn't had the experience of maintaining large codebases, attempting to modify his unmaintainable code or attempting to modify other people's unmaintainable code.


Well why are these people wrong? You had a thing going but you didn’t complete the thought: if Excel works and they can now calculate “anything” without developing software and deploying containers, where is Stockholm Syndrome? Why are they blind? What should they use in finance? Jupyter notebooks?


Having worked in the financial industry, my first job was writing a system to cover for a $10m fuck up in excel where yahoo exchange rates stopped updating and the same spreadsheet was used for a few weeks until someone noticed "Shit, we're losing a lot of money here.".

Excel is the right solution in the same way playdoh is a valid building material.


Seems seems there were at least a couple of problems here. Why were financial professionals using Yahoo as a source of exchange rates?


You'd be surprised at how common that was. I've known two other financial orgs that scraped Yahoo as a critical part of their investment system...


I know individuals / hobbyists who do this, which is fine. Financial orgs should know better, especially if it is actually "critical." How did this get through any sort of review? How do you know Yahoo is authoritative?


That assumes there is a review to get through, rather than it being an Excel file with a few hundred macros...


Hobbyists are usually more diligent than professionals because they care. If anyone cared about the product it wouldn't have been a spreadsheet.


Because excel couldn't authenticate into our internal server and it was good enough.


Was it good enough? I would argue that if it enabled your firm to make $12M before being the proximate cause of losing $10M, and the alternative was "we don't win $12M nor lose $10M because we can't afford the project" then I agree it was good enough.


No, which is why I was hired to replace it.


Okay, so you write a script to take the data from your internal server and post it somewhere that Excel can access. You don't use Yahoo.


How about don't use excel in the first place?


Oh, I completely agree. That was one of the "couple of problems" in my initial reply!


Pretty common for finance in the 00s and 10s to be honest. It's slowly getting better, but until you fire everyone who isn't on board with this and replace them with people who can at least script - at every level of the business - you're going to have the same problem.


That would be good. But to be honest, Excel is scripting - in a 2D FRP REPL (something structurally more advanced than most programmers are using day to day). If we're changing hiring requirements to test for bash/Python/Powershell competency, why not actually test for Excel competency?

Maybe that's the problem? Excel is so easy to start with that people with no experience think they've mastered it, and the industry doesn't seem to have specified any best practices, much less testing the interviewees for their knowledge of them.


Programming language expressions represent arbitrary trees, which are strictly more powerful than finite dimensional tables.


Still a very long way from that. What tools should non-programmers use instead?


>>Excel is the right solution in the same way playdoh is a valid building material.

I sooo badly want that on a T-Shirt.....


I really agree with this. Excel/spreadsheets are great and should be used whenever possible instead of some bespoke solution that will always be worse unless millions and millions of dollars are sunk into it, and even then, will probably be worse, given the state of software development in 2020.

Like all tools, you have to know how to use Excel. If this is the only error that’s come up, well, so many other problems would have come up with a bespoke solution that Excel is still miles ahead in my book


If you don't already know, Octave is a very good free and open source Matlab compatible interpreter and solution. Was of great use in my time in college.


Back in the day, it wasn't "very good", at least not in terms of getting it to work. That RK4 in Excel happened because I got extremely annoyed at Octave.


Peering into my crystal ball: You were between 16 and 22, you were 'great with computers', you had never used an OS other than Windows.

There is a difference between a calculator and a type writer and there is a difference between an office os and one for programming.

Using Windows and blaming Octave for Windows sucking is a rite of passage for everyone in a BSc program. I hope you got better and switched to a Unix.


Holy shit, how much does it cost to rent out your crystal ball for a few hours? Because you got it 80% right.

I was 19 or 20 back then, and Windows was my main OS - though I did have some experience with Linux as well (running several distributions for desktop use, as well as working with Cygwin and SFU on Windows), and I was a relatively proficient C++ programmer, having spent ~6 years of pretty much all my after-school time coding game engines.

So it's not that I couldn't make it work - I eventually did. But it was so rough around the edges that I gave up in frustration twice.

And yeah, these days, Linux is my daily driver (well, technically Emacs - Linux distros are just various flavors of Emacs bootloaders for me).


>Holy shit, how much does it cost to rent out your crystal ball for a few hours?

$300 per hour, I bring my own crystal ball, runes, chicken bones or voodoo doll based on customer requirements.


Does your divination work over video link too? I might take you up on this offer.


Were you on Windows? Octave really, truly shines as a command line tool on a Unix OS. The GUI is ok on Linux tho.


> Excel never works for math, at any scale.

That is patently untrue. I've borrowed and lent money based on Excel spreadsheets, more than once. That's small-scale math with real-life consequences, and Excel is an excellent tool for the job.


I'd concur if s/excellent/adequate/.


Couch GnuCash Cough


You're right, but if you asked me to do something in GnuCash which I never used or do the equivalent in Excel with formulas I'm already familiar with, I'd go with Excel in a pinch.

This is why Excel is so well known, because once you get the hang of it, it becomes a great tool for something small.

It scales poorly though.


It works for math at the scale I use it.

Excel is great in my industry (slot machine game/math design). It is fantastic for doing game calculations and there are reasonable ways to do error handling/checking for correctness.

I could say the same about programming a web app without writing tests/spending some time on system architecture. It looks fine until it doesn’t work.

In both instances of development (I argue constructing an Excel workbook in my line of work is very similar to programming) there are ways to mitigate risks by doing things similar to “writing tests”.


I wrote a finite capacity planner for a pie factory in Excel v6. It made my Pentium 60 run a bit warm for a while. It took about 18 months - I was the only IT bod in the place. I also learned basic networking and installed a pair of hubs (yes hubs) with a 10BASE2 "backbone" and 10BASET to the desktop. I learned the new Novell 4 stuff, NDS etc and wrote login scripts etc. When I started, Planning were using Lotus 1-2-3 for DOS and some amazing feats of keyboard jockeying. When I finished with them, they cranked the machine and then used their experience to tweak the plan a bit. Run it past the Production meetings, feedback and twiddle a bit and send it out to the Team Leaders, Quality and the rest. If things started to look different, then run the plan again etc. I borrowed my Dad's logistic textbooks to learn about forecasting (v).

Getting from infinite capacity planner to the finite version took quite a while!

I don't think you can fault the tool. VBA in a spreadsheet gives you a lot of power but it needs discipline to wield correctly. I used to have a row at the bottom of all my tables with the word "End" in tiny text in each column, always formatted white on red. All my routines that ran down the table to look up and do something would always look for that signal that the end had been found. Nearly all formulae were entered by VBA. I had auditing routines that would test the various sheets for errors - I suppose I "discovered" unit tests. One of them looked for a row of text with specific formatting ... Another obvious check is having row and column sums cross checking each other. I (re)discovered loads of little things like that.

With care a spreadsheet can be quite handy for all sorts of tasks but please don't equate the ill advised monstrosities you (and I) might have come across in the past with a fault in the tool itself.

Anyway, there is a lot more to this story than that and back then I had a IBM System/36 running the show as well to worry about. Twinax is a right old laugh to deal with. I remember going to Eng and asking to borrow a spanner and a soldering iron - "but you're Planning, what do you need those for".

Sorry, started waffling 8)

(v) How to forecast demand from the multiples in the UK for pasties, sausage rolls etc, back in the day. There are two cycles one is weekly and the other is roughly annual, with peaks and sometimes spikes at Easter and Christmas and some upticks at bank holidays. The weekly one literally looks like a sine wave, the annual one is a bit more involved. As a first go, take the last three orders by day of week for a product and calculate an exponentially smoothed forecast for next week. For example take the last three Mondays to get next Mondays's forecast. Bear in mind that you need to prep, make, bake, chill and wrap the product and ship to depot with about seven to 11 days shelf life and it takes something like one to three days to do that. You are always making to forecast, which is quite tricky. This was about 25 years ago but Asda, Nisa, Lidl etc used to take our forecast and fax/EDI it back as an order without changes.


Not to mention the "system" solution will have to go through its own debugging and will undoubtedly have bugs in it for some time after initial release.

On top of that, there is STILL room for human error as a user is still the one actually inputting data and designing the experiment/data flow.

I'm going to get mauled on this forum given the audience, but parent comment reeks of the technical elitism on this forum and the tendency to immediately condemn anyone who is using excel.

Excel works great - I've personally built some very complicated models that work just fine. Goes through the same process of QA and line-by-line checking of requirements.

Not sure what the big deal is - mistakes happen in excel or otherwise.


The same way that engineers saying you shouldn't use dung to build houses are guilty of architectural elitism.

Excel is the wrong tool because it mixes presentation and data, to the point that geneticists had to change the names of genes so their spreadsheets would stop turning them silently into dates and then destructively changing the original data[0].

>Excel works great - I've personally built some very complicated models that work just fine.

I've personally cleaned up after people like you, which cost the business millions of dollars because they didn't realize how unqualified they were for doing their jobs.

>Goes through the same process of QA and line-by-line checking of requirements.

Somewhat difficult as excel is not line based. "Oh yes, these constants are on a different sheet, if you just click here, here, here and here you can see where we get the original data, we multiply it by 1.0 to make sure it's not a date over there and then we feed it back ..." if excel was sanely serializable you could easily version control it and see meaningful diffs between check ins. Double points if you could run tests on sheets and run them in batch mode easily (I said easily).

[0] https://interestingengineering.com/genes-renamed-to-stop-mic...


>>>I've personally cleaned up after people like you,

Yep...

The critical question that always gets blank stares is "How is this data and calculations validated"

There is never an answer to that question from the ExcelMaster, it is always a variation of "it looks correct to me"

Wonderful....


Except you can do it if you know what you're doing.

No one is saying to manage millions of records etc.

But you can get more mileage out of excel than people think - and people who half understand it are the worse because they know enough to know that it doesn't work.

Custom solutions are great - but aren't always the answer and with proper process Excel will work just fine in many use cases.


How can you achieve anything equivalent to unit testing? How can you do any kind of change control (unless you treat every edit to the sheet as a change)? Excel does some things well but for anything that needs to work robustly as part of a process (as opposed to one-time exploratory data analysis) it lacks things that were basic in 1996. It's an "unsafe at any speed" situation: it may work as long as you do everything right, but it's missing basic safeguards as soon as you make a single mistake; I'd put it in the same category as using a memory-unsafe language.


>Except you can do it if you know what you're doing.

Ye gads, why didn't we think of that in any other programming language? If we know what we're doing we can do whatever we feel like, and if we get the wrong answers, it was obvious that we didn't know what we were doing which can be fixed by just knowing what we were doing!

Excel's only use case is for data that fits on one screen, or an exploratory poke at the data to see what's in which column and if there are any patterns you can eyeball. Then you put those hunches in a script and start doing the work for real.


This geneticist thing is a complete misunderstanding. Excel is tuned to autoconvert things that look like dates to dates. Best would be to be able to disable this, but Microsoft research shows that more users need it.

Meanwhile scientists use Excel for lots of stuff.. since it works. It has a not nice "feature" of those conversions.

So option 1) throw away the tool you use every day 2) change the gene name

Option 2 makes sense, although everyone would be more happy if Microsoft gave an option to toggle that auto-conversion off [there is option to properly load CSV files but people dont use it, since it takes more time...]


So I'm wrong because I'm right? Have you looked at Stockholm syndrome?


The generic tool for offices is so good at everything that geneticists also use it; they change a name of a gene to make own life easier?


And I've cleaned up after many, many shitty engineers, one example being engineers who want to use a sledge hammer to drive a finishing nail and again, that's just one example.

And if you want to get personal to that level they all the same know it all personality who think they write MUCH better code than they actually do.

Pretty sure they cost companies even more money.

Luckily for me, I have the practical sense to know when to use what tool and so far it's worked out just fine. I wouldn’t use excel where not appropriate, so kindly don’t make assumptions.

But more power to you, keep shitting on tools that are more sophisticated than you probably think.

People who half know excel are the worse, they are the ones who know it just well enough to think it doesn't work at all.

Custom solutions are great - but aren't always the answer and with proper process Excel will work just fine in many use cases that you might not expect.

I could blab on about this forever, because, you know, there is a lot of gray area or whatever.


Imagine asking people to use the custom solution sledge hammer of python, git, unit tests and sqlite. Not one off the shelf piece of software there.

There comes a point where using a spork to drill holes in walls is not the right solution. That you're shitting on people pointing out the obvious tells me you're still stuck in the 90s windows developer mindset.


The problem is you only hear about botched excel jobs - you never hear about all the successful projects, so when you hear a horror story you just jump to the conclusion excel sucks.

Ok, ask a geneticist to build what they need with their “off the shelf” parts you mentioned vs 90% of the time being self sufficient with excel and allowing them to, you know, do genetics work. (The other 10 percent being where it makes sense to build something more custom).

Reminds me of the classic HN post when Dropbox had just launched where the user exclaimed “this is just a mounted blah blah using subversion blah anyone can do it” (paraphrased)

That old Dropbox post summarizes this mindset quite well.

And if being practical with your head not in the clouds making sane business AND technical decisions is what the 90s were like, man I missed out.

I guess at the end of the day we’re arguing over where you draw the line regarding when to stop using excel and graduate to something different, a blurry line that at the end of the day is a judgement call.

Oh and for the record I wasn’t even arguing against or for this particular use case in this post - speaking more in general terms.


I worked with a couple. They stop being geneticists rather quickly because the salary for someone who can use python well is about 4 times that of someone working on genes. Keeping them away from coding is now an executive decision since university departments can't pay competitive salaries.

The dropbox post was right. Knowing someone who wins a lottery is no reason to conclude spending all your money on lottery tickets is a good investment. Which is as much as I said in that thread when it happened. That they were solving a problem for idiots comes with the problem that idiots are too stupid to realize when a problem is solved, so you need to focus on looking like you've solved it instead. Funnily enough dropbox spent herculean amounts of effort on polishing the UI.

>And if being practical with my head not in the clouds making sane business AND technical decisions is what the 90s were like, man I missed out.

Spoken like someone who has done neither.


>> The dropbox post was right

Enough said.


> Not to mention the "system" solution will have to go through its own debugging and will undoubtedly have bugs in it for some time after initial release.

Doing code reviews in Excel is hard if the developers are pathologically disciplined. It's impossible most of the time.

And so is debugging.

It's very unlikely any such system would face these limitations and would silently ignore data the same way Excel did.


> It's very unlikely any such system would face these limitations and would silently ignore data the same way Excel did.

I’ve seen plenty of production systems that ignore or hide errors. Sometimes they’re still logging them, but it just goes to some log store or file that the team doesn’t check until their customers or support team inform them that it’s broken.

Good practice? No. But there are plenty of ways to mess up a non-Excel system and get something that works worse.


I'm sure that a developer who's determined enough can mess up any system. And Excel is by no means the only system with that kind of dangerous default (I remember MySQL 4 had very similar behaviour, as did early versions of PHP). But the industry knows this is far from best practice.


Excel will silently corrupt data, not just drop. It loves corrupting anything that resembles a date, and there no way to say "stop it" globally.


use templates. or use the data importer correctly. No problem.


I'm sorry but you seem to not understand what data integrity is if you think Excel is a perfectly acceptable solution here.


It's all about the use case, man.

Within its constraints excel is great. Depending on the nature of your data and exactly what you are trying to do it can accomplish quite a bit.

The line between when to use excel and when to build custom stuff is not as clear as it seems, is all I'm trying to say.


Yep Excel is a solid choice for some systems. Business people know it and it's a hard sell to tell them you want more time / money to use something they see giving them less.


> Goes through the same process of QA and line-by-line checking of requirements.

That's simply not possible unless you have a whole system around excel running a year harness.

You can maybe do it if you are super dedicated with locked cells and formal double-blind QA passes which maybe exist somewhere but not in the vast majority of operations.


> formal double-blind QA passes

I don’t know what you think I have in my unit tests, but it’s almost certainly not whatever you mentioned.

Using excel as an alternative to CSV is fine. It’s not like anyone was doing anhthing complicated here. Just storing data.


>Using excel as an alternative to CSV is fine.

Clippy the Paperclip

Hi, it looks like this alphanumeric constant is the date format used by the Democratic Peoples Republic of Arstotzka for the year 127 BC. I have changed the encoding for the file to Arstotzka standard and overwritten the original.


Clippy doesn’t have to worry about COVID cases because Arstotzka probably executes them all


> that's rock solid (so no need to unit test anything)

that is crazy talk, the amount of user error that can happen in an Excel spreadsheet means that you would need to do crazy macro work in order to maintain the integrity of the data


> Still, you have data entry that everybody in the field understands, that's rock solid (so no need to unit test anything)

Hey @TeMPOraL, there's a guy with a withdrawn PhD thesis on line 1 who'd like a word with you...

http://blogs.nature.com/naturejobs/2017/02/27/escape-gene-na...


Couldn't you achieve exactly what was needed here with Airtable? Or, easier yet, Google Sheets and a Google Form?


Probably not while complying with privacy laws.

I would be much more upset to find out a governemnt employ on a deadline used Google Sheets or Airtable to share my medical data instead of an Excel doc on a secure government file server.


> HIPAA

" For customers who are subject to the requirements of the Health Insurance Portability and Accountability Act (HIPAA), G Suite and Cloud Identity can also support HIPAA compliance"

https://support.google.com/a/answer/3407054?hl=en

> secure

"UHS says all U.S. facilities affected by apparent ransomware attack Computer systems at Pennsylvania-based Universal Health Services began to fail over the weekend, leading to a network shutdown at hospitals around the country."

https://www.healthcareitnews.com/news/uhs-says-all-us-facili...


Not that HIPAA applies in the UK (and at this point it's not even certain that Google can be GDPR compliant even it they wanted to) but that bit you quoted applies to G Suite customers who have to enable that thing in the account settings.

It is implausible that every hospital, clinic, lab and other medical organization in the UK would sign up to G Suite and deploy it to every of their employees; or that the British government would negotiate some kind of procurement contract with Google quickly for all that to be practical.

Absent those issues, Google Sheets has a limit of 18k columns and 5m total cells, in this case the issue was that they hit Excel's 16k column limit. Not much of an upgrade there.


Ah, HIPAA.

Having looked at the marketplace for HIPAA compliant solutions for educational therapy patient record keeping (patient CRM, whatever the industry term is), I'd feel so much safer having my data in the hands of Airtable than the absolute dumpster fire that was every offering I saw.


HIPAA is an American regulation, not a British one.


I don't really agree with much of google any more but I was at one point fairly involved in a project there. They do have some more secure systems for government use that isn't the crazy leaky ad cesspool gmail and docs is.


Where are the ads in Docs?


The docs are used for data into the advertisement systems.

I've heard on the grapevine that several fortune 500 companies are on O365 because Google wouldn't disassociate docs from data gathering, even at the level of high touch high paying customers and they rightfully consider their internal documents to be part of their IP. Microsoft says "sure" and flips a secret "don't even gather stack traces on the server" flag for these customers.


Extremely low performance. Obscure "sharing" workflow instead of obvious "mail a file". Cloud dependency that needs to be vetted for security (EDIT: and data privacy, as 'sjansen pointed out), and policies regarding cloud accounts on government e-mails. And in case of Airtable, also for budget.

I'd be surprised if either of the above could handle 65k rows in a single table without becoming near-unusable.


For most cases where Excel is used where it "shouldn't be", any alternative that stores the data in the cloud is not possible.

Just getting the approval and contracts would take longer than the entire span of the project.


What about LibreOffice’s spreadsheets?


Look, you’re right that Excel shouldn’t be used for this. But even then, Excel definitely could be used for this.

We’re not talking big data here. We’re talking about a finite number of possible test result from a finite set of testing locations grouped on a daily basis. Let’s say 1000 test locations, where each location has a daily number of tests performed divided into “positive”, “negative”, “inconclusive” and maybe a few other options. That’s ~5000 data points per day. Should not be any problem.

However, you need to properly plan ahead about how you design your spreadsheet, how to format the columns, how to keep it performant and how to not save it in some godawful 25 year old memory-dump dumpster fire of a file format...


Come on, this is a government department of a first world country. They have databases for all sorts of things. They're not using excel for any other NHS databases, or the terrorism watchlist. There are plenty of people capable of making a good decision here.

I think you let them off too easily by just assuming they're dumb. This a bad decision by people who definitely should have known better.


Yep. If the UK is anything like Australia, they will have a small army of contractors who specialise in building SQL databases for various important and pointless requirements for the government. These contractors are used to working on giant, sensitive data, databases with insufficient allocated time. It should have been pretty much trivial for the UK govt to solve this.


And yet, I have first-hand witnessed a scenario like the one described in an AU government department. A small system developed by business (one person who had subsequently moved on) using Excel and Word and macros to keep track of information and generate letters to "customers" had become an essential tool, but unsupported and undocumented. When we (IT) recommended redeveloping it in Java with an SQL database (and documentation and testing and backups), they balked at the cost. Partly because the original person did the work "in their spare time", so the business never saw the actual costs. And partly because there are zero infrastructure costs with Excel and Word. There was very much a feeling that IT massively inflate the costs and will build a Rolls Royce when a Ford will do. It was a house of straw vs house of bricks situation, but they were happy with their straw house, they thought their only problem was that the little pig who built it was no longer there.


Was that system working on the single most important problem facing the country at that time though?


Why are there infrastructure costs for a database server? Once you have one it can host all your small databases.


Paying your share of having a unix sysadmin and a DBA available 24/7.


Developer salaries are the main cost.


The problem with my Private sector Australian company is everything IT related is aggressively outsourced through various networks of contractors.

In the old days when the relevant business application was written, hosted and supported in house there was a clear chain of responsibility I could pick up the phone and I'd have a direct line to the person who "owned" the application.

Nowadays if there is a problem it's pick up the phone talk to helpdesk get assigned a ticket number and get the buck passed between different teams, The database team will blame the server team, server team will blame the networking team, networking team will reply to ticket with 'looks ok no problem on my end' and the ticket will get closed without resolution.

From what I can tell there are a bunch of incentives in the support contract around how quickly support can close out tickets, so rather than trying to fix the problem support try to do everything they can to farm the ticket off to someone else so it won't impact their metrics. The whole thing feels maddening and has to be rather inefficient.


Victoria (Australia), until a few weeks ago, did contact tracing via hand-written notes and faxes. Sure, it's not going to overflow at 16,000-odd cases, and doesn't require a DB admin and test cases, and all of that. But it performs at a fraction of the speed, and is one of the major reasons the second wave here got out of hand. Just because that army of contractors exists doesn't mean the relevant departments know of it's existence, or decide to use it.


I'm pretty sure the UK does have actual software with actual databases backing up test and trace, some of which was outsourced - in fact it looks like our opposition party and opposition media are trying to blame this failure on outsourcing, even though that has no foundation in reality, mostly because this plays well with their supporters for reasons too tedious and political to explain. Unfortunately, this happened at a kind of interface between different systems designed to take in results from both the Pillar 1 NHS testing and Pillar 2 Lighthouse testing for the general public, deduplicate them, and spit them out to the contact tracing system and the government dashboard. These are seperate systems run by seperate organisations for various not-so-interesting reasons to do with the structure of the healthcare system and the history of Covid testing in the UK.


The people building stuff like that generally need a 6-12 lead time from project proposal to delivery.


> They're not using Excel for any other NHS databases

I assure you that you're wrong. The SaaS I work for replaces a suite of independently re-invented Excel files used in conjunction with other SaaS. NHS trusts are our main customers, I can only assume the NHS is full of excel spreadsheets.

(Technically speaking PHE is not part of the NHS, it has more in common with the civil service in some ways.)


The dubious wisdom of choosing Excel for this aside, how many people working in (1.) computing and (2.) statistics saw a file with exactly 65536 rows, with nary an eyebrow raised? And I mean the first time they saw it, not the second, third or fourth day in a row.


Your point is incredibly good, and made me laugh too. It illustrates precisely why people, both technical and non-technical, frequently avoid choosing the massive scalable hammer factory system architecture that takes five people three weeks just to get rev 1, over the little tapper that gets the job done today with one junior developer.

From the technical standpoint of a decent (and probably only decent) coder, I think Excel is an amazingly sensible choice for technical projects for a bunch of reasons. I think decent coders reach for the database solution far too often, and waste ungodly amounts of time setting it up. Excel coming bundled with an amazing front end UI you don't have to write code for shouldn't be ignored. Sure it's error prone, sure it's a nightmare for data validation, sure it doesn't scale to Amazon sizes, but this little project will never get that big. (Yes, until, of course, it does.)

I used Excel as the frontend for artists to author and enter small databases for a game engine, for example, and it was leaps and bounds better, and cheaper, and easier to use, than the unmaintainable SQL nightmare, with a bunch of frontend and backend dev on top of it, that the decent code who proceeded me left behind. Excel even scaled better, up to the size of our studio (hundreds of people), compared to the other system that needed technical artists to know how to use it. No question whether it would crumble if we were talking thousands of people or more. Which is why it happens so often. :P


> another lunatic who says something called "pigsqueal"

And it doesn't help that there's a 25-year-old who says, "piqsqueal is outdated anyway, modern organizations use MonkeyDB and Hand Goop".


rofl


Your scenario makes it seem like these things are the result of incompetence or stupidity by people who aren't "decent coders". The reality is that for most use cases excel+manual processes are perfectly adequate/have little startup costs and what you're suggesting is grossly over engineered/require immense initial investment.

The real problem is if/when scope begins to increase these processes are sufficient until they aren't, and the tools aren't flexible enough to gracefully manage the transition.


>The real problem is if/when scope begins to increase these processes are sufficient until they aren't, and the tools aren't flexible enough to gracefully manage the transition.

That's a good point.

I'd also add that, at least in the US, we don't have any sort of national standard for such things.

Each public health department (and there are >1,000 of them in the US) has their own set of processes and procedures.

I'd expect that the CDC has a proper database system and that data from those 1000+ entities is collated into that system.

However, state and local laws/rules require specific handling of such data and can differ significantly from state to state, county to county or even town to town.

Attempting to stand up an integrated nationwide system isn't a bad idea. But doing so while trying to manage a pandemic isn't reasonable.

That many places used Excel spreadsheets for storing testing data is neither surprising or necessarily a bad thing. Until March, no one needed to have huge databases of test results -- now we do.

Complaining that places which had , at most, a few dozen cases of reportable infectious diseases should have implemented a database that can support thousands (tens of thousands?) of test results seems rather silly to me.


My apologies.

I didn't read the article posted until now, and I see it's about issues in the UK and not the US (I was confused, as we had issues with data reporting in the US as well).

My comment was focused on the decentralized US public health model and not on the UK's.

My apologies for injecting analysis of a different problem into this discussion. As such, my prior comment should probably be down-voted.


Add to this the understanding that getting this on contract will take a few months, when the boss was asking for these figures yesterday...


No need for a “front end” or “backend” or “pigsqueal”. MS provides a perfectly workable all in one tool, Access, for this kind of work. You can even use Excel as a front-end to access data if you want.


I have spent plenty of time running Access programs for employers. You better have a Frontend and Backend if you want data integrity.


This isn't really true. The people making the decisions will have more or less zero insight into the data format being used to store data, they'll just know that data is stored.

IMO, a much more likely scenario: a brittle, haphazard system was set up with extremely short notice at the start of everything. One administrative snafu led to another, ran into "well, it's working so let's not touch it", and we got to where we are today.


A good part of the problem could be the timelines involved. With something like Excel, you have a large number of people who can put something together in a short amount of time and perform the needed analysis.

Contrast that to a database.

The first problem you run into is the limited number of people who actually have meaningful experience with databases. Databases ceased being consumer products about 20 years ago, so most people have only interacted with very limited front ends. They certainly haven't created a database, nor performed anything more than trivial queries. (Even then, they probably aren't thinking in terms of databases and queries.)

Okay then, we are reliant on a much smaller number of skilled database administrators and developers. Do their tools facilitate rapid development? Even if they do have access to suitable tools, the process is going to be slowed by communications between the people who need the analysis done and the people who can actually perform the analysis. Needless to say that Excel, with all of its limitations is starting to look like a good option.


Frankly I think you're being optimistic/pessimistic (depending on your point of view). A person can be a "decent coder" by very reasonable definitions of the term and still not realize that Excel can't handle this particular volume of data.


Excel can handle the volume of data just fine. The (very outdated) .xls file format with its 65,536 row limit was the problem here.


I understood (different source; not sure what the true picture is) that the problem was with storing each record in a column. There is a limit of 16,000-odd columns. This would, as you say, not be a problem with Excel as such, but with the user(s) not being aware of the limitations (which is still the case in your scenario). I wonder how much the general problem has to do with mistrust of 'experts' so, hey, I can do this myself in a spreadsheet. (Or maybe it was just a decision on-the-fly in a fast moving situation)


I believe that was misreported. Someone wrote "columns" when they meant "rows".


This is the test and trace system for the worst pandemic this country has ever faced, six months in. I’m sorry Excel spreadsheets won’t cut it.

This is a management failure. PHE has capable coders. And if not, they could hire some.

The NHSX team has managed to write the dang COVID-19 app twice. Once not using the Apple/Google contact API (because management) and a second time properly. It’s all open source and it looks pretty decent.


As an aside on the Covid-19 app - it is open sourced in the "occasionally throw some code over the wall" kind of way - eg it looks like the github was last updated on 24 September vs the app which has more recent updates. It's not clear they open sourced everything needed to actually build the app either looking through the issues.

This is very very far from the only project doing that but disappointing nonetheless given the amount of public money which was spent on it.


Excel will also happily screw it up and not tell you, as happened here.


Exactly...

> "But you wouldn't use XLS. Nobody would start with that."

Actually almost every non-technical person I know that wants something "database like" ends up in google sheets or excel first.


I've been bitten by this exact issue (albeit on non critical data), save a file as XLS it will silently drop rows beyond 65k - save it as XLSX and it's all good..


The silent part there is the real surprising part to me. You'd think the exporter would die noisily if it still had data in the buffer and it ran out of file.


I'm betting the exporter isn't running out of file, but out of loop counter.


I mean, same difference. "I cannot generate a file representing state of data in RAM" is a 'fail noisy' error, not a 'fail silent' error.


(EDIT: numerous comments point out that Excel displays an explicit warning when truncating data above 256x65k. This comment was written under the assumption that it doesn't.)

Not if it was a bug in the code. 65k sounds suspiciously close to the limit of 16 bit unsigned int.

I'm guessing the exporter looked roughly like:

  int exportRow( ... ) {
    if( /* can't export or no more rows */ ) {
      return 0;
    }
    /* export row */
    return ++someInternalCounter;
  }

  void export( ... ) {
    unsigned short nextRow = 0;
    do {
      nextRow = exportRow(...);
    } while(nextRow > 0);
  }
In the above example, the export would silently stop after 65k entries.

The way people write C in the wild, this wouldn't surprise me in the slightest. And with Microsoft being all about backwards compatibility, Excel probably defers to some ancient and long forgotten code when exporting to XLS.


Until you reach 1,048,576 and then XLSX silently drops rows.


Silent? Are you positive? Not doubting you but if so; WTF?!


I just tried to reproduce it and got a big warning message that says that data will be lost.

You can disable this message (per file basis), if for some reason you dont want to see it.

As far as I remember this message existed since Excel 2007 (which introduced .xlsx format), so the guy is simply lying.


Certainly not lying but it was a few years ago. Not sure which version but not 365


Yep, I'm positive but as said elsewhere this is a few years back. I recall sending the data out and getting some pretty stern feedback (this is why I recall it) Didn't make the mistake after that...


Good lord. Old XLS files. I'd love to have seen how this was concocted.


Perhaps some old integration code from 10+ years ago that was written to spit out XLS rather than XLSX

e.g. some old versions of Crystal Reports can only do XLS. There must be loads old systems around that can only do XLS


XLSX files have been around for ... nearly 13-years...

Ugh - I still deal with clients who are creating brand new XLS and DOC files and expecting to have all the latest features (co-authoring when hosted in 365/SharePoint Online/OneDrive/Teams)...


Worked on a world class radiotherapy piece of software. If you like 20 year old Delphi projects and pre C89 ... C, then come over! Xls export is probably a recent addition, xlsx support will most likely come no sooner than when Excel stops supporting xls.


Ten years ago people were still complaining about OOXML not being an open standard. Don't worry guys, I'm sure they'll make a reference implementation of the standard soon! I'm sure there were a few people waiting around for a "real" standard to come out before implementing XLSX support.


It was submitted for standardization in 2006: https://en.wikipedia.org/wiki/Office_Open_XML.

At that time the was less information about 97-2003 file formats (OLE2-based), but it was fixed around 2008-2009: https://alexott.blogspot.com/search/label/file%20formats


Excel has never worked well with external data.

Even the current version of Excel still inexplicably auto-converts long numerical strings to Scientific Notation.


Even xlsx files have caused endless headaches at my current workplace when they're large.


I suppose someone had written a VBA macro to import all the CSV files and then save them in a format which could be imported into some dated upstream system. Or never changed how they worked. Could even be because Excel does something slightly more rigorous with XLSX files which made writing the VBA trickier.


I worked with teams integrating legacy systems from hospitals before. There seem to be a large amount of carefully customised scripts for pulling data together. There was never a straight-forward data ETL / export for something other than routinely reported. A lot of information is derived and inferred from even not related (e.g infer chemo events from drug prescription codes if parts of system is not accessible). Hence if you really need high quality data you need a lot of validation and cross-checks as well (either by script or manually). To me it sounds like the checks and validation did not happen enough as the time is very critical. I don't think Excel is the single thing to blame here. Hospital IT systems have a long history of data related issues.


It's also entirely possible that whatever they're importing this data into only supports XLS and not XLSX. I don't think Excel is the ultimate endpoint - the data is getting imported into one or more databases somehow.


My guess: somewhere down the line someone who doesn't often use Excel typed ".xls" at the end of a string variable or selected it from a dropdown list and made a mental note to come back to it if it caused any issues in testing.


Old code is everywhere. I once worked at a health tech startup that was integrating with a very successful existing company. All their code was in Turbo Pascal. We were shocked. The funny thing is that their code worked great and even had a debugger. Our code was ultra modern typescript micro services and we debugged with prints and logs. Their product looked dated but was superior in every other way. My guess here is that XLS is a simpler format and the devs found an NPM package that could handle XLS but not the newer formats.


It may have been a user education and/or communication issue.

I had someone come to me recently asking for some data "in an XLS". I asked them if they specifically need XLS or if they just need something that can be read by Excel. It was the latter, but they didn't know that there was a difference. To some people, XLS == Excel. Sort of like how TLS is still being referred to as SSL - somewhere in the world there is a developer being asked to use SSL for a new project because SSL == secure.


“Let’s see if StackOverflow has something”

...

“Oh nice! Exactly what I need!”

<copy>, <paste>

...

“Woah it builds! Ship it!”


This is Excel. There is no "builds".


There are a surprising number of people out there that still save their office documents in the old formats for whatever reason they have convinced themselves of. Just a couple weeks ago I had a coworker that sent me a manual for a piece of software he wrote as a .doc file.


This is indeed user incompetence, but so far every news source quoted the official PR and said it was an "Excel Problem".

Microsoft PR was caught unprepared - I wonder how they'll re-spin it in the next few days (and for the first time that I can recall, a Microsoft product was wrongly blamed...)

Pay attention, how every time there's a Windows virus or worm, it's a "Computer Virus", but in the (extremely rare) occasions where Linux or MacOS is involved, it's attributed to that system. I don't think that's a coincidence, CMV.


I think Linux and MacOS viruses get a special callout because they're exceptions. Similar reporting conventions are used all the time, like a terrorist attack in the middle east vs a terrorist attack in France. It's not some nefarious PR scheme, the story is just more interesting because of how unusual it is.


It's not a user issue, it's a government incompetence issue. Who was the technical genius who thought 'right, we need to store thousands of lines of critical and important data, I know! Excel!'


The common case for stories like this is the answer is "nobody."

Time point 1: "We need a way to map the long-form epidemiology state. It's 2D data... Let's use Excel." "Will we hit limits?" "Pfft, no. Not unless we end up needing to track tens of thousands of patients, and that's not practical; we don't have enough people to do that tracking in the whole NHS."

Time point 2: "This pandemic is an URGENT problem. We need to hire more people than ever before to do tracing. And be sure to keep the epidemeology state mapper updated so it can drive the summary dashboards!"


Have you every worked in a large enterprise?

I have personally tested an Excel based credit rating tool, to be rolled-out globally by a major financial institution.

The fact that one should not do this, is by no means a reason not to do it.


Fun fact: Toshiba has software that generates entire CRUD app skeletons driven 100% by excel spreadsheets. You fill out object definitions in excel spreadsheets with property names and data types etc. It's like an extremely uncool long lost cousin of rails -g

Source: Got one of their engineers to show me after I heard about it and had to know if it was real.

Anyone who's ever worked with or in Japan knows just how far they are willing to torture Excel spreadsheets to get them to do anything and everything.


I've (miss)used excel as a web-service testing platform once. The test data (input/expected output) was anyway in the Excel sheet, so why not. We did not have other software available at the time, so Excel+VBA it is.

It was one of my nicest testing gigs ever. A test session would take only minutes, all results documented to the t. And fiddling around with the test data was so easy. Would be interesting to know if this thing is still in use.

Not sure the customer liked it that much, I regression tested the 6 previous - still running - versions of the service, something no one had cared to do for years. We found bugs both in the Spec and in the Code for nearly all old versions...


>I've (miss)used excel as a web-service testing platform once. The test data (input/expected output) was anyway in the Excel sheet, so why not. We did not have other software available at the time, so Excel+VBA it is.

That's an upgrade on how Toshiba did testing. They did basically the same thing except it was Excel+Humans. No joke. Never saw that one with my own eyes but a coworker did. He also said they had doctors there and each doctor was responsible for X number of staff and every so often made them fill out a questionnaire of which the final question was "have you thought about killing yourself lately?" and if you answered yes you got a day off. Apparently people would ask for transfers to different offices where the doctor responsible for them wasn't physically present, so they could avoid even being asked the question for fear of being made to take time off work. One of the more senior guys on the project kinda just disappeared too. Fun times.


to play devil's advocate, i had a desktop support job in college and worked with an executive assistance. She was _very_ attractive and so i was at her desk "working on the printer" a lot, she knew Excel better than anyone i've ever seen (before or since). It was like watching a magician.

In the right hands Excel is pretty amazing.


I once created a very modern mobile offline-capable web app for a large enterprise company that was powered by APIs backed by excel. Luckily we didn't have to deal with those directly, but it was a bit of a pain for the poor guy doing th backend.


Would be interested in hearing more about that Excel tool.


It started very innocent: Some consulting company apparently was tasked to create a prototype of the credit rating algorithm. Quite appropriately, I would say, they did so in Excel. As you can easily tweak the logic, show input, intermediate results and output. And all can be manipulated by the client who does not need to know a thing about programming to do so. Great.

From there it went downhill: Apparently someone up the hierarchy thought "Wow, that's 80% of what we need, let's just add a little UI and ship it"

They added a UI, but not as a VBA UI as you might hope. No, they did the whole UI in different worksheets. Storing any intermediate data ... also on worksheets. Long story short, it was a mess, and a slow one for that.

Oh, did I say this was a multi-lingual application?

Production rollout was on Jan 2nd, Dec 31st around 4pm I found a bug in the other language, on the one machine which had the this language configured. I believe this was the only time I ever saw a programmer literally run down the office to that machine to debug this issue.


Come work at a bank. Any bank.


Can attest to this. Some of the things I've seen excel spreadsheets contorted to do horrify me.

I once got spreadsheet dumped on me to debug because it wasn't working. A colleague used the spreadsheet to 'generate' interest rates that were then input into a mainframe. Dug into the VBA spaghetti mess, turns out this 15 year old script that took 15 minutes to run originally hit a bunch of Oracle/External APIs and performed calculations was now just copying the rates in from a csv file on a shared drive.

The error was caused by an excel formula ticking over a new year causing it to look for a directory it did not need to access that didn't exist. I thought it was pretty funny until I heard that the entire asset finance business had been unable to write any loans for days because of this.


In my org excel might have been the start, the intent being to move over to a DB once someone got it stood up. Then nobody would stand up the DB - because not urgent, we have the excel for now. The DB would be completely forgotten, and 3 years down the road everybody would be wondering wtf we have this huge excel file for. Then we would have six weeks of meetings to discuss the file until something else comes up and we forget it for another 3 years.


That's why Excel wins. Because it's right there, infinitely adaptable to any change in workflow you may need to make. No need to send official requests through half a dozen layers of management to get the IT people to maybe, in half a year, add a column to the database. No need to involve 20 people for a year to create new software for a new workflow. You just agree with your co-workers on a change, swap a column, add some color, File->Save, done. It wins because it puts power in the hands of the people doing the work.


I mean, yeah, but pretend Excel is, idk, sword arms. You get a lot of work done by whirling around the office, spinning your swords, cutting off a variety of limbs and heads in the process. You get to the end, meet the few survivors, staring dismayed at the trail of blood and dismemberment behind you, and you yell, "BUT I GOT THE JOB DONE!"

You're not wrong, but...

I wish there were some alternative where we instead fixed our ossified, byzantine processes gradually over time, so that we didn't need to break out the sword arm tornado in the name of getting work done and then, in hindsight, say, "well, shame about the negative externalities, but there really just was no other way. Ah well, let's move on, everyone who isn't a pile of flesh and blood and bits on the floor tidy up. Gotta get everything clean for the next sword arm tornado."


But the negative externalities don't seem all that big, considering. For every problem like this, you get countless millions of person-hours and dollars saved in work that wouldn't have happened at all, or would have been delayed, because of the necessity and cost of creating complex IT systems to support it.

If Excel is a sword tornado, it's one happening in an environment where everyone knows to be super vigilant about sharp objects. The alternative then would be a central processing factory that takes several years and millions of dollars to built, and which has to be turned off for a month several times a year, to change the shape of the blades used by the automated cutters.


> For every problem like this, you get countless millions of person-hours and dollars saved in work that wouldn't have happened at all,

I've found errors in every Excel spreadsheet I've ever looked at, and I'm not some master-excel user; I usually find them because - if I care about the results, I rewrite them as a Python script, so I get to go through everything.

The fact that Excel is effectively not auditable is a huge problem. There are a lot of dollars saved, like you said, but also many wasted or embezzled without anyone noticing in time because past a very low bar of complexity, it's really impossible to figure out what's happening without much, much work.

> If Excel is a sword tornado, it's one happening in an environment where everyone knows to be super vigilant about sharp objects.

... rather, no one is really careful, everyone gets a small cut every now and then, to which the apply a bandage and continue like nothing happened. And occasionally, they lose an eye or a limb or a head - and it is only those cases you read about in the newspapers.

I do not have a better suggestion, I'm afraid, but Excel is causing damage everywhere - e.g. [0], the subtitle - which I'm afraid is not an exaggeration, is "Sometimes it’s easier to rewrite genetics than update Excel"

[0] https://www.theverge.com/2020/8/6/21355674/human-genes-renam...


I agree with what you wrote. It does cause a lot of small mess everywhere, and an occasional big mess. But there really isn't any better suite of tools available that hits all the non-database requirements.


Alternatively...

"We need to store some data for a virus" "Where does the data come from?" "Every hospital sends us an excel sheet" "Well, let's merge all of it into a bigger sheet and display it or export it"


"We should maybe start tracking some of this" "Where does the data come from?" "Dunno, a handful of hospitals... maybe?" "Right. Let's just use Excel, not gonna waste time on that now"

... six months later ...

"Hey um..."


From the species that brought you "The Game of Life" comes "The Game of Technical Debt"


>"We need to store some data for a virus" "Where does the data come from?" "Every hospital sends us an excel sheet" "Well, let's merge all of it into a bigger sheet and display it or export it"

you have no idea how right you are :)


Been there, done that, got the sqlite t-shirt...

Wasn't a bad choice, but once you load 20TB into it, you realize you done goofed.


I think you'd be shocked by how many of our financial and social institutions rely on emailing Excel spreadsheets to one another.


I am absolutely not shocked by that. Most people do not know how to insert into a db yet alone spin one up and add tables. Everyone can excel.


Don't forget healthcare! Excel and CSV for tons of data exchange.


I think the healthcare industry is probably the only thing keeping fax machines alive. Interestingly, fax machine are tagged as HIPPA compliant. Lots of spreadsheets get printed and then fed to the fax machine for transmission instead of email/whatever. On the other side, and i'm being serious, someone gets the fax and creates a new spreadsheet and enters all the rows manually. fun fun.


CSV is somewhat okay-ish. Excel, not so much (thanks Excel for discarding the important leading zeros in this field, and for lossily parsing the other as a date... oh, this Excel file is so big it got the "XLSB" treatment which is hard to undo)


Who? Almost anyone who wanted to get something done without going through the red tape of working with IT. Even in a private corporation that red tape is enough to make many a things end up in excel (or for the more tech savvy, access). Government is surely even worse (better?) at having red tape so the pressure was likely stronger.


Yes exactly. An endless battle: IT want to stamp out spreadsheets and centralise everything but it all becomes an ossified bottleneck of development projects. Meanwhile people need to get stuff done so they use Excel.


Yup. What also helps is that the data literally sits on your drive, or your network drive. No SaaS vendor is holding it hostage.


It doesn't feel like that. Even if I save the file (365 subscription and desktop apps) on my disk it steel feels 'cloudy'.


Of course Excel. What else non-technical users can use? They simply don't have any other data tool.


Excel is perfectly fit for that and is nowhere near the actual issue.

It's lightweight, fast, portable, compatible, and just about everyone can use it regardless of (technical) skill.


Except it wasn’t the government. They outsourced the test and trace system to private companies.


Public Health England was responsible for collating the data which came in the form of CSV's - PHE is a government agency.


It won't be a surprise to anyone who lives in the UK that the problem was blamed on Excel, rather the UK government taking any responsbility for it.


Yea, I have a personal vendetta against XLSX due to having had to make tooling replacements for business logic being encoded in spreadsheets in the past, but the title of this article is incredibly slanted.

This was absolutely a user error and the title "Excel: Why using Microsoft's tool caused Covid-19 results to be lost" is really disappointingly click-baity.


Is your view based on data available to everyone else? If so, that would be nice to see. Its far easier for you to convince others with data. Why would anyone want to change your view?


Could also be a lack of budget funding. The newer version of excel supports: 1,048,576 rows


The test and trace system for the UK was ~£12bn.


> This is indeed user incompetence, but so far every news source quoted the official PR and said it was an "Excel Problem".

It's been quoted/reported as "an ill-thought-out use of Excel", when in reality it's "poor use of an older file format".

There's also a lesson to be had here about validating your data conversions. Especially for critical things like this, it's always a good idea to perform the conversion from the old file format to the new one, and then to go back and extract the data from the new one to compare to the source data.

This can also find issues like Excel cluelessly assuming a field is a date, PHP automatically converting a string which happens to only contain numbers into an integer, and so on.

From what I can tell, that sort of thing, assuming proper monitoring (which is a huge assumption), would have detected this immediately (plus however long until someone notices the error e-mail, etc.).


The way I heard it, it was an Excel problem: They had a CSV file that had more than 1 million lines, and Excel dropped the lines beyond 1 million.

I can imagine an automated process that appended to the CSV and then Excel saving back to that same file with the lines stripped.

I hope we can all agree that this is not the way Excel or any other software should behave, whether the user is incompetent or not.


If you read the article (perhaps its been updated?), they were using the old XLS format and they hit the limit at 64,000 lines

PHE had set up an automatic process to pull this data together into Excel templates so that it could then be uploaded to a central system and made available to the NHS Test and Trace team as well as other government computer dashboards.

The problem is that the PHE developers picked an old file format to do this - known as XLS.

As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of.


Like I said, that's the way I heard it, the way it was narrated to me.

The Guardian reported my version:

https://www.theguardian.com/politics/2020/oct/05/how-excel-m...

CSV would to me make more sense than an outdated binary format like XLS, but either story is plausible.


Did they pick XLS because it's easier to parse or something? I can't imagine why someone would pick a format that was largely replaced 10 years ago.


Moving from XLS to XLSX will break your macros and would require all machines to be running Office 2007 or later.


Office 2007 was like... 13 years ago, what's the problem?


Compatibility. It's reasonable to predict that some of the participating entities are running software that is at least a decade old.


Indeed


If somebody finds it easier to do this process using Excel than with a scripting language, I wouldn't expect them to know the differences between file formats. I would be surprised if they understood the concepts behind file formats.

They may have chosen XLS because they used Excel back when that was default, and now they don't want to "risk" anything by switching horses mid-stream.


Inertia, most likely. They were using XLS 10 years ago and never saw a reason to change.


If the conversion was failing silently, it's a Microsoft software problem, but I bet you money the conversion was failing in a way that fell into a `catch(Exception e) { }` somewhere in their conversion script. Or even a `catch(Exception e) { log e; }`, and the log is stored to local storage on a machine that no human being ever logs into.


There's nothing to convert, Excel natively supports CSV. It just doesn't read beyond a million lines, it presumably won't warn you about that, and if you save the file again (as CSV), the lines will just be gone.


Interesting. We might see an increase in the amount of wrongly reported data as a result of the global lockdown. One can only wonder what the consequences of data underreporting might be?

This reminds me of the below event which you can read the full story at https://leveragethoughts.substack.com/p/making-investment-de...

In 1976, the UK government, led by James Callaghan of the Labour, borrowed the sum of $3.9 billion from the International Monetary Fund. The granting of the loan was based on the condition that government fiscal deficit be slashed as a percentage of GDP.

A couple of years later, the chancellor of the exchequer at the time of the IMF loan said this below.

‘If we had had the right figures, we would never have needed to go for the loan”

That’s right!! The decision to borrow money from the IMF was based on wrong data. The public borrowing figures which prompted the UK government to seek IMF loan in 1976 were subsequently revised downwards sometime in the future.


And of course they would never needed to have borrowed at all if they had just let Sterling float and not taken a us dollar denominated loan in 1975.

The problem was a government still operating as though Britton Woods was in place five years after Nixon ended it


640k deaths ought to be enough for any country.


Excel is NOT a database. I don't understand the people who use it as such (and I've come across many who do).

Then they're surprised when it all goes tits-up.


If you're thinking regular office workers, they know of no alternative.

Excel works enough for small data, particularly when you don't do complex queries on it. The more you know of it, the better it works. The only other thing that offers similar benefits to Excel but works as a database is MS Access, but the mental model behind it is too complex for your average office worker who wasn't trained in it, and like most database systems, requires a lot of up-front work with figuring out the schema, and doesn't particularly like the schema being modified later on.

As far as I can tell, there's literally nothing else out there. No, random SaaS webapps du jour don't count, because they're universally slow, and also store the data in the cloud, instead of the local drive.


The mix of Excel & Access as a new application would be a killer app. Hasn’t someone already built that in the cloud?


Excel may not be a database solution, but it is the most commonly used database solution.


It's hard to say from here what would be the best thing for them to use, but simply using a modern spreadsheet format instead of the ancient xls seems fine for interchange.


There's not much to stop the devs from using open source tools to create a PHP web frontend and db backend for this data - and secure it as well.

Heck, if they still need to export, they could do that from the data too.

Sure - use Excel for POC, but get that DB backend up pronto.


I work in an office that relies way too much on Excel.

To consider your solution, first showstopper, it needs a server. We don't have a server, nor anyone who knows how to manage one. We'd need to ask IT, that will take months and they'll require a budget transfer, so we'd need to request it to management (which will need a business case to convince) and involve the finance guys. We can't just plug a RaspberryPi into the wall, not only that would get me fired, but also it wouldn't be able to connect to anything without the company's certificates for the proxy or whatever.

Second, we need people who can code in PHP (and their backups when they leave). Probably in practice we'd need IT to do that, so that's more months and budget required.

Obviously anything stored in the cloud is out of the question, just the authorizations and contracts to do that would take a year.

So in the end it ends up as a shared spreadsheet.


I'm sure, at a push, a PC could be co-opted to do the work while a server (even online) could then be brought up to speed and the data migrated - but that's your use case.

What we're talking about here is a government department who do have access to servers, but choose not to use them (or so it would seem).


How do you share the spreadsheet? Shared network drive? By email? With Excel?


Shared network drive usually


Wish we knew more details.

Maybe helpful for people that don’t know... Xls excel has a 64k row limit. Newer xlsx has 1mm row limit.

Or maybe they were putting people in by columns and hit the 16.3k column limit?


Or maybe it wasn't excel at all.

One number the UK government doesn't release is how many people are tested. They have spent the summer championing "testing capacity", which we found out at the start of September was a lie. They release the number of tests done, and the number of people returning positive. They don't release the number of people done -- if you're tested twice, that counts as two on the number of tests, but one on the number of people.


The trouble is, that entire narrative about the testing capacity being a lie because of secret leaked figures about the number of people tested was itself a massive lie.

For context, the Government publishes a stat for the number of people tested for the first time ever each week (used to be daily, but that was dropped in part due to people misusing it). Certain publications and politicians - starting I think with the Times - pushed the narrative that this was the real number of people being tested, and that the reason it was so far below the stated capacity was because the real capacity was far lower than the Government claims: https://archive.is/n3Yku This was bullshit on multiple levels - there are several important uses of testing, like routine screening of NHS frontline staff and hospital patients being admitted and discharged, that result in people being retested who've been tested at some point previously, and those obviously use actual testing capacity on people who haven't been tested that day or week. (Testing two samples from the same person in short succession, on the other hand, is unusual and not general policy.) Not only that, the people tested number is from pillar 1 and 2 testing in England only for the week ending the 2nd of September, whereas the capacity figure seems to be for all pillars in the entire UK at around the 12th or 13th. It makes absolutely no sense to compare them. And the cherry on the top is that I'm pretty sure this number is from before the problems with test shortages and testing delays that they're implying it explains.

The not-so-secret weekly report in question with the number of people tested for the first time is here: https://www.gov.uk/government/publications/nhs-test-and-trac...


We know capacity was a lie because people were unable to get tests despite the total number of tests being below claimed capacity.


(This comment was posted before we merged the threads from https://news.ycombinator.com/item?id=24685911. The current article confirms that it was Excel.)


When I use Excel to open a CSV with too many rows, I get an alert saying 'File not loaded completely'. I have to dismiss it by clicking 'OK' before I can interact with the spreadsheet.

So Excel isn't silently discarding data.


... Yeah, but if you have an automated script compiling the reports into one file, and you forget to add correct error handling .... you end up with this mess :-(

I suspect that wherever this foul up happened there wasn't someone sitting and clicking through sheets ignoring errors (I hope).


The BBC reports the problem was saving spreadsheets with more than 65536 rows as XLS. Can you verify Excel also throws an error for that case?


Yes, you get a warning that explicitly says "Data beyond 256 (IV) columns by 65,536 rows will not be saved."


>As a consequence, each template could handle only about 65,000 rows of data

Oh? Like 65,535 or so? That seemed weird a first that even as old as xls is that only 16bits were allocated to max rows.

But then maybe not. Each row might get an ID, so that's 16bits * rows you have. I wonder if when XLS was designed they considered it very unlikely many people would have 500MB of db of empty rows and then would needed more data added to each one?

Simplified of course, I bet even original format had some explicit row identifier that could be shortened with optimizations and increment tags.


The xls format is pretty old. The newer formats (xlsx, xlsb, and so on) allow for larger spreadsheets.

The reason the spreadsheets in the old format couldn’t grow was probably compatibility. The reason for the limit in the first place was probably about reducing memory on disk and in main memory.


XLS does in fact have a 16 bit row counter :/


That should be enough bits to do everything a computer will ever need to do.


You'd be surprised to hear how many big organisations still use Excel for their reporting needs. I used to worked at a company with 35000 employees. Their payroll reports were generated using a MS Access database and reported using Excel. It was a massive manual job every week as they'd weekly, fortnightly and monthly payroll cycles! It's very easy to lose files and/or make an error in the calculations. But nobody bothered to change it.


We had a telco vendor use excel to manage all their configuration.

It was a weird dual source of authority system, where the excel configuration was used at install / upgrade time, but you could change the config after the install at runtime. So you had to merge the active configuration into excel, upload the excel document to some server that customers didn't have access to that turned the excel file into a config file, then you could use that config file to upgrade the system.

It was as problematic as you could imagine, lost configuration, lots of macros, etc. Eventually they made some improvements, but I left the industry so don't know what the current status is.

Enterprise / Telecom solutions at their best.


I'm not surprised. I've seen all kind of shit working for enterprise clients. Only one of them adhered to very high standards and had processes for everything. They were a military contractor and most of the clients were ex-military.


See, they had far less than 65,536 employees, so Excel was fine for their workflow!


lol.. These days it can support up to a million employees! :D


WTF, they have a big Azure deal with Microsoft, why the hell would the be using Excel as the data source?

IT ineptitude of the highest order, although I'm not surprised having been involved with government IT previously.


So easy to pass judgement on critical software decisions from the outside, with absolutely no knowledge of how they work or their training, the pressure from Government, the previous conditions that they found themselves in when a pandemic hit.

Who are "they"? The Government? GDS? PHE? NHS England? NHS Digital? NHSx?


Yes, with direct previous experience and knowledge of large UK Government IT projects, how they were integrated and how staff were trained, I feel I'm well placed to comment harshly.

"They" in this case was the UK Government, and part of the goal was creating secure backend systems specifically for the NHS as a data store, Public Health England would absolutely have access to it.

Even the most basic of checks at the start of the project would have highlighted that Excel was not a proper solution for the application (Even if the developer(s?) had used the newer(?!?) XLSX format rather than XLS) - which highlights that there was no proper oversight as to how the system was constructed.


The most consequential Excel error was The 2010 Reinhart-Rogoff error in the paper "Growth in a Time of Debt" https://www.nber.org/papers/w15639

The paper claimed that average real economic growth declines 0.1% when natonal debt rises to more than 90% of gross domestic product (GDP). When you correct the error it shows 2.2% average increase in economic growth.

Paul Ryan used it in the US for Republican budget proposal to cut spending and it was also used in EU to implement Austerity policy that hurt people.


I think you're overstating both the paper's impact, and the significance of the error.

Most such papers are used to support existing policy preferences, not drive them. In addition, the error didn't reverse or erase the correlation, it diminished it, and removed the inflection point from the curve.


> I think you're overstating [...] the paper's impact

Krugman [0] called it "surely the most influential economic analysis of recent years" and adds that it "quickly achieved almost sacred status among self-proclaimed guardians of fiscal responsibility; their tipping-point claim was treated not as a disputed hypothesis but as unquestioned fact."

However, indeed, he cautions that "the Reinhart-Rogoff fiasco needs to be seen in the broader context of austerity mania: the obviously intense desire of policy makers, politicians and pundits across the Western world to turn their backs on the unemployed and instead use the economic crisis as an excuse to slash social programs. What the Reinhart-Rogoff affair shows is the extent to which austerity has been sold on false pretenses."

I'd say the paper made it certainly easier for the Very Serious People[1] to push through their austerity agenda.

[0] https://www.nytimes.com/2013/04/19/opinion/krugman-the-excel...

[1] Krugman's derogatory term for duplicitous policy wonks (like Paul Ryan) that pretend to be thoughtful and serious and concerned about the nefarious long term effects of debt when discussing stimulus and spending and social programs, but then typically turn around and implement huge tax cuts without second thoughts...


Krugman's editorials are polemics, not economics. I don't think anyone should trust the content of an opinion piece which purports to describe the logic, motives, and rationales of ideological opponents, as was the case in your linked article.


> Krugman's editorials are polemics, not economics.

The discussion here is not about economics. The discussion is about the impact of the (flawed) Reinhart-Rogoff paper, which one commenter thought was overstated.

> I don't think anyone should trust the content of an opinion piece [...]

Krugman is of the opinion that said paper had 1. a big (and 2. a deleterious) impact. For the first contention (which is, again, the pertinent topic), you don't have to rely on Krugman's expert assessment, though, you can look at the evidence: the infamous 90% inflection point had been quoted all over the place (as for example in the WaPo editorial cited by Krugman).

Lastly, what makes an article describing "the logic, motives, and rationales of ideological opponents" intrinsically untrustworthy?


Krugman opeds in the nyt are not academic papers.


The most salient point of the paper was the inflection point of 90% for sharply lower growth. It was and still is used by proponents of fiscal austerity to argue against fiscal stimulus.


Here is an excellent video about this subject (in French, but the auto-translate subtitles are understandable) https://www.youtube.com/watch?v=yeX_Zs7zztY

To summarize:

- The first two thirds of the video is a retelling of the story, the last third is an analysis of the data

- There are two problems in the original paper: a weird way of computing averages, and a mistake in their Excel file.

- There is a correlation, but it is weak (R2=0.04), and if there is an inflection point, it is around 30-40%, not 90%.

- That paper is likely to have been selected by politicians to support their policies instead of influencing them (confirmation bias).


It was literally cited by Britain's finance minister George Osborne as justification for UK government austerity:

> As Ken Rogoff himself puts it, "there's no question that the most significant vulnerability as we emerge from recession is the soaring government debt. It's very likely that will trigger the next crisis as governments have been stretched so wide."

https://web.archive.org/web/20100414205630/http://www.conser...


People often cite papers which support their positions, but correlation does not necessarily imply causation, and the paper is rarely decisive.


Not the paper on its own, maybe, but it bolsters other arguments. If someone already wants austerity, an authoritative paper saying that's better will make them want austerity more, and gives them a strong argument in discussions with opponents, and may sway people who are unsure.


Fun semi-related story. Way back when, while I was squishing some Y2K bugs for a Saudi bank that catered to high net-worth individuals I was asked to look into why the AS400==>>db==>>Excel spreadsheet was "showing something screwy every now and then".

Spoiler Alert: It was a SMALLINT error that was 'patched' by someone previously (legacy code) that worked around the bug by rounding account balances to the Million (since it was a reporting function rather than an accounting function it was a 'sort of ok' hack). The bug re-appeared as undefined behaviour when transaction and or account balances went in to the billions.


People keep putting the blame on Excel but it supports more columns than most implementations of SQL databases. MySQL for example supports only 4096 columns. SQLite defaults to 2000. That's way less than Excel's 16384.


True, but they don't need to relate 16384 types of data. It sounds like they were using one column per record, rather than one row per record. So if they had a sensible data model 2000 columns should be more than sufficient.


You're on to something. The number of tables in MySQL, SQLite, and Excel are virtually unlimited. It seems like that is the lowest common divider ensuring compatibility and data integrity when case numbers grow beyond bounds.


Amazing comment, sorry to see you're getting downvoted. ALTER TABLE upvotes ADD COLUMN threatripper_24691123_upvote DEFAULT 1 NOT NULL;


I have done a couple of gigs in life sciences and Excel is one of the big responsible for VB.NET to be still relevant.

I have met several researchers using Excel, not R, Python, Julia, nope plain old Excel, eventually with some VBA macros.

The more savvy ones, eventually ask IT for VB installation when they outgrown the VBA capabilities and carry on from there with either small Windows Forms based utilities or Office AddIns.

Any attempt to replace those sheets with proper applications has gotten plenty of push back until we basically offered enough Excel like features on the new applications.


This was me when our online meetings exploded and I had to record a ton of metrics manually without any prep or software. Now there is this digusting, VBA-augmented excel sheet with a million hacks being read by the c-levels to make very important changes by. I can't vouch for a single cell being accurate at this point. Even the few automatic metrics from the meeting software are misleading from the get go, but then they meet my black-box of doom.


While I agree with the general sentiment that excel is normally misused, like in this case, it's also capable of handling huge amounts of data thanks to powerquery and powerpivot (1,999,999,997 rows per table, 2 billion tables max, docs say). You can't load the data in the grid, you need to use the data model. Using powerquery properly also solves most problems about excel guessing types wrongly when importing csv and other external data. If strongly recommend any excel user to learn powerquery and powerpivot.


There's a long history of spreadsheet errors and literature on same going back a few decades now.

In a business sim class in college a couple of decades back, I discovered that the Lotus spreadsheets (as I said: a couple of decades back) had a totalling error which double-counted individual row totals in the bottom line (everything was twice as profitable as the spreadsheet indicated).

At an early gig, one of the senior developers instituted a practice of code walkthroughs on projects (only a subset of them). One of these involved, you guessed it, a spreadsheet (we used a number of other development tools for much of our work), in this case Excel. Again, numerous errors which substantively changed the outcome of the analysis. One of the walkthrough leader's observations was that you could replace all of the in-cell coding with a VBA macro making debugging far easier (all the code and data are separated and in one place each).

The particular analyst whose project this was: he insisted to the very end that this "wasn't a program" and he "wasn't a programmer" and that the walkthrough didn't apply to his situation. Despite the errors found and corrections made.

At the time (mid 1990s) the walkthrough lead turned up a paper from a researcher in Hawaii on the topic. I'm not certain it was Raymond Panko, but his 2008 paper (a revise of a 1998 work) discusses the matter in depth:

https://web.archive.org/web/20070617041554/panko.shidler.haw...


I wouldn’t blame Excel. The cause was gross incompetence.


Was just reading the chapter about using Excel as a database in this book two days ago: https://www.goodreads.com/book/show/39074550-humble-pi

Apparently it's super common, which fills me with horror. But these guys managed to take it to 11 by abusing it in yet another novel way.


Definitely recommend 'Humble Pi'. Funny as well as informative.


I hate solving problems and not knowing how I did it, so this particular experience has stuck in my craw.

Trying to develop a budget to pay off debts, my partner made this elaborate Excel spreadsheet and the output was that basically she had no spending money and I had very little, until one or both of us got a raise. It was far more austere than either of us were willing to go. So I started over using a different equation for 'fairness', and a different layout because something about her tables was just confusing and messy. When I was done, we had $300 a month of extra spending money between the two of us, despite using the same targets for pay-down, savings and bills.

I spent about 90 minutes poking at her spreadsheet and mine looking for the error and never did find it. I don't know what the right solution is to this sort of problem, at least for non-developers. But if I was skeptical of Excel going in, I was doubly so after that. Especially for something that is going to be used to make decisions that will affect every day of your life.


You made two different spreadsheets, they came up with different results, and you're blaming Microsoft for the error? Really?

Excel, whether you like it or loathe it, is in such wide use around the world that $300 math errors would have been noticed a very long time ago. I could believe that there are still many lurking bugs with obscure corner cases, nasty floating point rounding minutiae and so on, but I would bet (checks spreadsheet) $300 that your mismatched results were solely due to your own failures.


I've re-read the parent comment a few times, and I really don't see how this is your takeaway. The author is complaining that they find Excel difficult to inspect and different spreadsheets difficult to compare. There's no claim that Excel did a calculation wrong.


[flagged]


Except programming languages state clearly what's going to happen.. excel.. eh, not really. It's hidden.


But they aren't noticed. The classic blunder is dropping the last item off of a summation, and nobody notices until they add another item and things look janky.

Those happen all the time, we just only hear about it when the consequences are outsize.


You are writing about a problem that can be solved if you know what you are doing:

Store your data in a table ( https://www.contextures.com/xlExcelTable01.html ), then make a pivot table out of it.

You will never have problems with missing data.

In fact good practice is to add a check, just to see if your pivot table was refreshed and the data there matches the data in the source table. Just like you make tests in software.


I always store data in tables for this reason, but last week I ran into a problem because someone else had added new data to my sheet somehow without the table range being automatically adjusted. The table looked visually correct but when you sorted or filtered it, the rows towards the bottom were not affected. So sadly, this is not a silver bullet, but it does help a lot.


You could say the same thing about writing tests.

There's nothing wrong with Excel. Build an organized spreadsheet with clear separation of data and presentation, and write checks throughout and you will end up with a perfectly error-free workbook.


Nothing wrong? Excel autoconverts values if it thinks it knows the data type of a column and Excel treats 1900 as a leap year. Genetic researchers changed the names of genes to prevent auzoconveesion errors.


And JavaScript autoconverts all your numbers to floats, introducing errors every time you touch them. Meanwhile most other popular languages will happily wrap your fixed-size integers around on overflow, or silently perform integer division with truncation when you don't append a decimal fraction part manually to any of the numbers.

Point being, all tools come with strange caveats that one needs to be familiar with. A nice feature of Excel is that the caveats are visible, because unlike writing code, Excel is reactive and interactive.

I mean, some of the stuff Excel does to your data is downright idiotic wrt. the common use cases, and probably exists only for the sake of backwards compatibility. But let's not pretend you don't need to pay attention if using a tool you're not proficient with.


Yes, and JS is considered a shit language by everyone who isn't a JS fanboy. What's your point?

Most other languages come with libraries for arbitrary precision arithmetic where people who know what they are doing^tm can get the right answers.

>A nice feature of Excel is that the caveats are visible, because unlike writing code, Excel is reactive and interactive.

Unlike any programming language I know I can change the results in excel by changing the presentation of the data.


The typical users of javascript ate programmers, they know that errors. The typical users of Excel most of the time aren't aware of Excels weaknesses. And good look finding these errors in large data sets. There is a difference between visible and lucid, especially if you do not expect errors.


>>But let's not pretend you don't need to pay attention if using a tool you're not proficient with.

Decades of experience working in Office Environments tells me many many many many many people do not pay attention period, their experience with the tool is irrelevant.

My experience also shows that inexperienced people do not know WHAT to look out for, especially in excel so they are often more susceptible to mistakes.

This really comes into play when working in larger organizations where excel workbooks are passed around from person to person, often existing for years or decades at a time where people using the workbook are separated from the person that created the workbook.

Try figuring out an excel spreadsheet created 10 years ago by people no longer with the company that several data links importing data from all over the place.....


And a band saw will happily cut off your hand if you push it into the moving blade. That's why the best practices of operating one suggest that you never keep your hands anywhere near the cutting edge.

Excel's idiosyncrasies are very much on the level of the typical productive computing tool. They are less maddening than half the featureset of C++, three quarters of the featureset of Javascript, and 110% of the featureset of bash.

Despite that, people manage to get work done using bandsaws, C++, Javascript, and the occasional shell script.


Band saws, C++, Javascript have a different user group than Excel. Most of them have been trained on the tool they use. If Excel were a band saw, there would be a lot more people missing fingers.


The gene in question was "Membrane Associated Ring-CH-Type Finger 1", or MARCH1. Excel converted this to March 1st. They renamed it to MARCHF1


They renamed 27 genes in the past years to avoid Excel errors https://www.engadget.com/scientists-rename-genes-due-to-exce...


You can find idiosyncrasies about any tool. Excel is a fine tool for 99% of use cases. Sure, it's quirky here and there, but the sheer amount of work that gets done globally in Excel should be testament enough to its value. Highlighting edge cases to bash on the tool is either myopic or intellectually dishonest


Let's consider three claims:

    1) X delivers tremendous value.
    2) X has some significant problems.
    3) X has no problems.
Obviously 1 and 3 are compatible. I claim 1 and 2 are also compatible (and, actually, not uncommon...).

You said that "[t]here's nothing wrong with Excel", and someone responded with an example of a problem that they considered significant. If we read "there's nothing wrong with Excel" as a strong claim of 3, then that's obviously a refutation of your claim. You could argue that the specific problems are not actually significant enough to rise to the level of notice. You could argue that they are not, in fact, problems at all. You could argue that you didn't, in fact, mean to make claim 3 in any strong sense (which I think is what was actually going on here - that's valid, English works that way).

You've instead interpreted it as a flawed refutation of claim 1 (and took the opportunity to demean your conversation partner). I don't think that's productive.


Excel is good at what it was developed for. Unfortunately, it is often used for other things. For example, in this case as a database. This leads to the fact that departments build solutions that work at first sight but have errors in special cases that are difficult to detect and debug.


Yes because they couldn't use their tools correctly.


I'm not so sure he claimed that Excel can't add and subtract numbers. I think the argument is that even two people can make such a mess of a spreadsheet that nobody can really untangle it later.


That's about the size of it.

Tack on that developers have other, better options, but if you're not a developer I don't have a good solution for you, and that I don't like that state of affairs.


The OP said that using a spreadsheet program can be messy and prone to mistakes, and implied that there could exist alternative approaches. The OP never said a bug in Excel was responsible for the mistake or blamed Microsoft for corrupting the calculation.

In the same sense, one could say manual memory management in C can be messy and prone to mistakes, it doesn't mean that one believes C should never be used in any circumstance, and it doesn't mean one's trying to blame the C complier, libc or Dennis Ritchie for corrupting one's program.

Hypothetically, if someone writes the following reply.

> The C programming language, whether you like it or loathe it, is in such wide use around the world that a NULL pointer dereference would have been noticed a very long time ago. I could believe that there are still many lurking libc bugs with obscure corner cases, but I would bet the memory corruptions were solely due to your own failures.

It would completely miss the point.


(Side note: there were sarcastic/affectionate calls of "curse you, Dennis Ritchie!" at a previous workplace.)


The only real way to solve issues like this is a step-by-step reconciliation between the old model and new. Each assumption and calculation must be tested to confirm that they replicate the same output. This is extremely difficult if the underlying structure/assumptions are different (e.g. monthly views vs. transactional views, etc.).


That rework based on suspicion is typically how it’s done. :) Add to that subject matter expert review. For example, I used to work on a sheet of scoop size calculations for industrial digging and nothing from it would reach sales staff or customers until a couple of the crusty old experts played with it and vouched for the calculations.

Imperfect measures, but they keep the work within the grasp of everyday/business users in a way a formal test suite wouldn’t necessarily.


I didn't rework it because I thought the numbers were bad. She and I agreed to ground rules, [but] I could not commit to the consequences even though the 1st order consequences came out in my favor. The second-order consequences would be that I'd tricked her, intentionally or otherwise, which is decidedly unhealthy in a partnership. I had contributed 2/3rds of the income but 2/5ths of the debts (the latter being the main sticking point) so while 'fair' was not going to be 50:50, I'm not some monster who's going to buy high end electronics while my partner is eating ramen.

I expected worse news for myself after re-doing the numbers, and in fact I ended up with a little bit more spending money with the new numbers.

I mention this only because I have come to expect policy people to stick to their initial narrative a little more enthusiastically than I could. Getting them to review data and policy unless there is a clear benefit for themselves is difficult. It's very common for people to get promoted by challenging this friction in ways that benefit everyone (because the deciders couldn't connect the dots). It's always newsworthy, and I wish it weren't.


Make a third spreadsheet from scratch and see if it agrees with either of the existing ones.

There's some saying about never going to sea with two spreadsheets.


the secret to using Excel well in large complex tables/spreadsheets is the same secret as coding with large complex programs.

You use structure, design away the complexity, implement constraints, error checking and tests.

Note it's true that an inexperienced person is liable to make spaghetti code either way, but there's nothing fundamental about a spreadsheet format that makes it inherently unusable for a lot of small/medium problems. Indeed it even has benefits in terms of interactivity/ turn around/accessibility.

Of course, there's also a lot of problems with Excel and reasons not to use it like a database or anything which fundamentally relies on maintaining data integrity, and it's liable to be the first tool reached for by the non-experienced, who will generally make a mess of things large and complex as a rule.


> You use structure, design away the complexity, implement constraints, error checking and tests.

I've never thought of Excel as an ideal tool for any of these things. I'm struggling to think of how it would have change-verifications/tests in the same way that software projects do.

(I'm certainly open to the possibility that I'm ignorant/unaware on this subject)


depending on your point of view, you don't do them in the "same" way (I.e with separate unit tests or compiler enforced safety), and explicit difs are hard/impossible. And ideal is dependent on the task (it's ideal for me to get a working interactive graphic + visualisation to a user behind a corporate firewall via email in the same afternoon I get the request, I wouldn't actually choose to do 'programming' in Excel, but I consider a budget small and simple and programming is probably overkill)

what you do is much closer to old-school low level programming. define the relationships between your tables and variables well, set up explicit corresponding arrays of 1s and 0s that are themselves error checks on the underlying structure/ contents of your tables, calculate things in two different spots/ways and verify equivalence holds, and use simple red/green conditional formatting to draw attention to when things fall outside of expected state, etc.


This is precisely why I went with YNAB [0], started budgeting better and saved quite a bit of money.

Everyone who sees me paying for an app (I'm from India) ask me - "why can't you just use excel and do the same thing for free?". Excel sure is powerful, but in real life, your mileage may vary.

[0] - https://www.youneedabudget.com/


If you're trying to figure out why the spreadsheets had different results, I'd start here:

> I started over using a different equation


Try as I might, I've never found a way, before or since, to push piles of stuff around and end up with a larger sum total.

I suppose it's possible I've discovered Dark Money but I doubt it.


I wrote this a decade ago and have been using this for all of my personal and business spending: https://zetabee.com/cashflow/

It's not automated but it can help future income/expenses/balance in an Excel-like UI.


[flagged]


Please don't break the site guidelines with name-calling and personal attacks, regardless of how bad another comment is or you feel it is. We're trying to avoid a culture where people pick on others like that.

https://news.ycombinator.com/newsguidelines.html


Umm, who is qualified to use Excel if both scientists and economists frequently have errors in their published works? https://www.economist.com/graphic-detail/2016/09/07/excel-er...


Umm, who is qualified to write programs in any programming language, if both programmers and developers frequently have errors in their code?

https://cve.mitre.org/cve/


That exactly right, noone is. Thats why noone uses C++ or javascript in safety critical devices Even in 'normal' industry its is a huge problem and we need safer languages, i.e. Rust (merits to be seen)


Err... C & C++ are the primary languages used in typical safety-critical systems. I've used it.

JS... of course no one uses that mainly because the abstraction level is too high, the number of layers of non-safety-critical code massively outweighs the application code.

The critical thing about safety-critical code isn't really about the code... its the whole development process. Implementation is just one small part of that process.


People went to the moon and came back long before Rust existed.

Lots of operating systems were written in C or C++ and they allowed to achieve many things.


> Thats why noone uses C++ or javascript in safety critical devices

Ooh, no that's fairly incorrect. Javascript for sure is less common because it's the web browser language, but C and C++ are used all the time in embedded safety critical designs.


There is of course a difference between complicated scientific models and someone budgeting some expenses.


This reminds me of developers talking about the git CLI. At some point you have to stop blaming the users.


I wish I could remember the movie (it appears to be a reference to Al Capone, of all people), but there's a scene that gave me the tingles. The leading man, about to exact some revenge, turns to the antagonist and says,

"You've mistaken my kindness for weakness."

If I worked at a mental health technician I'd be complaining about how bad the system is for trauma survivors and neuro-atypical individuals. But I'm not. I'm a developer who has focused on - or been forced to look at - people struggling with User- and Developer-Experience issues for much of my career. If I'm talking to you about your struggles, you get sympathy. If I'm talking about policy, or policy makers, I'm going to give you a taste of what my 'customers' have to go through every day. And I'm not going to feel a bit sorry about it. If you dismiss it, we may have a problem.


[flagged]


Random Karens don't create complex spreadsheets. They copy paste numbers into a spreadsheet someone else has created for them.


The person above was making a home budget (and failed) - which also is not a complex spreadsheet.

Also, you could probably find 10 different templates in google.


Spreadsheets like Excel or Google Docs, with the expandability offered by VB and JavaScript, allow for rapid development of applications that require data storage and data interpretation in one package. I’ve seen them used and abused for several functions over the course of my career.

Apple/Claris FileMaker has had this niche for a bit, but it’s only been a niche,


I call that the Microsoft Office Syndrome.

Tools such as Excel or PowerPoint are almost "too good" as in they allowed users to come up with usage which aren't what the software was made for.

As others have said regarding Excel, it's not so much what it's capable of but what you do with it and how you do it.

I've seen people using Excel as a tool for trading, as an inventory database for a warehouse, as a shared database to track task progress, as a Gantt chart, as a calendar and so on.

PowerPoint is the same. Who hasn't been given a PPT file as a manual for something?

"Oh, it's all in the powerpoint".

Or been given the same kind of file to familiarize yourself about something at your (new) workplace. Issue is you rarely have any of the material the speaker used to present the whole thing so you're left wondering about the meaning of it all while reading bullet point lists...

No, a powerpoint is not a proper doc...

So in a way, I have to applaud MS as they did great with these but almost too great...


"But you wouldn't use XLS. Nobody would start with that."

Oh. "Every" sales and business person probably would. Probably.


Well they'd at least start with XLSX unless they've got a very old copy of Office


The thing that puzzles me is, what on earth happened on the 28th? A really anomalously low number of cases were reported then, and it's only got more weird after this fix added zero unreported new cases on that day and much higher numbers on the days before and after it. Something definitely doesn't seem right here.


Reminds me of my days working in fiance. I was basically a broken record that just repeated: Excel is not a database


typo :)


I would hope that any decent data engineer would know not to do this. Are there avenues for fully-employed engineers to help governments build these systems properly? I'd honestly be happy to just donate some of my free time to help these efforts.


interesting, i was on a conf call 3 hours ago that was an almost identical use case. Excel gets used because word comes to a director at a health department that says "we need to collect testing results from 500 different facilities in a way their, and our, existing systems do not support" and they're given maybe 3 days to get it done along with the rest of their job. Putting together a technical team would take a week minimum let alone requirements gathering and analysis. So what do they do? They fire up Excel, add column headers that represent the data they need, and then send it to a Sr. Manager and tell them to make it happen.


I love how Excel has become a de facto database - even though it absolutely shouldn’t be.


The first rule of software is "If it works, it'll get used."

This is why people's email inboxes double as their TODO lists.


It doesn't even necessarily have to "work" (as in, work well) it just has to feel like it's the path least resistance.

You can work in a department using Excel for this type of data collection and reporting, and you can continuously suggest to your superior that something more mature could be used. They would probably agree. As would the entire team!

But if the team of analysts are only properly trained in this system and it's all they know coupled with a huge backlog of cases and time pressures, then they're going to keep using the thing that causes least headaches in the short term. And that might be objectively worse to everyone involved but they just keep ploughing through.

Toxic culture, bad management, poor working practises and external pressures can force even the most sane of people to choose the worst technology on the basis that they perceive it as "saving time" in the short term, even when they know full well they're borrowing Peter to pay Paul, they still do it.


In Norway there was a $14 billion calculation error based on Excel, that resulted in opening of new oil fields in the Barents Sea: https://translate.googleusercontent.com/translate_c?depth=1&...


Oh, Excel, how I hate to love you...

As is the case for many, Excel is one of my main tools. From financials to data gathering to electrical, mechanical or software engineering, Excel has always been there. It is fair to say I have made lots of money thanks to this tool.

And yet, every so often...

Many years ago a rounding error in a complex Excel tool we wrote to calculate coefficients for an FPGA-based polyphase FIR filter cost us a little over six months of debug time. I still remember the "eureka!" moment at two in the morning --while looking at the same data for the hundredth time in complete frustration-- when I realized we should have used "ROUNDUP()" rather than "ROUND()".

Most recently, I was working with a client who chose to build a massive Excel sheet to gather a bunch of relevant data. The person doing the work seems to think they know what they are doing (conditional formatting and filtering don't make you an expert). This poor spreadsheet has every color in the rainbow and a mess of formulas. It's impossible for anyone but the guy who created it to touch it.

Here's a hint:

Do not mix data with presentation. Where have we heard that before?

This is one of my pet peeves with Excel. If you need to gather a bunch of data, do it. Treat Excel like a database (apply normalization if you can!) and keep it with as little formatting as you possibly can. Then do all the formatting and calculations on a "Presentation" sheet or sheets. Just don't pollute your database with formatting.

EDIT: Thinking about the UK problem, if they were working with a ".xlsx" file and accidentally saved it in ".xls" form, well, as they say, "There's a warning dialog for that".

What surprises me the most about these kinds of incidents is that people keep working on the same single document. In other words, no semblance at all of what the software business knows as version control.

Decades ago I adopted the idea that storage is cheap and always getting cheaper. If I am working on something critical, I never work more than one day without a backup. I make a copy and continue editing. In most cases I make a new copy every single day.


I'm wondering if there could be an opportunity for an open-source contract tracing system, similar to the allReady [1] application ? Maybe something backed by a graph database such as Neo4J [2]

Looking at the allReady repo, maybe it isn't a great example since it hasn't been touched in years...

[1] https://github.com/HTBox/allReady

[2] https://neo4j.com/


This is clearly a case of someone not knowing the limits of a technology / how to select or use a technology. Yes, more modern versions of Excel support many more rows than 65k. As a "power" excel user in the early 2000s, it's tiring to see this same mistake made. And I might have blamed MSFT in 2008. But the "modern" world has taken an absurdly boorish response to this virus. This unfortunately looks like another example.


Forgot to mention: there's nothing inherently wrong with storing data in CSV files. One does have to think about consumption and processing. A Postgres database seems pretty reasonable to track information like this, but I wouldn't blame a CSV format for shortsightedness. We're not talking about actual big data here...


I've been teaching data curation for open data managers in governments around the world for the last decade. All the issues identified in this BBC article are covered in my syllabus.

If anyone's interested, here's the openly-licenced syllabus, in English and French: https://github.com/whythawk/data-wrangling-and-validation


I hope people finally understand ( At least I hope Microsoft Understand ) Their biggest asset isn't actually Windows. But Excel.

Some on HN will definitely comes in with (x)Office Support MS Excel file as well etc.

There are trillion dollar worth of revenue relying on Excel. In the best case scenario, no one wants to work, rework, or even touch that Spreadsheet. Having 99% compatibility is not good enough.


Excel is great but I think at a certain level folks should start using Python and Pandas. The challenge is understanding when and getting the right resources to the right people.

https://training.talkpython.fm/courses/move-from-excel-to-py...


The conceptual leap from Excel to Python+Pandas is pretty huge. Especially if someone hasn't done programming before. That is part of the reason I created https://www.easydatatransform.com .


Many comments already talked at length on use-cases for Excel, so I won't go for a 'me too'.

Instead, I just want to share this video from Joel Spolsky, aptly titled "You Suck at Excel".

Turns out, most people do suck at Excel. Myself included.

https://www.youtube.com/watch?v=0nbkaYsR94c


> The problem is that PHE's own developers picked an old file format to do this - known as XLS.

> As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of.

Why would they even use XLS in the first place? CSV files have no such limitation, you can have CSV files that hold billions of records.


This is yet another, but subtle example of how abstract models are disconnected from reality

Another one is abysmal failure of all simulation based models to be even close and useful.

https://karma-engineering.com/lab/blog/Solved


Excel has its place. And it is excellent tool for some workflows. However, I have started to think that Microsoft should disable the option to save workbooks. That would kind of force people to use Excel as a numerical scratchpad with fast charting what is where Excel excels.

(There is some sarcasm in this comment, but not that much)


I wrote an energy reporting system based on Excel. Based on that, I think Excel use falls into 2 categories

a, extremely simple reporting with few workbooks

b, serious misuse of technology

Billion dollar companies were born based on replacing Excel in workflows. The problem is not that they have an old version, the real problem is that they use such a system in the critical path.


It's unlikely this will have had much impact since most reported cases were only a few days late but come on guys, really?

Using an Excel sheet as a database? In week 1, that would be "not great" but could be accepted as being a fast solution that everyone could work with. This far into a pandemic I think we can expect a little more professionalism in data handling.


People use excel sheets as databases all the time because it doesn't require any specialist knowledge to build or use.

Sure, we techy types prefer a proper database with a web frontend and an API, but that requires significantly more skill to build than an excel file.

No database has the flexibility and flat learning curve that excel has.

Until a database manages to meet these goals, it remains a good tool for certain usecases.

We as engineers should be working towards making a real database thats just as easy to use as excel, and only then can we complain that people are using excel sheets for this kind of stuff.


> People use excel sheets as databases all the time because it doesn't require any specialist knowledge to build or use.

Test and Trace is a multi billion pound programme of work, involving huge companies (eg, Serco) interfacing with national governmental bodies (Public Health England).

We've had months to get something in place. One of the reasons organisations like Serco is used is their apparent expertise with IT.


The main reason Serco is used is their political connections; they really ought to have been banned from consideration after the fraud scandal: https://www.bbc.co.uk/news/uk-50806919

(my one-step programme for improving UK public contracting would be that after a significant failed contract the contracting company and its directors and any other companies they are directors of would be banned from tendering for a period; maybe a couple of years?)


Unfortunately the procurement rules currently in place prevent procurement personnel evaluating a tender from taking into account any prior performance. I'm unsure whether this is a result of EU legislation and common market rules, or whether it is just the UK.


italy has legislation to forbid people who engaged in corruption/embezzlement from working with public administrations.

It also has a required certification for companies that work with the public that they weren't condemned for certain crimes (trafficking in illegal waste, extorsion, organized crime etc).

So, it doesn't seem an EU/Single Market rule that you can't check this.


Please tell me that is not true. That is just...nuts! What is the rationale behind that beyond corruption?


One issue they have is that very few companies are willing to have the government as a customer. That's one reason why they keep going back to the same failed firms time and time again.

Reasons vary but the ones I've seen are generally:

1. Complex procurement regulations that take a lot of effort to comply with by contractors.

2. Low willingness to pay (both price-wise and time-wise).

3. Civil servants are nightmare customers who don't know what they want and are under no pressure to figure it out, so there's a lot of timewasting involved.

4. PR nightmare if/when things go wrong even if it's not your fault, as government is relatively open compared to other types of customers so easier to find out about problems, and lots of people automatically blame the private sector in preference to blaming government employees due to (misguided) assumptions of moral superiority of the public sector.


This is a very good point and a shame to see it being downvoted. I have been wishing there was an accessible database-ish software for doing my accounts, as the current mega-spreadsheet I use is pretty hairy and none of the specialised apps quite fit the bill. The closest thing seems to be Airtable, which while nice, is quite expensive.

Recently I was visiting hospital quite a bit (ante-natal) and witnessing the horror of the massive NHS form filling software the nurses and midwives have to use. It's clunky and sprawling but one of the main sins I think is it's lack of adaptability. In the old days of pen an paper, a new form could be written, or adapted, and photocopied easily, and on-site. Now if a new field needs to be added or a process is changed it needs to go (I imagine) to some centralised IT development office and fed into a ticketing system where it might get changed in a few weeks or months. There's no room to quickly adapt at a ground-level, so you end up with things like this excel sheet problem.


I'm surprised that MS Access didn't take this kind of niche.

I know it's a pretty shitty program to use - excel is much more user friendly. If MS had made excel, but have all the features of access in it...


It depends which bits of functionality you want, but I think you are close to describing PowerQuery (Get & Transform) which is in Excel, which can ingest multiple data sources into effectively a relational database with hard datatypes and data-cleaning.

In PowerQuery & PowerPivot there are also no practical limits on the number of rows of data you can have (other than the impact on processing time, clearly hundreds of millions of records might start to be a problem). It's not quite access - it doesn't actually persist/store the data, just aggregates it from other sources which is probably what is required here.

Access is a little legacy and would have it's own issues. I think in reality it depends on how you are ingesting data (which is probably hospitals submitting excel templates or similar?). Maybe these could be automatically aggregated and put onto an analytics platform? (e.g. PowerBi). Hospitals could maybe report it on a portal, but I'm not convinced that the leadership would have wanted to change the established process (because of operational focus, change reduction & risk).


For a very long time, it did.

Weaknesses of Access in no particular order: very poor multi-user/multi-device-access features. No real concept of servers. Buggy, had a nasty habit of corrupting its custom file format causing data loss (Excel never does this). Does expect you to understand SQL, data normalisation, etc. VBA just about serviceable for a spreadsheet but not good enough for a 'real' database-backed app.

Probably someone could try and make a better Access. There are surely lots of startups doing that already, along with hosting. All the "no code" tools that are out there, etc. Access had the advantage of coming with Office so lots of people had it already, whereas today's subscription based services that host your data have incremental cost.


LibreOffice Base is something of an attempt at a newer Access, but unfortunately it's far from ready for production use.

Base has very solid data storage foundations (including the ability to connect to real database servers) but the UI is very buggy and is painful to fight with. Note I said 'fight with' rather than 'use.'

Base seems to have a fair bit of potential, however, if someone where to pour some money and/or time into developing it further.


IMHO the biggest problem is the 2GB database size limit


Don't forget the formulas! Excel is essentially a functional reactive programming REPL. No RDBMS I know lets you easily define a table in which cells in a row depend on values from previous cells in a row, and keep this as an invariant.

People working on Excel replacements need to remember about this aspect of spreadsheets too.


Sure. That's why I said it wouldn't bother me if this was in the first few weeks of the pandemic or for one-off ad-hoc analysis. I actually use Excel for that all the time, including for tasks where technically a db is the correct solution. Working is better than perfect and everyone knows (sort of) how to use Excel. To still be using it in production this many months in is not acceptable. Someone should have taken that Excel based workflow and set it up to run through a database, they had the time to do it.


There is also the fact that Excel has some rich editing tools. I have actually seen a system where most editing was done by exporting the relevant items to a spreadsheet, making your changes and reimporting.


> It's unlikely this will have had much impact since most reported cases were only a few days late

I'm not sure if you're being sarcastic or not, but in case you're not, this has meant c.16k people missed from having their contacts traced. That's time-critical. Delays of any number of days are very bad. People will likely die because of this error.

This is safety-critical software. It should be stress tested and have sanity checks. It should be impossible for this to happen.

For something like this system, part of a track and trace system costing billions, to fail because of a file size error is just unbelievable.


I was being serious. Distributionally, most of the cases were 1-2 days before the error was spotted. To be clear though, when I said "not much impact", I didn't mean that people might not die because of it, only that it wouldn't affect the large scale functioning (such as it is) of the tracing operation.


> Distributionally, most of the cases were 1-2 days before the error was spotted.

That’s always true though if the virus is spreading exponentially (dunno whether it currently does in the UK).


This is more true than it would be from pure exponential growth alone, especially the relatively slow exponential growth the UK as a whole currently seems to be seeing (doubling time maybe about a fortnight?) The issue was effectively dropping all results after a certain number from a particular location, so at the start relatively few results were dropped and it was spotted around the time that number started to become substantial. It's that hard cutoff where results suddenly start to be dropped that means most of them are in the last few days.


Sadly it takes about 24 to get tested and 24 hours for the lab result. That gives very little time to trace contacts before they start entering the asymptomatic-but-infectious-to-others stage. A one day delay is a problem.


The WHO estimated 750 million cumulative infections on Friday. This error is not unbelievable because the system is just a bit of political theater. There’s nothing more we can do against an endemic human coronavirus. It’s time to move on.


Yeah we should give up on controlling polio, measles etc as well. After all polio only kills 0.5% of cases as well. </sarcasm>


Hey, at least the UK has automated handling of data! Parts of Germany managed to not tell people they'd tested positive for a week or two because their manual, Excel-based process for handling test results couldn't cope with the increased load from testing people returning to the country - and they're meant to be the competent ones we look up to, according to our press. (I get the impression that German experts have a rather different view.)


Once you have a couple of thousands of rows in excell files, one file per day, possibly one per testing center.... possibly with inconsistent structure.... non standard date formats (even different in each row),... add to this the inconsistent namings ("Tested at" column with data: "Testing center 1, London", "TC1, London", "TC1 London", "TC1 LONDON",....), etc...

Yes, anything-sql or whatever would be better, but deciding between putting all the data into sql by hand, or "uhm... i'll think of something and do it tomorrow", me, being lazy, I'd pick the second option,...

...until i'd get shamed on here. Maybe even a few days after.


Cleaning data is a major part of data engineering there are plenty of professionals capable of helping with this task.


> Using an Excel sheet as a database

How do you know it was Excel? I see no technical details in the posted article except "some files containing positive test results exceeded the maximum file size".


https://twitter.com/MaxCRoser/status/1313046638915706880

> The reason was apparently that the database is managed in Excel and the number of columns had reached the maximum.


Columns!?

You're meant to add additional records as rows! (Excel supports only 16,384 columns, but 1,048,576 rows.)

Actually, this was a pet peeve of mine: Especially in the early days of COVID, just about every official or unofficial data source was an absolute shitshow:

- Transposed data (new entries in columns)

- Pretty printed dates (instead of ISO 8601 or Excel format, or... anything even vaguely parseable by computer).

- US date formats mixed in with non-US date formats.

- Each day in a separate file, often with as little as 100 bytes per file. Thousands upon thousands of files.

- Random comments or annotations in numeric fields (preventing graphing the data).

We don't all need to be data scientists, or machine learning wizards, or quantum computing gods.

But come on. This is the most basic, data entry clerk level Excel 101 stuff. Trivial, basic stuff.

Put rows in rows.

Don't mix random notes into columns you might want to sum or graph.

Don't mix units.

Have headers.

Use a sane date format.

That is all.

</rant>


Oh, I saw excel files that contained screen shots of other excel files over cells containing data. Comments where in text fields, if you were lucky, or again screen shots, from word.

And then the same people wondered why things based on this file didn't work.


They seem to have hit the row limit, not the column limit. They were using Excel to collate the data coming in from different test centres: when one of the CSVs was loaded in, the row limit was exceeded and Excel cut off the extra rows without the operator realising (I'm going off reporting by Alex Hearn at The Guardian here, and he's normally pretty good)


lets assume that local centers people do save them in rows, which is the most logical thing to do. But probably the software where they copy/paste them requires the data in columns, so maybe they transposed the data without realizing that it was truncated to max size.

Or else it wouldn't explain why it took so many days to realize that they couldn't add more columns.


A question, was it possibly a csv?

ie a csv is equivalent to Excel at the Daily Mail, so they printed Excel as the issue due to it being more familiar to their readership.

(Can't think of what the technical issue would be in the case of a csv, also doubt the credibility of the source!)


The story I heard was that it was indeed a CSV and that Excel (silently) drops anything beyond 1 million rows.


More specifically: they were detouring through the old XLS file format, which has a 65k row limit.


Actually from The Daily Mail, and not clear where they are getting it from.


Worth noting the primary source for that claim is the Daily Mail.


I noticed this as well, and started worrying the various news items that were citing Excel since had been reading HN!

I still haven't seen an authoritative source, or one that predates that comment.

But it definitely sounds possible.


It is absolutely standard that everyone doing office work outside of high-tech jobs uses Excel for data handling.

It's not surprising that this reaches into government.

Even if your canonical database is done properly, there will be a clash here as soon as you share (or give anyone the ability to generate) CSV files that are too large for Excel.


I agree, but I'm genuinely glad they're not[1] using fax machines.

[1] they almost certainly are using fax somewhere.


It's probably Japan but that would be the least of the egregiousness.

[1] Japan's COVID-19 Reports - 140KBs of Unadulterated Incompetence https://stdio.sangwhan.com/wtf-japan-covid-19-report/

[2] The previous HN discussion https://news.ycombinator.com/item?id=22728674


I get a cert warning on https://stdio.sangwhan.com/wtf-japan-covid-19-report/ which really is a good start.


> professionalism in data handling

Nope. They appointed Dido Harding to cough manage it, and Matt Hancock oversees it. That should tell you all you need to know. I wonder if they were running Excel 2007?


I think it's more interesting on what happens to any analysis done with this incorrect data. There was a news article last week suggesting that the growth in COVID-19 cases was "levelling off" and I wonder in retrospect would this incorrect data account for such a position.

The article in question: https://www.bbc.co.uk/news/health-54387057


Excel always feels like a bit of a curse of the world for me, and one that makes me feel like I'm going mad. I work partly in partnership with the NHS within my university, and get to see it being applied, completely inappropriately, with no ability to comment.

For me, I always feel like I'm living in some sort of parallel universe. Imagine living in a world where _you_, and a select group of others, know that there are better methods to drive a screw into a wall than punching it with your fist. However, 99% of people in the world punch screws into walls by hand, because they know how to use the only tool they have to hand (i.e. their hand). Their friends and neighbours do too. Their colleagues show them neat little tricks about how to fold your hand better to get the screw in with less pain, or more efficiently. The "proper" option -- drilling a pilot hole, using a self-tapping screw, understanding the difference between nails, bolts and screws, etc, or the concept of wall-plugs, is "too technical", "too complex", or requires "a large amount of infrastructure to fix a simple problem" [buying a drill & screwdriver, etc]. Sometimes people buy new walls with screws pre-installed because the thought of driving in those screws seems like an overwhelming technical burden.

You, and a group of similarly-minded nerds, understand just how stupid this is at times. You personally then spend the rest of your life watching people occasionally have horrific hand injuries from punching screws into walls and refusing the offer of screwdrivers. Others profit from selling vertical hard surfaces with a whole array of convenient screws pre-installed; or get employed teaching others the _very_ basics of screw-driving, or sell products containing all sorts of klunky hacks to cope with the fact that screws manually installed are at odd angles and tend to fall over randomly. The supplier of a popular brand of screwdrivers happens to be the world's largest screw manufacturer, but never seems to promote screwdrivers to people who currently punch them in with their fists. Instead, they sell them a subscription to screws, with the option for a very expensive pre-screwed wall-delivery service, built to your specification (that they call Dynamics for some reason…).

You, and a subset of your nerdy and educated friends recognise just how batshit insane the widespread use of punching-as-a-method-to-install-screws is, and occasionally laugh about it, mostly to distract yourself from the grim reality of buildings badly held together by punched-in-screws. Every so often a skyscraper-sized building collapses, [1] and it's realised in retrospect that the manually-driven-in-screws were not the right tool for the job. Still, people carry on banging in screws with their fists, and you increasingly realise that you have to -- somehow -- accept that in order to sleep at night without going insane.

[1] https://www.businessinsider.com/excel-partly-to-blame-for-tr...


My dad (who is 79 but still working) proudly mentioned that he spent 11 hours putting together some charts in Excel for his client. Probably something that could be done in an hour or so with an online tool.


To be fair, there hasn't been any massive irretrievable data loss yet, so not too bad.

https://www.youtube.com/watch?v=K_FrQnQv0Vw (probably an accurate depiction of what's going on in Whitehall right now, though)


At this point I think I'd rather have Peter Capaldi acting as Malcolm Tucker running Whitehall over the shower of incompetent people we have now. At least he might get something done in between the flaying


I'd rather have David Brent, but we are where we are


> Using an Excel sheet as a database?

You wouldn't (want to) believe how incredibly common that is.


Excel has a limit of ~1M rows per sheet. Total cases in the UK until now are ~500K. So it seems they use 2 rows per case in the sheet.

Edit: oops, it seems they used columns for the cases and the ~16K column limit was hit when daily cases exceeded that.


For most companies, "big data" is when they can't fit the data in a single Excel sheet anymore.


Devops Borat: "big data is any data which is crash excel"


This is coincidentally also how I describe my job: I help when it doesn't fit in excel anymore.

Much easier for people to understand than shorter than civil engineer turned software developer in an engineering firm where I do data management, automation and application development for (mostly internal) clients to streamline business and engineering processes.


This will not be Excel, but a external export library.

I doubt the current Excel which MS spends billions on would not be opening data if the XLS is over the 'size' limit but a valid format.

I guess MS hate is easier than thinking about software and logic and being a better programmer and stuff.


"Developers picked an old file format to do this - known as XLS.

As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of."

"Experience is the name everyone gives to their mistakes" --Oscar


A lot of trashing on Excel but I kind of think this particular instance isn't really the essence of the problems with Excel. Perfectly reasonable programmers get this wrong all the time (ever heard of buffer overflow anyone?), and the issue of small test data being unrepresentative of large test data is something everybody has to contend with. If there is a real problem with it I would say it is the presumably silent nature of the truncation which allowed the error to propagate.

The real horrors of Excel come from things like auto-conversion of column data (text, numbers to dates etc), off-by-one errors in copy and paste, overwriting forumlas in cells, sorting of columns that doesn't capture all the rows, etc. These are all problems that are engineered into the user interface of Excel. It's like putting a tripwire at the top of your stairs and just expecting people to step over it day in and day out. It's basically inevitable somebody will fall down the stairs.


I use AWK to do floating point calculations and copy the results to Excel https://news.ycombinator.com/item?id=14018450


Why would you even use excel for something so important, rather than importing the data in a database? The problem isn't the version of Excel, but using a Spreadsheet in the first place!


Maybe everyone in school should be given real world problems like this to do in Excel so people can get a feel for doing it more correctly? Rows === records and columns === fields being the first lesson...


They've tried that already.

When I was in school (early 2000's) our GCSE computing lessons were, more or less, "Here's Microsoft Word, today we will learn how to format a letter!" or "Here's Microsoft Excel, today we will learn how to create a chart!".

The only upside to taking the class was that our school managed to identify the kids who were obviously computer literate and we got to go on a tour of Microsoft's UK headquarters...


If I had a dollar for every office worker that thinks Excel is a database...


Worked at an optics company and they were spoiling lenses until someone discovered that Excel was rounding values to two digits after the comma (financial rounding).


The actual title of the article is awfully clickbaity and misleading. As i understand the issue is not at all Microsofts fault, but NHE using an inadequate way of storing the data


I'm a bit of a novice when it comes to storing vast amounts of data, especially of this nature, but why would you use Excel rather than an actual database?


Development time and data accessibility. Everyone you might need to send data to has Excel, fewer have Access, and far fewer have a typical database software.

That's not saying Excel is the best way to store data, but it gets a lot of jobs done without multiple month delays that come with an IT project. Often times, waiting 2 weeks because they're already busy then spending another 2 weeks outlining requirements is categorically unacceptable to accomplish the business goals.


Why does Excel still have such low row limits in 2020? When they increased it from 64K, did they decide to only go to a million to promote Azure for "Big Data"?


You seem to assume the government used a modern version of Excel


A DB system that doesn't verify that each submitted record has been included in the DB (and scream if it hasn't) ... is a DB cruising for a bruising.


It sounds like a data processing pipeline rather than a database itself. A systems A...E at hospital push csv files to a shared system F. System G pulls the data from F, passes it through an excel workbook (failure point!) for data augmentation and cleanup tasks built by a public health SME, the drops the data in location H. System I loads the data then finally loads the data to a “real” database for additional reporting.

If that sounds insane, I assure you that is a simplified version of what happens in some systems.


Wow, what a mess. At the very least (which is about all that could be done) A could assign a new-record-ID to each record; I could tell F where to find it; F could verify its presence (or scream) and tell A.


I have seen this type of bug in production. After seeing it, I refuse to accept xls for data. CSV all day, whomever needs xlsx can transform it into that after.


Just dropping this here: In Apple Numbers you can work with tables that contain up to 1,000,000 rows (and 1,000 columns). :)


In these cases shouldnt Microsoft Excel raise a warning that data loss may occur if youre using the legacy file format?


I believe Office will warn you when saving as .xls (or .doc instead of .docx, or .ppt instead of .pptx for that matter) that some data loss may occur. Problem is that nobody reads those warnings and just hits the OK button.


> "But you wouldn't use XLS. Nobody would start with that."

Haha. This would be funny if it wasn’t so abjectly false.


Friends don't let friends use Excel.

Unfortunately Excel is the great swiss army knife of software and it's hard to avoid.


And what do they use instead of Excel? LibreOffice? Some expensive custom database software?


Yep, that's the problem. There's a gap in the space between the generality of Excel and the performance of a SQL DB with no UI that nobody wants to fill.

Microsoft Access sits there, but nobody uses it for some reason.


My belief is because Access sits on the wrong side of the gap you mention. It's an SQL DB with an UI, alright, but the database is (or at least has a reputation of being) kind of flakey, and the UI exposes you to all the complexity and inflexibility of relational databases. This creates two types of hurdle. One is educational - you have to teach people basics of relational data modelling if you want Access to make sense for them. The other is operational - you can't shuffle data or table structure around in Access like you could in Excel. You can't just copy pieces of table around, you can't make rows depend on rows before them, etc.

Worth remembering is that Excel is, fundamentally, a 2D functional reactive programming REPL. Most of its users don't understand that, but they internalize the behavior. They may not know what a DAG is, but they know that updating cells will recalculate cells that depended on them. So you can't replace Excel with just a database - because half of the utility of the program is in formulas.

There is a gap - a tool is missing that would offer the flexibility, the ergonomics, and FRP capabilities of Excel, while also providing tools for ensuring data consistency and relational queries, and at the same time also making them easy for users to wield. It's a tall order.


That's what Filemaker is supposed to be.


Ask a professional when the stakes are too high. Hire a developer.


Excel is just fine for doing one-off computations and visualizations. There's nothing out there that's as ubiquitous, capable, and easy to use.


Just saying in Apple Numbers you can work with tables that contain up to 1,000,000 rows and 1,000 columns. :)


> Missing Covid-19 test data was caused by the ill-thought-out use of Excel Buhari today monninig


Excel? Really? They are using that crap for nation wide reporting? Wow, mind blown. Epic fail.


A non-technical senior manager who was in charge of a project I worked on went on to manage England's contact tracing. Not sure if he was directly responsible for picking Excel, but everyone who worked under him was not the least bit surprised when this news came out.

The project failed spectacularly if you were wondering.


I read an article and I feel that this is a deflection of a guilt much more than Excel issue.


We need to stop accepting data in Excel. Full stop. One day it’s going to kill someone.


English get what they buy with cheap IT slaves working on work visas.


TL;DR [I found it hard to find which actual limit was hit.]

> The problems are believed to have arisen when labs sent in their results using CSV files, which have no limits on size. But PHE then imported the results into Excel, where documents have a limit of just over a million lines.

> The technical issue has now been resolved by splitting the Excel files into batches.


Further down, maybe the article has been updated

PHE had set up an automatic process to pull this data together into Excel templates so that it could then be uploaded to a central system and made available to the NHS Test and Trace team as well as other government computer dashboards.

The problem is that the PHE developers picked an old file format to do this - known as XLS.

As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of.


Obligatory reminder: an Excel error was the cause for much unnecessary austerity after the 2008 financial crisis [0]

[0] https://www.nytimes.com/2013/04/19/opinion/krugman-the-excel...


Probably excel parsing everything as a date LOL


How much did the UK government pay for this?


Columns. They were using a new column for each new observation. The 16k number isn't a coincidence - it's the max # of columns in a spreadsheet.


Absolute clowns running the show.


Professionalism has been entirely absent from the top-level handling of the pandemic. It's not entirely clear from the article who's responsible for this, but I note that the so-called "NHS" test and trace is in fact an outsourced organisation: https://en.wikipedia.org/wiki/NHS_Test_and_Trace run by Tory placewoman Dido Harding.

> "She is a former chief executive of the TalkTalk Group where she faced calls for her to resign after a cyber attack revealed the details of 4 million customers. A member of the Conservative Party, Harding is married to Conservative Party Member of Parliament John Penrose and is a friend of former Prime Minister David Cameron. Harding was appointed as a Member of the House of Lords by Cameron in 2014. She holds a board position at the Jockey Club, which is responsible for several major horse-racing events including the Cheltenham Festival. "

> "In May 2020, Harding was appointed by Health Secretary Matt Hancock to head NHS Test and Trace, established to track and help prevent the spread of COVID-19 in England. In August 2020, after it was announced Public Health England was to be abolished, Harding was appointed interim chair of the new National Institute for Health Protection, an appointment that was criticised by health experts as she did not have a background in healthcare."


I mean let's call a cat a cat: this is cronyism thus corruption. Unfortunately, voters never care about these immoral practices. When a social system/administration is dysfunctional, corruption is often the first cause of these disfunctions. The result of corruption is always sheer incompetence.


OT: are you French? AFAIK, calling a cat a cat is a French idiom. The English equivalent would be calling a spade a spade.


A spade is only a cat if you have tomatoes on your eyes.


Also Dutch


I think the voters do care, but can't keep track of the details and only know what the media choose to focus on. This is why Donald Trump's "drain the swamp" pitch was so effective, despite the reality of the far worse corruption in his campaign.

Voters want less corruption, but their own tribalism leads them to unhelpful positions of "everyone is corrupt", "none of my tribe are corrupt", and sometimes both of those at once.


I have not yet watched this programme from the BBC but it claims the NHS Test and Trace system is a chaotic mess. In other words, as shambolic as everything our Conservative government has done over the pandemic.

Panorama - Test and Track Exposed:

> "Panorama hears from whistleblowers working inside the government’s new coronavirus tracking system. They are so concerned about NHS Test and Trace that they are speaking out to reveal chaos, technical problems, confusion, wasted resources and a system that does not appear to them to be working. The programme also hears from local public health teams who say they have largely been ignored by the government in favour of the private companies hired to run the new centralised tracking system. As Panorama investigates, it has left some local authorities questioning whether local lockdowns could have been handled better or avoided altogether."

https://www.bbc.co.uk/iplayer/episode/m000n1xp/panorama-test...


>She holds a board position at the Jockey Club, which is responsible for several major horse-racing events including....

I'll point out that Matt Hancock is MP for Newmarket - a centre for horse racing and associated businesses.


For those not in the UK, Dido Harding was already on front pages a couple of weeks ago when she was grilled by a Parliamentary committee about the shambolic Covid testing organisation and her reply was "no-one could predict such a strong demand for Covid tests"...

As the NY Times said yesterday, sometimes it feels like "Britain is operating without adult supervision".


She was also CEO of ISP TalkTalk when they had a massive personal data breach and told an investigation that she didn't know if the data was encrypted or not.

https://www.msn.com/en-gb/money/other/people-want-to-know-wh...


In the anglosphere when you're in the managerial class, you fail upwards it seems.


We have a meritocracy apparently, so if you have got to the top you must be good and therefore your talents shouldn't be wasted.


Marketing director of one of the largest travel firms in the UK at 27! Such talent at such a young age (teeth grinding intensifies)


Which was the first of its kind and she was new to the role.


The whole NHS Test and Trace organisation seems very strange. Going by recent news reports, it's not just Harding who comes from a retail background, several of the other top people do, too. What they don't seem to have done is hire top people with a technical or management background in either public health or technology, which you'd think might be slightly important for an organisation with this remit.

In other news, I'm looking forward to hearing about how a system apparently hacked together from text files and decades-old Excel formats is complying with basic data protection principles, given that it's being used to process sensitive personal data and has profound implications for both public health and now (since violating an instruction from their people to self-isolate has just been made a criminal offence) individual liberty for huge numbers of people.


While that is true, I have a little difficulty imagining that Harding or anyone else senior was involved in the process of shoveling around spreadsheets. As the overall leadership she obviously has the accountability but just saying, "oh what do you expect when you put a crony in?" ignores the reality that a number of people on an operational level had been doing this for months and nobody seems to have stopped it.


The latest news is that apparently this was a problem with Public Health England's internal processes anyway, not NHS Test and Trace or any other kind of outsourced service: https://www.bbc.co.uk/news/uk-54422505 I don't think that even falls under Harding's responsibilties.


Day one on the job of CEO should be requesting input about systemic organizational and operational problems. Both from your direct reports and from the underlings who actually know what is broken.


We detached this subthread from https://news.ycombinator.com/item?id=24685998.


The BBC updated the article with technical details, see about halfway down. Here's the tech details:

(PHE = Public Health England)

Analysis by Leo Kelion, Technology desk editor -----------------------------------------------

The BBC has confirmed the missing Covid-19 test data was caused by the ill-thought-out use of Microsoft's Excel software. Furthermore, PHE was to blame, rather than a third-party contractor.

The issue was caused by the way the agency brought together logs produced by the commercial firms paid to carry out swab tests for the virus.

They filed their results in the form of text-based lists, without issue.

PHE had set up an automatic process to pull this data together into Excel templates so that it could then be uploaded to a central system and made available to the NHS Test and Trace team as well as other government computer dashboards.

The problem is that the PHE developers picked an old file format to do this - known as XLS.

As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of.

And since each test result created several rows of data, in practice it meant that each template was limited to about 1,400 cases. When that total was reached, further cases were simply left off.

Until last week, there were not enough test results being generated by private labs for this to have been a problem - PHE is confident that test results were not previously missed because of this issue.

And in its defence, the agency would note that it caught most of the cases within a day or two of the records slipping through its net.

To handle the problem, PHE is now breaking down the data into smaller batches to create a larger number of Excel templates in order to make sure none hit their cap.

But insiders acknowledge that their current clunky system needs to be replaced by something more advanced that does not involve Excel.

edit: also longer version by Leo here https://www.bbc.co.uk/news/technology-54423988


Everyone's saying not checking for errors was the terrible thing, but I think having Excel as a middle-man in such a sensitive system was the first issue.

It's such a British government thing to do too (I worked for Syntegra when I got out of uni).


OK, we'll change to that from https://www.bbc.com/news/uk-54422505. Thanks!

More explanation here: https://news.ycombinator.com/item?id=24690286


The real, visceral tragedy that is COVID-19 notwithstanding, I think it brings a marvelous example of why statistics, the importance of accuracy in data collection, and the effects of exponential growth are a must-have in today's highschool curriculums.


Imho, exponential growth especially, and then clustering.

Some pretty animations in a lesson might be helpful.

Fire grows in a similar way.

I've seen people up in arms because of things like a workplace of 1000 people shut down because "only" 20 people tested positive one day.

What's the problem?

To people who don't understand exponential growth it seems like 1 in 50 people is a fuss about nothing.

To people who do understand, it's an early-warning signal which says 500 people will get it in weeks due to the confined workspace, if not stopped urgently. And they will infect 1000s in the surrounding community in the same timeframe.

(To anyone tempted to point out the published R isn't that high, local R is highly dependent on situation and who mixes with whom. In a confined workspace, especially with people moving around, it's higher than in the general population. This is why it's useful to teach clustering in addition to exponential growth.)


This government had a statistics screw-up just two months ago and the only negative reaction was about how they handled the fallout. As a nation we simply do not understand its value - or the cost of it being misused.


This. We need to drop calculus, even geometry if we have to, in order to make room in the curriculum for statistics. Statistical illiteracy kills.


When I did my A-level in maths the course was half pure and half applied maths.

The pure maths course was mandatory, but for the applied side you had the choice of either mechanics or statistics. At the time I was much more interested in physics, and was considering it as a degree course but, even if I'd done that, statistics would have been far more useful to me.

I certainly wouldn't choose to drop geometry or calculus, but I've had many occasions to regret my choice of mechanics over statistics during the past two decades.


There's never been a better time to pick up some stats than now.


I mean, statistics relies heavily on calculus. It might be best to try and fit both in. Something at least a little bit fun and educational could be achieved with some dice throwing or simple computer simulations. High school students place dice games, they will immediately appreciate how useful a bit of probability would be in that context.


Trignometric calculus comprised at least half my high school calc assignments ~20 years ago. Surely that can be scaled back in favour of more lessons on statistics and probability.


I mean, statistics relies heavily on calculus

No, it doesn't. Not the kind you'd teach in high school to help young people grow up and maintain a functioning democracy.


Change in deaths over change in statistical illiteracy is positive eh? How positive? Presumably the change in statistical illiteracy depends upon the magnitude of the deaths. It'd be an interesting relationship to study...


Where did you go to school that you didn't do statistics?


Just about anywhere in the US?

Remember that HN comments are an exercise in selection bias.


Then that is nuts, and you don't need to remove calculus to fix it.

Presumably there's something you're teaching that we (in the UK) aren't, or more depth somewhere, but I don't know what it is. Our systems are quite different in that mathematics becomes optional after GCSEs (15-16yo) here, but statistics is taught from a far younger age than that, to that, and beyond for those that take A level(s) in mathematics. (As I recall there are six statistics A level modules total, S1-6, I think S1-2 are compulsory for a full A2 (vs. AS) mathematics qualification (which consists of six modules total). In order to do all six statistics modules one would at least take the second A level 'further mathematics', and probably (pun intended; unless statistics was a particular passion and the school allowed it) 'further additional'.

NB I quite liked that structure - there are 18 'modules' total (arranged in 'core', 'further pure', 'decision' (algorithms), 'statistics', and 'mechanics'. Three A levels total available (six modules each) or fewer and an AS (three). Which ones you want to do are almost entirely up to you if the school's big/lenient enough. IIRC you could even decide for yourself how to allocate the modules' grades across the number of A levels you were eligible for, e.g. if AC would be more beneficial to you than BB.


the effects of exponential growth

Oh god not this again.

The problem the world has right now is not a lack of understanding of what exponential growth means. Plenty of people understand that just fine. It's so easy to understand that there is even a simple ancient parable about it (of the Chinese Emperor and the chess board).

The problem is people who are obsessed with the concept of exponential growth even though "grows exponentially until everyone is infected" is not a real thing that happens with viruses, even though COVID-19 no more shows exponential growth than the sine wave does (sin roughly doubles at points), even though Farrs Law is all about how microbial diseases show S-curve type growth.

This leads to crazyness like the UK's chief medical officers going on TV and presenting a graph in which the last few data points are in decline, but with sudden endless exponential growth projected into the future, along with claims that "this isn't a prediction, but clearly, we have to take extreme action now because of exponential growth".

Observing exponential growth for a few days in a row does NOT mean endless growth until the whole world is infected. Growth rates can themselves change over time, and do. That's the thing people don't seem to understand.


Actual title:

Covid: Test error 'should never have happened' - Hancock

Please note this has been discussed quite a lot here today.

Eg.

https://news.ycombinator.com/item?id=24685911

https://news.ycombinator.com/item?id=24685950


You're right. But this article contains significant new information (SNI) about Excel, and spreadsheet topics are perennially interesting to HN readers, so this is the rare case where a quick follow-up article, and also the title change, probably makes sense. Usually it goes the other way of course:

https://hn.algolia.com/?query=follow-up%20by%3Adang&dateRang...

https://hn.algolia.com/?query=%22significant%20new%20informa...


Fair enough, makes sense, thanks. When I posted the comment it was the first comment on the thread (not sure if it's been merged or I'd had the page open a long time before posting) and we'd already had other threads with that leeway. Cheers for the clarification, I'd thought that might be the reason for the earlier ones.

Edit: just seen your other comment re merging, thanks.


The BBC has expanded on the Excel-related part and broken it out into a seperate article: https://www.bbc.co.uk/news/technology-54423988 Maybe it'd make more sense to link that one, if you're keeping the title?


Ok, changed to that from https://www.bbc.com/news/uk-54422505. Thanks!


[flagged]


You're talking about US death certificates, I assume. It's the norm to list more than one thing that lead to death: e.g.

Heart failure - Lung failure - Trauma - Automobile crash

So for covid-19

Respiratory failure - diabetes - COVID-19

You're saying that Respiratory failure or diabetes is the 'real cause of death', but COVID-19 is actually the cause of death. This has been common practice for years, stop believing what Mc. Nixon tweets out.


Compare to the flu... So many deaths of people with the flu are NOT flu deaths...


That happens if you only test ill people: https://www.worldometers.info/coronavirus/country/spain/


Overblown, IMO - only the useless absolute figure is affected, so it just creates another reason it's useless.

Unless positive result files are more likely to exceed the file size limit than negative ones, of course, which I haven't heard suggested is the case.

Edit: Perhaps I should clarify that I'm not suggesting it isn't a cockup. I just don't see it as a massive scandal that renders the data useless - it just decreased the sample size. What particularly wound me up was a Radio 4 presenter this morning objecting to the guest (I missed the start, I'm not sure exactly who - a female public or civil servant) calling it a 'glitch'. ('Really?! Really! 16000 missing cases is a glitch?!') Well, yes. Glitches can have minor, severe, catastrophic, or no consequences.


Care to explain why you think the absolute figure is useless?


It's not much good for comparrison with March when tests were only done in hospital, and I see a lot of people doing that

Most people who had covid by August (which was at least 6%, or 4 million, in the UK according to antibody tests) caught it between start of February and end of April. That's at least 3.5 million over 90 days, or 40k a day on average. Peak was likely double, maybe even treble that, given lockdown on 23rd of March dramatically cut infections.

We're likely testing somewhere in the region of 50% of actual cases. There's the non-symptomatic cases where about equal to the number of symptomatic cases, so double the absolute number -- some with symptoms will refuse to be tested because they don't want to miss work, some without will be caught by contact tracing, those two groups probably cancel each other out.

As such I'd expect 50k/day to be the "March equivelent" - with doubling every 10 days that means another 2 or 3 weeks.

It's not just about absolute cases though. Cases have been doubling roughly every 10 days, as have deaths. Deaths lag cases by 2-3 weeks, so I'd expect deaths in 20 days to be continuing to double even if we all stayed in an isolated booth from now.

Ultimately the concern is we're heading into flu season when hospitals are stretched, and cases, hospital admissions, and deaths are all increasing.

Hopefully flu season will be milder due to social distancing and due to more vulnerable people having been killed off, but either way we need to get a grip soon. Or just abandon any pretense of trying to stop its spread.


What are you going to use it for?

Mostly it seems to be used for comparing differently populous countries.


Actually not only that. It affected contact tracing, and people exposed to infected patients were not informed to self isolate. Possibly leading to extra spreading of the virus.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: