Hacker News new | past | comments | ask | show | jobs | submit login
Missing Covid-19 test data was caused by the ill-thought-out use of Excel (bbc.com)
1041 points by cjlm 17 days ago | hide | past | favorite | 801 comments

I work in healthcare and by far the biggest production issues we've ever run into are people who run Excel spreadsheets through our file processing where the leading zero's have been removed from patient identifiers because that's Excel's default behavior and you CANNOT TURN IT OFF!

EDIT: I have no idea who downvoted my post because what I said is 100% true. We have to tell customers to stop opening CSVs in Excel and then uploading them to us because what they upload could be missing critical data. Excel interprets a number and then formats it as a number, but in healthcare, 10 digit numbers are really strings. Unless your IT team has created an Excel extension and had it preloaded onto your local copy, Excel will remove leading zeros and there isn't a way to get them back if you save/overwrite.

Every single problem I've ever had with excel stems from excel trying to be too clever. Nearly all my problems with importing data into it could be solved if excel left everything as is until I asked it to format values a certain way.

>Every single problem I've ever had with excel stems from excel trying to be too clever. Nearly all my problems with importing data into it could be solved if excel left everything as is

We're biased to bash on Microsoft for being "too clever" but maybe we need a reality check by looking at the bigger picture.

Examples of other software not written by Microsoft that also drops the leading zeros and users asking questions on how to preserve them:

- Python Pandas import csv issue with leading zeros: https://stackoverflow.com/questions/13250046/how-to-keep-lea...

- R software import csv issue with leading zeros: https://stackoverflow.com/questions/31411119/r-reading-in-cs...

- Google Sheets issue with leading zeros: https://webapps.stackexchange.com/questions/120835/importdat...

Conclusion: For some compelling reason, we have a bunch of independent programmers who all want to remove leading zeros.

I'm certainly not singling out excel, this thread is just about excel so thought I'd share my frustration with that. I've had similar experiences with pandas certainly. I've not used the rest of your examples, but do have similar issues with libreoffice calc. Regards pandas it is certainly easier sorting out the assumptions independent programmers make by including 'dtype=str'.

The issue is that CSVs are untyped and software has to guess the type, sometimes unsuccessfully.

Does it? I think it should treat everything as a string since it can't be certain about the correct type. It could also suggest the correct type rather than applying it without user confirmation.

You want Excel to auto-interpret a table of numbers, that come from a CSV, as entirely text?

Looking forward to the first time anyone tries to use your excel on a table of numbers and then immediately has to multiply everything by *1 (in a separate table) just to get it back into numbers...

Oh yes please.

At least you would know what's happening and be in control of it

"Hey, is that a date? I bet that's a date!" - Aaaargh Noooo!

YES!! I seem to remember a few versions of Excel ago there was a way to import data and set the data type for each column in a preview window to see what it would look like. And to your point, it's very easy to select an existing column in Excel and change the data type anyway. So why not just make everything a string by default?

Maybe you recall the CSV import window in libreoffice

You are right on the cause of the issue. But the software does not HAVE to guess the type. If I paste something into Excel, I want what I pasted to be what's in Excel. Changing the type and inadvertently affecting the data isn't something any data software should do by default. And the fact that OP said they can't turn it off means that this is beyond a design mistake, it is a FEATURE. If Excel took itself seriously as a data platform, it would not make a change to any text by default.

But Excel has never been a data platform. It's a spreadsheet.

It's user's faults for using it in ways that it was never designed for.

Excel has always been about sticking numbers in boxes and calculating with them.

If you want unmodified string input, input strings into a tool intended to handle them.

Project specifications can be hard. Using 1) .xls files after they were superseded, 2) ANY data transfer method without considering capacity or truncation issues, speaks of incompetence.

You can format the column as text before you paste or you can load data from text and set the types for every column. Excel has features to handle all of these issues. I think it's biggest problem is that it's just too easy to use. It will allow novices to do all sorts of things with their data without providing guardrails for common mistakes.

Agreed with this. Excel does have data import feature that will allows us to fine-tune the import process to prevent this kind of mistake.

People just double click the CSV and complained that it didn't do it correctly. It is the same situation with scientific research data that researchers don't bother to use escape marker or blindly open the file without going through the proper import process. Then they blamed Excel for the that without understanding how Excel works.

Yes, Excel does have their quirks. But there are ways around those quirks, they have thousands of thousands guides out there about Excel. There is no excuses for people to complain about Excel didn't do the way that users want it to do without looking up for information.

Except that literally every end-user I've ever had the displeasure of dealing with did not know how to properly import data into excel.

Double clicking the CSV should open the data import dialog.

Yeah it does, libreoffice has methods of recognizing numbers that are to be interpreted as strings as well. This is user error and the programmer trying to blame that on the program rather than it being a process issue. The spreadsheet can't read your mind

> If I paste something into Excel, I want what I pasted to be what's in Excel.

And that's what you have in Excel. What gets displayed is a separate issue.

And no, you don't want to see exactly what you typed in, not in the general case.

And no, I can't believe I am defending Excel!

I believe excel considered itself a finance system not a data system. Hence the clear bias towards numbers over strings.

> has to guess the type

I believe that's the point, it certainly does NOT need to.

Absolutely! I ran into the arff format when using Weka and it was a revelation:


Well Excel trying to be clever stems from people not trying to be clever. We truly reap what we sow.

As an intern I was once given a spreadsheet of manually entered daily data, stored by a human being over the previous couple of years. Obviously he'd entered the dates in whatever form he felt like at the time and Excel had parsed them in a variety of godawful ways not least of which was defaulting to American month/day assumptions before UK day/month.

I think it would have been far quicker to just manually write a new column interpreting the dates based on previous/next etc. Instead I spent God knows how long trying to be clever, failing, and being embarrassed that I could not solve this obviously trivial problem.

And the people who make excel will just tell you "excel is not a database" lol

Which sadly, is what I want to tell them too!

The right way to open CSVs in Excel is: Data -> From Text -> Browse to your csv -> Delimited -> Comma -> Next -> Select Text formatting for columns with leading zeroes -> Finish

You can either send xlsx with format or csv without format. If this would be disabled then we'd have another group of people complaining that their dates from CSV are not parsed.

and to enforce that, remove the file association which opens .CSV files in Excel by default. If every machine opened CSV files in notepad a huge percentage of 'Excel' problems would just go away. It could be worse though, try opening and saving a CSV in Word and see what that does.

I used to work with books, which obviously includes ISBNs. They are 13 digit numbers, helpfully turned into scientific notation by Excel.

I used to do migrations between library systems, and often they'd export from the old system as CSV. One rule I gave them was "never open this file in excel", because they'd always try to be helpful by tidying something and I'd end up with scientific notation ISBNs, dates that were in one form if they could be coerced into US-style infix, otherwise left in the original form, barcodes losing zeros, etc. etc.

The same problem happens with UK phone numbers, where the leading 0 gets chopped off Solvable if there's a space between the dialling code and the number, but when individuals enter their number, they don't always do this

Don't ISBNs also contain an X sometimes?

The old ISBN10 standard could get X (meaning 10) as the last character (the last number is a check digit), but ISBN13 (which prefixes 978 or 979 to ) is all numbers. The check digit is calculated differently. One obvious advantage is that the ISBN13 is barcode compatible.

I would guess that most modern actors in the book business has been primarily using ISBN13 for at least the last decade.

Thanks! The last time I read about ISBNs was probably more than a decade ago.

X can be a check digit. From memory the check digit is calculated using a modulo 11 calculation.

> there isn't a way to get them back if you save/overwrite.

Of course there is, in your case. If patient identifiers have a fixed, known length, then you can pad with leading zeros to recover them.

You only have a problem if 012345 and 12345 are distinct patient identifiers.

It is bone-headed in the first place to use numeric-looking identifiers (such as containing digits only) which are really strings, and then allow leading zeros. Identifiers which are really strings should start with a letter (which could be a common prefix). E.g. a patient ID could be a P00123.

This is useful for more than just protecting the 00. Anywhere in the system, including on any printed form, if you see P00123, you have a clue that it's a patient identifier. An input dialog can reject an input that is supposed to be a patient identifier if it is missing the leading P, or else include a fixed P in the UI to remind the user to look for P-something in whatever window or piece of paper they are copying from.

The main point in my comment is that instead of shaking your fist that the behavior of other users in the system, such as those who choose Excel because it's the only data munging thing they know how to use, you can look for ways that your own conventions and procedures are contributing to the issue.

If people are going to "Excel" your data, and then loop it back to you, maybe your system should be "Excel proofed".

The lengths of the identifiers are not known, man. We are dealing with like 200 hospitals. They all have various kinds of Electronic Medical Record system like Epic or Cerner, or one of the 15 other third party vendors that all have custom implementations. Healthcare is insane in our country.

Interesting to see just how many responses to this real-world problem are suggesting that people rewrite CSV files to work with Excel.

They clearly do not understand system integration and the use of CSV text files for data interchange between multiple systems and application. Hey, JSON and Javascript libraries are the answer to that, eh

There are already enough potential issues with CSV interpretation on wrapping strings, escaping characters and so on, but changing the content when a delimiter is found should not be added to that list.

You bold point is the most important, the default behaviour of Excel when opening a plain text CSV file is to alter the content for display, applying magic and often-unwanted formatting rules. That should be optional.

It should be possible to open a text CSV file in Excel, view the contents in columns but the same textual form, save the file in CSV format and open it in another viewer/editor and still see the same content as the original file.

The problem is windows doesn't have a preloaded double click viewer for csv except excel if you have Office. The average user just double clicks. If notepad or wordpad did any kind of formatting for viewing it would be less an issue. So you have to solve for excel as a dev knowing that's what 99% of people will use.

If you 'solved for Excel', it will be broken for the other applications that may want to use data from those Excel-specific CSV files, which is generally why CSV was chosen in the first place.

If you want it to be a string then put it in quotes. It's that simple. If the user is too dumb to slow down and look at the import options it is not excel's fault, it's a process issue. Excel can't read a cell and then read your mind and extract the correct option.

First off, I “hate” Excel, but only because people misuse it. It, like so many other products of its kind, are for numbers! Yet I would be hard-pressed to find someone actually using it for numbers. People absolutely insist on using it as a generic table system for holding non-number data.

I recently had to show a 20+ year Excel-using fanatic how to import data from a CSV file so that they could select as type Text columns that contain leading zeros. The ability exists, but I have found scant few people who know how to actually use the product properly.

Oh, and I also work in healthcare.

Excel supports a variety of data types other than numbers and includes many built in functions for dealing with non-numeric data. Whatever it's history, it is not accurate to say that it is only for numbers.

Regardless of its _additional_ support for other types, those are only in support of its primary role as a spreadsheet. It is designed around the concept of plugging in _numbers_ and calculating with them. Everything else is to make those calculations well notated and human friendly.

Any other use of Excel is bending it into a role it wasn't intended for, and user beware.

And it is all too easy to just go there since there are soooo many convenience features for those who don't want to laern how to do the tasks well.

If notation were the sole purpose there'd be little need for anything but row and column labels or simple string input. Instead there's a full set of string manipulation tools, embedding of other objects, charts, images, drawings etc.

You are taking the application as it existed 35 years ago and saying it must still be that thing, yet it has had 35 years to evolve far beyond that. Microsoft itself, when it talks about Excel, talks about using it to organize "data", not just numerics data. It has become a more general purpose tool.

As a lay-person, what would you use for non numbers tabular data?

People love to use Excel to look at (with their eyes, or a screen reader, i.e., non programmatically) all kinds of data. They use it to list exports from data stores, and then take notes on what they discovered, show and hide data using the filter option, etc. It is just far too easy to use (and misuse) that it ends up being used for any type of data that fits into a table.

Excel have lots of different uses. It can be used as inventory tracking, sales, catalog, mileages, contact list, and the list goes on. It is useful in many different fields and industries.

You have to remember, Excel is extremely powerful beast. It have many specialized features that will handle any data it encountered with. I used Excel for 15 years and I am still finding features that made the process quicker. Of course, Excel have its limits and I am well aware of that.

I mean I thought you said that they should not use excel for these scenarios, but something else. So I thought you had an alternative software in mind?

I use Splunk (our organization pays lots of money for it, and yet it just sits there because people would rather use Excel). There is also Tableau and other products like it. Most of these tools have a free version, as well. They take some learning, but it is well worth it. Of course, taking that same amount of time to actually learn how Excel works is worth the time, too. I guess the issue is that Excel is just too "easy" to use, and hence people using it for years and years and not actually knowing how to use it.

I am not trying to be an elitist about this. It is just that the misuse of Excel (because people do not know how to use it) causes massive issues on a daily basis.

I haven't used it much myself, but I think Access is meant to answer this question.

Excel is fine, but keep the data as xls and you won't have problems like dropping leading zeros. As soon as you convert to a simpler file format like cvs, you will likely lose data unless you are very careful in both export and import.

I just upvoted you.

You are fully correct, I have seen plenty of stuff like that in life sciences projects.

> Excel interprets a number and then formats it as a number,

Unless you instruct it to interpret the field as a string.

> but in healthcare, 10 digit numbers are really strings.

I'm wondering, if you expect 10 digits and you get less than that, how difficult is it to add some padding zeroes?

With US ZIP (postal) codes, padding with zeroes can actually be ambiguous because we have "zip+4" which is often stored as two fields, but just as often concatenated into a single representation. If you have a zip+4 of "05555-0444" in two fields that each lose leading zeroes, and then concatenate to "5555444" you can't reconstruct it with certainty (00555-5444 or 05555-0444 or 55554-0044).

I had a package get seriously delayed one time because it kept being sent to a city whose 5 digit zip code was equal to [my 5 digit zip code, less its leading 0, and the first digit of my +4]. Fun times.

> Unless you instruct it to interpret the field as a string.

Except double clicking a CSV does not give you the option to do this. At this point Excel already decided to corrupt your data. And guess how most users open CSV files? That's right, they double click.

> I'm wondering, if you expect 10 digits and you get less than that, how difficult is it to add some padding zeroes?

If you expect 10 digits and get less than that, you have corrupted input data. Trying to "fix" this is exactly the sin excel is committing. Don't do that.

This would assume people (many who have over a decade of experience using Excel) actually know how to do that. The largest problem with Excel is people.

Excel rules. I still use it daily despite having access to big boy tools (python, SQL, R, etc.) Few tools make it easier to answer quick, simple questions about a dataset. Plus it's still the best way to browse SSAS cubes. Blaming excel for these sorts of problems is like blaming a hammer for not being a screwdriver.

I also work in healthcare.

Their database system afterwards could do that.

Or prepend a letter when producing the CSV to avoid EXCEL doing what it does.

Maybe it can be solved by prepending "ID:" for exports, and removing it on import. It could be proposed as a convention for all HC software companies.

Introducing standards across many companies is incredibly difficult but your main problem is that a lot of healthcare IT systems are very old and not actively maintained, they're essentially on life support. So changing the format of a file, or requiring new import cleaning, is going to break integrations and data sharing in many places. Basically it'd create far more problems than it's attempting to solve.

That is understandable.

I guess it depends on which country and which specific part of healthcare you are active in. In the 'care' domain in the Netherlands I also see that some of our integrations are one-off, but the most important ones do have industry wide standardization that receives updates based on law.

Although make sure to not use that for the first column heading, an uppercase I will make excel think it's a SYLK file and then refuse to open it.

Good catch! The Wikipedia page confirms it.

Nothing in software development is as straightforward as we might hope.


IME numeric identifiers are just asking for trouble. Some systems are sensitive to leading zeros, some systems will zero-pad, <input type="number"> loses zero-padding, etc. Worked with a client once with references to the same entity scattered across two databases, one of which zero-padded the identifiers and the other didn't. Oy vey.

> IME numeric identifiers are just asking for trouble

That's a failure of the system if it can't be told to not interpret the data. However you're saying the world is as it is; can't argue.

Agree with the problem reported here - the fact that Excel will drop leading zeroes and will turn phone numbers into Exponential format causes many live issues where customers use Excel to work on 'CSV' files :(

The issue I see is you have people who don't know the difference between a string data type and a number. I keep see "Listen to the scientists" and "Listen to the doctors". There are many facets to solving problems. Maybe many disciplines may be better at solving it. Had you/"healthcare" contacted folks with software expertise rather than being your own experts, this would have been a non-problem.

Also long numeric ids are rounded by Excel.

And fucking with dates.

Does it strip leading zeroes if you format the numbers as strings with quotes like this?


Working internationally with CSV holding dates is a recipe for disasters.

Couldn't you validate the data on upload or if it's such a common problem just auto-fix it by left padding w/ zeros?

There are data sources that have a clear unambiguous specification of what information is in each field, and exactly how they are formatted, such that you could auto-parse the column. And even have no data entry errors, so that you can rely on that parsing and have clean error free imported data.

There are many more that don't have a clear spec, or even if they do, have every possible variation of data corruption / keying errors / user misunderstanding / total 'don't give a shit, I'll enter it how I want, that's what the computer is supposed to handle' problems in that source data, that make parsing an absolute nightmare.

I've had to manually review a list of 10k+ data readings monthly, from an automated recorder, because the guy who was supposed to copy the collected data files as is and upload them, instead opened each one and fixed what he (badly mistakenly) thought had been recorded wrong. Different changes in a dozen different ways based on no particular logic beyond "that doesn't look right". And un-fireable, of course.

auto-fix by padding assumes that the only way to get an ID that's too short is by excel removing the leading zeros - presumably it could also be human error when entering etc. (in theory I suppose that would be validated in whatever form the number is entered into in the app that generates the CSV, but hard to say for sure). Or some other issue in data processing pipeline erroneously truncating the ID number some how.

Validating the data would at least prevent getting invalid data into database (and presumably this is happening already), but it doesn't actually "fix" the problem, you still then need the original provider of the data to fix what's missing.

These aren't fixed-width fields with a known length. The length of the field is unique to the person sending the file and they don't tell us how long it's supposed to be.

The parent post says: "Excel interprets a number and then formats it as a number, but in healthcare, 10 digit numbers are really strings." - so I was under impression that the length for that particular column is fixed - probably just misunderstood the sentence.

I wonder how human operators figure out if the value is correct, or the Excel messed it up, or the input was invalid in the first place? If it's even possible to do it reliably then probably there is some set of patterns and methods that possibly could be turned into an algorithm... just thinking out loud here, but seems as an interesting problem to tackle...

Validating on upload is sensible and should be done anyway. However, left-padding is not always going to work, it depends on the data. Common problems I see are with phone numbers. UK mobiles begin 07nnn. Excel kindly drops the leading zero and then really helpfully treats the long number as being an exponential-format floating point number, which cannot be turned back to the original number. In our systems, we require country codes in front of the numbers, so UK mobiles should being 447nnn. That doesn't lose the leading zero but still gets the E-format mess when opened by Excel. Customers often miss out the 44 and use the national 07 format. If that leading zero is missing, it can be quite difficult for a system to reliable determine whether that number, beginning with a 7, is an international number for a destination in Russia, or a user error and should've been 447, from an 07 data value.

One wonders why Excel doesn't store data as strings (or equivalent lossless format) and process it as numbers like SQLite does for instance.

Because Excel is a spreadsheet, for calculating with numbers.

Then there's additional code for dealing with all the inconvenient ways people format things, or want to add text labels, or do things closer to numerical/financial analysis, or all the other extras wrapped around the core "put numbers in boxes and do math".

That misunderstanding is at the core of Excel misuse.

> I have no idea who downvoted my post

That would be the Excel devs working at Microsoft. They read HN. I can feel it.

Excel devs, do you hear me? Repent for your sins! Go back to Microsoft and fix those problems. The cost of Excel mishandling numbers and dates must be many orders of magnitude higher than your lifetime salary.

Thank you.

They can't. All those spreadsheets that depend on that behavior are out there. It makes spell create with an e look easy.

Down-voted as it is factually incorrect. See comment by reportgunner below.

Just format the identifier cells as text. I've also had this problem, this is how I solved it.

That's fine for an individual working on a specific set of data in a specific sheet, but this isn't just a problem for one individual or a small team that can solve it and move on. It's a systemic problem throughout many vast organisations, with continuous influxes of new personnel and constantly changing requirements. When you get an XLS sheet sent over to you from another team that already made this mistake, it's too late and this happens all the time.

> Just format the identifier cells as text

CSV is text. If you mean in Excel, if you opened it in Excel (rather than importing and choosing non-default options), you've already lost the data so formatting doesn't help you.

Yes, I mean Excel. We have CSV to XLS import scripts/forms that format identifier cells as text. The data format is standardised. Using templates to do the imports was the dumb part. Microsoft has a Power BI tool if ones doesn't want to write or use scrips. Use that. I assume a government agency has the resporces to pay for it and for data scientists.


Thanks for bothering to respond instead of downvoting.

> I assume a government agency has the resporces to pay for it and for data scientists

First, the upthread commented said "healthcare" not "government agency".

Second, as someone who has worked in public sector healthcare: HA HA HA!

I mean, sure we have the resources to pay for data scientists (of which we have quite a few) and could conceivably probably afford to develop custom scripts for any CSV subformat that we decided we needed one for (though if its a regular workflow, we're probably acquiring it an importing it into a database without nontechnical staff even touching it, and providing a reporting solution and/or native Excel exports for people who need it in Excel.)

The problem is that when people who aren't technical staff or data scientists encounter and try to use CSVs (often, without realizing that's what they are) and produce problems, its usually well before the kind of analysis which would go into that. If its a regular workflow that's been analyzed and planned for, we probably have either built specialized tools or at least the relevant unit has desk procedures. But the aggregate of the stuff outside of regularized workflows is...large.

Nearly all end-users open a CSV like this:

1. They see a file (they have file extensions turned off, which is the default, so they probably don't even know what a CSV is)

2. They double click it

Excel now corrupted the data. That is the problem. Good luck teaching all end-users how to use Excel properly.

> They have file extensions turned off, which is the default, so they probably don't even know what a CSV is

And if, also by default, Excel is setup with an association with CSVs, the CSV file will, in addition to not having an extension to identify it, will have an icon which identifies it with Excel.

> I assume a government agency has the resources to pay for it and for data scientists.

Bold strategy there, let's see how that plays out.

Having been in and around military / DoD usages for a long time, I can tell you it's always an uphill battle to get processes to work well, instead of defaulting to whatever the original spec happened to get included as a result of some incompetent who wasn't even aware of good practice.

So what's the big deal? Just read back the numbers, if it's less than 10 digits prepend zeros. Am I missing something (probably, which means you're leaving something out in the description of the issue)

The 10 digits was just an example. There like 100 different ID types with different formats/lengths in healthcare.

How on earth is this the top rated comment? I would downvote it if I could. As other people have noted, you can import CSVs values as strings instead of numbers.

Furthermore, losing preceding zeroes in number-typed values is not unique to excel; it is a common feature in all typed programming languages.

> you can import CSVs values as strings instead of numbers

Of course, but the problem isn't that the person who posted the comment doesn't know this - it's that many users of their systems don't know it. Most people are just going to accept whatever defaults Excel suggests and not know any better, causing problems down the line.

No, that's not correct. The parent poster specifically made this false claim: "that's Excel's default behavior and you CANNOT TURN IT OFF!" The false claim is the part that the poster wrote in all caps. He said that this behavior can not be turned off, when in fact it can be turned off.

It can't be turned off, if you're opening a CSV in Excel.

It can be avoided, if you go through the Data | Import tools. The complaint is that few Excel users know that the import engine is available, or use it, instead of just opening the file and getting all the default interpolations. Which can't be avoided in the usual Open code path.

How? Where is the setting to turn it off when initially loading a file?

>How? Where is the setting to turn it off when initially loading a file?

Confusion is happening because 2 different ideas of Excel using csv files:

- you saying "can't turn this off" : File Explorer double-clicking a "csv" or MS Excel "File->Open" csv.

- others saying "you can preserve leading zeros" : click on Excel 2019 Data tab and import via "From Text/CSV" button on the ribbon menu and a dialog pops up that provides option "Do not detect data types" (Earlier version of Excel has different verbiage to interpret numbers as text)

Quote your strings. The problem is more with how you generate CSV (which is easy to break in the first place) than Excel’s fault. It’s like writing your identifiers as numbers in JSON instead of string then complain the leading zeros are removed.

Quoting doesn't work. Try opening this CSV with Excel:

Excel interprets both as the same number–1.

The correct way to generate a CSV cell with a leading 0 is

You can verify this with


Perhaps by "correct way" you meant "dodgy hack to make Excel happy and risk breaking more sensible implementations"?

Excel may predate the RFC but AFAIK MS didn't invent or coin the term CSV, so you can't just say whatever Excel does is correct. The RFC is loose because of nonsense like this, it doesn't mean it was ever a good idea.

How in the world is Excel supposed to know which fields you want to be numbers and which to be strings? CSV doesn't have that info built in unless you surround the number with quotes and you select the right process. Excel isn't just a CSV importer, it reads all sorts of files, and it needs some help if you expect it to work. What ever happened to process and a sense of responsibility and craft in your work?

Choosing based on whether it's in quotes seems like a better solution than using an equals sign. Not everything follows that convention admittedly, but not everything understands the '=' either. Or it could just treat everything as text until a user tells it otherwise.

But none of that is really the point. Because CSV files aren't just for importing into Excel. One of their main benefits is their portability. In other situations column types might be specified out of band, but even if not, putting equals signs before values is unconventional, so more likely to hurt than help. And in the cases it might help, i.e. when you only care about loading into Excel, then you have options other than CSV, rather than contorting CSV files for Excel's sake.

> What ever happened to process and a sense of responsibility and craft in your work?

I actually have no idea what you are on about. I'm talking about the "responsibility and craft" of not producing screwed up CSV files. Why do some people find that so offensive? Yes, it is not inconceivable that there could be some situation working with legacy systems where putting `="..."` in CSVs is, unfortunately, your best option. Sometimes you do have to put in a hack to get something done. But don't go around telling people (or yourself) that it is "the correct way".


I sure don't want to have to deal with people putting `="01"` in CSV files.

We have multiple legacy systems where I work, communicating via batch csv files, some of which are authored & curated by staff in Excel. I can confirm that doing this would be a very bad thing indeed and make our systems grind to a halt.

For CSV that isn’t solely targeting Excel, hacks around the way Excel works are useless.

The reason you use CSV is that you want to use it with software that ISN'T Excel. Otherwise you'd just use .xlsx. No other software uses this convention, and it's not correct CSV.

I guess the HN audience has voted. The result is: hacking around Excel stupidities is itself a stupid idea.

If you want Excel to understand your output, perhaps use a library which can write xlsx files.

“Correct” is a strong word. It isn’t defined in https://tools.ietf.org/html/rfc4180 so Excel should not try to be smart and add extensions only they support.

Excel predates RFC4180 by nearly 20 years (RFC4180 is October 2005, Excel 1.0 was September 1985) and this behavior was already cemented when the RFC was written.

As for the actual RFC, it's worth taking a read. Any sort of value interpretation is left up to the implementation, to the extent that Excel's behavior in interpreting formulae is 100% in compliance with the spec.

CSV predates Excel, and other CSV implementations don't have this behavior

What spec?

Anyway the RFC doesn't mandate any value interpretation IIRC.

If CSV were being used just to exchange data with Excel, we probably wouldn't be using CSV. Many systems neither need nor know that ="01" should be treated as the string "01".

If Excel were the only intended consumer, .xlsx would be a preferable file format. At least it's mostly unambiguous.

I ran into this last week with a UK bank. I was offered a CSV file. What I got was a CSV file with excel formulae in it.

I actually wanted a CSV file – preferably without having to resort to sed to strip out excel formulae.

And now the csv parser (or downstream process) has to guess whether to interpret that as the raw string or as the eval'd value.

If you want to write a bespoke CSV generator for an application where you know for sure that the file is only ever going to go straight to an Excel instance, sure.

For all the other uses in the world, that's a breaking change.

I have a list of companies I'd like you to consult for. Coincidentally, they're companies I'd like to work for, and I've quite enjoyed building proper database solutions which replace incredibly hacky/terrible Excel (or Excel adjacent) solutions.

Not even Excel uses that syntax when exporting to CSV (at least by default).

Are you asuming creation of CSV? 'cause most excel users wont be aware of why they should store in csv instead of excel's proprietary formats.

It is sometimes possible to write code to alter your CSVs to retain the strings.

The first thing would be to write them as: entry1,"0123456789",entry2 rather than entry1,0123456789,entry2. This has worked for me in some instances in Excel whereby I have to escape certain things inside a string, but I would not be surprised if Excel still messes this up. For example, giving the triangle exclamation mark box and then helpfully suggest to convert to number.

If you want to go further, you can do something like write a routine that alters the CSV, such as entry1,hospitalString(0123456789),entry2. Sure, there are problems with this too, but Excel can break a lot of things and the above examples I do use in practise (the first example I put the double quotes to escape single quotes in foreign language unicode).

Another thing Excel can do is break your dates, by switching months (usually only for dates < 13th of the month, but often a partial conversion in your data for < 13th and >= 13th) or convert dates to integers.

All previous major spreadsheet screwups were costing money (largest being $6 billion in losses due to a excel formula mishap) , but this one is playing with lives :(

5 – London Olympics Oversells Swimming Event by 10,000 Tickets

4- Banking powerhouse Barclay’s accidentally bought 179 more contracts than they intended in their purchase of Lehman Brothers assets in 2008. Someone hid cells containing the unwanted contract instead of deleting them.

3-utsourcing specialists Mouchel had to endure a £4.3 million profits write down due to a spreadsheet error in a pension fund deficit caused by an outside firm of actuaries

2- Canadian power generator TransAlta suffered losses of $24 million as the result of a simple clerical error which meant they bought US contracts at higher prices than they should hav

and the Biggest one is. -

Basic Excel flaws and incorrect testing led to JP Morgan Chase losing more than $6 billion in their London Whale disaster.


They are costing money. No doubt.

At the same time, the business world runs on Excel. How much money is Excel making?

I've done my share of cursing at Excel at various jobs. At the same time, I am grateful for the quick and easy way it allows me and many others to manipulate data. It's unfair to just cite the costs of using Excel without acknowledging the benefits it brings.

The real question should be would it be cheaper to replace it with a proper solution (probably something custom built application written by a software engineer).

Excel's ease of use is it's downfall. It is the worlds most popular database, despite not actually being a database. I have wasted countless hours dealing with Excel where something else should have been used. I built a database for a friend recently, I think 75% of the work was cleaning the existing data from the excel to get it into the database.

People start using excel when the requirements are unknown, they use excel to understand the requirements while still being functional. To build a "proper solution", you frontload all the requirement discovery to build a system which only you or another developer will be able to update/change.

We blame excel, but excel is really just being used for prototyping and nobody takes a decision at a certain point to move on from that prototype.

I would say that people use Excel because they don't understand how to set up a relational database. It's usually non-technical users that set up Excel, then at a later stage I get asked to create a proper database from it.

>I would say that people use Excel because they don't understand how to set up a relational database. It's usually non-technical users that set up Excel,

I disagree that lack of db knowledge is the primary reason. I'm a programmer and I usually use MS Excel because it's easier than relational databases. I prefer Excel even though my skillset includes:

+ Oracle DBA certification and working as a real db administrator for 2 years

+ MySQL and MS SQL Server programming with raw "INSERT/UPDATE/DELETE" or with ORMs

+ SQLite and programming with its API in C/C++/C#

+ MS Access databases and writing VB for enterprises

The problem is none of the above databases (except for MSAccess) come with a GUI datagridview for easy inputting data, sorting columns, coloring cells, printing reports, etc.

Yes, there are some GUI tools such as SQLyog, Jetbrains DataGrip, Navicat, etc... but none of those have the flexibility and power of Excel.

Yes, a GUI frontend to interact with backend databases can be built and to that point, I also have in my skillset: Qt with C++ and Windows Forms with C#.

But my GUI programming skills also don't matter because for most data analysis tasks, I just use Excel if it's less than a million rows. Databases have a higher level of friction and all of my advanced skills don't really change that. Starting MS Excel with a blank worksheet and start typing immediately into cell A1 is always faster than spinning up a db instance and entering SQL "CREATE TABLE xyz (...);" commands.

Of course, if it's a mission-critical enterprise program, I'll recommend and code a "real" app with a relational database. However, the threshold for that has to be really high. This is why no corporate IT department can develop "real database apps" as fast as Excel users can create adhoc spreadsheets. (My previous comment about that phenomenon: https://news.ycombinator.com/item?id=15756400)

I find people often use Excel (or Access if they're brave) when they have a known problem and can visualise the solution in something they know, like Excel. The crucial factor is implementation time. They can open an Excel and start work immediately, or wait weeks or months going through a lengthy and complex IT procurement process that will take from their budget. So in most cases starting with Excel is a no-brainer. Same reason why a lot of stealth IT gets implemented.

It's often only after years of a business using what has become sacred & business critical Excels, that somebody suggests formalizing it into software. In a business with an IT function, or a consultancy looking for business, it should always be somebody's job to find these Excels and replace them with something more robust.

> Same reason why a lot of stealth IT gets implemented.

Honestly, I miss the days of writing VBA macros which save hours of work a week and being sneered at by the 'official IT'.

I worked in a team in a large commercial bank handling reconciliations with various funds. Some of which had to be contacted by phone to confirm the current holdings. We had a system which would import our current positions and take imports in various formats from funds. Somewhere around 40% of the differences where due to trades which had been executed over the reconciliation date. I wrote a VBA script which pulled in all of the differences and identified trades which where open over the period and automatically closed the discrepancy with a reference to the trade IDs.

Another time I wrote a VBA script which would take a case ID and look it up in a diary system (at the time the only way I found to do this was to use the Win32 APIs and manually parse the fields in the HTML from the system), it would then enter this at the top of a spreadsheet which had to be completed. People liked it so much I had to rewrite it so that it would work on a list of case IDs and automatically print out the checklist.

Much more fun than figuring out why Kubernetes is doing something weird for the 3rd time this week.

The problem with a database is that business users struggle to build them, they take longer, and sometimes can’t hold logic in the same way.

Plus in a database you can’t do the same sort of real-time analysis and also pass the document around for other non-technical folk to add to and modify.

In the real world in big companies, people often don’t want to talk to IT because they over-spec and quote what are perceived to be giant sums of money for something that can be created in an hour in excel.

To add to this point, the latest article says that they started building a replacement to the spreadsheet two months ago, so I think that reinforces the speed of a spreadsheet implementation vs anything else in the public sector!

You might not have realized this but they should have started work in January.

You might not have realized this, but a global pandemic was starting in January and they were probably collectively shitting bricks at that point trying to shoe-string together something to help capture reporting, while probably being in a change freeze, and also probably having to do a mass-migration to at home work with the inevitable struggles of that.

>It is the worlds most popular database, despite not actually being a database.

"A database is an organized collection of data, generally stored and accessed electronically from a computer system."[1]

Excel is an organized collection of data, stored and accessed electronically from a computer system. So I would call it a database.

[1] https://en.wikipedia.org/wiki/Database

Excel is not a database, it is a spreadsheet. They are different beasts.

This link will explain the difference between two of them. https://365datascience.com/explainer-video/database-vs-sprea...

What i was getting at is that it fulfills the (very) basic definition of a database, even though it's very bad for this purpose.

No, because the proper solution written by a software engineer will contain more bugs.

it will only be cheaper after the company loses money... no business person ever looked at something and said "it works, but it might not, some day... better spend $XXX,XXX on an engineer to fix the problem".

You have just identified the new market of Spreadsheet Insurance. I am thinking it could be quite lucrative if you get the right salespeople in on the ground floor. I'm willing to bet no funding source is unfamiliar with the demand for this product.

and one more:

"Scientists rename human genes to stop Microsoft Excel from misreading them as dates. Sometimes it’s easier to rewrite genetics than update Excel" https://www.theverge.com/2020/8/6/21355674/human-genes-renam...

Also the post-2008 financial crisis austerity programs based on Excel analysis.


Austerity wasn't based on that study, and it didn't go away once that error was found.

That study was the intellectual justification for the policy.

To be fair, I'm not sure if any of the proponents actually believed (or had read) the study, but it was definitely wheeled out in debates against Keynesians.

Yeah, why there is no "scientific" mode to just turn of all "smartness" has baffled me for years and will probably continue to baffle me for years to come.

Or even an algorithm that can detect that you are using gene name from the cells around march1 and sept7.

I think that's a good way forward, Excel is a fantastically powerful, ubiquitous tool and isn't going away any time soon. If you're going to standardise on a technology, it makes sense to design your systems to work well with that technology. That includes making identifiers that you need to be treated like strings look like strings to your tools. A little bit of thought in advance and maybe some small compromises early on can save massive problems down the line.

A master crafter always adjusts their industry to match the currently popular tools. That's why we still use reins to steer automobiles and aircraft.

Well, not exactly like non-Excel systems are bulletproof either.

Entire Japanese stock market went down last week, and it's not like there was a flood of people on HN bemoaning that. At least with Reinhart and Rogoff et al you have a responsible party.

As opposed to 'nameless machine failed, and nameless backup machine also failed, and now it's in JIRA so don't worry about it'.


The glitch stemmed from a problem in the hardware that powers the exchange, said the Japan Exchange Group, the exchange’s operator, during a news conference. The system failed to switch to a backup in response to the problem, a representative said.

I think similar to “code smells” in software engineering, using excel for critical data pipeline work can be considered a “data smell”. It’s not bad in and of itself, but it’s a signal that something may have gone seriously wrong with the data engineering process.

On the other hand, its more like "nobody ever got fired for using excel"

I think in this case it’s more like “nobody who would know better than to use excel had the authority to make the decision not to use excel”. Given that this was a government project, my guess is that there are very few technical people involved at the decision making level.

I'm willing to make an exception in this case.

Unfortunately excel is the lingua franca of data exchange in SME

I am sure there is a solution to this problem that could make someone a fair bit of money, but i am not quite sure what it is yet.

That example is overblown as the system is supposed to be restarted more frequently than that for maintenance. A better example is cars have been frequently rebooting various internal computers while your driving for decades, that seems bad but it’s similarly irrelevant.

And how about this one: an Excel error was the cause for much unnecessary austerity after the 2008 financial crisis, costing potentially tens (hundreds?) of billions in lost output.


Is it really that hard to spend 5 minutes setting up a decent database? Use sqlite if you really want something simple.

I guess you still have to understand locking (especially on distributed filesystems.) I've certainly seen people mess that up with spreadsheets.

Sure, I'll just teach tell my wife to learn SQL and some devops while she's at it. After that I'll coach her on how to get buy in from management on this new system while she neglects her primary duties.

Throwing up a database and integrating it into a workflow/system isn't something anyone can just get up and do. I have to imagein you know that.

Yep! And then when your wife gets that promotion she wanted, she still has to spend time maintaining and making changes to the old database system because nobody else understands it or has the skills to change it.

Was it really a manual process? I read it as some system importing info automatically.

And if it was manual I am surprised that Excel did not complain about adding more than 65000 rows (or saving more than 65000 rows as XLS). If a user gets a warning about possible data loss they should investigate more.

The problem is people can't do this in a office. Nobody has admin rights on there computers. Everybody know how yo use Excel. We all know, if you have to make a ticket to your office IT for such things, you mostly have to wait a long long time and it never, never works the first time. I guess that is one of the main reasons, why the world runs on Excel (and MS Access). It just works, it's just there, no dealing with IT.

Sqlite is a bit of a special case, in my eyes. It's so ubiquitous that it's probably already installed somewhere on the average machine, just not as a top-level application.

Of course, there's a bit of a gap between "it's there on the machine" and "we can rely on it for useful work", but baby steps...

No it isn't, but in any big org the plebs are only allowed to use Office and anything else is IT captured.

Sure, Excel is instantly available. But also you get charting and formulas. Especially with formulas they've done something really right. You can transform a column in a few seconds while keeping the old one, tweaking the new one and having a graphical preview through the chart feature at the same time. Of course every Spreadsheet software can do this, but to me that's the ultimate killer feature

It takes a course to understand how to to use a relational database.

MS Access used to come as standard with Office and is actually the perfect solution to many of the problems that businesses use Excel for. It's very rarely that people actually used Access as Excel was far more intuitive and good enough for many projects especially in the early stages.

I can attest to this. I move job and my new work place used to manage everything in excel. Something's still are. Realising excel could not scale, and with no support for IT for a proper solution the "tech" guy (that is good with computers but not a developer) was tasked with creating a "database". What we now have is essentially 3 excel spreadsheet tables in access that you can run SQL on and input data using access forms. No normalisation, no well anything really. We're now coming to the point where access is creaking at the seams. I'm confident with properly designed tables and queries we wouldn't be, and I'm no expert either, but it will have to fail before there's a concerted effort to redesign or god forbid IT do there jobs.

Things are never 5 minutes. And it is just as easy to make mistakes with sql as it is with excel.

I have definitely just spent 5 minutes importing csv tables into sqlite so I can run a query against them.

Do note that anything other than Excel, UI/UX wise, failed.

When people wants a “database” they fire up Excel, start punching in numbers, solar calculators next to keyboard, and use eyeballs to search for strings.

Every time I see a problem like this I can't stop thinking that computer education is terrible everywhere.

We are able to teach almost everyone how to use complex software like Word and Excel. Why can't we teach people how to use a terminal, SQLite, or how to create a very simple Python script?

Well maybe it has something to do with the Windows monopoly and that they only want you to use their proprietary tools. If you learn too much then you will want to be free!!

Whenever you try to do something properly, some person in your organization condescendingly says "perfect is the enemy of the good".

And because Excel is "good" for a vast number of use-cases then people use it for everything.

Hey come now... we both need each other /s

As a data engineer, I think blaming Excel for this is the wrong framing. The real problem here is with the entire process, and specifically, the lack of testing.

Excel is great for many use cases, especially if you need people to enter data somewhere. Its UI is unparalleled in terms of quickly giving something to users that they can understand, mess around with, and verify. It's a very common use case to then need to pull in data from a bunch of Excel files, into one main repository of data (a data warehouse). That can be stored in an Excel file, although more commonly would be stored in a database.

But there are always problems with this process! There can be missing data, there can be weird data conversions because the program/language you're using to parse the data and get it into the database reads things differently than how Excel intended, there can be weird database issues that causes data loss, etc.

Complex systems always, always have bugs.

It is the job of a data engineering team to, among other things, test the systems thoroughly, and put in place systems to test against data loss, etc. It is pretty common, for example, to count the data going into a pipeline, and the data that you end up with, and make sure nothing was lost on the way.

Anyone can make a mistake. Any team, no matter how good, especially when they're rushed, can use shortcuts, use the wrong technologies because it's expedient, or simply have bugs. It is the job of the team, and of project management in general, to do all the manual and automatic testing necessary to make sure that mistakes are caught.

The real lesson isn't "Excel is bad". It's not. It's an amazing tool. The real lesson is "Data Engineering is hard, requires a lot of resources", and "all systems have bugs - testing is mandatory for any critical system".

Disagree on the "users can verify" part. Excel is very hard to verify, because of corner cases like date conversion. Formulas and macros, maybe spread out all over multiple sheets in hidden away white-on-white cells, make things very opaque and impossible to verify for most users. After a long gone colleague has passed the usage of some excel file, it is treated as an arcane and unchangeable wisdom from the past, never to be really understood or changed. Just maybe used as a copy-paste value import to the next generation.

Also, recognized good practices for software development (which any excel sheet that does more than just a sum() will be) like commenting and versioning are quite hard to impossible in excel. So even if you endavour to do it "right" but with excel, it is just the wrong tool for the job.

I vaguely remember a discussion about this and the problem boils down to Excel sheets being impossible to unit-test.

There might be, for example, a formula going through 20k lines in column F. But the one on line 1138 has a typo and the formula references an incorrect cell. No human will ever go through all the lines to check. Excel itself doesn't check stuff like that. And there are no tools for it either.

Excel actually checks that and gives you an 'Inconsistent formula' warning with a little green corner similar to when you comment a cell.

All you need to do is scroll through the worksheet you have just made.

All you need to do is scroll through 20,000 lines looking at each one

Excel have an ability to highlight the cell that have invalid data with a red circle to make it pop out on the table. Just look for something red.

20,000 is not a lot of lines though, I used to scroll 300k sometimes.

Anything below 500k rows is a 'small table' still.

The point is that's a manual, tedious, process that will be performed with error or completely skipped.

We have computers to do that sort of work.

It is a manual and tedious process if you make it manual and tedious for yourself.

After all, it's people that make computers do this work in the first place.

You can easily create tests for cases like that in excel. For example, if you know that all cells in a column must conform to specific conditions, you can create a new cell with a formula that validates the other cells. I have a bookkeeping excel where the "validation cell" becomes red when it detects errors.

Excel probably excels at promoting file formats like xlsx and such. I would not be surprised, if many mistakes, including choice of xlsx when saving, came from that, instead of storing data in csv, with the most simple csv syntax. So in that regard, I think excel is sometimes just bad.

I don't think data engineering is hard or needs much resources. It does need people who care and who will think about things.

Anyone who comes at a problem with the mindset 'this is going to he hard' probably lacks experience and will throw big-data frameworks at it, really screwing things up. The most significant, and valuable, resource needed is thought first, and knowledge+experience second.

All IMO anyway.

I always think that database design is pretty simple, but having worked on numerous databases designed by other, some people get it very badly wrong. "Wrong" maybe isn't the correct word, because you can use the database for what it is needed but it just requires extra joins and hacks at the application level to get around poor design.

Depends on context, you're talking in general, but in particular as the GP post said, there can be pressure from management to rush things, different data formats, bad or missing data and so many other pitfalls. It's hard when it's hard.

You can write a simple test to check a function is working correctly, but how do you make sure your 100,000 item database doesn't have corrupted or missing data caused by the latest pipeline update, especially if the corrupted parts are rare?

Collecting and distributing the results of life-or-death diagnostic tests is specifically one case where you don't want end users to "mess around with" the data though... (making sure the right test results get associated with the right record is surprisingly difficult and rigid software is part of that).

Excel is not bad, the users and institutions which use it for any use case are to blame.

The only correct usages of excel (and google sheets) are for user input to other processes, and visualization of data (which should be definitely stored in raw format in some database). And always assuming frequent backups of all sheets and of course manageable datasets. Anything else that includes external processes/scripts which append/ovewrite data in sheets is terrible practice and very error-prone.

The main problem I have seen is that data in Excel is essentially not validated.

Sometimes you get an number, sometimes you get a string containing a number. Sometimes the string contains white space that you don't want. And then there are dates, which can have similar problems multiplied ten times.

While I am not in the habit of defending Microsoft, I think it should be pointed out that it wasn't using "Excel" that was the biggest problem, but using "13-years obsolete format for Excel". Not that there was any reason to be using Excel for this, but it's a bit unfair to blame Microsoft for something they never claimed XLS could do, and provided a much better capacity in XLSX 13 years ago.

Again, it's hard to cry too many tears for Microsoft, but it does seem a bit off-target to blame "Excel" for this...

The issue with is excel is that its failure mode is suboptimal. Programming languages will generally crash at runtime when an unrecoverable error occurs (index out of bounds, memory allocation failed, null pointer, etc). But excel just throws away your data?

If it fails, it’s often silent. If not silent, hidden. If not hidden, misleading, if not misleading, opaque. If not opaque, trivial.

Excel is a wonderfully powerful tool that’s very bad at handling errors clearly.

It's the PHP of business software. Powers half the market, but boy do mistakes cost you.

Indeed, you have to manually inspect results to verify correctness. If you're lucky maybe your error will result in one of those weird error values being shown instead, but way too often that is not the case

Not to mention that complex formulae are still usually expressed as a bunch of gobbledygook in the cell value textbox, which is about as easy to parse as minified Javascript. And that's to technical users like ourselves.

I had Excel error out when I tried to open 10 million rows.

Excel tops out a little more over a million rows without specifically handling of larger files. I’m not sure excusing it, just a heads up to all the junior analysts out there who haven’t dealt with the problem(not implying you’re a junior analyst, just wanted to point something out to those who aren’t familiar with the issue).

No worries, Excel is nice for small/medium sized data and has the UI/Visualization which is nice.

The story I mentioned was because I wanted to look at the data before I started parsing it. I had full expectations to use either sqlite or Pandas.

> Programming languages will generally crash at runtime when an unrecoverable error occurs (index out of bounds, memory allocation failed, null pointer, etc).

JS would like to have a word with you.

In this case, Excel throws an error and says that not all rows have been loaded.

It seems that warning has just been ignored by the user.

Sure, and I think at least the headline ("ill-thought-out use of Excel") avoids blaming MS or Excel itself.

Ultimately tools are built for particular things, and if you choose to use a tool for something it's not built for, and it breaks catastrophically, that's on you.

No. Excel is definitely to blame. A simple error "cannot import/export more than 65k rows" would have done the trick. Not silently throw away rows.

The premise of excel is very much "don't bug me, just do it". The highly praised UI simplifies away corner cases and errors intentionally. Summing over an empty cell? Most probably zero, why throw an error and confuse the user? The simplified UI makes correctness hard to impossible, so excel is just unsuitable for tasks where correctness is desired.

Interpreting an empty cell as 0 is making an (often correct) assumption about what the user wanted.

Throwing away data without warning is almost certainly never what the user wanted.

You get CSV files as input and have to upload them into the database. You don't need Excel at all, neither in XSL, nor in XSLX format.

My bet is the biggest problem here is subcontracting this work to the lowest bidder, presumably from some developing country.

> subcontracting this work to the lowest bidder

knowing a little how things (don't) work in the UK, it's likely subcontracted, but to a company belonging to a mate of the director in charge of the whole thing

I'd take that bet ;-) From the article:

> > And it appears that Public Health England (PHE) was to blame, rather than a third-party contractor.

Over-13-years old Excel already opened delimited flat files natively, and it threw warnings when you opened a .csv file that exceeded the worksheet row limit. Someone in the distant past was just clever enough to hack together the script (with no error handling) without being curious enough to know it wasn't necessary in the first place.

If they had used the newer format, we'd be talking about this in a few weeks time when the larger limit was reached.

This is a problem of bad coding, and using the wrong tool for the job.

A defensive coding practice would have prevented this from going unseen. Using a database to store data would have prevented such arbitrary limits.

Yeah, something like "Missing Covid-19 test data was caused by using spreadsheet instead of database", would be more accurate.

Nobody is blaming Microsoft. From the article:

> And it appears that Public Health England (PHE) was to blame, rather than a third-party contractor.

Blaming PHE seems to have become a regular occurance this year. I wonder if it's justified or not.

It’s fun to point and laugh, but is this really just the consequence of an unprecedented health emergency demanding a rush job?

As far as bugs go, it doesn’t sound that bad. They didn’t lose data - they just processed it late? And they spotted it within days/weeks, and have a workaround/correction already? And it’s only the reporting that was wrong, not the more important part where they inform people of results?

I’d rather have this system now than be waiting for the requirements analysis to conclude on the perfect system.

There's a test and trace system that is supposed to phone everyone with a positive test result and get details of all their close contacts (people they sat in an office with, restaurants they visited, etc), who are in turn contacted.

A delay in the test results means the contact tracing could not happen, and people will have been going around spreading the virus they caught off the original, known cases.

Also, the whole test and trace system has been in the news a lot recently here for various failings, and things like this will just further knock people's confidence in it.

> A delay in the test results means the contact tracing could not happen, and people will have been going around spreading the virus they caught off the original, known cases.

A recent piece in The Atlantic argues that we got it all wrong, we should test backwards (who infected the current patient) instead of forwards.

This particularity is caused by the dispersion factor K which is a metric hidden (conflated) by the R0, meaning that the same R0 would be handled differently depending on the K factor.

A related suggestion was that we should focus more on super-spreaders as COVID doesn't spread uniformly. That's why the probability of finding a cluster is higher looking back than forward - most people don't get to actually spread the virus.


You’re right - but the biggest issue is clearly one of experience. Someone was put into a critical role they weren’t prepared for - a position in which their failures cost lives. They don’t have a redundant process, or error checking, or even someone familiar with spreadsheets.

That’s the scandal: it’s the type of basic error that it screams “they’re not handling this well.” You wouldn’t tolerate your new SWE colleague asking “What’s an array again?”

If publishing the whole set of data was delayed by excel crashing - fine. But silent failure because they ran out of columns? Cmon...

I would say it's error from a rushed (and understandably so) job, I've been involved in a number of Covid related projects at the US county and state level. The teams and departments are having to move very fast with difficult technical problems. Keep in mind, it's not like they were just sitting around with nothing to do pre-covid. Basically, when the pandemic hit, everyone got a second job to go along with their day job.

I think that's too generous. It would indeed have been excusable for perhaps the first 2 weeks. But at this point, they've had an effectively unlimited budget, likely access to engineers working for big consultancies and big tech who would have been quite happy to literally drop everything to help out and 6 MONTHS to build a system that worked. And they are still using an Excel spreadsheet.

Well, I don't see why developing an application for transforming csv files to xls files would go faster than developing an application that uploaded csv files to a database.

I could see that i could be a rush job problem, but in this case they're not gaining any time.

Good points

Partly I think the Excel bit is news because Excel (and mistakes made with Excel) is easily relatable to people. But bugs always come up and most other bugs are just as stupid. If it had been an off-by-one loop error in some C code somewhere it would be just as dumb but you'd get none of the facepalm memes all over Twitter.

Off by one errors are all alike; every excel error is an error in its own way.

Not when it mixes up US and UK date formats when you import a CSV.

Not when it turns phone numbers into integers and strips off the leading zero.

This is the root of the problem isn't it? The wrong tool for the job.

Tolstoy keeps on giving ...

They had a budget of GBP 36 million. It's pretty bad.

Serco received 45.8 Million in March (not disclosed until April), and I'm pretty sure they've had more since.

What’s your source for that budget? The error was by PHE, not Serco.

You need to back that number up with real facts, as this article is regarding the PHE handling of reporting.

This is what it most likely is. Analogous to starting out on an easy to hack together tech stack, succeeding, and dealing with tech debt when something goes wrong.

I guess its easier to understand since everyone uses excel, however it does end up giving a halo of blame to excel, as opposed to human processes.

No. There is difference between a little rush job, and an incompetent rush job. I’m only speaking for my own experience, but when put under pressure most competent teams don’t end up throwing away such important data. I’ve been incompetent for most of my career, but it’s given the serious of task this is inexcusable. I don’t know how things work elsewhere, but checking that the same number of unique records(by some composite key) are ingested as exported isn’t an advanced concept. Know the value of the data you’re handling. Marketing data <> health data. If health data doesn’t line up: you figure out why and solve it. If you don’t have anyone available to do so then the problem is managerial/structural.

That has been the COVID experience in Britain in the last couple of months: the government is spending a ton of money in training new people and creating new processes, which is causing a ton of slowdowns and bugs.

In the grand scheme of things, having problems because of inexperienced people is much better than not having enough people like during April and May.

I wonder if this will be added to the European Spreadsheet Risk Interest Group's (EuSpRiG) horror stories list:


Discussed here: https://news.ycombinator.com/item?id=22431500. (Submitted many times, but that's the only real thread, and only from 7 months ago.)

TIL there is something called "spreadsheet risk management."

Wow. Thank you for this gem of human culture.

The site's cert is invalid. Got a mirror?

Certificates are not applicable, the link scheme is HTTP.

It presents a self-signed certificate if using a browser/extension that automatically redirects to HTTPS (which everyone should be).

If the extension redirects to a broken cert not intended for public use, the extension is broken and should not be used. This is not making things safer, this is training users to click through warnings again. Very much "don't".

The extension is not at fault, all it does is rewrite HTTP links to HTTPS (which should be the default IMO, I agree with the parent commenter). The fault is with the site providing a self-signed certificate when accessed over HTTPS.

All it does is assume that a https site serves the same content and audience as the corresponding http site. That is a broken assumption. The consequences of such a broken assumption are very much the fault of the extension.

> That is a broken assumption

Why is that a broken assumption? Can you name a legitimate reason for HTTP and HTTPS sites to serve separate contents and audiences? I would rather not connect over HTTP to _anything_ nowadays.

CMS, serve the content over http and the admin page over https.

And for sites with noncritical static content https is superfluous to dangerous. ESNI isn't implemented yet, IP addresses are still visible to the eyes. And content sizes and timing are a dead giveaway for the things you are looking at. HTTPS for everything is just a simulation of privacy at best, and misleading and dangerous at worst, because there IS NO PRIVACY in the aforementioned cases.

So, what we know is:

« PHE had set up an automatic process to pull this data together into Excel templates [...] When [the row limit] was reached, further cases were simply left off. »

The terrible thing here is dropping data rather than reporting an error and refusing to run.

It isn't clear what piece of software was behind this "automatic process".

Clearly the responsible humans are to blame.

If the software that dropped data rather than failing has that as its default behaviour, or is easily configured to do that, then I think that software (and its authors) are also to blame.

Is there anything in Excel itself that behaves like that?

> Is there anything in Excel itself that behaves like that?

I'd imagine the default error-handling behaviour of 9/10 Excel macros is to throw away data

Oh, my God. "On Error Resume Next" is one of my earliest memories of learning to program for Windows. It never occurred to me that it might result in deaths.

That's an interesting ethical point in CS: sure, you didn't intend the software you hacked together in an afternoon to be used in anything with life-and-death consequences. But that doesn't mean someone isn't going to come along later and use it, without ever even looking at the source, for something very critical down the line.

I wonder how many bleeding edge master branches of GitHub repos, pulled in blindly by someone cobbling something together to meet a deadline, are running in places they probably shouldn't be.

People have died from using unsuitable contruction materials or devices, that were never designed to be used that way.

It's not the fault of the original designer if he was clearly targetting a different purpose.

Yes and "on error resume next" is what I usually see in VBA code

Curious, in what context are you all still using VBA?

I, personally, wouldn't use VBA at all, ever. I would only consider Excel and Word for very simple use cases. For example, I refuse to use Word for technical documentation. But Office and Excel are everywhere in large corporates; and simple tactical spreadsheets turn into monsters over time. One former employer had a spreadsheet "application" where triggering a sheet recalc (accidentally) would take out an entire grid of machines. I've never been in any position to stop this stuff from happening.

I had to import data from Excel documents into MS SQL Server. Since this was supposed to be a one time thing I used the import wizard. Some values where quietly ignored because the cells happened to have different formats than the surrounding cells. They looked exactly the same, so it took me a little while to figure it out. And if I wouldn't have done some simple sanity checks those errors would definitely not have been discovered before it was too late.

> It isn't clear what piece of software was behind this "automatic process".

They probably mean a semi-automated process within excel, where each tab is a days extract or something similar and they are using external references to other sheets. In any vaguely up-to-date version of excel the way you would do this is via 'get and transform' which does not have any of these limitations (including the 1m record limit that the news article suggests).

The funny thing is that the latest versions of Excel are brilliant at aggregating and analyzing data and are more than suitable for this task if used correctly (i.e. using PowerQuery). It's just that way less than 1% of users are aware of this functionality - I would assume that even most hacker news readers probably don't know about PowerQuery/PowerPivot, writing M in excel and setting up data relationships e.t.c.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact