I work in healthcare and by far the biggest production issues we've ever run into are people who run Excel spreadsheets through our file processing where the leading zero's have been removed from patient identifiers because that's Excel's default behavior and you CANNOT TURN IT OFF!
EDIT: I have no idea who downvoted my post because what I said is 100% true. We have to tell customers to stop opening CSVs in Excel and then uploading them to us because what they upload could be missing critical data. Excel interprets a number and then formats it as a number, but in healthcare, 10 digit numbers are really strings. Unless your IT team has created an Excel extension and had it preloaded onto your local copy, Excel will remove leading zeros and there isn't a way to get them back if you save/overwrite.
Every single problem I've ever had with excel stems from excel trying to be too clever. Nearly all my problems with importing data into it could be solved if excel left everything as is until I asked it to format values a certain way.
>Every single problem I've ever had with excel stems from excel trying to be too clever. Nearly all my problems with importing data into it could be solved if excel left everything as is
We're biased to bash on Microsoft for being "too clever" but maybe we need a reality check by looking at the bigger picture.
Examples of other software not written by Microsoft that also drops the leading zeros and users asking questions on how to preserve them:
I'm certainly not singling out excel, this thread is just about excel so thought I'd share my frustration with that. I've had similar experiences with pandas certainly. I've not used the rest of your examples, but do have similar issues with libreoffice calc. Regards pandas it is certainly easier sorting out the assumptions independent programmers make by including 'dtype=str'.
Does it? I think it should treat everything as a string since it can't be certain about the correct type. It could also suggest the correct type rather than applying it without user confirmation.
You want Excel to auto-interpret a table of numbers, that come from a CSV, as entirely text?
Looking forward to the first time anyone tries to use your excel on a table of numbers and then immediately has to multiply everything by *1 (in a separate table) just to get it back into numbers...
YES!! I seem to remember a few versions of Excel ago there was a way to import data and set the data type for each column in a preview window to see what it would look like. And to your point, it's very easy to select an existing column in Excel and change the data type anyway. So why not just make everything a string by default?
You are right on the cause of the issue. But the software does not HAVE to guess the type. If I paste something into Excel, I want what I pasted to be what's in Excel. Changing the type and inadvertently affecting the data isn't something any data software should do by default. And the fact that OP said they can't turn it off means that this is beyond a design mistake, it is a FEATURE. If Excel took itself seriously as a data platform, it would not make a change to any text by default.
But Excel has never been a data platform. It's a spreadsheet.
It's user's faults for using it in ways that it was never designed for.
Excel has always been about sticking numbers in boxes and calculating with them.
If you want unmodified string input, input strings into a tool intended to handle them.
Project specifications can be hard. Using 1) .xls files after they were superseded, 2) ANY data transfer method without considering capacity or truncation issues, speaks of incompetence.
You can format the column as text before you paste or you can load data from text and set the types for every column. Excel has features to handle all of these issues. I think it's biggest problem is that it's just too easy to use. It will allow novices to do all sorts of things with their data without providing guardrails for common mistakes.
Agreed with this. Excel does have data import feature that will allows us to fine-tune the import process to prevent this kind of mistake.
People just double click the CSV and complained that it didn't do it correctly. It is the same situation with scientific research data that researchers don't bother to use escape marker or blindly open the file without going through the proper import process. Then they blamed Excel for the that without understanding how Excel works.
Yes, Excel does have their quirks. But there are ways around those quirks, they have thousands of thousands guides out there about Excel. There is no excuses for people to complain about Excel didn't do the way that users want it to do without looking up for information.
Yeah it does, libreoffice has methods of recognizing numbers that are to be interpreted as strings as well. This is user error and the programmer trying to blame that on the program rather than it being a process issue. The spreadsheet can't read your mind
As an intern I was once given a spreadsheet of manually entered daily data, stored by a human being over the previous couple of years. Obviously he'd entered the dates in whatever form he felt like at the time and Excel had parsed them in a variety of godawful ways not least of which was defaulting to American month/day assumptions before UK day/month.
I think it would have been far quicker to just manually write a new column interpreting the dates based on previous/next etc. Instead I spent God knows how long trying to be clever, failing, and being embarrassed that I could not solve this obviously trivial problem.
The right way to open CSVs in Excel is:
Data -> From Text -> Browse to your csv -> Delimited -> Comma -> Next -> Select Text formatting for columns with leading zeroes -> Finish
You can either send xlsx with format or csv without format. If this would be disabled then we'd have another group of people complaining that their dates from CSV are not parsed.
and to enforce that, remove the file association which opens .CSV files in Excel by default. If every machine opened CSV files in notepad a huge percentage of 'Excel' problems would just go away. It could be worse though, try opening and saving a CSV in Word and see what that does.
I used to do migrations between library systems, and often they'd export from the old system as CSV. One rule I gave them was "never open this file in excel", because they'd always try to be helpful by tidying something and I'd end up with scientific notation ISBNs, dates that were in one form if they could be coerced into US-style infix, otherwise left in the original form, barcodes losing zeros, etc. etc.
The same problem happens with UK phone numbers, where the leading 0 gets chopped off
Solvable if there's a space between the dialling code and the number, but when individuals enter their number, they don't always do this
The old ISBN10 standard could get X (meaning 10) as the last character (the last number is a check digit), but ISBN13 (which prefixes 978 or 979 to ) is all numbers. The check digit is calculated differently. One obvious advantage is that the ISBN13 is barcode compatible.
I would guess that most modern actors in the book business has been primarily using ISBN13 for at least the last decade.
Interesting to see just how many responses to this real-world problem are suggesting that people rewrite CSV files to work with Excel.
They clearly do not understand system integration and the use of CSV text files for data interchange between multiple systems and application. Hey, JSON and Javascript libraries are the answer to that, eh
There are already enough potential issues with CSV interpretation on wrapping strings, escaping characters and so on, but changing the content when a delimiter is found should not be added to that list.
You bold point is the most important, the default behaviour of Excel when opening a plain text CSV file is to alter the content for display, applying magic and often-unwanted formatting rules. That should be optional.
It should be possible to open a text CSV file in Excel, view the contents in columns but the same textual form, save the file in CSV format and open it in another viewer/editor and still see the same content as the original file.
The problem is windows doesn't have a preloaded double click viewer for csv except excel if you have Office. The average user just double clicks. If notepad or wordpad did any kind of formatting for viewing it would be less an issue. So you have to solve for excel as a dev knowing that's what 99% of people will use.
If you 'solved for Excel', it will be broken for the other applications that may want to use data from those Excel-specific CSV files, which is generally why CSV was chosen in the first place.
If you want it to be a string then put it in quotes. It's that simple. If the user is too dumb to slow down and look at the import options it is not excel's fault, it's a process issue. Excel can't read a cell and then read your mind and extract the correct option.
> there isn't a way to get them back if you save/overwrite.
Of course there is, in your case. If patient identifiers have a fixed, known length, then you can pad with leading zeros to recover them.
You only have a problem if 012345 and 12345 are distinct patient identifiers.
It is bone-headed in the first place to use numeric-looking identifiers (such as containing digits only) which are really strings, and then allow leading zeros. Identifiers which are really strings should start with a letter (which could be a common prefix). E.g. a patient ID could be a P00123.
This is useful for more than just protecting the 00. Anywhere in the system, including on any printed form, if you see P00123, you have a clue that it's a patient identifier. An input dialog can reject an input that is supposed to be a patient identifier if it is missing the leading P, or else include a fixed P in the UI to remind the user to look for P-something in whatever window or piece of paper they are copying from.
The main point in my comment is that instead of shaking your fist that the behavior of other users in the system, such as those who choose Excel because it's the only data munging thing they know how to use, you can look for ways that your own conventions and procedures are contributing to the issue.
If people are going to "Excel" your data, and then loop it back to you, maybe your system should be "Excel proofed".
The lengths of the identifiers are not known, man. We are dealing with like 200 hospitals. They all have various kinds of Electronic Medical Record system like Epic or Cerner, or one of the 15 other third party vendors that all have custom implementations. Healthcare is insane in our country.
First off, I “hate” Excel, but only because people misuse it. It, like so many other products of its kind, are for numbers! Yet I would be hard-pressed to find someone actually using it for numbers. People absolutely insist on using it as a generic table system for holding non-number data.
I recently had to show a 20+ year Excel-using fanatic how to import data from a CSV file so that they could select as type Text columns that contain leading zeros. The ability exists, but I have found scant few people who know how to actually use the product properly.
Excel supports a variety of data types other than numbers and includes many built in functions for dealing with non-numeric data. Whatever it's history, it is not accurate to say that it is only for numbers.
Regardless of its _additional_ support for other types, those are only in support of its primary role as a spreadsheet. It is designed around the concept of plugging in _numbers_ and calculating with them. Everything else is to make those calculations well notated and human friendly.
Any other use of Excel is bending it into a role it wasn't intended for, and user beware.
And it is all too easy to just go there since there are soooo many convenience features for those who don't want to laern how to do the tasks well.
If notation were the sole purpose there'd be little need for anything but row and column labels or simple string input. Instead there's a full set of string manipulation tools, embedding of other objects, charts, images, drawings etc.
You are taking the application as it existed 35 years ago and saying it must still be that thing, yet it has had 35 years to evolve far beyond that. Microsoft itself, when it talks about Excel, talks about using it to organize "data", not just numerics data. It has become a more general purpose tool.
People love to use Excel to look at (with their eyes, or a screen reader, i.e., non programmatically) all kinds of data. They use it to list exports from data stores, and then take notes on what they discovered, show and hide data using the filter option, etc. It is just far too easy to use (and misuse) that it ends up being used for any type of data that fits into a table.
Excel have lots of different uses. It can be used as inventory tracking, sales, catalog, mileages, contact list, and the list goes on. It is useful in many different fields and industries.
You have to remember, Excel is extremely powerful beast. It have many specialized features that will handle any data it encountered with. I used Excel for 15 years and I am still finding features that made the process quicker. Of course, Excel have its limits and I am well aware of that.
I mean I thought you said that they should not use excel for these scenarios, but something else. So I thought you had an alternative software in mind?
I use Splunk (our organization pays lots of money for it, and yet it just sits there because people would rather use Excel). There is also Tableau and other products like it. Most of these tools have a free version, as well. They take some learning, but it is well worth it. Of course, taking that same amount of time to actually learn how Excel works is worth the time, too. I guess the issue is that Excel is just too "easy" to use, and hence people using it for years and years and not actually knowing how to use it.
I am not trying to be an elitist about this. It is just that the misuse of Excel (because people do not know how to use it) causes massive issues on a daily basis.
Excel is fine, but keep the data as xls and you won't have problems like dropping leading zeros. As soon as you convert to a simpler file format like cvs, you will likely lose data unless you are very careful in both export and import.
With US ZIP (postal) codes, padding with zeroes can actually be ambiguous because we have "zip+4" which is often stored as two fields, but just as often concatenated into a single representation. If you have a zip+4 of "05555-0444" in two fields that each lose leading zeroes, and then concatenate to "5555444" you can't reconstruct it with certainty (00555-5444 or 05555-0444 or 55554-0044).
I had a package get seriously delayed one time because it kept being sent to a city whose 5 digit zip code was equal to [my 5 digit zip code, less its leading 0, and the first digit of my +4]. Fun times.
> Unless you instruct it to interpret the field as a string.
Except double clicking a CSV does not give you the option to do this. At this point Excel already decided to corrupt your data. And guess how most users open CSV files? That's right, they double click.
> I'm wondering, if you expect 10 digits and you get less than that, how difficult is it to add some padding zeroes?
If you expect 10 digits and get less than that, you have corrupted input data. Trying to "fix" this is exactly the sin excel is committing. Don't do that.
This would assume people (many who have over a decade of experience using Excel) actually know how to do that. The largest problem with Excel is people.
Excel rules. I still use it daily despite having access to big boy tools (python, SQL, R, etc.) Few tools make it easier to answer quick, simple questions about a dataset. Plus it's still the best way to browse SSAS cubes. Blaming excel for these sorts of problems is like blaming a hammer for not being a screwdriver.
Introducing standards across many companies is incredibly difficult but your main problem is that a lot of healthcare IT systems are very old and not actively maintained, they're essentially on life support. So changing the format of a file, or requiring new import cleaning, is going to break integrations and data sharing in many places. Basically it'd create far more problems than it's attempting to solve.
I guess it depends on which country and which specific part of healthcare you are active in.
In the 'care' domain in the Netherlands I also see that some of our integrations are one-off, but the most important ones do have industry wide standardization that receives updates based on law.
IME numeric identifiers are just asking for trouble. Some systems are sensitive to leading zeros, some systems will zero-pad, <input type="number"> loses zero-padding, etc. Worked with a client once with references to the same entity scattered across two databases, one of which zero-padded the identifiers and the other didn't. Oy vey.
Agree with the problem reported here - the fact that Excel will drop leading zeroes and will turn phone numbers into Exponential format causes many live issues where customers use Excel to work on 'CSV' files :(
The issue I see is you have people who don't know the difference between a string data type and a number. I keep see "Listen to the scientists" and "Listen to the doctors". There are many facets to solving problems. Maybe many disciplines may be better at solving it. Had you/"healthcare" contacted folks with software expertise rather than being your own experts, this would have been a non-problem.
It is sometimes possible to write code to alter your CSVs to retain the strings.
The first thing would be to write them as: entry1,"0123456789",entry2 rather than entry1,0123456789,entry2. This has worked for me in some instances in Excel whereby I have to escape certain things inside a string, but I would not be surprised if Excel still messes this up. For example, giving the triangle exclamation mark box and then helpfully suggest to convert to number.
If you want to go further, you can do something like write a routine that alters the CSV, such as entry1,hospitalString(0123456789),entry2. Sure, there are problems with this too, but Excel can break a lot of things and the above examples I do use in practise (the first example I put the double quotes to escape single quotes in foreign language unicode).
Another thing Excel can do is break your dates, by switching months (usually only for dates < 13th of the month, but often a partial conversion in your data for < 13th and >= 13th) or convert dates to integers.
There are data sources that have a clear unambiguous specification of what information is in each field, and exactly how they are formatted, such that you could auto-parse the column. And even have no data entry errors, so that you can rely on that parsing and have clean error free imported data.
There are many more that don't have a clear spec, or even if they do, have every possible variation of data corruption / keying errors / user misunderstanding / total 'don't give a shit, I'll enter it how I want, that's what the computer is supposed to handle' problems in that source data, that make parsing an absolute nightmare.
I've had to manually review a list of 10k+ data readings monthly, from an automated recorder, because the guy who was supposed to copy the collected data files as is and upload them, instead opened each one and fixed what he (badly mistakenly) thought had been recorded wrong. Different changes in a dozen different ways based on no particular logic beyond "that doesn't look right". And un-fireable, of course.
auto-fix by padding assumes that the only way to get an ID that's too short is by excel removing the leading zeros - presumably it could also be human error when entering etc. (in theory I suppose that would be validated in whatever form the number is entered into in the app that generates the CSV, but hard to say for sure). Or some other issue in data processing pipeline erroneously truncating the ID number some how.
Validating the data would at least prevent getting invalid data into database (and presumably this is happening already), but it doesn't actually "fix" the problem, you still then need the original provider of the data to fix what's missing.
These aren't fixed-width fields with a known length. The length of the field is unique to the person sending the file and they don't tell us how long it's supposed to be.
The parent post says: "Excel interprets a number and then formats it as a number, but in healthcare, 10 digit numbers are really strings." - so I was under impression that the length for that particular column is fixed - probably just misunderstood the sentence.
I wonder how human operators figure out if the value is correct, or the Excel messed it up, or the input was invalid in the first place? If it's even possible to do it reliably then probably there is some set of patterns and methods that possibly could be turned into an algorithm... just thinking out loud here, but seems as an interesting problem to tackle...
Validating on upload is sensible and should be done anyway.
However, left-padding is not always going to work, it depends on the data.
Common problems I see are with phone numbers. UK mobiles begin 07nnn. Excel kindly drops the leading zero and then really helpfully treats the long number as being an exponential-format floating point number, which cannot be turned back to the original number.
In our systems, we require country codes in front of the numbers, so UK mobiles should being 447nnn. That doesn't lose the leading zero but still gets the E-format mess when opened by Excel. Customers often miss out the 44 and use the national 07 format. If that leading zero is missing, it can be quite difficult for a system to reliable determine whether that number, beginning with a 7, is an international number for a destination in Russia, or a user error and should've been 447, from an 07 data value.
Because Excel is a spreadsheet, for calculating with numbers.
Then there's additional code for dealing with all the inconvenient ways people format things, or want to add text labels, or do things closer to numerical/financial analysis, or all the other extras wrapped around the core "put numbers in boxes and do math".
That misunderstanding is at the core of Excel misuse.
Excel devs, do you hear me? Repent for your sins! Go back to Microsoft and fix those problems. The cost of Excel mishandling numbers and dates must be many orders of magnitude higher than your lifetime salary.
That's fine for an individual working on a specific set of data in a specific sheet, but this isn't just a problem for one individual or a small team that can solve it and move on. It's a systemic problem throughout many vast organisations, with continuous influxes of new personnel and constantly changing requirements. When you get an XLS sheet sent over to you from another team that already made this mistake, it's too late and this happens all the time.
CSV is text. If you mean in Excel, if you opened it in Excel (rather than importing and choosing non-default options), you've already lost the data so formatting doesn't help you.
Yes, I mean Excel. We have CSV to XLS import scripts/forms that format identifier cells as text. The data format is standardised. Using templates to do the imports was the dumb part. Microsoft has a Power BI tool if ones doesn't want to write or use scrips. Use that. I assume a government agency has the resporces to pay for it and for data scientists.
> I assume a government agency has the resporces to pay for it and for data scientists
First, the upthread commented said "healthcare" not "government agency".
Second, as someone who has worked in public sector healthcare: HA HA HA!
I mean, sure we have the resources to pay for data scientists (of which we have quite a few) and could conceivably probably afford to develop custom scripts for any CSV subformat that we decided we needed one for (though if its a regular workflow, we're probably acquiring it an importing it into a database without nontechnical staff even touching it, and providing a reporting solution and/or native Excel exports for people who need it in Excel.)
The problem is that when people who aren't technical staff or data scientists encounter and try to use CSVs (often, without realizing that's what they are) and produce problems, its usually well before the kind of analysis which would go into that. If its a regular workflow that's been analyzed and planned for, we probably have either built specialized tools or at least the relevant unit has desk procedures. But the aggregate of the stuff outside of regularized workflows is...large.
> They have file extensions turned off, which is the default, so they probably don't even know what a CSV is
And if, also by default, Excel is setup with an association with CSVs, the CSV file will, in addition to not having an extension to identify it, will have an icon which identifies it with Excel.
> I assume a government agency has the resources to pay for it and for data scientists.
Bold strategy there, let's see how that plays out.
Having been in and around military / DoD usages for a long time, I can tell you it's always an uphill battle to get processes to work well, instead of defaulting to whatever the original spec happened to get included as a result of some incompetent who wasn't even aware of good practice.
So what's the big deal? Just read back the numbers, if it's less than 10 digits prepend zeros. Am I missing something (probably, which means you're leaving something out in the description of the issue)
How on earth is this the top rated comment? I would downvote it if I could. As other people have noted, you can import CSVs values as strings instead of numbers.
Furthermore, losing preceding zeroes in number-typed values is not unique to excel; it is a common feature in all typed programming languages.
> you can import CSVs values as strings instead of numbers
Of course, but the problem isn't that the person who posted the comment doesn't know this - it's that many users of their systems don't know it. Most people are just going to accept whatever defaults Excel suggests and not know any better, causing problems down the line.
No, that's not correct. The parent poster specifically made this false claim: "that's Excel's default behavior and you CANNOT TURN IT OFF!" The false claim is the part that the poster wrote in all caps. He said that this behavior can not be turned off, when in fact it can be turned off.
It can't be turned off, if you're opening a CSV in Excel.
It can be avoided, if you go through the Data | Import tools. The complaint is that few Excel users know that the import engine is available, or use it, instead of just opening the file and getting all the default interpolations. Which can't be avoided in the usual Open code path.
>How? Where is the setting to turn it off when initially loading a file?
Confusion is happening because 2 different ideas of Excel using csv files:
- you saying "can't turn this off" : File Explorer double-clicking a "csv" or MS Excel "File->Open" csv.
- others saying "you can preserve leading zeros" : click on Excel 2019 Data tab and import via "From Text/CSV" button on the ribbon menu and a dialog pops up that provides option "Do not detect data types" (Earlier version of Excel has different verbiage to interpret numbers as text)
Quote your strings. The problem is more with how you generate CSV (which is easy to break in the first place) than Excel’s fault. It’s like writing your identifiers as numbers in JSON instead of string then complain the leading zeros are removed.
Perhaps by "correct way" you meant "dodgy hack to make Excel happy and risk breaking more sensible implementations"?
Excel may predate the RFC but AFAIK MS didn't invent or coin the term CSV, so you can't just say whatever Excel does is correct. The RFC is loose because of nonsense like this, it doesn't mean it was ever a good idea.
How in the world is Excel supposed to know which fields you want to be numbers and which to be strings? CSV doesn't have that info built in unless you surround the number with quotes and you select the right process. Excel isn't just a CSV importer, it reads all sorts of files, and it needs some help if you expect it to work. What ever happened to process and a sense of responsibility and craft in your work?
Choosing based on whether it's in quotes seems like a better solution than using an equals sign. Not everything follows that convention admittedly, but not everything understands the '=' either. Or it could just treat everything as text until a user tells it otherwise.
But none of that is really the point. Because CSV files aren't just for importing into Excel. One of their main benefits is their portability. In other situations column types might be specified out of band, but even if not, putting equals signs before values is unconventional, so more likely to hurt than help. And in the cases it might help, i.e. when you only care about loading into Excel, then you have options other than CSV, rather than contorting CSV files for Excel's sake.
> What ever happened to process and a sense of responsibility and craft in your work?
I actually have no idea what you are on about. I'm talking about the "responsibility and craft" of not producing screwed up CSV files. Why do some people find that so offensive? Yes, it is not inconceivable that there could be some situation working with legacy systems where putting `="..."` in CSVs is, unfortunately, your best option. Sometimes you do have to put in a hack to get something done. But don't go around telling people (or yourself) that it is "the correct way".
We have multiple legacy systems where I work, communicating via batch csv files, some of which are authored & curated by staff in Excel. I can confirm that doing this would be a very bad thing indeed and make our systems grind to a halt.
The reason you use CSV is that you want to use it with software that ISN'T Excel. Otherwise you'd just use .xlsx. No other software uses this convention, and it's not correct CSV.
“Correct” is a strong word. It isn’t defined in https://tools.ietf.org/html/rfc4180 so Excel should not try to be smart and add extensions only they support.
Excel predates RFC4180 by nearly 20 years (RFC4180 is October 2005, Excel 1.0 was September 1985) and this behavior was already cemented when the RFC was written.
As for the actual RFC, it's worth taking a read. Any sort of value interpretation is left up to the implementation, to the extent that Excel's behavior in interpreting formulae is 100% in compliance with the spec.
If CSV were being used just to exchange data with Excel, we probably wouldn't be using CSV. Many systems neither need nor know that ="01" should be treated as the string "01".
If Excel were the only intended consumer, .xlsx would be a preferable file format. At least it's mostly unambiguous.
If you want to write a bespoke CSV generator for an application where you know for sure that the file is only ever going to go straight to an Excel instance, sure.
For all the other uses in the world, that's a breaking change.
I have a list of companies I'd like you to consult for. Coincidentally, they're companies I'd like to work for, and I've quite enjoyed building proper database solutions which replace incredibly hacky/terrible Excel (or Excel adjacent) solutions.
All previous major spreadsheet screwups were costing money (largest being $6 billion in losses due to a excel formula mishap) , but this one is playing with lives :(
5 – London Olympics Oversells Swimming Event by 10,000 Tickets
4- Banking powerhouse Barclay’s accidentally bought 179 more contracts than they intended in their purchase of Lehman Brothers assets in 2008. Someone hid cells containing the unwanted contract instead of deleting them.
3-utsourcing specialists Mouchel had to endure a £4.3 million profits write down due to a spreadsheet error in a pension fund deficit caused by an outside firm of actuaries
2- Canadian power generator TransAlta suffered losses of $24 million as the result of a simple clerical error which meant they bought US contracts at higher prices than they should hav
and the Biggest one is. -
Basic Excel flaws and incorrect testing led to JP Morgan Chase losing more than $6 billion in their London Whale disaster.
At the same time, the business world runs on Excel. How much money is Excel making?
I've done my share of cursing at Excel at various jobs. At the same time, I am grateful for the quick and easy way it allows me and many others to manipulate data. It's unfair to just cite the costs of using Excel without acknowledging the benefits it brings.
The real question should be would it be cheaper to replace it with a proper solution (probably something custom built application written by a software engineer).
Excel's ease of use is it's downfall. It is the worlds most popular database, despite not actually being a database. I have wasted countless hours dealing with Excel where something else should have been used. I built a database for a friend recently, I think 75% of the work was cleaning the existing data from the excel to get it into the database.
People start using excel when the requirements are unknown, they use excel to understand the requirements while still being functional. To build a "proper solution", you frontload all the requirement discovery to build a system which only you or another developer will be able to update/change.
We blame excel, but excel is really just being used for prototyping and nobody takes a decision at a certain point to move on from that prototype.
I would say that people use Excel because they don't understand how to set up a relational database. It's usually non-technical users that set up Excel, then at a later stage I get asked to create a proper database from it.
>I would say that people use Excel because they don't understand how to set up a relational database. It's usually non-technical users that set up Excel,
I disagree that lack of db knowledge is the primary reason. I'm a programmer and I usually use MS Excel because it's easier than relational databases. I prefer Excel even though my skillset includes:
+ Oracle DBA certification and working as a real db administrator for 2 years
+ MySQL and MS SQL Server programming with raw "INSERT/UPDATE/DELETE" or with ORMs
+ SQLite and programming with its API in C/C++/C#
+ MS Access databases and writing VB for enterprises
The problem is none of the above databases (except for MSAccess) come with a GUI datagridview for easy inputting data, sorting columns, coloring cells, printing reports, etc.
Yes, there are some GUI tools such as SQLyog, Jetbrains DataGrip, Navicat, etc... but none of those have the flexibility and power of Excel.
Yes, a GUI frontend to interact with backend databases can be built and to that point, I also have in my skillset: Qt with C++ and Windows Forms with C#.
But my GUI programming skills also don't matter because for most data analysis tasks, I just use Excel if it's less than a million rows. Databases have a higher level of friction and all of my advanced skills don't really change that. Starting MS Excel with a blank worksheet and start typing immediately into cell A1 is always faster than spinning up a db instance and entering SQL "CREATE TABLE xyz (...);" commands.
Of course, if it's a mission-critical enterprise program, I'll recommend and code a "real" app with a relational database. However, the threshold for that has to be really high. This is why no corporate IT department can develop "real database apps" as fast as Excel users can create adhoc spreadsheets. (My previous comment about that phenomenon: https://news.ycombinator.com/item?id=15756400)
I find people often use Excel (or Access if they're brave) when they have a known problem and can visualise the solution in something they know, like Excel. The crucial factor is implementation time. They can open an Excel and start work immediately, or wait weeks or months going through a lengthy and complex IT procurement process that will take from their budget. So in most cases starting with Excel is a no-brainer. Same reason why a lot of stealth IT gets implemented.
It's often only after years of a business using what has become sacred & business critical Excels, that somebody suggests formalizing it into software. In a business with an IT function, or a consultancy looking for business, it should always be somebody's job to find these Excels and replace them with something more robust.
> Same reason why a lot of stealth IT gets implemented.
Honestly, I miss the days of writing VBA macros which save hours of work a week and being sneered at by the 'official IT'.
I worked in a team in a large commercial bank handling reconciliations with various funds. Some of which had to be contacted by phone to confirm the current holdings. We had a system which would import our current positions and take imports in various formats from funds. Somewhere around 40% of the differences where due to trades which had been executed over the reconciliation date. I wrote a VBA script which pulled in all of the differences and identified trades which where open over the period and automatically closed the discrepancy with a reference to the trade IDs.
Another time I wrote a VBA script which would take a case ID and look it up in a diary system (at the time the only way I found to do this was to use the Win32 APIs and manually parse the fields in the HTML from the system), it would then enter this at the top of a spreadsheet which had to be completed. People liked it so much I had to rewrite it so that it would work on a list of case IDs and automatically print out the checklist.
Much more fun than figuring out why Kubernetes is doing something weird for the 3rd time this week.
The problem with a database is that business users struggle to build them, they take longer, and sometimes can’t hold logic in the same way.
Plus in a database you can’t do the same sort of real-time analysis and also pass the document around for other non-technical folk to add to and modify.
In the real world in big companies, people often don’t want to talk to IT because they over-spec and quote what are perceived to be giant sums of money for something that can be created in an hour in excel.
To add to this point, the latest article says that they started building a replacement to the spreadsheet two months ago, so I think that reinforces the speed of a spreadsheet implementation vs anything else in the public sector!
You might not have realized this, but a global pandemic was starting in January and they were probably collectively shitting bricks at that point trying to shoe-string together something to help capture reporting, while probably being in a change freeze, and also probably having to do a mass-migration to at home work with the inevitable struggles of that.
it will only be cheaper after the company loses money... no business person ever looked at something and said "it works, but it might not, some day... better spend $XXX,XXX on an engineer to fix the problem".
You have just identified the new market of Spreadsheet Insurance. I am thinking it could be quite lucrative if you get the right salespeople in on the ground floor. I'm willing to bet no funding source is unfamiliar with the demand for this product.
That study was the intellectual justification for the policy.
To be fair, I'm not sure if any of the proponents actually believed (or had read) the study, but it was definitely wheeled out in debates against Keynesians.
Yeah, why there is no "scientific" mode to just turn of all "smartness" has baffled me for years and will probably continue to baffle me for years to come.
Or even an algorithm that can detect that you are using gene name from the cells around march1 and sept7.
I think that's a good way forward, Excel is a fantastically powerful, ubiquitous tool and isn't going away any time soon. If you're going to standardise on a technology, it makes sense to design your systems to work well with that technology. That includes making identifiers that you need to be treated like strings look like strings to your tools. A little bit of thought in advance and maybe some small compromises early on can save massive problems down the line.
Well, not exactly like non-Excel systems are bulletproof either.
Entire Japanese stock market went down last week, and it's not like there was a flood of people on HN bemoaning that. At least with Reinhart and Rogoff et al you have a responsible party.
As opposed to 'nameless machine failed, and nameless backup machine also failed, and now it's in JIRA so don't worry about it'.
The glitch stemmed from a problem in the hardware that powers the exchange, said the Japan Exchange Group, the exchange’s operator, during a news conference. The system failed to switch to a backup in response to the problem, a representative said.
I think similar to “code smells” in software engineering, using excel for critical data pipeline work can be considered a “data smell”. It’s not bad in and of itself, but it’s a signal that something may have gone seriously wrong with the data engineering process.
I think in this case it’s more like “nobody who would know better than to use excel had the authority to make the decision not to use excel”. Given that this was a government project, my guess is that there are very few technical people involved at the decision making level.
That example is overblown as the system is supposed to be restarted more frequently than that for maintenance. A better example is cars have been frequently rebooting various internal computers while your driving for decades, that seems bad but it’s similarly irrelevant.
And how about this one: an Excel error was the cause for much unnecessary austerity after the 2008 financial crisis, costing potentially tens (hundreds?) of billions in lost output.
Sure, I'll just teach tell my wife to learn SQL and some devops while she's at it. After that I'll coach her on how to get buy in from management on this new system while she neglects her primary duties.
Throwing up a database and integrating it into a workflow/system isn't something anyone can just get up and do. I have to imagein you know that.
Yep! And then when your wife gets that promotion she wanted, she still has to spend time maintaining and making changes to the old database system because nobody else understands it or has the skills to change it.
Was it really a manual process? I read it as some system importing info automatically.
And if it was manual I am surprised that Excel did not complain about adding more than 65000 rows (or saving more than 65000 rows as XLS). If a user gets a warning about possible data loss they should investigate more.
The problem is people can't do this in a office. Nobody has admin rights on there computers. Everybody know how yo use Excel. We all know, if you have to make a ticket to your office IT for such things, you mostly have to wait a long long time and it never, never works the first time. I guess that is one of the main reasons, why the world runs on Excel (and MS Access). It just works, it's just there, no dealing with IT.
Sqlite is a bit of a special case, in my eyes. It's so ubiquitous that it's probably already installed somewhere on the average machine, just not as a top-level application.
Of course, there's a bit of a gap between "it's there on the machine" and "we can rely on it for useful work", but baby steps...
Sure, Excel is instantly available. But also you get charting and formulas. Especially with formulas they've done something really right. You can transform a column in a few seconds while keeping the old one, tweaking the new one and having a graphical preview through the chart feature at the same time. Of course every Spreadsheet software can do this, but to me that's the ultimate killer feature
It takes a course to understand how to to use a relational database.
MS Access used to come as standard with Office and is actually the perfect solution to many of the problems that businesses use Excel for. It's very rarely that people actually used Access as Excel was far more intuitive and good enough for many projects especially in the early stages.
I can attest to this. I move job and my new work place used to manage everything in excel. Something's still are. Realising excel could not scale, and with no support for IT for a proper solution the "tech" guy (that is good with computers but not a developer) was tasked with creating a "database". What we now have is essentially 3 excel spreadsheet tables in access that you can run SQL on and input data using access forms. No normalisation, no well anything really. We're now coming to the point where access is creaking at the seams. I'm confident with properly designed tables and queries we wouldn't be, and I'm no expert either, but it will have to fail before there's a concerted effort to redesign or god forbid IT do there jobs.
Do note that anything other than Excel, UI/UX wise, failed.
When people wants a “database” they fire up Excel, start punching in numbers, solar calculators next to keyboard, and use eyeballs to search for strings.
Every time I see a problem like this I can't stop thinking that computer education is terrible everywhere.
We are able to teach almost everyone how to use complex software like Word and Excel. Why can't we teach people how to use a terminal, SQLite, or how to create a very simple Python script?
Well maybe it has something to do with the Windows monopoly and that they only want you to use their proprietary tools. If you learn too much then you will want to be free!!
As a data engineer, I think blaming Excel for this is the wrong framing. The real problem here is with the entire process, and specifically, the lack of testing.
Excel is great for many use cases, especially if you need people to enter data somewhere. Its UI is unparalleled in terms of quickly giving something to users that they can understand, mess around with, and verify. It's a very common use case to then need to pull in data from a bunch of Excel files, into one main repository of data (a data warehouse). That can be stored in an Excel file, although more commonly would be stored in a database.
But there are always problems with this process! There can be missing data, there can be weird data conversions because the program/language you're using to parse the data and get it into the database reads things differently than how Excel intended, there can be weird database issues that causes data loss, etc.
Complex systems always, always have bugs.
It is the job of a data engineering team to, among other things, test the systems thoroughly, and put in place systems to test against data loss, etc. It is pretty common, for example, to count the data going into a pipeline, and the data that you end up with, and make sure nothing was lost on the way.
Anyone can make a mistake. Any team, no matter how good, especially when they're rushed, can use shortcuts, use the wrong technologies because it's expedient, or simply have bugs. It is the job of the team, and of project management in general, to do all the manual and automatic testing necessary to make sure that mistakes are caught.
The real lesson isn't "Excel is bad". It's not. It's an amazing tool. The real lesson is "Data Engineering is hard, requires a lot of resources", and "all systems have bugs - testing is mandatory for any critical system".
Disagree on the "users can verify" part. Excel is very hard to verify, because of corner cases like date conversion. Formulas and macros, maybe spread out all over multiple sheets in hidden away white-on-white cells, make things very opaque and impossible to verify for most users. After a long gone colleague has passed the usage of some excel file, it is treated as an arcane and unchangeable wisdom from the past, never to be really understood or changed. Just maybe used as a copy-paste value import to the next generation.
Also, recognized good practices for software development (which any excel sheet that does more than just a sum() will be) like commenting and versioning are quite hard to impossible in excel. So even if you endavour to do it "right" but with excel, it is just the wrong tool for the job.
I vaguely remember a discussion about this and the problem boils down to Excel sheets being impossible to unit-test.
There might be, for example, a formula going through 20k lines in column F. But the one on line 1138 has a typo and the formula references an incorrect cell. No human will ever go through all the lines to check. Excel itself doesn't check stuff like that. And there are no tools for it either.
You can easily create tests for cases like that in excel. For example, if you know that all cells in a column must conform to specific conditions, you can create a new cell with a formula that validates the other cells. I have a bookkeeping excel where the "validation cell" becomes red when it detects errors.
Excel probably excels at promoting file formats like xlsx and such. I would not be surprised, if many mistakes, including choice of xlsx when saving, came from that, instead of storing data in csv, with the most simple csv syntax. So in that regard, I think excel is sometimes just bad.
I don't think data engineering is hard or needs much resources. It does need people who care and who will think about things.
Anyone who comes at a problem with the mindset 'this is going to he hard' probably lacks experience and will throw big-data frameworks at it, really screwing things up. The most significant, and valuable, resource needed is thought first, and knowledge+experience second.
I always think that database design is pretty simple, but having worked on numerous databases designed by other, some people get it very badly wrong. "Wrong" maybe isn't the correct word, because you can use the database for what it is needed but it just requires extra joins and hacks at the application level to get around poor design.
Depends on context, you're talking in general, but in particular as the GP post said, there can be pressure from management to rush things, different data formats, bad or missing data and so many other pitfalls. It's hard when it's hard.
You can write a simple test to check a function is working correctly, but how do you make sure your 100,000 item database doesn't have corrupted or missing data caused by the latest pipeline update, especially if the corrupted parts are rare?
Collecting and distributing the results of life-or-death diagnostic tests is specifically one case where you don't want end users to "mess around with" the data though... (making sure the right test results get associated with the right record is surprisingly difficult and rigid software is part of that).
Excel is not bad, the users and institutions which use it for any use case are to blame.
The only correct usages of excel (and google sheets) are for user input to other processes, and visualization of data (which should be definitely stored in raw format in some database). And always assuming frequent backups of all sheets and of course manageable datasets. Anything else that includes external processes/scripts which append/ovewrite data in sheets is terrible practice and very error-prone.
The main problem I have seen is that data in Excel is essentially not validated.
Sometimes you get an number, sometimes you get a string containing a number. Sometimes the string contains white space that you don't want. And then there are dates, which can have similar problems multiplied ten times.
While I am not in the habit of defending Microsoft, I think it should be pointed out that it wasn't using "Excel" that was the biggest problem, but using "13-years obsolete format for Excel". Not that there was any reason to be using Excel for this, but it's a bit unfair to blame Microsoft for something they never claimed XLS could do, and provided a much better capacity in XLSX 13 years ago.
Again, it's hard to cry too many tears for Microsoft, but it does seem a bit off-target to blame "Excel" for this...
The issue with is excel is that its failure mode is suboptimal. Programming languages will generally crash at runtime when an unrecoverable
error occurs (index out of bounds, memory allocation failed, null pointer, etc). But excel just throws away your data?
Indeed, you have to manually inspect results to verify correctness. If you're lucky maybe your error will result in one of those weird error values being shown instead, but way too often that is not the case
Not to mention that complex formulae are still usually expressed as a bunch of gobbledygook in the cell value textbox, which is about as easy to parse as minified Javascript. And that's to technical users like ourselves.
Excel tops out a little more over a million rows without specifically handling of larger files. I’m not sure excusing it, just a heads up to all the junior analysts out there who haven’t dealt with the problem(not implying you’re a junior analyst, just wanted to point something out to those who aren’t familiar with the issue).
> Programming languages will generally crash at runtime when an unrecoverable error occurs (index out of bounds, memory allocation failed, null pointer, etc).
Sure, and I think at least the headline ("ill-thought-out use of Excel") avoids blaming MS or Excel itself.
Ultimately tools are built for particular things, and if you choose to use a tool for something it's not built for, and it breaks catastrophically, that's on you.
The premise of excel is very much "don't bug me, just do it". The highly praised UI simplifies away corner cases and errors intentionally. Summing over an empty cell? Most probably zero, why throw an error and confuse the user? The simplified UI makes correctness hard to impossible, so excel is just unsuitable for tasks where correctness is desired.
knowing a little how things (don't) work in the UK, it's likely subcontracted, but to a company belonging to a mate of the director in charge of the whole thing
Over-13-years old Excel already opened delimited flat files natively, and it threw warnings when you opened a .csv file that exceeded the worksheet row limit. Someone in the distant past was just clever enough to hack together the script (with no error handling) without being curious enough to know it wasn't necessary in the first place.
It’s fun to point and laugh, but is this really just the consequence of an unprecedented health emergency demanding a rush job?
As far as bugs go, it doesn’t sound that bad. They didn’t lose data - they just processed it late? And they spotted it within days/weeks, and have a workaround/correction already? And it’s only the reporting that was wrong, not the more important part where they inform people of results?
I’d rather have this system now than be waiting for the requirements analysis to conclude on the perfect system.
There's a test and trace system that is supposed to phone everyone with a positive test result and get details of all their close contacts (people they sat in an office with, restaurants they visited, etc), who are in turn contacted.
A delay in the test results means the contact tracing could not happen, and people will have been going around spreading the virus they caught off the original, known cases.
Also, the whole test and trace system has been in the news a lot recently here for various failings, and things like this will just further knock people's confidence in it.
> A delay in the test results means the contact tracing could not happen, and people will have been going around spreading the virus they caught off the original, known cases.
A recent piece in The Atlantic argues that we got it all wrong, we should test backwards (who infected the current patient) instead of forwards.
This particularity is caused by the dispersion factor K which is a metric hidden (conflated) by the R0, meaning that the same R0 would be handled differently depending on the K factor.
A related suggestion was that we should focus more on super-spreaders as COVID doesn't spread uniformly. That's why the probability of finding a cluster is higher looking back than forward - most people don't get to actually spread the virus.
You’re right - but the biggest issue is clearly one of experience. Someone was put into a critical role they weren’t prepared for - a position in which their failures cost lives. They don’t have a redundant process, or error checking, or even someone familiar with spreadsheets.
That’s the scandal: it’s the type of basic error that it screams “they’re not handling this well.”
You wouldn’t tolerate your new SWE colleague asking “What’s an array again?”
If publishing the whole set of data was delayed by excel crashing - fine. But silent failure because they ran out of columns? Cmon...
I would say it's error from a rushed (and understandably so) job, I've been involved in a number of Covid related projects at the US county and state level. The teams and departments are having to move very fast with difficult technical problems. Keep in mind, it's not like they were just sitting around with nothing to do pre-covid. Basically, when the pandemic hit, everyone got a second job to go along with their day job.
I think that's too generous. It would indeed have been excusable for perhaps the first 2 weeks. But at this point, they've had an effectively unlimited budget, likely access to engineers working for big consultancies and big tech who would have been quite happy to literally drop everything to help out and 6 MONTHS to build a system that worked. And they are still using an Excel spreadsheet.
Well, I don't see why developing an application for transforming csv files to xls files would go faster than developing an application that uploaded csv files to a database.
I could see that i could be a rush job problem, but in this case they're not gaining any time.
Partly I think the Excel bit is news because Excel (and mistakes made with Excel) is easily relatable to people. But bugs always come up and most other bugs are just as stupid. If it had been an off-by-one loop error in some C code somewhere it would be just as dumb but you'd get none of the facepalm memes all over Twitter.
This is what it most likely is. Analogous to starting out on an easy to hack together tech stack, succeeding, and dealing with tech debt when something goes wrong.
I guess its easier to understand since everyone uses excel, however it does end up giving a halo of blame to excel, as opposed to human processes.
No. There is difference between a little rush job, and an incompetent rush job. I’m only speaking for my own experience, but when put under pressure most competent teams don’t end up throwing away such important data. I’ve been incompetent for most of my career, but it’s given the serious of task this is inexcusable. I don’t know how things work elsewhere, but checking that the same number of unique records(by some composite key) are ingested as exported isn’t an advanced concept. Know the value of the data you’re handling. Marketing data <> health data. If health data doesn’t line up: you figure out why and solve it. If you don’t have anyone available to do so then the problem is managerial/structural.
That has been the COVID experience in Britain in the last couple of months: the government is spending a ton of money in training new people and creating new processes, which is causing a ton of slowdowns and bugs.
In the grand scheme of things, having problems because of inexperienced people is much better than not having enough people like during April and May.
If the extension redirects to a broken cert not intended for public use, the extension is broken and should not be used. This is not making things safer, this is training users to click through warnings again. Very much "don't".
The extension is not at fault, all it does is rewrite HTTP links to HTTPS (which should be the default IMO, I agree with the parent commenter).
The fault is with the site providing a self-signed certificate when accessed over HTTPS.
All it does is assume that a https site serves the same content and audience as the corresponding http site. That is a broken assumption. The consequences of such a broken assumption are very much the fault of the extension.
Why is that a broken assumption? Can you name a legitimate reason for HTTP and HTTPS sites to serve separate contents and audiences? I would rather not connect over HTTP to _anything_ nowadays.
CMS, serve the content over http and the admin page over https.
And for sites with noncritical static content https is superfluous to dangerous. ESNI isn't implemented yet, IP addresses are still visible to the eyes. And content sizes and timing are a dead giveaway for the things you are looking at. HTTPS for everything is just a simulation of privacy at best, and misleading and dangerous at worst, because there IS NO PRIVACY in the aforementioned cases.
« PHE had set up an automatic process to pull this data together into Excel templates [...] When [the row limit] was reached, further cases were simply left off. »
The terrible thing here is dropping data rather than reporting an error and refusing to run.
It isn't clear what piece of software was behind this "automatic process".
Clearly the responsible humans are to blame.
If the software that dropped data rather than failing has that as its default behaviour, or is easily configured to do that, then I think that software (and its authors) are also to blame.
Is there anything in Excel itself that behaves like that?
Oh, my God. "On Error Resume Next" is one of my earliest memories of learning to program for Windows. It never occurred to me that it might result in deaths.
That's an interesting ethical point in CS: sure, you didn't intend the software you hacked together in an afternoon to be used in anything with life-and-death consequences. But that doesn't mean someone isn't going to come along later and use it, without ever even looking at the source, for something very critical down the line.
I wonder how many bleeding edge master branches of GitHub repos, pulled in blindly by someone cobbling something together to meet a deadline, are running in places they probably shouldn't be.
I, personally, wouldn't use VBA at all, ever. I would only consider Excel and Word for very simple use cases. For example, I refuse to use Word for technical documentation. But Office and Excel are everywhere in large corporates; and simple tactical spreadsheets turn into monsters over time. One former employer had a spreadsheet "application" where triggering a sheet recalc (accidentally) would take out an entire grid of machines. I've never been in any position to stop this stuff from happening.
I had to import data from Excel documents into MS SQL Server. Since this was supposed to be a one time thing I used the import wizard. Some values where quietly ignored because the cells happened to have different formats than the surrounding cells. They looked exactly the same, so it took me a little while to figure it out. And if I wouldn't have done some simple sanity checks those errors would definitely not have been discovered before it was too late.
> It isn't clear what piece of software was behind this "automatic process".
They probably mean a semi-automated process within excel, where each tab is a days extract or something similar and they are using external references to other sheets. In any vaguely up-to-date version of excel the way you would do this is via 'get and transform' which does not have any of these limitations (including the 1m record limit that the news article suggests).
The funny thing is that the latest versions of Excel are brilliant at aggregating and analyzing data and are more than suitable for this task if used correctly (i.e. using PowerQuery). It's just that way less than 1% of users are aware of this functionality - I would assume that even most hacker news readers probably don't know about PowerQuery/PowerPivot, writing M in excel and setting up data relationships e.t.c.
Each test result creating several rows of data seems like a problem too. In clean data, every observation is one row. It makes working with the dataset much easier. In this scenario, I would expect one observation to correspond to one test result. The multiple rows are then better off pivoted into columns.
Looking from an outside perspective I would agree, but in practice we don't actually know what this dataset is, so several things rows per test may make sense in the context of this stage of the data processing.
It never ceases to amaze me how often we as developers are quick to point out other people's obviously incorrect decisions, only to defend our own Rube Goldberg implementations 30 minutes later by pointing out that critics just don't understand the design constraints.
Or at least how often my peers do that. Obviously all of my systems and code are perfect.
Nothing seems as easy as another engineer’s problem.
This also amazes me, especially the whining about other people's code from developers. Truely believing that they would do it better. I've even seen inherited code posted to be ridiculed/shamed/bashed in some slacks and subreddits.
If the schema is consistent between rows, and it turns out a test result is made up of several rows because the test is composed of several stages, I would leave it as is until reporting time.
If you pivot prematurely, you could end up dropping data because there are new stages didn't exist when you implemented the pivot.
> If the schema is consistent between rows, and it turns out a test result is made up of several rows because the test is composed of several stages, I would leave it as is until reporting time.
In that scenario, indeed no action seems to be needed, because each row is one observation: every test stage is an observation. So it would seem to make sense.
One could then argue that each patient deserves their own table (observational unit).
But as other commenters pointed out, this is all speculation.
The article states that it is not known where in the country was affected, but I think we can deduce that. Here are the daily cases by specimen date. Dotted line is Saturday's data, solid line is Sunday's. The difference between the two is partly the normal daily update, but around 2/3 of it is correcting the underreporting:
Looks like all regions were affected, but by far the largest corrections are in NW, NE and Yorkshire regions. In particular, NE had looked like cases were declining, but we can now see this was incorrect, and they're still increasing rapidly
Edit: note the most recent 3 days are always incomplete, so any decline shown there is not a real effect.
Nit: presumably whatever the maximum time is before results is the period in which updates to past results should be expected. I think it's about 1 week at present?
Before the recent leap in cases my test took 6 days to come back (negative). My wife's test at the same time came back the next day. At the time they were saying that 72 hours was the expected return time for results. For me it has been a couple of days of steadily worsening coughing, and I gather people take about 3days-1week to show symptoms ordinarily.
So UK results are most likely reflecting infections from 1-2 weeks ago.
Always use checked pre- and post-conditions in a data pipeline.
This simple post-condition would have caught this issue:
The sheet after merge operation must have a number of rows equal to the sum of number of rows for all merged sheets.
Assuming this is a merge of a standardized input, then another post-condition might be:
The number of columns in output shall equal the number of columns in the input.
Might want to check header names, and order as well.
Thinking in terms of universal properties, and putting the checks into production, is better than unit-testing.
Which I feel might be indicative of a larger problem. If you are going to deal with data (and its processing), it only makes sense to teach those skills in school itself.
Give them a break. They don't have the luxury of a highly-structured field, purposefuly constructed by humans (engineers & mathematiciens on top of that) with only 50 years of legacy.
They have to deal with a field that was haphasardly constructed by nature over the course of 4 billion years.
Imagine having to retro-document a code base with 4 billion years of history, created solely by junior developers fresh out of college..
Even just within my own sheet I always build in some sort of idiot-check, the assumption being that I’m the idiot.
Do a SUM and match it against a COUNTIF on another column, or something. It doesn’t really matter what it is but if the data is important at all, I always scatter little checks throughout.
The case described in this article sounds like someone who just doesn’t know how to use Excel. I mean why in the name of VLOOKUP would you override the default and choose to save as .xls? That was a conscious decision. Anyone worth their salt knows that’s stupid.
I’d guess they wrote a script to merge the csv file and export in xls. Xls might have been the default of the library they used in a rush. I think this may be the case with some python libraries that generate excel files as xlsx is a more complicated format.
We did a bit of work for laboratories this year and csv is not an uncommon exchange format between labs.
In general almost all exchange formats are text based, with labs saying they will upgrade to „modern xml formats“ at some point in the future.
So seen in this context a csv or an excel file doesn‘t really surprise me and should probably also be seen in this context.
> exchange formats are text based, with labs saying they will upgrade to „modern xml formats“
Worth noting that XML is also a text format. SGML even can treat CSVs as markup. There's nothing wrong with CSVs/TSVs anyway - it's a concise tabular format using only minimal special coding for a record and a field separator, as envisioned by ASCII and EDIFACT. The problem seems more like that there was no error checking in place to capture file write errors, or more generally the use of non-reproducible, manual operating practices which seems common in data processing.
I can see the theoretical purity of this statement, but based on my experience working with CSV files generated by actual non-technical users I have to disagree here.
There are a number of footguns here that are really subtle and the average non-technical user has no hope of spotting them.
Problems that I've seen in the wild, off the top of my head:
* Windows vs. Linux line terminators breaks some CSV libraries.
* Encoding can change depending on what program emitted the CSV file, and auto-detecting encoding is not perfect. For example, Excel for Mac uses Linux encoding by default, IIRC.
* Excel does wacky things when you export a "CSV" in the wrong format; real users use Excel to generate their CSVs, not Python. For example if you import the string "0123456789" in an Excel sheet, it infers "number" and strips the leading "0" when you export. Now your bank account/routing numbers are invalid!
* "What's a TSV?" -- if users use CSV, how do you handle commas in the data? It's nontrivial to train users to do their CSV upload as a TSV.
Etc.
In practice we needed to build a fairly beefy helpdesk article with accumulated wisdom on how to not break your CSV exports, and most users don't read/remember these steps until they experience the trauma first-hand.
I'd say the CSV format is deceptively simple -- it's quite easy to do the right thing as a developer where the source and sink are both code you control, but in the wild it gets messy really quickly.
You have to navigate the same basic issues to convert each csv as it comes in to xls as PHE did in the article. They could just as well have converted them to a consistent csv format and stopped there.
Almost none [0] of the problem, such as line terminators, encodings, or escape characters, is an inherent flaw in the idea of using a text-based file to represent data. They can be fixed by just a little bit of standardization. But unfortunately, I don't think it's possible, we're stuck at here.
[0] Except type conversion, which is a real problem.
Agreed; I think the crux of the problem is that CSV is not a format that supports metadata. And even if you somehow created such a convention/standard, it's impossible to upgrade all the existing users with 10+ year-old Excel installs; that ship has already sailed.
The first CSV file was created in 1983. The first CSV standard was created in 2005[1].
The two decades of CSV surviving as an informal standard means that it takes minutes to make a 95% complete CSV parser and an infinite amount of time to make a 99.99% complete CSV parser.
Did it mention that it was in the context of csv or an excel file?
I only say that because, as someone who is painfully aware of the limitations and problems of those formats, I'm similarly aware of getting "that web-guy" on a project who proclaims "lets put things in a modern xlm format!", and lo and behold the process is now an order of magnitude slower and the xml format an order of magnitude larger than the simple delimited tabular format or stream.
I'm also painfully aware of the old systems (and how old health systems are) with fixed sized buffers and processes, so I can see how this would happen in the context of a lot of computing.
Edit: i see later on someone is mentioning that twitter suggests it had to do with excel file size limitations...
There is nothing wrong with CSV or text base formats. You can even use AWS Athena to query CSV feels stored in S3. It's a good format for data import/export, that many systems can natively understand or have tools to parse, given it's known how to interpret data.
It was a data pipeline issue. Software has little to do with it. If they received data in json and tried to interpret it as CSV, the same could have happened. I believe Excel even warns when you open file that has too many rows.
Tools exist, for analysts and engineers (MS Access comes to mind for the analyst, python for the engineer), that would rectify the problem. And I think it's a fair assumption to say that those tools would be readily available.
Kinda sounds like a management issue, as well. No one ever said "hey you know XLS doesn't support all of this data"?
Have you seen the kind of mess Excel can make with a CSV file? The names of several genes were recently changed so that Excel would stop mangling them.
Doesn't Excel by default capture the .csv extension so that it gets called automatically when you try to open the file?
Since Excel is one of the few standard pieces of software that knows how to open CSV, it gets used a lot of times when it shouldn't. There's another post I made comparing Excel to a swiss army knife, and there's a reason for that.
I think our company has a company-wide license for Office. When IT sets up a machine you get it automatically. Microsoft works hard to get those kind of setups to be common-place.
Last time I worked in a big corporate you had to fill in a long form to request a license for anything you needed. If you didn't use it again within a fortnight or so it got yoinked away...
The startups I've worked at since have all been big on GSuite.
Sure it does but it doesn't give a f*k - once you open it in excel, it does its' own formatting and will save that incorrect formatting even if you save the CSV as a CSV
Excel removes leading zeros from numeric fields no matter what you do and you cannot turn it off. This is a huge problem in healthcare where many patient identifiers have leading zeros
There's nothing wrong with CSV, especially if you don't have arbitrary-text data (if you do, just don't do CSV). Excel though adds an addition layer of services, which turns it into a nightmare if used as something it explicitly wasn't made to be - a database.
Sorry, but if you run into OOM issues by parsing an XML file, you're using the wrong API.
The DOM for a large XML document will of course take tons of space in memory. The key to parsing XML files quickly and with low memory consumption is to only keep in memory what's necessary, by streaming over the elements.
Correct. And to add to this: apparently the lost data was due to the data that exceeded the 16k rows XLS supports, so the amount of data per file was apparently not huge to begin with. So even a shitty XML parser should do just fine here.
As an aside, this is a very clear explanation of the bug in non-technical language, much better than you normally see on general news sites. Excellent reporting!
When a cell in a spreadsheet refers to another cell it must be finally calculated after the cell it refers to. This is called a Dependency.
Excel recognizes dependencies by looking at each formula and seeing what cells are referred to.
See Dependency Trees for more details of how Excel determines dependencies.
Understanding this is important for User Defined Functions because you need to make sure that all the cells the function uses are referred to in the function arguments. Otherwise Excel may not be able to correctly determine when the function needs to calculated, and what its dependencies are, and you may get an unexpected answer. Specifying Application.Volatile or using Ctrl/Alt/F9 will often enable Excel to bypass this problem, but you still need to write your function to handle multiple executions per calculation cycle and uncalculated data.
So the limit for Excell is 65,000 rows for the old format, and one million-plus rows for the new format (from the article).
Then I wonder, is there any tool that mimic Excell but with Sqlite as the backend? The limit of rows in Sqlite is 2 raised to the power of 64 (18446744073709551616 or about 1.8e+19).
Mimicking a spreadsheet with an RDBMS would be a neat hack, but I think the better solution is to simply use the database like a database. If an amateur carpenter is trying to hammer screws into wood, you hand him a screw driver, not try to invent some sort of hammer with a twisting mechanism that turns screws when you wack them. Use the right tool for the job, and use it like it's meant to be used. If the workers involved don't understand the right tool, then either train them or hire new workers who do.
> not try to invent some sort of hammer with a twisting mechanism that turns screws when you wack them.
That's a thing, it's a manual impact driver. And sometimes it is the right tool for the job (for example, when you have stuck screws and a regular screwdriver would just strip the head).
apparently part of the problem here was simply the fact that they used the columns [1] (of which there are far fewer, even on newer versions of excel) instead of the rows
Something sort of like this already exists shipped with Excel in the form of Power Pivot, which stores data in an embedded database inside the Excel workbook, and is supposed to scale up to 2GB and hundreds of millions of rows.
As you might see from that link it's semi-integrated with the Excel interface, but the differences in the underlying model show through in some ways (like not being able to edit individual cells or use VBA)
Microsoft actually has a reasonable migration path where you can use Excel backed by Access and then transition to Access backed by SQL Server. But you'd have to recognise that you had a problem first.
The problem is that anyone who isn't a decent coder think excel is a sensible choice for this, because they've seen it in their school or work.
So when the proposals come in, you're going to see one guy who says it's all common sense and we use familiar old excel for everything, and another lunatic who says something called "pigsqueal" is actually the standard, connected to a "frontend" which for some reason is now separate to the "backend". This nutter thinks we have time to write some unit tests and also wants to add an authentication module so we know who uploaded what. And somehow he thinks we need to add logging so we can ensure errors can be tracked and debugged. Amazingly he seems to be suggesting that there's going to be errors in our process.
That would imply non-tech people having some surprisingly deep visibility into technical details. What they really see is Excel vs. some contractor doing "a system".
The reality is that Excel is available today and works, and scales up... well, until it doesn't. Still, you have data entry that everybody in the field understands, that's rock solid (so no need to unit test anything), and has a tried and true authentication module built in (a file sent from a government mail address). There's a risk of user errors, but the system will be working immediately[0]. Deployment can be done by anyone, as it's just clicking on File->New and starting to type data in.
Meanwhile, a "pigsqueal" solution will take half a year to design, produce and deploy (that's in an emergency, a year otherwise). And that's with competent IT people, and not someone who wants to profiteer off the crisis. Then you'll have to train the users, and hope the reality won't necessitate any changes, because they will take a while.
I think plenty of decent coders appreciate the fact that Excel is suitable for a surprisingly wide range of tasks, and that if you spot it widely used somewhere, it most likely means there isn't any comparable alternative available.
--
[0] - Note that the user error that finally happened did not happen in the place where you'd expect it to.
Excel never works for math, at any scale. It's a constant trashfire of "looks fine" until it shows a result someone doesn't want, and then an error is found that changes results and the cycle repeats.
Excel is nice for visualizing and browsing data you already have, and informally searching and sorting for hypothesis generation.
My 4th order Runge-Kutta ODE solver I did in an afternoon in Excel the other day, because I didn't want to pirate Matlab to do a homework assignment, begs to differ :). Not to mention, the entire financial industry and many (most?) small businesses.
Lots of people use Excel for various types of math, and it works fine, if you know what you're doing. Excel not being idiot-proof doesn't mean it's not usable.
And that's all without writing a single line of VBA.
(We can have a separate discussion about some of the newer Excel features, like automatically suggesting what kind of statistical analyses to perform on your data - I consider this to be a potential future source of serious fuckups, as it allows people to easily transform data using methods, whose assumptions and implications they do not understand.)
> "looks fine" until it shows a result someone doesn't want, and then an error is found that changes results and the cycle repeats
That's a feature, though. Excel is interactive, which means you get ample opportunity to do sanity checks on results (both partial and final) as you work on your sheet, as well as immediate feedback on corrections. In "properly written" IT systems, this is rarely the case - you end up discovering a problem further down the chain, and have to figure out which component did something wrong, and why.
That IS the dumpster fire. Trust me, the reason I know is that I used to be that guy who thought Excel was a great tool.
I built derivatives spreadsheets, backoffice spreadsheets, trading systems with realtime data, all sorts of crap in Excel.
Really, it's Stockholm Syndrome. People who previously had zero computing power at their disposal think they've found the hammer that solves all problems when they're introduced to Excel, because now they can calculate lots of numbers.
They just get blinded by the revelation that they can now calculate "anything" and are happy to pay whatever the cost is in terms of future maintenance, ease of understanding, etc.
> Excel not being idiot-proof doesn't mean it's not usable.
This is absolutely true, I'd be able to do much better with Excel now than earlier. The problem is Dunning Krueger. There are too many people who think they can code once they're able to get a bit of Excel going, and they don't know that they can't. Not trying to be condescending, I've been there myself. It's just that you get a lot of "coding is a thing I have to do in order to get to some target", and so people think that once they've finally bashed out their spreadsheet, they've figured it all out.
You bring up some very good points. But to those, I have a question: what is the alternative? Both now (for UK healthcare), and in general?
The "people who previously had zero computing power at their disposal" may be wrong in thinking "they've found the hammer that solves all problems" - but there's literally no other hammer available for them. They're not programmers, they won't write their own software (nor would they be allowed to). Any other option involves so much organizational overhead - both initial and ongoing - that it's a non-starter.
I agree that people routinely use Excel way beyond their own skills. But I haven't heard of any viable alternative.
One part is knowing how important correctness actually is in your case. A lot of people on HN are perfectionists, and there's a lot of rhetoric going around our industry about avoiding errors and not breaking things; in reality business processes have always had to be error tolerant (because until recently they were always done by humans) and a quick solution that breaks sometimes is often more valuable than a slow solution that doesn't break.
Another piece is pervasive auditability. Any result should come with an explanation of where it came from; "Bob did some calculations in his head and he reckons the answer is 7" would be acceptable for some kinds of business decisions, while for others it needs to be more like "Bob followed the procedure specified in the XZY institute handbook, page 456". Somehow we've let all that go out the window, partly because people who don't understand computing are managing organisations that deeply depend on it. But you don't even really need computer literacy; what you do need is the same kind of scepticism that you'd apply to any other piece of work.
Managers need to manage. Some of the problem is just people lacking the necessary skills (and a lot of that goes all the way to the top: the UK government doesn't have the wherewithal to hire skilled computer professionals because at every level the people on top don't have the skills to assess whether the people below them are any good), but a lot is a misplaced perception of computers as infallible.
Excel is an excellent hammer. A spreadsheet is a system to develop pure functional graphs and to interact with them in real time. I spent lots of time on finance spreadsheets myself and also thought it was Stockholm syndrome, but when I transitioned to working on software I came to see exactly why spreadsheets are entrenched the way they are, enjoy seeing them used creatively, and would never wish bespoke software on finance people. And good for them, they seem to know better, themselves!
> It's just that you get a lot of "coding is a thing I have to do in order to get to some target", and so people think that once they've finally bashed out their spreadsheet, they've figured it all out.
I have a friend who wrote a 641-line long bash script to automate a web site. It doesn't use subroutines anywhere, the body for the program is a 550 line long loop with multiple if statements and loops inside it.
He thinks the program is maintainable and easy to understand because he didn't have particular difficulty writing it.
He attended the school of self-reliance. He should be applauded for tackling the challenge and doing the job himself, where otherwise it would likely have not been done at all.
Obviously, he's not a programmer. The script, for us, is bad. Perhaps if something important for him depends on this script working, he should pay a software developer to spend some time cleaning it up.
He's entirely self taught. He's not stupid or anything, it's just that he hasn't had the experience of maintaining large codebases, attempting to modify his unmaintainable code or attempting to modify other people's unmaintainable code.
Well why are these people wrong? You had a thing going but you didn’t complete the thought: if Excel works and they can now calculate “anything” without developing software and deploying containers, where is Stockholm Syndrome? Why are they blind? What should they use in finance? Jupyter notebooks?
Having worked in the financial industry, my first job was writing a system to cover for a $10m fuck up in excel where yahoo exchange rates stopped updating and the same spreadsheet was used for a few weeks until someone noticed "Shit, we're losing a lot of money here.".
Excel is the right solution in the same way playdoh is a valid building material.
I know individuals / hobbyists who do this, which is fine. Financial orgs should know better, especially if it is actually "critical." How did this get through any sort of review? How do you know Yahoo is authoritative?
Was it good enough? I would argue that if it enabled your firm to make $12M before being the proximate cause of losing $10M, and the alternative was "we don't win $12M nor lose $10M because we can't afford the project" then I agree it was good enough.
Pretty common for finance in the 00s and 10s to be honest. It's slowly getting better, but until you fire everyone who isn't on board with this and replace them with people who can at least script - at every level of the business - you're going to have the same problem.
That would be good. But to be honest, Excel is scripting - in a 2D FRP REPL (something structurally more advanced than most programmers are using day to day). If we're changing hiring requirements to test for bash/Python/Powershell competency, why not actually test for Excel competency?
Maybe that's the problem? Excel is so easy to start with that people with no experience think they've mastered it, and the industry doesn't seem to have specified any best practices, much less testing the interviewees for their knowledge of them.
I really agree with this. Excel/spreadsheets are great and should be used whenever possible instead of some bespoke solution that will always be worse unless millions and millions of dollars are sunk into it, and even then, will probably be worse, given the state of
software development in 2020.
Like all tools, you have to know how to use Excel. If this is the only error that’s come up, well, so many other problems would have come up with a bespoke solution that Excel is still miles ahead in my book
If you don't already know, Octave is a very good free and open source Matlab compatible interpreter and solution. Was of great use in my time in college.
Back in the day, it wasn't "very good", at least not in terms of getting it to work. That RK4 in Excel happened because I got extremely annoyed at Octave.
Holy shit, how much does it cost to rent out your crystal ball for a few hours? Because you got it 80% right.
I was 19 or 20 back then, and Windows was my main OS - though I did have some experience with Linux as well (running several distributions for desktop use, as well as working with Cygwin and SFU on Windows), and I was a relatively proficient C++ programmer, having spent ~6 years of pretty much all my after-school time coding game engines.
So it's not that I couldn't make it work - I eventually did. But it was so rough around the edges that I gave up in frustration twice.
And yeah, these days, Linux is my daily driver (well, technically Emacs - Linux distros are just various flavors of Emacs bootloaders for me).
That is patently untrue. I've borrowed and lent money based on Excel spreadsheets, more than once. That's small-scale math with real-life consequences, and Excel is an excellent tool for the job.
You're right, but if you asked me to do something in GnuCash which I never used or do the equivalent in Excel with formulas I'm already familiar with, I'd go with Excel in a pinch.
This is why Excel is so well known, because once you get the hang of it, it becomes a great tool for something small.
Excel is great in my industry (slot machine game/math design). It is fantastic for doing game calculations and there are reasonable ways to do error handling/checking for correctness.
I could say the same about programming a web app without writing tests/spending some time on system architecture. It looks fine until it doesn’t work.
In both instances of development (I argue constructing an Excel workbook in my line of work is very similar to programming) there are ways to mitigate risks by doing things similar to “writing tests”.
I wrote a finite capacity planner for a pie factory in Excel v6. It made my Pentium 60 run a bit warm for a while. It took about 18 months - I was the only IT bod in the place. I also learned basic networking and installed a pair of hubs (yes hubs) with a 10BASE2 "backbone" and 10BASET to the desktop. I learned the new Novell 4 stuff, NDS etc and wrote login scripts etc. When I started, Planning were using Lotus 1-2-3 for DOS and some amazing feats of keyboard jockeying. When I finished with them, they cranked the machine and then used their experience to tweak the plan a bit. Run it past the Production meetings, feedback and twiddle a bit and send it out to the Team Leaders, Quality and the rest. If things started to look different, then run the plan again etc. I borrowed my Dad's logistic textbooks to learn about forecasting (v).
Getting from infinite capacity planner to the finite version took quite a while!
I don't think you can fault the tool. VBA in a spreadsheet gives you a lot of power but it needs discipline to wield correctly. I used to have a row at the bottom of all my tables with the word "End" in tiny text in each column, always formatted white on red. All my routines that ran down the table to look up and do something would always look for that signal that the end had been found. Nearly all formulae were entered by VBA. I had auditing routines that would test the various sheets for errors - I suppose I "discovered" unit tests. One of them looked for a row of text with specific formatting ... Another obvious check is having row and column sums cross checking each other. I (re)discovered loads of little things like that.
With care a spreadsheet can be quite handy for all sorts of tasks but please don't equate the ill advised monstrosities you (and I) might have come across in the past with a fault in the tool itself.
Anyway, there is a lot more to this story than that and back then I had a IBM System/36 running the show as well to worry about. Twinax is a right old laugh to deal with. I remember going to Eng and asking to borrow a spanner and a soldering iron - "but you're Planning, what do you need those for".
Sorry, started waffling 8)
(v) How to forecast demand from the multiples in the UK for pasties, sausage rolls etc, back in the day. There are two cycles one is weekly and the other is roughly annual, with peaks and sometimes spikes at Easter and Christmas and some upticks at bank holidays. The weekly one literally looks like a sine wave, the annual one is a bit more involved. As a first go, take the last three orders by day of week for a product and calculate an exponentially smoothed forecast for next week. For example take the last three Mondays to get next Mondays's forecast. Bear in mind that you need to prep, make, bake, chill and wrap the product and ship to depot with about seven to 11 days shelf life and it takes something like one to three days to do that. You are always making to forecast, which is quite tricky. This was about 25 years ago but Asda, Nisa, Lidl etc used to take our forecast and fax/EDI it back as an order without changes.
Not to mention the "system" solution will have to go through its own debugging and will undoubtedly have bugs in it for some time after initial release.
On top of that, there is STILL room for human error as a user is still the one actually inputting data and designing the experiment/data flow.
I'm going to get mauled on this forum given the audience, but parent comment reeks of the technical elitism on this forum and the tendency to immediately condemn anyone who is using excel.
Excel works great - I've personally built some very complicated models that work just fine. Goes through the same process of QA and line-by-line checking of requirements.
Not sure what the big deal is - mistakes happen in excel or otherwise.
The same way that engineers saying you shouldn't use dung to build houses are guilty of architectural elitism.
Excel is the wrong tool because it mixes presentation and data, to the point that geneticists had to change the names of genes so their spreadsheets would stop turning them silently into dates and then destructively changing the original data[0].
>Excel works great - I've personally built some very complicated models that work just fine.
I've personally cleaned up after people like you, which cost the business millions of dollars because they didn't realize how unqualified they were for doing their jobs.
>Goes through the same process of QA and line-by-line checking of requirements.
Somewhat difficult as excel is not line based. "Oh yes, these constants are on a different sheet, if you just click here, here, here and here you can see where we get the original data, we multiply it by 1.0 to make sure it's not a date over there and then we feed it back ..." if excel was sanely serializable you could easily version control it and see meaningful diffs between check ins. Double points if you could run tests on sheets and run them in batch mode easily (I said easily).
Except you can do it if you know what you're doing.
No one is saying to manage millions of records etc.
But you can get more mileage out of excel than people think - and people who half understand it are the worse because they know enough to know that it doesn't work.
Custom solutions are great - but aren't always the answer and with proper process Excel will work just fine in many use cases.
How can you achieve anything equivalent to unit testing? How can you do any kind of change control (unless you treat every edit to the sheet as a change)? Excel does some things well but for anything that needs to work robustly as part of a process (as opposed to one-time exploratory data analysis) it lacks things that were basic in 1996. It's an "unsafe at any speed" situation: it may work as long as you do everything right, but it's missing basic safeguards as soon as you make a single mistake; I'd put it in the same category as using a memory-unsafe language.
>Except you can do it if you know what you're doing.
Ye gads, why didn't we think of that in any other programming language? If we know what we're doing we can do whatever we feel like, and if we get the wrong answers, it was obvious that we didn't know what we were doing which can be fixed by just knowing what we were doing!
Excel's only use case is for data that fits on one screen, or an exploratory poke at the data to see what's in which column and if there are any patterns you can eyeball. Then you put those hunches in a script and start doing the work for real.
This geneticist thing is a complete misunderstanding. Excel is tuned to autoconvert things that look like dates to dates. Best would be to be able to disable this, but Microsoft research shows that more users need it.
Meanwhile scientists use Excel for lots of stuff.. since it works. It has a not nice "feature" of those conversions.
So option 1) throw away the tool you use every day 2) change the gene name
Option 2 makes sense, although everyone would be more happy if Microsoft gave an option to toggle that auto-conversion off [there is option to properly load CSV files but people dont use it, since it takes more time...]
And I've cleaned up after many, many shitty engineers, one example being engineers who want to use a sledge hammer to drive a finishing nail and again, that's just one example.
And if you want to get personal to that level they all the same know it all personality who think they write MUCH better code than they actually do.
Pretty sure they cost companies even more money.
Luckily for me, I have the practical sense to know when to use what tool and so far it's worked out just fine. I wouldn’t use excel where not appropriate, so kindly don’t make assumptions.
But more power to you, keep shitting on tools that are more sophisticated than you probably think.
People who half know excel are the worse, they are the ones who know it just well enough to think it doesn't work at all.
Custom solutions are great - but aren't always the answer and with proper process Excel will work just fine in many use cases that you might not expect.
I could blab on about this forever, because, you know, there is a lot of gray area or whatever.
Imagine asking people to use the custom solution sledge hammer of python, git, unit tests and sqlite. Not one off the shelf piece of software there.
There comes a point where using a spork to drill holes in walls is not the right solution. That you're shitting on people pointing out the obvious tells me you're still stuck in the 90s windows developer mindset.
The problem is you only hear about botched excel jobs - you never hear about all the successful projects, so when you hear a horror story you just jump to the conclusion excel sucks.
Ok, ask a geneticist to build what they need with their “off the shelf” parts you mentioned vs 90% of the time being self sufficient with excel and allowing them to, you know, do genetics work. (The other 10 percent being where it makes sense to build something more custom).
Reminds me of the classic HN post when Dropbox had just launched where the user exclaimed “this is just a mounted blah blah using subversion blah anyone can do it” (paraphrased)
That old Dropbox post summarizes this mindset quite well.
And if being practical with your head not in the clouds making sane business AND technical decisions is what the 90s were like, man I missed out.
I guess at the end of the day we’re arguing over where you draw the line regarding when to stop using excel and graduate to something different, a blurry line that at the end of the day is a judgement call.
Oh and for the record I wasn’t even arguing against or for this particular use case in this post - speaking more in general terms.
I worked with a couple. They stop being geneticists rather quickly because the salary for someone who can use python well is about 4 times that of someone working on genes. Keeping them away from coding is now an executive decision since university departments can't pay competitive salaries.
The dropbox post was right. Knowing someone who wins a lottery is no reason to conclude spending all your money on lottery tickets is a good investment. Which is as much as I said in that thread when it happened. That they were solving a problem for idiots comes with the problem that idiots are too stupid to realize when a problem is solved, so you need to focus on looking like you've solved it instead. Funnily enough dropbox spent herculean amounts of effort on polishing the UI.
>And if being practical with my head not in the clouds making sane business AND technical decisions is what the 90s were like, man I missed out.
> Not to mention the "system" solution will have to go through its own debugging and will undoubtedly have bugs in it for some time after initial release.
Doing code reviews in Excel is hard if the developers are pathologically disciplined. It's impossible most of the time.
And so is debugging.
It's very unlikely any such system would face these limitations and would silently ignore data the same way Excel did.
> It's very unlikely any such system would face these limitations and would silently ignore data the same way Excel did.
I’ve seen plenty of production systems that ignore or hide errors. Sometimes they’re still logging them, but it just goes to some log store or file that the team doesn’t check until their customers or support team inform them that it’s broken.
Good practice? No. But there are plenty of ways to mess up a non-Excel system and get something that works worse.
I'm sure that a developer who's determined enough can mess up any system. And Excel is by no means the only system with that kind of dangerous default (I remember MySQL 4 had very similar behaviour, as did early versions of PHP). But the industry knows this is far from best practice.
Yep Excel is a solid choice for some systems. Business people know it and it's a hard sell to tell them you want more time / money to use something they see giving them less.
> Goes through the same process of QA and line-by-line checking of requirements.
That's simply not possible unless you have a whole system around excel running a year harness.
You can maybe do it if you are super dedicated with locked cells and formal double-blind QA passes which maybe exist somewhere but not in the vast majority of operations.
Hi, it looks like this alphanumeric constant is the date format used by the Democratic Peoples Republic of Arstotzka for the year 127 BC. I have changed the encoding for the file to Arstotzka standard and overwritten the original.
> that's rock solid (so no need to unit test anything)
that is crazy talk, the amount of user error that can happen in an Excel spreadsheet means that you would need to do crazy macro work in order to maintain the integrity of the data
I would be much more upset to find out a governemnt employ on a deadline used Google Sheets or Airtable to share my medical data instead of an Excel doc on a secure government file server.
" For customers who are subject to the requirements of the Health Insurance Portability and Accountability Act (HIPAA), G Suite and Cloud Identity can also support HIPAA compliance"
"UHS says all U.S. facilities affected by apparent ransomware attack
Computer systems at Pennsylvania-based Universal Health Services began to fail over the weekend, leading to a network shutdown at hospitals around the country."
Not that HIPAA applies in the UK (and at this point it's not even certain that Google can be GDPR compliant even it they wanted to) but that bit you quoted applies to G Suite customers who have to enable that thing in the account settings.
It is implausible that every hospital, clinic, lab and other medical organization in the UK would sign up to G Suite and deploy it to every of their employees; or that the British government would negotiate some kind of procurement contract with Google quickly for all that to be practical.
Absent those issues, Google Sheets has a limit of 18k columns and 5m total cells, in this case the issue was that they hit Excel's 16k column limit. Not much of an upgrade there.
Having looked at the marketplace for HIPAA compliant solutions for educational therapy patient record keeping (patient CRM, whatever the industry term is), I'd feel so much safer having my data in the hands of Airtable than the absolute dumpster fire that was every offering I saw.
I don't really agree with much of google any more but I was at one point fairly involved in a project there. They do have some more secure systems for government use that isn't the crazy leaky ad cesspool gmail and docs is.
The docs are used for data into the advertisement systems.
I've heard on the grapevine that several fortune 500 companies are on O365 because Google wouldn't disassociate docs from data gathering, even at the level of high touch high paying customers and they rightfully consider their internal documents to be part of their IP. Microsoft says "sure" and flips a secret "don't even gather stack traces on the server" flag for these customers.
Extremely low performance. Obscure "sharing" workflow instead of obvious "mail a file". Cloud dependency that needs to be vetted for security (EDIT: and data privacy, as 'sjansen pointed out), and policies regarding cloud accounts on government e-mails. And in case of Airtable, also for budget.
I'd be surprised if either of the above could handle 65k rows in a single table without becoming near-unusable.
Look, you’re right that Excel shouldn’t be used for this. But even then, Excel definitely could be used for this.
We’re not talking big data here. We’re talking about a finite number of possible test result from a finite set of testing locations grouped on a daily basis. Let’s say 1000 test locations, where each location has a daily number of tests performed divided into “positive”, “negative”, “inconclusive” and maybe a few other options. That’s ~5000 data points per day. Should not be any problem.
However, you need to properly plan ahead about how you design your spreadsheet, how to format the columns, how to keep it performant and how to not save it in some godawful 25 year old memory-dump dumpster fire of a file format...
Come on, this is a government department of a first world country. They have databases for all sorts of things. They're not using excel for any other NHS databases, or the terrorism watchlist. There are plenty of people capable of making a good decision here.
I think you let them off too easily by just assuming they're dumb. This a bad decision by people who definitely should have known better.
Yep. If the UK is anything like Australia, they will have a small army of contractors who specialise in building SQL databases for various important and pointless requirements for the government. These contractors are used to working on giant, sensitive data, databases with insufficient allocated time. It should have been pretty much trivial for the UK govt to solve this.
And yet, I have first-hand witnessed a scenario like the one described in an AU government department. A small system developed by business (one person who had subsequently moved on) using Excel and Word and macros to keep track of information and generate letters to "customers" had become an essential tool, but unsupported and undocumented. When we (IT) recommended redeveloping it in Java with an SQL database (and documentation and testing and backups), they balked at the cost. Partly because the original person did the work "in their spare time", so the business never saw the actual costs. And partly because there are zero infrastructure costs with Excel and Word. There was very much a feeling that IT massively inflate the costs and will build a Rolls Royce when a Ford will do. It was a house of straw vs house of bricks situation, but they were happy with their straw house, they thought their only problem was that the little pig who built it was no longer there.
The problem with my Private sector Australian company is everything IT related is aggressively outsourced through various networks of contractors.
In the old days when the relevant business application was written, hosted and supported in house there was a clear chain of responsibility I could pick up the phone and I'd have a direct line to the person who "owned" the application.
Nowadays if there is a problem it's pick up the phone talk to helpdesk get assigned a ticket number and get the buck passed between different teams, The database team will blame the server team, server team will blame the networking team, networking team will reply to ticket with 'looks ok no problem on my end' and the ticket will get closed without resolution.
From what I can tell there are a bunch of incentives in the support contract around how quickly support can close out tickets, so rather than trying to fix the problem support try to do everything they can to farm the ticket off to someone else so it won't impact their metrics. The whole thing feels maddening and has to be rather inefficient.
Victoria (Australia), until a few weeks ago, did contact tracing via hand-written notes and faxes. Sure, it's not going to overflow at 16,000-odd cases, and doesn't require a DB admin and test cases, and all of that. But it performs at a fraction of the speed, and is one of the major reasons the second wave here got out of hand. Just because that army of contractors exists doesn't mean the relevant departments know of it's existence, or decide to use it.
I'm pretty sure the UK does have actual software with actual databases backing up test and trace, some of which was outsourced - in fact it looks like our opposition party and opposition media are trying to blame this failure on outsourcing, even though that has no foundation in reality, mostly because this plays well with their supporters for reasons too tedious and political to explain. Unfortunately, this happened at a kind of interface between different systems designed to take in results from both the Pillar 1 NHS testing and Pillar 2 Lighthouse testing for the general public, deduplicate them, and spit them out to the contact tracing system and the government dashboard. These are seperate systems run by seperate organisations for various not-so-interesting reasons to do with the structure of the healthcare system and the history of Covid testing in the UK.
> They're not using Excel for any other NHS databases
I assure you that you're wrong. The SaaS I work for replaces a suite of independently re-invented Excel files used in conjunction with other SaaS. NHS trusts are our main customers, I can only assume the NHS is full of excel spreadsheets.
(Technically speaking PHE is not part of the NHS, it has more in common with the civil service in some ways.)
The dubious wisdom of choosing Excel for this aside, how many people working in (1.) computing and (2.) statistics saw a file with exactly 65536 rows, with nary an eyebrow raised? And I mean the first time they saw it, not the second, third or fourth day in a row.
Your point is incredibly good, and made me laugh too. It illustrates precisely why people, both technical and non-technical, frequently avoid choosing the massive scalable hammer factory system architecture that takes five people three weeks just to get rev 1, over the little tapper that gets the job done today with one junior developer.
From the technical standpoint of a decent (and probably only decent) coder, I think Excel is an amazingly sensible choice for technical projects for a bunch of reasons. I think decent coders reach for the database solution far too often, and waste ungodly amounts of time setting it up. Excel coming bundled with an amazing front end UI you don't have to write code for shouldn't be ignored. Sure it's error prone, sure it's a nightmare for data validation, sure it doesn't scale to Amazon sizes, but this little project will never get that big. (Yes, until, of course, it does.)
I used Excel as the frontend for artists to author and enter small databases for a game engine, for example, and it was leaps and bounds better, and cheaper, and easier to use, than the unmaintainable SQL nightmare, with a bunch of frontend and backend dev on top of it, that the decent code who proceeded me left behind. Excel even scaled better, up to the size of our studio (hundreds of people), compared to the other system that needed technical artists to know how to use it. No question whether it would crumble if we were talking thousands of people or more. Which is why it happens so often. :P
Your scenario makes it seem like these things are the result of incompetence or stupidity by people who aren't "decent coders". The reality is that for most use cases excel+manual processes are perfectly adequate/have little startup costs and what you're suggesting is grossly over engineered/require immense initial investment.
The real problem is if/when scope begins to increase these processes are sufficient until they aren't, and the tools aren't flexible enough to gracefully manage the transition.
>The real problem is if/when scope begins to increase these processes are sufficient until they aren't, and the tools aren't flexible enough to gracefully manage the transition.
That's a good point.
I'd also add that, at least in the US, we don't have any sort of national standard for such things.
Each public health department (and there are >1,000 of them in the US) has their own set of processes and procedures.
I'd expect that the CDC has a proper database system and that data from those 1000+ entities is collated into that system.
However, state and local laws/rules require specific handling of such data and can differ significantly from state to state, county to county or even town to town.
Attempting to stand up an integrated nationwide system isn't a bad idea. But doing so while trying to manage a pandemic isn't reasonable.
That many places used Excel spreadsheets for storing testing data is neither surprising or necessarily a bad thing. Until March, no one needed to have huge databases of test results -- now we do.
Complaining that places which had , at most, a few dozen cases of reportable infectious diseases should have implemented a database that can support thousands (tens of thousands?) of test results seems rather silly to me.
I didn't read the article posted until now, and I see it's about issues in the UK and not the US (I was confused, as we had issues with data reporting in the US as well).
My comment was focused on the decentralized US public health model and not on the UK's.
My apologies for injecting analysis of a different problem into this discussion. As such, my prior comment should probably be down-voted.
No need for a “front end” or “backend” or “pigsqueal”. MS provides a perfectly workable all in one tool, Access, for this kind of work. You can even use Excel as a front-end to access data if you want.
This isn't really true. The people making the decisions will have more or less zero insight into the data format being used to store data, they'll just know that data is stored.
IMO, a much more likely scenario: a brittle, haphazard system was set up with extremely short notice at the start of everything. One administrative snafu led to another, ran into "well, it's working so let's not touch it", and we got to where we are today.
A good part of the problem could be the timelines involved. With something like Excel, you have a large number of people who can put something together in a short amount of time and perform the needed analysis.
Contrast that to a database.
The first problem you run into is the limited number of people who actually have meaningful experience with databases. Databases ceased being consumer products about 20 years ago, so most people have only interacted with very limited front ends. They certainly haven't created a database, nor performed anything more than trivial queries. (Even then, they probably aren't thinking in terms of databases and queries.)
Okay then, we are reliant on a much smaller number of skilled database administrators and developers. Do their tools facilitate rapid development? Even if they do have access to suitable tools, the process is going to be slowed by communications between the people who need the analysis done and the people who can actually perform the analysis. Needless to say that Excel, with all of its limitations is starting to look like a good option.
Frankly I think you're being optimistic/pessimistic (depending on your point of view). A person can be a "decent coder" by very reasonable definitions of the term and still not realize that Excel can't handle this particular volume of data.
I understood (different source; not sure what the true picture is) that the problem was with storing each record in a column. There is a limit of 16,000-odd columns. This would, as you say, not be a problem with Excel as such, but with the user(s) not being aware of the limitations (which is still the case in your scenario). I wonder how much the general problem has to do with mistrust of 'experts' so, hey, I can do this myself in a spreadsheet. (Or maybe it was just a decision on-the-fly in a fast moving situation)
This is the test and trace system for the worst pandemic this country has ever faced, six months in. I’m sorry Excel spreadsheets won’t cut it.
This is a management failure. PHE has capable coders. And if not, they could hire some.
The NHSX team has managed to write the dang COVID-19 app twice. Once not using the Apple/Google contact API (because management) and a second time properly. It’s all open source and it looks pretty decent.
As an aside on the Covid-19 app - it is open sourced in the "occasionally throw some code over the wall" kind of way - eg it looks like the github was last updated on 24 September vs the app which has more recent updates. It's not clear they open sourced everything needed to actually build the app either looking through the issues.
This is very very far from the only project doing that but disappointing nonetheless given the amount of public money which was spent on it.
I've been bitten by this exact issue (albeit on non critical data), save a file as XLS it will silently drop rows beyond 65k - save it as XLSX and it's all good..
The silent part there is the real surprising part to me. You'd think the exporter would die noisily if it still had data in the buffer and it ran out of file.
(EDIT: numerous comments point out that Excel displays an explicit warning when truncating data above 256x65k. This comment was written under the assumption that it doesn't.)
Not if it was a bug in the code. 65k sounds suspiciously close to the limit of 16 bit unsigned int.
I'm guessing the exporter looked roughly like:
int exportRow( ... ) {
if( /* can't export or no more rows */ ) {
return 0;
}
/* export row */
return ++someInternalCounter;
}
void export( ... ) {
unsigned short nextRow = 0;
do {
nextRow = exportRow(...);
} while(nextRow > 0);
}
In the above example, the export would silently stop after 65k entries.
The way people write C in the wild, this wouldn't surprise me in the slightest. And with Microsoft being all about backwards compatibility, Excel probably defers to some ancient and long forgotten code when exporting to XLS.
Yep, I'm positive but as said elsewhere this is a few years back. I recall sending the data out and getting some pretty stern feedback (this is why I recall it) Didn't make the mistake after that...
XLSX files have been around for ... nearly 13-years...
Ugh - I still deal with clients who are creating brand new XLS and DOC files and expecting to have all the latest features (co-authoring when hosted in 365/SharePoint Online/OneDrive/Teams)...
Worked on a world class radiotherapy piece of software. If you like 20 year old Delphi projects and pre C89 ... C, then come over! Xls export is probably a recent addition, xlsx support will most likely come no sooner than when Excel stops supporting xls.
Ten years ago people were still complaining about OOXML not being an open standard. Don't worry guys, I'm sure they'll make a reference implementation of the standard soon! I'm sure there were a few people waiting around for a "real" standard to come out before implementing XLSX support.
I suppose someone had written a VBA macro to import all the CSV files and then save them in a format which could be imported into some dated upstream system. Or never changed how they worked. Could even be because Excel does something slightly more rigorous with XLSX files which made writing the VBA trickier.
I worked with teams integrating legacy systems from hospitals before. There seem to be a large amount of carefully customised scripts for pulling data together. There was never a straight-forward data ETL / export for something other than routinely reported. A lot of information is derived and inferred from even not related (e.g infer chemo events from drug prescription codes if parts of system is not accessible). Hence if you really need high quality data you need a lot of validation and cross-checks as well (either by script or manually). To me it sounds like the checks and validation did not happen enough as the time is very critical. I don't think Excel is the single thing to blame here. Hospital IT systems have a long history of data related issues.
It's also entirely possible that whatever they're importing this data into only supports XLS and not XLSX. I don't think Excel is the ultimate endpoint - the data is getting imported into one or more databases somehow.
My guess: somewhere down the line someone who doesn't often use Excel typed ".xls" at the end of a string variable or selected it from a dropdown list and made a mental note to come back to it if it caused any issues in testing.
Old code is everywhere. I once worked at a health tech startup that was integrating with a very successful existing company. All their code was in Turbo Pascal. We were shocked. The funny thing is that their code worked great and even had a debugger. Our code was ultra modern typescript micro services and we debugged with prints and logs. Their product looked dated but was superior in every other way. My guess here is that XLS is a simpler format and the devs found an NPM package that could handle XLS but not the newer formats.
It may have been a user education and/or communication issue.
I had someone come to me recently asking for some data "in an XLS". I asked them if they specifically need XLS or if they just need something that can be read by Excel. It was the latter, but they didn't know that there was a difference. To some people, XLS == Excel. Sort of like how TLS is still being referred to as SSL - somewhere in the world there is a developer being asked to use SSL for a new project because SSL == secure.
There are a surprising number of people out there that still save their office documents in the old formats for whatever reason they have convinced themselves of. Just a couple weeks ago I had a coworker that sent me a manual for a piece of software he wrote as a .doc file.
This is indeed user incompetence, but so far every news source quoted the official PR and said it was an "Excel Problem".
Microsoft PR was caught unprepared - I wonder how they'll re-spin it in the next few days (and for the first time that I can recall, a Microsoft product was wrongly blamed...)
Pay attention, how every time there's a Windows virus or worm, it's a "Computer Virus", but in the (extremely rare) occasions where Linux or MacOS is involved, it's attributed to that system. I don't think that's a coincidence, CMV.
I think Linux and MacOS viruses get a special callout because they're exceptions. Similar reporting conventions are used all the time, like a terrorist attack in the middle east vs a terrorist attack in France. It's not some nefarious PR scheme, the story is just more interesting because of how unusual it is.
It's not a user issue, it's a government incompetence issue. Who was the technical genius who thought 'right, we need to store thousands of lines of critical and important data, I know! Excel!'
The common case for stories like this is the answer is "nobody."
Time point 1: "We need a way to map the long-form epidemiology state. It's 2D data... Let's use Excel." "Will we hit limits?" "Pfft, no. Not unless we end up needing to track tens of thousands of patients, and that's not practical; we don't have enough people to do that tracking in the whole NHS."
Time point 2: "This pandemic is an URGENT problem. We need to hire more people than ever before to do tracing. And be sure to keep the epidemeology state mapper updated so it can drive the summary dashboards!"
Fun fact: Toshiba has software that generates entire CRUD app skeletons driven 100% by excel spreadsheets. You fill out object definitions in excel spreadsheets with property names and data types etc. It's like an extremely uncool long lost cousin of rails -g
Source: Got one of their engineers to show me after I heard about it and had to know if it was real.
Anyone who's ever worked with or in Japan knows just how far they are willing to torture Excel spreadsheets to get them to do anything and everything.
I've (miss)used excel as a web-service testing platform once. The test data (input/expected output) was anyway in the Excel sheet, so why not. We did not have other software available at the time, so Excel+VBA it is.
It was one of my nicest testing gigs ever. A test session would take only minutes, all results documented to the t. And fiddling around with the test data was so easy. Would be interesting to know if this thing is still in use.
Not sure the customer liked it that much, I regression tested the 6 previous - still running - versions of the service, something no one had cared to do for years. We found bugs both in the Spec and in the Code for nearly all old versions...
>I've (miss)used excel as a web-service testing platform once. The test data (input/expected output) was anyway in the Excel sheet, so why not. We did not have other software available at the time, so Excel+VBA it is.
That's an upgrade on how Toshiba did testing. They did basically the same thing except it was Excel+Humans. No joke. Never saw that one with my own eyes but a coworker did. He also said they had doctors there and each doctor was responsible for X number of staff and every so often made them fill out a questionnaire of which the final question was "have you thought about killing yourself lately?" and if you answered yes you got a day off. Apparently people would ask for transfers to different offices where the doctor responsible for them wasn't physically present, so they could avoid even being asked the question for fear of being made to take time off work. One of the more senior guys on the project kinda just disappeared too. Fun times.
to play devil's advocate, i had a desktop support job in college and worked with an executive assistance. She was _very_ attractive and so i was at her desk "working on the printer" a lot, she knew Excel better than anyone i've ever seen (before or since). It was like watching a magician.
I once created a very modern mobile offline-capable web app for a large enterprise company that was powered by APIs backed by excel. Luckily we didn't have to deal with those directly, but it was a bit of a pain for the poor guy doing th backend.
It started very innocent: Some consulting company apparently was tasked to create a prototype of the credit rating algorithm. Quite appropriately, I would say, they did so in Excel. As you can easily tweak the logic, show input, intermediate results and output. And all can be manipulated by the client who does not need to know a thing about programming to do so. Great.
From there it went downhill: Apparently someone up the hierarchy thought "Wow, that's 80% of what we need, let's just add a little UI and ship it"
They added a UI, but not as a VBA UI as you might hope. No, they did the whole UI in different worksheets. Storing any intermediate data ... also on worksheets. Long story short, it was a mess, and a slow one for that.
Oh, did I say this was a multi-lingual application?
Production rollout was on Jan 2nd, Dec 31st around 4pm I found a bug in the other language, on the one machine which had the this language configured. I believe this was the only time I ever saw a programmer literally run down the office to that machine to debug this issue.
Can attest to this. Some of the things I've seen excel spreadsheets contorted to do horrify me.
I once got spreadsheet dumped on me to debug because it wasn't working. A colleague used the spreadsheet to 'generate' interest rates that were then input into a mainframe. Dug into the VBA spaghetti mess, turns out this 15 year old script that took 15 minutes to run originally hit a bunch of Oracle/External APIs and performed calculations was now just copying the rates in from a csv file on a shared drive.
The error was caused by an excel formula ticking over a new year causing it to look for a directory it did not need to access that didn't exist. I thought it was pretty funny until I heard that the entire asset finance business had been unable to write any loans for days because of this.
In my org excel might have been the start, the intent being to move over to a DB once someone got it stood up. Then nobody would stand up the DB - because not urgent, we have the excel for now. The DB would be completely forgotten, and 3 years down the road everybody would be wondering wtf we have this huge excel file for. Then we would have six weeks of meetings to discuss the file until something else comes up and we forget it for another 3 years.
That's why Excel wins. Because it's right there, infinitely adaptable to any change in workflow you may need to make. No need to send official requests through half a dozen layers of management to get the IT people to maybe, in half a year, add a column to the database. No need to involve 20 people for a year to create new software for a new workflow. You just agree with your co-workers on a change, swap a column, add some color, File->Save, done. It wins because it puts power in the hands of the people doing the work.
I mean, yeah, but pretend Excel is, idk, sword arms. You get a lot of work done by whirling around the office, spinning your swords, cutting off a variety of limbs and heads in the process. You get to the end, meet the few survivors, staring dismayed at the trail of blood and dismemberment behind you, and you yell, "BUT I GOT THE JOB DONE!"
You're not wrong, but...
I wish there were some alternative where we instead fixed our ossified, byzantine processes gradually over time, so that we didn't need to break out the sword arm tornado in the name of getting work done and then, in hindsight, say, "well, shame about the negative externalities, but there really just was no other way. Ah well, let's move on, everyone who isn't a pile of flesh and blood and bits on the floor tidy up. Gotta get everything clean for the next sword arm tornado."
But the negative externalities don't seem all that big, considering. For every problem like this, you get countless millions of person-hours and dollars saved in work that wouldn't have happened at all, or would have been delayed, because of the necessity and cost of creating complex IT systems to support it.
If Excel is a sword tornado, it's one happening in an environment where everyone knows to be super vigilant about sharp objects. The alternative then would be a central processing factory that takes several years and millions of dollars to built, and which has to be turned off for a month several times a year, to change the shape of the blades used by the automated cutters.
> For every problem like this, you get countless millions of person-hours and dollars saved in work that wouldn't have happened at all,
I've found errors in every Excel spreadsheet I've ever looked at, and I'm not some master-excel user; I usually find them because - if I care about the results, I rewrite them as a Python script, so I get to go through everything.
The fact that Excel is effectively not auditable is a huge problem. There are a lot of dollars saved, like you said, but also many wasted or embezzled without anyone noticing in time because past a very low bar of complexity, it's really impossible to figure out what's happening without much, much work.
> If Excel is a sword tornado, it's one happening in an environment where everyone knows to be super vigilant about sharp objects.
... rather, no one is really careful, everyone gets a small cut every now and then, to which the apply a bandage and continue like nothing happened. And occasionally, they lose an eye or a limb or a head - and it is only those cases you read about in the newspapers.
I do not have a better suggestion, I'm afraid, but Excel is causing damage everywhere - e.g. [0], the subtitle - which I'm afraid is not an exaggeration, is "Sometimes it’s easier to rewrite genetics than update Excel"
I agree with what you wrote. It does cause a lot of small mess everywhere, and an occasional big mess. But there really isn't any better suite of tools available that hits all the non-database requirements.
"We need to store some data for a virus"
"Where does the data come from?"
"Every hospital sends us an excel sheet"
"Well, let's merge all of it into a bigger sheet and display it or export it"
"We should maybe start tracking some of this"
"Where does the data come from?"
"Dunno, a handful of hospitals... maybe?"
"Right. Let's just use Excel, not gonna waste time on that now"
>"We need to store some data for a virus" "Where does the data come from?" "Every hospital sends us an excel sheet" "Well, let's merge all of it into a bigger sheet and display it or export it"
I think the healthcare industry is probably the only thing keeping fax machines alive. Interestingly, fax machine are tagged as HIPPA compliant. Lots of spreadsheets get printed and then fed to the fax machine for transmission instead of email/whatever. On the other side, and i'm being serious, someone gets the fax and creates a new spreadsheet and enters all the rows manually. fun fun.
CSV is somewhat okay-ish. Excel, not so much (thanks Excel for discarding the important leading zeros in this field, and for lossily parsing the other as a date... oh, this Excel file is so big it got the "XLSB" treatment which is hard to undo)
Who? Almost anyone who wanted to get something done without going through the red tape of working with IT. Even in a private corporation that red tape is enough to make many a things end up in excel (or for the more tech savvy, access). Government is surely even worse (better?) at having red tape so the pressure was likely stronger.
Yes exactly. An endless battle: IT want to stamp out spreadsheets and centralise everything but it all becomes an ossified bottleneck of development projects. Meanwhile people need to get stuff done so they use Excel.
Yea, I have a personal vendetta against XLSX due to having had to make tooling replacements for business logic being encoded in spreadsheets in the past, but the title of this article is incredibly slanted.
This was absolutely a user error and the title "Excel: Why using Microsoft's tool caused Covid-19 results to be lost" is really disappointingly click-baity.
Is your view based on data available to everyone else? If so, that would be nice to see. Its far easier for you to convince others with data. Why would anyone want to change your view?
> This is indeed user incompetence, but so far every news source quoted the official PR and said it was an "Excel Problem".
It's been quoted/reported as "an ill-thought-out use of Excel", when in reality it's "poor use of an older file format".
There's also a lesson to be had here about validating your data conversions. Especially for critical things like this, it's always a good idea to perform the conversion from the old file format to the new one, and then to go back and extract the data from the new one to compare to the source data.
This can also find issues like Excel cluelessly assuming a field is a date, PHP automatically converting a string which happens to only contain numbers into an integer, and so on.
From what I can tell, that sort of thing, assuming proper monitoring (which is a huge assumption), would have detected this immediately (plus however long until someone notices the error e-mail, etc.).
If you read the article (perhaps its been updated?), they were using the old XLS format and they hit the limit at 64,000 lines
PHE had set up an automatic process to pull this data together into Excel templates so that it could then be uploaded to a central system and made available to the NHS Test and Trace team as well as other government computer dashboards.
The problem is that the PHE developers picked an old file format to do this - known as XLS.
As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of.
If somebody finds it easier to do this process using Excel than with a scripting language, I wouldn't expect them to know the differences between file formats. I would be surprised if they understood the concepts behind file formats.
They may have chosen XLS because they used Excel back when that was default, and now they don't want to "risk" anything by switching horses mid-stream.
If the conversion was failing silently, it's a Microsoft software problem, but I bet you money the conversion was failing in a way that fell into a `catch(Exception e) { }` somewhere in their conversion script. Or even a `catch(Exception e) { log e; }`, and the log is stored to local storage on a machine that no human being ever logs into.
There's nothing to convert, Excel natively supports CSV. It just doesn't read beyond a million lines, it presumably won't warn you about that, and if you save the file again (as CSV), the lines will just be gone.
Interesting. We might see an increase in the amount of wrongly reported data as a result of the global lockdown. One can only wonder what the consequences of data underreporting might be?
In 1976, the UK government, led by James Callaghan of the Labour, borrowed the sum of $3.9 billion from the International Monetary Fund. The granting of the loan was based on the condition that government fiscal deficit be slashed as a percentage of GDP.
A couple of years later, the chancellor of the exchequer at the time of the IMF loan said this below.
‘If we had had the right figures, we would never have needed to go for the loan”
That’s right!! The decision to borrow money from the IMF was based on wrong data. The public borrowing figures which prompted the UK government to seek IMF loan in 1976 were subsequently revised downwards sometime in the future.
If you're thinking regular office workers, they know of no alternative.
Excel works enough for small data, particularly when you don't do complex queries on it. The more you know of it, the better it works. The only other thing that offers similar benefits to Excel but works as a database is MS Access, but the mental model behind it is too complex for your average office worker who wasn't trained in it, and like most database systems, requires a lot of up-front work with figuring out the schema, and doesn't particularly like the schema being modified later on.
As far as I can tell, there's literally nothing else out there. No, random SaaS webapps du jour don't count, because they're universally slow, and also store the data in the cloud, instead of the local drive.
It's hard to say from here what would be the best thing for them to use, but simply using a modern spreadsheet format instead of the ancient xls seems fine for interchange.
I work in an office that relies way too much on Excel.
To consider your solution, first showstopper, it needs a server. We don't have a server, nor anyone who knows how to manage one. We'd need to ask IT, that will take months and they'll require a budget transfer, so we'd need to request it to management (which will need a business case to convince) and involve the finance guys. We can't just plug a RaspberryPi into the wall, not only that would get me fired, but also it wouldn't be able to connect to anything without the company's certificates for the proxy or whatever.
Second, we need people who can code in PHP (and their backups when they leave). Probably in practice we'd need IT to do that, so that's more months and budget required.
Obviously anything stored in the cloud is out of the question, just the authorizations and contracts to do that would take a year.
I'm sure, at a push, a PC could be co-opted to do the work while a server (even online) could then be brought up to speed and the data migrated - but that's your use case.
What we're talking about here is a government department who do have access to servers, but choose not to use them (or so it would seem).
One number the UK government doesn't release is how many people are tested. They have spent the summer championing "testing capacity", which we found out at the start of September was a lie. They release the number of tests done, and the number of people returning positive. They don't release the number of people done -- if you're tested twice, that counts as two on the number of tests, but one on the number of people.
The trouble is, that entire narrative about the testing capacity being a lie because of secret leaked figures about the number of people tested was itself a massive lie.
For context, the Government publishes a stat for the number of people tested for the first time ever each week (used to be daily, but that was dropped in part due to people misusing it). Certain publications and politicians - starting I think with the Times - pushed the narrative that this was the real number of people being tested, and that the reason it was so far below the stated capacity was because the real capacity was far lower than the Government claims: https://archive.is/n3Yku This was bullshit on multiple levels - there are several important uses of testing, like routine screening of NHS frontline staff and hospital patients being admitted and discharged, that result in people being retested who've been tested at some point previously, and those obviously use actual testing capacity on people who haven't been tested that day or week. (Testing two samples from the same person in short succession, on the other hand, is unusual and not general policy.) Not only that, the people tested number is from pillar 1 and 2 testing in England only for the week ending the 2nd of September, whereas the capacity figure seems to be for all pillars in the entire UK at around the 12th or 13th. It makes absolutely no sense to compare them. And the cherry on the top is that I'm pretty sure this number is from before the problems with test shortages and testing delays that they're implying it explains.
When I use Excel to open a CSV with too many rows, I get an alert saying 'File not loaded completely'. I have to dismiss it by clicking 'OK' before I can interact with the spreadsheet.
... Yeah, but if you have an automated script compiling the reports into one file, and you forget to add correct error handling .... you end up with this mess :-(
I suspect that wherever this foul up happened there wasn't someone sitting and clicking through sheets ignoring errors (I hope).
>As a consequence, each template could handle only about 65,000 rows of data
Oh? Like 65,535 or so? That seemed weird a first that even as old as xls is that only 16bits were allocated to max rows.
But then maybe not. Each row might get an ID, so that's 16bits * rows you have. I wonder if when XLS was designed they considered it very unlikely many people would have 500MB of db of empty rows and then would needed more data added to each one?
Simplified of course, I bet even original format had some explicit row identifier that could be shortened with optimizations and increment tags.
The xls format is pretty old. The newer formats (xlsx, xlsb, and so on) allow for larger spreadsheets.
The reason the spreadsheets in the old format couldn’t grow was probably compatibility. The reason for the limit in the first place was probably about reducing memory on disk and in main memory.
You'd be surprised to hear how many big organisations still use Excel for their reporting needs. I used to worked at a company with 35000 employees. Their payroll reports were generated using a MS Access database and reported using Excel. It was a massive manual job every week as they'd weekly, fortnightly and monthly payroll cycles! It's very easy to lose files and/or make an error in the calculations. But nobody bothered to change it.
We had a telco vendor use excel to manage all their configuration.
It was a weird dual source of authority system, where the excel configuration was used at install / upgrade time, but you could change the config after the install at runtime. So you had to merge the active configuration into excel, upload the excel document to some server that customers didn't have access to that turned the excel file into a config file, then you could use that config file to upgrade the system.
It was as problematic as you could imagine, lost configuration, lots of macros, etc. Eventually they made some improvements, but I left the industry so don't know what the current status is.
I'm not surprised. I've seen all kind of shit working for enterprise clients. Only one of them adhered to very high standards and had processes for everything. They were a military contractor and most of the clients were ex-military.
So easy to pass judgement on critical software decisions from the outside, with absolutely no knowledge of how they work or their training, the pressure from Government, the previous conditions that they found themselves in when a pandemic hit.
Who are "they"? The Government? GDS? PHE? NHS England? NHS Digital? NHSx?
Yes, with direct previous experience and knowledge of large UK Government IT projects, how they were integrated and how staff were trained, I feel I'm well placed to comment harshly.
"They" in this case was the UK Government, and part of the goal was creating secure backend systems specifically for the NHS as a data store, Public Health England would absolutely have access to it.
Even the most basic of checks at the start of the project would have highlighted that Excel was not a proper solution for the application (Even if the developer(s?) had used the newer(?!?) XLSX format rather than XLS) - which highlights that there was no proper oversight as to how the system was constructed.
The most consequential Excel error was The 2010 Reinhart-Rogoff error in the paper "Growth in a Time of Debt" https://www.nber.org/papers/w15639
The paper claimed that average real economic growth declines 0.1% when natonal debt rises to more than 90% of gross domestic product (GDP). When you correct the error it shows 2.2% average increase in economic growth.
Paul Ryan used it in the US for Republican budget proposal to cut spending and it was also used in EU to implement Austerity policy that hurt people.
I think you're overstating both the paper's impact, and the significance of the error.
Most such papers are used to support existing policy preferences, not drive them. In addition, the error didn't reverse or erase the correlation, it diminished it, and removed the inflection point from the curve.
> I think you're overstating [...] the paper's impact
Krugman [0] called it "surely the most influential economic analysis of recent years" and adds that it "quickly achieved almost sacred status among self-proclaimed guardians of fiscal responsibility; their tipping-point claim was treated not as a disputed hypothesis but as unquestioned fact."
However, indeed, he cautions that "the Reinhart-Rogoff fiasco needs to be seen in the broader context of austerity mania: the obviously intense desire of policy makers, politicians and pundits across the Western world to turn their backs on the unemployed and instead use the economic crisis as an excuse to slash social programs.
What the Reinhart-Rogoff affair shows is the extent to which austerity has been sold on false pretenses."
I'd say the paper made it certainly easier for the Very Serious People[1] to push through their austerity agenda.
[1] Krugman's derogatory term for duplicitous policy wonks (like Paul Ryan) that pretend to be thoughtful and serious and concerned about the nefarious long term effects of debt when discussing stimulus and spending and social programs, but then typically turn around and implement huge tax cuts without second thoughts...
Krugman's editorials are polemics, not economics. I don't think anyone should trust the content of an opinion piece which purports to describe the logic, motives, and rationales of ideological opponents, as was the case in your linked article.
> Krugman's editorials are polemics, not economics.
The discussion here is not about economics. The discussion is about the impact of the (flawed) Reinhart-Rogoff paper, which one commenter thought was overstated.
> I don't think anyone should trust the content of an opinion piece [...]
Krugman is of the opinion that said paper had 1. a big (and 2. a deleterious) impact. For the first contention (which is, again, the pertinent topic), you don't have to rely on Krugman's expert assessment, though, you can look at the evidence: the infamous 90% inflection point had been quoted all over the place (as for example in the WaPo editorial cited by Krugman).
Lastly, what makes an article describing "the logic, motives, and rationales of ideological opponents" intrinsically untrustworthy?
The most salient point of the paper was the inflection point of 90% for sharply lower growth. It was and still is used by proponents of fiscal austerity to argue against fiscal stimulus.
It was literally cited by Britain's finance minister George Osborne as justification for UK government austerity:
> As Ken Rogoff himself puts it, "there's no question that the most significant vulnerability as we emerge from recession is the soaring government debt. It's very likely that will trigger the next crisis as governments have been stretched so wide."
Not the paper on its own, maybe, but it bolsters other arguments. If someone already wants austerity, an authoritative paper saying that's better will make them want austerity more, and gives them a strong argument in discussions with opponents, and may sway people who are unsure.
Fun semi-related story. Way back when, while I was squishing some Y2K bugs for a Saudi bank that catered to high net-worth individuals I was asked to look into why the AS400==>>db==>>Excel spreadsheet was "showing something screwy every now and then".
Spoiler Alert: It was a SMALLINT error that was 'patched' by someone previously (legacy code) that worked around the bug by rounding account balances to the Million (since it was a reporting function rather than an accounting function it was a 'sort of ok' hack). The bug re-appeared as undefined behaviour when transaction and or account balances went in to the billions.
People keep putting the blame on Excel but it supports more columns than most implementations of SQL databases. MySQL for example supports only 4096 columns. SQLite defaults to 2000. That's way less than Excel's 16384.
True, but they don't need to relate 16384 types of data. It sounds like they were using one column per record, rather than one row per record. So if they had a sensible data model 2000 columns should be more than sufficient.
You're on to something. The number of tables in MySQL, SQLite, and Excel are virtually unlimited. It seems like that is the lowest common divider ensuring compatibility and data integrity when case numbers grow beyond bounds.
I have done a couple of gigs in life sciences and Excel is one of the big responsible for VB.NET to be still relevant.
I have met several researchers using Excel, not R, Python, Julia, nope plain old Excel, eventually with some VBA macros.
The more savvy ones, eventually ask IT for VB installation when they outgrown the VBA capabilities and carry on from there with either small Windows Forms based utilities or Office AddIns.
Any attempt to replace those sheets with proper applications has gotten plenty of push back until we basically offered enough Excel like features on the new applications.
This was me when our online meetings exploded and I had to record a ton of metrics manually without any prep or software. Now there is this digusting, VBA-augmented excel sheet with a million hacks being read by the c-levels to make very important changes by. I can't vouch for a single cell being accurate at this point. Even the few automatic metrics from the meeting software are misleading from the get go, but then they meet my black-box of doom.
While I agree with the general sentiment that excel is normally misused, like in this case, it's also capable of handling huge amounts of data thanks to powerquery and powerpivot (1,999,999,997 rows per table, 2 billion tables max, docs say). You can't load the data in the grid, you need to use the data model. Using powerquery properly also solves most problems about excel guessing types wrongly when importing csv and other external data. If strongly recommend any excel user to learn powerquery and powerpivot.
There's a long history of spreadsheet errors and literature on same going back a few decades now.
In a business sim class in college a couple of decades back, I discovered that the Lotus spreadsheets (as I said: a couple of decades back) had a totalling error which double-counted individual row totals in the bottom line (everything was twice as profitable as the spreadsheet indicated).
At an early gig, one of the senior developers instituted a practice of code walkthroughs on projects (only a subset of them). One of these involved, you guessed it, a spreadsheet (we used a number of other development tools for much of our work), in this case Excel. Again, numerous errors which substantively changed the outcome of the analysis. One of the walkthrough leader's observations was that you could replace all of the in-cell coding with a VBA macro making debugging far easier (all the code and data are separated and in one place each).
The particular analyst whose project this was: he insisted to the very end that this "wasn't a program" and he "wasn't a programmer" and that the walkthrough didn't apply to his situation. Despite the errors found and corrections made.
At the time (mid 1990s) the walkthrough lead turned up a paper from a researcher in Hawaii on the topic. I'm not certain it was Raymond Panko, but his 2008 paper (a revise of a 1998 work) discusses the matter in depth:
I hate solving problems and not knowing how I did it, so this particular experience has stuck in my craw.
Trying to develop a budget to pay off debts, my partner made this elaborate Excel spreadsheet and the output was that basically she had no spending money and I had very little, until one or both of us got a raise. It was far more austere than either of us were willing to go. So I started over using a different equation for 'fairness', and a different layout because something about her tables was just confusing and messy. When I was done, we had $300 a month of extra spending money between the two of us, despite using the same targets for pay-down, savings and bills.
I spent about 90 minutes poking at her spreadsheet and mine looking for the error and never did find it. I don't know what the right solution is to this sort of problem, at least for non-developers. But if I was skeptical of Excel going in, I was doubly so after that. Especially for something that is going to be used to make decisions that will affect every day of your life.
You made two different spreadsheets, they came up with different results, and you're blaming Microsoft for the error? Really?
Excel, whether you like it or loathe it, is in such wide use around the world that $300 math errors would have been noticed a very long time ago. I could believe that there are still many lurking bugs with obscure corner cases, nasty floating point rounding minutiae and so on, but I would bet (checks spreadsheet) $300 that your mismatched results were solely due to your own failures.
I've re-read the parent comment a few times, and I really don't see how this is your takeaway. The author is complaining that they find Excel difficult to inspect and different spreadsheets difficult to compare. There's no claim that Excel did a calculation wrong.
But they aren't noticed. The classic blunder is dropping the last item off of a summation, and nobody notices until they add another item and things look janky.
Those happen all the time, we just only hear about it when the consequences are outsize.
In fact good practice is to add a check, just to see if your pivot table was refreshed and the data there matches the data in the source table. Just like you make tests in software.
I always store data in tables for this reason, but last week I ran into a problem because someone else had added new data to my sheet somehow without the table range being automatically adjusted. The table looked visually correct but when you sorted or filtered it, the rows towards the bottom were not affected. So sadly, this is not a silver bullet, but it does help a lot.
There's nothing wrong with Excel. Build an organized spreadsheet with clear separation of data and presentation, and write checks throughout and you will end up with a perfectly error-free workbook.
Nothing wrong?
Excel autoconverts values if it thinks it knows the data type of a column and Excel treats 1900 as a leap year.
Genetic researchers changed the names of genes to prevent auzoconveesion errors.
And JavaScript autoconverts all your numbers to floats, introducing errors every time you touch them. Meanwhile most other popular languages will happily wrap your fixed-size integers around on overflow, or silently perform integer division with truncation when you don't append a decimal fraction part manually to any of the numbers.
Point being, all tools come with strange caveats that one needs to be familiar with. A nice feature of Excel is that the caveats are visible, because unlike writing code, Excel is reactive and interactive.
I mean, some of the stuff Excel does to your data is downright idiotic wrt. the common use cases, and probably exists only for the sake of backwards compatibility. But let's not pretend you don't need to pay attention if using a tool you're not proficient with.
The typical users of javascript ate programmers, they know that errors.
The typical users of Excel most of the time aren't aware of Excels weaknesses. And good look finding these errors in large data sets.
There is a difference between visible and lucid, especially if you do not expect errors.
>>But let's not pretend you don't need to pay attention if using a tool you're not proficient with.
Decades of experience working in Office Environments tells me many many many many many people do not pay attention period, their experience with the tool is irrelevant.
My experience also shows that inexperienced people do not know WHAT to look out for, especially in excel so they are often more susceptible to mistakes.
This really comes into play when working in larger organizations where excel workbooks are passed around from person to person, often existing for years or decades at a time where people using the workbook are separated from the person that created the workbook.
Try figuring out an excel spreadsheet created 10 years ago by people no longer with the company that several data links importing data from all over the place.....
And a band saw will happily cut off your hand if you push it into the moving blade. That's why the best practices of operating one suggest that you never keep your hands anywhere near the cutting edge.
Excel's idiosyncrasies are very much on the level of the typical productive computing tool. They are less maddening than half the featureset of C++, three quarters of the featureset of Javascript, and 110% of the featureset of bash.
Despite that, people manage to get work done using bandsaws, C++, Javascript, and the occasional shell script.
Band saws, C++, Javascript have a different user group than Excel. Most of them have been trained on the tool they use.
If Excel were a band saw, there would be a lot more people missing fingers.
You can find idiosyncrasies about any tool. Excel is a fine tool for 99% of use cases. Sure, it's quirky here and there, but the sheer amount of work that gets done globally in Excel should be testament enough to its value. Highlighting edge cases to bash on the tool is either myopic or intellectually dishonest
1) X delivers tremendous value.
2) X has some significant problems.
3) X has no problems.
Obviously 1 and 3 are compatible. I claim 1 and 2 are also compatible (and, actually, not uncommon...).
You said that "[t]here's nothing wrong with Excel", and someone responded with an example of a problem that they considered significant. If we read "there's nothing wrong with Excel" as a strong claim of 3, then that's obviously a refutation of your claim. You could argue that the specific problems are not actually significant enough to rise to the level of notice. You could argue that they are not, in fact, problems at all. You could argue that you didn't, in fact, mean to make claim 3 in any strong sense (which I think is what was actually going on here - that's valid, English works that way).
You've instead interpreted it as a flawed refutation of claim 1 (and took the opportunity to demean your conversation partner). I don't think that's productive.
Excel is good at what it was developed for. Unfortunately, it is often used for other things. For example, in this case as a database. This leads to the fact that departments build solutions that work at first sight but have errors in special cases that are difficult to detect and debug.
I'm not so sure he claimed that Excel can't add and subtract numbers. I think the argument is that even two people can make such a mess of a spreadsheet that nobody can really untangle it later.
Tack on that developers have other, better options, but if you're not a developer I don't have a good solution for you, and that I don't like that state of affairs.
The OP said that using a spreadsheet program can be messy and prone to mistakes, and implied that there could exist alternative approaches. The OP never said a bug in Excel was responsible for the mistake or blamed Microsoft for corrupting the calculation.
In the same sense, one could say manual memory management in C can be messy and prone to mistakes, it doesn't mean that one believes C should never be used in any circumstance, and it doesn't mean one's trying to blame the C complier, libc or Dennis Ritchie for corrupting one's program.
Hypothetically, if someone writes the following reply.
> The C programming language, whether you like it or loathe it, is in such wide use around the world that a NULL pointer dereference would have been noticed a very long time ago. I could believe that there are still many lurking libc bugs with obscure corner cases, but I would bet the memory corruptions were solely due to your own failures.
The only real way to solve issues like this is a step-by-step reconciliation between the old model and new. Each assumption and calculation must be tested to confirm that they replicate the same output. This is extremely difficult if the underlying structure/assumptions are different (e.g. monthly views vs. transactional views, etc.).
That rework based on suspicion is typically how it’s done. :) Add to that subject matter expert review. For example, I used to work on a sheet of scoop size calculations for industrial digging and nothing from it would reach sales staff or customers until a couple of the crusty old experts played with it and vouched for the calculations.
Imperfect measures, but they keep the work within the grasp of everyday/business users in a way a formal test suite wouldn’t necessarily.
I didn't rework it because I thought the numbers were bad. She and I agreed to ground rules, [but] I could not commit to the consequences even though the 1st order consequences came out in my favor. The second-order consequences would be that I'd tricked her, intentionally or otherwise, which is decidedly unhealthy in a partnership. I had contributed 2/3rds of the income but 2/5ths of the debts (the latter being the main sticking point) so while 'fair' was not going to be 50:50, I'm not some monster who's going to buy high end electronics while my partner is eating ramen.
I expected worse news for myself after re-doing the numbers, and in fact I ended up with a little bit more spending money with the new numbers.
I mention this only because I have come to expect policy people to stick to their initial narrative a little more enthusiastically than I could. Getting them to review data and policy unless there is a clear benefit for themselves is difficult. It's very common for people to get promoted by challenging this friction in ways that benefit everyone (because the deciders couldn't connect the dots). It's always newsworthy, and I wish it weren't.
the secret to using Excel well in large complex tables/spreadsheets is the same secret as coding with large complex programs.
You use structure, design away the complexity, implement constraints, error checking and tests.
Note it's true that an inexperienced person is liable to make spaghetti code either way, but there's nothing fundamental about a spreadsheet format that makes it inherently unusable for a lot of small/medium problems. Indeed it even has benefits in terms of interactivity/ turn around/accessibility.
Of course, there's also a lot of problems with Excel and reasons not to use it like a database or anything which fundamentally relies on maintaining data integrity, and it's liable to be the first tool reached for by the non-experienced, who will generally make a mess of things large and complex as a rule.
> You use structure, design away the complexity, implement constraints, error checking and tests.
I've never thought of Excel as an ideal tool for any of these things. I'm struggling to think of how it would have change-verifications/tests in the same way that software projects do.
(I'm certainly open to the possibility that I'm ignorant/unaware on this subject)
depending on your point of view, you don't do them in the "same" way (I.e with separate unit tests or compiler enforced safety), and explicit difs are hard/impossible. And ideal is dependent on the task (it's ideal for me to get a working interactive graphic + visualisation to a user behind a corporate firewall via email in the same afternoon I get the request, I wouldn't actually choose to do 'programming' in Excel, but I consider a budget small and simple and programming is probably overkill)
what you do is much closer to old-school low level programming. define the relationships between your tables and variables well, set up explicit corresponding arrays of 1s and 0s that are themselves error checks on the underlying structure/ contents of your tables, calculate things in two different spots/ways and verify equivalence holds, and use simple red/green conditional formatting to draw attention to when things fall outside of expected state, etc.
This is precisely why I went with YNAB [0], started budgeting better and saved quite a bit of money.
Everyone who sees me paying for an app (I'm from India) ask me - "why can't you just use excel and do the same thing for free?". Excel sure is powerful, but in real life, your mileage may vary.
Please don't break the site guidelines with name-calling and personal attacks, regardless of how bad another comment is or you feel it is. We're trying to avoid a culture where people pick on others like that.
That exactly right, noone is.
Thats why noone uses C++ or javascript in safety critical devices
Even in 'normal' industry its is a huge problem and we need safer languages, i.e. Rust (merits to be seen)
Err... C & C++ are the primary languages used in typical safety-critical systems. I've used it.
JS... of course no one uses that mainly because the abstraction level is too high, the number of layers of non-safety-critical code massively outweighs the application code.
The critical thing about safety-critical code isn't really about the code... its the whole development process. Implementation is just one small part of that process.
> Thats why noone uses C++ or javascript in safety critical devices
Ooh, no that's fairly incorrect. Javascript for sure is less common because it's the web browser language, but C and C++ are used all the time in embedded safety critical designs.
I wish I could remember the movie (it appears to be a reference to Al Capone, of all people), but there's a scene that gave me the tingles. The leading man, about to exact some revenge, turns to the antagonist and says,
"You've mistaken my kindness for weakness."
If I worked at a mental health technician I'd be complaining about how bad the system is for trauma survivors and neuro-atypical individuals. But I'm not. I'm a developer who has focused on - or been forced to look at - people struggling with User- and Developer-Experience issues for much of my career. If I'm talking to you about your struggles, you get sympathy. If I'm talking about policy, or policy makers, I'm going to give you a taste of what my 'customers' have to go through every day. And I'm not going to feel a bit sorry about it. If you dismiss it, we may have a problem.
Spreadsheets like Excel or Google Docs, with the expandability offered by VB and JavaScript, allow for rapid development of applications that require data storage and data interpretation in one package. I’ve seen them used and abused for several functions over the course of my career.
Apple/Claris FileMaker has had this niche for a bit, but it’s only been a niche,
Tools such as Excel or PowerPoint are almost "too good" as in they allowed users to come up with usage which aren't what the software was made for.
As others have said regarding Excel, it's not so much what it's capable of but what you do with it and how you do it.
I've seen people using Excel as a tool for trading, as an inventory database for a warehouse, as a shared database to track task progress, as a Gantt chart, as a calendar and so on.
PowerPoint is the same. Who hasn't been given a PPT file as a manual for something?
"Oh, it's all in the powerpoint".
Or been given the same kind of file to familiarize yourself about something at your (new) workplace. Issue is you rarely have any of the material the speaker used to present the whole thing so you're left wondering about the meaning of it all while reading bullet point lists...
No, a powerpoint is not a proper doc...
So in a way, I have to applaud MS as they did great with these but almost too great...
The thing that puzzles me is, what on earth happened on the 28th? A really anomalously low number of cases were reported then, and it's only got more weird after this fix added zero unreported new cases on that day and much higher numbers on the days before and after it. Something definitely doesn't seem right here.
I would hope that any decent data engineer would know not to do this. Are there avenues for fully-employed engineers to help governments build these systems properly? I'd honestly be happy to just donate some of my free time to help these efforts.
interesting, i was on a conf call 3 hours ago that was an almost identical use case. Excel gets used because word comes to a director at a health department that says "we need to collect testing results from 500 different facilities in a way their, and our, existing systems do not support" and they're given maybe 3 days to get it done along with the rest of their job. Putting together a technical team would take a week minimum let alone requirements gathering and analysis. So what do they do? They fire up Excel, add column headers that represent the data they need, and then send it to a Sr. Manager and tell them to make it happen.
It doesn't even necessarily have to "work" (as in, work well) it just has to feel like it's the path least resistance.
You can work in a department using Excel for this type of data collection and reporting, and you can continuously suggest to your superior that something more mature could be used. They would probably agree. As would the entire team!
But if the team of analysts are only properly trained in this system and it's all they know coupled with a huge backlog of cases and time pressures, then they're going to keep using the thing that causes least headaches in the short term. And that might be objectively worse to everyone involved but they just keep ploughing through.
Toxic culture, bad management, poor working practises and external pressures can force even the most sane of people to choose the worst technology on the basis that they perceive it as "saving time" in the short term, even when they know full well they're borrowing Peter to pay Paul, they still do it.
As is the case for many, Excel is one of my main tools. From financials to data gathering to electrical, mechanical or software engineering, Excel has always been there. It is fair to say I have made lots of money thanks to this tool.
And yet, every so often...
Many years ago a rounding error in a complex Excel tool we wrote to calculate coefficients for an FPGA-based polyphase FIR filter cost us a little over six months of debug time. I still remember the "eureka!" moment at two in the morning --while looking at the same data for the hundredth time in complete frustration-- when I realized we should have used "ROUNDUP()" rather than "ROUND()".
Most recently, I was working with a client who chose to build a massive Excel sheet to gather a bunch of relevant data. The person doing the work seems to think they know what they are doing (conditional formatting and filtering don't make you an expert). This poor spreadsheet has every color in the rainbow and a mess of formulas. It's impossible for anyone but the guy who created it to touch it.
Here's a hint:
Do not mix data with presentation. Where have we heard that before?
This is one of my pet peeves with Excel. If you need to gather a bunch of data, do it. Treat Excel like a database (apply normalization if you can!) and keep it with as little formatting as you possibly can. Then do all the formatting and calculations on a "Presentation" sheet or sheets. Just don't pollute your database with formatting.
EDIT: Thinking about the UK problem, if they were working with a ".xlsx" file and accidentally saved it in ".xls" form, well, as they say, "There's a warning dialog for that".
What surprises me the most about these kinds of incidents is that people keep working on the same single document. In other words, no semblance at all of what the software business knows as version control.
Decades ago I adopted the idea that storage is cheap and always getting cheaper. If I am working on something critical, I never work more than one day without a backup. I make a copy and continue editing. In most cases I make a new copy every single day.
I'm wondering if there could be an opportunity for an open-source contract tracing system, similar to the allReady [1] application ? Maybe something backed by a graph database such as Neo4J [2]
Looking at the allReady repo, maybe it isn't a great example since it hasn't been touched in years...
This is clearly a case of someone not knowing the limits of a technology / how to select or use a technology. Yes, more modern versions of Excel support many more rows than 65k. As a "power" excel user in the early 2000s, it's tiring to see this same mistake made. And I might have blamed MSFT in 2008. But the "modern" world has taken an absurdly boorish response to this virus. This unfortunately looks like another example.
Forgot to mention: there's nothing inherently wrong with storing data in CSV files. One does have to think about consumption and processing. A Postgres database seems pretty reasonable to track information like this, but I wouldn't blame a CSV format for shortsightedness. We're not talking about actual big data here...
I've been teaching data curation for open data managers in governments around the world for the last decade. All the issues identified in this BBC article are covered in my syllabus.
I hope people finally understand ( At least I hope Microsoft Understand ) Their biggest asset isn't actually Windows. But Excel.
Some on HN will definitely comes in with (x)Office Support MS Excel file as well etc.
There are trillion dollar worth of revenue relying on Excel. In the best case scenario, no one wants to work, rework, or even touch that Spreadsheet. Having 99% compatibility is not good enough.
Excel is great but I think at a certain level folks should start using Python and Pandas. The challenge is understanding when and getting the right resources to the right people.
The conceptual leap from Excel to Python+Pandas is pretty huge. Especially if someone hasn't done programming before. That is part of the reason I created https://www.easydatatransform.com .
Excel has its place. And it is excellent tool for some workflows. However, I have started to think that Microsoft should disable the option to save workbooks. That would kind of force people to use Excel as a numerical scratchpad with fast charting what is where Excel excels.
(There is some sarcasm in this comment, but not that much)
I wrote an energy reporting system based on Excel. Based on that, I think Excel use falls into 2 categories
a, extremely simple reporting with few workbooks
b, serious misuse of technology
Billion dollar companies were born based on replacing Excel in workflows. The problem is not that they have an old version, the real problem is that they use such a system in the critical path.
It's unlikely this will have had much impact since most reported cases were only a few days late but come on guys, really?
Using an Excel sheet as a database? In week 1, that would be "not great" but could be accepted as being a fast solution that everyone could work with. This far into a pandemic I think we can expect a little more professionalism in data handling.
People use excel sheets as databases all the time because it doesn't require any specialist knowledge to build or use.
Sure, we techy types prefer a proper database with a web frontend and an API, but that requires significantly more skill to build than an excel file.
No database has the flexibility and flat learning curve that excel has.
Until a database manages to meet these goals, it remains a good tool for certain usecases.
We as engineers should be working towards making a real database thats just as easy to use as excel, and only then can we complain that people are using excel sheets for this kind of stuff.
> People use excel sheets as databases all the time because it doesn't require any specialist knowledge to build or use.
Test and Trace is a multi billion pound programme of work, involving huge companies (eg, Serco) interfacing with national governmental bodies (Public Health England).
We've had months to get something in place. One of the reasons organisations like Serco is used is their apparent expertise with IT.
The main reason Serco is used is their political connections; they really ought to have been banned from consideration after the fraud scandal: https://www.bbc.co.uk/news/uk-50806919
(my one-step programme for improving UK public contracting would be that after a significant failed contract the contracting company and its directors and any other companies they are directors of would be banned from tendering for a period; maybe a couple of years?)
Unfortunately the procurement rules currently in place prevent procurement personnel evaluating a tender from taking into account any prior performance. I'm unsure whether this is a result of EU legislation and common market rules, or whether it is just the UK.
italy has legislation to forbid people who engaged in corruption/embezzlement from working with public administrations.
It also has a required certification for companies that work with the public that they weren't condemned for certain crimes (trafficking in illegal waste, extorsion, organized crime etc).
So, it doesn't seem an EU/Single Market rule that you can't check this.
One issue they have is that very few companies are willing to have the government as a customer. That's one reason why they keep going back to the same failed firms time and time again.
Reasons vary but the ones I've seen are generally:
1. Complex procurement regulations that take a lot of effort to comply with by contractors.
2. Low willingness to pay (both price-wise and time-wise).
3. Civil servants are nightmare customers who don't know what they want and are under no pressure to figure it out, so there's a lot of timewasting involved.
4. PR nightmare if/when things go wrong even if it's not your fault, as government is relatively open compared to other types of customers so easier to find out about problems, and lots of people automatically blame the private sector in preference to blaming government employees due to (misguided) assumptions of moral superiority of the public sector.
This is a very good point and a shame to see it being downvoted. I have been wishing there was an accessible database-ish software for doing my accounts, as the current mega-spreadsheet I use is pretty hairy and none of the specialised apps quite fit the bill. The closest thing seems to be Airtable, which while nice, is quite expensive.
Recently I was visiting hospital quite a bit (ante-natal) and witnessing the horror of the massive NHS form filling software the nurses and midwives have to use. It's clunky and sprawling but one of the main sins I think is it's lack of adaptability. In the old days of pen an paper, a new form could be written, or adapted, and photocopied easily, and on-site. Now if a new field needs to be added or a process is changed it needs to go (I imagine) to some centralised IT development office and fed into a ticketing system where it might get changed in a few weeks or months. There's no room to quickly adapt at a ground-level, so you end up with things like this excel sheet problem.
It depends which bits of functionality you want, but I think you are close to describing PowerQuery (Get & Transform) which is in Excel, which can ingest multiple data sources into effectively a relational database with hard datatypes and data-cleaning.
In PowerQuery & PowerPivot there are also no practical limits on the number of rows of data you can have (other than the impact on processing time, clearly hundreds of millions of records might start to be a problem). It's not quite access - it doesn't actually persist/store the data, just aggregates it from other sources which is probably what is required here.
Access is a little legacy and would have it's own issues. I think in reality it depends on how you are ingesting data (which is probably hospitals submitting excel templates or similar?). Maybe these could be automatically aggregated and put onto an analytics platform? (e.g. PowerBi). Hospitals could maybe report it on a portal, but I'm not convinced that the leadership would have wanted to change the established process (because of operational focus, change reduction & risk).
Weaknesses of Access in no particular order: very poor multi-user/multi-device-access features. No real concept of servers. Buggy, had a nasty habit of corrupting its custom file format causing data loss (Excel never does this). Does expect you to understand SQL, data normalisation, etc. VBA just about serviceable for a spreadsheet but not good enough for a 'real' database-backed app.
Probably someone could try and make a better Access. There are surely lots of startups doing that already, along with hosting. All the "no code" tools that are out there, etc. Access had the advantage of coming with Office so lots of people had it already, whereas today's subscription based services that host your data have incremental cost.
LibreOffice Base is something of an attempt at a newer Access, but unfortunately it's far from ready for production use.
Base has very solid data storage foundations (including the ability to connect to real database servers) but the UI is very buggy and is painful to fight with. Note I said 'fight with' rather than 'use.'
Base seems to have a fair bit of potential, however, if someone where to pour some money and/or time into developing it further.
Don't forget the formulas! Excel is essentially a functional reactive programming REPL. No RDBMS I know lets you easily define a table in which cells in a row depend on values from previous cells in a row, and keep this as an invariant.
People working on Excel replacements need to remember about this aspect of spreadsheets too.
Sure. That's why I said it wouldn't bother me if this was in the first few weeks of the pandemic or for one-off ad-hoc analysis. I actually use Excel for that all the time, including for tasks where technically a db is the correct solution. Working is better than perfect and everyone knows (sort of) how to use Excel. To still be using it in production this many months in is not acceptable. Someone should have taken that Excel based workflow and set it up to run through a database, they had the time to do it.
There is also the fact that Excel has some rich editing tools. I have actually seen a system where most editing was done by exporting the relevant items to a spreadsheet, making your changes and reimporting.
> It's unlikely this will have had much impact since most reported cases were only a few days late
I'm not sure if you're being sarcastic or not, but in case you're not, this has meant c.16k people missed from having their contacts traced. That's time-critical. Delays of any number of days are very bad. People will likely die because of this error.
This is safety-critical software. It should be stress tested and have sanity checks. It should be impossible for this to happen.
For something like this system, part of a track and trace system costing billions, to fail because of a file size error is just unbelievable.
I was being serious. Distributionally, most of the cases were 1-2 days before the error was spotted. To be clear though, when I said "not much impact", I didn't mean that people might not die because of it, only that it wouldn't affect the large scale functioning (such as it is) of the tracing operation.
This is more true than it would be from pure exponential growth alone, especially the relatively slow exponential growth the UK as a whole currently seems to be seeing (doubling time maybe about a fortnight?) The issue was effectively dropping all results after a certain number from a particular location, so at the start relatively few results were dropped and it was spotted around the time that number started to become substantial. It's that hard cutoff where results suddenly start to be dropped that means most of them are in the last few days.
Sadly it takes about 24 to get tested and 24 hours for the lab result. That gives very little time to trace contacts before they start entering the asymptomatic-but-infectious-to-others stage. A one day delay is a problem.
The WHO estimated 750 million cumulative infections on Friday. This error is not unbelievable because the system is just a bit of political theater. There’s nothing more we can do against an endemic human coronavirus. It’s time to move on.
Hey, at least the UK has automated handling of data! Parts of Germany managed to not tell people they'd tested positive for a week or two because their manual, Excel-based process for handling test results couldn't cope with the increased load from testing people returning to the country - and they're meant to be the competent ones we look up to, according to our press. (I get the impression that German experts have a rather different view.)
Once you have a couple of thousands of rows in excell files, one file per day, possibly one per testing center.... possibly with inconsistent structure.... non standard date formats (even different in each row),... add to this the inconsistent namings ("Tested at" column with data: "Testing center 1, London", "TC1, London", "TC1 London", "TC1 LONDON",....), etc...
Yes, anything-sql or whatever would be better, but deciding between putting all the data into sql by hand, or "uhm... i'll think of something and do it tomorrow", me, being lazy, I'd pick the second option,...
...until i'd get shamed on here. Maybe even a few days after.
How do you know it was Excel? I see no technical details in the posted article except "some files containing positive test results exceeded the maximum file size".
You're meant to add additional records as rows! (Excel supports only 16,384 columns, but 1,048,576 rows.)
Actually, this was a pet peeve of mine: Especially in the early days of COVID, just about every official or unofficial data source was an absolute shitshow:
- Transposed data (new entries in columns)
- Pretty printed dates (instead of ISO 8601 or Excel format, or... anything even vaguely parseable by computer).
- US date formats mixed in with non-US date formats.
- Each day in a separate file, often with as little as 100 bytes per file. Thousands upon thousands of files.
- Random comments or annotations in numeric fields (preventing graphing the data).
We don't all need to be data scientists, or machine learning wizards, or quantum computing gods.
But come on. This is the most basic, data entry clerk level Excel 101 stuff. Trivial, basic stuff.
Put rows in rows.
Don't mix random notes into columns you might want to sum or graph.
Oh, I saw excel files that contained screen shots of other excel files over cells containing data. Comments where in text fields, if you were lucky, or again screen shots, from word.
And then the same people wondered why things based on this file didn't work.
They seem to have hit the row limit, not the column limit. They were using Excel to collate the data coming in from different test centres: when one of the CSVs was loaded in, the row limit was exceeded and Excel cut off the extra rows without the operator realising (I'm going off reporting by Alex Hearn at The Guardian here, and he's normally pretty good)
lets assume that local centers people do save them in rows, which is the most logical thing to do. But probably the software where they copy/paste them requires the data in columns, so maybe they transposed the data without realizing that it was truncated to max size.
Or else it wouldn't explain why it took so many days to realize that they couldn't add more columns.
It is absolutely standard that everyone doing office work outside of high-tech jobs uses Excel for data handling.
It's not surprising that this reaches into government.
Even if your canonical database is done properly, there will be a clash here as soon as you share (or give anyone the ability to generate) CSV files that are too large for Excel.
Nope. They appointed Dido Harding to cough manage it, and Matt Hancock oversees it. That should tell you all you need to know. I wonder if they were running Excel 2007?
I think it's more interesting on what happens to any analysis done with this incorrect data. There was a news article last week suggesting that the growth in COVID-19 cases was "levelling off" and I wonder in retrospect would this incorrect data account for such a position.
Excel always feels like a bit of a curse of the world for me, and one that makes me feel like I'm going mad. I work partly in partnership with the NHS within my university, and get to see it being applied, completely inappropriately, with no ability to comment.
For me, I always feel like I'm living in some sort of parallel universe. Imagine living in a world where _you_, and a select group of others, know that there are better methods to drive a screw into a wall than punching it with your fist. However, 99% of people in the world punch screws into walls by hand, because they know how to use the only tool they have to hand (i.e. their hand). Their friends and neighbours do too. Their colleagues show them neat little tricks about how to fold your hand better to get the screw in with less pain, or more efficiently. The "proper" option -- drilling a pilot hole, using a self-tapping screw, understanding the difference between nails, bolts and screws, etc, or the concept of wall-plugs, is "too technical", "too complex", or requires "a large amount of infrastructure to fix a simple problem" [buying a drill & screwdriver, etc]. Sometimes people buy new walls with screws pre-installed because the thought of driving in those screws seems like an overwhelming technical burden.
You, and a group of similarly-minded nerds, understand just how stupid this is at times. You personally then spend the rest of your life watching people occasionally have horrific hand injuries from punching screws into walls and refusing the offer of screwdrivers. Others profit from selling vertical hard surfaces with a whole array of convenient screws pre-installed; or get employed teaching others the _very_ basics of screw-driving, or sell products containing all sorts of klunky hacks to cope with the fact that screws manually installed are at odd angles and tend to fall over randomly. The supplier of a popular brand of screwdrivers happens to be the world's largest screw manufacturer, but never seems to promote screwdrivers to people who currently punch them in with their fists. Instead, they sell them a subscription to screws, with the option for a very expensive pre-screwed wall-delivery service, built to your specification (that they call Dynamics for some reason…).
You, and a subset of your nerdy and educated friends recognise just how batshit insane the widespread use of punching-as-a-method-to-install-screws is, and occasionally laugh about it, mostly to distract yourself from the grim reality of buildings badly held together by punched-in-screws. Every so often a skyscraper-sized building collapses, [1] and it's realised in retrospect that the manually-driven-in-screws were not the right tool for the job. Still, people carry on banging in screws with their fists, and you increasingly realise that you have to -- somehow -- accept that in order to sleep at night without going insane.
My dad (who is 79 but still working) proudly mentioned that he spent 11 hours putting together some charts in Excel for his client. Probably something that could be done in an hour or so with an online tool.
At this point I think I'd rather have Peter Capaldi acting as Malcolm Tucker running Whitehall over the shower of incompetent people we have now. At least he might get something done in between the flaying
This is coincidentally also how I describe my job: I help when it doesn't fit in excel anymore.
Much easier for people to understand than shorter than civil engineer turned software developer in an engineering firm where I do data management, automation and application development for (mostly internal) clients to streamline business and engineering processes.
A lot of trashing on Excel but I kind of think this particular instance isn't really the essence of the problems with Excel. Perfectly reasonable programmers get this wrong all the time (ever heard of buffer overflow anyone?), and the issue of small test data being unrepresentative of large test data is something everybody has to contend with. If there is a real problem with it I would say it is the presumably silent nature of the truncation which allowed the error to propagate.
The real horrors of Excel come from things like auto-conversion of column data (text, numbers to dates etc), off-by-one errors in copy and paste, overwriting forumlas in cells, sorting of columns that doesn't capture all the rows, etc. These are all problems that are engineered into the user interface of Excel. It's like putting a tripwire at the top of your stairs and just expecting people to step over it day in and day out. It's basically inevitable somebody will fall down the stairs.
Why would you even use excel for something so important, rather than importing the data in a database? The problem isn't the version of Excel, but using a Spreadsheet in the first place!
Maybe everyone in school should be given real world problems like this to do in Excel so people can get a feel for doing it more correctly? Rows === records and columns === fields being the first lesson...
When I was in school (early 2000's) our GCSE computing lessons were, more or less, "Here's Microsoft Word, today we will learn how to format a letter!" or "Here's Microsoft Excel, today we will learn how to create a chart!".
The only upside to taking the class was that our school managed to identify the kids who were obviously computer literate and we got to go on a tour of Microsoft's UK headquarters...
Worked at an optics company and they were spoiling lenses until someone discovered that Excel was rounding values to two digits after the comma (financial rounding).
The actual title of the article is awfully clickbaity and misleading. As i understand the issue is not at all Microsofts fault, but NHE using an inadequate way of storing the data
I'm a bit of a novice when it comes to storing vast amounts of data, especially of this nature, but why would you use Excel rather than an actual database?
Development time and data accessibility. Everyone you might need to send data to has Excel, fewer have Access, and far fewer have a typical database software.
That's not saying Excel is the best way to store data, but it gets a lot of jobs done without multiple month delays that come with an IT project. Often times, waiting 2 weeks because they're already busy then spending another 2 weeks outlining requirements is categorically unacceptable to accomplish the business goals.
Why does Excel still have such low row limits in 2020? When they increased it from 64K, did they decide to only go to a million to promote Azure for "Big Data"?
It sounds like a data processing pipeline rather than a database itself. A systems A...E at hospital push csv files to a shared system F. System G pulls the data from F, passes it through an excel workbook (failure point!) for data augmentation and cleanup tasks built by a public health SME, the drops the data in location H. System I loads the data then finally loads the data to a “real” database for additional reporting.
If that sounds insane, I assure you that is a simplified version of what happens in some systems.
Wow, what a mess. At the very least (which is about all that could be done) A could assign a new-record-ID to each record; I could tell F where to find it; F could verify its presence (or scream) and tell A.
I have seen this type of bug in production. After seeing it, I refuse to accept xls for data. CSV all day, whomever needs xlsx can transform it into that after.
I believe Office will warn you when saving as .xls (or .doc instead of .docx, or .ppt instead of .pptx for that matter) that some data loss may occur. Problem is that nobody reads those warnings and just hits the OK button.
Yep, that's the problem. There's a gap in the space between the generality of Excel and the performance of a SQL DB with no UI that nobody wants to fill.
Microsoft Access sits there, but nobody uses it for some reason.
My belief is because Access sits on the wrong side of the gap you mention. It's an SQL DB with an UI, alright, but the database is (or at least has a reputation of being) kind of flakey, and the UI exposes you to all the complexity and inflexibility of relational databases. This creates two types of hurdle. One is educational - you have to teach people basics of relational data modelling if you want Access to make sense for them. The other is operational - you can't shuffle data or table structure around in Access like you could in Excel. You can't just copy pieces of table around, you can't make rows depend on rows before them, etc.
Worth remembering is that Excel is, fundamentally, a 2D functional reactive programming REPL. Most of its users don't understand that, but they internalize the behavior. They may not know what a DAG is, but they know that updating cells will recalculate cells that depended on them. So you can't replace Excel with just a database - because half of the utility of the program is in formulas.
There is a gap - a tool is missing that would offer the flexibility, the ergonomics, and FRP capabilities of Excel, while also providing tools for ensuring data consistency and relational queries, and at the same time also making them easy for users to wield. It's a tall order.
A non-technical senior manager who was in charge of a project I worked on went on to manage England's contact tracing. Not sure if he was directly responsible for picking Excel, but everyone who worked under him was not the least bit surprised when this news came out.
The project failed spectacularly if you were wondering.
TL;DR [I found it hard to find which actual limit was hit.]
> The problems are believed to have arisen when labs sent in their results using CSV files, which have no limits on size. But PHE then imported the results into Excel, where documents have a limit of just over a million lines.
> The technical issue has now been resolved by splitting the Excel files into batches.
PHE had set up an automatic process to pull this data together into Excel templates so that it could then be uploaded to a central system and made available to the NHS Test and Trace team as well as other government computer dashboards.
The problem is that the PHE developers picked an old file format to do this - known as XLS.
As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of.
Professionalism has been entirely absent from the top-level handling of the pandemic. It's not entirely clear from the article who's responsible for this, but I note that the so-called "NHS" test and trace is in fact an outsourced organisation: https://en.wikipedia.org/wiki/NHS_Test_and_Trace run by Tory placewoman Dido Harding.
> "She is a former chief executive of the TalkTalk Group where she faced calls for her to resign after a cyber attack revealed the details of 4 million customers. A member of the Conservative Party, Harding is married to Conservative Party Member of Parliament John Penrose and is a friend of former Prime Minister David Cameron. Harding was appointed as a Member of the House of Lords by Cameron in 2014. She holds a board position at the Jockey Club, which is responsible for several major horse-racing events including the Cheltenham Festival. "
> "In May 2020, Harding was appointed by Health Secretary Matt Hancock to head NHS Test and Trace, established to track and help prevent the spread of COVID-19 in England. In August 2020, after it was announced Public Health England was to be abolished, Harding was appointed interim chair of the new National Institute for Health Protection, an appointment that was criticised by health experts as she did not have a background in healthcare."
I mean let's call a cat a cat: this is cronyism thus corruption. Unfortunately, voters never care about these immoral practices. When a social system/administration is dysfunctional, corruption is often the first cause of these disfunctions. The result of corruption is always sheer incompetence.
I think the voters do care, but can't keep track of the details and only know what the media choose to focus on. This is why Donald Trump's "drain the swamp" pitch was so effective, despite the reality of the far worse corruption in his campaign.
Voters want less corruption, but their own tribalism leads them to unhelpful positions of "everyone is corrupt", "none of my tribe are corrupt", and sometimes both of those at once.
I have not yet watched this programme from the BBC but it claims the NHS Test and Trace system is a chaotic mess. In other words, as shambolic as everything our Conservative government has done over the pandemic.
Panorama - Test and Track Exposed:
> "Panorama hears from whistleblowers working inside the government’s new coronavirus tracking system. They are so concerned about NHS Test and Trace that they are speaking out to reveal chaos, technical problems, confusion, wasted resources and a system that does not appear to them to be working. The programme also hears from local public health teams who say they have largely been ignored by the government in favour of the private companies hired to run the new centralised tracking system. As Panorama investigates, it has left some local authorities questioning whether local lockdowns could have been handled better or avoided altogether."
For those not in the UK, Dido Harding was already on front pages a couple of weeks ago when she was grilled by a Parliamentary committee about the shambolic Covid testing organisation and her reply was "no-one could predict such a strong demand for Covid tests"...
As the NY Times said yesterday, sometimes it feels like "Britain is operating without adult supervision".
She was also CEO of ISP TalkTalk when they had a massive personal data breach and told an investigation that she didn't know if the data was encrypted or not.
The whole NHS Test and Trace organisation seems very strange. Going by recent news reports, it's not just Harding who comes from a retail background, several of the other top people do, too. What they don't seem to have done is hire top people with a technical or management background in either public health or technology, which you'd think might be slightly important for an organisation with this remit.
In other news, I'm looking forward to hearing about how a system apparently hacked together from text files and decades-old Excel formats is complying with basic data protection principles, given that it's being used to process sensitive personal data and has profound implications for both public health and now (since violating an instruction from their people to self-isolate has just been made a criminal offence) individual liberty for huge numbers of people.
While that is true, I have a little difficulty imagining that Harding or anyone else senior was involved in the process of shoveling around spreadsheets. As the overall leadership she obviously has the accountability but just saying, "oh what do you expect when you put a crony in?" ignores the reality that a number of people on an operational level had been doing this for months and nobody seems to have stopped it.
The latest news is that apparently this was a problem with Public Health England's internal processes anyway, not NHS Test and Trace or any other kind of outsourced service: https://www.bbc.co.uk/news/uk-54422505 I don't think that even falls under Harding's responsibilties.
Day one on the job of CEO should be requesting input about systemic organizational and operational problems. Both from your direct reports and from the underlings who actually know what is broken.
The BBC updated the article with technical details, see about halfway down. Here's the tech details:
(PHE = Public Health England)
Analysis by Leo Kelion, Technology desk editor
-----------------------------------------------
The BBC has confirmed the missing Covid-19 test data was caused by the ill-thought-out use of Microsoft's Excel software. Furthermore, PHE was to blame, rather than a third-party contractor.
The issue was caused by the way the agency brought together logs produced by the commercial firms paid to carry out swab tests for the virus.
They filed their results in the form of text-based lists, without issue.
PHE had set up an automatic process to pull this data together into Excel templates so that it could then be uploaded to a central system and made available to the NHS Test and Trace team as well as other government computer dashboards.
The problem is that the PHE developers picked an old file format to do this - known as XLS.
As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of.
And since each test result created several rows of data, in practice it meant that each template was limited to about 1,400 cases. When that total was reached, further cases were simply left off.
Until last week, there were not enough test results being generated by private labs for this to have been a problem - PHE is confident that test results were not previously missed because of this issue.
And in its defence, the agency would note that it caught most of the cases within a day or two of the records slipping through its net.
To handle the problem, PHE is now breaking down the data into smaller batches to create a larger number of Excel templates in order to make sure none hit their cap.
But insiders acknowledge that their current clunky system needs to be replaced by something more advanced that does not involve Excel.
Everyone's saying not checking for errors was the terrible thing, but I think having Excel as a middle-man in such a sensitive system was the first issue.
It's such a British government thing to do too (I worked for Syntegra when I got out of uni).
The real, visceral tragedy that is COVID-19 notwithstanding, I think it brings a marvelous example of why statistics, the importance of accuracy in data collection, and the effects of exponential growth are a must-have in today's highschool curriculums.
Imho, exponential growth especially, and then clustering.
Some pretty animations in a lesson might be helpful.
Fire grows in a similar way.
I've seen people up in arms because of things like a workplace of 1000 people shut down because "only" 20 people tested positive one day.
What's the problem?
To people who don't understand exponential growth it seems like 1 in 50 people is a fuss about nothing.
To people who do understand, it's an early-warning signal which says 500 people will get it in weeks due to the confined workspace, if not stopped urgently. And they will infect 1000s in the surrounding community in the same timeframe.
(To anyone tempted to point out the published R isn't that high, local R is highly dependent on situation and who mixes with whom. In a confined workspace, especially with people moving around, it's higher than in the general population. This is why it's useful to teach clustering in addition to exponential growth.)
This government had a statistics screw-up just two months ago and the only negative reaction was about how they handled the fallout. As a nation we simply do not understand its value - or the cost of it being misused.
When I did my A-level in maths the course was half pure and half applied maths.
The pure maths course was mandatory, but for the applied side you had the choice of either mechanics or statistics. At the time I was much more interested in physics, and was considering it as a degree course but, even if I'd done that, statistics would have been far more useful to me.
I certainly wouldn't choose to drop geometry or calculus, but I've had many occasions to regret my choice of mechanics over statistics during the past two decades.
I mean, statistics relies heavily on calculus. It might be best to try and fit both in. Something at least a little bit fun and educational could be achieved with some dice throwing or simple computer simulations. High school students place dice games, they will immediately appreciate how useful a bit of probability would be in that context.
Trignometric calculus comprised at least half my high school calc assignments ~20 years ago. Surely that can be scaled back in favour of more lessons on statistics and probability.
Change in deaths over change in statistical illiteracy is positive eh? How positive? Presumably the change in statistical illiteracy depends upon the magnitude of the deaths. It'd be an interesting relationship to study...
Then that is nuts, and you don't need to remove calculus to fix it.
Presumably there's something you're teaching that we (in the UK) aren't, or more depth somewhere, but I don't know what it is. Our systems are quite different in that mathematics becomes optional after GCSEs (15-16yo) here, but statistics is taught from a far younger age than that, to that, and beyond for those that take A level(s) in mathematics. (As I recall there are six statistics A level modules total, S1-6, I think S1-2 are compulsory for a full A2 (vs. AS) mathematics qualification (which consists of six modules total). In order to do all six statistics modules one would at least take the second A level 'further mathematics', and probably (pun intended; unless statistics was a particular passion and the school allowed it) 'further additional'.
NB I quite liked that structure - there are 18 'modules' total (arranged in 'core', 'further pure', 'decision' (algorithms), 'statistics', and 'mechanics'. Three A levels total available (six modules each) or fewer and an AS (three). Which ones you want to do are almost entirely up to you if the school's big/lenient enough. IIRC you could even decide for yourself how to allocate the modules' grades across the number of A levels you were eligible for, e.g. if AC would be more beneficial to you than BB.
The problem the world has right now is not a lack of understanding of what exponential growth means. Plenty of people understand that just fine. It's so easy to understand that there is even a simple ancient parable about it (of the Chinese Emperor and the chess board).
The problem is people who are obsessed with the concept of exponential growth even though "grows exponentially until everyone is infected" is not a real thing that happens with viruses, even though COVID-19 no more shows exponential growth than the sine wave does (sin roughly doubles at points), even though Farrs Law is all about how microbial diseases show S-curve type growth.
This leads to crazyness like the UK's chief medical officers going on TV and presenting a graph in which the last few data points are in decline, but with sudden endless exponential growth projected into the future, along with claims that "this isn't a prediction, but clearly, we have to take extreme action now because of exponential growth".
Observing exponential growth for a few days in a row does NOT mean endless growth until the whole world is infected. Growth rates can themselves change over time, and do. That's the thing people don't seem to understand.
You're right. But this article contains significant new information (SNI) about Excel, and spreadsheet topics are perennially interesting to HN readers, so this is the rare case where a quick follow-up article, and also the title change, probably makes sense. Usually it goes the other way of course:
Fair enough, makes sense, thanks. When I posted the comment it was the first comment on the thread (not sure if it's been merged or I'd had the page open a long time before posting) and we'd already had other threads with that leeway. Cheers for the clarification, I'd thought that might be the reason for the earlier ones.
Edit: just seen your other comment re merging, thanks.
The BBC has expanded on the Excel-related part and broken it out into a seperate article: https://www.bbc.co.uk/news/technology-54423988 Maybe it'd make more sense to link that one, if you're keeping the title?
You're saying that Respiratory failure or diabetes is the 'real cause of death', but COVID-19 is actually the cause of death. This has been common practice for years, stop believing what Mc. Nixon tweets out.
Overblown, IMO - only the useless absolute figure is affected, so it just creates another reason it's useless.
Unless positive result files are more likely to exceed the file size limit than negative ones, of course, which I haven't heard suggested is the case.
Edit: Perhaps I should clarify that I'm not suggesting it isn't a cockup. I just don't see it as a massive scandal that renders the data useless - it just decreased the sample size. What particularly wound me up was a Radio 4 presenter this morning objecting to the guest (I missed the start, I'm not sure exactly who - a female public or civil servant) calling it a 'glitch'. ('Really?! Really! 16000 missing cases is a glitch?!') Well, yes. Glitches can have minor, severe, catastrophic, or no consequences.
It's not much good for comparrison with March when tests were only done in hospital, and I see a lot of people doing that
Most people who had covid by August (which was at least 6%, or 4 million, in the UK according to antibody tests) caught it between start of February and end of April. That's at least 3.5 million over 90 days, or 40k a day on average. Peak was likely double, maybe even treble that, given lockdown on 23rd of March dramatically cut infections.
We're likely testing somewhere in the region of 50% of actual cases. There's the non-symptomatic cases where about equal to the number of symptomatic cases, so double the absolute number -- some with symptoms will refuse to be tested because they don't want to miss work, some without will be caught by contact tracing, those two groups probably cancel each other out.
As such I'd expect 50k/day to be the "March equivelent" - with doubling every 10 days that means another 2 or 3 weeks.
It's not just about absolute cases though. Cases have been doubling roughly every 10 days, as have deaths. Deaths lag cases by 2-3 weeks, so I'd expect deaths in 20 days to be continuing to double even if we all stayed in an isolated booth from now.
Ultimately the concern is we're heading into flu season when hospitals are stretched, and cases, hospital admissions, and deaths are all increasing.
Hopefully flu season will be milder due to social distancing and due to more vulnerable people having been killed off, but either way we need to get a grip soon. Or just abandon any pretense of trying to stop its spread.
Actually not only that. It affected contact tracing, and people exposed to infected patients were not informed to self isolate. Possibly leading to extra spreading of the virus.
EDIT: I have no idea who downvoted my post because what I said is 100% true. We have to tell customers to stop opening CSVs in Excel and then uploading them to us because what they upload could be missing critical data. Excel interprets a number and then formats it as a number, but in healthcare, 10 digit numbers are really strings. Unless your IT team has created an Excel extension and had it preloaded onto your local copy, Excel will remove leading zeros and there isn't a way to get them back if you save/overwrite.