I disagree with this list of "falsehoods programmers believe".
Most of those lists describes objectively wrong assumptions programmers have about some real-world phenomenon. But in this case I'd argue that a lot of these points are simply not correctly formatted CSV files. Just because someone handed you a file with the extension .csv does not mean it's a CSV file proper, and it certainly does not mean that you have to guess at what it intends to encode without assuming any of the things on the list.
For example, "All CSVs contains a single consistent encoding". If this is not the case I'd (rightfully) reject the file as being a proper CSV file.
You can certainly reject the file, but that's not going to make your user happy. The (sad) reality is that you have often to deal with these edge cases in software.
The user does not care that the CSV is not proper. He just wants to open it, and if it doesn't work, they will blame your program, not the source of the CSV.
And you only control the former. I have had to deal with a CSV export from a web service that was "not proper". I notified their support, but that problem was never fixed.
Luckily, the user in that case was just me. But if I was making software for someone else, blaming the right source of the problem would have not taken me anywhere.
There are different use cases. What you say is not really CSV specific and true about most data sets: you receive data sets that are not compliant to whatever spec you expect. Different projects must handle it in different ways, either making some efforts to correct errors or not.
In some cases data users wield enormous power over data submitters (e.g., many for-record safety analyses or real-time systems). If any part of the data collection process is suspect the sample is likely to get tossed. And for spec compliance, if a data set does not pass perfectly through a validator the sender will likely get some sort of a "shame on you" nastygram.
The real problem is when the user complains "but it works fine in Excel." If it does work elsewhere then in the user's eyes it's your app's fault that it cannot read a working file.
Yes, it does work in Excel but it's because of 40 years of various edge case fixes that you just cannot replicate in their entirety in your app.
Really what this comes down to is that the CSV "standard" is in actuality an undocumented standard with one reference implementation - Excel. If it works in Excel, it's valid CSV, if it doesn't, then it's not. It's a sad state of affairs, and probably one reason to avoid working with CSV unless absolutely necessary, but there's no getting around that fact.
This is so wrong it’s hard to figure out where to start. Excel doesn’t make proper CSVs and it’s not a reference for anything. I can change the extension of an html file to CSV and open it in excel, it’s still an html file.
Excel makes proper CSVs insofar as when you say that you allow people to import CSV files into your application, the vast vast majority will be using Excel to create / manipulate those CSV files (and if they don't, whatever tool they DO use will have made efforts to work consistently with Excel).
What any other spec actually says is irrelevant. For most non-technical users, anything Excel outputs as a CSV is a CSV file (and anything you output that doesn't work with Excel is not)
I use CSV (really tsv) because big data tools can open them extremely easily, because they're human readable, and because they're easy for my software stack to parse and use. I do most of my data processing for local consumption on a repl-like tool. We have a defined configuration for TSV we use internally that is lossless in terms of storing and extracting a number of complex data types. Embedded JSON with embedded separators in that JSON? No problem.
Originally the xls format was the only open format in the office family and it was actually much easier to create xls files than to deal with the stupidity of csv
I take a defined format like .xls over an unspecified de facto ‘standard’ like .csv anytime. In many locales .csv files don’t even contain comma separated values because the comma is the decimal symbol already. It’s a mess, half of the time Microsoft can’t make it work either even when you only use their software.
But every piece of software that accepts csv files cannot be expected to include an editor/data Wrangler mode that will allow the user to resolve encoding errors, ambiguities and inconsistencies, nor can it be expected to resolve those problems automagically
> If you put a phone number form and the user put letters inside, you will just say "invalid number". You won't try to parse it.
Wait, what? Phone numbers could always include letters. Many businesses get phone numbers that include or consist of words relevant to their business, to make them memorable. And I've known people whose personal number happens to contain a word or words, so they give the letter version out because it's easy to remember. If a phone number field doesn't include simple logic to map letters into the proper numbers, that's just lazy programming.
That's an american way of seing things (and is only a thing in very few countries), because all your letters are on the num pad. But 555-فودكا is not going to work.
And what next, do you accept exotic unicode numbers ? Letters that you casefold to something that can make to a number ?
Again, you draw the line somewhere. There is no right place for the line, it depends of your business, but there is always a line of what you accept as a user input.
Being offended won't help you on HN, and calling out people names either.
The letters all correspond to numbers. I don't think most applications with phone numbers really need to support putting 1-800-FLOWERS instead of the real number.
There is a standard, universal mapping of letters onto numbers. This allows companies to advertise their phone number as something memorable like 1-800-CASH-4-AU and everyone knows how to dial it. People don't even need to remember the mapping of letters, because they're printed on every keypad.
In the US, phone numbers started with 2 letters well into the 1960s. They were usually the first 2 letters of the street name the nearest telephone exchange was on. My grandfather's address book was full of numbers still written that way.
The numbers all have three or four letters above them and people (mostly businesses, really) will sometimes get a number that spells out a memorable mnemonic. That's all he's talking about.
Well there are the A B C and D DTMF tones not sure if those technically count as numbers and from memory they are without the ITU spec for a phone numbers.
You're right, but non-technical clients don't care. It's a little different to them.
With a ZIP code in the US, it consists of 5 digits that form a unique number for a geographical area with 4 more digits that represents a sub-area within the 5 digit ZIP. But there are no instances where there can be a letter. That's a hard fact. If you put 4308A-65XK as the ZIP, the post office or delivery service will not know how to deliver the mail without trying to cross reference address with possible valid ZIP codes.
Invalid CSV is murky, because CSV is a serialization format. That is, invalid CSV may not play well with a program that accepts well-formed CSV, but a human may still be able to parse it with no problem. Returning "invalid ZIP Code" makes sense. The client would understand that letters do not belong in a ZIP Code.
If you've never had to deal with a nightmare client of that sort, hats off to you.
Perhaps the problem is with the client relationship.
Why not explain to the client that you will need to put in manual or custom work to clean up the CSV or batch of CSVs and charge for the extra time writing a script or cleanup?
Literally the only time this may not apply is if you are writing a general purpose program.
You can also have clients that don't have a working OS. But you can't fix that. You need to choose a limit after which you say "no, I won't do that, get you s* together"
How could you possibly handle a single file that mixes and matches encodings? I'd think opening that in any software is going to result in some garbled characters.
That non-compliant sort-of-CSV can be cleanly (i.e. without errors) decoded as several "dialects", producing multiple different results, of which only one (or none) will be correct.
I guess I would do it, because I could then be explicit about the differences between the two, and explicit about how I would decode something if it was indeterminate.
True, but if your job involves data transformations or migrations, then bad files, dirty data and such things are just part of the job. You deal with them. Rejecting incoming data because it isn't perfect is a path to bad customer service, and a poor reputation.
One challenge is that rejecting, or even detecting that something is wrong is often not a simple decision to make. Parsing a CSV in an invalid encoding will usually produce "valid" but nonsensical output. Determining that that has happened when your destination schema is somewhat freeform can be a challenge.
It depends on the context. If you're pulling in data from a source whose format the user can't control, then you have to deal with the broken CSV emitter. But if you're pulling in data from Excel, you're better off telling the user to save as xlsx because Excel makes such a hash of CSVs. In Python, I've had way fewer problems using xlrd and openpyxl than I have with any of the ways you can read CSVs.
> For example, "All CSVs contains a single consistent encoding". If this is not the case I'd (rightfully) reject the file as being a proper CSV file.
"CSV" only specify the metacharacters (at best), unless it's explicitly stated that the CSV is a single consistent encoding it's no more improper than inconsistently encoded Linux paths. It's stupid and a pain in the ass to be sure, but it's not improper.
And when you don't control the entire pipeline, you will eventually hit this issue.
In fact the first falsehood programmers believe about CSVs is that it's a good serialisation, interchange (between separate systems) or human-edition format. While it's routinely used for all of those it's also absolute garbage at all of them.
>"CSV" only specify the metacharacters (at best), unless it's explicitly stated that the CSV is a single consistent encoding it's no more improper than inconsistently encoded Linux paths. It's stupid and a pain in the ass to be sure, but it's not improper.
I think it's generally agreed that a CSV file is a plain text file. And I'm not aware of any school of thought that says that valid text files may have multiple encodings. IMO, it's ridiculous to expect any software to detect and accommodate that.
I don't expect a CSV to be a plain text file, unless all the data it's encoding is plain text. CSV is a data interchange format, not text. If I open a CSV in a plain text editor and get junk out of it, I would be mildly surprised but I wouldn't immediately conclude the file is invalid or corrupt, I would at least entertain the notion that the data is correct but indisplayable.
Mixed encodings would make sense if the columns in a single CSV are coming from multiple data sources. Any data processing software that expects to work in the real world is expected to be able to handle inputs in different formats. Depending on the relationship between the parts of the processing pipeline, a trying to reconcile formatting issues before/while creating the CSV may be an objectively incorrect design decision.
>I don't expect a CSV to be a plain text file, unless all the data it's encoding is plain text.
Really? So you might write a floating point value to a "CSV" file by writing the IEEE 754 bit pattern for that value? And you would call that a valid CSV file, and would expect generic CSV parsers to properly decode that value?
That's a new one on me. I have always been under the impression that CSV files are plain text. Perhaps you should edit the Wikipedia page so it doesn't continue to give unsuspecting people the wrong information:
In computing, a comma-separated values (CSV) file is a
delimited text file that uses a comma to separate values
(many implementations of CSV import/export tools allow
other separators to be used). It stores tabular data
(numbers and text) in plain text. [1]
> In fact the first falsehood programmers believe about CSVs is that it's a good serialisation, interchange (between separate systems) or human-edition format.
I would argue that most programmers realize CSV sucks and would rather use something structured and sane. In my experience CSV support almost always comes about because someone from another business unit (e.g., marketing) needs self-service import facilities. Those folks live in Excel and CSV becomes the de facto import format.
CSV is simpler and therefore harder for Excel to screw up. Not for lack of trying: exponential values often truncated, Windows 1252 encoding, BOM, etc.
Then you make a wrapper program in a language that has an Excel library which e.g. takes your own data in some structured format (JSON, YAML, or whatever) and emits Excel.
> For example, "All CSVs contains a single consistent encoding". If this is not the case I'd (rightfully) reject the file as being a proper CSV file.
That's literally the only one on there I felt that way about. All the other ones can and should be accounted for in your application if it accepts arbitrary CSVs.
Encoding is a special case because it's impossible to do anything other than guess at a plaintext document's encoding. If you're getting byte code parsing errors, the code should try a few different encodings before rejecting the document.
This won't help if there's a mix of encodings in the document. If this is the case, nothing can help and you will have garbage going in.
The vast amount of entries are duplicates. Sorting those out, you're really not left with that much, and some of those are just invalid CSV's. Assuming I haven't miscounted:
There are fifteen points effectively addressing delimiters, which all sum up to "delimiters may be arbitrary byte sequences, and they should be ignored in quoted fields or if escaped".
There's nine points about encoding, three of which if violated just result in a flat out invalid file.
There's four points that effectively claim that you must parse CSV files that adhere to no known formal or informal standard, which is just silly.
These "Falsehoods programmers believe about X" articles lists all revolve around the same thing, accepting arbitrary inputs. If you can constrain your input, then great, you don't need to read the article. Or read it anyway, just so you get a better idea for exactly how to constrain it.
But if you can't, and business needs force you to allow anyone and their dog Sue to pass input into your software system, then you need to be able to handle garbage input effectively, up to and including just correcting it for them if that's possible.
The biases listed are good to check your code against if that's indeed the case.
The idea of "accepting arbitrary input" is nonsense. Arbitrary input would imply that I can feed the CSV parser a JPEG and expect some non-error result.
A CSV parser will obviously only accept CSV. That list includes a few (5?) things that might end up being common bugs in hand-rolled CSV parsers. However, most of is is just bogus.
The only possible way to handle "garbage" input is sensible errors. Of course, is someone is presenting a valid CSV variant that you do not support, then it's not garbage, but just a variant that you may wish to implement support for.
A parser that tries to continue on error would be absolutely outrageous for a business to implement. How do you think the finance department would feel about their accounting being rendered invalid because of a parser trying to be smart?
> The idea of "accepting arbitrary input" is nonsense. Arbitrary input would imply that I can feed the CSV parser a JPEG and expect some non-error result.
This is completely wrong. The idea is to sift through the input and return appropriate error codes for garbage input, while remaining flexible enough to handle malformed input that's not quite garbage. As the maxim goes, be flexible in what you accept, strict in what you hand out.
Any input that your program accepts that produces a sensible output is within the set of inputs for which your program is designed to be valid, which is a subset of all possible inputs. Arbitrary input means any possible input, beyond the set for which your program is valid, including those that causes your program to segfault or set your cat on fire.
Expecting to handle "arbitrary input data" would mean to try to accept data outside of that set. In the case of a CSV parser, it would mean parsing /dev/urandom as a CSV and producing sensible output ("sensible" defined by the users, which in your example is the dog "Sue").
What the maxim says about being flexible in what you accept is not trying to tell you that everything should be considered valid, but instead that it doesn't hurt to allow certain rules to be broken if it cannot do any harm. This could mean supporting trailing commas in JSON arrays, which while illegal is a harmless "extension".
However, it is not trying to tell you that you should accept garbage and "silently recover", trying to do something with invalid input. Doing so does more harm than good. Be flexible, not stupid.
Not surprisingly, most lists of "falsehoods" are like that.
I believe it is naturally impossible or intractable to consider every item of those lists. Rather, they are convenient checklists for consideration when you are delving into the realm of those complicated subjects. Do you reject, say, 30% of those items? That's fine. Just make sure that you are explicitly making tradeoffs and will consult the lists again when your initial tradeoffs wear off.
The original statement was about coverage of the breadth of the specification: if you're consuming data you need to accept everything the specification allow, but if you're generating data you should be producing as little of the specification's productions as you can.
This does not work when the format is mostly unspecified and the vast majority of what you'll find in the wild is a tire fire.
A colleague got stuck trying to parse a 1-million-lines CSV for a while, then he found out that the last line was a PHPMyAdmin timeout error.
Maybe they should add "All CSV contain no error messages" to the list :)
Lol "reject" the file. The author in this story could be me and the reason we both have so much experience working with that long list of fucked up CSV files is because we don't have the power to 'reject' anything. Our client gives us the data and tells us to fuck off until we solve their problem. Professional services != consumer services in servicer/servicee power disparity.
Yeah, almost by definition if you’re dealing with crappy CSV files it’s because you’re part of a pipeline crossing organization boundaries with manual processes and you rarely have the authority or means to reject input.
Yeah, that's the most commonly referenced standard.
Approaching it from a database perspective (eg sqlitebrowser.org), where we need to both import and export CSV's... it can be tricky.
For example, if an imported CSV file has two commas ",," that generally indicates an empty field. But, does that mean an empty string? Or does it mean we should use the default value for the field (as per database spec)?
That specific case can be handled by a user preference on the import/export dialog ("What to do with empty fields?").
Other's less so. For example, database fields can contain data types (eg images, binary blobs) that have no representation defined in the RFC. Literally any decision about how to encode them is going to be non-standard. Ugh.
There really needs to be a follow up RFC that specifies useful, practical things to tighten up the gaps.
As the quote states, none of that would be JSON. Thus, if you have a document that contains that, it is no longer JSON. A parser may accept things that are not JSON, but that's rather irrelevant.
JSON does not have comments, support trailing comments, single quotes, etc. etc. etc.
Where is the minefield? In accepting things that are not JSON, such as trailing commas and comments? That's silly.
Everything blue and yellow in that matrix is entirely OK and should be green ("undefined" means "anything goes", and the spec specifically allows accepting more than JSON, so "should have failed" is bogus).
That leaves a few parser bugs in form of orange and red, but they're corner cases (like surrogate pairs or 10000 nested arrays) and fixable.
Something that's being lost here is that this article is in the same vein as the seminal "Falsehoods programmers believe about time" [0] which contains quite a few statements which describe user error that developers still have to deal with (e.x. "The system clock will always be set to the correct local time.")
> I honestly think this genre is horrible and counterproductive, even though the writer's intentions are good. It gives no examples, no explanations, no guidelines for proper implementations - just a list of condescending gotchas, showing off the superior intellect and perception of the author.
Like, OK, I shouldn't use sep. Good to know. What should I use instead? Why tell people that \ isn't the escape delimiter without explaining the way the quoting system works?
And frankly, the stuff about Excel is divorced from reality. More than 90% of the time, the reason you're making a CSV is because somebody wants to look at the data in Excel and you don't want to deal with xlsx. If your concern is something else CSV is probably the wrong choice. Thus, for most programmers, Excel is the reference implementation of CSV.
I like this genre because many developers (and non-developers who are imagining features) tend to trivialize things that are inherently complex. “Just add N seconds and you’ll have the target date!” “Just parse the user’s address and return the street name” “Just add a radio button for the user’s gender!” Resulting in underestimates and blown schedules, database designs that are not future proof, and at the end of the day demonstrably wrong software.
Sure, good examples of how to handle each edge case would be ideal, but merely pointing out all the bad assumptions to someone is a valuable first step.
I think there's more going on than condescending gotchas here. I see the whole genre as fairly tongue-in-cheek enumerations of pitfalls that, all too often, are baked into projects as unexamined assumptions.
When we read them, we have the opportunity to check our own assumptions -- about the subject at hand (today, CSVs) and also (hopefully!) about other subjects we may encounter later.
I just don't know what thought process I'm supposed to go through when someone tells me writing `sep=,` in a CSV is horrible but doesn't offer any alternative way of fixing the problem that solves.
I think an `.csv` is very useful for any kind of one- or two-dimensional numerical data, and even a lot of non-numerical data. It's very simple, human-readable. E.g. if you have an API, a `.csv` backend in addition to other (e.g. `.json` or `.mat` or similar) makes it much easier for a human to inspect the data.
I've had to deal with CSV data and Excel a lot in my career, and I learned one trick (sorry) a few years back that has made my life so much better:
Here's a scenario I bet many people have faced: Export an Excel sheet to CSV that has high-ASCII characters in it like accents. The export gets mangled when you load it into code or a text editor after. You eventually just upload it to Google Sheets and export it from there instead. It works but it's a pain.
Instead of exporting it as a CSV from Excel, export it as a UTF-16 TXT, which is basically a TSV file.
That will correctly preserve all the character encoding.
I can't promise this will work 100% of the time but it has resolved many many encoding issues going to/from Excel.
As of this writing, there exists a single usable CSV format that Microsoft Excel can both read and write safely across platforms. Tab delimited UTF-16LE with leading Byte Order Mark.
I work with CSVs in R, but I don't work with Excel that much. Thanks for the useful tip.
My one trick there is to just not ask for CSV, and import Excel files directly. Most languages have libraries to read basic (non-formula) excel files, they work fine, and provide richer data models than CSV (though somewhat risky as you have to deal with formattings and finding e.g. numbers where you expected strings and the other way around).
This is unacceptable in large business settings today. A list of assets alone can be tens of millions of lines. A list of transactions? Oh boy time to break out Rust if you want your database to be finished this month.
> This is unacceptable in large business settings today. A list of assets alone can be tens of millions of lines. A list of transactions?
The original comment specifically mentioned excel. Excel can not work on "tens of millions of lines" and your objection is thus irrelevant.
If your data does not come from excel, there is obviously no reason to make it go through excel. There is also no reason to make it go through CSV, make it go through an actual serialisation format which is not mostly unspecified garbage since you're programmatically serialising it for interchange and hopefully not expecting human edition of the intermediate megabytes of data dump.
Do those things start out as an Excel file? Because that was the suggestion, to avoid Excel->CSV headaches by importing Excel files directly. I don't think they meant to always use Excel for interchange.
Generally speaking though, the areas where the sorts of issues that the article talks about come up is from user-specified CSV mostly loaded from Excel. When you're dealing with enormous amounts of data coming from another (automated) system, you can often make assumptions that wouldn't be practical with user-supplied CSV, like assuming a particular encoding, delimiter, etc.
I've spent a lot of time thinking about a better format that is close enough to CSV to be practical, but has more precisely defined semantics and structure, also to support better usability (decreasing the need for manual integrity checks after parsing). I wanted at least a defined encoding and defined schema (table definitions with fixed number of typed columns). Optionally Unique Keys and Foreign Keys, but that quickly leads to a situation where there are more possible features with diminishing returns to consider.
I ended up with this [1] and a python implementation [2], and it turned out not too bad. I've also done a more pragmatic C implementation (couple hundred LOC) in a toy project [3] (wsl.c and *.wsl files), and it turned out quite usable.
I think what prevents adoption of such a thing is that it's very hard to standardize on primitive types and integrity features.
You could come up with the most wonderful format in the world, but unless it's transparently readable and writable by Excel then it will never replace CSV. How does Excel handle your whitespace-separated files?
Many of them would actually be handled just fine as CSV files (as fine as any random actual CSV file), but I use a different string syntax ([like this] instead of "like this"), so that would pose some problems.
It would not be hard to write an Excel plugin to read my format, while restricting to a fixed set of basic datatypes, and getting some usability gains there. The problem I see is that full-featured interoperability is hard to achieve because the set of supported features is hard to agree on.
I went the same route but ended up adopting feather from the makers of Python's Pandas and R's tidyverse, Wes McKinney and Hadley Wickham. https://blog.rstudio.com/2016/03/29/feather/
> SUMMARY: Feather's good performance is a side effect of its design, but the primary goal of the project is to have a common memory layout (Apache Arrow) and metadata (type information) for use in multiple programming languages. http://wesmckinney.com/blog/feather-its-the-metadata/
The approach is data frames on disk storage.
> data frames are lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (_factors), or _string. Every column can have missing values.
I don't think you can fix CSV. You could define a reasonable-by-today's-standards subset - I'd say, UTF-8, 0x20 delimiters and per-lexem choice of C-style identifiers or string literals.
But I need more structure. My format supports an arbitrary number of tables in a single file, some automated integrity checking, and direct conversion of text from/to runtime structures (e.g. C structs, possibly with automatically allocated backing buffers for variable-width datatypes) with only little boilerplate. So yes, that's a different format.
You can use any syntax for the lexems if you are willing to write the code. "[]" is not really "quoting". It's just the lexical syntax for the string datatype I use, and for consistency it's required even for strings that have no spaces or special characters. And lines starting with "%" hold enforced (even required) schema structure. Yes, I decided to require a single " " as delimiter, and that is not so easy to change with the current approach.
Many of the complaints stem from the fact that many CSV writers do not follow any spec and produce garbage data. Kind of like when the industry decided that reading badly formed HTML is beneficial.
For the sake of sanity I'd recommend splitting the logic of reading csv into two parts. One that takes garbage and produces a correctly formatted file (e.g. properly quoted values, equal number of columns, single encoding) and the second part that actually does CSV parsing adhering to strict rules.
That's actually a very interesting idea/pattern for dealing with problems.
However, I'm not sure what it means practically speaking on a real project - most of the time, you're presenting the user with some kind of upload screen, then do processing behind the scenes, and display results. You can (and probably should) structure things such that you're internally converting the "bad" csv file to a "good" csv file, and parsing it normally afterwards, but this is all behind the scenes anyway.
Maybe if there was a professional organization of software engineers, it could settle these disputes and decide who is to blame for the wrong implementation.
The content of this article is great. But the title represent it's own bad assumption & falsehood. Just because a CSV doesn't conform to the spec doesn't mean the person who produced the CSV file believes their CSV file to be correct or misunderstands that CSV is complicated. For most devs I've dealt with, the reason has been conscious and admitted laziness, trying to get something done faster knowing it's not 100% correct. Conforming to the spec is harder and more confusing than splitting on commas or tabs in a script, while splitting on commas or tabs works 80% of the time.
It's lame and lazy to write & read CSV with one-off code, but we already knew that, CSV is just deceptively simple looking so the temptation is strong. But what we most need isn't the list of things we're doing wrong. We already know we're doing it wrong. What we need to do is to use a library that does it right. What we need to have is a list of libraries and tools that are easy to use and fix all the items in the author's list. It might also be useful to suggest simple things a hacker can do in a couple lines of code in a one-off CSV reader/writer that covers as many cases as possible.
Garbage data has been with us since the beginning, and will always be with us. Whether its bad csv, html, xml, json, half the data integrator's job will be 'sanitizing' the data. Which means dealing with exceptions in an empirical manner.
Our first effort at absorbing Wall Street financial data satellite stream (stock trades, bond rates and all the rest) we found that every day there would be new bad data. Trades out of order; missing decimal points; badly spelled security names; clumsy attempts to 'edit in' trades after close. The world's financial data should probably have been better managed, but it was all had-generated back then (90's) and mostly viewed as a stream by brokers on a terminal screen, so the human filter could understand most of the BS. But a database, not so much.
As someone mentioned in the comments in the article I think it is very common for people to use LibreOffice Calc to work with CSV because Excel does not handle UTF-8 all that well. In Libre Office you can open an Excel workbook and export a csv in UTF-8 and ask it to double-quote all of the fields too (which is a very good thing to do to csv files)
While 33 and 34 are true (33. Excel is a good tool for working with CSVs 34. Excel is an OK tool for working with CSVs), there is one reality that makes them irrelevant: when working with data that is in any way or form touched manually during its lifespan (that includes 'looking at it'), dealing with Excel is inevitable.
I never understood why Excel doesn't work easily with Unicode CSV files. Well I know the workaround, but I still don't understand why they don't improve the handling of CSV files.
Author here, just waking up. I object to the title change on the strongest terms. It’s simply not “Problems with CSVs”. That’s not at all what the post is.
The list isn’t problems, and if you read it as a list of problems it’s nonsense.
Also why on earth did the fact that it is 14 months old need to be noted in the title - has anything changed in the last 14 months? Not that I am aware of.
> All records contain a single consistent encoding
> All fields contain a single consistent encoding
What? How are you expected to handle these cases?
Generally you have to guess the encoding of a plain text file like a CSV. I’m fairly sure the common case is the entire file will be a consistent encoding. If you were to guess per-record or per-field I suspect it’s more likely you’d guess some records/fields wrong than encountering a file with varying encoding.
I’d be interested to see some real world examples of CSVs with varying encoding and how existing software handles that.
A lot of broken encoding is the result of various copy pasting and OCR issues during data collection.
If there is one thing I could change about all spreadsheet software is to paste without formatting by default. Would make everyones life so much easier...
Dealing with the backscatter from CSV misunderstandings can be fairly challenging - for a lot of us, the customer experience is improved by being as accommodating as possible instead of correct. We at Intercom released a Ruby CSV parser that "is a ridiculously tolerant and liberal parser which aims to yield as much usable data as possible out of such real-world CSVs".
The biggest problem with CSV is that it looks easier than it actually is and then they go ahead and write their own CSV "printer"/parser which is usually just a ",".join([1,2,3]) or "a,b,c".split(",").
In reality CSV has a similar complexity akin to JSON. You have to consider the possibility of quoting, escaping, encoding, delimiters, etc... You should always use a library to generate and parse CSV to avoid these issues.
CSV is a great example of what happens when you do the opposite of "be strict in what you emit, be liberal in what you accept." It doesn't matter though, because it's just about good enough, and you almost never want a _general_ CSV solution, rather you need one specific to a particular problem or workflow (so you can handle your situational idiosyncrasies.)
It really depends on the purpose. For scientific computing, for instance, CSV is used as a "standard" for quick tables of numbers. Sufficiently "normal" files can be read by almost all scientific tools (speaking of libraries such as numpy, systems such as Mathematica, Maple, or ecosystems such as R, basically even C++ and Fortran allow this task to be implemented in a couple of lines).
However, also in this context CSV has major drawbacks, for instance not defining column headers, comment lines, number formatting. A proper drop-in replacement in my subject is HDF5 (https://www.hdfgroup.org/) which is mostly used for being a binary representation of n-dimensional tables with couplings to major programming systems.
However, I never heard of HDF5 outside of science. Therfore I make this example here.
I usually reach for SQLite if CSV is creaking under the strain and it's a part of the pipe I control on both ends. I've heard good things about HDF5 for big flat tables, but only when I was in grad school. I haven't tried it.
If it's something you're streaming, using a binary messaging format is always faster/better. I tend to use Kx's format since a lot of my messages go from to their systems, but Erlang's binary term format is good too -- even if you're not using Erlang.
In my experience the reason for a JSON object per line is because a tool can then split the entire file by newlines and have a list of objects to start parsing/processing in parallel, which avoids having to parse the entire file up front to get usable data, and lets the tool start processing things while some data is still being parsed.
My favorite is pipe-delimited, though that would depend on your data of course. Pipes are never used with the data I deal with, and would be a "typo" if they were. I do check for them on import and export, replacing them with %pipe; on any free-form text fields.
My data sets aren't that huge, but sometimes do get up to mid 10Ms though frequently is only in the 100Ks.
I know it's not a perfect solution but may work for you too.
Depends on the context. If it's data-exchange and the data is going to be edited by humans, just take excel files directly as input. That'll save you tons of trouble.
Excel is a horrible tool, in general. I remember that I exported a very long list of numbers to a .CSV. Excel then formatted the numbers (like 53564566934 to the form 52564E+6).
If you then copy and paste, the numbers are actually converted to 52564000000. Result: I had to do a lot of work again. Which is partly my fault, and partly a horrific design.
Just saying, great timing. Because I just today had this discussion covering everything on the list. Different character encodings in file, fields which contain line feeds, fields which contain field delimiters without escaping or using quotes and so on. But that's very common. People use whatever source for data, copy paste it into Excel and then think it's good CSV after that. I usually handle CSV as required, every file can be and usually is different. You'll write I/O code according case specific requirements. In many cases, this means some manual fixing and custom parser. -> Job done.
Edit: Honestly, I don't even remember when I would have seen RFC4180 compliant file. That's just the truth out there.
* CSV files will always have a trailing EOL marker
* CSV files will never have a trailing EOL marker
* Any file with a name ending .csv is a CSV file (or something close to)
> Excel can losslessly save CSVs it opens
A particular problem we've had many times with some clients is Excel messing around with date/time formats: files with dates formatted YYYY-MM-DD being changed to American style, columns containing both date and time having one or the other chopped off, dates being converted to numeric representations, ...
I've occasionally wondered why we (programmers) don't take greater advantage of ASCII control codes[1] for structuring data -- many of the pain points of CSV/TSV could be addressed by using US (unit separator) or RS (record separator) bytes instead. You'd still have to handle binary values, but that's a problem shared with CSV/TSV.
That's valid, but it's also tooling/expectation issue -- most users are probably interacting with CSV/TSV through a spreadsheet interface, and don't need to know anything about the internal representation of records.
The list seems like decent thoughts if you are going to work extensively with CSVs, but the title is clickbait and honestly incorrect, I haven't worked much with CSVs but I'd make very few actual assumptions about them, although in reality we would all ignore 99% of his list because it wouldn't be practical to deal with any of it.
He also almost seems to be blaming the receiving programmer for not being able to deal with a given CSV in an insane format.
I like these "falsehoods programmers believe" articles but they also indicate that it's really hard to find ground truth in this business. There is no agreed upon standard for what REST really means or object orientation or how to use XML or CSV so we just keep stumbling along implementing what we think it means and thinking that people who do it differently are clueless.
I guess that's the nature of technology that's moving so fast.
My boss at my last company used to say "I am morally opposed to CSV".
We did a lot of stuff with data in Google Sheets, and when we exported a sheet to disk so we could parse it in code, he would insist on using TSV instead of CSV.
The main reason he insisted on this was because you don't really have to worry about escapes and quoting with tab separators. And we were an NLP company, so the data would very often include complete sentences with commas.
FYI: I've started to collect ideas adnd initiatives on how to evolve the world's most popular format in the Awesome Comma-Separated Values (CSV) page - What's Next? - Frequently Asked Questions (F.A.Q.s). Read more @ https://github.com/csvalues/awesome-csv
> Excel will load valid CSV files correctly.
If a field contains more than 32kb of text , Excel will continue outputting that text into the next cell, and it won't be quoted, so the CSV file appears to be flooded with complete garbage from that point on.
On using CSV as an input format, but wanting a “better” format for storage/manipulation, [csvw][1] is an interesting choice, and can be converted to any RDF encoding.
Given this massive list of falsehoods I have to ask: what can programmers believe about CSVs?
How do you go about writing a good CSV parser given that you can't assume anything about your input data? Are there examples of safe, robust CSV parsers that deal with all these falsehoods?
Normally, I kinda think these sorts of articles are a bit defeatist because reality isn't as bad as they suggest. When you set out to put an address widget on your web page, you are not necessarily bound to create something that can correctly ship a package to some random location in Wakanda. Usually you're just shooting for a single country, which has a de facto accepted layout which even if it is technically "wrong" the local delivery services already deal with. So there isn't a great need to worry about 80% of the "falsehoods". Software gets pretty damned expensive if every single address field has to be free of all errors of that sort, and every name field has to handle all 7 billion people in the world with equal fluidity, etc.
But this is one case where I'd say the style conveys an accurate impression. The problem is basically that there is no such thing as "CSV". You can't build a parser that handles all the issues because a lot of the mistakes made during generation lead to fundamentally ambiguous file formats. In the simplest case, how do you parse:
Quote, Speaker, Time
I came, I saw, I conquered, Julius Caesar, 47BC
A generic library can't really help with that. You're going to have to do some heuristic munging. And bear in mind this is just a simple example; when you've got megabytes of vaguely CSV-inspired ASCII octets it gets bad fast. You'll yearn for CSV files generated by someone who thinks
"\n".join(",".join(str(y) for y in x) for x in [[1,2],[3,4]])
is all you need to write a CSV output function; the real world gets much more perverse than that.
As file sizes increase and the errors increase, while you get an increase in the number of issues that a library may be able to help with, you also get an increase in the number of errors in which the resulting CSV is simply fundamentally ambiguous.
The way I've dealt with it in my parser (which is influenced by how Python's standard library csv parser deals with it) is to simply not have any CSV parse errors at all. That is, you always return a parse of the data, even if it's wrong. As long as it's consistently wrong, it at least gives the user of your CSV parser a chance to read the data somehow.
Normally, this isn't something I'd do, since it feels icky. But it works well in practice, precisely because CSV data can be so messy. It's often much easier to hack your way around the CSV data with a sufficiently robust parser than to demand that the source of the CSV data clean up their act.
The CSV parser in Python actually makes least one assumption, that should hold true, but doesn't. The delimiter isn't always a single character.
I've had to deal with CSV files containing financial transactions, they used "," as a decimal separator. Fine that perfectly normal, and the same standard used here in Denmark. But what to do with the delimiter then.... that can't be a comma, so rather than making it a ; or something similar, the made the delimiter ", " that's a comma followed by a space.
That doesn't actually work for most CSV parsers, they just assume that the space is part of the data, and the value is split in kroner/øre (dollar/cents). The result is that you gain an additional field and need to trim all data.... Oh and headers doesn't match.
Sure. Given a choice between a CSV parser that supported multiple character delimiters and one that didn't, sure you'd choose the one with better support for your use case.
But given the choice between a CSV parser that gives you an invalid parse and a CSV parser that completely falls over when thrown a curveball (like a multi character delimiter), which would you pick? ;-)
1. They're an OK format if you avoid things like multiline strings, escape strings inside escape strings, data that needs complex character encoding, things that might get parsed as dates or times, numbers that could suffer from rounding problems, lines with different numbers of entries and so on.
There's actually quite a lot of stuff, in real business environments, that meets those criteria. Lots of stuff doesn't, obviously, but lots of stuff does.
2. If in doubt, most users will be happy enough if you do what Excel does.
3. A lot of this stuff is in business environments, where the end user doesn't make the purchasing decision, so if the data has to be manually formatted just so, users will often suck it up (and/or won't pay for improvements)
At some point I was led to the conclusion that CSV stands for Character-Separated Values. So as long as you pick one character for your delimiter and stick to it, it's a CSV.
This apparently isn't actually true, but the world makes more sense if you pretend it is.
That being said, the vast majority of CSV files I've dealt with are very simple and are plain ASCII separated by a comma (or whatever). But that doesn't mean I don't know there could be more to a CSV file.
many points can be summarized as: “CSV is CSV”. What about instead writing summary how to write properly CSV which is usable in all or at least most cases? Like best encoding, proper separation, headers, delimiters and escaping all the special chars.
Big fan of CSV. Let's make it even better! I've started the CSV v1.1 initiative. Let's evolve CSV and let's add comments, blank lines, double quotes, and more to make it easier to read and write. Learn more @ https://csvalues.github.io
Why did CSV ever happen? Using printable characters as separators is really silly. Using multiple characters (like CRLF) is silly. ASCII actually contains four control characters for this purpose: file, group, record and unit separator. I'm sure a lot of problems would be solved if people just used these.
File formats like CSV are frequently typed out by hand in an editor. Using a non-printable character would make that harder than necessary in most situations.
Exactly. The "proper" record separator (CRLF) is not available on the keyboard on UNIX systems (well, half of it is). If people are typing it out by hand (which I doubt, actually) there are many keys on standard keyboard which could be used for the various separators (like the F keys).
I don't don't know if CRLF is "the proper record separator". If any, that's probably RS (ASCII 30). Btw. in the Unix terminal you can type any ASCII 1-26 by pressing Ctrl+[a-z], and for the remaining 0 and 27-31 there are other combinations. It's just not as convenient as pressing a single key.
For almost all uses, CSV is the wrong format for the job.
Protobufs are usually almost a drop in replacement, and when you have them, you no longer have to worry about escaping, newline characters, adding columns, data types, nulls, or any of the other painful corner cases of csv's.
> Protobufs are usually almost a drop in replacement [for csv]
I'm sorry, what? In what way is it a drop in replacement? Csvs are editable in a text editor and natively imported and exported by a wide variety of tools.
How many levels of indirection from end users do you need to be, in order to consider installing an arcane linux tool and then launching `protoc --decode_raw < message.bin` from the command line an acceptable drop-in replacement for CSVs?
I'll take the generous interpretation and assume OP meant some very specific developer use-cases :)
The reason CSV files are used everywhere is not because of lack of alternatives, but because business people understand what CSV files are and feel safe using them, plus you're not relying on any technology or more complex protocol, it's really just the "natural form" to handle table data.
If you can successfully convince the business to first always use some command line tool to convert data before they can look at it in Excel, then you're in a very lucky position.
In fact I think it should be the top item of a list of "falsehood users believe about CSV files"!
I've had people submit an Excel file to a service expecting a CSV, and when being told to convert to CSV instead do so by just renaming the file...
> feel safe using them
This much is true, for certain definitions of "feel safe".
> plus you're not relying on any technology or more complex protocol
This is the key point: they usually have and regularly use software that can handle CSV files (however badly), usually Excel, so don't need to learn a new tool from scratch or, in controlled corporate environments, spend six months filing paperwork and chasing requests to get IT to install an appropriate version of a relevant tool to handle the format.
That's crazy protobuf are a massive pain. I was doing a ctf recently using protobuf and it was incredibly anonying decoding or reversing it. You lose the schema you're shit out of luck.
Most of those lists describes objectively wrong assumptions programmers have about some real-world phenomenon. But in this case I'd argue that a lot of these points are simply not correctly formatted CSV files. Just because someone handed you a file with the extension .csv does not mean it's a CSV file proper, and it certainly does not mean that you have to guess at what it intends to encode without assuming any of the things on the list.
For example, "All CSVs contains a single consistent encoding". If this is not the case I'd (rightfully) reject the file as being a proper CSV file.