Most of those lists describes objectively wrong assumptions programmers have about some real-world phenomenon. But in this case I'd argue that a lot of these points are simply not correctly formatted CSV files. Just because someone handed you a file with the extension .csv does not mean it's a CSV file proper, and it certainly does not mean that you have to guess at what it intends to encode without assuming any of the things on the list.
For example, "All CSVs contains a single consistent encoding". If this is not the case I'd (rightfully) reject the file as being a proper CSV file.
The user does not care that the CSV is not proper. He just wants to open it, and if it doesn't work, they will blame your program, not the source of the CSV.
And you only control the former. I have had to deal with a CSV export from a web service that was "not proper". I notified their support, but that problem was never fixed.
Luckily, the user in that case was just me. But if I was making software for someone else, blaming the right source of the problem would have not taken me anywhere.
In some cases data users wield enormous power over data submitters (e.g., many for-record safety analyses or real-time systems). If any part of the data collection process is suspect the sample is likely to get tossed. And for spec compliance, if a data set does not pass perfectly through a validator the sender will likely get some sort of a "shame on you" nastygram.
Same for invalid CSV.
There is nothing wrong with this. You can't accept all the garbage from the user.
Yes, it does work in Excel but it's because of 40 years of various edge case fixes that you just cannot replicate in their entirety in your app.
What any other spec actually says is irrelevant. For most non-technical users, anything Excel outputs as a CSV is a CSV file (and anything you output that doesn't work with Excel is not)
So given its impossible to compete totally with Excel, how do we decide which of these edge cases are worth our time to implement?
It's not as simple as invalid input or a bad copy/paste. The user likely cannot correct the CSV.
Wait, what? Phone numbers could always include letters. Many businesses get phone numbers that include or consist of words relevant to their business, to make them memorable. And I've known people whose personal number happens to contain a word or words, so they give the letter version out because it's easy to remember. If a phone number field doesn't include simple logic to map letters into the proper numbers, that's just lazy programming.
And what next, do you accept exotic unicode numbers ? Letters that you casefold to something that can make to a number ?
Again, you draw the line somewhere. There is no right place for the line, it depends of your business, but there is always a line of what you accept as a user input.
Being offended won't help you on HN, and calling out people names either.
There is a standard, universal mapping of letters onto numbers. This allows companies to advertise their phone number as something memorable like 1-800-CASH-4-AU and everyone knows how to dial it. People don't even need to remember the mapping of letters, because they're printed on every keypad.
With a ZIP code in the US, it consists of 5 digits that form a unique number for a geographical area with 4 more digits that represents a sub-area within the 5 digit ZIP. But there are no instances where there can be a letter. That's a hard fact. If you put 4308A-65XK as the ZIP, the post office or delivery service will not know how to deliver the mail without trying to cross reference address with possible valid ZIP codes.
Invalid CSV is murky, because CSV is a serialization format. That is, invalid CSV may not play well with a program that accepts well-formed CSV, but a human may still be able to parse it with no problem. Returning "invalid ZIP Code" makes sense. The client would understand that letters do not belong in a ZIP Code.
If you've never had to deal with a nightmare client of that sort, hats off to you.
Why not explain to the client that you will need to put in manual or custom work to clean up the CSV or batch of CSVs and charge for the extra time writing a script or cleanup?
Literally the only time this may not apply is if you are writing a general purpose program.
Well, it all depends on the money, doesn't it?
Doing so doesn't mean that it is CSV, it just means you can parse it.
try standard parsing, if fail try exotic features, if fails try so common hacks around common mistake, if fails just reject it as invalid data.
We all do that all the time for everything. Form input, video upload, user click in a MMO, etc.
"CSV" only specify the metacharacters (at best), unless it's explicitly stated that the CSV is a single consistent encoding it's no more improper than inconsistently encoded Linux paths. It's stupid and a pain in the ass to be sure, but it's not improper.
And when you don't control the entire pipeline, you will eventually hit this issue.
In fact the first falsehood programmers believe about CSVs is that it's a good serialisation, interchange (between separate systems) or human-edition format. While it's routinely used for all of those it's also absolute garbage at all of them.
I think it's generally agreed that a CSV file is a plain text file. And I'm not aware of any school of thought that says that valid text files may have multiple encodings. IMO, it's ridiculous to expect any software to detect and accommodate that.
Mixed encodings would make sense if the columns in a single CSV are coming from multiple data sources. Any data processing software that expects to work in the real world is expected to be able to handle inputs in different formats. Depending on the relationship between the parts of the processing pipeline, a trying to reconcile formatting issues before/while creating the CSV may be an objectively incorrect design decision.
Really? So you might write a floating point value to a "CSV" file by writing the IEEE 754 bit pattern for that value? And you would call that a valid CSV file, and would expect generic CSV parsers to properly decode that value?
That's a new one on me. I have always been under the impression that CSV files are plain text. Perhaps you should edit the Wikipedia page so it doesn't continue to give unsuspecting people the wrong information:
In computing, a comma-separated values (CSV) file is a
delimited text file that uses a comma to separate values
(many implementations of CSV import/export tools allow
other separators to be used). It stores tabular data
(numbers and text) in plain text. 
I would argue that most programmers realize CSV sucks and would rather use something structured and sane. In my experience CSV support almost always comes about because someone from another business unit (e.g., marketing) needs self-service import facilities. Those folks live in Excel and CSV becomes the de facto import format.
Which is an error, and you'd save yourself much pain if you just worked directly with excel files. Most languages have libraries allowing that.
That's literally the only one on there I felt that way about. All the other ones can and should be accounted for in your application if it accepts arbitrary CSVs.
Encoding is a special case because it's impossible to do anything other than guess at a plaintext document's encoding. If you're getting byte code parsing errors, the code should try a few different encodings before rejecting the document.
This won't help if there's a mix of encodings in the document. If this is the case, nothing can help and you will have garbage going in.
There are fifteen points effectively addressing delimiters, which all sum up to "delimiters may be arbitrary byte sequences, and they should be ignored in quoted fields or if escaped".
There's nine points about encoding, three of which if violated just result in a flat out invalid file.
There's four points that effectively claim that you must parse CSV files that adhere to no known formal or informal standard, which is just silly.
But if you can't, and business needs force you to allow anyone and their dog Sue to pass input into your software system, then you need to be able to handle garbage input effectively, up to and including just correcting it for them if that's possible.
The biases listed are good to check your code against if that's indeed the case.
A CSV parser will obviously only accept CSV. That list includes a few (5?) things that might end up being common bugs in hand-rolled CSV parsers. However, most of is is just bogus.
The only possible way to handle "garbage" input is sensible errors. Of course, is someone is presenting a valid CSV variant that you do not support, then it's not garbage, but just a variant that you may wish to implement support for.
A parser that tries to continue on error would be absolutely outrageous for a business to implement. How do you think the finance department would feel about their accounting being rendered invalid because of a parser trying to be smart?
This is completely wrong. The idea is to sift through the input and return appropriate error codes for garbage input, while remaining flexible enough to handle malformed input that's not quite garbage. As the maxim goes, be flexible in what you accept, strict in what you hand out.
Expecting to handle "arbitrary input data" would mean to try to accept data outside of that set. In the case of a CSV parser, it would mean parsing /dev/urandom as a CSV and producing sensible output ("sensible" defined by the users, which in your example is the dog "Sue").
What the maxim says about being flexible in what you accept is not trying to tell you that everything should be considered valid, but instead that it doesn't hurt to allow certain rules to be broken if it cannot do any harm. This could mean supporting trailing commas in JSON arrays, which while illegal is a harmless "extension".
However, it is not trying to tell you that you should accept garbage and "silently recover", trying to do something with invalid input. Doing so does more harm than good. Be flexible, not stupid.
I believe it is naturally impossible or intractable to consider every item of those lists. Rather, they are convenient checklists for consideration when you are delving into the realm of those complicated subjects. Do you reject, say, 30% of those items? That's fine. Just make sure that you are explicitly making tradeoffs and will consult the lists again when your initial tradeoffs wear off.
If you're producing CSV files, then you can be more strict.
This does not work when the format is mostly unspecified and the vast majority of what you'll find in the wild is a tire fire.
I mean if an API sends invalid JSON you can't parse it...
Saying that CSV is not is generous.
Approaching it from a database perspective (eg sqlitebrowser.org), where we need to both import and export CSV's... it can be tricky.
For example, if an imported CSV file has two commas ",," that generally indicates an empty field. But, does that mean an empty string? Or does it mean we should use the default value for the field (as per database spec)?
That specific case can be handled by a user preference on the import/export dialog ("What to do with empty fields?").
Other's less so. For example, database fields can contain data types (eg images, binary blobs) that have no representation defined in the RFC. Literally any decision about how to encode them is going to be non-standard. Ugh.
There really needs to be a follow up RFC that specifies useful, practical things to tighten up the gaps.
* Do you accept trailing commas?
* Do you accept multiline comments?
* Do you accept single line comments?
* Do you accept literal newlines in strings?
* Do you accept single quotes in place of double quotes?
* Do you accept integers over 8 bytes?
* Do you autodetect Unicode encoding?
* Do you accept numbers with leading zeros?
* Do you accept numbers with a leading plus sign?
* Do you accept NaN?
* Do you accept non-quoted object keys?
* Do you accept duplicate object keys?
In other words,
> A JSON parser MAY accept non-JSON forms or extensions.
JSON does not have comments, support trailing comments, single quotes, etc. etc. etc.
(Couldn't help it.)
Everything blue and yellow in that matrix is entirely OK and should be green ("undefined" means "anything goes", and the spec specifically allows accepting more than JSON, so "should have failed" is bogus).
That leaves a few parser bugs in form of orange and red, but they're corner cases (like surrogate pairs or 10000 nested arrays) and fixable.
The equivalent page for CSV would not have an end.
You can't deal with all the configuration of all the atoms in the universe, you need to draw a line.
If something just can't be reasonably parsed, it's fair to reject it as invalid CSV.
> I honestly think this genre is horrible and counterproductive, even though the writer's intentions are good. It gives no examples, no explanations, no guidelines for proper implementations - just a list of condescending gotchas, showing off the superior intellect and perception of the author.
Like, OK, I shouldn't use sep. Good to know. What should I use instead? Why tell people that \ isn't the escape delimiter without explaining the way the quoting system works?
And frankly, the stuff about Excel is divorced from reality. More than 90% of the time, the reason you're making a CSV is because somebody wants to look at the data in Excel and you don't want to deal with xlsx. If your concern is something else CSV is probably the wrong choice. Thus, for most programmers, Excel is the reference implementation of CSV.
Sure, good examples of how to handle each edge case would be ideal, but merely pointing out all the bad assumptions to someone is a valuable first step.
When we read them, we have the opportunity to check our own assumptions -- about the subject at hand (today, CSVs) and also (hopefully!) about other subjects we may encounter later.
I read these as "a list of things people have assumed about X that, on the surface, seem true but absolutely won't be 100% of the time."
IOW, examine your assumptions, and protect yourself from your unexamined biases and assumptions.
Besides that, anybody who is troubled by working with something like JSON is not going to look at raw CSV. They're going to use Excel.
Here's a scenario I bet many people have faced: Export an Excel sheet to CSV that has high-ASCII characters in it like accents. The export gets mangled when you load it into code or a text editor after. You eventually just upload it to Google Sheets and export it from there instead. It works but it's a pain.
Instead of exporting it as a CSV from Excel, export it as a UTF-16 TXT, which is basically a TSV file.
That will correctly preserve all the character encoding.
I can't promise this will work 100% of the time but it has resolved many many encoding issues going to/from Excel.
As of this writing, there exists a single usable CSV format that Microsoft Excel can both read and write safely across platforms. Tab delimited UTF-16LE with leading Byte Order Mark.
I work with CSVs in R, but I don't work with Excel that much. Thanks for the useful tip.
The original comment specifically mentioned excel. Excel can not work on "tens of millions of lines" and your objection is thus irrelevant.
If your data does not come from excel, there is obviously no reason to make it go through excel. There is also no reason to make it go through CSV, make it go through an actual serialisation format which is not mostly unspecified garbage since you're programmatically serialising it for interchange and hopefully not expecting human edition of the intermediate megabytes of data dump.
I ended up with this  and a python implementation , and it turned out not too bad. I've also done a more pragmatic C implementation (couple hundred LOC) in a toy project  (wsl.c and *.wsl files), and it turned out quite usable.
I think what prevents adoption of such a thing is that it's very hard to standardize on primitive types and integrity features.
It would not be hard to write an Excel plugin to read my format, while restricting to a fixed set of basic datatypes, and getting some usability gains there. The problem I see is that full-featured interoperability is hard to achieve because the set of supported features is hard to agree on.
> SUMMARY: Feather's good performance is a side effect of its design, but the primary goal of the project is to have a common memory layout (Apache Arrow) and metadata (type information) for use in multiple programming languages. http://wesmckinney.com/blog/feather-its-the-metadata/
The approach is data frames on disk storage.
> data frames are lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (_factors), or _string. Every column can have missing values.
This fits the needs of all of my CSV usage.
But I need more structure. My format supports an arbitrary number of tables in a single file, some automated integrity checking, and direct conversion of text from/to runtime structures (e.g. C structs, possibly with automatically allocated backing buffers for variable-width datatypes) with only little boilerplate. So yes, that's a different format.
For the sake of sanity I'd recommend splitting the logic of reading csv into two parts. One that takes garbage and produces a correctly formatted file (e.g. properly quoted values, equal number of columns, single encoding) and the second part that actually does CSV parsing adhering to strict rules.
However, I'm not sure what it means practically speaking on a real project - most of the time, you're presenting the user with some kind of upload screen, then do processing behind the scenes, and display results. You can (and probably should) structure things such that you're internally converting the "bad" csv file to a "good" csv file, and parsing it normally afterwards, but this is all behind the scenes anyway.
It's lame and lazy to write & read CSV with one-off code, but we already knew that, CSV is just deceptively simple looking so the temptation is strong. But what we most need isn't the list of things we're doing wrong. We already know we're doing it wrong. What we need to do is to use a library that does it right. What we need to have is a list of libraries and tools that are easy to use and fix all the items in the author's list. It might also be useful to suggest simple things a hacker can do in a couple lines of code in a one-off CSV reader/writer that covers as many cases as possible.
Our first effort at absorbing Wall Street financial data satellite stream (stock trades, bond rates and all the rest) we found that every day there would be new bad data. Trades out of order; missing decimal points; badly spelled security names; clumsy attempts to 'edit in' trades after close. The world's financial data should probably have been better managed, but it was all had-generated back then (90's) and mostly viewed as a stream by brokers on a terminal screen, so the human filter could understand most of the BS. But a database, not so much.
The list isn’t problems, and if you read it as a list of problems it’s nonsense.
Also why on earth did the fact that it is 14 months old need to be noted in the title - has anything changed in the last 14 months? Not that I am aware of.
48. CSVs are useful to make machine-2-machine interfaces
49. CSVs for importing/exporting data eliminates all those pesky programmers taking up too much time with the API design
50. CSVs are real time
51. CSVs are a robust mechanism to export/import data
> All records contain a single consistent encoding
> All fields contain a single consistent encoding
What? How are you expected to handle these cases?
Generally you have to guess the encoding of a plain text file like a CSV. I’m fairly sure the common case is the entire file will be a consistent encoding. If you were to guess per-record or per-field I suspect it’s more likely you’d guess some records/fields wrong than encountering a file with varying encoding.
I’d be interested to see some real world examples of CSVs with varying encoding and how existing software handles that.
If there is one thing I could change about all spreadsheet software is to paste without formatting by default. Would make everyones life so much easier...
In reality CSV has a similar complexity akin to JSON. You have to consider the possibility of quoting, escaping, encoding, delimiters, etc... You should always use a library to generate and parse CSV to avoid these issues.
CSV is a great example of what happens when you do the opposite of "be strict in what you emit, be liberal in what you accept." It doesn't matter though, because it's just about good enough, and you almost never want a _general_ CSV solution, rather you need one specific to a particular problem or workflow (so you can handle your situational idiosyncrasies.)
However, also in this context CSV has major drawbacks, for instance not defining column headers, comment lines, number formatting. A proper drop-in replacement in my subject is HDF5 (https://www.hdfgroup.org/) which is mostly used for being a binary representation of n-dimensional tables with couplings to major programming systems.
However, I never heard of HDF5 outside of science. Therfore I make this example here.
If it's something you're streaming, using a binary messaging format is always faster/better. I tend to use Kx's format since a lot of my messages go from to their systems, but Erlang's binary term format is good too -- even if you're not using Erlang.
I've seen the JSON-object-per-newline pattern a few times, but why not use a JSON array then? Then the whole file remains valid JSON.
My data sets aren't that huge, but sometimes do get up to mid 10Ms though frequently is only in the 100Ks.
I know it's not a perfect solution but may work for you too.
Depends on the context. If it's data-exchange and the data is going to be edited by humans, just take excel files directly as input. That'll save you tons of trouble.
If you then copy and paste, the numbers are actually converted to 52564000000. Result: I had to do a lot of work again. Which is partly my fault, and partly a horrific design.
Edit: Honestly, I don't even remember when I would have seen RFC4180 compliant file. That's just the truth out there.
* CSV files will have consistent EOL markers
* CSV files will always have a trailing EOL marker
* CSV files will never have a trailing EOL marker
* Any file with a name ending .csv is a CSV file (or something close to)
> Excel can losslessly save CSVs it opens
A particular problem we've had many times with some clients is Excel messing around with date/time formats: files with dates formatted YYYY-MM-DD being changed to American style, columns containing both date and time having one or the other chopped off, dates being converted to numeric representations, ...
He also almost seems to be blaming the receiving programmer for not being able to deal with a given CSV in an insane format.
I guess that's the nature of technology that's moving so fast.
We did a lot of stuff with data in Google Sheets, and when we exported a sheet to disk so we could parse it in code, he would insist on using TSV instead of CSV.
The main reason he insisted on this was because you don't really have to worry about escapes and quoting with tab separators. And we were an NLP company, so the data would very often include complete sentences with commas.
It's irritating to go through 50 million records and three of them have letters where there should be numbers. Or weird stuff like "$nul".
> Excel will load valid CSV files correctly.
If a field contains more than 32kb of text , Excel will continue outputting that text into the next cell, and it won't be quoted, so the CSV file appears to be flooded with complete garbage from that point on.
How do you go about writing a good CSV parser given that you can't assume anything about your input data? Are there examples of safe, robust CSV parsers that deal with all these falsehoods?
But this is one case where I'd say the style conveys an accurate impression. The problem is basically that there is no such thing as "CSV". You can't build a parser that handles all the issues because a lot of the mistakes made during generation lead to fundamentally ambiguous file formats. In the simplest case, how do you parse:
Quote, Speaker, Time
I came, I saw, I conquered, Julius Caesar, 47BC
"\n".join(",".join(str(y) for y in x) for x in [[1,2],[3,4]])
As file sizes increase and the errors increase, while you get an increase in the number of issues that a library may be able to help with, you also get an increase in the number of errors in which the resulting CSV is simply fundamentally ambiguous.
Normally, this isn't something I'd do, since it feels icky. But it works well in practice, precisely because CSV data can be so messy. It's often much easier to hack your way around the CSV data with a sufficiently robust parser than to demand that the source of the CSV data clean up their act.
I've had to deal with CSV files containing financial transactions, they used "," as a decimal separator. Fine that perfectly normal, and the same standard used here in Denmark. But what to do with the delimiter then.... that can't be a comma, so rather than making it a ; or something similar, the made the delimiter ", " that's a comma followed by a space.
That doesn't actually work for most CSV parsers, they just assume that the space is part of the data, and the value is split in kroner/øre (dollar/cents). The result is that you gain an additional field and need to trim all data.... Oh and headers doesn't match.
But given the choice between a CSV parser that gives you an invalid parse and a CSV parser that completely falls over when thrown a curveball (like a multi character delimiter), which would you pick? ;-)
There's actually quite a lot of stuff, in real business environments, that meets those criteria. Lots of stuff doesn't, obviously, but lots of stuff does.
2. If in doubt, most users will be happy enough if you do what Excel does.
3. A lot of this stuff is in business environments, where the end user doesn't make the purchasing decision, so if the data has to be manually formatted just so, users will often suck it up (and/or won't pay for improvements)
CSV is less a format and more an article of faith.
(ノಠ益ಠ)ノ彡┻━┻ <-- CSV
This apparently isn't actually true, but the world makes more sense if you pretend it is.
That being said, the vast majority of CSV files I've dealt with are very simple and are plain ASCII separated by a comma (or whatever). But that doesn't mean I don't know there could be more to a CSV file.
Protobufs are usually almost a drop in replacement, and when you have them, you no longer have to worry about escaping, newline characters, adding columns, data types, nulls, or any of the other painful corner cases of csv's.
Some might say "but you can't open them in a text editor", but to those people I give this link: https://stackoverflow.com/questions/34952811
I'm sorry, what? In what way is it a drop in replacement? Csvs are editable in a text editor and natively imported and exported by a wide variety of tools.
How many levels of indirection from end users do you need to be, in order to consider installing an arcane linux tool and then launching `protoc --decode_raw < message.bin` from the command line an acceptable drop-in replacement for CSVs?
I'll take the generous interpretation and assume OP meant some very specific developer use-cases :)
Also, as an aside, sqlite would probably be a better container format for data, and one that would be easier to integrate into other applications.
If you can successfully convince the business to first always use some command line tool to convert data before they can look at it in Excel, then you're in a very lucky position.
I would very much argue against that supposition!
In fact I think it should be the top item of a list of "falsehood users believe about CSV files"!
I've had people submit an Excel file to a service expecting a CSV, and when being told to convert to CSV instead do so by just renaming the file...
> feel safe using them
This much is true, for certain definitions of "feel safe".
> plus you're not relying on any technology or more complex protocol
This is the key point: they usually have and regularly use software that can handle CSV files (however badly), usually Excel, so don't need to learn a new tool from scratch or, in controlled corporate environments, spend six months filing paperwork and chasing requests to get IT to install an appropriate version of a relevant tool to handle the format.
That's read only, and only works if you have the proto declaration. Very different from being editable in a text file editor.