Not too long ago I was struggling with one of these CSV issues and received some good advice from Hans Passant  on a Stack Overflow question pertaining to my problem (emphasis mine):
"It is pretty important that you don't try to fix it. That will make you responsible for bad data for a long time. Reject the file for being improperly formatted. If they hassle you about it then point out that it is not RFC-4180 compatible. There's another programmer somewhere that can easily fix this."
It makes perfect sense in hindsight. If you accept a malformed CSV file, people will expect you to accept any malformed data that has a CSV extension. You are taking on a lot of extra responsibility to cover for the lack of work by another programmer. Odds are they can make a change to fix the problem that takes a fraction of the time it would take you work around it. You just have to raise the issue.
I realize that rejecting bad files isn't really possible in every circumstance. But I have a feeling it is an option more times than you might initially think.
 - http://stackoverflow.com/users/17034/hans-passant
In my current job, the most common "invalid" CSV format we get is .xlsx files.
So I wrote an .xlsx parser (way, way faster than Apache POI).
Another interesting hiccup to consider is CSV inside individual fields - i.e. recursive CSV. There are various ways to handle this, but in my company's line of business the usual route is to duplicate that line once per CSV element found in the field.
Likely the next invalid format we'll have to parse is PDFs containing tables...
cough people doing e-invoicing with pdf's...
And after that you will have to parse PDFs containing scans (as images, not text) of pages containing tables...
I was pulling data from a medical system that I knew full well I would not be able to get changes into for YEARS (and I got to meet the vendor, who was working on a shiny new XML export system - I wonder if that has quoting issues too - it wasn't released by the time I finished working the project)
So I wound up writing perl that knew enough to fix all the common problems with the source data, and emailed me any odd lines it couldn't cope with, so I could go in and update the regular expressions. It kinda sucked, but the end result was better antibiotic coverage for a bunch of people. Worst case of a line it really couldn't handle was that person didn't get the benefit of an expert system checking that they didn't have doubled-up medicines, which is no worse than they would have had without this system.
Trying to compensate for bad CSV format will more likely cause headaches and confusion rather than clarity. It can also discourage the need for CSV writers to be rigorous about their output formatting.
We get "CSV" files from Klarna, an invoicing company, with the payments they've processed for us. Because we're Danish and they are Swedish, it's not really weird that they would use comma as the decimal separator. So to compensate for having used the comma, they for some reason picks ", " ( that's comma + space ) as the field separator. Most good csv parsers can handle the field separator to be any character you like, as long is it's just ONE character. By picking a two character separator they've just dictated that I write my own or resort to just splitting a line on ", ".
I have an function that I usually use in projects that counts , and ; on each line to determine which one is most likly beeing used in the file.
The most annoying thing I have found in csv files is the escape sign I would like it to be \" but very often I see """ as the escape for "
Don't do things that screw up the standard tools other developers depend on.
So they could have easily played the ball back into your field if they had know what they did...
And there it clearly works. I used this technique a few times with success. If you find a CSV file that has mixed field separator types, then you probably found a broken CSV file.
When it comes to parsing, the thing is that you usually have to make some assumptions about the document structure.
Yes, the assumption you have to make is called the grammar, and you better have a parser that always does what the grammar says, and global text replacement is a technique that is easy to get wrong, difficult to prove correct, and completely unnecessary at that.
What should happen? Since #COMMA is not #COMMA#, it gets not replaced, because it does not match.
Please keep in mind, that I replied to suni's very specific question and did not try to start a discussion about general parser theory. In practice, we find a lot of files that do not respect the grammar, but still need to find a way to make the data accessible.
The way to make the data accessible is to request the producer to be fixed, it's that simple. If that is completely impossible, you'll have to figure out the grammar of the data that you actually have and build a parser for that. Your suggested strategy does not work.
https://github.com/dbro/csvquote is a small and fast script that can replace ambiguous separators (commas and newlines, for example) inside quoted fields, so that other text tools can work with a simple grammar. After that work is done, the ambiguous commas inside quoted fields get restored. I wrote it to use unix shell tools like cut, awk, ... with CSV files containing millions of records.
Your csvquote is something completely different, though it seems like you yourself might be confused about what it actually is when you use the word "ambiguous". There is nothing ambiguous about commas and newlines in CSV fields. If it were, that would be a bug in the grammar. It just so happens that many unix shell tools cannot handle CSV files in any meaningful way, because that is not their input grammar. Now, what your csvquote actually does is that it translates between CSV and a format that is compatible with that input grammar on some level, in a reversible manner. The thing to recognize is that that format is _not_ CSV and that you are actually parsing the input according to CSV grammar, so that the translation is actually reversible. Such a conversion between formats is obviously perfectly fine - as long as you can prove that the conversion is reversible, that the round-trip is the identity function, that the processing you do on the converted data is actually isomorphic to what you conceptually want to do, and so on.
BTW, I suspect that that code would be quite a bit faster if you didn't use a function pointer in that way and/or made the functions static. I haven't tried what compilers do with it, but chances are they keep that pointer call in the inner loop, which would be terribly slow. Also, you might want to review your error checking, there are quite a few opportunities for errors to go undetected, thus silently corrupting data.
My approach never failed for me, so telling me that my strategy does not work is a strong claim, where it reliably did the job for me.
Your examples are all valid, but what you are describing are theoretical attacks on the method, while the method works in almost all cases in practice. We are talking about two different viewpoints: dealing with large amounts of messy data on one hand and parser theory in an ideal cosmos on the other hand.
As such, the two viewpoints really are: not really caring about the soundness of your results and solving the actual problem.
Now, maybe you really can show that the bugs in the methods you use only cause negligible noise in your results, in which case it might be perfectly fine to use those methods. But just ignoring errors in your deduction process because you don't feel like doing the work of actually solving the problem at hand is not pragmatism. You'll have to at least demonstrate that your approach does not invalidate the result.
As I wrote above, by making sure that I use a placeholder that does not appear in the data, I make sure that it does not cause the issues you describe. And if I was wrong with that assumption, I can at least minimize the effect by choosing a very unlikely sequence as placeholder.
I really see no issue here. How do you find valid grammars for fuzzy data in practice?
commas don't delimit field entries? CSV -> comma separated values.
new lines inside a field? i've never written a parser that would be foiled by this. could be an issue if you use a built-in tokeniser (e.g strtok, etc.). be aware.
variable number of fields? you’re probably writing this for something with an expected input form. throw errors if you see something you do not accept. make sure you catch them.
ascii/unicode? yea. it’s a fucking mess. everywhere.
just do it. handle failure gracefully. learn from your mistakes. don't be naive. consider a library if the (risk of failure):(time) ratio is skewed the wrong way. the only time i would absolutely insist that a 3rd party library be used is when crypto is involved. even then, be aware that they are not perfect.
absolutely ignore people who's argument is along the lines of "you are not smart enough to implement this standard. let someone else do it.”. fuck everything about that statement, and it’s false sense of superiority.
nothing comes for free. wether you use a library, or do your own thing, you’re going to run into problems.
In general it's not about being smart enough (although for some complicated standards maybe it's true), but rather biting off more than you realize. Everything sounds simple before you find the edge case implementation issues and have to rework and rethink a bunch of hard issues that a dozen people have already thought through. Doing it yourself is on the table, but rarely the most efficient decision.
Too many "enterprise" coworkers who don't know how to write a finite state machine. They do need to use a library.
None of these questions are particularly daunting. CSV means "comma separated values", so if you want to play games and use other delimiters, please fuck off. If it's not a comma, then guess what: it's not delimited. New line characters are well-known, and well-understood, across all platforms and easy to detect. If you manage to fuck that up in your file, then take a look in the mirror, because the problem is you. Enforcing the practice of enclosing the target data in quotation marks among users is a good idea. It's something that should be supported and encouraged, and ignored at one's own risk.
Additionally, employing an escape character (such as backslash) to allow for the use of a quotation mark within enclosing quotation marks is a nice feature to add in. After that, the concept of a CSV file has provided enough tools, to tolerate [an arbitrarily large percentage] of all use cases. If you need something more robust, XML is thataway.
It uses regex lookaheads to deal with quoting, so it's not 100% portable. But it's only about one page.
As for the other things mentioned by the OP (BOM, encoding), those should be handled by the stream, and are not the provenance of CSV per se.
Unnecessary string copy operations are what make the parser slow.
When I add a CSV import feature to a project I'm working on, I tell people "this works with MS Excel flavor of CSV." This covers most, if not all, real world cases because in my world the people who want to import data are non-programmer types who all use Excel.
I'll often include the basic rules in the screen that accepts the import. If I ever had to accept data from something that was _not_ Excel I'd probably include a combo box on the web form that lets you pick the dialect. So far I haven't had to do that.
The only thing I might not be totally covering is how Excel handles newlines, but in practice I've never had to deal with that.
Primarily, I didn't fully understand the date objects and functions available in the languages/libraries i was using so simple things like formatting a string date seemed difficult to me.
This was an awful idea. Dreadful.
I came up with all sort of delightful helper methods to cover common use cases like adding one month to the current date. I made this decision to represent dates internally with a timestamp, so adding a month is easy, right?! No. ...What's 1 month from January 31st? February 28th? Well then what's 1 month from February 28th? The list of edge cases goes on.
Most things in life are more complicated than they, at first, seem.
I can't but disagree when I read stuff like this. Why shouldn't I release a library if I think it's good enough for the community? Even the powerful and versatile Ruby library for CSV parsing started as a gem from a person who didn't give a s... about advise like "do not inflict another one into this world".
Because, you know, learning, having fun and stuff.
> Not kidding.
We'd ll be better off really, but that ship has sailed. Using CSV for data which is only ever read by a machine is a dumb decision. Use the RS (record separator) character and many of these ambiguities disappear.
Of course, like I said, that ship has sailed. If you want your data to be read nicely by other programs you're probably stuck with CSV, TSV, or something similar.
Overall I agree with the article, there's no point in reinventing the wheel if there are libraries out there. And CSV specifically is a horribly complex format to deal with. But sometimes rolling your own is the best and/or only choice you have, and you might come out the other end enjoying the experience, and having learned a lot.
As for what happened to my old CSV parser? It ended up being quite popular, but stuck in the dark ages as I'd mostly moved on from PHP years ago. But thanks to a contributor, we've recently put renewed effort into bringing the project in to modern times: https://github.com/parsecsv/parsecsv-for-php
If you look at lib/csv.rb  it's:
* 2325 Lines
* 2161 Non-blank lines
* 950 Lines of Code
It there are many "what ifs" like in the posted article. You probably need another format like JSON (preferably) or XML.
Of course you can embed comma-separated lists in XML, but with JSON it will parse them for you.
(And of course it's not as good as a protobuf, but not bad for a text format.)
However, that doesn't excuse sloppy CSV writers.
Only thing that I don't like is that many candidates will assume that they have to fix the code within the parser, given those instructions, even if they know that a battle-tested library is how they would actually do it. I hope you accept an off-hand comment such as, "ew, why is this hand-rolled" as a sufficient indicator in favor of your solution.
No alternate delimiters, no backslashes.
Now I have to put up with offshore staff trying to use apostrophes (') instead of quotes (") :-(
Barring alternate delimiters, and disallowing newlines* in fields, I can write the parser for 4180 in about 30 lines of perl, reading a char at a time and flipping between about 4 states. (avoids getting root access and days of paperwork to install from CPAN)
* disallowing newlines in the data is admittedly a big restriction, but it works for many use-case/applications, and allows the caller to pull in a line before calling the parse function.
For Java, the "Ostermiller" library is pretty good for CSV handling, and has a few options for dealing with freaky variants.
I was trying to get Perl tar libraries working, when my colleague asked why I don't just use backticks to do it in the shell. Basically because I don't know that much about tar. I can use it to untar file, or create a new archive. Someone else who has written a library probably has taken the time to read through the whole manual and make it work nicely. They know the errors and warnings, and have abstracted that to a sensible level hopefully. They have thought about these things, so hopefully I won't have to.
Very little data is actually true CSV.
The code isn't particularly long (~900 lines), it's Python (hence readable) and it's well commented:
I can only guess that since it's astronomy data and constellation coordinates have decimal places, it's best to avoid the comma character because some countries use it as a decimal separator.
http://p3rl.org/Text::CSV | http://p3rl.org/Text::CSV_XS
It would be awesome if someone made a table with CSV features in one dimension and application/library behaviour in the other.
CSV or Comma Separated Values are not only for RFC, but also for EVERYONE who wants to use this word or phrase. Pedantry sucks!
knowing my needs i could easily account for all possible muck ups and avoid the instances where ambiguity could play a part
i was then able to use the bits i pulled out of the ~4G file, now 16M, in the parser with all of its assurances
sure, edge cases justify using a tried and true library for generics, but there are also edge cases that justify mocking up your own naive implementation.. if only, like in my case, to make the dada usable in such a library
Of course you can come up with scenarios where it doesn't work, but anyone who considers themselves to be a competent programmer should be able to deal with these issues, use another data format, or just talk to whomever is giving you the data to correct their data issues.
Seriously, The overhwleming CSV_bashing in these comment really makes me worry that coders just can't handle the basics anymore.
1. Of course there are exceptions to the rules: perhaps the CSV is malformed or there are special considerations in the backend, but the general point stands.
I will solve your problem with only UNIX utilities. And I'm sure others will solve it other ways.
Usually I only need sed and tr. Sometimes lex or AWK.
Arguing about something without ever pointing to an example accomplishes nothing; it's just whining.
Post an example.
A few months ago, I was trying to get some bulk data into ebay's proprietary TurboLister program. Guess what, it can import csv but not JSON.
SQLite can import csv but not JSON.
Google's terabyte ngram dataset is csv (tsv) instead of JSON. I'm glad it's not JSON because it would have required extra disk space.
... plus tons of other real-world csv examples out in the wild.
Unfortunately, the csv format is very easy for programs to write but it's very difficult for programs to properly read because of the tricky parsing.
So no, if you control input and output, JSON is still easier to use than CSV, and just as performant. JSON stores straight arrays just fine. It's not the format's fault so many people choose to store hashes with it.
...and, the moment you declare you handle json, people will send non-string data ("that is a number, of course it isn't quoted"), attempt to include nested data, leave out the opening and closing  (because people will grep a file with one array per line to filter a json file; that is no way robust, but people will do it, anyways)
The main advantage of json vs csv is that there is only one json, but quite a few different variant of csv. That is a huge advantage, but csv is engrained; there are many, many tools that handle csv but not json.
And the "if you control input and output" case is not the interesting or the problematic one.
Sometimes, simplicity is better for everyone.
I've had to() parse json embedded in a field in a csv file. Unquoted of course.
Until I explained to the other developer just how stupid that was.
CSV is bullshit, it's not good for anything except scenarios where you control both the export process, and the parser (so you know what delimiter is used and so on).
We process csv files containing payment information, export csv as product feeds for perhaps 10 different partners. It works, it's faster than XML or JSON and is easy for non-technical people to inspect using tools like Excel. Most cvs data exports are not something that's made up on the fly every day, it batch jobs that create the same type of CSV file each and every time.
CSV is alright. XML is bullshit.
Obviously, you can make CSV files work for alternating record types, in the same way that the old mainframe files used to work with multiple record types in a file, with a type descriminator field in a know place, usually field 1, of each record. But it starts to get cluttered.