Hacker News new | past | comments | ask | show | jobs | submit login
So You Want to Write Your Own CSV code (tburette.github.io)
158 points by Monkeyget on May 25, 2014 | hide | past | favorite | 121 comments

CSV are a headache. Like the article says, RFC4180 doesn't necessarily represent the real world. However sometimes you just have to reject things that aren't spec.

Not too long ago I was struggling with one of these CSV issues and received some good advice from Hans Passant [1] on a Stack Overflow question pertaining to my problem (emphasis mine):

"It is pretty important that you don't try to fix it. That will make you responsible for bad data for a long time. Reject the file for being improperly formatted. If they hassle you about it then point out that it is not RFC-4180 compatible. There's another programmer somewhere that can easily fix this."

It makes perfect sense in hindsight. If you accept a malformed CSV file, people will expect you to accept any malformed data that has a CSV extension. You are taking on a lot of extra responsibility to cover for the lack of work by another programmer. Odds are they can make a change to fix the problem that takes a fraction of the time it would take you work around it. You just have to raise the issue.

I realize that rejecting bad files isn't really possible in every circumstance. But I have a feeling it is an option more times than you might initially think.

[1] - http://stackoverflow.com/users/17034/hans-passant

On the other hand, the ability to handle all kinds of input can be a chief selling point of your product.

In my current job, the most common "invalid" CSV format we get is .xlsx files.

So I wrote an .xlsx parser (way, way faster than Apache POI).

Another interesting hiccup to consider is CSV inside individual fields - i.e. recursive CSV. There are various ways to handle this, but in my company's line of business the usual route is to duplicate that line once per CSV element found in the field.

Likely the next invalid format we'll have to parse is PDFs containing tables...

> Likely the next invalid format we'll have to parse is PDFs containing tables...

cough people doing e-invoicing with pdf's...

> Likely the next invalid format we'll have to parse is PDFs containing tables...

And after that you will have to parse PDFs containing scans (as images, not text) of pages containing tables...

Yeah, that's a great idea if you can.

I was pulling data from a medical system that I knew full well I would not be able to get changes into for YEARS (and I got to meet the vendor, who was working on a shiny new XML export system - I wonder if that has quoting issues too - it wasn't released by the time I finished working the project)

So I wound up writing perl that knew enough to fix all the common problems with the source data, and emailed me any odd lines it couldn't cope with, so I could go in and update the regular expressions. It kinda sucked, but the end result was better antibiotic coverage for a bunch of people. Worst case of a line it really couldn't handle was that person didn't get the benefit of an expert system checking that they didn't have doubled-up medicines, which is no worse than they would have had without this system.

This is good advice. It's the philosophy adopted by Papa Parse, http://papaparse.com - try to gracefully handle malformed CSV and report all errors so they're obvious and actually helpful by telling the user where the syntax error is at.

Trying to compensate for bad CSV format will more likely cause headaches and confusion rather than clarity. It can also discourage the need for CSV writers to be rigorous about their output formatting.

I've never really got my head round RFCs, but 4180 is only informational, not a standard. I have used exactly your argument before though, and will again. Have also been on the other side and needed to convert horribly inconsistent data to fit it.

It's not a standard but it should very well be one, in my opinion. Whenever someone has to process CSV I always point to it to make them aware of it, that a lot of subtle points like escaping have already been defined to be done in one way. There is no good reason to NOT follow RFC4180 if you want to produce/consume CSV.

The most retard structure I've seen in a CSV file relates to the "What if the character separating fields is not a comma?".

We get "CSV" files from Klarna, an invoicing company, with the payments they've processed for us. Because we're Danish and they are Swedish, it's not really weird that they would use comma as the decimal separator. So to compensate for having used the comma, they for some reason picks ", " ( that's comma + space ) as the field separator. Most good csv parsers can handle the field separator to be any character you like, as long is it's just ONE character. By picking a two character separator they've just dictated that I write my own or resort to just splitting a line on ", ".

In General Swedish csv files are separated with ;

I have an function that I usually use in projects that counts , and ; on each line to determine which one is most likly beeing used in the file.

The most annoying thing I have found in csv files is the escape sign I would like it to be \" but very often I see """ as the escape for "

Well the standard says to double the double quote to escape it. If you are seeing \" or even """ to escape a quote, then someone isn't following the standard.

I believe that the Swedish Klarna files uses ; but the Danish ones uses the comma + space. That only adds to the stupidity though, why not have just one format?

Just be happy to live in the best country.

it can be irritating, but you can just as easy parse ", " to "|" or something, by simple string replacing, pre parsing..

True, but in my mind picking ", " indicate to me that they don't care or don't know what they're doing. I often run into something similar with XML. I've had more than one partner call or write me saying that the elements in a file are not in the right order. Every single time they've admitted to not actually using an XML parser.

Don't do things that screw up the standard tools other developers depend on.

Stuff not being in the correct order is a perfectly fine technical reason to reject an XML file if you have a DTD based workflow. It is actually quite difficult to specify that the order of elements is irrelevant (it goes with n!, so allowing six elements exactly once, but in arbitrary order in the contents of So in that case, you might have been the developer that screwed up the standard tools other developers dependan element makes that part of the DTD 720 times longer than specifying a fixed order).

So they could have easily played the ball back into your field if they had know what they did...

That is much more complex than the issues we've seen. My issue is with say a <person> with the subelements <name> and <email>, it should never make any difference if email or name is first or second.

Think it through. What if there is free text in the field? "How are you, Sally?"

If your delimiter appears in a field, the field needs to be quoted.

You can replace all commas with a placeholder (e.g. "#COMMA#"), replace the delimiter with a comma, parse the document and then replace all placeholders in the data with ",".

That does not work, unless that first replacement magically ignores the commas that are part of field separators. If you know how to write the code that does that, your problem is solved.

I was referencing to "What if the character separating fields is not a comma?".

And there it clearly works. I used this technique a few times with success. If you find a CSV file that has mixed field separator types, then you probably found a broken CSV file.

No, it doesn't. What if there is #COMMA# in one of the fields?

You just choose a placeholder that does not appear in the data. You could even implement it in a way that a placeholder is automatically selected upfront that does not appear in the data.

When it comes to parsing, the thing is that you usually have to make some assumptions about the document structure.

What if there is #COMMA, in one of the fields (but no #COMMA#)?

Yes, the assumption you have to make is called the grammar, and you better have a parser that always does what the grammar says, and global text replacement is a technique that is easy to get wrong, difficult to prove correct, and completely unnecessary at that.

> What if there is #COMMA, in one of the fields (but no #COMMA#)?

What should happen? Since #COMMA is not #COMMA#, it gets not replaced, because it does not match.

Please keep in mind, that I replied to suni's very specific question and did not try to start a discussion about general parser theory. In practice, we find a lot of files that do not respect the grammar, but still need to find a way to make the data accessible.

What would happen is that you first would replace #COMMA, with #COMMA#COMMA# and then later replace that with ,COMMA# , thus garbling the data.

The way to make the data accessible is to request the producer to be fixed, it's that simple. If that is completely impossible, you'll have to figure out the grammar of the data that you actually have and build a parser for that. Your suggested strategy does not work.

Usually the person parsing the CSV data doesn't have control over the way the data gets written. If he did, he would probably prefer something like protocol buffers. CSV is the lowest common denominator, so it's a useful format for exchanging data between different organizations that are producing and consuming the data.

https://github.com/dbro/csvquote is a small and fast script that can replace ambiguous separators (commas and newlines, for example) inside quoted fields, so that other text tools can work with a simple grammar. After that work is done, the ambiguous commas inside quoted fields get restored. I wrote it to use unix shell tools like cut, awk, ... with CSV files containing millions of records.

You tend to have more control over the way the data is produced than you think, and you should make use of it. It's idiotic to work around broken producers over and over and over again, each time with a high risk of introducing some bugs, instead of pushing back and getting the producer fixed once and for all. Often the problem is simply in the perception that somehow broken output is just "not quite right", and therefore nothing to make a fuss about. That's not how reliable data processing works. You have a formal grammar, and either your data conforms to it or it does not, and if it doesn't, good software should simply reject it.

Your csvquote is something completely different, though it seems like you yourself might be confused about what it actually is when you use the word "ambiguous". There is nothing ambiguous about commas and newlines in CSV fields. If it were, that would be a bug in the grammar. It just so happens that many unix shell tools cannot handle CSV files in any meaningful way, because that is not their input grammar. Now, what your csvquote actually does is that it translates between CSV and a format that is compatible with that input grammar on some level, in a reversible manner. The thing to recognize is that that format is _not_ CSV and that you are actually parsing the input according to CSV grammar, so that the translation is actually reversible. Such a conversion between formats is obviously perfectly fine - as long as you can prove that the conversion is reversible, that the round-trip is the identity function, that the processing you do on the converted data is actually isomorphic to what you conceptually want to do, and so on.

BTW, I suspect that that code would be quite a bit faster if you didn't use a function pointer in that way and/or made the functions static. I haven't tried what compilers do with it, but chances are they keep that pointer call in the inner loop, which would be terribly slow. Also, you might want to review your error checking, there are quite a few opportunities for errors to go undetected, thus silently corrupting data.

I used that strategy for parsing gigabytes of CSVs containing arbitrary natural language from the web - try to get these files fixed, or figure out a grammar for gigabytes of fuzzy data...

My approach never failed for me, so telling me that my strategy does not work is a strong claim, where it reliably did the job for me.

Your examples are all valid, but what you are describing are theoretical attacks on the method, while the method works in almost all cases in practice. We are talking about two different viewpoints: dealing with large amounts of messy data on one hand and parser theory in an ideal cosmos on the other hand.

How do you know that the strategy worked reliably if you never compared the results to the results obtained using a reliable method (which you presumably didn't, because then you could just have used the reliable method)? The larger the data you have to deal with, the more likely it is that corner cases will occur in it, and the less likely that you will notice anomalies, thus the more important that you are very strict in your logic if you want to derive any meaningful results.

As such, the two viewpoints really are: not really caring about the soundness of your results and solving the actual problem.

Now, maybe you really can show that the bugs in the methods you use only cause negligible noise in your results, in which case it might be perfectly fine to use those methods. But just ignoring errors in your deduction process because you don't feel like doing the work of actually solving the problem at hand is not pragmatism. You'll have to at least demonstrate that your approach does not invalidate the result.

Nitpicking much?

As I wrote above, by making sure that I use a placeholder that does not appear in the data, I make sure that it does not cause the issues you describe. And if I was wrong with that assumption, I can at least minimize the effect by choosing a very unlikely sequence as placeholder.

I really see no issue here. How do you find valid grammars for fuzzy data in practice?

Garbage in? Garbage out. You give me a shitty file, you get shitty results. Tough shit.

None of these questions are particularly daunting. CSV means "comma separated values", so if you want to play games and use other delimiters, please fuck off. If it's not a comma, then guess what: it's not delimited. New line characters are well-known, and well-understood, across all platforms and easy to detect. If you manage to fuck that up in your file, then take a look in the mirror, because the problem is you. Enforcing the practice of enclosing the target data in quotation marks among users is a good idea. It's something that should be supported and encouraged, and ignored at one's own risk.

Additionally, employing an escape character (such as backslash) to allow for the use of a quotation mark within enclosing quotation marks is a nice feature to add in. After that, the concept of a CSV file has provided enough tools, to tolerate [an arbitrarily large percentage] of all use cases. If you need something more robust, XML is thataway.

as the article mentions, CSV is not well defined. libraries are.. well, different. you'd spend as much time becoming familiar with one as you would writing a basic parser.

commas don't delimit field entries? CSV -> comma separated values.

new lines inside a field? i've never written a parser that would be foiled by this. could be an issue if you use a built-in tokeniser (e.g strtok, etc.). be aware.

variable number of fields? you’re probably writing this for something with an expected input form. throw errors if you see something you do not accept. make sure you catch them.

ascii/unicode? yea. it’s a fucking mess. everywhere.

just do it. handle failure gracefully. learn from your mistakes. don't be naive. consider a library if the (risk of failure):(time) ratio is skewed the wrong way. the only time i would absolutely insist that a 3rd party library be used is when crypto is involved. even then, be aware that they are not perfect.

absolutely ignore people who's argument is along the lines of "you are not smart enough to implement this standard. let someone else do it.”. fuck everything about that statement, and it’s false sense of superiority.

nothing comes for free. wether you use a library, or do your own thing, you’re going to run into problems.

> absolutely ignore people who's argument is along the lines of "you are not smart enough to implement this standard. let someone else do it.”. fuck everything about that statement, and it’s false sense of superiority.

In general it's not about being smart enough (although for some complicated standards maybe it's true), but rather biting off more than you realize. Everything sounds simple before you find the edge case implementation issues and have to rework and rethink a bunch of hard issues that a dozen people have already thought through. Doing it yourself is on the table, but rarely the most efficient decision.

On the one hand, I hear you. On the other hand...

Too many "enterprise" coworkers who don't know how to write a finite state machine. They do need to use a library.

My most popular stackoverflow answer [1] includes a CSV writer and reader. Yeah, I'd clean it up a little if I were doing it now (return enumerator instead of array, etc). But people keep using it.

It uses regex lookaheads to deal with quoting, so it's not 100% portable. But it's only about one page.

As for the other things mentioned by the OP (BOM, encoding), those should be handled by the stream, and are not the provenance of CSV per se.

[1] http://stackoverflow.com/a/769713/4525

Regex lookaheads are more efficient because you're copying everything between terminal chars at once as opposed to one char at a time.

Unnecessary string copy operations are what make the parser slow.

"CSV is not a well defined file-format. The RFC4180 does not represent reality. It seems as every program handles CSV in subtly different ways. Please do not inflict another one onto this world. Use a solid library."

I can't but disagree when I read stuff like this. Why shouldn't I release a library if I think it's good enough for the community? Even the powerful and versatile Ruby library for CSV parsing started as a gem from a person who didn't give a s... about advise like "do not inflict another one into this world".

IF your library is a solid library, then release it. What he is saying though, is don't roll your own if you can use a solid library. And if a good solid library exists, why bother writing your own?

> And if a good solid library exists, why bother writing your own?

Because, you know, learning, having fun and stuff.

This article makes it much more complicated than it needs to be. It tries to be all things to all people. In practice you're going to have to sacrifice some functionality for the sake of usability and your own sanity.

When I add a CSV import feature to a project I'm working on, I tell people "this works with MS Excel flavor of CSV." This covers most, if not all, real world cases because in my world the people who want to import data are non-programmer types who all use Excel.

I'll often include the basic rules in the screen that accepts the import. If I ever had to accept data from something that was _not_ Excel I'd probably include a combo box on the web form that lets you pick the dialect. So far I haven't had to do that.

The only thing I might not be totally covering is how Excel handles newlines, but in practice I've never had to deal with that.

I found out that Windows Excel and Mac OS Excel use different character encodings for CSV.

Does it work with Hungarian Ms Excel? It uses semicolons as delimiters.

If all you care about is Excel compatibility you can add "sep=," on the first line. You can also use the Text Import Wizard. Changing the extension to .txt should cause Excel to show the Wizard upon opening the file.

What is wrong with trying to be all things to all people? If you use a good solid library you don't need to tell people "this works with some versions of MS Excel" And that is the main point of the article.

Makes total sense to focus on the format the user actually uses. Still...why don't you use a library?

> Ruby CSV library is 2321 lines.

If you look at lib/csv.rb [1] it's:

* 2325 Lines

* 2161 Non-blank lines

* 950 Lines of Code

[1]: https://github.com/ruby/ruby/blob/trunk/lib/csv.rb

Early on in my career, just a year out of school, I, for some absurd reason, had the idea to build my own date library.

Primarily, I didn't fully understand the date objects and functions available in the languages/libraries i was using so simple things like formatting a string date seemed difficult to me.

This was an awful idea. Dreadful.

I came up with all sort of delightful helper methods to cover common use cases like adding one month to the current date. I made this decision to represent dates internally with a timestamp, so adding a month is easy, right?! No. ...What's 1 month from January 31st? February 28th? Well then what's 1 month from February 28th? The list of edge cases goes on.

Most things in life are more complicated than they, at first, seem.

Especially dates.

> What if the character separating fields is not a comma?

> Not kidding.

We'd ll be better off really, but that ship has sailed. Using CSV for data which is only ever read by a machine is a dumb decision. Use the RS (record separator) character and many of these ambiguities disappear.

Of course, like I said, that ship has sailed. If you want your data to be read nicely by other programs you're probably stuck with CSV, TSV, or something similar.

On the other hand, there's definitely some value to being able to directly inspect and alter your data in a text editor. It would be nice to not have to deal with unprintable characters.

Personally I know the pain of creating a CSV parser. In late 2006 I was working on a PHP project that required a CSV parser, and what was available at the time did not come close to cutting it. So I created my own parser/generator, which among many other things included automatic delimiter character detection. It was a rather painful project to create, but I learned a lot, and found the experience really fun.

Overall I agree with the article, there's no point in reinventing the wheel if there are libraries out there. And CSV specifically is a horribly complex format to deal with. But sometimes rolling your own is the best and/or only choice you have, and you might come out the other end enjoying the experience, and having learned a lot.

As for what happened to my old CSV parser? It ended up being quite popular, but stuck in the dark ages as I'd mostly moved on from PHP years ago. But thanks to a contributor, we've recently put renewed effort into bringing the project in to modern times: https://github.com/parsecsv/parsecsv-for-php

Can we stop being liberal in what we accept from others? It only leads to an unfixable mess.

CSV works for simple cases. It is trivial to parse, you shouldn't even need a library.

It there are many "what ifs" like in the posted article. You probably need another format like JSON (preferably) or XML.

Yet in the real world you don't often have such luxury. Often times that inconsistent CSV file that you have to parse is not of your own creation. It comes from some other data source, or perhaps you have multiple data sources spitting out their own variants. You just need to get the job done, and splitting on ',' won't work.

Off topic, but why JSON over XML? What are the technical advantages for using JSON instead of XML (and don't say anything about "human readable"). If you're consuming the data with JavaScript, I'll grant you that JSON has quite an edge. But most every language has standard libs for XML. Both are easy to parse, but XML is easier to validate given a schema definition.

JSON is considerably more compact, especially if you use lists instead of maps. For a list of numbers, there is only one character of overhead per item. For a list of strings, it's three characters per item.

Of course you can embed comma-separated lists in XML, but with JSON it will parse them for you.

(And of course it's not as good as a protobuf, but not bad for a text format.)

Because of closing tags, XML is approximately twice as noisy as JSON encoding. If response sizes and the network traffic they entail are a concern, JSON is worth considering. If response sizes + validation are necessary, something like Protocol Buffers or Thrift may also fit, as they are widely supported as well.

If your data touches the network, conform xml parsing is an attack vector (billion laughs, external entity exploits, ...) and non-conform xml parsing ends up being a headache. Even more, the sheer absurd complexity of xml contains so much stuff, who knows how many more exploits by specification are in there.

Well a lot of times you are reading in a csv file given to you by someone else. You can't expect it to be "simple" and you need to parse it correctly

JSON is arguably more complex than CSV. Though it is at least well-defined (mostly).

However, that doesn't excuse sloppy CSV writers.

This goes for most complex problems. The first step of any dev problem should be to make sure you understand the problem, the second to map out the main pieces and the third to make sure you are leveraging every (well maintained) library possible. There are, of course, issues with dependencies and tying yourself to code you didn't write, but what would you rather depend on--code that has had tens or hundreds of eyes on it, or code that you, and maybe one or two team members has reviewed?

One of my first open source projects was a JDBC driver that read CSV files. It started simply enough but once you started adding in support for all the quirks things became really complicated really quickly. Just check out all the "options" for the driver that have been added by the community over the last 14ish years http://csvjdbc.sourceforge.net/doc.html

Parsing CSV is easier than handling XML or JSON. I do integrations as my job and most common format used is CSV because it's handy simple and reliable compared to other formats. That is exactly the reason why ini and props file are also preferred over database for data which isn't too volatile or big. Any one can open the datafile and see what's stored and what's wrong.

Have a look at `jq`: https://stedolan.github.io/jq/ It makes working with JSON a breeze.

The best tool I've found for working with CSV files is csvkit[1]. I've run into some of the issues mentioned in the article and it's handled them all gracefully. It's basically a bunch of scripts mirroring sort, grep, cut, etc. but specifically for dealing with CSV files.

[1] http://csvkit.readthedocs.org/

Hey, this looks good. I've also used csvfix [1] to get me out of trouble before.

1: http://neilb.bitbucket.org/csvfix/

We actually use CSV-reading as an incidental part of a hiring exercise. We provide a really simple homemade CSV parser as part of a PHP project, with a "could you find and fix bugs in this?" instruction. The way to get full marks is to rip out the parser and replace it with the appropriate standard library function.

I like this.

Only thing that I don't like is that many candidates will assume that they have to fix the code within the parser, given those instructions, even if they know that a battle-tested library is how they would actually do it. I hope you accept an off-hand comment such as, "ew, why is this hand-rolled" as a sufficient indicator in favor of your solution.

Such a comment would be acceptable, yeah. So long as we can tell they looked at it and thought "wow, that might go incredibly wrong"...

CSVs were simpler back in the 80s, when there were a few products (e.g. - Lotus 123, xBASE) that all wrote RFC 4180 compliant text (and I'm pretty sure there was no RFC 4180 yet)

No alternate delimiters, no backslashes.

Now I have to put up with offshore staff trying to use apostrophes (') instead of quotes (") :-(

Barring alternate delimiters, and disallowing newlines* in fields, I can write the parser for 4180 in about 30 lines of perl, reading a char at a time and flipping between about 4 states. (avoids getting root access and days of paperwork to install from CPAN)

* disallowing newlines in the data is admittedly a big restriction, but it works for many use-case/applications, and allows the caller to pull in a line before calling the parse function.

For Java, the "Ostermiller" library is pretty good for CSV handling, and has a few options for dealing with freaky variants.

I think this example is relevant to many seemingly trivial problems. Where the task seems simple, but once you think about the details a bit more it becomes complex.

I was trying to get Perl tar libraries working, when my colleague asked why I don't just use backticks to do it in the shell. Basically because I don't know that much about tar. I can use it to untar file, or create a new archive. Someone else who has written a library probably has taken the time to read through the whole manual and make it work nicely. They know the errors and warnings, and have abstracted that to a sensible level hopefully. They have thought about these things, so hopefully I won't have to.

Easy to write, hard to read. Perfect illustration of an emergent case of Postel's Law.

To me it's more the perfect illustration of the broken window theory.

I trust Numpy a lot for CSV handling. It deals with lots of edge cases including missing data, weird delimiters (pipes '|' are popular in astro for some reason) and massive files. If in doubt, whack it into Excel which has been doing this stuff for decades now. I prefer using Numpy to Python's CSV library which I find a bit clunky.

Very little data is actually true CSV.

The code isn't particularly long (~900 lines), it's Python (hence readable) and it's well commented:


>weird delimiters (pipes '|' are popular in astro for some reason)

I can only guess that since it's astronomy data and constellation coordinates have decimal places, it's best to avoid the comma character because some countries use it as a decimal separator.


A good and performant alternative to CSVs are Google's protocol buffers: https://code.google.com/p/protobuf/

But it is still by far the most readable text data format out there. Which is the reason for its wide adoption. I'll be downvoted, but I really believe in this.

So, which library? CSV is a mess.

Here's some links which (will always) point to latest versions on MetaCPAN:

http://p3rl.org/Text::CSV | http://p3rl.org/Text::CSV_XS

I really have to second this. It's fast, it's smart, it handles almost anything, and it reliably gives good error messages. It's an important part of my data science toolbelt.

No idea. I've been bitten by a library that turned out well into a project to have pernicious flaws.

It would be awesome if someone made a table with CSV features in one dimension and application/library behaviour in the other.

The problem is that CSV is basically non discoverable, as it has no metadata eg about encoding or delimiters or locale... Thats why Excel gives you a sample and lets you change delimiters etc. You can guess, or you can write something that seems to work for a particular input source. But it is best avoided.

I would say use the parser included in your programming environment, but I spend way to long looking of the non existing CSV parser in .Net.

Not a single language I use on a routine basis includes a CSV parser in its standard library.

I'm mostly using Python and trying to learn Go, both have great csv parsers and I used the one in PHP. Given the size and features present in .Net I was really suprised to not find a csv parser.

ocaml-csv can handle anything Excel can throw at it (and throw it back). Worked out well in the real world.


Have you tried Papa Parse?

Why CSV is not just for readability? I think RFC is sometimes too pedantic, that it let CSV can handle both plain text and binary files. COMMA is not just a COMMA, but a COMMA not in different environments. Why should we use the phrase CSV or Comma Separated Values just for RFC?

CSV or Comma Separated Values are not only for RFC, but also for EVERYONE who wants to use this word or phrase. Pedantry sucks!

We switched from a CSV based delivery to Apache Avro files. These are binary files which have the record schema embedded in the file header. We're pretty happy with this solution for the time being and it seems to be an awesome alternative to CSV. I wonder if anyone else is doing something similar? Good article but I'd appreciate if the author gave some alternatives.

I usually take advantage of the fixed formats of the individual exporting tool. Everyone does it a bit different - so what? I have a php parser for it and adapt it for every of my clients. It's cheaper to have a small parser, adapted for the customer's needs, than having one 10k SLOC library to handle a boatload of files...

CSV is really slow to work with, because you have to check for well-formedness, like you do with XML. And in the end, I always end up making specific concessions for the files that my customers use (which must be patched again and again) or having to take a hard stance on what can and can't be in the "CSV" files.

Mike Bostock's DSV library handles pretty much all of the cases listed for encoding and decoding. Written in JavaScript, in 116 lines.


i recently needed to deal with a ~4G xml file.. i tried a parser but after waiting thirty minutes for it to load i decided to parse out the bits i needed manually with a bash script

knowing my needs i could easily account for all possible muck ups and avoid the instances where ambiguity could play a part

i was then able to use the bits i pulled out of the ~4G file, now 16M, in the parser with all of its assurances

sure, edge cases justify using a tried and true library for generics, but there are also edge cases that justify mocking up your own naive implementation.. if only, like in my case, to make the dada usable in such a library

Most likely prompted by discussion on https://news.ycombinator.com/item?id=7794684

CSV: Where the only way to win is not to play . . .

This article represents all my feeligs when my boss says " just write a csv parser for this, its just csv . So aint that hard"

A non-standard standard is always a sure way to shoot yourself in the foot. Endianess also causes some confusion...

Its a flippin' CSV.

Of course you can come up with scenarios where it doesn't work, but anyone who considers themselves to be a competent programmer should be able to deal with these issues, use another data format, or just talk to whomever is giving you the data to correct their data issues.

Seriously, The overhwleming CSV_bashing in these comment really makes me worry that coders just can't handle the basics anymore.

It's not a question of can, it's a question of should. If any engineer on my team came to me and told me he was building a CSV reader/writer, I would seriously question his judgement as an engineer[1]. My thoughts would be that either he isn't capable of seeing obvious challenges in building a "simple" CSV feature or he isn't able to prioritize his time well, focusing on useless toys at the expense of getting important work done.

1. Of course there are exceptions to the rules: perhaps the CSV is malformed or there are special considerations in the backend, but the general point stands.

Never ever use csv to export. Use tab separated, as it takes work to type a tab in excel.

Post a sample .csv file you believe is too difficult.

I will solve your problem with only UNIX utilities. And I'm sure others will solve it other ways.

Usually I only need sed and tr. Sometimes lex or AWK.

Arguing about something without ever pointing to an example accomplishes nothing; it's just whining.

Post an example.

Thank you.

It isn't that any particular file is difficult but that the variations that you haven't even thought about might catch you out. It is the deceptive simplicity of the samples that you have at hand that may catch out your code when it hits a different (also simple but different) example in the field.

Why are people using CSV when better (and less fuzzily defined) solutions exist, such as JSON?

In addition to aforementioned import/export data interop with MS Excel, there are tons of legacy systems (mainframes, etc) that import/export csv but not XML or JSON. The csv format is everywhere and will continue to be with us for decades. People will always look for a quality library (in whatever new programming language) that handles all tricky edge cases.

A few months ago, I was trying to get some bulk data into ebay's proprietary TurboLister[1] program. Guess what, it can import csv but not JSON.

SQLite[2] can import csv but not JSON.

Google's terabyte ngram dataset[3] is csv (tsv) instead of JSON. I'm glad it's not JSON because it would have required extra disk space.

... plus tons of other real-world csv examples out in the wild.

Unfortunately, the csv format is very easy for programs to write but it's very difficult for programs to properly read because of the tricky parsing.

[1] http://pages.ebay.com/sellerinformation/sellingresources/tur...

[2] http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles

[3] http://storage.googleapis.com/books/ngrams/books/datasetsv2....

If your data are rectangular and you care about performance, CSV is better than JSON just because it avoids repetitive key names everywhere. Then again, if your data are rectangular and you really care about performance, you would not use any of these (you might use HDF5, which has support in many programming languages and will destroy the others in terms of speed).

JSON is almost a subset of CSV, with the understanding that you have to wrap every line in [], the document in [], and every field must be quoted. (And JSON doesn't have built-in support for headings, so you need to write a little loop instead of the library building a hash for you.)

So no, if you control input and output, JSON is still easier to use than CSV, and just as performant. JSON stores straight arrays just fine. It's not the format's fault so many people choose to store hashes with it.

...and quotes inside fields must be escaped with a backslash, newlines replaced by \n, etc.

...and, the moment you declare you handle json, people will send non-string data ("that is a number, of course it isn't quoted"), attempt to include nested data, leave out the opening and closing [] (because people will grep a file with one array per line to filter a json file; that is no way robust, but people will do it, anyways)

The main advantage of json vs csv is that there is only one json, but quite a few different variant of csv. That is a huge advantage, but csv is engrained; there are many, many tools that handle csv but not json.

And the "if you control input and output" case is not the interesting or the problematic one.

The idea that JSON is the substitute made me chuckle. JSON is more verbose to boot. CSV is a poor format but JSON is not panacea, actually personally I'd never use it for anything that's not web (browser) related.

If you have simple data, why use something as complicated as JSON? For a recent project, I had a simple CSV file with an int and float per row; using JSON would probably double the datasize. I used a simple string.split(",") for the javascript decoder, because I control the data, and know it's safe. I don't need another javascript library (I'd probably do differently if I had a standard library, not a hodge-podge of scripts).

Sometimes, simplicity is better for everyone.

Why not use both?

I've had to() parse json embedded in a field in a csv file. Unquoted of course.

Until I explained to the other developer just how stupid that was.

Importing into excel is probably a big reason.

If Excel compatibility is the goal, one should use libraries that read and produce Excel files.

CSV is bullshit, it's not good for anything except scenarios where you control both the export process, and the parser (so you know what delimiter is used and so on).

CSV files are MUCH easier to search and inspect using tools like grep and less. It's the accounting people that want's Excel, but as a developer CSV is easier and more flexible.

We process csv files containing payment information, export csv as product feeds for perhaps 10 different partners. It works, it's faster than XML or JSON and is easy for non-technical people to inspect using tools like Excel. Most cvs data exports are not something that's made up on the fly every day, it batch jobs that create the same type of CSV file each and every time.

CSV is alright. XML is bullshit.

CSV is much simpler when the records are all of one type. It gets debatable which (XML vs CSV) is simpler when you get multiple record types dumping a hierarchical data structure.

Obviously, you can make CSV files work for alternating record types, in the same way that the old mainframe files used to work with multiple record types in a file, with a type descriminator field in a know place, usually field 1, of each record. But it starts to get cluttered.

The same could be said about exporting to JSON. (the JSON code on some of the big APIs does not parse intelligently and I have to spend a lot of time fixing it.)

CSV is far, far more ubiquitous and much more usable in non-web settings. (e.g. desktop data analysis programs)

true, and in those settings you largely don't see situations that trip up naive parsers such as newlines or delimiters inside fields.

Because Excel doesn't export JSON. Because if you have a table, offering comma delimited field is easy. But really, people you work with give you csv files, and you don't have a choice.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact