CSV is a hazing joke we play on noob programmers to demonstrate to them the function that eats a string and a delimiter, and spits out that split string into an array. Then we demo it with the most innocuous plain vanilla csv formatted data we can imagine like '1,2,3' and then haze the new programmer by telling them that's all they really need to know to process CSV files, assuming they can read file lines into strings and manipulate arrays. ha ha noobs ha ha ha. Well its funny if you're not the one cleaning it up or getting the bug reports.
The other side is trolling noob DBAs into thinking they wrote babbys first fisher-price CSV exporter when they ran SELECT CONCAT(columnone, ',', columntwo, ',', columnthree) FROM tb_wtf; and then they get their first bug report about the column data containing a comma, perhaps as a full name.
Personally I think if we're going to haze noobs it would be more civilized to just spank them with a paddle while they repeat "thank you sir may I have another" (And that movie is I believe almost half a century old now) Or do the rituals the navy used to do upon first equator crossing. Tricking noobs into writing CSV processing code is just going to far.
It doesn't so much as it says here are some things that people have done.
> While there are various specifications and implementations for the
CSV format (for ex. [4], [5], [6] and [7]), there is no formal
specification in existence, which allows for a wide variety of
interpretations of CSV files. This section documents the format that
seems to be followed by most implementations:
RFC4180 misses some rather famous practical problems, such as until very recently Excel was limited to some 16 bit number of rows, 32767 or 65536, and the bigger and richer the corporation the less likely their software is updated, so you send them a 500K line CSV dump of engineering data, they use Excel as a viewer, and they insta-file a bug report that they are paying good money for 500K lines of data but they checked and we only sent them the first 65536 lines or whatever. Bonus fun if they save the truncated file on itself permanently deleting the excess data. Or if their corporate standard DBMS is Excel, as it often is at larger companies.
This has been fixed, but only in very recent versions of Excel, where I work this was a near daily issue until last year or two.
> This has been fixed, but only in very recent versions of Excel, where I work this was a near daily issue until last year or two.
This has been fixed since Excel 2007. I don't know whether 9 years counts as very recently.
I also agree with the fact that it has been an issue. Even at customers who have upgraded past Office 2007, a lot of corporate forms and standardized data exports are based on old work that was completed before the new Office file formats. Updates to these documents and exports continue to use the old file formats because people make changes to existing artifacts, which are handled in compatibility mode in Office, and no one changes the settings to export .xlsx files.
Absolutely. I am not denying enterprise inertia. But the parent of my original comment specifically said "recent versions of Excel". To me, this sounds like "a version of Excel released recently", not "a version of Excel that certain enterprise clients have only recently started migrating to".
What's more common, tabs in text content or commas in text content?
I'd suggest you'll see commas far more frequently, such as in addresses and freetext fields.
Can fields with commas be quoted to prevent them from being interpreted as delimiters? Yes they can, but that requires extra vigilance, and in my experience it's frequently overlooked.
As for TSV not allowing tabs... the same type of post-processing that goes into allowing for commas to be quoted can go into replacing tabs with multiple spaces.
> As for TSV not allowing tabs... the same type of post-processing that goes into allowing for commas to be quoted can go into replacing tabs with multiple spaces.
No, replacing tabs with spaces means you cannot distinguish between spaces and tabs.
CSV escaping is completely reversible.
I agree tabs aren't common, but it sure is nice to have your format allow arbitrary data, rather than just the data a particular person prefers.
The "Peace on Earth" feature is in pretty high demand too. Unfortunately, any issues that request this feature end up being closed with WONTFIX by the maintainer.
We recently received a CSV from a bank where they didn't escape properly, so one of the entries was like this:
"","C/O COMPANYNAME"SUBHEADLINE"","",(...)
So yeah, they probably wrote this themselves, and didn't expect a valid company name to contain these characters.
We ended up writing a custom parser around their horrible format, because their technical support and technical lead person we could reach couldn't understand our issues.
What you describe is exactly the problem with CSV: many less tech-savvy people decide to use it to provide some sort of very minimal API (even if it's periodical manual exports sent by mail; some people consider that automation!), don't account for the possible traps (like double quotes in the value of any field), then call it a day -- more like a lifetime.
Then there's inertia, there's a boss saying "we absolutely must work with these people", and... the rest is history for many of us.
NOTE: I've been writing custom CSV parsers at least 10 times during my career. That makes roughly 2 such parsers every 3 years. And that comes from a guy that's not very active in his career.
Not just with CSV. I remember a similar comment about broken XML.
It actually doesn't matter how compliant you are to the standard. Standards are certainly nice when both parties are willing to agree on things. But when one party is in position of strength and wants to minimize its costs or efforts, compliance to a standard won't help you.
That's the sad truth. Sometimes the thought of this makes me wanna go full Hitler on the internet -- if you can't expose a standards-compliant API, your service will be blocked by central tech authorities.
We can dream, right?
Still though, all giants eventually fall when they become phlegmatic. Sadly it's taking way too long.
I've had odd bugs show up in a CSV import process, where it was a numeric string (think account numbers) that had leading zeros. Turned out the imported CSV had been cleaned by moving records around in Excel and saving as CSV (which of course saw the account numbers as numerics and conveniently truncated the zeros)
Excel thinks anything vaguely scientific notation-ish is a float, which is a nightmare if you have, for example, four hexadecimal digits 2E15 which is 11797 in decimal but excel will helpfuly "fix" it to 2 million billion (depending on your locale of billion LOL)
That was a contrived example but I've seen Excel get very confused by foreign postal codes involving digits surrounding a letter "E" or apartment number columns for example "1E"
Excel enthusiastically and automatically types oh-so-many things. It'll take dates and assume they are in american MM/DD/YYYY in some fields and then treat dates in the same column like 13/01/1986 as strings.
That exact problem I had when working with a company who had a stocked item with the code "060E2". When people don't know scientific notation the result is bewildering!!!
This should be titled "What Vendors Should Account For When Publishing CSVs". If someone claims they're exporting a file in X format:
* X should be a known standard OR
* they should provide a client parser for X OR
* they should provide a grammar or list of rules for parsing X
This rant is focused in the wrong direction. It should go without saying that someone should deliver what they claim to. If CSV means different things to different people, that's fine too, but the burden should be on the exporter to define what it means to them.
I'll accept a retitling of "Bullshit Developers Shouldn't Have to Question About a Common File Format".
It's literally impossible for CSV to be a useful format if you cannot make any of these assumptions. Every field with a distinct encoding? Really? The focus needs to be on making CSVs about which you can reasonably assume a lot of these things so that you can actually write software that processes them.
Is there anyone out there who actually believes all CSVs are UTF-16? That's such a weird idea that I have no idea how somebody would arrive at it.
That's what always gets me about this genre of list — I often find myself scratching my head, trying to figure out if I've missed something or if they really listed that thing as a falsehood programmers believe.
Possibly its asymmetric political power at its finest but I am aware of at least one "business intelligence" application that outputs its CSV files consistently as UTF-16BE (not LE)
I'm not even sure what planet its from, much less the application, I just get stuck importing what they feel like sending me.
I don't know why everything that upstream sends me is in UTF-16BE format, its just what they do. Perhaps they hate me, or my employer. I donno. Perhaps its an intelligence test to cut down on support costs, you must be this smart or smarter to mess with our data or go away.
I question the title change from "Falsehoods Programmers Believe About CSVs"
It's not really a list of problems as much as incorrect things people think and simply reading that title alongside the list without the page title/description would make no sense at all to the reader.
Particularly problematic with this is that a human cannot necessarily determine with 100% confidence where the error is. If you have a CSV where every row has 5 entries, and you come upon this row:
> 100,1,200,300,10,100
You suspect that an in-field comma is causing a split where there shouldn't be, but who's to say which number got split up? 1200 into 1,200, or was it 200300 to 200,300, or 10100 to 10,100?
Having developed on a system that had to import and collate insurance claims data from multiple providers, I've come to dread CSVs for all of these reasons.
One thing to remember if your company offers a customer-facing "export to CSV" button:
The format of that CSV is part of your public API and changes to it can cause pain for your customers. Don't be caught off-guard if they've built a workflow around "download this CSV, put it into this excel spreadsheet and run this VBscript on it" or "download this CSV into this iPython notebook and feed it to FOO".
I remember at my last job, mentioning something about saving a Google Sheet in CSV format, and my boss immediately said "I'm morally opposed to CSV."
He then told me about TSV, and since then I've come to accept it as a superior format. With the kind of data I dealt with at my job, we never encountered tabs within a column, but commas within columns were very common, so TSV was perfect for us. A lot of what I was doing was ad-hoc processing of various sheets (nothing that was really production-grade), so a lot of time I'd just write code to read in a file and split each line on \t, and that was good enough. And maybe I'd throw out any lines that began with #, but that was about it for processing.
Actually, we did have a TSV parser in our codebase, but my boss told me that it has problems and he never liked it and that I should just roll my own inline with my code.
This has a lot of good stuff in it, but I think it's hurt by including stuff where you just have to make an assumption, E.g. I can interpret the following:
"Some\tText"
in at least three different non-ridiculous ways, even assuming the encoding is known. It's worth keeping this in mind, but intertwining it with other tractable assumptions feels like it hides some of the super-useful nuggets contained within. That said, I'll admit the dividing line is fairly indistinct, plus it's still a great list.
Any known web (JS) workarounds for parsing CSVs with inconsistent file encodings? Right now I am at the conclusion that you must know the encoding type before parsing, though I've come across some decent libraries that can "guess" the encoding type:
If it's a mix of UTF-8 and a single-byte encoding, you can guess, because UTF-8 is nice enough to be obvious when you guessed wrong.
The command-line version of ftfy [1] can handle a distinct encoding for each line with a near-zero rate of false positives. (No guarantees the output will be correct, especially if you read it as the wrong single-byte encoding initially, but it will be no worse than what you put in, because it won't replace correct text with incorrect text.)
There is no JS version of ftfy, and there won't be until JS can represent all Unicode characters without surrogates.
I think many of the problems he lists are because, as he properly notes, a great many people believe "ALL" or "NEVER" are appropriate words to use when considering CSV files (most of his points use one or the other).
But I think it is in all fairness to point out that parsing CSV files for value is a lot better than free form text, even if it's nowhere close to XML.
If the requirement is to read CSV files from a third party, then suggesting HDF5 isn't helpful.
If the requirement is to write CSV files to send to a third party, then suggesting HDF5 isn't helpful.
So this limits the usefulness of mentioning HDF5 to when you are both the producer and consumer of the data, but that's the easy case because then you can choose anything up to and including strongly typed binary formats.
Stop this horrible genre now! As discussed in the recent video programming post (https://news.ycombinator.com/item?id=13259686), this neither helps the beginner because it is unnecessarily overwhelming, nor the expert because he has already read https://tools.ietf.org/html/rfc4180 and is aware of these problems with bad CSV parsers/emitters. In the intermediate case, it could be useful for someone maintaining a CSV parser or emitter who is not aware of a few of these points and wants 100% compatibility with all CSV parsers/emitters in existence, but the list style is confusing grammatically and offers no counterexamples or suggestions for improvement.
With that said, if you are interested in using CSV for encoding and decoding in your own environment, I suggest a new file standard called μTSV.
1. All μTSV files are UTF-8.
2. All values are delimited by "\t", and all lines must end with "\n" (including the last one).
3. If you want to use a "\t" or "\n" character in your value, tough luck---use JSON.
I mostly just thought it was fun to write. Put my pain down on a page, get a little bit of catharsis. I actually had written out some "counterexamples or suggestions" but decided to keep it terse and with the general style of the other similar posts.
I was not aware of the scorn for the style, I just knew I enjoy reading them.
As opposed to your μTSV may I suggest DSV. The Art of Unix Programming makes a strong case for how it's superior to CSV [1]
The example program in the post lets you use either a "-c delimiter_char" or a "-n delimiter_ASCII_code" to specify the delimiter for your DSV data being processed by it.
and
[xtopdf] Publish DSV data (Delimiter-Separated Values) to PDF:
Bah. How am I wrong when I haven't even claimed DSV to be the best? Just quoting that section in ESR's book does not imply I endorse or agree with it. Logic ...
Plus, just providing that EDN link is hardly proof that it is the best ... And further, "best" in many cases is subjective. You want to claim EDN is the best, you lay out an objective proof.
I do intend to. Thank you for the comment though. I had starred EDN on github just now before replying to groovy2shoes, i.e., I took an initial brief look at it. Saw a commit by richhickey (Clojure creator). Since I'm into data / file formats, I will definitely check out EDN.
Sorry, my intent (beside bringing up EDN) was to be playful. Tone is often difficult to communicate in writing. I would have stuck in a `:p` if I hadn't been on mobile, but it slipped my mind. My apologies; I did not mean to insult or to otherwise be combative or critical.
I don't have objective proof that it's the best. It's a mix of personal preference and empirical "evidence" based on past success with both EDN and its granddaddy s-expressions for data exchange and serialization.
As for why I personally like EDN so much:
1. It offers the malleability, flexibility, and extensibility of XML, while being
2. even more concise than JSON,
3. straightforward to parse, and
4. precisely defined in a rigorous and language-independent specification.
Traits 1-3 it shares with the traditional s-expressions of the Lisp family, but in contrast to s-expressions it's specifically designed as a data exchange format rather than as a lexical/syntactic format for programming languages. The reason this matters is because the traditional Lisp reader is itself extensible, and extensions can't be guaranteed to carry over from dialect to dialect or even from implementation to implementation. Many Lisp readers go so far as to even allow the execution of arbitrary code at read time, which is desirable for source code as it enables metaprogramming, but it's not so desirable when you're parsing data from an arbitrary source, due to security concerns.
While EDN does have its roots in Clojure's reader, EDN is not exactly Clojure's external representation. Rather, it's a subset of Clojure's surface syntax in much the same way that JSON is a subset of JavaScript's. Like JSON, EDN works great outside of its "native language" (in fact, I've never used EDN from Clojure itself; I've only used it from Lua, Scheme, Python, C, C++, C#, Java, and JavaScript (not necessarily in that order)).
>Sorry, my intent (beside bringing up EDN) was to be playful. Tone is often difficult to communicate in writing.
I understand, and agree it can be difficult. Thanks for clearing that up, and no worries at all. Interesting info, BTW, about EDN. Will reply again in more detail later today with any comments/questions.
I'll need some time to check out EDN, so will message you with any comments later, since this thread might expire. Thanks again for letting me know about it.
Cool. Will try to comment before the thread expires, then (not sure what the period is), else will post the points here as a new thread under my id (vram22).
Which Python library did you use for EDN (if any)? I had quickly googled for that, and saw a few, including one by Swaroop C H. Got any preferences? And same question for C.
I was an early adopter of EDN, and wound up rolling my own parser in Python. EDN is simple enough that you can realistically roll a parser for it, complete with tests, in one or two working days. I find this to be a huge advantage—EDN doesn't really have any corner cases, so implementation is very straightforward.
I tried out https://github.com/swaroopch/edn_format some (in Python). Will read up more on EDN, try things a bit more, and then reply here with any comments or impressions.
Many of these problems could be avoided if we used the character codes for unit (31) and record (30) separator. This would avoid any problems with tabs or line feeds, as well as better matching the intended semantics of ASCII/UTF-8
That's suggested often, and the obvious rebuttal is that CSV is intended to be easily edited and relatively human-friendly. The unit/record characters are not on most(all?) keyboards, so editing the file in a text editor becomes pretty much impossible.
As a counterpoint, nobody seems to be editing CSV files by hand any more. It appears at this point to be an interchange format between different spreadsheets, row-oriented databases, etc.
I promise they're still used by hand all the time, usually as a form of supplying data to a program in cases where a database is just too much. CSV is still a very common base for data-driven unit tests.
If you're going to replace CSV then why not use the ASCII record and field delimiter characters and avoid using characters like tab and new line which are in band and actually useful?
And if you're using tabs as field delimiters then why do you need quotation marks in the mix?
In practice, I made heavy use of this format at my last company. I was dealing with lots and lots of Google Sheets where I knew there wouldn't be any tabs or returns in the columns, and my code was all ad-hoc stuff that would never see production, so I'd just read in files, line-by-line, split on "\t", and bam instant parser.
Yeah, the genre was good for non-standardized things like names and (to a lesser extent) time, but file formats have specs and reference implementations. Refer to those.
Such standard (Lets call it mCSV) can be incorporated into existing solutions. Moreover, existing files can be converted from/to a proprietary flavor of CSV to/from metainized CSV.
Self-describing data is generally a bad idea. If you can't get programs to agree on the original document format, what makes you think you'll be able to get them to agree on this meta format?
EDIT: The right solution is the simple one. Follow RFC4180 exactly. Reject everything that doesn't match it.
CSV is format for information interchange. General rule for II formats is: be strict in what you emit, be liberal in what you accept.
So, yes, follow RFC4180 strictly in your own code, but be able to read CSV files produced by others too. Single line of metainformation can greatly reduce confusion and is easy to throw out, when it is unnecessary.
This rule sounds nice, but you end up with a bunch of malformed documents out there because people emitting it never get feedback about what they are doing wrong.
I think I've said this in another thread, but I think rejecting malformed documents even though you could "figure out what they meant" is the equivalent of not telling people when they have food on their face.
I think the right pattern for something like this is to complain loudly when you are given a malformed document, then say, "If you really want me to parse this malformed document, use --force and I'll try to make some guesses, but really you should just ask for a well-formed document."
Well, I'm saying the attitude of "be liberal in what you accept" is to blame for the fact that malformed documents are being emitted in the first place. A huge number of people will just hack away at something until it works rather than implementing their emitters to spec, so accepting malformed documents is training people to emit malformed documents.
That is unrelated to well-formed documents in incompatible formats, though.
I mean, its a bit apples or oranges.. XML's well-formed rules are, in my experience, fairly universal, hence parsing isn't as much an issue as is poorly constructed XML schemas (and schma-less xml documents as well) I know there are edge cases, but overall XML doesn't have many of the problems specified.
XML doesn't have the problems in theory, but in practice it can be just as bad.
As a mildly egregious example, I had a company which wanted a data feed from us in XML format. I wrote a query to produce the relevant data out of MSSQL and sent it to them. They wrote back that I had to have a header line:
<?xml version="1.0" encoding="UTF-8">
MSSQL writes SQL in UCS-2 format; I therefore sent them back a feed with the encoding declared as such, and they rejected it. They were fine with the encoding being UCS-2, but they were absolutely insistent I had to declare it as UTF-8 or they wouldn't accept it. After a lot of back-and-forth and quoting of standards at them, I finally gave up and wrote a little script to change the declared encoding so they would accept it.
It's this sort of thing that makes me understand why Ode to a Shipping Label[1] exists.
A regional (to me) real estate web site that we had to work with had a data service that output XML. It was naively implemented and managed to output malformed XML. Mostly in the form of a CSV field crammed into an XML field with non-conformant XML markup.
XML has a specification. There are two well-defined categories of well-formed and ill-formed XML files, and there's exactly one correct way to parse the well-formed ones.