
Problems with CSVs - donatj
https://donatstudios.com/Falsehoods-Programmers-Believe-About-CSVs
======
astrodust
CSV is less a file format and more a hypothetical construct like "peace on
Earth".

~~~
ORioN63
That's not a bad thought. Maybe RFC4180 shouldn't have defined csv, but
another similar alternative.

~~~
ZenoArrow
CSV is inferior to TSV, there's really very little reason to keep using CSV
other than taking advantage of its ubiquity.

~~~
IanCal
It's a shame we don't use the ASCII delimiter characters. It would have solved
a lot of the quoting issues.

[https://en.m.wikipedia.org/wiki/C0_and_C1_control_codes#Fiel...](https://en.m.wikipedia.org/wiki/C0_and_C1_control_codes#Field_separators)

~~~
ZenoArrow
Agreed, that would be the best delimiter choice for plain text data files.

------
KC8ZKF
ASCII defines a unit separator, and a record separator. I wish we would just
bite the bullet and start using this 50 year old solution.

Modern editors/spreadsheets could insert the characters without troubling the
user too much, and show visible glyphs when that make sense.

[https://news.ycombinator.com/item?id=7474600](https://news.ycombinator.com/item?id=7474600)

------
DonHopkins
Falsehood: You can safely name your first column "ID".

[https://support.microsoft.com/en-
us/kb/323626](https://support.microsoft.com/en-us/kb/323626)

[https://news.ycombinator.com/item?id=12041210](https://news.ycombinator.com/item?id=12041210)

~~~
IanCal
Hah, I've spent years munging broken csvs and this one hit me on our first
public data release.

I still maintain my release was fine and I blame excel ;)

------
kawsper
We recently received a CSV from a bank where they didn't escape properly, so
one of the entries was like this:

    
    
        "","C/O COMPANYNAME"SUBHEADLINE"","",(...)
    

So yeah, they probably wrote this themselves, and didn't expect a valid
company name to contain these characters.

We ended up writing a custom parser around their horrible format, because
their technical support and technical lead person we could reach couldn't
understand our issues.

:(

~~~
pdimitar
What you describe is _exactly_ the problem with CSV: many less tech-savvy
people decide to use it to provide some sort of very minimal API (even if it's
periodical manual exports sent by mail; some people consider that
automation!), don't account for the possible traps (like double quotes in the
value of any field), then call it a day -- more like a lifetime.

Then there's inertia, there's a boss saying "we absolutely must work with
these people", and... the rest is history for many of us.

NOTE: I've been writing custom CSV parsers at least 10 times during my career.
That makes roughly 2 such parsers every 3 years. And that comes from a guy
that's not very active in his career.

~~~
astrobe_
Not just with CSV. I remember a similar comment about broken XML.

It actually doesn't matter how compliant you are to the standard. Standards
are certainly nice when both parties are willing to agree on things. But when
one party is in position of strength and wants to minimize its costs or
efforts, compliance to a standard won't help you.

~~~
pdimitar
That's the sad truth. Sometimes the thought of this makes me wanna go full
Hitler on the internet -- if you can't expose a standards-compliant API, your
service will be blocked by central tech authorities.

We can dream, right?

Still though, all giants eventually fall when they become phlegmatic. Sadly
it's taking way too long.

------
nommm-nommm
>All CSVs are readable with Excel

>Excel is a good tool for working with CSVs

>Excel is an OK tool for working with CSVs

Also See: Excel: Error when CSV file starts with “I” and “D”
[https://support.microsoft.com/en-
us/kb/323626](https://support.microsoft.com/en-us/kb/323626)

HN discussion:
[https://news.ycombinator.com/item?id=12041210](https://news.ycombinator.com/item?id=12041210)

------
justinsaccount
> Excel is a good tool for working with CSVs > Excel is an OK tool for working
> with CSVs > Excel can losslessly save CSVs it opens

Seriously. It's amazing how fast excel can corrupt a simple csv file.

~~~
bdcravens
I've had odd bugs show up in a CSV import process, where it was a numeric
string (think account numbers) that had leading zeros. Turned out the imported
CSV had been cleaned by moving records around in Excel and saving as CSV
(which of course saw the account numbers as numerics and conveniently
truncated the zeros)

~~~
VLM
Excel thinks anything vaguely scientific notation-ish is a float, which is a
nightmare if you have, for example, four hexadecimal digits 2E15 which is
11797 in decimal but excel will helpfuly "fix" it to 2 million billion
(depending on your locale of billion LOL)

That was a contrived example but I've seen Excel get very confused by foreign
postal codes involving digits surrounding a letter "E" or apartment number
columns for example "1E"

~~~
tomlock
Excel enthusiastically and automatically types oh-so-many things. It'll take
dates and assume they are in american MM/DD/YYYY in some fields and then treat
dates in the same column like 13/01/1986 as strings.

That exact problem I had when working with a company who had a stocked item
with the code "060E2". When people don't know scientific notation the result
is bewildering!!!

------
libria
This should be titled "What Vendors Should Account For When Publishing CSVs".
If someone claims they're exporting a file in X format:

* X should be a known standard OR

* they should provide a client parser for X OR

* they should provide a grammar or list of rules for parsing X

This rant is focused in the wrong direction. It should go without saying that
someone should deliver what they claim to. If CSV means different things to
different people, that's fine too, but the burden should be on the exporter to
define what it means to them.

I'll accept a retitling of "Bullshit Developers Shouldn't Have to Question
About a Common File Format".

------
TallGuyShort
It's literally impossible for CSV to be a useful format if you cannot make any
of these assumptions. Every field with a distinct encoding? Really? The focus
needs to be on making CSVs about which you can reasonably assume a lot of
these things so that you can actually write software that processes them.

------
chc
Is there anyone out there who actually believes all CSVs are UTF-16? That's
such a weird idea that I have no idea how somebody would arrive at it.

That's what always gets me about this genre of list — I often find myself
scratching my head, trying to figure out if I've missed something or if they
really listed that thing as a falsehood programmers believe.

~~~
VLM
Possibly its asymmetric political power at its finest but I am aware of at
least one "business intelligence" application that outputs its CSV files
consistently as UTF-16BE (not LE)

I'm not even sure what planet its from, much less the application, I just get
stuck importing what they feel like sending me.

I don't know why everything that upstream sends me is in UTF-16BE format, its
just what they do. Perhaps they hate me, or my employer. I donno. Perhaps its
an intelligence test to cut down on support costs, you must be this smart or
smarter to mess with our data or go away.

~~~
leesalminen
> Perhaps its an intelligence test to cut down on support costs, you must be
> this smart or smarter to mess with our data or go away.

I like this design choice. I'll keep it in mind ;)

------
donatj
I question the title change from "Falsehoods Programmers Believe About CSVs"

It's not really a list of problems as much as incorrect things people think
and simply reading that title alongside the list without the page
title/description would make no sense at all to the reader.

I guess the HN gods know best though.

------
Walkman
One of the most common mistakes is delimiter in a field. E.g. comma as
thousand separator...

~~~
daturkel
Particularly problematic with this is that a human cannot necessarily
determine with 100% confidence where the error is. If you have a CSV where
every row has 5 entries, and you come upon this row:

> 100,1,200,300,10,100

You suspect that an in-field comma is causing a split where there shouldn't
be, but who's to say which number got split up? 1200 into 1,200, or was it
200300 to 200,300, or 10100 to 10,100?

------
teilo
Having developed on a system that had to import and collate insurance claims
data from multiple providers, I've come to dread CSVs for all of these
reasons.

------
afarrell
One thing to remember if your company offers a customer-facing "export to CSV"
button:

The format of that CSV is part of your public API and changes to it can cause
pain for your customers. Don't be caught off-guard if they've built a workflow
around "download this CSV, put it into this excel spreadsheet and run this
VBscript on it" or "download this CSV into this iPython notebook and feed it
to FOO".

------
amyjess
I remember at my last job, mentioning something about saving a Google Sheet in
CSV format, and my boss immediately said "I'm morally opposed to CSV."

He then told me about TSV, and since then I've come to accept it as a superior
format. With the kind of data I dealt with at my job, we never encountered
tabs within a column, but commas within columns were very common, so TSV was
perfect for us. A lot of what I was doing was ad-hoc processing of various
sheets (nothing that was really production-grade), so a lot of time I'd just
write code to read in a file and split each line on \t, and that was good
enough. And maybe I'd throw out any lines that began with #, but that was
about it for processing.

Actually, we _did_ have a TSV parser in our codebase, but my boss told me that
it has problems and he never liked it and that I should just roll my own
inline with my code.

------
mnarayan01
This has a lot of good stuff in it, but I think it's hurt by including stuff
where you just have to make an assumption, E.g. I can interpret the following:

    
    
      "Some\tText"
    

in at least three different non-ridiculous ways, even assuming the encoding is
known. It's worth keeping this in mind, but intertwining it with other
_tractable_ assumptions feels like it hides some of the super-useful nuggets
contained within. That said, I'll admit the dividing line is fairly
indistinct, plus it's still a great list.

------
prodtorok
Any known web (JS) workarounds for parsing CSVs with inconsistent file
encodings? Right now I am at the conclusion that you must know the encoding
type before parsing, though I've come across some decent libraries that can
"guess" the encoding type:

[https://github.com/aadsm/jschardet](https://github.com/aadsm/jschardet)

[https://github.com/sonicdoe/detect-character-
encoding](https://github.com/sonicdoe/detect-character-encoding)

~~~
tantalor
Not sure what you mean by "inconsistent", but in general you have to know the
character set before you read the file.

[https://www.joelonsoftware.com/2003/10/08/the-absolute-
minim...](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-
every-software-developer-absolutely-positively-must-know-about-unicode-and-
character-sets-no-excuses/)

Guessing will lead to tears.

~~~
rspeer
If it's a mix of UTF-8 and a single-byte encoding, you can guess, because
UTF-8 is nice enough to be obvious when you guessed wrong.

The command-line version of ftfy [1] can handle a distinct encoding for each
line with a near-zero rate of false positives. (No guarantees the output will
be _correct_ , especially if you read it as the wrong single-byte encoding
initially, but it will be _no worse_ than what you put in, because it won't
replace correct text with incorrect text.)

There is no JS version of ftfy, and there won't be until JS can represent all
Unicode characters without surrogates.

[1] [https://github.com/LuminosoInsight/python-
ftfy](https://github.com/LuminosoInsight/python-ftfy)

------
sbov
43\. This list matters for all applications that parse or create CSVs.

------
dwe3000
I think many of the problems he lists are because, as he properly notes, a
great many people believe "ALL" or "NEVER" are appropriate words to use when
considering CSV files (most of his points use one or the other).

But I think it is in all fairness to point out that parsing CSV files for
value is a lot better than free form text, even if it's nowhere close to XML.

------
justinclift
Is probably time someone did a follow up RFC, specifying further useful
things. eg:

* how to encode binary data * how to encode NULL's (for transport between databases) * how to specify encoding * (etc)

------
rch
Last time CSVs came up I was downvoted without comment for mentioning HDF5.

I know the CLI tools could use some improvement, and a vim plugin would be
nice, but are there other concrete objections?

~~~
dsp1234
If the requirement is to read CSV files from a third party, then suggesting
HDF5 isn't helpful.

If the requirement is to write CSV files to send to a third party, then
suggesting HDF5 isn't helpful.

So this limits the usefulness of mentioning HDF5 to when you are both the
producer and consumer of the data, but that's the easy case because then you
can choose anything up to and including strongly typed binary formats.

~~~
rch
Thanks - that's fair criticism.

------
kodfodrasz
I love the fact that if I want to work with CSV from C# I need to use a
VisualBasic namespace:

Microsoft.VisualBasic.FileIO.TextFieldParser

see: [http://stackoverflow.com/questions/3507498/reading-csv-
files...](http://stackoverflow.com/questions/3507498/reading-csv-files-using-
c-sharp)

------
vortico
Stop this horrible genre now! As discussed in the recent video programming
post
([https://news.ycombinator.com/item?id=13259686](https://news.ycombinator.com/item?id=13259686)),
this neither helps the beginner because it is unnecessarily overwhelming, nor
the expert because he has already read
[https://tools.ietf.org/html/rfc4180](https://tools.ietf.org/html/rfc4180) and
is aware of these problems with bad CSV parsers/emitters. In the intermediate
case, it could be useful for someone maintaining a CSV parser or emitter who
is not aware of a few of these points and wants 100% compatibility with all
CSV parsers/emitters in existence, but the list style is confusing
grammatically and offers no counterexamples or suggestions for improvement.

With that said, if you are interested in using CSV for encoding and decoding
in your own environment, I suggest a new file standard called μTSV.

1\. All μTSV files are UTF-8.

2\. All values are delimited by "\t", and all lines must end with "\n"
(including the last one).

3\. If you want to use a "\t" or "\n" character in your value, tough luck---
use JSON.

~~~
donatj
I mostly just thought it was fun to write. Put my pain down on a page, get a
little bit of catharsis. I actually had written out some "counterexamples or
suggestions" but decided to keep it terse and with the general style of the
other similar posts.

I was not aware of the scorn for the style, I just knew I enjoy reading them.

As opposed to your μTSV may I suggest DSV. The Art of Unix Programming makes a
strong case for how it's superior to CSV [1]

1\.
[http://www.catb.org/esr/writings/taoup/html/ch05s02.html#id2...](http://www.catb.org/esr/writings/taoup/html/ch05s02.html#id2901882)

~~~
vram22
Heh, by coincidence, I had recently written two posts about DSV, and had
mentioned the same TAOUP DSV link you gave above:

Processing DSV data (Delimiter-Separated Values) with Python:

[https://jugad2.blogspot.in/2016/11/processing-dsv-data-
delim...](https://jugad2.blogspot.in/2016/11/processing-dsv-data-delimiter-
separated.html)

The example program in the post lets you use either a "-c delimiter_char" or a
"-n delimiter_ASCII_code" to specify the delimiter for your DSV data being
processed by it.

and

[xtopdf] Publish DSV data (Delimiter-Separated Values) to PDF:

[https://jugad2.blogspot.in/2016/12/xtopdf-publish-dsv-
data-d...](https://jugad2.blogspot.in/2016/12/xtopdf-publish-dsv-data-
delimiter.html)

~~~
groovy2shoes
You're both wrong. EDN is the best.

[https://github.com/edn-format/edn](https://github.com/edn-format/edn)

~~~
vram22
>You're both wrong.

Bah. How am I wrong when I haven't even claimed DSV to be the best? Just
quoting that section in ESR's book does not imply I endorse or agree with it.
Logic ...

~~~
vram22
Plus, just providing that EDN link is hardly proof that it is the best ... And
further, "best" in many cases is subjective. You want to claim EDN is the
best, you lay out an objective proof.

~~~
zdkl
I get that the parent response about EDN isn't particularily enticing but you
really shoild give it a look.

~~~
vram22
I do intend to. Thank you for the comment though. I had starred EDN on github
just now before replying to groovy2shoes, i.e., I took an initial brief look
at it. Saw a commit by richhickey (Clojure creator). Since I'm into data /
file formats, I will definitely check out EDN.

~~~
groovy2shoes
Sorry, my intent (beside bringing up EDN) was to be playful. Tone is often
difficult to communicate in writing. I would have stuck in a `:p` if I hadn't
been on mobile, but it slipped my mind. My apologies; I did not mean to insult
or to otherwise be combative or critical.

I don't have objective proof that it's the best. It's a mix of personal
preference and empirical "evidence" based on past success with both EDN and
its granddaddy s-expressions for data exchange and serialization.

As for why I personally like EDN so much:

1\. It offers the malleability, flexibility, and extensibility of XML, while
being

2\. even more concise than JSON,

3\. straightforward to parse, and

4\. precisely defined in a rigorous and language-independent specification.

Traits 1-3 it shares with the traditional s-expressions of the Lisp family,
but in contrast to s-expressions it's specifically designed as a data exchange
format rather than as a lexical/syntactic format for programming languages.
The reason this matters is because the traditional Lisp reader is itself
extensible, and extensions can't be guaranteed to carry over from dialect to
dialect or even from implementation to implementation. Many Lisp readers go so
far as to even allow the execution of arbitrary code at read time, which is
desirable for source code as it enables metaprogramming, but it's not so
desirable when you're parsing data from an arbitrary source, due to security
concerns.

While EDN does have its roots in Clojure's reader, EDN is _not_ exactly
Clojure's external representation. Rather, it's a subset of Clojure's surface
syntax in much the same way that JSON is a subset of JavaScript's. Like JSON,
EDN works great outside of its "native language" (in fact, I've never used EDN
from Clojure itself; I've only used it from Lua, Scheme, Python, C, C++, C#,
Java, and JavaScript (not necessarily in that order)).

~~~
vram22
>Sorry, my intent (beside bringing up EDN) was to be playful. Tone is often
difficult to communicate in writing.

I understand, and agree it can be difficult. Thanks for clearing that up, and
no worries at all. Interesting info, BTW, about EDN. Will reply again in more
detail later today with any comments/questions.

~~~
vram22
I'll need some time to check out EDN, so will message you with any comments
later, since this thread might expire. Thanks again for letting me know about
it.

~~~
zdkl
I'm also interested to hear your perspective

~~~
vram22
Cool. Will try to comment before the thread expires, then (not sure what the
period is), else will post the points here as a new thread under my id
(vram22).

------
SixSigma
See also: XML

~~~
rpd9803
I mean, its a bit apples or oranges.. XML's well-formed rules are, in my
experience, fairly universal, hence parsing isn't as much an issue as is
poorly constructed XML schemas (and schma-less xml documents as well) I know
there are edge cases, but overall XML doesn't have many of the problems
specified.

~~~
duskwuff
XML has a _specification_. There are two well-defined categories of well-
formed and ill-formed XML files, and there's exactly one correct way to parse
the well-formed ones.

~~~
SixSigma
You still write a list like

1: it is well formed

2: the person who produces it will fix it when you tell them

3: the encoding is correct

etc.

Even ycombinator's rss feed is only well formed but not fully specified

[http://validator.w3.org/check?verbose=1&uri=https%3A%2F%2Fne...](http://validator.w3.org/check?verbose=1&uri=https%3A%2F%2Fnews.ycombinator.com%2Frss)

