
Problems with CSVs (2016) - charlieirish
https://donatstudios.com/Falsehoods-Programmers-Believe-About-CSVs
======
nightcracker
I disagree with this list of "falsehoods programmers believe".

Most of those lists describes objectively wrong assumptions programmers have
about some real-world phenomenon. But in this case I'd argue that a lot of
these points are simply not correctly formatted CSV files. Just because
someone handed you a file with the extension .csv does not mean it's a CSV
file proper, and it certainly does not mean that you have to guess at what it
intends to encode without assuming any of the things on the list.

For example, "All CSVs contains a single consistent encoding". If this is not
the case I'd (rightfully) reject the file as being a proper CSV file.

~~~
DeusExMachina
You can certainly reject the file, but that's not going to make your user
happy. The (sad) reality is that you have often to deal with these edge cases
in software.

The user does not care that the CSV is not proper. He just wants to open it,
and if it doesn't work, they will blame your program, not the source of the
CSV.

And you only control the former. I have had to deal with a CSV export from a
web service that was "not proper". I notified their support, but that problem
was never fixed.

Luckily, the user in that case was just me. But if I was making software for
someone else, blaming the right source of the problem would have not taken me
anywhere.

~~~
sametmax
If you put a phone number form and the user put letters inside, you will just
say "invalid number". You won't try to parse it.

Same for invalid CSV.

There is nothing wrong with this. You can't accept all the garbage from the
user.

~~~
doodpants
> If you put a phone number form and the user put letters inside, you will
> just say "invalid number". You won't try to parse it.

Wait, what? Phone numbers could always include letters. Many businesses get
phone numbers that include or consist of words relevant to their business, to
make them memorable. And I've known people whose personal number happens to
contain a word or words, so they give the letter version out because it's easy
to remember. If a phone number field doesn't include simple logic to map
letters into the proper numbers, that's just lazy programming.

~~~
dronescanfly
Wait what? There are countries that put letters into phone NUMBERS?

~~~
lmkg
This is what the average phone bad looks like in the United States:

[http://www.pachd.com/free-images/technology-images/phone-
dia...](http://www.pachd.com/free-images/technology-images/phone-dial-
pad-01.jpg)

There is a standard, universal mapping of letters onto numbers. This allows
companies to advertise their phone number as something memorable like
1-800-CASH-4-AU and everyone knows how to dial it. People don't even need to
remember the mapping of letters, because they're printed on every keypad.

------
emodendroket
This post, on a different subject, captures my feelings:

[https://news.ycombinator.com/item?id=13260082](https://news.ycombinator.com/item?id=13260082)

> I honestly think this genre is horrible and counterproductive, even though
> the writer's intentions are good. It gives no examples, no explanations, no
> guidelines for proper implementations - just a list of condescending
> gotchas, showing off the superior intellect and perception of the author.

Like, OK, I shouldn't use sep. Good to know. What should I use instead? Why
tell people that \ isn't the escape delimiter without explaining the way the
quoting system works?

And frankly, the stuff about Excel is divorced from reality. More than 90% of
the time, the reason you're making a CSV is because somebody wants to look at
the data in Excel and you don't want to deal with xlsx. If your concern is
something else CSV is probably the wrong choice. Thus, for most programmers,
Excel is the reference implementation of CSV.

~~~
ubermonkey
I think there's more going on than condescending gotchas here. I see the whole
genre as fairly tongue-in-cheek enumerations of pitfalls that, all too often,
are baked into projects as unexamined assumptions.

When we read them, we have the opportunity to check our own assumptions --
about the subject at hand (today, CSVs) and also (hopefully!) about other
subjects we may encounter later.

~~~
emodendroket
I just don't know what thought process I'm supposed to go through when someone
tells me writing `sep=,` in a CSV is horrible but doesn't offer any
alternative way of fixing the problem that solves.

~~~
ubermonkey
Maybe the title scheme is throwing you off.

I read these as "a list of things people have assumed about X that, on the
surface, seem true but absolutely won't be 100% of the time."

IOW, examine your assumptions, and protect yourself from your unexamined
biases and assumptions.

No?

~~~
emodendroket
At best, I'm now aware that there's some situation where I should avoid it,
but I have no idea why, what the situation is, or what options I have.

------
sbarre
I've had to deal with CSV data and Excel a lot in my career, and I learned one
trick (sorry) a few years back that has made my life so much better:

Here's a scenario I bet many people have faced: Export an Excel sheet to CSV
that has high-ASCII characters in it like accents. The export gets mangled
when you load it into code or a text editor after. You eventually just upload
it to Google Sheets and export it from there instead. It works but it's a
pain.

Instead of exporting it as a CSV from Excel, export it as a UTF-16 TXT, which
is basically a TSV file.

That will correctly preserve all the character encoding.

I can't promise this will work 100% of the time but it has resolved many many
encoding issues going to/from Excel.

~~~
masklinn
My one trick there is to just not ask for CSV, and import Excel files
directly. Most languages have libraries to read basic (non-formula) excel
files, they work fine, and provide richer data models than CSV (though
somewhat risky as you have to deal with formattings and finding e.g. numbers
where you expected strings and the other way around).

~~~
purple-again
This is unacceptable in large business settings today. A list of assets alone
can be tens of millions of lines. A list of transactions? Oh boy time to break
out Rust if you want your database to be finished this month.

~~~
masklinn
> This is unacceptable in large business settings today. A list of assets
> alone can be tens of millions of lines. A list of transactions?

The original comment specifically mentioned excel. Excel can not work on "tens
of millions of lines" and your objection is thus irrelevant.

If your data does not come from excel, there is obviously no reason to make it
go through excel. There is also no reason to make it go through CSV, make it
go through an actual serialisation format which is not mostly unspecified
garbage since you're programmatically serialising it for interchange and
hopefully not expecting human edition of the intermediate megabytes of data
dump.

------
jstimpfle
I've spent a lot of time thinking about a better format that is close enough
to CSV to be practical, but has more precisely defined semantics and
structure, also to support better usability (decreasing the need for manual
integrity checks after parsing). I wanted at least a defined encoding and
defined schema (table definitions with fixed number of typed columns).
Optionally Unique Keys and Foreign Keys, but that quickly leads to a situation
where there are more possible features with diminishing returns to consider.

I ended up with this [1] and a python implementation [2], and it turned out
not too bad. I've also done a more pragmatic C implementation (couple hundred
LOC) in a toy project [3] (wsl.c and *.wsl files), and it turned out quite
usable.

I think what prevents adoption of such a thing is that it's very hard to
standardize on primitive types and integrity features.

[1]
[http://jstimpfle.de/projects/wsl/main.html](http://jstimpfle.de/projects/wsl/main.html)
[2] [https://github.com/jstimpfle/python-
wsl/](https://github.com/jstimpfle/python-wsl/) [3]
[https://github.com/jstimpfle/learn-
opengl](https://github.com/jstimpfle/learn-opengl)

~~~
jacquesm
It's the age old trap of trying to fix a format or a protocol, only to end up
with yet another format or protocol.

~~~
jstimpfle
I don't think you can fix CSV. You could define a reasonable-by-
today's-standards subset - I'd say, UTF-8, 0x20 delimiters and per-lexem
choice of C-style identifiers or string literals.

But I need more structure. My format supports an arbitrary number of tables in
a single file, some automated integrity checking, and direct conversion of
text from/to runtime structures (e.g. C structs, possibly with automatically
allocated backing buffers for variable-width datatypes) with only little
boilerplate. So yes, that's a different format.

------
yoz-y
Many of the complaints stem from the fact that many CSV writers do not follow
any spec and produce garbage data. Kind of like when the industry decided that
reading badly formed HTML is beneficial.

For the sake of sanity I'd recommend splitting the logic of reading csv into
two parts. One that takes garbage and produces a correctly formatted file
(e.g. properly quoted values, equal number of columns, single encoding) and
the second part that actually does CSV parsing adhering to strict rules.

~~~
edanm
That's actually a very interesting idea/pattern for dealing with problems.

However, I'm not sure what it means practically speaking on a real project -
most of the time, you're presenting the user with some kind of upload screen,
then do processing behind the scenes, and display results. You can (and
probably _should_ ) structure things such that you're internally converting
the "bad" csv file to a "good" csv file, and parsing it normally afterwards,
but this is all behind the scenes anyway.

------
dahart
The content of this article is great. But the title represent it's own bad
assumption & falsehood. Just because a CSV doesn't conform to the spec doesn't
mean the person who produced the CSV file believes their CSV file to be
correct or misunderstands that CSV is complicated. For most devs I've dealt
with, the reason has been conscious and admitted laziness, trying to get
something done faster knowing it's not 100% correct. Conforming to the spec is
harder and more confusing than splitting on commas or tabs in a script, while
splitting on commas or tabs works 80% of the time.

It's lame and lazy to write & read CSV with one-off code, but we already knew
that, CSV is just deceptively simple looking so the temptation is strong. But
what we most need isn't the list of things we're doing wrong. We already know
we're doing it wrong. What we need to do is to use a library that does it
right. What we need to have is a list of libraries and tools that are easy to
use and fix all the items in the author's list. It might also be useful to
suggest simple things a hacker can do in a couple lines of code in a one-off
CSV reader/writer that covers as many cases as possible.

------
JoeAltmaier
Garbage data has been with us since the beginning, and will always be with us.
Whether its bad csv, html, xml, json, half the data integrator's job will be
'sanitizing' the data. Which means dealing with exceptions in an empirical
manner.

Our first effort at absorbing Wall Street financial data satellite stream
(stock trades, bond rates and all the rest) we found that every day there
would be new bad data. Trades out of order; missing decimal points; badly
spelled security names; clumsy attempts to 'edit in' trades after close. The
world's financial data should probably have been better managed, but it was
all had-generated back then (90's) and mostly viewed as a stream by brokers on
a terminal screen, so the human filter could understand most of the BS. But a
database, not so much.

------
outsideoflife
As someone mentioned in the comments in the article I think it is very common
for people to use LibreOffice Calc to work with CSV because Excel does not
handle UTF-8 all that well. In Libre Office you can open an Excel workbook and
export a csv in UTF-8 and ask it to double-quote all of the fields too (which
is a very good thing to do to csv files)

------
roel_v
While 33 and 34 are true (33. Excel is a good tool for working with CSVs 34.
Excel is an OK tool for working with CSVs), there is one reality that makes
them irrelevant: when working with data that is in any way or form touched
manually during its lifespan (that includes 'looking at it'), dealing with
Excel is inevitable.

~~~
jameshart
Right, the more important widely-held falsehood is "If you tell people not to
edit your CSV file in Excel, they won't."

------
donatj
Author here, just waking up. I object to the title change on the strongest
terms. It’s simply not “Problems with CSVs”. That’s not at all what the post
is.

The list isn’t problems, and if you read it as a list of problems it’s
nonsense.

Also why on earth did the fact that it is 14 months old need to be noted in
the title - has anything changed in the last 14 months? Not that I am aware
of.

------
LeonM
47\. CSVs are an API

48\. CSVs are useful to make machine-2-machine interfaces

49\. CSVs for importing/exporting data eliminates all those pesky programmers
taking up too much time with the API design

50\. CSVs are real time

51\. CSVs are a robust mechanism to export/import data

~~~
endriju
52\. All tools working with CSV follow RFC 4180[1]

[1] [https://tools.ietf.org/html/rfc4180](https://tools.ietf.org/html/rfc4180)

------
tlrobinson
> All CSVs contains a single consistent encoding

> All records contain a single consistent encoding

> All fields contain a single consistent encoding

What? How are you expected to handle these cases?

Generally you have to guess the encoding of a plain text file like a CSV. I’m
fairly sure the common case is the entire file will be a consistent encoding.
If you were to guess per-record or per-field I suspect it’s more likely you’d
guess some records/fields wrong than encountering a file with varying
encoding.

I’d be interested to see some real world examples of CSVs with varying
encoding and how existing software handles that.

------
sly010
A lot of broken encoding is the result of various copy pasting and OCR issues
during data collection.

If there is one thing I could change about all spreadsheet software is to
paste without formatting by default. Would make everyones life so much
easier...

------
bscanlan
Dealing with the backscatter from CSV misunderstandings can be fairly
challenging - for a lot of us, the customer experience is improved by being as
accommodating as possible instead of correct. We at Intercom released a Ruby
CSV parser that "is a ridiculously tolerant and liberal parser which aims to
yield as much usable data as possible out of such real-world CSVs".

[https://github.com/intercom/hippie_csv](https://github.com/intercom/hippie_csv)

------
imtringued
The biggest problem with CSV is that it looks easier than it actually is and
then they go ahead and write their own CSV "printer"/parser which is usually
just a ",".join([1,2,3]) or "a,b,c".split(",").

In reality CSV has a similar complexity akin to JSON. You have to consider the
possibility of quoting, escaping, encoding, delimiters, etc... You should
always use a library to generate and parse CSV to avoid these issues.

------
scrumper
(Needs a 2016 in the title).

CSV is a great example of what happens when you do the opposite of "be strict
in what you emit, be liberal in what you accept." It doesn't matter though,
because it's just about good enough, and you almost never want a _general_ CSV
solution, rather you need one specific to a particular problem or workflow (so
you can handle your situational idiosyncrasies.)

------
ex3ndr
Any alternatives to CSVs? In our startup we are using of JSON object on each
line but may be there is a more established format for doing this?

~~~
dtech
XML, JSON, any specified format is a better machine-to-machine format than
CSV.

I've seen the JSON-object-per-newline pattern a few times, but why not use a
JSON array then? Then the whole file remains valid JSON.

~~~
dalailambda
In my experience the reason for a JSON object per line is because a tool can
then split the entire file by newlines and have a list of objects to start
parsing/processing in parallel, which avoids having to parse the entire file
up front to get usable data, and lets the tool start processing things while
some data is still being parsed.

------
meuk
Excel is a horrible tool, in general. I remember that I exported a very long
list of numbers to a .CSV. Excel then formatted the numbers (like 53564566934
to the form 52564E+6).

If you then copy and paste, the numbers are actually converted to 52564000000.
Result: I had to do _a lot_ of work again. Which is partly my fault, and
partly a horrific design.

------
Sami_Lehtinen
Just saying, great timing. Because I just today had this discussion covering
everything on the list. Different character encodings in file, fields which
contain line feeds, fields which contain field delimiters without escaping or
using quotes and so on. But that's very common. People use whatever source for
data, copy paste it into Excel and then think it's good CSV after that. I
usually handle CSV as required, every file can be and usually is different.
You'll write I/O code according case specific requirements. In many cases,
this means some manual fixing and custom parser. -> Job done.

Edit: Honestly, I don't even remember when I would have seen RFC4180 compliant
file. That's just the truth out there.

------
dspillett
A couple of extras:

* CSV files will have consistent EOL markers

* CSV files will always have a trailing EOL marker

* CSV files will never have a trailing EOL marker

* Any file with a name ending .csv is a CSV file (or something close to)

 _> Excel can losslessly save CSVs it opens_

A particular problem we've had many times with some clients is Excel messing
around with date/time formats: files with dates formatted YYYY-MM-DD being
changed to American style, columns containing both date and time having one or
the other chopped off, dates being converted to numeric representations, ...

------
woodruffw
I've occasionally wondered why we (programmers) don't take greater advantage
of ASCII control codes[1] for structuring data -- many of the pain points of
CSV/TSV could be addressed by using US (unit separator) or RS (record
separator) bytes instead. You'd still have to handle binary values, but that's
a problem shared with CSV/TSV.

[1]: [https://ascii.cl/control-characters.htm](https://ascii.cl/control-
characters.htm)

~~~
rjbwork
Because most tools don't show them so it's useless. Users don't know what a
character even is if they can't type and see it.

~~~
woodruffw
That's valid, but it's also tooling/expectation issue -- most users are
probably interacting with CSV/TSV through a spreadsheet interface, and don't
need to know anything about the internal representation of records.

------
nebulous1
The list seems like decent thoughts if you are going to work extensively with
CSVs, but the title is clickbait and honestly incorrect, I haven't worked much
with CSVs but I'd make very few actual assumptions about them, although in
reality we would all ignore 99% of his list because it wouldn't be practical
to deal with any of it.

He also almost seems to be blaming the receiving programmer for not being able
to deal with a given CSV in an insane format.

------
maxxxxx
I like these "falsehoods programmers believe" articles but they also indicate
that it's really hard to find ground truth in this business. There is no
agreed upon standard for what REST really means or object orientation or how
to use XML or CSV so we just keep stumbling along implementing what we think
it means and thinking that people who do it differently are clueless.

I guess that's the nature of technology that's moving so fast.

------
amyjess
My boss at my last company used to say "I am morally opposed to CSV".

We did a lot of stuff with data in Google Sheets, and when we exported a sheet
to disk so we could parse it in code, he would insist on using TSV instead of
CSV.

The main reason he insisted on this was because you don't really have to worry
about escapes and quoting with tab separators. And we were an NLP company, so
the data would very often include complete sentences with commas.

------
geraldbauer
FYI: I've started to collect ideas adnd initiatives on how to evolve the
world's most popular format in the Awesome Comma-Separated Values (CSV) page -
What's Next? - Frequently Asked Questions (F.A.Q.s). Read more @
[https://github.com/csvalues/awesome-csv](https://github.com/csvalues/awesome-
csv)

------
forinti
That's just half the problem. You then have to put the data in a database and
you (usually) don't want to declare varchar(4000) for all your fields.

It's irritating to go through 50 million records and three of them have
letters where there should be numbers. Or weird stuff like "$nul".

------
sonofgod
An additional belief I have been disabused of:

> Excel will load valid CSV files correctly. If a field contains more than
> 32kb of text , Excel will continue outputting that text into the next cell,
> and it won't be quoted, so the CSV file appears to be flooded with complete
> garbage from that point on.

------
alexchamberlain
On using CSV as an input format, but wanting a “better” format for
storage/manipulation, [csvw][1] is an interesting choice, and can be converted
to any RDF encoding.

[1]: [https://www.w3.org/ns/csvw](https://www.w3.org/ns/csvw)

------
allengeorge
Given this massive list of falsehoods I have to ask: what _can_ programmers
believe about CSVs?

How do you go about writing a good CSV parser given that you can't assume
anything about your input data? Are there examples of safe, robust CSV parsers
that deal with all these falsehoods?

~~~
krapp
That they probably contain commas used to separate values. Maybe.

CSV is less a format and more an article of faith.

~~~
em500
Non-English locale software often use semicolumns (since the comma is usually
the decimal separator). (Falsehood 27 - 29)

~~~
krapp

        (ノಠ益ಠ)ノ彡┻━┻  <-- CSV
    

Never mind then.

------
JustSomeNobody
You shouldn't assume you know what I believe.

That being said, the vast majority of CSV files I've dealt with are very
simple and are plain ASCII separated by a comma (or whatever). But that
doesn't mean I don't _know_ there could be more to a CSV file.

------
edejong
This is exactly what is wrong with: “Be liberal with what you accept”.
Millions of variants as weed in a garden, making computer parsing impossible.

------
tgb
What problem is #34 "You can safely name your first column "ID"" referring to?

~~~
astura
An Excel bug:

[https://news.ycombinator.com/item?id=12041210](https://news.ycombinator.com/item?id=12041210)

------
jaakl
many points can be summarized as: “CSV is CSV”. What about instead writing
summary how to write properly CSV which is usable in all or at least most
cases? Like best encoding, proper separation, headers, delimiters and escaping
all the special chars.

~~~
empath75
He made two recommendations if you read between the lines: 1) don’t use csv 2)
follow the rfc

------
geraldbauer
Big fan of CSV. Let's make it even better! I've started the CSV v1.1
initiative. Let's evolve CSV and let's add comments, blank lines, double
quotes, and more to make it easier to read and write. Learn more @
[https://csvalues.github.io](https://csvalues.github.io)

------
cup-of-tea
Why did CSV ever happen? Using printable characters as separators is really
silly. Using multiple characters (like CRLF) is silly. ASCII actually contains
four control characters for this purpose: file, group, record and unit
separator. I'm sure a lot of problems would be solved if people just used
these.

~~~
onion2k
File formats like CSV are frequently typed out by hand in an editor. Using a
non-printable character would make that harder than necessary in most
situations.

~~~
jstimpfle
To be pedantic, CR and LF are non-printable. It would be nice if FS and RS had
also simple to reach keys on the keyboard.

~~~
cup-of-tea
Exactly. The "proper" record separator (CRLF) is not available on the keyboard
on UNIX systems (well, half of it is). If people are typing it out by hand
(which I doubt, actually) there are many keys on standard keyboard which could
be used for the various separators (like the F keys).

~~~
jstimpfle
I don't don't know if CRLF is "the proper record separator". If any, that's
probably RS (ASCII 30). Btw. in the Unix terminal you can type any ASCII 1-26
by pressing Ctrl+[a-z], and for the remaining 0 and 27-31 there are other
combinations. It's just not as convenient as pressing a single key.

~~~
cup-of-tea
For CSV it is.

------
londons_explore
For almost all uses, CSV is the wrong format for the job.

Protobufs are usually almost a drop in replacement, and when you have them,
you no longer have to worry about escaping, newline characters, adding
columns, data types, nulls, or any of the other painful corner cases of csv's.

Some might say "but you can't open them in a text editor", but to those people
I give this link:
[https://stackoverflow.com/questions/34952811](https://stackoverflow.com/questions/34952811)

~~~
jimktrains2
> Protobufs are usually almost a drop in replacement [for csv]

I'm sorry, what? In what way is it a drop in replacement? Csvs are editable in
a text editor and natively imported and exported by a wide variety of tools.

~~~
Radim
Curious:

How many levels of indirection from end users do you need to be, in order to
consider installing an arcane linux tool and then launching `protoc
--decode_raw < message.bin` from the command line an acceptable drop-in
replacement for CSVs?

I'll take the generous interpretation and assume OP meant some very specific
developer use-cases :)

~~~
jimktrains2
Ggp said usually. Usually I would not expect users to have protocol installed.
I'd wager that even most developers don't.

Also, as an aside, sqlite would probably be a better container format for
data, and one that would be easier to integrate into other applications.

