
So you want to write your own CSV code (2014) - ColinWright
http://thomasburette.com/blog/2014/05/25/so-you-want-to-write-your-own-CSV-code/
======
ken
I wrote my own CSV code. That's only about 10 issues, and I understand them
all, and they're easy to handle, and to test (unlike, say, threading issues,
or OS bugs that don't happen on my computer).

That wasn't even close to being one of the more difficult parts of my program
to write. I've had to do zero work to maintain it. I don't seem to have ever
had any bugs filed against it. Are these supposed to be difficult? Sure, don't
write your own encryption algorithm, and avoid touching threads if at all
possible, but CSV?

Sure, I wish I could have used a library. Unfortunately, once you start adding
requirements, the options start to vanish. It needs to be a C (or C-callable)
library. It needs to accept input as Unicode strings (lines), not just files
on disk, or bytes. It needs to return results line-at-a-time. It needs to have
a compatible license. It needs to support the same package manager as I'm
using (or be a trivial to include, like a single file). It needs to be
reliable (essentially bug-free, or at the very least currently maintained). It
needs to be fairly efficient. What library supports all these?

> Writing CSV code that works with files out there in the real world is a
> difficult task. The rabbit hole goes deep. Ruby CSV library is 2321 lines.

I love Ruby, but that file looks a little nuts to me right now. The block
comment at the top of csv.rb is over 500 lines long. The Ruby module tries to
do everything. Just "def open(filename, mode="r", __options) " and its
docstring are 100 lines long! This class includes methods for auto-detecting
the encoding, and auto-detecting common date formats, and some kind of system
"for building Unix-like filters for CSV data".

I don't put all of that into my CSV parser, because I want my code to be
modular. I have features like data detectors and filters and open-file-by-
filename in my program, too, but they're fully independent, not tied to any
one file format. When I come up with a better way to detect dates in text, I
don't want to tweak a 6-line regex in every file format I've written.

~~~
mey
I wish the JVM was less of a pariah. I love being able to just reach for
[https://commons.apache.org/proper/commons-csv/user-
guide.htm...](https://commons.apache.org/proper/commons-csv/user-guide.html)

Here is the entire logic of the CSV parser (test code is in a separate path)
[https://gitbox.apache.org/repos/asf?p=commons-
csv.git;a=tree...](https://gitbox.apache.org/repos/asf?p=commons-
csv.git;a=tree;f=src/main/java/org/apache/commons/csv;h=83af61a0d67b70f708125b0ca9d41b09a9396178;hb=refs/heads/master)

~~~
ken
I'm not as anti-JVM as I once was, though the recent licensing changes have
left me confused and nervous. I read their licensing FAQ (again) and I still
can't tell if I'm allowed to include recent versions of their JVM in my
application.

Were I to use the JVM for my background processing needs (the thought has
occurred to me!), I'd definitely use Clojure -- 'data.csv' is less than 150
lines, including comments, for both the parser and formatter!

~~~
mey
I suggest looking into very specifically AdoptOpenJDK.

[https://adoptopenjdk.net/faq.html](https://adoptopenjdk.net/faq.html)

A lot of people have been digging into the general insanity of Java licensing
since Java 8's divergence. Here is a good overview
[https://medium.com/@javachampions/java-is-still-
free-2-0-0-6...](https://medium.com/@javachampions/java-is-still-
free-2-0-0-6b9aa8d6d244)

------
cesaref
One thing that has always puzzled me is why CSV exists, given the ASCII
unit/record seperators appear to do exactly what the comma and newline
characters are attempting to do with codes which could appear within records,
and have been well defined since some point in the 60s.

And here we are, 60 years later still struggling to work out where a record
ends...

~~~
atoav
Easy: CSVs are human readable and writeable and humans tend to have a comma
sign on their keyboard, while they don't have the ASCII record seperator
visible (teaching them to use eg Alt + 30 will not work).

And if they had that key on their keyboard then you'd have the comma problems
all over again: What if a ASCII record seperator shows up in the field?

~~~
louis_pasteur
But humans rarely use notepad these days. They use code editors like Coda or
VSCode at the very least which have all kinds of advanced features. Surely,
those can include support for ASCII separators?

~~~
sfifs
Let's say we have a product master. Something looks fishy and we want a brand
person or product supply person to check... Push a CSV and it's trivially easy
for them to open in XL and check, make corrections and send.

In a business context, this happens far more often than you may expect. Sure
you can build a custom platform to validate and make people connect to a form
connecting to a database - but Excel is a great user interface and __everyone
__knows excel.

A funny incident - we were struggling to build a complex workflow in KNIME
where at some points we need user input. Nothing out of the box was great -
tools either assume a dashboard paradigm or a data flow paradigm - nothing In
between.

One of our creative folks came up with the solution of writing to a CSV and
getting KNIME to open it in excel. The user would make changes and save, close
excel and continue the workflow in KNIME. Even completely non technical people
got it.

~~~
hnick
We had to ban our non-technical users from opening CSVs in Excel because it
would strip leading zeros and do other exciting things like convert large
numbers to 1.23123E+20 :(

It does this without any notification or warning when you save.

~~~
mywittyname
That behavior is technically correct because those are all valid
representations of that value, e.g., 007449938 = 7449938 = 7.449938E6. Pretty
much any plain text (or binary) data format will have this same issue, unless
it supports additional metadata for display purposes.

If you intended the data to be a string value, then it should have been
enclosed in quotes.

~~~
yiyus
The problem is that when you have a number like 123451234512345, it may get
rewritten as 1.234512E+14, and then you are loosing information.

------
QuadrupleA
This scary stuff only applies if you need to _read_ garbage CSVs from a
million different sources. Even then, like ken says here, that's only like 10
issues to figure out.

Exporting is likely the more common case and is pretty simple. Quote fields if
they contain \r\n",

I'd say relax, focus on your particular problem - if a simple 2-3 line export
loop fits your use case and is simpler than a csv library (e.g. in C/C++
perhaps where dependencies are a pain) then why not.

This fear mongering about "you can't possibly do this, the library writers are
much smarter than you" can lead to it's own ridiculousness of hundreds of
dependencies and ungodly waste of processor time, memory, build time,
containerization hassle, etc. to solve what should be a simple problem.

~~~
7thaccount
I agree. Sometimes the business case is needing to read .CSV files that have
some pretty bizarre properties and you need a massive library that takes care
of all of that.

In my case (I do a LOT of .CSV work) though none of these problems exist. I
haven't reached for Python's .CSV library in years and neither have my co-
workers. We simply loop through the file, split the strings on commas, and
have a few if statements to parse on a situation by situation basis. Extremely
old school, but it works very well and is easier than dealing with the objects
that are generated from using the library. I realize this probably doesn't
work with all use cases on HN though, or even most.

My ideal language has direct support for making file I/O, dictionaries, sets,
whatever all in core and easy to use. This keeps me from having to write my
own helper classes and modules or bring in 3rd party libraries. Other
languages like to have a community package for everything to where you have 9
options for everything. This seems to be common in JS and Perl and is
certainly valid, but I'm not a big fan. Python hits a sweet spot for me.

------
joshlk
The Pandas CSV reader (Python) is one of the best and most versatile
implementations out there. Also very fast.

[https://pandas.pydata.org/pandas-
docs/stable/reference/api/p...](https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.to_csv.html)

~~~
nojito
Disagree.

Data.table’s fread is leagues ahead of pandas.

[https://www.rdocumentation.org/packages/data.table/versions/...](https://www.rdocumentation.org/packages/data.table/versions/1.12.8/topics/fread)

Fread has automatic footer detection and has automatic skip logic to help
parse out mangled headers in some csvs.

~~~
vharuck
fread can also accept a shell command which it will run and use the result as
input. I use that to grep the lines I want from files too big for RAM.

~~~
nojito
Same here!

I also use fwrite to send it back to the shell to continue my processing
pipeline.

[https://www.rdocumentation.org/packages/data.table/versions/...](https://www.rdocumentation.org/packages/data.table/versions/1.12.8/topics/fwrite)

It's mind boggling how little fanfare data.table has despite it being the best
way to handle tabular data in 2020.

------
redleggedfrog
Oh .csv, how I love to loath thee.

I have a funny story about .csv. Back in the day I was working on an
integration with Fishbowl. Looking through their docs, the data format was
your typical XML type stuff, until you got to the import requests. And there,
lo and behold, was XML wrapping, you guessed it, .csv. I literally laughed out
loud.

[https://www.fishbowlinventory.com/wiki/Fishbowl_Legacy_API#I...](https://www.fishbowlinventory.com/wiki/Fishbowl_Legacy_API#Import)

Years later they released an updated API. Upon checking it I discovered it's
now _JSON_ wrapping .csv, so, you know, progress.

[https://www.fishbowlinventory.com/wiki/Fishbowl_API#Import](https://www.fishbowlinventory.com/wiki/Fishbowl_API#Import)

~~~
arethuza
Paths in SVG are another example where the designers couldn't bring themselves
to do _everything_ strictly in XML and decided to use (space) delimited
values:

    
    
       <path d="M 100 100 L 300 100 L 200 300 z"
            fill="red" stroke="blue" stroke-width="3" />

~~~
th0ma5
This is sort of a quasi standard way of describing paths as drawing
operations. You'll find this in working with fonts as well and old plotters
sort of.

------
jmull
> Do you really still want to roll your own code to handle CSV?

No one really does, but we get backed into it due to these various issues.

The general issue is that there is no real standard for csv, yet it's often
used as the touch-point for integration of disparate systems!

The integration goes something like this:

1\. A and B decide to integrate! A will accept data from B.

2\. A and B want to use a good interchange format, but after weeks of intense
negotiations B can only get her IT department to agree to deliver csv (and
feels lucky to get that much).

3\. B uses excel to provide sample and test files, which are used by A for
development or even in early stages of the live integration. Things seem to be
going smoothly.

4\. At some the export process at B changes, a new csv generator is used and
now things break on A's side. Or something on the data input side changes at B
so that new forms of data are now present that their csv generator does handle
the way A expects, etc.

~~~
3pt14159
Eh—I had to do it recently. There was a file format that I was dealing with
that was basically a combination of CSV for data and Yaml for headers (and
some other edge case stuff) and the Python CSV reader couldn't handle some
edge case that we needed handling, mostly around reporting back to the file
provider where they messed up one thing or another in the file.

It didn't really take me that long. Lots of easy to write tests and the thing
is processing a ton of data now for a well known stock exchange. Far easier
than, say, the machine learning stuff I've written.

------
xamuel
Funny story on this subject. When I was transitioning from academia to
industry, I was afraid (having browsed hackernews so much) that I would be
harshly judged for reinventing the wheel. So I removed the CSV library I had
written from Github. Shortly thereafter, I was emailed by a well-known
programmer (the creator of a fairly major language) asking why I had removed
my CSV library, which he was using in one of his projects!

Here's the library, btw:
[https://github.com/semitrivial/csv_parser](https://github.com/semitrivial/csv_parser)

------
DougBTX
I think the most fun I had parsing a CSV file was when I found out it switched
from UTF8 to UTF16 after several thousand lines, as differently encoded CSV
files had been concatenated together directly into a single file...

~~~
GolDDranks
I have a similar story: parsing a CSV file that had an unexpected 0x1A byte in
the middle. Turns out the file was concatenated together in multiple passes
from multiple files using the COPY DOS command, which by default adds ASCII
SUB to the end as an EOF marker:
[https://en.wikipedia.org/wiki/Substitute_character](https://en.wikipedia.org/wiki/Substitute_character)

~~~
SanchoPanda
I have done this to myself in the past. Those were....dark days.

~~~
GolDDranks
The sad thing is that this happened last year.

~~~
oefrha
You don't say, someone sent a CSV like that to me yesterday...

------
cjfd
Well, considerations like these are typical for more or less any kind of
software you will ever write. Because of this all but the very simplest
software needs automated tests.

But one may also note that your particular usage of CSV may be a bit simpler
than supporting all of the possible complications. Especially if one is not
reading CSVs that could come from anywhere.

It depends, having dependencies is always a risk and a nuisance as well. The
library you are using may have bugs, acquire bugs in the future, change in
incompatible ways and so on.

------
polm23
At a former job we processed files, many CSV, from hundreds of different
sources and I was in charge of cleaning up the code that did this once. There
were a few tiny binaries without source called csv2tsv and csv2tsv2. No
documentation of any kind of course.

csv2tsv just handled quoted fields and I was able to replace it with a few
lines of Python without issue.

The csv2tsv2 program was used for CSV files from exactly one company. We
couldn't ask them for technical assistance - it's likely their systems had
been written years ago and running without updates since then - so I tried to
figure out what the binary was doing. The input file had some null characters
in it, but they seemed to be used inconsistently, and I was never able to
figure out what that binary was supposed to do.

I left that binary alone and kept using it for a few years before a new guy
joined and took over the system from me. I mentioned this weird old binary to
him and over the course of a week he poked at it now and again before figuring
out what it was doing. He used to work at a bank and realized it was using the
same quoting method that some old data format they'd used there did -
something about doubling characters and a few other tricks.

There's nothing simple someone won't make complicated.

~~~
SanchoPanda
You are thinking of these: [https://www.w3.org/Tools/csvtotab-
vv](https://www.w3.org/Tools/csvtotab-vv)

Agreed, they are excellent.

------
geraldbauer
CSV is awesome :-). Leon Bambrick curates an Awesome CSV Page [1] and another
Awesome CSV Page [2] by my humble self.

[1]:
[https://github.com/secretGeek/AwesomeCSV](https://github.com/secretGeek/AwesomeCSV)
[2]: [https://github.com/csvspecs/awesome-
csv](https://github.com/csvspecs/awesome-csv)

~~~
JoBrad
> When the final cockroach breathes her last breath, her dying act will be to
> scratch her date of death in a CSV file for posterity.

CSV is the Keith Richards of file formats.

------
dehesa
I can relate. I have been one of those who started writing a CSV library and
went down the rabbit hole of edge cases and configurability. Looking back (and
only thinking on productivity) my time would have been better spent doing
something else. I did, however, learn quite a lot along the way:

\- how text is represented in a computer (and all its caveats),

\- how to identify and tune slow code,

\- how to vectorize code (thanks to Daniel Lemire's
[https://www.youtube.com/watch?v=wlvKAT7SZIQ](https://www.youtube.com/watch?v=wlvKAT7SZIQ)
),

\- how to encapsulate code better,

\- how to use some great CSV tools out there (thanks to Leon's
[https://github.com/secretGeek/awesomeCSV](https://github.com/secretGeek/awesomeCSV)
)

\- how to better manage and communicate an open source project.

All in all, I am still not sure it was a mistake. Moreover when Swift didn't
really had, at the time, a good Codable interface to CSV. Next time I
encounter a similar problem I will probably just wrap a C library and create a
good interface, though.

In any case, if you are coding in Swift and find yourself in need of a CSV
encoder/decoder, you might want to check
[https://github.com/dehesa/CodableCSV](https://github.com/dehesa/CodableCSV)

------
asadjb
My initial comment was going to be that "The Python CSV library is only 448
lines (378 LOC) [1]".

Only later did I realize that the bulk of the work is done by the C library,
which is ~1500 LOC [2].

I guess parsing CSVs reliably is a reasonably difficult undertaking.

[1]
[https://github.com/python/cpython/blob/3.8/Lib/csv.py](https://github.com/python/cpython/blob/3.8/Lib/csv.py)
[2]
[https://github.com/python/cpython/blob/4a21e57fe55076c77b0ee...](https://github.com/python/cpython/blob/4a21e57fe55076c77b0ee454e1994ca544d09dc0/Modules/_csv.c)

~~~
JoBrad
I have a base “library” I’ve used for several years that wraps the csv module
(specifically DictReader/Writer) to handle encoding issues as well as data
parsing. Have run into minor issues, but think that module alone is one of
Python’s best use-cases.

------
jackfoxy
And this is just the tip of the iceberg for data importation.

Managers and customer facing folks want to say _Can 't you just use off the
shelf software for this?_ If your data importation needs are at all non-
trivial, the problems tend to come in 2 categories.

1) Error reporting from off the shelf software is inadequate. If you are
lucky, it will consistently point you to the correct row number. It's usually
lacking any additional info.

2) If your application is at all non-trivial, you have specific importation
needs, not easily covered by 3rd party generic solutions. For instance you
want to import well formed tables with primary keys. Lots of solutions claim
to handle this case, but you usually run into problems under category (1).

A programmer can overcome these difficulties and muddle through 3rd party
solutions, but you probably don't want to hand this over to users. It just
won't be a good experience for them.

I completed an importation library tailored for our specific needs and
handling the real problems we have experience over many years. The front-end
project to make this available to users started and stopped and changed hands
over months. We are finally nearing deployment, and so far the users we have
shown it to like it.

------
roelschroeven
An additional problem is that Excel, the tool many people use to open CSV
files, uses the computers regional settings for decimal separator and
thousands separator. CSV files are used as an interchange format, so the
format of the numbers in it does not depend on the region of that specific
computer.

You can tell Excel to always use '.' as a decimal separator, but then it also
uses it for presenting numbers to the user. It boggles my mind that software
like Excel doesn't understand the difference between reading/formatting
numbers for human use and reading/formatting numbers primarily meant for
talking to other software.

~~~
tlbsofware
Excel will also take a number with a leading zero and drop it while it
converts it from csv, for ex 01234 -> 1234, which is SUPER bad if you are
dealing with any sort of IDs to invoices or work orders from outside the
company

~~~
sk5t
That is a good occasion to curse Excel (how does it know the number isn't
octal, too?) and also curse the person or group that designed a textual coding
system consisting of digits with leading zeros. Looking at you, National Drug
Code.

------
bproctor
Well sure, if you're writing a generic CSV parser, it's very complicated. But
if you know what your input looks like, CSV can be one of the simplest format
to parse.

------
barbegal
I think you need to be careful with choosing to use CSV format.

CSV works fine as long as you are not handling character stings. Character
strings get messy as soon as they might introduce commas or characters outside
of ASCII.

For tabular data which has character strings I prefer to used the ODF format
[http://opendocumentformat.org/developers/](http://opendocumentformat.org/developers/)
(.ods file extension) which has good import and export capabilities from
Excel, or Google sheets. If the user needs the data in CSV for entry into
another application then they can convert within Excel and handle any
conversion problems themselves.

~~~
guerby
Do you have a preferred library for python for example)?

Debian packages python3-odf and
[https://github.com/eea/odfpy](https://github.com/eea/odfpy) seems to still
have some activity (vs the other listed).

~~~
barbegal
I don't. In the past I have just used an XML and a zip library like this
article explains
[https://www.codeguru.com/csharp/csharp/cs_data/xml/article.p...](https://www.codeguru.com/csharp/csharp/cs_data/xml/article.php/c16311/How-
to-Read-and-Write-ODFODS-Files-OpenDocument-Spreadsheets.htm)

------
wdavidw
I am the lead developer behind the Node.js CSV parser [0]. The project started
in 2010, an hour before me leaving work for the weekend. It ended up being a
long, very long journey that isn't over yet. The project grew incorporating
many features I would never have expected, fixing corner cases the several
hundred tests were not covering and teaching me a lot about coding and project
handling. The bottom line, writing a correct CSV parser takes a little bit
more than implementing an advanced string spliting function.

[0] [https://csv.js.org/](https://csv.js.org/)

------
drej
There's this system, which has become extremely popular for processing data
(Apache Spark). And its default configuration cannot parse CSVs properly and
can lead to pretty severe data loss. I wrote about this a while ago and it's
still an issue. (Yes, we did lose data his way, it's not a theoretical
exercise.)

[https://kokes.github.io/blog/2019/07/09/losing-data-
apache-s...](https://kokes.github.io/blog/2019/07/09/losing-data-apache-
spark.html#csvs-dont-have-a-standard-per-se-but-lets-be-reasonable)

------
mmm_grayons
I wrote my own CSV code rarher thsn using a library because the input CSVs
were produced by guys who didn't use a library. The columns had a fixed "real
width" but also had a column that contained an indeterminate number of comma-
separated, values. One can escape commas in CSV for just such a purpose, but
these guys didn't. So, to deal with this issue quickly, I had to basically
parse the line forwards until I hit the variable-length field, then backwards
until I hit it as well. Then I could parse the field to determine how many
values it had and extract them.

------
beders
My CSV story from last night:

I had these 15MB+ excel sheets and was trying to open them with Apache POI. I
gave that code a generous 2GB memory: GC overhead limit reached.

Then I opened them in Excel, saved them as CSV, reached for a CSV-library and
was done within seconds.

Ok, well, the story is more about how parsing XML can ruin your day ;)

~~~
mike22
Are you using POI’s eventmodel APIs to handle such files? The SAX model isn’t
too bad for reading files.

------
Hitton
It was not emphasized in the article, but the quotes can be both single and
double.

That said, it's futile to try to write universal CSV parser. If you need to
write your own, just make it possible to choose delimiter and type of quotes
and call it a day. LibreOffice Calc does same.

------
franze
I coded a few robots.txt parsers i.e.: (these are just the ones i cared to
publish) [https://github.com/franzenzenhofer/simple-functional-
robots-...](https://github.com/franzenzenhofer/simple-functional-robots-txt-
parser)
[https://github.com/franzenzenhofer/robotstxt](https://github.com/franzenzenhofer/robotstxt)

and even though robots.txt seems like a very, very simple text based protocoll
there are unanswered mysteries

the biggest mystery is user agent groups and comments

i.e.:

    
    
      User-agent: googlebot
      User-agent: bing
      User-agent: yandex
      Disallow: /
    

so i am disallowing everything for google, bing, yandex; easy enough, but:

    
    
      User-agent: googlebot
    
      User-agent: yandex
      Disallow: / 
    

means that googlebot has no instructions, but yandex is disallowed all.

if a whole line is commented out

    
    
      User-agent: googlebot
      #User-agent: bing
      User-agent: yandex
      Disallow: / 
    

is a commented out line a blank line or a non existing line?

if it is interpreted as a blank linke, the disallow only counts for yandex, if
the line is non existant, it counts for googlebot and yandex.

i like simple things, but sometimes the complexity is between the lines.

~~~
iams

      User-agent: googlebot
      #User-agent: bing
      User-agent: yandex
      Disallow: /
    

why would it be interpreted as a blank line? If you remove everything after
the #, that includes the new line characters at the end of the line. Leaving:

    
    
      User-agent: googlebot
      User-agent: yandex
      Disallow: /

~~~
eyelidlessness
It's ambiguous, and this is why significant whitespace can be so frustrating.
Unless it's specified, some people will interpret a full-line comment as a
blank line with a comment ending it (`^#. _$`), and others will interpret it
as you have (`^#._?\n`). Neither is obviously correct (even if it's obvious
_to you_ ).

Edit: I don't know how to escape in HN formatting. Obviously there are italics
where literal asterisks should be.

~~~
traes

      *** You can just use three asterisks. ***
    

__* You can just use three asterisks. __*

    
    
       Unfortunately you need something after them though. ***
    

Unfortunately you need something after them though. ___

~~~
eyelidlessness
Thanks, that's good to know!

------
oefrha
> What if there is non ASCII data?

> Just use utf8 right? But wait…

> What if the program reading the CSV use an encoding depending on the locale?

> A program can’t magically know what encoding a file is using. Some will use
> an encoding depending on the locale of the machine.

Excel (for Mac at least) is a fucking pain in this regard. Just try this
minimal UTF-8 example:

    
    
      tmpfile=$(mktemp /tmp/XXXXX.csv); echo '“,”' > $tmpfile; open -a 'Microsoft Excel' $tmpfile
    

Hooray, you successfully opened

    
    
      ‚Äú,‚Äù
    

I'm not even sure in which encoding e280 9c2c e280 9d corresponds to that (not
the usual suspect cp1252, nor any code page in the cp1250 to cp1258 range;
easy to confirm with iconv).

One remedy is to add a BOM (U+FEFF) to the beginning of the file, but of
course no one other than Microsoft (at least in my experience) uses this weird
UTF-8 with BOM encoding (which the Unicode standard recommends against), so it
breaks other programs correctly decoding UTF-8.

This means I can never share a non-ASCII CSV file with non-technical people.
Always have to convert to .xlsx although it's usually easier for me to
generate CSV. Then .xlsx opens me up to formatting problems, like phone
numbers being treated as natural numbers and automatically displayed in
scientific notation... Which means ssconvert or other naive conversion tools
aren't enough, I need to use a library like xlsxwriter.

I'm not sure why it's so hard to just fucking ask when you don't know which
encoding to use. (Plus it's not super hard to detect UTF-8. uchardet works
just fine. Plus my locale is en_US.UTF-8, maybe that's a hint.)

~~~
jfk13
> Hooray, you successfully opened

> ‚Äú,‚Äù

> I'm not even sure in which encoding e280 9c2c e280 9d corresponds to that

That'll be MacRoman (not very surprisingly).

[https://en.wikipedia.org/wiki/Mac_OS_Roman](https://en.wikipedia.org/wiki/Mac_OS_Roman)

~~~
oefrha
Ah, I thought it has to be something Microsoft.

We can place the last modernization effort to this piece of code, then.

------
papito
Implementing your own CSV parser is one of the first lessons a green software
engineer will learn in how to NOT try to re-invent the wheel.

Still, please don't import packages with just one goddamned 3-line function in
it.

~~~
qayxc
It's what happens when DRY becomes dogma instead of a guideline...

This leads to debacles like the npm left-pad & kik affairs and other scary
shit like [https://github.com/parro-it/awesome-micro-npm-
packages](https://github.com/parro-it/awesome-micro-npm-packages)

Yeah - have fun maintaining dependencies when every other module depends on
one-liners that can be pulled or broken at random...

------
leokennis
One small tip if you receive a CSV with a non-default separator (like ;):

* Open it in a text exitor

* Add "sep=;" on the first line (without the quotes)

* Now at least Excel will open it as intended

~~~
BlueTemplar
IIRC Excel will ask for the separator anyway when you open the file?

~~~
brewmarche
It does not and actually assumes locale-dependent separators (usually
semicolons if your locale uses a comma as a decimal separator and comma
otherwise).

If you want to specify separators you have to go via Text Import.

------
SanchoPanda
To be fair, sometimes calling some giant java library or installing tons of
dependencies is just not realistic. Use the library for the mission critical
stuff or the automated reporting without a doubt, but it's important to be
practical and thinking on your feet as well.

In the meantime for quick analysis and testing 90% + can be accomplished with
one line of (g)awk.

    
    
      awk -v FPAT='"[^"]*"|[^,]*' -v OFS='\t' '{$1=$1; print $0}' "$filename" 
    

More on FPAT
[here]([https://www.gnu.org/software/gawk/manual/gawk.html#Splitting...](https://www.gnu.org/software/gawk/manual/gawk.html#Splitting-
By-Content))

~~~
nicoburns
You can also use xsv for this kind of analysis of CSV files. Which is a unix-
style tool designed for CSVs and with a proper CSV parser.

It's written in Rust so it's one binary - no runtime dependencies, and will
happily chunk through multi-gigabyte files.

[https://github.com/BurntSushi/xsv](https://github.com/BurntSushi/xsv)

------
csunbird
Reminds me of [https://infiniteundo.com/post/25326999628/falsehoods-
program...](https://infiniteundo.com/post/25326999628/falsehoods-programmers-
believe-about-time)

~~~
tonyedgecombe
It applies to everything, dates, floating point numbers, strings, etc. What we
would consider the basics always turn out to be much deeper than they appear
on the surface.

~~~
atulatul
Dating advice: if you habitually, unironically, and without good cause use any
dating format but that which is mandated by ISO 8601 we cannot be friends, let
alone more.

[https://twitter.com/jwiechers/status/1205515440543424513](https://twitter.com/jwiechers/status/1205515440543424513)
\+ the thread

------
freefriedrice
After working int he world of CAD tools for decades, I cannot tell you the
number of times I've seen a younger engineer try to write a CSV parser only to
fall into a circular hacking deathspiral.

You can't just use regex/split to handle CSV, unless you have significant
field cleaning BEFORE converting to csv.

In reality you need lexical analysis and grammatical rules to parse any string
of symbols. This is often always overlooked by naive implementations.

I take issue with OP's claim that RFC4180 is not well-defined, but almost all
of the cases the OP listed are literally in the spec.

------
benbristow
Whilst we're talking about libraries, had good experience with CsvHelper for
C#/.NET.

[https://joshclose.github.io/CsvHelper/](https://joshclose.github.io/CsvHelper/)

Documentation isn't 100% but it's a good tool. Been using it to work on a
production 15+ column SQL->CSV mapping job with tens of thousands of records,
working great.

------
neilv
I wrote a CSV-reading library 16 years ago that I'm pleasantly surprised
handled most of the points of trickiness the article mentions.

[https://www.neilvandyke.org/racket/csv-
reading/](https://www.neilvandyke.org/racket/csv-reading/)

The documentation section "Reader Specs" gives a good idea of the different
variations it intends to handle. I was working from a sense of variations I'd
seen myself, and extrapolating.

One thing I also did, which I didn't see in the article, was to support
comments.

Unfortunately, at the time I wrote it, there was no standard Unicode support
for Scheme, so I didn't get into that, but just used the character abstraction
(which might or might not involve parsing of a particular non-ASCII encoding).

Fortunately, AFAIK, it's worked for people, since fixing/extending it at this
point would mean relearning the code. :)

------
ungerik
Those are the simple things to worry about, it can even get more complicated
in the real world. So complicated that writing our own parser was the only way
to go.

So what are some of the more advanced challenges:

* Banking systems that treat CSV export just as some kind of line based dump file with no regard of consistent formatting. If the banking backend was updated, the format might change within a file from one line to the next

* Some misconfigured data dumping pipeline parses CSV the wrong way from another system and emits it in escaped form again in a different format. For instance putting a complete line escaped into a single field. Your parser has to detect that there is a CSV embedded within a CSV.

* Dumping Pipeline treats \r\n as all kinds of silly lines and re-embedding those lines with quotes around them to look like real data

* Inventing completely new specs of what CSV could mean

* Use mainframe character-sets from the last millennium

------
mrweasel
> Some countries use a comma as decimal separator instead of a colon.

And some companies... like Klarna, fixes this wrongly. At least they did a few
years back.

We'd get a CSV file from them, containing payment information, but they ran
into the issue that they really wanted to use comma for field separation, but
also for krone/øre (decimal) separation. They ingenious solution: Fields are
now separated by ", " that's a comma and a space.

Pretty much no CSV parser will accept a multi character field separator. So
many fields would just have a space prepended to whatever value was in that
field and you'd get a new field, when the parser split the field containing
the amount. So now you have say 5 headers but each row would have 6 fields,
because the money amount became two fields.

Being on the receiving end you now have to choose, do you want to strip spaces
and reconstruct the amount field manually, or parse the CSV file yourself.

~~~
tjalfi
If you use .NET then CsvHelper[0] supports multi character separators.

[0]
[https://joshclose.github.io/CsvHelper/](https://joshclose.github.io/CsvHelper/)

------
OliverJones
I totally agree with this. Writing your own csv code is the archetypal
example, in computer programming, of reinventing the flat tire.

I've found that every major language has at least one reusable component /
module / library / dll to to this. They're maintained by people who have lots
more patience than I do.

------
geraldbauer
FYI: I've put together a collection of CSV specifications [1]. CSV Dialects /
Flavors include CSV v1.0, CSV v1.1, CSV Strict, CSV <3 Numerics, CSV<3 JSON,
CSV <3 YAML and some more.

[1]: [https://github.com/csvspecs](https://github.com/csvspecs)

------
mystickphoenix
Fantastic article!

It's missing: \- "what if the quotes are smart quotes?" \- This applies to
both quotes inside a field and quotes wrapping a field \- "what if the file
doesn't have a header?" \- "what if a logical record is spread over multiple
lines?" \- Yeah... so, you have line 23 that has the majority of the data in
the expected columns, line 24 has additional data in columns 23, 25, and 28
that's supposed to be appended onto the previous line's data that also appears
in columns 23, 25, and 28. Line 25, same thing, but line 26 starts a new
logical record.

Seen all three above in the wild and Python's (pretty awesome) CSV code didn't
handle them. Queue the custom code that wraps Python's core CSV parsing.

------
jtvjan
Would've been a lot nicer if they used the record and unit separator instead
of commas and newlines. There are characters made specifically for storing
tables like this and they instead chose to reuse characters which might appear
in regular text.

------
bloak
If you can't write a correct (according to RFC 4180) CSV parser in 15 lines of
Perl then you're not a Real Programmer!

(Yes, I understand that not all CSV files in the wild are correct according to
RFC 4180. Yes, I also understand that only grey-beard loons still use Perl.)

The strangest thing about CSV, I think, is that if there's only one field per
record and the file ends with a CRLF then you can't tell whether there's a
final record containing an empty field following that CRLF. It's probably best
to assume there isn't.

~~~
nmz
15? damn, that's 10 lines less than my awk csv parser. oh well...

------
nbevans
I prefer to use a SQLite database. And if a customer insists on CSV then I
still use a SQLite database to import and/or export the CSV without them
realising it. SQLite brings sanity to CSVs.

~~~
DarkWiiPlayer
You still have to import the CSV though, so the problems are still there, you
just don't have to deal with them after a successful import.

~~~
nbevans
You misunderstand how enterprise integrations work. It's easy to tell a
customer: "your CSV is not compatible with standard off the shelf tooling,
indeed the most popular database engine in the world" than "your CSV doesn't
seem to work with our software/library/<proprietary ETL framework>". The
former makes it their problem, the latter will almost undoubtably make it your
problem. In case you don't know: SQLite has a really well designed and very
performant CSV import function. BTW did you really downvote me for that? Wow.

------
larschdk
Just today I needed to write a .csv file in Python on Windows and got a file
with CR CR LF line endings by default (\r\r\n). That was using the standard
'import csv' in Python 3.

~~~
hprotagonist
yeah you gotta pass

    
    
      with open(...,newline='') as f:

~~~
JoBrad
I think the docs now include this.

~~~
hprotagonist
yes, they have for some time now i think. I wish it was better known, though.

------
dang
Discussed at the time:
[https://news.ycombinator.com/item?id=7796268](https://news.ycombinator.com/item?id=7796268)

------
nly
Sometimes the best thing you do is learn at least one parsing framework so
that you can write your own parsers. The advantage of this is you know exactly
what the behavior will be.

Here's a CSV parser in ~40 lines of C++ using Boost Spirit that handles all
kinds of weird cases:

[https://gcc.godbolt.org/z/KGUHXo](https://gcc.godbolt.org/z/KGUHXo)

It's easy to customize, declarative, and compiles down to about ~15KB.

------
hermitcrab
I wrote my own CSV parser for
[https://www.easydatatransform.com](https://www.easydatatransform.com). I ran
into some cases where it wasn't clear what to do. For example where you had
some whitespace outside of quotes. So I checked what Excel and Numbers (the
Mac spreadsheet) do. They parse the same string differently...

------
tester34
doesn't seem insanely difficult

dont be scared of coding guys, just differentiate between environments: do it
for your 4fun project, not prod.

~~~
yoz-y
The problem isn't really that it's difficult to solve. It's because it's
impossible. There is no real formal spec. The RFC does not specify edge cases
and nobody respects it anyways. "Generic" CSV parser is a pipe dream, don't
try to do it.

Now, if you define your own set of rules on what is and is not allowed then
it's fine.

~~~
Brian_K_White
Indeed it's a perfect example to illustrate the difference and the importance
of an under-defined spec and a well defined one.

Whatever you don't define, will be undefined. But you still somehow have to do
something definite without a definition.

The customer expects predictable output at the end, and until we have not only
clairvoyance, but clairvoyance that can be built in a machine, you can't have
predictable output without either predictable input, or a spec that actually
provides an answer for any input.

------
aww_dang
Even when using libraries to read CSV data the need for sanitization isn't
uncommon. Sure there's RFC4180, but variations are common. Mysterious records
that are out of sync are typical.

For my own sanity I like to verify that columns are of the right data type
where possible.

That said, I still prefer CSV over heavier formats (XML,JSON) where the data
conforms.

------
sharpercoder
Can we somehow start to use the ascii 30 character aka the Record Separator
(SP)? This would solve a lot of the problems!

~~~
wombatpm
Not as fun as you might think. I had to parse and write inkjet control files
for commercial printing. It used record separators, group separators, and unit
separators.

The problem? Records began WITH the RS character and there were NO line
breaks.

Ended up writing my own library
[https://github.com/Wombatpm/VIPNT](https://github.com/Wombatpm/VIPNT)

------
arethuza
A good one I saw recently was a text file that, even though it had a .csv file
extension was actually using another character as a delimiter (in this case
colons) but as it contained large amounts of textual data actually had a
enough commas to be parsed as a CSV file.

VS Code and a suitable extension handled this rather better than Excel.

~~~
tigerstripe
Sounds like CSV in this case stood for Colon Separate Data

~~~
fomine3
Further generalization: Character Separated Values

------
rietta
I write CSV code all the time in different languages. It's a fun challenge
that incorporates several nuances. It has been a form of code kata for me over
the last 20 years. Biggest lesson, one does not simply split a string. You
have to scan through and tokenize following the state of a quoted field.

------
nojvek
ndjson(new line delimited json) is what we should use as a human friendly and
machine friendly alternative to csv.

json is a universal format. Works in every major language. It's spec is a
state machine defining what to do with every possible encountered byte.
[https://www.json.org/json-en.html](https://www.json.org/json-en.html)

SIMDJSON parses json at RAM speeds (~3Ghz).

CSV should die. It's a terrible format. Just send arrays of ndjson instead.
You'll probably have a few more " and [, but hey, you can deeply nest arrays
inside arrays of objects with arrays.

[https://github.com/simdjson/simdjson](https://github.com/simdjson/simdjson)

~~~
tatersolid
JSON doesn’t handle int64 or explicit date-time data types, which makes it
nearly as painful as CSV in practice when used for data integration. Both of
those are explicit types present in nearly every real-world database schema.

So you end up with many of the same problems as CSV. You basically get an
application-specific JSON dialect to interpret.

Header row or not? Is a long series of digits in quotes an int64? What happens
when someone puts an ISO-formatted date string into a field that is supposed
to be a simple string?

------
hackandtrip
Some fast implementations in C++?

In particular, I tried in the last weeks to work on a OpenMP / MPI driven CSV
parser, but surely accounting for new lines inside quotes can be a pain the
ass, I'm wondering if there's a go-to implementation or model to implement
that.

------
pinopinopino
I did this just for fun to learn a parsing library I was using. It didn't deal
with utf-8 though, but even that is manageable. The complexity with csv is
that it is a family of formats with small changes. But if you make your parser
configurable, you are set.

~~~
pinopinopino
If you want to do something really hard, try to parse PDF, that is sheer
insanity.

------
coliveira
CSV is not a portable format across applications. That's it. You should use it
only in your own application or in connection with applications that you know.
The big problem is to think that you can handle a generic CSV file, because
there is no such thing.

------
lxe
Just because all kinds of inputs exist, doesn't mean you should support
handling them all.

------
t0astbread
What about implementing something that works for your environment and making
it fail-fast so that any unexpected input will just abort+log the current
operation? You could then expand on that on-demand.

~~~
Brian_K_White
This assumes you can recognize unexpected input.

~~~
t0astbread
I assume unexpected input results in either an error (or exception, whatever
your language calls it), a crash or a user complaint. I also assume it's
unlikely for an input parser to introduce a security issue into your app if
it's written in a safe language and your app has proper validation set up
(i.e. don't trust input as soon as it passes the parser).

------
xupybd
I don't want to but there is nothing that I've found in c# other than a helper
tool that will only translate objects to CSV rows.

------
kyberias
I'm currently implementing a REST API that returns CSV files.

------
Neil44
Me too. Everything was great until Mr O’Connor came along.

------
swiley
>Some countries use a comma as decimal separator instead of a colon.

I kind of wish we could all just pick _something_ and stick with it.

~~~
unwind
That sentence must have a typo in it, right?

Wikipedia [1] doesn't list _any_ country that _does_ use the colon (:) as the
decimal separator, so possibly the period/dot (.) as used in most English-
speaking countries, was meant?

1:
[https://en.wikipedia.org/wiki/Decimal_separator](https://en.wikipedia.org/wiki/Decimal_separator)

~~~
yoz-y
Yeah, I think what author meant "some countries don't use the period but a
comma, thus hilarity ensues" (especially when your code respects the locale
and the csv file is written in append mode)

------
the_dripper
i wanted to... but not anymore :/

------
ph4
import csv

------
GuB-42
I disagree with the article, I'd say writing your own CSV code is the second
best option, behind not using CSV at all.

The reasons are:

\- CSV is really simple. A complete RFC4180 parser shouldn't take more than
100 lines of readable code to implement. Even less so for the writer. So much
that most of the code is likely to be interface code (file I/O, matching your
internal data structures, etc...) rather than parsing. And chances are that a
library won't even make your code shorter.

\- As mentioned in the article, CSV is not well defined. I've seen all sorts
of weirdness: semicolons vs colons, LF vs CRLF, simple vs double quotes,
backslash escaping, comments, floating point numbers with comas instead of
points. What your library understands as a CSV may not be the files you have
to work with. I'd rather have a simple piece of code that correspond to what I
am given rather than a big library that uses AI techniques to try to guess the
intent of whatever wrote this file.

There are cases where it would be foolish not to use a library. I will never
write a XML parser for instance, it is a well defined and complex format. JSON
is borderline but I still won't write my own parser. CSV is essentially a
proprietary format, so it is fitting to use a proprietary parser/writer.

