
CleverCSV: A Drop-In Replacement for Python's CSV Module - reedwolf
https://github.com/alan-turing-institute/CleverCSV
======
gregmac
> standardize Convert a CSV file to one that conforms to RFC-4180.

Very glad to see this here, hopefully it'll help with increasing adoption of
the RFC.

I find one of the biggest pains of working with CSV is that all too often,
someone gets Excel involved, and then Excel absolutely butchers the file. Off
the top of my head: It strips leading 0's (even if quoted), converts big
numbers to scientific notation, sometimes interprets non-date values as dates,
and when saving, keeps the mangled values, doesn't preserve quotes, and uses
regional settings -- to the point in some locales, it becomes a "semi-colon
separated file"! Even just opening and then saving a .csv (without editing)
can totally mangle it to the point it's unusable. Sometimes Excel can't even
open a file _saved by Excel_ without mangling it even more.

[1] [https://tools.ietf.org/html/rfc4180](https://tools.ietf.org/html/rfc4180)

~~~
Twirrim
At a previous company, someone decided to experiment with putting a comma in a
text field submitted to an internal service that for some godforsaken reason
spat out CSV files that a lot of other teams consumed. I forget the exact
entry in the field but it was something like "I wonder, how does this handle
commas".

Turns out it didn't handle them well. It didn't escape them or anything. Out
spits this line in the csv format that has more columns than expected and
everything from the that field onwards was off by one. Most teams were lucky,
their code barfed when presented with a line with too many fields. Some teams
were unlucky and it didn't. How unlucky _they_ were was somewhat varied based
on how much data validation they did.

That caused a big headache for a lot of people trying to recover from the
actions of the genius who thought doing that against the production endpoint
vs the test endpoint.

That incident pretty much solidified my reputation in the team as some kind of
weird voodoo type of chaos monkey, because my shifts _always_ had a
significantly higher sev2 rate caused by things failing that I'd not touched
or had any influence over.

~~~
tmpz22
Better an issue raised by a engineer then that same issue caused by an end
user :shrug:

~~~
Twirrim
The engineer technically was an end user. It was a service that could get
called as an indirect result of customer action, but without any customer
supplied information passed along, which is part of the reason it didn't do
any input clean-up (at least according to the post-incident report). They
assumed they could trust people not to screw things up.

To be fair, it had been running for a long time that way without anyone doing
that to them.

~~~
tmpz22
All good, I hope my comment didn't come off as snarky because I have 110% been
in similar environments!

------
etaioinshrdlu
This seems kind of philosophically similar to FTFY,
[https://github.com/LuminosoInsight/python-
ftfy](https://github.com/LuminosoInsight/python-ftfy), except for CSV files
instead of botched text strings.

~~~
scrollaway
I knew of ftfy already but I would love some more examples of these kinds of
data cleanup libraries; it's always super useful.

~~~
cosmie
Not in any particular order, but here is a list of really useful text
butchering utilities, that I keep in the same grab-bag as ftfy: csvclean[1],
unidecode[2], transliterate[3], charset-normalizer[4], cchardet[5],
phonenumbers[6][7], iconv[8], enca/enconv[9].

[1]
[https://csvkit.readthedocs.io/en/1.0.3/scripts/csvclean.html](https://csvkit.readthedocs.io/en/1.0.3/scripts/csvclean.html)

[2] [https://pypi.org/project/Unidecode/](https://pypi.org/project/Unidecode/)

[3]
[https://pypi.org/project/transliterate/](https://pypi.org/project/transliterate/)

[4] [https://pypi.org/project/charset-
normalizer/](https://pypi.org/project/charset-normalizer/)

[5] [https://pypi.org/project/cchardet/](https://pypi.org/project/cchardet/)

[6]
[https://pypi.org/project/phonenumbers/](https://pypi.org/project/phonenumbers/)

[7] which is a python port of Google's libphonenumber:
[https://github.com/google/libphonenumber](https://github.com/google/libphonenumber)

[8] [https://linux.die.net/man/1/iconv](https://linux.die.net/man/1/iconv)

[9] [https://linux.die.net/man/1/enca](https://linux.die.net/man/1/enca)

------
emmanueloga_
I learnt a while ago about a way to provide a schema for CSV files. If you
publish CSV data, may be useful to provide a schema for it [1].

Sounds like CleverCSV could determine and generate a schema automatically
given a CSV file.

1: [https://github.com/theodi/csvlint.rb#json-table-schema-
suppo...](https://github.com/theodi/csvlint.rb#json-table-schema-support)

------
th0ma5
The problem mostly with CSV beyond the lack of standards, character encodings,
and lack of compression, inability to efficiently write or read, and problems
handling multiline freeform text, is that ultimately the structure is inband
with the data, which is what makes all problems probable.

~~~
closed
> The problem mostly with CSV... is that ultimately the structure is inband
> with the data

I wonder if for 90% of users this seems like a critical UI feature!

------
fake-name
The only issue I've ever had with python's CSV module is that it's _slow_. Is
this any faster?

~~~
gjjvdburg
Hi, I'm the author of CleverCSV. Although the package was initially written to
mainly improve the accuracy of automated dialect detection, we ended up
changing some aspects of the C parser as well (most notably increasing the
look-ahead to two characters). This has made the parsing somewhat faster than
the builtin csv package (around 10-20%).

------
manor
TSV --> Problem Solved! Has anyone ever stumbled upon a cell entry that needed
to contain an actual tab character? No!!! The tab character's entire
raison-d'etre is to separate fields, whereas the comma separates words.
Technically the "C" in CSV denotes "Character" but with TSV the intent is 100%
clear. The MIME type is explicit: "Note that fields that contain tabs are not
allowable in this encoding." This is a vanishingly small price to pay for a
parser a simple as split("\t")...

[1] [http://www.iana.org/assignments/media-types/text/tab-
separat...](http://www.iana.org/assignments/media-types/text/tab-separated-
values)

~~~
dmd
What problem is solved, exactly? How do you parse the messy CSV file you've
been handed, in order to convert it to TSV?

~~~
manor
My point is simply that TSVs are just as straightforward as CSVs but suffer
none of the issues. So why are we not making TSVs to begin with?

~~~
dmd
Nobody (or very few people) processing CSVs is doing it because they like
them. You do it because that's how someone else created the data - someone you
don't have any control over, like a government agency, financial institution,
piece of obscure software, whatever.

------
r0f1
How does this compare to Pandas?

~~~
donarb
Pandas uses the Python csv module under the hood. Most of the code in the
read_csv method is used to get the data from the csv file and load it into
data frames.

~~~
urschrei
Are you sure? Wes McKinney rewrote the Pandas CSV parser way back in late 2012
because the batteries-included version was slow and memory-hungry. Did Python
upstream his version later on, or replace it with something faster (I'd be
surprised if they did either, but I'd be curious to know)

~~~
evanpw
It's configurable. The default parser is written in C, but it's possible to
use a parser based on the csv module if you needs things like separators
specified by a regex.

