Hacker News new | past | comments | ask | show | jobs | submit login
CleverCSV: A Drop-In Replacement for Python's CSV Module (github.com/alan-turing-institute)
154 points by reedwolf on Jan 13, 2020 | hide | past | favorite | 42 comments

> standardize Convert a CSV file to one that conforms to RFC-4180.

Very glad to see this here, hopefully it'll help with increasing adoption of the RFC.

I find one of the biggest pains of working with CSV is that all too often, someone gets Excel involved, and then Excel absolutely butchers the file. Off the top of my head: It strips leading 0's (even if quoted), converts big numbers to scientific notation, sometimes interprets non-date values as dates, and when saving, keeps the mangled values, doesn't preserve quotes, and uses regional settings -- to the point in some locales, it becomes a "semi-colon separated file"! Even just opening and then saving a .csv (without editing) can totally mangle it to the point it's unusable. Sometimes Excel can't even open a file saved by Excel without mangling it even more.

[1] https://tools.ietf.org/html/rfc4180

I cannot for the life of me find the blog post where I learned about this, but I learned long ago that the only reliable way to get Excel to open a CSV file with Unicode in it - regardless of whatever ancient version and OS they're using - is to use tab separators and encode as UTF-16LE with a leading BOM.

I wrote a Stack Overflow answer like this. Only Mac Excel had this issue, if I recall correctly.


To be fair, the latest version of Excel has improved handling of CSV files immensely.

It isn't just Unicode, Excel has been like this for every version I can remember, and I've used v5.0, possibly earlier versions (1992/93).

Whenever you have the option, use tab-delimited if it ever has to go into Excel reliably.

For ecommerce sites it is so common to want to go from Excel to UTF8 csv that a lot of people use Libre/Open Office to open and save the Excel file. Then you can do safe things like double-quoting all fields, and choosing your encoding. Perhaps more systems should have Excel upload rather than insisting on UTF-8 csvs that the most common spreadsheet software has never been able to produce?

Maybe the most common spreadsheet software should get it's act together and support this simple standard?

Excel is mostly responsible for wrecking any attempt of "standard" CSV, not that there ever was a chance.

If I remember correctly, non-Anglo CSVs using a semi-colon as a delimiter is fully to blame on it.

Its not just non-Anglo .. semicolon is often used because data often contains commas, and it’s just easier to swop your delimiter than encapsulate your fields properly ... even then it’s just a simple matter to use the alternative delimeter, or convert by a simple search and replace (once youve added encapuslation to the relevant fields of course).

Though CSV directly implies commas I find it helpful to consider it a synonym for “delimited text”. The specifics of the delimiter don’t bother me so much so long as the format is consistent.

Not saying it's not useful, just that this arose from them not using a hard-coded comma, but referencing some default separator, that happens to be the semi-colon in some locales. I might be wrong and this could have an older precedent.

Now, whether it would've been better to stick with the comma and just properly quote all the values for locales where the comma is a decimal separator and thus often used in spreadsheet columns is a moot point, the damage has been done.

Personally I just call it "character separated values" anyway. I quite like tab-separated values. Makes it okay to read quite often and I think it's the default output of Postgres' COPY command.

But wouldn't it help the non technical user more if Magento, Shopify etc supported the much more widely used standard of xlsx? Then people don't have to find out what UTF-8 even is. Library support for reading Excel docs is pretty good now, isn't it?

I wish Excel did support csv better, but we are in the .001% of Excel users who even know that it doesn't do a good job.

XLSX isn't actually a standard in a meaningful sense. So while support for whatever parts are standard is "pretty good" how does that help your users when some of their sheets don't work correctly? Who wants to explain what's wrong and that nobody can fix it but er... maybe better luck next time?

XLSX (Office Open XML) is full of safety valves that let Microsoft's existing products continue doing whatever poorly documented or undocumented stuff they were doing previously. Microsoft didn't want to have to go back to features which worked in Office already and either rip them out or re-implement them, and it didn't have documentation for those features that anybody else would be able to implement‡. So Office Open XML just says in those cases well here's a blob of data and good luck unless you're Microsoft Office.

This is tolerable for exporting to Excel. I can emit compliant Office Open XML that gets my numbers into XL reliably. So that's nice.

But when importing from Excel you're fighting that impedance mismatch. Rather than explain to users "Something about your document is incompatible and I swear it's Microsoft's fault" it's just better to say "Use CSV".

‡ e.g. suppose there's a line in Excel which defines a function FOO() by calling into some particular Windows DLL. Well that's not a useful thing to standardise. So do you call the relevant MS department and ask them to paste all their documentation for that DLL into your "spreadsheet" standard? No. You write "Implementation defined" and it becomes a black box.

XLSX is not remotely the same as CSV. CSV is way more widely used than you think under the hood of stuff and is a muuuuuch more slimmed down format. It's litterally just values separated by a delimiter. XLSX also isn't a standard. It'll keep changing.

It's one of those reasons why Microsoft tried to shoehorn their own trash Open document format that would fit them and let them keep backwards compatibility with stoneage versions of their office suite. https://www.computerweekly.com/news/2240225262/Microsoft-att...

It's like saying everyone should have adapted to internet explorer rather than following proper web standards. Even if XSLX was some kind of standard it's a bad idea because at the end of the day MS shouldn't be trusted with those.

At a previous company, someone decided to experiment with putting a comma in a text field submitted to an internal service that for some godforsaken reason spat out CSV files that a lot of other teams consumed. I forget the exact entry in the field but it was something like "I wonder, how does this handle commas".

Turns out it didn't handle them well. It didn't escape them or anything. Out spits this line in the csv format that has more columns than expected and everything from the that field onwards was off by one. Most teams were lucky, their code barfed when presented with a line with too many fields. Some teams were unlucky and it didn't. How unlucky _they_ were was somewhat varied based on how much data validation they did.

That caused a big headache for a lot of people trying to recover from the actions of the genius who thought doing that against the production endpoint vs the test endpoint.

That incident pretty much solidified my reputation in the team as some kind of weird voodoo type of chaos monkey, because my shifts _always_ had a significantly higher sev2 rate caused by things failing that I'd not touched or had any influence over.

Better an issue raised by a engineer then that same issue caused by an end user :shrug:

The engineer technically was an end user. It was a service that could get called as an indirect result of customer action, but without any customer supplied information passed along, which is part of the reason it didn't do any input clean-up (at least according to the post-incident report). They assumed they could trust people not to screw things up.

To be fair, it had been running for a long time that way without anyone doing that to them.

All good, I hope my comment didn't come off as snarky because I have 110% been in similar environments!

This is actually frustratingly common with off-the-shelf ERP 'integrations'. Things like courier labelling software come to mind as culprits. And you are stuck between trying to get UPS to fix a system that they give away free, or training sales operators to write addresses properly.

I love the ones that present a REST api but still can't handle commas, because they have wrapped some ancient importer in a front end that simply writes the request data to a csv! I had to write one once, but I double quoted all fields and had to translate UTF-8 back to some other standard the old ERP used! It was junk, but I believe is still chugging away years later...

Sounds like good system testing to me!

EDIT or absence thereof

That is hilarious, amazing, sad, and begging for a blog post of someone walking through the nightmare you described.

One in five genetics papers contains errors thanks to Microsoft Excel https://genomebiology.biomedcentral.com/articles/10.1186/s13...

Blame the calculator...'misuse of Microsoft Excel'

Nature has the highest percentage of affected papers. Huh.

Funny enough, this is the exact problem I had today at work. The easiest way to fix this problem is to add a tab character (“\t”) somewhere within the value (most likely at the end) and Excel will not format it. For example: “value\t”. Another way is “=“”value”””

This seems kind of philosophically similar to FTFY, https://github.com/LuminosoInsight/python-ftfy, except for CSV files instead of botched text strings.

I knew of ftfy already but I would love some more examples of these kinds of data cleanup libraries; it's always super useful.

Not in any particular order, but here is a list of really useful text butchering utilities, that I keep in the same grab-bag as ftfy: csvclean[1], unidecode[2], transliterate[3], charset-normalizer[4], cchardet[5], phonenumbers[6][7], iconv[8], enca/enconv[9].

[1] https://csvkit.readthedocs.io/en/1.0.3/scripts/csvclean.html

[2] https://pypi.org/project/Unidecode/

[3] https://pypi.org/project/transliterate/

[4] https://pypi.org/project/charset-normalizer/

[5] https://pypi.org/project/cchardet/

[6] https://pypi.org/project/phonenumbers/

[7] which is a python port of Google's libphonenumber: https://github.com/google/libphonenumber

[8] https://linux.die.net/man/1/iconv

[9] https://linux.die.net/man/1/enca

This software (Trifacta Wrangler) was demoed at NeurIPS this year and looked very impressive. https://www.trifacta.com/products/wrangler-editions

There is also the non-maintaned version that the company spun out from:


I learnt a while ago about a way to provide a schema for CSV files. If you publish CSV data, may be useful to provide a schema for it [1].

Sounds like CleverCSV could determine and generate a schema automatically given a CSV file.

1: https://github.com/theodi/csvlint.rb#json-table-schema-suppo...

The problem mostly with CSV beyond the lack of standards, character encodings, and lack of compression, inability to efficiently write or read, and problems handling multiline freeform text, is that ultimately the structure is inband with the data, which is what makes all problems probable.

> The problem mostly with CSV... is that ultimately the structure is inband with the data

I wonder if for 90% of users this seems like a critical UI feature!

The only issue I've ever had with python's CSV module is that it's slow. Is this any faster?

Hi, I'm the author of CleverCSV. Although the package was initially written to mainly improve the accuracy of automated dialect detection, we ended up changing some aspects of the C parser as well (most notably increasing the look-ahead to two characters). This has made the parsing somewhat faster than the builtin csv package (around 10-20%).

TSV --> Problem Solved! Has anyone ever stumbled upon a cell entry that needed to contain an actual tab character? No!!! The tab character's entire raison-d'etre is to separate fields, whereas the comma separates words. Technically the "C" in CSV denotes "Character" but with TSV the intent is 100% clear. The MIME type is explicit: "Note that fields that contain tabs are not allowable in this encoding." This is a vanishingly small price to pay for a parser a simple as split("\t")...

[1] http://www.iana.org/assignments/media-types/text/tab-separat...

What problem is solved, exactly? How do you parse the messy CSV file you've been handed, in order to convert it to TSV?

My point is simply that TSVs are just as straightforward as CSVs but suffer none of the issues. So why are we not making TSVs to begin with?

Nobody (or very few people) processing CSVs is doing it because they like them. You do it because that's how someone else created the data - someone you don't have any control over, like a government agency, financial institution, piece of obscure software, whatever.

How does this compare to Pandas?

Pandas uses the Python csv module under the hood. Most of the code in the read_csv method is used to get the data from the csv file and load it into data frames.

Are you sure? Wes McKinney rewrote the Pandas CSV parser way back in late 2012 because the batteries-included version was slow and memory-hungry. Did Python upstream his version later on, or replace it with something faster (I'd be surprised if they did either, but I'd be curious to know)

It's configurable. The default parser is written in C, but it's possible to use a parser based on the csv module if you needs things like separators specified by a regex.

This project appears to have a dependency on pandas.


That dependency is merely a wrapper to allow CleverCSV to load data directly into a Pandas data frame. It doesn't require pandas just for loading csv data.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact