
CSVKit: CSV utilities that includes csvsql, csvgrep, csvstat, and more - ConceitedCode
http://csvkit.readthedocs.org/en/0.7.3/
======
kbenson
I like fsql[1], as it allows me to combine multiple CSV, TSV, LTSV. JSON and
YAML files as SQL tables and query against them. Very useful.

[1]: [https://metacpan.org/pod/distribution/App-
fsql/bin/fsql](https://metacpan.org/pod/distribution/App-fsql/bin/fsql)

------
rwmj
csvtool:

[https://forge.ocamlcore.org/plugins/scmgit/cgi-
bin/gitweb.cg...](https://forge.ocamlcore.org/plugins/scmgit/cgi-
bin/gitweb.cgi?p=csv/csv.git;a=blob;f=examples/csvtool.ml;h=950e9aa6705c45f1315b285d055033e42b5d9ea4;hb=HEAD#l512)

You will already have this program in your Linux distro since I wrote it about
a decade ago and it has been maintained by a small team ever since then. It
also handles all the quirks of Excel CSV files and is used extensively in
production.

~~~
zmmmmm
> You will already have this program in your Linux distro

Yum disagrees on CentOS 6.4.

~~~
jacktasia
It's actually in the package "ocaml-csv.x86_64" on CentOS 6.4, which does come
up if you "yum search csvtool" (for me anyway)

~~~
zmmmmm
Ah, apologies then. I actually saw that but completely discounted it because
it looked like an OCaml library rather than a command line tool!

------
smackfu
No, everyone must think they can start by splitting on commas and descending
slowly into madness. That is the way of the programmer.

------
denimboy
Also openrefine (formally google refine)

    
    
      http://openrefine.org/
    

is like a GUI version of csvkit.

It can do external look ups, fuzzy matching, and has its own programming
languages Jython and GREL.

~~~
buckie
I tried out OpenRefine previously on the large CSV's I have to deal with at
work largely on the recommendation of HN comments. On small stuff to medium
size stuff, it's pretty nice. But for larger sets (+1GB) it starts to slow
down and eventually will fail to commit changes when the set is big enough.

The only answer I've found to consistently work quickly, with the ability to
explore the data, is the killer combo of iPython Notebook + pandas'
read_csv[0] + a _lot_ of RAM -- 10GB CSV on disk becomes ~20GB in memory
(don't know why yet). When I say quick, I mean 10GB CSV un-cached disk to RAM
in <5Min including fuzzy parsing on dates.

The nice part is, when you have things figured out, you can enable a chunked
reading to get back in-core on machines of lesser specs. Further, you can dump
the pandas DataFrame to HDF, thereafter having ludicrous-speed IO & 'where'
queries.

Still though, OpenRefine is much more turn key and feature rich.

[0][http://pandas.pydata.org/pandas-
docs/version/0.13.1/generate...](http://pandas.pydata.org/pandas-
docs/version/0.13.1/generated/pandas.io.parsers.read_csv.html)

~~~
codygman
[http://hackage.haskell.org/package/csv-
conduit](http://hackage.haskell.org/package/csv-conduit)
[http://hackage.haskell.org/package/cassava](http://hackage.haskell.org/package/cassava)

These libraries should be able to work with data that large, though I can't
say whether they meet your requirements yet. I'm not sure what exactly "10GB
CSV un-cached disk to RAM in <5Min including fuzzy parsing on dates".

Namely, I don't know what you mean by date fuzzy parsing or what your output
looks like after. Perhaps I need to open ipython notebook and import pandas ;)

~~~
buckie
I've been working to get Haskell approved at my place of employment for 1.5
years but getting the US Gov't to change is rather hard; I've had to sit the
"tech" people down and explain that javascript != java... with that baseline,
explaining 'Why Haskell' is non-trivial. Come September, after 1.5 years of
effort, I should have 'all the FOSS'. Until then I have to wait.

It's worth noting that iPython Notebook + pandas vs. cassava + conduits (even
with iHaskell Notebook) serve very different ends. If I need to explore how to
do something, I'd use Haskell. But I'm still in phase 2 (phase 1: collect
underpants) and I've yet to find anything as powerful and flexible as the
iPython Notebook + pandas + hdf5 stack that also just works. I can just move
faster with that stack than anything else I've ever seen. That being said, I'm
knowingly deferring bugs to the runtime -- 'tis the cost of python.

If you're unfamiliar with pandas, the "quick vignette" here[0] is decent
enough. The reason pandas is awesome, IMO, isn't actually because pandas is
awesome (which it is) but because it's embedded in a full language. Julia, R,
etc... can do the same stuff (maybe faster), but I wouldn't also want to
program, say, a production web-app in them (though I have high hopes for
Julia).

Fuzzy parsing on dates: pandas by default uses dateutil[1] which is both
awesome and slow.

10GB CSV... : yeah... it's "fast for python" but pandas is admittedly doing a
lot in that time, namely putting it into a data structure that is very
friendly to time series analysis.

[0] [http://pandas.pydata.org/](http://pandas.pydata.org/) [1]
[https://labix.org/python-dateutil](https://labix.org/python-dateutil)

~~~
codygman
Wow! I wasn't expecting such a response! Awesome. I'll definitely have to
check this out. About dateutil... actually I used it at work today.

As for Haskell and fuzzy dates:

[http://hackage.haskell.org/package/dates](http://hackage.haskell.org/package/dates)

I'm curious what you mean by cassava + conduits with Haskell notebook serve
different ends. For instance where would you use it in place of
pandas/python/hdf5?

------
pessimizer
I'm going to continue to throw it out there when anybody mentions csvkit, but
I can't believe that more people haven't settled on
[http://csvfix.byethost5.com/csvfix15/csvfix.html](http://csvfix.byethost5.com/csvfix15/csvfix.html)

~~~
alaithea
Why? I know and love csvkit, but never heard of csvfix.

~~~
peterwwillis
It looks like csvfix has more features.

~~~
pessimizer
And no python dependency (although that hardly matters.)

edit: This is a better link
[http://neilb.bitbucket.org/csvfix/](http://neilb.bitbucket.org/csvfix/)

------
joncooper
Do any of y'all have an opinion on which of the tools mentioned here is the
most robust to junky input?

~~~
alaithea
That depends on what you want to do with your junky input. And if it's junky
enough, no tool is going to be able to divine what you want to do with the
data until you do some manual cleanup. However, just to put it out there,
csvkit has a utility, csvclean, that will split a file based on which lines
are valid csv and which are not, allowing you to more quickly dive into what's
wrong with the bad lines.

~~~
joncooper
That alone would be helpful. Thanks!

------
Malarkey73
I'm not totally sure why csvgrep csvsort csvcut csvjoin are needed as
replacements for grep sort, cut or join?

The syntax doesn't seem noticeably clearer?

And - without testing - I presume csvkit in Python is a bit slower than the
GNU coreutils in C?

~~~
acdha
> The syntax doesn't seem noticeably clearer?

From the very first example:

    
    
        csvgrep -c 1 -m ILLINOIS 
    

using coreutils grep you'd need something like this:

    
    
        grep -E '^ILLINOIS,' 2009.csv 
    

Oh, wait, that doesn't work because the file was last saved by something which
quotes it.

    
    
        grep -E '^"?ILLINOIS"?,'
    

except that would produce unexpected results with 'ILLINOIS"' so it really
needs to be something like:

    
    
        grep -E '^(["]{0,1})ILLINOIS\1,'
    

Bear in mind that this is the _simplest_ possible case and doesn't even touch
on issues like quoting in the shell or needing to handle files which have
embedded separators as data values (imagine what happens when our erstwhile
grep data-miner needs to check the stats for `WASHINGTON, DISTRICT OF
COLUMBIA`…).

In all but the most trivial cases it's safer and easier to use tools designed
for the job. csvkit also has the very nice property that it's callable as a
Python library so when you outgrow a simple shell processing pipeline you
could migrate your notes / scripts to a full Python program without having to
retest everything related to basic file I/O, unicode, etc. which you would
otherwise need to do when switching readers.

(Bear in mind that the author works in data journalism – the target user is
not a grizzled Unix expert but someone who has a bunch of CSV files and a full
time job which is not shell scripting)

> And - without testing - I presume csvkit in Python is a bit slower than the
> GNU coreutils in C?

Perhaps but it'd be unlikely for it to be noticeable for n less than millions
on a remotely modern system – the Python regular expression engine has quite
decent performance and if that became an issue, PyPy will even JIT them for
you. In the very few cases I've seen where there is a noticeable difference it
was always because the Python version was decoding Unicode and the shell tools
were running with LC_ALL=C, which meant that corrupt data made it further
before being caught and, in some cases, either failing to match all of the
records or subtly quietly things by not extracting all of a combined
character, etc.

For the target use-case, however, this is likely all to be many, many orders
of magnitude less than the time most people would spend debugging regex
patterns.

~~~
Malarkey73
I take you're point and those other well made points about line continuations
below -- but you're over egging it here as

cut -d, -f1 2010.csv | grep ILLINOIS

works just fine for this data. I emphasise again - I take your overall point
though.

~~~
acdha
Yeah, I'm more sensitive to this class of errors now that I live in DC since
the "Washington, District of Columbia" format is common enough to show up
sporadically.

------
fsiefken
Would be nice if a tool works on querying markdown/org-mode tables...
correlations, averages and sums would be nice too.

------
dinedal
I'd love to add similar features to textql[1]. am currently working on a v2
that does a lot more, but right now it fills a gap not in CSVKit, which is the
direct execution of SQL on flat files.

[1]: [https://github.com/dinedal/textql](https://github.com/dinedal/textql)

