
Command-line tools for data science - robdoherty2
http://jeroenjanssens.com/2013/09/19/seven-command-line-tools-for-data-science.html
======
bcantrill
Glad to see that we're not the only ones who are emphatic believers in
applying the Unix philosophy to data science! Those interested in this may
also be interested in Mark Cavage and Dave Pacheco's presentation last week at
Surge on Manta[1], a new storage service that we have developed that draws
very direct inspiration from Unix -- and allows you to use the tools mentioned
by Jeroen across very large data sets. awk FTMFW!

[1] [http://us-
east.manta.joyent.com/mark.cavage/public/surge2013...](http://us-
east.manta.joyent.com/mark.cavage/public/surge2013_manta_final.pdf)

~~~
mathattack
An open question - why do you prefer doing this in the command line versus via
a scripting language like Python? I get the piping philosophy, but why one
versus the other?

~~~
joelhaasnoot
Python is slow - parsing 10GB of logging works best with awk, grep, etc.

~~~
kyzyl
I would say that perhaps parsing _logging_ works best when using awk, grep and
the like, because that's more or less what they were designed for. But not
everything is unix logs, and not everything is over 10GB. Having said that,
python can absolutely handle 10GB data sets. In fact with things like PySpark,
you can really go much bigger.

~~~
joelhaasnoot
You're right - in the end my solution for this particular project was using
grep and awk to parse the loglines into a CSV-ish format. That was then
interpreted by Python and matplotlib to create beautiful graphs.

------
mjn
I like Google's crush-tools, which works on delimited data (e.g. tab-
delimited), a somewhat simpler and faster format than CSV. Lots of the built-
in Unix tools also work on delimited data (cut, join, sort, etc.), but crush-
tools fills in some gaps, like being able to do a 'uniq' on specific fields,
do outer joins, sum columns, etc.:

[https://code.google.com/p/crush-tools/](https://code.google.com/p/crush-
tools/)

~~~
sturadnidge
They look interesting, thanks! Any experience using them over GB or TB
datasets?

~~~
mjn
I've used them on tens-of-GB datasets (not TB), and they're quite fast
_except_ for those implemented in Perl, which is kind of a hidden gotcha. For
example, calcfield is barely usable, because it loops a perl expression over
every record in the file. But things like funiq and cutfield are fast, at
least as fast as the traditional Unix tools. And if you have pre-sorted data,
aggregate2 is a nice aggregation tool for larger-than-memory datasets.

~~~
sliverstorm
That's interesting. I've written Perl tools that apply pattern-matching to
several-GB flat files, and while it was horrifyingly slow at first, I was able
to get the performance down to a minute in the average case. Honestly the
whole time I think I/O was a greater limiting factor than Perl's processing
speed.

------
aficionado
A very powerful command line tool for Machine Learning:
[http://bigmler.readthedocs.org/en/latest/](http://bigmler.readthedocs.org/en/latest/)

Create a predictive model is as simple as: bigmler --train < data.csv

or create a predictive model for Bitcoin volume taking online data from Quandl
and parsing it with jq in just one line.

curl --silent
"[http://www.quandl.com/api/v1/datasets/BITCOIN/BTCDEEUR.json"](http://www.quandl.com/api/v1/datasets/BITCOIN/BTCDEEUR.json")
| jq -c ".data" | bigmler --train --field-attributes <(count=0; for i in `curl
--silent
"[http://www.quandl.com/api/v1/datasets/BITCOIN/BTCDEEUR.json"](http://www.quandl.com/api/v1/datasets/BITCOIN/BTCDEEUR.json")
| jq -c ".column_names[]"`; do echo "$count, $i"; count=$[$count+1]; done)
--name bitcoin

More info here: [http://blog.bigml.com/2013/01/31/fly-your-ml-cloud-like-a-
ki...](http://blog.bigml.com/2013/01/31/fly-your-ml-cloud-like-a-kite-with-
bigmler-the-command-line-tool-for-machine-learning/)

~~~
diziet
Hmm -- the model did not seem to spot something a simple linear regression
did.

~~~
aficionado
In this particular case, I think that you are right. But isn't it powerful to
have it at the command line level? It's great to quickly create models as you
can also use them to make predictions at the command line level.

------
harrytuttle
Nice tools.

You can actually do most of that with vanilla PowerShell and Excel believe it
or not but it's much fuglier and you spend most of your way working around
edge cases.

The only thing that scares me about this though is that JSON is a terrible
format for storing numbers in. There is no way of specifying a decimal type
for example so aggregations and calculations have no implied precision past
floating point values.

------
th0ma5
The ability to script the R stuff with pipes and whatnot is very welcome to
me. I tried using R as a part of a system of normalizing package installs
across environments. It was sort of kludgey but ultimately worked okay.
Setting up the environment as the first part of any script you push into R is
necessary, and that makes sense. Python and Java seem to have more support for
this kind of thing (environment swapping) on the command line, or at least
perhaps I wasn't looking the right places perhaps for R. It has been a year
since I've messed with this though.

------
gerad
I'm surprised that the post didn't call out the unix built-ins cut, sort,
uniq, and cat. There's an amazing amount of data processing you can do with
just those commands.

~~~
sturadnidge
He linked another post that covers those
[http://www.gregreda.com/2013/07/15/unix-commands-for-data-
sc...](http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/)

------
cgrubb
Nice tools. I see some stuff I might add to my own personal repo:
[https://github.com/clarkgrubb/data-tools](https://github.com/clarkgrubb/data-
tools)

Would be nice if some of these tools became standard at the command line. I
don't know about "sample" though, since that can easily be implemented in awk:

    
    
        awk 'rand() < 0.2' /etc/passwd

~~~
jeroenjanssens
That's a nice collection of tools you have there. I'll add it to the post.

------
dergachev
These are great!

I wrote a quick ruby script that converts one or more CSV files into an SQLITE
db file, so you can easily query them.

[https://github.com/dergachev/csv2sqlite](https://github.com/dergachev/csv2sqlite)

~~~
aw3c2
No (real) need for a script, you can import in the sqlite3 commandline. First
make the table, then set a .separator, then .import the file. I am not sure
how performant it is though, so creating transactions outside might be better
in some cases.

~~~
mkiwala
The sqlite3 .import command does not handle quoted csv values.

"Doe, John", 1234 Pine St, Springfield

That would be imported as 4 fields, not 3.

~~~
SQLite
Used to be true. But recent versions of SQLite fix this.

------
nonchalance
> Ideally, json2csv would have added the header.

You can raise an issue
([https://github.com/jehiah/json2csv](https://github.com/jehiah/json2csv)) or
just fork and add a flag to include a header line

~~~
daniel-levin
[https://github.com/daniel-levin/json2csv](https://github.com/daniel-
levin/json2csv)

Obligatory here-I-did-it

~~~
daniel-levin
Update: now merged with the mainline repo

------
jeroenjanssens
Suggestions are more than welcome. I'm currently keeping the post up-to-date
with suggested tools and repositories.

~~~
susi22
[http://www.hdfgroup.org/HDF5/doc/RM/Tools.html](http://www.hdfgroup.org/HDF5/doc/RM/Tools.html)

[http://www.hdfgroup.org/tools5desc.html#1](http://www.hdfgroup.org/tools5desc.html#1)

The first thing I do when I deal with data which has more than --let's say--
10,000 rows is to put it in HDF file format and work with that. Saves a ton of
time while developing a script. I had a python script do a histogram and it
ran ~15sec for a file with 100k rows. With converting it first to HDF it ran
in ~0.5sec. The import in python is also much shorter (two lines).

HDF is made for high performance numerical I/O. It's great and you can query
several structures and even do slices of arrays on the command line (with
h5tools).

It's also widely used by Octave, Python, R, Matlab... And you don't have a
drawback since you can just pipe it into existing command line tools with a
h5dump.

HDF5:

[http://www.hdfgroup.org/HDF5/RD100-2002/HDF5_Performance.pdf](http://www.hdfgroup.org/HDF5/RD100-2002/HDF5_Performance.pdf)

[http://www.hdfgroup.org/HDF5/RD100-2002/HDF5_Overview.pdf](http://www.hdfgroup.org/HDF5/RD100-2002/HDF5_Overview.pdf)

~~~
twistedpair
Wow, those benchmarks are old. 195MHz test machine?

Still, 15K rows is not much. Just did that 100K rows (read/sum) bit in Octave.
Took about 1/4 second to extract and sum. I assume HDF rocks for much larger
sets.

------
revertts
You should check out RecordStream. It has a lot of convenient processing built
around JSON, and a lot of ways to get data to/from JSON.

[https://github.com/benbernard/RecordStream/tree/master/doc](https://github.com/benbernard/RecordStream/tree/master/doc)

[https://github.com/benbernard/RecordStream/blob/master/doc/R...](https://github.com/benbernard/RecordStream/blob/master/doc/RecordStreamStory.pod)

~~~
ffk
I second this suggestion. It doesn't just slurp and serialize data, it also is
capable of filtering, transforming, aggregating (+ stats) and pretty printing
streams of json records.

------
gav
If you are working with XML, XMLStarlet is invaluable:
[http://xmlstar.sourceforge.net/](http://xmlstar.sourceforge.net/)

------
paulgb
Great list. Shameless self-plug: I wrote a tool similar to sample available on
pip as subsample. It supports the approximate sample algorithm that yours does
as well as reservoir sampling and an exact two-pass algorithm that runs in
O(1) space.

Docs:
[https://github.com/paulgb/subsample](https://github.com/paulgb/subsample)

------
zmmmmm
Worth noting that the Rio tool does not seem to account for R's concurrency /
race conditions when running RScript. Basically, if you run it a lot in
parallel you will get random failures. Although I am sure it is not exactly
meant for high concurrency situations, it would be good if it accounted for
this.

------
helloTree
Maybe I should take a look at awk/sed but I never found its language
orthogonal. For simple scripts I mostly use Python or Haskell via something
like

ghc -e 'interact ((++"\n") . show . length . lines)' < file

EDIT:

I know that it's the same like "wc -l" but using that approach I can solve
also problems that may not be well suited for awk/sed. Or maybe I just have to
see some convincing one-liners.

------
dima55
Here's another that I use all the time:

[https://github.com/dkogan/feedgnuplot](https://github.com/dkogan/feedgnuplot)

It reads data on stdin, and makes plots. More or less anything that gnuplot
can do is supported, which is quite a lot. Realtime plotting of streaming data
is supported as well. This is also in Debian, and is apt-gettable.

Disclaimer: I am the author

------
vram22
Good thread. This article may be useful for those who want to learn to write
Linux or Unix command-line utilities in C:

Developing a Linux command-line utility:

[http://www.ibm.com/developerworks/library/l-clutil/](http://www.ibm.com/developerworks/library/l-clutil/)

------
peterwwillis
So weird... I didn't know about the csvkit I guess (or maybe it wasn't
available for cygwin?), so I wrote my own a year ago. It supports field names
(in addition to column numbers) so your data can move around and you'll still
reference the right logical data.

~~~
michaelmior
csvkit is awesome. I discovered it by chance a few months ago and I've found
it to be really well-written. It correctly handles a lot of weird edge cases
(e.g. newlines in the middle of a record) For the record, csvkit supports
field names as well :)

------
mikeme123
SmallR: [http://github.com/religa/smallR](http://github.com/religa/smallR)
seems to provide quite a bit of flexibility with only few keystrokes needed.

------
fluidcruft
There is a command-line optimizer. I always forget it's name and can never
find it. It is very cool for doing simplex optimizations etc.

I will try yet again to find it and edit.

------
tomrod
This post and comment section are tool's gold. I've been wondering where to
start expanding my toolkit. Now I have a lot of great stuff to learn!

------
mrcactu5
I feel like a dumbass. The two great data science tools I learned for command
line are grep and sed :-)

------
lardissone
Awesome tools.

Also interesting color scheme! Can you share this?

------
kbenson
This approach reminded me quite a bit of what Mojo::DOM and Mojo::DOM:CSS
modules give you, and the ojo command line tool for perl (called as perl
-Mojo).

Here's section 5 from that article rewritten:

perl -MList::MoreUtils=zip -Mojo -E 'say
g("[http://en.wikipedia.org/wiki/List_of_countries_and_territori...](http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio"\)->dom->find\("table.wikitable)
> tr:not(:first-child)")->pluck(sub{ j { country=>$_->find("td")->[1]->text,
border=>$_->find("td")->[2]->text, surface=>$_->find("td")->[3]->text,
ratio=>$_->find("td")->[4]->text } })' | head

    
    
      {"ratio":"7.2727273","surface":"0.44","country":"Vatican City","border":"3.2"}
      {"ratio":"2.2000000","surface":"2","country":"Monaco","border":"4.4"}
      {"ratio":"0.6393443","surface":"61","country":"San Marino","border":"39"}
      {"ratio":"0.4750000","surface":"160","country":"Liechtenstein","border":"76"}
      {"ratio":"0.3000000","surface":"34","country":"Sint Maarten (Netherlands)","border":"10.2"}
      {"ratio":"0.2570513","surface":"468","country":"Andorra","border":"120.3"}
      {"ratio":"0.2000000","surface":"6","country":"Gibraltar (United Kingdom)","border":"1.2"}
      {"ratio":"0.1888889","surface":"54","country":"Saint Martin (France)","border":"10.2"}
      {"ratio":"0.1388244","surface":"2586","country":"Luxembourg","border":"359"}
      {"ratio":"0.0749196","surface":"6220","country":"Palestinian territories","border":"466"}
    

And here's the original command:

curl -s
'[http://en.wikipedia.org/wiki/List_of_countries_and_territori...](http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio')
| scrape -be 'table.wikitable > tr:not(:first-child)' | xml2json | jq -c
'.html.body.tr[] | {country: .td[1][], border: .td[2][], surface: .td[3][],
ratio: .td[4][]}' | head

Fairly comparable, but there's a whole world of Perl modules for me to pull in
and use with the first one.

------
the_cat_kittles
Some of these are cool, but I don't understand the preference for command line
tools over writing, say, a short python script in vim and then executing it. I
see the two methods achieving the same results with about the same amount of
text, except with a script, its savable and much easier to type.

~~~
gnaritas
> except with a script, its savable and much easier to type.

That's an odd thing to say as Python scripts are no more savable than shell
scripts. Nor are Python scripts any easier to type than shell scripts. Also,
as shell scripts tend to pipe commands together, it's multiple process and
thus multiple core by default, unlike Python.

~~~
reustle
I could be wrong, but doesn't each piped command wait for the previous one to
finish (not multi-core)?

~~~
gizmo686
No, both commands run at the same time. What pipe does is pipe the stdout of
one command into the stdin of the other command. Occasionally buffering gives
the impression of waiting, as many commands will detect that they are being
piped (as opposed to outputting straight to a terminal), and not actually send
anything to stdout until either they are finished, or they fill the buffer.

