Hacker News new | past | comments | ask | show | jobs | submit login
Command-line tools for data science (jeroenjanssens.com)
346 points by robdoherty2 on Sept 19, 2013 | hide | past | favorite | 78 comments

Glad to see that we're not the only ones who are emphatic believers in applying the Unix philosophy to data science! Those interested in this may also be interested in Mark Cavage and Dave Pacheco's presentation last week at Surge on Manta[1], a new storage service that we have developed that draws very direct inspiration from Unix -- and allows you to use the tools mentioned by Jeroen across very large data sets. awk FTMFW!

[1] http://us-east.manta.joyent.com/mark.cavage/public/surge2013...

An open question - why do you prefer doing this in the command line versus via a scripting language like Python? I get the piping philosophy, but why one versus the other?

Well, if I can do something in one line (albeit perhaps a long line), awk is my preferred tool. But as soon as things get complicated, one is almost certainly better off in a richer environment -- bash, Python, node.js, etc. So I don't view it as one versus the other, but rather picking the right tool for the job -- which the Unix philosophy very much liberates us to do.

Most of the well-know "unix-style" command-line tools, such as grep, sort, etc. actually have very high performance. Their relatively constrained use cases allow the authors to implement decent algorithms and optimizations (e.g. sort uses merge sort, grep uses all kinds of optimizations: http://lists.freebsd.org/pipermail/freebsd-current/2010-Augu...)

In contrast, when you're building a custom pipeline in a high-level language, you're optimizing for simple solutions and are not likely to get better performance unless you hit an edge case where the standard tools do really poorly.

Actually sort is even better than normal (purely in-memory) merge sort. It looks at available memory and writes out sorted files to merge.


Interesting! So if I understand that article correctly, it's basically doing a multi-phase merge sort where each individual run is stored in a file?

This is really a matter of personal style. The point of this approach is that if you want to break out Python or whatever, it integrates with the rest of it just like any of the "built-in" programs.

It's usually quicker for me to iterate on building up a complex program using existing command-line tools -- up to a point. After that point, I switch to something like Node or Python.

One reason it's faster is that they're designed to be composable. They're flexible in just the right ways -- record separators, output formats, and key behaviors (like inverting the sense of a filter or whatever) -- to be able to perform a variety of tasks, but not so flexible that you need a lot of boilerplate, as with more general-purpose languages. They defer unrelated tasks (like sorting) to tools designed for that, keeping concerns separate.

Take an awk script that reads whitespace-separated fields as input and transforms that, adding a header at the top and a summary at the end. awk's got a really nice syntax for these common tasks, and at the end you're left with a program where nearly all of the code is part of the specific problem you're trying to solve, not junk around requiring modules, opening files, looping, parsing, and so on.

Its worth mentioning that a UNIX pipeline is highly parallel and naturally exploits multiple cores, while Python is in my experience does not.

Python is slow - parsing 10GB of logging works best with awk, grep, etc.

Python isn't going to beat grep, but it beats awk in a lot of cases. (Cases that awk isn't well suited to, to be fair. Python doesn't beat awk for 99% of what people use awk for.)

It's faster than people think it is. Especially when you add in libraries like pandas, it's fantastic for data analysis.

Of course, by the time you get to using pandas, you have to have everything in memory.

This isn't true of python in general, though. For simpler tasks, you can easily write generators to read from stdin and write to stdout.

I'm not saying that it's better for things like log parsing, but for more complicated ascii formats, I'd far rather use python than awk.

That having been said, people who don't learn and use awk are missing out. It's a fantastic tool.

I've just seen one too many unreadable, 1000-line awk programs to do something that's a dozen lines of python.

I would say that perhaps parsing logging works best when using awk, grep and the like, because that's more or less what they were designed for. But not everything is unix logs, and not everything is over 10GB. Having said that, python can absolutely handle 10GB data sets. In fact with things like PySpark, you can really go much bigger.

You're right - in the end my solution for this particular project was using grep and awk to parse the loglines into a CSV-ish format. That was then interpreted by Python and matplotlib to create beautiful graphs.

I hear you on this. I'm very interested in non-performance based reasons. Some of the python libraries are optimized for big data too, no?

I guess the reason I ask is much of the "manipulate and check" that I do happens before I get things to where a one liner will work. That could very well be a programmer competency issue on my part though. :-)

Use pypy. At 10 GB the bottleneck will probably be the storage.

For myself: specialized tools are useful, but the commandline utility set lends itself, as Kernighan and Pike noted in The UNIX Programming Environment over a quarter century ago, to very rapid prototyping and developing. You can accomplish a great deal in a few well-understood lines of shell (which is to say, UNIX utilities).

Yes, sometimes the full power of a programming or scripting language is what you need, and in cases it may execute faster (though you may well be surprised -- the shell utilities are often highly optimised), but if a one-liner, or even a few brief lines can accomplish the task, why bother with the heavier tool?

Command line tools are just faster in a lot of cases and doesn't disrupt your flow as you work on the terminal.

However, that being said, I do notice there is a disturbing trend of command line warriors trying to do absolutely everything on the command line resulting in spending 10 minutes to construct a perfect one-liner when they could have just wrote a python/perl script in 2 minutes.

for anything beyond a chained set of greps and cuts, I'll use Perl for a one-liner. This has the benefit of allowing me to easily transfer to a script if it becomes unwieldy. It's just a case of pasting in the one liner, adding newlines at semicolons, and a few characters to fix up at the top and bottom of the file.

I like Google's crush-tools, which works on delimited data (e.g. tab-delimited), a somewhat simpler and faster format than CSV. Lots of the built-in Unix tools also work on delimited data (cut, join, sort, etc.), but crush-tools fills in some gaps, like being able to do a 'uniq' on specific fields, do outer joins, sum columns, etc.:


They look interesting, thanks! Any experience using them over GB or TB datasets?

I've used them on tens-of-GB datasets (not TB), and they're quite fast except for those implemented in Perl, which is kind of a hidden gotcha. For example, calcfield is barely usable, because it loops a perl expression over every record in the file. But things like funiq and cutfield are fast, at least as fast as the traditional Unix tools. And if you have pre-sorted data, aggregate2 is a nice aggregation tool for larger-than-memory datasets.

That's interesting. I've written Perl tools that apply pattern-matching to several-GB flat files, and while it was horrifyingly slow at first, I was able to get the performance down to a minute in the average case. Honestly the whole time I think I/O was a greater limiting factor than Perl's processing speed.

A very powerful command line tool for Machine Learning: http://bigmler.readthedocs.org/en/latest/

Create a predictive model is as simple as: bigmler --train < data.csv

or create a predictive model for Bitcoin volume taking online data from Quandl and parsing it with jq in just one line.

curl --silent "http://www.quandl.com/api/v1/datasets/BITCOIN/BTCDEEUR.json" | jq -c ".data" | bigmler --train --field-attributes <(count=0; for i in `curl --silent "http://www.quandl.com/api/v1/datasets/BITCOIN/BTCDEEUR.json" | jq -c ".column_names[]"`; do echo "$count, $i"; count=$[$count+1]; done) --name bitcoin

More info here: http://blog.bigml.com/2013/01/31/fly-your-ml-cloud-like-a-ki...

Hmm -- the model did not seem to spot something a simple linear regression did.

In this particular case, I think that you are right. But isn't it powerful to have it at the command line level? It's great to quickly create models as you can also use them to make predictions at the command line level.

Impressive. I'll add it to the list.

Nice tools.

You can actually do most of that with vanilla PowerShell and Excel believe it or not but it's much fuglier and you spend most of your way working around edge cases.

The only thing that scares me about this though is that JSON is a terrible format for storing numbers in. There is no way of specifying a decimal type for example so aggregations and calculations have no implied precision past floating point values.

The ability to script the R stuff with pipes and whatnot is very welcome to me. I tried using R as a part of a system of normalizing package installs across environments. It was sort of kludgey but ultimately worked okay. Setting up the environment as the first part of any script you push into R is necessary, and that makes sense. Python and Java seem to have more support for this kind of thing (environment swapping) on the command line, or at least perhaps I wasn't looking the right places perhaps for R. It has been a year since I've messed with this though.

I'm surprised that the post didn't call out the unix built-ins cut, sort, uniq, and cat. There's an amazing amount of data processing you can do with just those commands.

He linked another post that covers those http://www.gregreda.com/2013/07/15/unix-commands-for-data-sc...

Nice tools. I see some stuff I might add to my own personal repo: https://github.com/clarkgrubb/data-tools

Would be nice if some of these tools became standard at the command line. I don't know about "sample" though, since that can easily be implemented in awk:

    awk 'rand() < 0.2' /etc/passwd

That's a nice collection of tools you have there. I'll add it to the post.

These are great!

I wrote a quick ruby script that converts one or more CSV files into an SQLITE db file, so you can easily query them.


No (real) need for a script, you can import in the sqlite3 commandline. First make the table, then set a .separator, then .import the file. I am not sure how performant it is though, so creating transactions outside might be better in some cases.

The sqlite3 .import command does not handle quoted csv values.

"Doe, John", 1234 Pine St, Springfield

That would be imported as 4 fields, not 3.

Used to be true. But recent versions of SQLite fix this.

csv2sqlite actually generates the table definitions for you, so I think that's a big improvement when you just downloaded a CSV file with 30 columns in it and need to quickly verify it or extract some data from it.

Also supports parsing multiple CSV files at once, so you can easily do joins.

Depends on how often you have to do this... if it saves time, I'm in

> Ideally, json2csv would have added the header.

You can raise an issue (https://github.com/jehiah/json2csv) or just fork and add a flag to include a header line

Update: now merged with the mainline repo

Nice one!

Suggestions are more than welcome. I'm currently keeping the post up-to-date with suggested tools and repositories.



The first thing I do when I deal with data which has more than --let's say-- 10,000 rows is to put it in HDF file format and work with that. Saves a ton of time while developing a script. I had a python script do a histogram and it ran ~15sec for a file with 100k rows. With converting it first to HDF it ran in ~0.5sec. The import in python is also much shorter (two lines).

HDF is made for high performance numerical I/O. It's great and you can query several structures and even do slices of arrays on the command line (with h5tools).

It's also widely used by Octave, Python, R, Matlab... And you don't have a drawback since you can just pipe it into existing command line tools with a h5dump.




Wow, those benchmarks are old. 195MHz test machine?

Still, 15K rows is not much. Just did that 100K rows (read/sum) bit in Octave. Took about 1/4 second to extract and sum. I assume HDF rocks for much larger sets.

This also really helpful (pity it's relatively unknown):



http://en.wikipedia.org/wiki/GNU_Core_Utilities (section "Text utilities")


Run "$ info coreutils"

Another shameless plug: http://gibrown.wordpress.com/2013/01/26/unix-bi-grams-tri-gr...

More about Natural Language Processing.

nice post! i'd like to make a shameless plug for Drake, a 'make for data' workflow tool. https://github.com/Factual/drake

You should check out RecordStream. It has a lot of convenient processing built around JSON, and a lot of ways to get data to/from JSON.



I second this suggestion. It doesn't just slurp and serialize data, it also is capable of filtering, transforming, aggregating (+ stats) and pretty printing streams of json records.

If you are working with XML, XMLStarlet is invaluable: http://xmlstar.sourceforge.net/

Great list. Shameless self-plug: I wrote a tool similar to sample available on pip as subsample. It supports the approximate sample algorithm that yours does as well as reservoir sampling and an exact two-pass algorithm that runs in O(1) space.

Docs: https://github.com/paulgb/subsample

Worth noting that the Rio tool does not seem to account for R's concurrency / race conditions when running RScript. Basically, if you run it a lot in parallel you will get random failures. Although I am sure it is not exactly meant for high concurrency situations, it would be good if it accounted for this.

Maybe I should take a look at awk/sed but I never found its language orthogonal. For simple scripts I mostly use Python or Haskell via something like

ghc -e 'interact ((++"\n") . show . length . lines)' < file


I know that it's the same like "wc -l" but using that approach I can solve also problems that may not be well suited for awk/sed. Or maybe I just have to see some convincing one-liners.

Here's another that I use all the time:


It reads data on stdin, and makes plots. More or less anything that gnuplot can do is supported, which is quite a lot. Realtime plotting of streaming data is supported as well. This is also in Debian, and is apt-gettable.

Disclaimer: I am the author

Good thread. This article may be useful for those who want to learn to write Linux or Unix command-line utilities in C:

Developing a Linux command-line utility:


So weird... I didn't know about the csvkit I guess (or maybe it wasn't available for cygwin?), so I wrote my own a year ago. It supports field names (in addition to column numbers) so your data can move around and you'll still reference the right logical data.

csvkit is awesome. I discovered it by chance a few months ago and I've found it to be really well-written. It correctly handles a lot of weird edge cases (e.g. newlines in the middle of a record) For the record, csvkit supports field names as well :)

SmallR: http://github.com/religa/smallR seems to provide quite a bit of flexibility with only few keystrokes needed.

There is a command-line optimizer. I always forget it's name and can never find it. It is very cool for doing simplex optimizations etc.

I will try yet again to find it and edit.

This post and comment section are tool's gold. I've been wondering where to start expanding my toolkit. Now I have a lot of great stuff to learn!

I feel like a dumbass. The two great data science tools I learned for command line are grep and sed :-)

Awesome tools.

Also interesting color scheme! Can you share this?

This approach reminded me quite a bit of what Mojo::DOM and Mojo::DOM:CSS modules give you, and the ojo command line tool for perl (called as perl -Mojo).

Here's section 5 from that article rewritten:

perl -MList::MoreUtils=zip -Mojo -E 'say g("http://en.wikipedia.org/wiki/List_of_countries_and_territori... > tr:not(:first-child)")->pluck(sub{ j { country=>$_->find("td")->[1]->text, border=>$_->find("td")->[2]->text, surface=>$_->find("td")->[3]->text, ratio=>$_->find("td")->[4]->text } })' | head

  {"ratio":"7.2727273","surface":"0.44","country":"Vatican City","border":"3.2"}
  {"ratio":"0.6393443","surface":"61","country":"San Marino","border":"39"}
  {"ratio":"0.3000000","surface":"34","country":"Sint Maarten (Netherlands)","border":"10.2"}
  {"ratio":"0.2000000","surface":"6","country":"Gibraltar (United Kingdom)","border":"1.2"}
  {"ratio":"0.1888889","surface":"54","country":"Saint Martin (France)","border":"10.2"}
  {"ratio":"0.0749196","surface":"6220","country":"Palestinian territories","border":"466"}
And here's the original command:

curl -s 'http://en.wikipedia.org/wiki/List_of_countries_and_territori... | scrape -be 'table.wikitable > tr:not(:first-child)' | xml2json | jq -c '.html.body.tr[] | {country: .td[1][], border: .td[2][], surface: .td[3][], ratio: .td[4][]}' | head

Fairly comparable, but there's a whole world of Perl modules for me to pull in and use with the first one.

Some of these are cool, but I don't understand the preference for command line tools over writing, say, a short python script in vim and then executing it. I see the two methods achieving the same results with about the same amount of text, except with a script, its savable and much easier to type.

There are two reasons I prefer command line tools over python/R scripts: 1) Most of the time, I don't need to save anything at all, as they are one-off data manipulations 2) IO and control flow are already taken care of. No specifying that you want to read from stdin, no looping, no specifying how you want to print out. For a one-liner action, that is a ton of boilerplate.

> I see the two methods achieving the same results with about the same amount of text, except with a script, its savable and much easier to type

If I wrote a script for every command I executed to manipulate data I would have about a million small files lying around full of one off commands. Frequently I am just debugging or exploring data and I never want to see the command again - but if I do, it will be in my bash history.

One of Python's annoyances (which are few) is that it doesn't put enough in the default namespace to be convenient from the command line.

Iterative development. It's so easy to press up, add another component to the pipeline, press enter and see the results. Lather, rinse, repeat. LightTable seems interesting development in that regards, it has the possibility to bring similar iterative process to "real" programming languages.

you can do that with vim easily, just map something to ! python % ...but yes, i understand what you are saying

It's equivalent if you look at the command line as a (very popular) REPL, and at the tools as functions.

> except with a script, its savable and much easier to type.

That's an odd thing to say as Python scripts are no more savable than shell scripts. Nor are Python scripts any easier to type than shell scripts. Also, as shell scripts tend to pipe commands together, it's multiple process and thus multiple core by default, unlike Python.

I wasn't thinking of making them a shell script, I was thinking of using them directly from the prompt- that makes sense. Come to think of it, I should be writing more shell scripts instead of retyping stuff into the prompt.

Yeah, that's like saying Python sucks for scripting because all your work in the REPL won't be saved. Goes both ways.

That's why ipython is so nice :)

Or better yet, use make.

I could be wrong, but doesn't each piped command wait for the previous one to finish (not multi-core)?

No, both commands run at the same time. What pipe does is pipe the stdout of one command into the stdin of the other command. Occasionally buffering gives the impression of waiting, as many commands will detect that they are being piped (as opposed to outputting straight to a terminal), and not actually send anything to stdout until either they are finished, or they fill the buffer.

No, a pipe ties the output stream of one command to the input stream of another. That means each line of data travels through the pipe separately allowing all processes to run concurrently.

No, try this in a command line: sleep 2 | echo hi

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact