Hacker News new | past | comments | ask | show | jobs | submit login
Useful Unix commands for data science (gregreda.com)
221 points by gjreda on July 15, 2013 | hide | past | web | favorite | 106 comments

Everyone forgets the brilliant and sometimes crazy BSD ones:

  - Column: Create columns / tables from input data
  - tr: substitute / delete chars
  - join: like a database join, but for text files
  - comm: like diff, but you can use it programmatically to choose if a       line is in one file, or another, or both.
  - paste: put file lines side-by-side
  - rs: reshape arrays
  - jot: generate random or sequence data
  - expand: replace tabs / spaces

Looks like 6 of those 8 are in GNU coreutils as well (and therefore can be assumed present on just about any modern Unix). 'rs' and 'jot' are the two missing from most default Linux installs. On Debian you can install them via the packages 'rs' and 'athena-jot'.

'jot' is pretty sweet, especially for creating ranges for iteration and random numerical data for testing arguments and such.

Check out the man page for a few snippets: http://www.unix.com/man-page/FreeBSD/1/jot/

It is the older, more flexible uncle of gnu's 'seq' command: http://administratosphere.wordpress.com/2009/01/23/using-bsd...

And you can't mention jot and rs without lam: http://www.unix.com/man-page/FreeBSD/1/lam/

Join is really one of those awesome unknown commands.

> cat data.csv | awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'

"Don't pipe a cat".

My test doesn't show a speed improvement, but there are less processes running, and less memory consumed.

  bch:~ bch$ jot 999999999 2 99 > data.dat

  bch:~ bch$ time cat data.dat  | awk '{sum +=$1} END {printf "sum: %d\n", sum}'
  sum: 50499999412

  real 6m21.111s
  user 6m15.506s
  sys  0m5.711s

  22342  awk          100.7 05:11.84 1/1   0    17     21     52K    212K   340K   17M    2378M  22341 22306 running  501  311       49       73          36
  22341  cat          1.1   00:03.94 1/1   0    17     21     272K   212K   548K   17M    2378M  22341 22306 running  501  268       51       73          36

  bch:~ bch$ time awk '{sum +=$1} END {printf "sum: %d\n", sum}' ./data.dat
  sum: 50499999412

  real 6m24.023s
  user 6m13.828s
  sys  0m2.774s

  22373  awk          100.0 00:30.16 1/1   0    17     21     276K   212K   624K   17M    2378M  22373 22306 running  501  256       46       73          36

Sometimes I like to start with cat so I can easily swap for zcat when changing to gripped input.

Agree. Or actually I start with a 'head -100' so I don't handle too much data in my pipeline until it's ready.

I'm old fashioned, so use "sed 100q" instead of the newer "head -100". It saves a keystroke, too.

There are enough variations in ways to do things on Unix that I've sometimes wondered about how easy it would be to identify a user by seeing how they accomplish a common task.

For instance, I noticed at one place I worked that even though everyone used the same set of options when doing a "cpio -p", everyone had their own order they wrote them. Seeing one "cpio -p" command was sufficient to tell which of the half dozen of us had done the command.

I think I'm the only one where I work who uses "sed Nq" instead of "head -N", so that would fingerprint me.

I sorta had this happen to me once. I have used "lsl" as an alias for long directory listings for longer than I can remember. And just out of habit it was almost always the first command I typed when logging into any box anywhere.

So one day I telnetted into a Solaris machine and immediately typed "lsl" before doing anything else. A short while later a colleague came to my cube. He had been snooping the hme1 interface and saw me login. He didn't need to trace the IP because he knew it was me when he saw 3 telnet packets with "l" "s" "l" in them.

Stoll's "The Cuckoo's Egg" contains a bit of detective work based around the hacker's switch style.

Hmm… Now we need a zawk.

AWK is worth learning completely. It hits a real sweet spot in terms of minimizing the number of lines of code needed to write useful programs in the world of quasi-structured (not quite CSV but not completely free form) data. You can learn the whole language and become proficient in an afternoon.

I recommend "The AWK Programming Language" by Aho, Kernighan, and Weinberger, though it seems to be listed for a hilariously high price on Amazon at the moment. Maybe try to pick up a used copy.

>I recommend "The AWK Programming Language" by Aho, Kernighan, and Weinberger

I concur with this recommendation. "The AWK Programming Language", at little over 100 pages, is a classic of programming language instruction. The book jumps right into use cases, it does not waste one's time. This book should be required reading for anyone contemplating writing a handbook on any programming language; my CS bookshelf would be several feet thinner and several times more informative.

Sadly it seems very expensive now, $95 on Amazon...

Fortunately, a google search for "the awk programming language pdf" returns a link to this: http://books.cat-v.org/computer-science/awk-programming-lang...

It's the first result for me.

USD 8.99 used, with 3.99 shipping.

Here is Kernighan's personal help file on AWK:


It deals with things he forgets or needs to remind himself of.

If you're interested in his other personal tutorials, they are here:


Kernighan's personal help file is excellent. If you need more, you should switch to a more powerful language.

I advise not learning more than basic usage of awk and to spend the time on more versatile languages. You can do very neat tricks with sed and awk, but when the problems become more complex, it is a lot faster to use a smarter language. And if you know well this language, you will discover that it may also be very concise for relatively simpler tasks.

When Perl was created, one of its advertised goal was to avoid all the time lost trying to work around the limitations of awk, sed and shell.

I recall there was a pointer to an old great AWK tutorial some time ago - smth along the lines 'how to approach awk language....' - anyone kept the link?

This is the first hit for awk tutorial and it's all you need.


I think Steve's Awk academy is a nice supplement to Grymoire : http://www.troubleshooters.com/codecorn/awk/

By the way: what people need to understand is that in order to use Awk, efficently, you'll either use associative arrays, or structure your script like a sed script, otherwise it will be slow. The interesting thing about both of those, is the regex algorithm Thompson NFA, that is from what I hear around 7 times faster than PCRE that is used in Perl, PHP, Python and Ruby?

i wrote one a long time ago:


i still use a buttload of awk for data science type uses.

One of my favorite little tools that makes all these others better is pv -- PipeViewer.

Use it any place in a pipeline to see a progress meter on stderr. Very handy when grepping through a bunch of big log files looking for stuff. Here is a quick strawman example:

  pv /data/*.log.gz | zgrep -c 'hello world'
  241MiB 0:00:15 [15.8MiB/s] [==>       ]  2% ETA 0:12:12

BSD has a progress meter utility. It's called progress(1).

   progress -zf /data/*.log.gz grep -c 'hello world'

   progress -f /data/*.log.gz zgrep -c 'hello world'
The second form will show the progress of the decompression process.

You can also adjust buffer size, set the length for the time estimate (otherwise we have to fstat the input), and display progress to stderr instead of stdout.

Which BSD has that? I don't seem to find it in my FreeBSD installs.

It's actually in FreeBSD's base, but it's part of the ftp(1) program.


Not sure if you prefer binary installs or whether you compile your installs yourself... but I'm sure you could get this to compile on FreeBSD with a little work.

Shameless plug: here is a similar tool I wrote that prints not a progress bar but the contents flowing through the pipe, to help in debugging:


Actually useful data science tips for unix users.

  Make all your commands 3x faster:
   export LC_ALL=C

  Actually use the 32 CPUs you paid for:
   sort --parallel=32 ...
   xargs -P32 ...

Could you expand on why

   export LC_ALL=C
would "make all your commands 3x faster"?

If your text manipulation programs are locale-aware, they may be interpreting the input as a multibyte encoding, and need to do a lot more work in preprocessing to get semantically correct operation. For example, a Unicode-aware grep may understand more forms of equivalence, similarly for sorting. See e.g. http://en.wikipedia.org/wiki/Unicode_equivalence

With the C locale, text is more or less treated as plain bytes.

Actually, it was more like 2000X[1] -- and I believe that it still stands as Brendan Gregg's biggest performance win.

[1] http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance...

According to the comments in that thread, this issue was fixed in GNU grep 2.7 (my system currently has grep 2.14 on it, so this must have been some time ago).

Gnu grep is or was very slow with the UTF-8 locale. Not sure about other commands, perhaps anything that processes text, awk and sed maybe?

That was mostly fixed. http://savannah.gnu.org/bugs/?14472

It's no longer quadratic in so many cases, but it's still true that UTF-8 string operations require, in the best case, several CPU cycles per character consumed, even when the input is an ASCII subset. LC_ALL=C pretty much guarantees one or fewer CPU cycles per input character. Basics like strlen and strchr and strstr are significantly faster in "C" locale.

Note that the standard Solaris versions of many commands are substantially faster than their GNU equivalents in the 'C' and multi-byte locales so this advice doesn't necessarily apply.

That's part of why Solaris continues to use them in favour of GNU alternatives (although the GNU alternatives are available easily in /usr/gnu/bin).

Also export MAKEOPTS=-j33

> Actually use the 32 CPUs you paid for:

gnu parallel FTW!

I like slicing and dicing with awk, grep and friends too.

One thing I find odd that you have to drop to a full language (awk, perl etc) to sum a column of numbers. Am I missing a utility?

  echo "1\n2\n3\n" | sum # should print 6 with hyphothetical sum command
I suppose more generally you could have a 'fold initial op' and:

  echo "1\n2\n3\n4\n" | fold 0 +   # should print 10
  echo "1\n2\n3\n4\n" | fold 1 \*  # should print 24
But I guess at that point you're close enough to using awk/perk/whatever anyway. Which probably answers my question.

Just for grins:

    $ alias sum="xargs | tr ' ' '+' | bc"
    $ echo -e "1\n2\n3\n" | sum

Here's one way to do this with the standard "dc" (RPN calculator) utility:

  echo "1\n2\n3\n+\n+\np\n" | dc -
Or, a little more legibly:

  > dc
[Edited to fix bug]

Not sure how to automate this to sum 1000 values without needing to explicitly insert 999 + signs, though. Haven't explored dc in depth myself yet. There's probably some way to do it with a macro or something, but it may not be pretty.

If you consider the use of dc/bc as in the other solutions to be cheating, you can use unary-encoded integers...

   alias sum='xargs -I{} sh -c "head -c {} < /dev/zero" | wc -c'

Yeah I wrote my own sum utility in Python... the syntax is just sum 1 or sum 2 for the column, with a -d delimiter flag. In retrospect I guess it could have been a one line awk script. But yeah if you are doing this kind of data-processing, it makes sense to have a hg/git repo of aliases and tiny commands that you sync around from machine to machine. You shouldn't have to write the sum more than once.

Another useful one is "hist" which is sort | uniq -c | sort -n -r.

I've never aliased it, but yes I use your 'hist' a lot. Useful for things like "categorise log errors" etc.

Does everyone else edit command history, stacking up 'grep -v xxxx' in the pipeline to remove noise?

If I'm working on a new pipeline, my normal workflow is something like:

  head file     # See some representative lines
  head file | grep goodstuff
  head file | grep good stuff | grep -v badstuff
  head file | grep ... | grep ... | sed -e 's/cut out/bits/' -e 's/i dont/want/'
  head file | grep ... | grep ... | sed -e 's/cut out/bits/' -e 's/i dont/want/' | awk '{print $3}' # get a col
  head file | grep ... | grep ... | sed -e 's/cut out/bits/' -e 's/i dont/want/' | awk '{print $3}' | sort | uniq -c | sort -nr  # histogram as parent
Then I edit the 'head' into a 'cat' and handle the whole file. Basically all done with bash history editing (I'm a 'set -o vi' person for vi keybindings in bash, emacs is fine too :-)

Yeah, this is my quick-and-dirty way of looking at referers in Apache logs, built up from a few history edits. It excludes some bot-like stuff (many bots give a plus-prefixed URL in the user-agent string) and referer strings from my own domain, removes query strings, and cleans up trailing slashes:

   grep -v "+http" access_log | cut -d \" -f 4 | cut -d \? -f 1 | sed 's/\/$//' | grep -v kmjn.org | sort | uniq -c | sort -nr

> awk '{print $3}'

is the same as

> cut -f3 -d' '

cut is amazing for what it does. and most people know only the subset of awk that effectively _is_ cut anyway :D.

Not exactly, awk will consume all whitespace while cut will split on each individual space character, and not on newlines and tabs.

Bundling expressions into regular expressions can be handy, for example "grep -Ev '(thisbot|thatbot|bingbot|bongbot)'" instead many single grep pipes.

    echo "1\n2\n3\n" | tr '\n' + | bc

Doesn't work. By default, echo doesn't translate \n into a newline, so you have to add the -e flag. Then, bc doesn't like the extra plusses at the end, so you have to either add the -n flag to echo and remove the last \n, or somehow trim the newlines from the end beforehand.

Thanks for the detail. I was worried about the escapes in the echo, but didn't check.

You could do:

echo -e "1\n2\n3\n4" | paste -sd+ | bc

kind of cheating though! :-)

    paste -sd+|bc

I was hoping to see an article about some neat new utilities specifically tailored for doing advanced data analysis.

Instead this is a set of basic examples of bog-standard tools that every newbie *nix user should be already familiar with: cat, awk, head, tail, wc, grep, sed, sort, uniq

>"tools that every newbie nix user should be already familiar with"

The key word is should... you might be surprised how many "not newbie" nix users are not aware of those commands or how using them in this fashion. Specially awk.

I forgot all about wc, but thanks to this article I may remember it the next time I need it.


Indeed. I'm going to start calling myself a data scientist now, instead of a sysadmin. See if I can't get a raise.

Don't forget there are only ever going to be more unix newbies in the world. It's not like they're a dying breed. There are more people than ever who have never been exposed to unix tools who might benefit from them (myself included several years ago).

Lots of people are familiar with the "basics" of each of these commands but many of them (awk, sed) are very powerful utilities that can do much, much more than it appears at first glance.

A commenter on the article pointed out the "Useless use of cat".

What most users probably don't realize is that the redirection can be anywhere on the line, not just at the beginning. Putting an input redirection at the beginning of the command can make the data flow clearer: from the input file, through the command, to stdout:

    < data.csv awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'
(This only works for simple commands; you can't do `< file if blah; then foo; else bar; fi`)

"Useless use of cat" is one of those boring pedantic comments that makes me cringe. Who cares? It's usually much more straightforward to build a pipeline from left to right, particularly for people who are just learning this stuff.

> It's usually much more straightforward to build a pipeline from left to right, particularly for people who are just learning this stuff.

True, however, people pointing out UUOC are in fact pointing out that you should not be building a pipeline at all. If you want to apply an awk / sed / wc / whatever command to a file, then you should just do that instead of piping it through a extraneous command.

Sure, as people always mention, in your actual workflow you might have a cat or grep already, and are building a pipeline incrementally; there's no reason to remove previous stuff to be "pure" or whatever. But if you're giving a canonical example, there's no reason to add unneeded commands.

Please be very careful doing math with bash and awk...

  cat data.csv | awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'
From that command, it's unclear whether the sum will be accurate, it depends on the inputs and on the precision of awk. See (D.3 Floating-Point Number Caveats): http://www.delorie.com/gnu/docs/gawk/gawk_260.html

Please be very careful doing math with bash and awk...

I don't see how that's more true for awk than it is for any other programming language. Awk uses double precision floating point for all numeric values, which isn't a horrible choice for a catch-all numeric type.

Starts off with unnecessary use of cat, e.g., cat file | awk 'cmds'.

One can simply do awk 'cmds' file.

I know purists always complain about unnecessary cats, but I always find it useful to start with "head" or "tail" in the first position to figure out my pipeline, and then replace it with cat when it's all working.

And if the extra cat is actually making a measurable difference, maybe that's a good signal that it's time to rewrite it in C.

You can do with simple IO redirection. For example, the arbitrary pipeline

    $ cat data.txt | awk '{ print $2+$4,$0 }'|sort|sed '/^0/d'
can be written as

    $ <data.txt awk '{ print $2+$4,$0 }'|sort|sed '/^0/d'

Some people prefer the first, longer-winded way because it's more explicit. To some --- myself included --- it makes more sense because it explicitly breaks each function into a seperate steps; I'm explicitly telling the system to print the contents of data.txt rather than implictly doing so. I'll happily type those five extra characters for that additional clarity.

If you need them, Windows also has most of those tools somehow replicated in Powershell. For instance, the initial example can be replicated with:

Get-Content .\data.csv | %{[int]$total+=$_.Split('|')[3]; } ; Write-Host "$total"

Or you can actually use the Linux commands by installing Cygwin. Pretty much my first conscious action when I wake up stranded on a desert Windows system.

Why Cygwin instead of MinGW? (I can't remember why I prefer MinGW, but at some point I had a reason).

I could care less which you choose so long as you're getting a proper Linux toolset.

Powershell is a skill I don't have yet which carries over to ... precisely one declining technical dinosaur (with a penchant for expiring its skillsets).

The Linux toolbox is a set of skills I embarked on learning over a quarter-century ago, most of which goes back another decade or further (the 'k' in 'awk' comes from Brian Kernighan, one of Unix's creators). And while some old utilities are retired and new ones replace them (telnet / rsh for ssh, sccs/rcs for git), much of the core has remained surprisingly stable over time.

The main difference between MinGW and Cygwin appears to be how Windows-native they are considered, which for my own purposes has been an entirely irrelevant distinction, though if you're building applications based off of the tools might matter to you.


Virtualbox ubuntu.

One trick I like to do is to feed two files into awk; /dev/stdin and some other file I'm interested in. Here's an example: lookup the subject names of a list of serial numbers in an openSSL index.txt

    printf %s\\n "$@" | awk -F'\t' '
        FILE == "/dev/stdin" {
            needle[$0] = 1

        needle[$4] {
            print $NF
    ' /dev/stdin /etc/pki/CA/index.txt
I find myself using this idiom (feeding data from a file to awk and selecting it with data from standard input) again and again. It's a great way to scale shell scripts to take multiple arguments while avoiding opening the same file N times, or doing clunky things with awk's -v flag.

Worth noting that:

    grep -A n -B n
is more easily written:

    grep -C n
If you think "C" for "context" this is easier to remember too.

If you're interested in this general topic check out this Wikibook from John Rauser. He's a data scientist at Pinterest.


If this interests you, you should check out Joyents new Manta service which lets you do this type of thing on your data via their infrastructure. It's really cool.


If I needed to do this type of thing on 10 TB of data, it would probably take me longer to get the data to them than it would to just run it on my own hardware.

Apparently there's a need for it, though, or it wouldn't exist.

Disclaimer: I work at Joyent, on Manta.

This entire HN thread is a perfect example of why we built Manta. Lots of engineers/scientists/sysadmins/... already know how to (elegantly) process data using Unix and augmenting with scripts. Manta isn't about always needing to work on a 10TB dataset (you can), but about it being always available, and stored ready to go. I know we can't live without it for running our own systems -- all logs in the entire Joyent fleet are rotated and archived in Manta, and we can perform both recurring/automated and ad-hoc analysis on the dataset, without worrying about storage shares, or ETL'ing from cold storage to compute, etc. And you can sample as little as much or as much as you want. At least to us (and I've run several large distributed systems in my career), that has tremendous value, and we believe it does for others as well. And that's just one use case (log processing).

Like I said, disclaimers/bias/etc.


Wow, this looks great. My ideal cloud-computing platform is basically something like xargs -P or GNU parallel, but with the illusion that I'm running it on a machine with infinite CPU cores and RAM (charged for usage, of course). I was spoiled early on by having once had something almost like that, via a very nice university compute cluster, where your data was always available on all nodes (via NFS), and you just prefixed your usual Unix commands with a job-submit command, which did the magic of transparently running stuff wherever it wanted to run it. Apart from the slight indirection of using the job-submit tool, it almost succeeded in giving the illusion of ssh-ing into a single gazillion-core big-iron machine, which is more or less the user experience I want. But I haven't found a commercial offering where I can get an account on a big Unix cluster and just get billed for some function of my (disk space, CPU usage, RAM usage) x time.

Cloud services are amazing in a lot of ways, but so far I've found them much more heavyweight for the use-case of running ad-hoc jobs from the Unix command line. You don't really want to write Hadoop code for exploratory data analysis, and even managing a little fleet of bashreduce+EC2 instances that get spun up and down on demand is error-prone and tedious, turning me more into the cluster administrator rather than a user, which is what I'd rather be. Admittedly it's possible that could be abstracted out better in the case where you don't mind latency: I often don't mind if my jobs queue up for a few minutes, which would mean a tool could spin up EC2 instances behind the scenes and then tear them down without me noticing. But I haven't found anything that does that transparently yet, and Manta looks like a more direct implementation of the "illusion of running on an N-core machine for arbitrary N" idea that seems in the same cost ballpark. Definitely going to do some experimentation here, to see if 2010s technology will enable me to keep using a 1970s-era data-processing workflow.

I know manta, has default software packaged, but is it possible to install your own like ghci, or julia? Or is that something that needs to be brought in as an asset. This isn't necessarily a feature request, just trying to figure out how it works. https://apidocs.joyent.com/manta/compute-instance-software.h...

An asset is currently the way to do that.

Mark, is there any info on how I can figure out my monthly billing cost easily? Do I just need to sum the /user/reports/summary data for an estimate?

Yeah that's why we generate ~/reports for you every hour - that's what our billing runs off of. I know there's an internal "turn that into daily $ script" somebody wrote -- we'll get that put out as a sample job.

A short and nice read is Unix for Poets by Kenneth Ward Church: http://www.stanford.edu/class/cs124/kwc-unix-for-poets.pdf

I actually was about to post this -- this guide is great. As an undergraduate, this was what was given to us to help demonstrate above-introductory command line tools/pipes.

Data science? Is that we're calling DBA work nowadays?

I'm surprised there was no mention of cut -d. It's good for simple stuff where you don't need all of awk.

and paste - quick and simple for working with columns

The "Text Processing" category of this list of unix utilities is also helpful: http://en.wikipedia.org/wiki/List_of_Unix_programs

BashReduce is a pretty cool application of many of these utilities.

Nice overview. Some guys at my previous company used to show off stuff like this one line recommender (slide 6)


Wow, I have to say I use all these commands - these are also particularly useful while testing Hadoop streaming jobs since you can test locally on your shell using "cat | map | sort | reduce" (replace cat with head if you want) and then actually run it in Hadoop.

This reminds me of page 213 (175 in the book's numbering) of the Unix Haters Handbook, found at http://web.mit.edu/~simsong/www/ugh.pdf

"Imagine you have a 4.2GB CSV file." "

All you need... is the sum of all values in one particular column."

In that case, if speed was paramount, I'd use Kona or kdb. Unquestionably, k is the best tool for that particular job.

Really? In a recent test of a whole bunch of languages, scripting, compiled and JVM (but not Kona or kdb), our awk test was beaten only by C. awk was so far ahead, its run time beat other's compile + run time.

Yes, really. Nothing I know of beats the speed of k for this column-oriented type of task. Kernighan himself tested it against awk many years ago and if I recall it was generally faster even in that set of tests.

Where can I see your experimental design? I'd like to try to replicate your results.

I've been using Ubuntu for about a year now and although I feel comfortable doing a lot of things with the CL, I'm not sure if I really know enough about *nix. I wish there was a was a website with the 20-30 most useful unix commands and very clear language as to what they do with examples. Although, I've used all the tools in this post, I still enjoyed the use of example.

Software Carpentry provides a good overview with examples: http://software-carpentry.org/4_0/shell/index.html

Looks pretty solid but a bit on the simple side. Thanks for the link.

>>Writing a script in python/ruby/perl/whatever would probably take a few minutes and then even more time for the script to actually complete.

Thankfully you can also write a Perl one liner. Which most of the times is far powerful than awk.

sum the 0th field: perl -lane '$a += $F[0]; END{ print $a; }'

In some cases, (command line) Perl will actually process piped text faster than awk or even sed. I'm not sure about arithmetic though.

"data science"... seriously?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact