
Useful Unix commands for data science - gjreda
http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/
======
makmanalp
Everyone forgets the brilliant and sometimes crazy BSD ones:

    
    
      - Column: Create columns / tables from input data
      - tr: substitute / delete chars
      - join: like a database join, but for text files
      - comm: like diff, but you can use it programmatically to choose if a       line is in one file, or another, or both.
      - paste: put file lines side-by-side
      - rs: reshape arrays
      - jot: generate random or sequence data
      - expand: replace tabs / spaces

~~~
mjn
Looks like 6 of those 8 are in GNU coreutils as well (and therefore can be
assumed present on just about any modern Unix). 'rs' and 'jot' are the two
missing from most default Linux installs. On Debian you can install them via
the packages 'rs' and 'athena-jot'.

~~~
merlincorey
'jot' is pretty sweet, especially for creating ranges for iteration and random
numerical data for testing arguments and such.

Check out the man page for a few snippets: [http://www.unix.com/man-
page/FreeBSD/1/jot/](http://www.unix.com/man-page/FreeBSD/1/jot/)

It is the older, more flexible uncle of gnu's 'seq' command:
[http://administratosphere.wordpress.com/2009/01/23/using-
bsd...](http://administratosphere.wordpress.com/2009/01/23/using-bsd-jot/)

~~~
emmelaich
And you can't mention jot and rs without lam: [http://www.unix.com/man-
page/FreeBSD/1/lam/](http://www.unix.com/man-page/FreeBSD/1/lam/)

------
bch
> cat data.csv | awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'

"Don't pipe a cat".

My test doesn't show a speed improvement, but there are less processes
running, and less memory consumed.

    
    
      bch:~ bch$ jot 999999999 2 99 > data.dat
    
    
      bch:~ bch$ time cat data.dat  | awk '{sum +=$1} END {printf "sum: %d\n", sum}'
      sum: 50499999412
    
      real 6m21.111s
      user 6m15.506s
      sys  0m5.711s
    
      PID    COMMAND      %CPU  TIME     #TH   #WQ  #PORTS #MREGS RPRVT  RSHRD  RSIZE  VPRVT  VSIZE  PGRP  PPID  STATE    UID  FAULTS    COW      MSGSENT     MSGRECV
      22342  awk          100.7 05:11.84 1/1   0    17     21     52K    212K   340K   17M    2378M  22341 22306 running  501  311       49       73          36
      22341  cat          1.1   00:03.94 1/1   0    17     21     272K   212K   548K   17M    2378M  22341 22306 running  501  268       51       73          36
    

==============

    
    
      bch:~ bch$ time awk '{sum +=$1} END {printf "sum: %d\n", sum}' ./data.dat
      sum: 50499999412
    
      real 6m24.023s
      user 6m13.828s
      sys  0m2.774s
    
      PID    COMMAND      %CPU  TIME     #TH   #WQ  #PORTS #MREGS RPRVT  RSHRD  RSIZE  VPRVT  VSIZE  PGRP  PPID  STATE    UID  FAULTS    COW      MSGSENT     MSGRECV
      22373  awk          100.0 00:30.16 1/1   0    17     21     276K   212K   624K   17M    2378M  22373 22306 running  501  256       46       73          36

~~~
aidos
Sometimes I like to start with cat so I can easily swap for zcat when changing
to gripped input.

~~~
jbert
Agree. Or actually I start with a 'head -100' so I don't handle too much data
in my pipeline until it's ready.

~~~
tzs
I'm old fashioned, so use "sed 100q" instead of the newer "head -100". It
saves a keystroke, too.

There are enough variations in ways to do things on Unix that I've sometimes
wondered about how easy it would be to identify a user by seeing how they
accomplish a common task.

For instance, I noticed at one place I worked that even though everyone used
the same set of options when doing a "cpio -p", everyone had their own order
they wrote them. Seeing one "cpio -p" command was sufficient to tell which of
the half dozen of us had done the command.

I think I'm the only one where I work who uses "sed Nq" instead of "head -N",
so that would fingerprint me.

~~~
smutticus
I sorta had this happen to me once. I have used "lsl" as an alias for long
directory listings for longer than I can remember. And just out of habit it
was almost always the first command I typed when logging into any box
anywhere.

So one day I telnetted into a Solaris machine and immediately typed "lsl"
before doing anything else. A short while later a colleague came to my cube.
He had been snooping the hme1 interface and saw me login. He didn't need to
trace the IP because he knew it was me when he saw 3 telnet packets with "l"
"s" "l" in them.

------
minimax
AWK is worth learning completely. It hits a real sweet spot in terms of
minimizing the number of lines of code needed to write useful programs in the
world of quasi-structured (not quite CSV but not completely free form) data.
You can learn the whole language and become proficient in an afternoon.

I recommend "The AWK Programming Language" by Aho, Kernighan, and Weinberger,
though it seems to be listed for a hilariously high price on Amazon at the
moment. Maybe try to pick up a used copy.

~~~
daemon13
I recall there was a pointer to an old great AWK tutorial some time ago - smth
along the lines 'how to approach awk language....' \- anyone kept the link?

~~~
foobarbazqux
This is the first hit for awk tutorial and it's all you need.

[http://www.grymoire.com/Unix/Awk.html](http://www.grymoire.com/Unix/Awk.html)

~~~
McUsr
I think Steve's Awk academy is a nice supplement to Grymoire :
[http://www.troubleshooters.com/codecorn/awk/](http://www.troubleshooters.com/codecorn/awk/)

By the way: what people need to understand is that in order to use Awk,
efficently, you'll either use associative arrays, or structure your script
like a sed script, otherwise it will be slow. The interesting thing about both
of those, is the regex algorithm Thompson NFA, that is from what I hear around
7 times faster than PCRE that is used in Perl, PHP, Python and Ruby?

------
DEinspanjer
One of my favorite little tools that makes all these others better is pv --
PipeViewer.

Use it any place in a pipeline to see a progress meter on stderr. Very handy
when grepping through a bunch of big log files looking for stuff. Here is a
quick strawman example:

    
    
      pv /data/*.log.gz | zgrep -c 'hello world'
      241MiB 0:00:15 [15.8MiB/s] [==>       ]  2% ETA 0:12:12

~~~
gwu78
BSD has a progress meter utility. It's called progress(1).

    
    
       progress -zf /data/*.log.gz grep -c 'hello world'
    
       progress -f /data/*.log.gz zgrep -c 'hello world'
    

The second form will show the progress of the decompression process.

You can also adjust buffer size, set the length for the time estimate
(otherwise we have to fstat the input), and display progress to stderr instead
of stdout.

~~~
gaadd33
Which BSD has that? I don't seem to find it in my FreeBSD installs.

~~~
gwu78
It's actually in FreeBSD's base, but it's part of the ftp(1) program.

    
    
       http://svnweb.freebsd.org/base/vendor/tnftp/dist/src/progressbar.c
    
       http://ftp.netbsd.org/pub/NetBSD/NetBSD-release-6/src/usr.bin/{Makefile,progress.c}
    

Not sure if you prefer binary installs or whether you compile your installs
yourself... but I'm sure you could get this to compile on FreeBSD with a
little work.

------
thrownaway2424
Actually useful data science tips for unix users.

    
    
      Make all your commands 3x faster:
       export LC_ALL=C
    
      Actually use the 32 CPUs you paid for:
       sort --parallel=32 ...
       xargs -P32 ...

~~~
gnosis
Could you expand on why

    
    
       export LC_ALL=C
    

would "make all your commands 3x faster"?

~~~
bcantrill
Actually, it was more like 2000X[1] -- and I believe that it still stands as
Brendan Gregg's biggest performance win.

[1]
[http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance...](http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance-
win/)

~~~
gnosis
According to the comments in that thread, this issue was fixed in GNU grep 2.7
(my system currently has grep 2.14 on it, so this must have been some time
ago).

------
jbert
I like slicing and dicing with awk, grep and friends too.

One thing I find odd that you have to drop to a full language (awk, perl etc)
to sum a column of numbers. Am I missing a utility?

    
    
      echo "1\n2\n3\n" | sum # should print 6 with hyphothetical sum command
    

I suppose more generally you could have a 'fold initial op' and:

    
    
      echo "1\n2\n3\n4\n" | fold 0 +   # should print 10
      echo "1\n2\n3\n4\n" | fold 1 \*  # should print 24
    

But I guess at that point you're close enough to using awk/perk/whatever
anyway. Which probably answers my question.

~~~
chubot
Yeah I wrote my own sum utility in Python... the syntax is just sum 1 or sum 2
for the column, with a -d delimiter flag. In retrospect I guess it could have
been a one line awk script. But yeah if you are doing this kind of data-
processing, it makes sense to have a hg/git repo of aliases and tiny commands
that you sync around from machine to machine. You shouldn't have to write the
sum more than once.

Another useful one is "hist" which is sort | uniq -c | sort -n -r.

~~~
jbert
I've never aliased it, but yes I use your 'hist' a lot. Useful for things like
"categorise log errors" etc.

Does everyone else edit command history, stacking up 'grep -v xxxx' in the
pipeline to remove noise?

If I'm working on a new pipeline, my normal workflow is something like:

    
    
      head file     # See some representative lines
      head file | grep goodstuff
      head file | grep good stuff | grep -v badstuff
      head file | grep ... | grep ... | sed -e 's/cut out/bits/' -e 's/i dont/want/'
      head file | grep ... | grep ... | sed -e 's/cut out/bits/' -e 's/i dont/want/' | awk '{print $3}' # get a col
      head file | grep ... | grep ... | sed -e 's/cut out/bits/' -e 's/i dont/want/' | awk '{print $3}' | sort | uniq -c | sort -nr  # histogram as parent
    

Then I edit the 'head' into a 'cat' and handle the whole file. Basically all
done with bash history editing (I'm a 'set -o vi' person for vi keybindings in
bash, emacs is fine too :-)

~~~
ibotty
> awk '{print $3}'

is the same as

> cut -f3 -d' '

cut is amazing for what it does. and most people know only the subset of awk
that effectively _is_ cut anyway :D.

~~~
toupeira
Not exactly, awk will consume all whitespace while cut will split on each
individual space character, and not on newlines and tabs.

------
gnosis
I was hoping to see an article about some neat new utilities specifically
tailored for doing advanced data analysis.

Instead this is a set of basic examples of bog-standard tools that every
newbie *nix user should be already familiar with: cat, awk, head, tail, wc,
grep, sed, sort, uniq

~~~
pmelendez
>"tools that every newbie _nix user should be already familiar with "

The key word is _should _... you might be surprised how many "not newbie" _nix
users are not aware of those commands or how using them in this fashion.
Specially awk.

~~~
D9u
I forgot all about wc, but thanks to this article I may remember it the next
time I need it.

Thanks!

------
_kst_
A commenter on the article pointed out the "Useless use of cat".

What most users probably don't realize is that the redirection can be anywhere
on the line, not just at the beginning. Putting an input redirection at the
beginning of the command can make the data flow clearer: _from_ the input
file, _through_ the command, _to_ stdout:

    
    
        < data.csv awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'
    

(This only works for simple commands; you can't do `< file if blah; then foo;
else bar; fi`)

~~~
alayne
"Useless use of cat" is one of those boring pedantic comments that makes me
cringe. Who cares? It's usually much more straightforward to build a pipeline
from left to right, particularly for people who are just learning this stuff.

~~~
coolj
> It's usually much more straightforward to build a pipeline from left to
> right, particularly for people who are just learning this stuff.

True, however, people pointing out UUOC are in fact pointing out that you
should not be building a pipeline at all. If you want to apply an awk / sed /
wc / whatever command to a file, then you should just do that instead of
piping it through a extraneous command.

Sure, as people always mention, in your actual workflow you might have a cat
or grep already, and are building a pipeline incrementally; there's no reason
to remove previous stuff to be "pure" or whatever. But if you're giving a
canonical example, there's no reason to add unneeded commands.

------
codezero
Please be very careful doing math with bash and awk...

    
    
      cat data.csv | awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'
    

From that command, it's unclear whether the sum will be accurate, it depends
on the inputs and on the precision of awk. See (D.3 Floating-Point Number
Caveats):
[http://www.delorie.com/gnu/docs/gawk/gawk_260.html](http://www.delorie.com/gnu/docs/gawk/gawk_260.html)

~~~
minimax
_Please be very careful doing math with bash and awk..._

I don't see how that's more true for awk than it is for any other programming
language. Awk uses double precision floating point for all numeric values,
which isn't a horrible choice for a catch-all numeric type.

------
pstuart
Starts off with unnecessary use of cat, e.g., cat file | awk 'cmds'.

One can simply do awk 'cmds' file.

~~~
mds
I know purists always complain about unnecessary cats, but I always find it
useful to start with "head" or "tail" in the first position to figure out my
pipeline, and then replace it with cat when it's all working.

And if the extra cat is actually making a measurable difference, maybe that's
a good signal that it's time to rewrite it in C.

~~~
sleepydog
You can do with simple IO redirection. For example, the arbitrary pipeline

    
    
        $ cat data.txt | awk '{ print $2+$4,$0 }'|sort|sed '/^0/d'
    

can be written as

    
    
        $ <data.txt awk '{ print $2+$4,$0 }'|sort|sed '/^0/d'

~~~
koralatov
Some people prefer the first, longer-winded way because it's more explicit. To
some --- myself included --- it makes more sense because it explicitly breaks
each function into a seperate steps; I'm explicitly telling the system to
print the contents of data.txt rather than implictly doing so. I'll happily
type those five extra characters for that additional clarity.

------
zeidrich
If you need them, Windows also has most of those tools somehow replicated in
Powershell. For instance, the initial example can be replicated with:

Get-Content .\data.csv | %{[int]$total+=$_.Split('|')[3]; } ; Write-Host
"$total"

~~~
dredmorbius
Or you can actually use the Linux commands by installing Cygwin. Pretty much
my first conscious action when I wake up stranded on a desert Windows system.

~~~
pseut
Why Cygwin instead of MinGW? (I can't remember why I prefer MinGW, but at some
point I had a reason).

~~~
dredmorbius
I could care less which you choose so long as you're getting a proper Linux
toolset.

Powershell is a skill I don't have yet which carries over to ... precisely one
declining technical dinosaur (with a penchant for expiring its skillsets).

The Linux toolbox is a set of skills I embarked on learning over a quarter-
century ago, most of which goes back another decade or further (the 'k' in
'awk' comes from Brian Kernighan, one of Unix's creators). And while some old
utilities are retired and new ones replace them (telnet / rsh for ssh,
sccs/rcs for git), much of the core has remained surprisingly stable over
time.

The main difference between MinGW and Cygwin appears to be how Windows-native
they are considered, which for my own purposes has been an entirely irrelevant
distinction, though if you're building applications based off of the tools
might matter to you.

[http://www.mingw.org/node/21](http://www.mingw.org/node/21)

------
sleepydog
One trick I like to do is to feed two files into awk; /dev/stdin and some
other file I'm interested in. Here's an example: lookup the subject names of a
list of serial numbers in an openSSL index.txt

    
    
        #!/bin/sh
        
        printf %s\\n "$@" | awk -F'\t' '
            FILE == "/dev/stdin" {
                needle[$0] = 1
                next
            }
    
            needle[$4] {
                print $NF
            }
        ' /dev/stdin /etc/pki/CA/index.txt
    

I find myself using this idiom (feeding data from a file to awk and selecting
it with data from standard input) again and again. It's a great way to scale
shell scripts to take multiple arguments while avoiding opening the same file
N times, or doing clunky things with awk's -v flag.

------
nfg
Worth noting that:

    
    
        grep -A n -B n
    

is more easily written:

    
    
        grep -C n
    

If you think "C" for "context" this is easier to remember too.

------
jonjenk
If you're interested in this general topic check out this Wikibook from John
Rauser. He's a data scientist at Pinterest.

[http://en.wikibooks.org/wiki/Ad_Hoc_Data_Analysis_From_The_U...](http://en.wikibooks.org/wiki/Ad_Hoc_Data_Analysis_From_The_Unix_Command_Line)

------
res0nat0r
If this interests you, you should check out Joyents new Manta service which
lets you do this type of thing on your data via their infrastructure. It's
really cool.

[http://www.joyent.com/products/manta](http://www.joyent.com/products/manta)

~~~
jlgaddis
If I needed to do this type of thing on 10 TB of data, it would probably take
me longer to get the data to them than it would to just run it on my own
hardware.

Apparently there's a need for it, though, or it wouldn't exist.

~~~
mcavage
Disclaimer: I work at Joyent, on Manta.

This entire HN thread is a perfect example of why we built Manta. Lots of
engineers/scientists/sysadmins/... already know how to (elegantly) process
data using Unix and augmenting with scripts. Manta isn't about always needing
to work on a 10TB dataset (you can), but about it being always available, and
stored ready to go. I know we can't live without it for running our own
systems -- _all_ logs in the entire Joyent fleet are rotated and archived in
Manta, and we can perform both recurring/automated and ad-hoc analysis on the
dataset, without worrying about storage shares, or ETL'ing from cold storage
to compute, etc. And you can sample as little as much or as much as you want.
At least to us (and I've run several large distributed systems in my career),
that has tremendous value, and we believe it does for others as well. And
that's just one use case (log processing).

Like I said, disclaimers/bias/etc.

m

~~~
jrn
I know manta, has default software packaged, but is it possible to install
your own like ghci, or julia? Or is that something that needs to be brought in
as an asset. This isn't necessarily a feature request, just trying to figure
out how it works. [https://apidocs.joyent.com/manta/compute-instance-
software.h...](https://apidocs.joyent.com/manta/compute-instance-
software.html)

~~~
dap
An asset is currently the way to do that.

------
reyan
A short and nice read is Unix for Poets by Kenneth Ward Church:
[http://www.stanford.edu/class/cs124/kwc-unix-for-
poets.pdf](http://www.stanford.edu/class/cs124/kwc-unix-for-poets.pdf)

~~~
hafabnew
I actually was about to post this -- this guide is great. As an undergraduate,
this was what was given to us to help demonstrate above-introductory command
line tools/pipes.

------
snorkel
Data science? Is that we're calling DBA work nowadays?

------
xntrk
I'm surprised there was no mention of cut -d. It's good for simple stuff where
you don't need all of awk.

~~~
throwwiffle
and paste - quick and simple for working with columns

------
trimbo
Crush tools.

[https://code.google.com/p/crush-tools/](https://code.google.com/p/crush-
tools/)

------
westurner
The "Text Processing" category of this list of unix utilities is also helpful:
[http://en.wikipedia.org/wiki/List_of_Unix_programs](http://en.wikipedia.org/wiki/List_of_Unix_programs)

BashReduce is a pretty cool application of many of these utilities.

------
mountaineer
Nice overview. Some guys at my previous company used to show off stuff like
this one line recommender (slide 6)

[http://www.slideshare.net/strands/strands-presentation-at-
re...](http://www.slideshare.net/strands/strands-presentation-at-recked-
presentation/6)

------
samspenc
Wow, I have to say I use _all_ these commands - these are also particularly
useful while testing Hadoop streaming jobs since you can test locally on your
shell using "cat | map | sort | reduce" (replace cat with head if you want)
and then actually run it in Hadoop.

------
noloqy
This reminds me of page 213 (175 in the book's numbering) of the Unix Haters
Handbook, found at
[http://web.mit.edu/~simsong/www/ugh.pdf](http://web.mit.edu/~simsong/www/ugh.pdf)

------
gwu78
"Imagine you have a 4.2GB CSV file." "

All you need... is the sum of all values in one particular column."

In that case, if speed was paramount, I'd use Kona or kdb. Unquestionably, k
is the best tool for that particular job.

~~~
lotsofcows
Really? In a recent test of a whole bunch of languages, scripting, compiled
and JVM (but not Kona or kdb), our awk test was beaten only by C. awk was so
far ahead, its run time beat other's compile + run time.

~~~
gwu78
Yes, really. Nothing I know of beats the speed of k for this column-oriented
type of task. Kernighan himself tested it against awk many years ago and if I
recall it was generally faster even in that set of tests.

Where can I see your experimental design? I'd like to try to replicate your
results.

------
wasd
I've been using Ubuntu for about a year now and although I feel comfortable
doing a lot of things with the CL, I'm not sure if I really know enough about
*nix. I wish there was a was a website with the 20-30 most useful unix
commands and very clear language as to what they do with examples. Although,
I've used all the tools in this post, I still enjoyed the use of example.

~~~
bendmorris
Software Carpentry provides a good overview with examples: [http://software-
carpentry.org/4_0/shell/index.html](http://software-
carpentry.org/4_0/shell/index.html)

~~~
wasd
Looks pretty solid but a bit on the simple side. Thanks for the link.

------
kamaal
>>Writing a script in python/ruby/perl/whatever would probably take a few
minutes and then even more time for the script to actually complete.

Thankfully you can also write a Perl one liner. Which most of the times is far
powerful than awk.

~~~
dpatru
sum the 0th field: perl -lane '$a += $F[0]; END{ print $a; }'

------
dbbolton
In some cases, (command line) Perl will actually process piped text faster
than awk or even sed. I'm not sure about arithmetic though.

------
Rickasaurus
"data science"... seriously?

