
Useful Unix commands for exploring data - aks_c
http://datavu.blogspot.com/2014/08/useful-unix-commands-for-exploring-data.html
======
zo1
"While dealing with big genetic data sets I often got stuck with limitation of
programming languages in terms of reading big files."

Hate to sound like Steve-Jobs here, but: "You're using it wrong."

Let me elaborate. If you're coming across limitations of "too-big" or "too-
long" in your language of choice: Then you're just a few searches away from
both being enlightened on how to solve your task at hand and on how your
language works. Both things that will prevent you from being hindered next
time around when you have to do a similar big-data job.

Perhaps you are more comfortable using pre-defined lego-blocks to build your
logic. Perhaps you understand the unix commands better than you do your chosen
language. But understand that programming is the same, just in a different
conceptual/knowledge space. And remember, always use the right tool for the
job!

(I use Unix commands daily as they're quick/dirty in a jiffy, but for complex
tasks I am more productive solving the problem in a language I am comfortable
in instead of searching through man pages for obscure flags/functionality)

~~~
barrkel
Most scripting languages aren't multithreaded, and some aren't pipeline
oriented by default.

For example, working with file lines naively in Ruby means reading the whole
lot into a giant array and doing transformations an array at a time, rather
than in a streaming fashion.

The shell gives you fairly safe concurrency and streaming for free.

Personally, if it's a complex task, I generally write a tool such that it can
be put into a shell pipeline.

Knowing the command line well - so that you don't often have to look up man
pages for obscure flags / functionality - has its own rewards, as these
commands turn into something you use all the time in the terminal. Rather than
spending a few minutes developing a script in an editor, you can incrementally
build a pipeline over a few seconds. Doing your script in a REPL is a better
approximation, but it's a bit less immediate.

~~~
dpeck
No the case with ruby at all, if you're reading the whole file into memory
theres a good chance you're doing it wrong.

check out yield and blocks

~~~
barrkel
The problem is that the most obvious way of doing it -
File.readlines('foo.txt').map { ... }.select { ... } etc. - is not stream-
oriented.

~~~
riffraff
arguably, it's trivial to make that stream oriented

    
    
        open('tmp.rb').each_line.lazy.map {...}.select {...}
    

the problem with processing big files with ruby (in my humble experience) is
usually that it's still slow enough that "preprocessing with grep&uniq" is
worthwhile.

~~~
barrkel

        > open('tmp.rb').each_line.lazy
        NoMethodError: undefined method `lazy' for #<Enumerator: #<File:Procfile>:each_line>
    

Not everybody is using Ruby 2.0.

------
etrain
Some more tips from someone who does this every day.

1) Be careful with CSV files and UNIX tools - most big CSV files with text
fields have some subset of fields that are text quoted and character-escaped.
This means that you might have "," in the middle of a string. Anything (like
cut or awk) that depends on comma as a delimiter will not handle this
situation well.

2) "cut" has shorter, easier to remember syntax than awk for selecting fields
from a delimited file.

3) Did you know that you can do a database-style join directly in UNIX with
common command line tools? See "join" \- assumes your input files are sorted
by join key.

4) As others have said - you almost invevitably want to run sort before you
run uniq, since uniq only works on adjacent records.

5) sed doesn't get enough love: sed '1d' to delete the first line of a file.
Useful for removing those pesky headers that interfere with later steps. Not
to mention regex replacing, etc.

6) By the time you're doing most of this, you should probably be using python
or R.

~~~
collyw
Actually I would say Perl is more appropriate. I went back to Perl after 4
years for this sort of task, as it has so many features built into the syntax.
Plus it can be run as a one liner.

~~~
etrain
I'm reminded of the old joke, "python is executable pseudocode, while perl is
executable line noise."

But seriously, I've got some battle scars from the perl days, and hope not to
revisit them. Honestly, there's very little I find I can do with perl and not
python, and it's just as easy to express (if not quite as concise) and _much_
simpler to maintain.

But, use the tool that works for you!

~~~
collyw
I use Python and Django most of the time, and its true, you can do pretty much
the same thing in each language. But for quick hacky stuff manipulating the
filesystem a lot, Perl has many more features built into the language. Things
like regex syntax, globing directories, back ticks to execute Unix commands,
and the fact you can use it directly from the command line as a one liner. You
can do all these (except the last one?) in Python, but Perl is quicker.

~~~
vram22
>But for quick hacky stuff manipulating the filesystem a lot, Perl has many
more features built into the language. Things like regex syntax, globing
directories, back ticks to execute Unix commands

All good points.

>you can use it directly from the command line as a one liner. You can do all
these (except the last one?) in Python

You can use Python from the command line too, but Perl has more features for
doing that, like the -n and -p flags. Then again, Python has the fileinput
module. Here's an example:

[http://jugad2.blogspot.in/2013/05/convert-multiple-text-
file...](http://jugad2.blogspot.in/2013/05/convert-multiple-text-files-to-pdf-
with.html)

------
CraigJPerry
>> If we don't want new file we can redirect the output to same file which
will overwrite original file

You need to be a little careful with that. If you do:

    
    
        uniq -u movies.csv > movies.csv
    

The shell will first open movies.csv for writing (the redirect part) then
launch the uniq command connecting stdout to the now emptied movies.csv.

Of course when uniq opens movies.csv for consumption, it'll already be empty.
There will be no work to do.

There's a couple of options to deal with this, but the temporary intermediate
file is my preference provided there's sufficient space - it's easily
understood, if someone else comes across the construct in your script, they'll
grok it.

~~~
aks_c
Thank you for inputs, how about this?

uniq -u movies.csv > temp.csv

temp.csv > movie.csv

rm temp.csv

~~~
icebraining

      $ temp.csv > movie.csv
      temp.csv: command not found

~~~
kyllo
He forgot his cat.

------
WestCoastJustin
My personal favorite is to use this pattern. You can do some extremely cool
counts and group by operations at the command like [1]:

    
    
      grep '01/Jul/1995' NASA_access_log_Jul95 | 
        awk '{print $1}' | 
        sort | 
        uniq -c | 
        sort -h -r | 
        head -n 15
    

Turns this:

    
    
      199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
      unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985
      199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085
      burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0
      199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179
    

Into this:

    
    
        623 piweba3y.prodigy.com
        547 piweba4y.prodigy.com
        536 alyssa.prodigy.com
        463 disarray.demon.co.uk
        456 piweba1y.prodigy.com
        417 www-b6.proxy.aol.com
        350 burger.letters.com
        300 poppy.hensa.ac.uk
        279 www-b5.proxy.aol.com
    

[1] [https://sysadmincasts.com/episodes/28-cli-monday-cat-grep-
aw...](https://sysadmincasts.com/episodes/28-cli-monday-cat-grep-awk-sort-and-
uniq)

~~~
vram22
I think there's something like that in the Kernighan and Pike book I referred
to elsewhere in this thread, and also, that code looks similar to this
technique:

[http://en.wikipedia.org/wiki/Decorate-sort-
undecorate](http://en.wikipedia.org/wiki/Decorate-sort-undecorate)

, i.e. Decorate-Sort-Undecorate (DSU), related to the Schwartzian transform.

------
CGamesPlay
For working with complex CSV files, I highly recommend checking out CSVKit
[https://csvkit.readthedocs.org/en/0.8.0/](https://csvkit.readthedocs.org/en/0.8.0/)

I've just started using it, and the only limitation I've so far encountered
has been that there's no equivalent to awk (i.e. I want a way to evaluate a
python expression on every line as part of a pipeline).

~~~
vdm
Get words starting with "and"

    
    
        $ cat /usr/share/dict/words | py -fx 're.match(r"and", x)' | head -5
        and
        andante
        andante's
        andantes
        andiron
    

[https://github.com/Russell91/pythonpy](https://github.com/Russell91/pythonpy)

~~~
CGamesPlay
Sorry, I meant: remove the characters "$" and "," from the 3rd column of a CSV
file. Obviously the CSV file is quoted, since it has commas in the 3rd column,
and so awk is no longer an acceptable solution.

------
hafabnew
Not to sound too much like an Amazon product page, but if you like this,
you'll probably quite like "Unix for Poets" \-
[http://www.lsi.upc.edu/~padro/Unixforpoets.pdf](http://www.lsi.upc.edu/~padro/Unixforpoets.pdf)
. It's my favourite 'intro' to text/data mangling using unix utils.

------
pessimizer
I'd like to repeat peterwwillis in saying that there are very Unixy tools that
are designed for this, and update his link to my favorite, csvfix:
[http://neilb.bitbucket.org/csvfix/](http://neilb.bitbucket.org/csvfix/)

Neat selling points: csvfix eval and csvfix exec

also: the last commit to csvfix was 6 days ago; it's active, mature, and the
developer is very responsive. If you can think of a capability that he hasn't
yet, tell him and you'll have it in no time:)

------
tdicola
If you're on Windows, you owe it to yourself to check out a little known
Microsoft utility called logparser:
[http://mlichtenberg.wordpress.com/2011/02/03/log-parser-
rock...](http://mlichtenberg.wordpress.com/2011/02/03/log-parser-rocks-more-
than-50-examples/) It effectively lets you query a CSV (or many other log file
formats/sources) with a SQL-like language. Very useful tool that I wish was
available on Linux systems.

~~~
727374
LogParser is one of the few things I really miss from windows. I think there
are unix equivalents, but I haven't had the time to invest in learning them.
Pretty much every example in this article boiled down to 'Take this CSV and
run a simple SQL query on it'. Yes you can do that by piping through various
unix utilities or you could just use a tool mean specifically for the task.
I'd like to see the article explore some more advanced cases, like rolling up
a column. I actually had to do this yesterday and ended up opening my data in
open office and using a pivot table.

------
zerop
7 command-line tools for data science
[http://jeroenjanssens.com/2013/09/19/seven-command-line-
tool...](http://jeroenjanssens.com/2013/09/19/seven-command-line-tools-for-
data-science.html)

Useful Unix commands for data science
[http://www.gregreda.com/2013/07/15/unix-commands-for-data-
sc...](http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/)

~~~
jeroenjanssens
First blog post was the inspiration for a book, which is almost finished:
[http://datascienceatthecommandline.com](http://datascienceatthecommandline.com)

~~~
a3n
O'Reilly is having a 50% sale on all ebooks through 9 September.

[http://oreilly.com/](http://oreilly.com/)

I just bought the early release of that exact book for $13.60, which was 60%
off, because you get 60% off if you order $100 worth of prediscount ebooks.

[http://shop.oreilly.com/product/0636920032823.do](http://shop.oreilly.com/product/0636920032823.do)

When the book is finished you get the final version. It's mostly already
finished.

"With Early Release ebooks, you get books in their earliest form — the
author's raw and unedited content as he or she writes — so you can take
advantage of these technologies long before the official release of these
titles. You'll also receive updates when significant changes are made, new
chapters as they're written, and the final ebook bundle."

------
peterwwillis
You can also find tools designed for your dataset, like csvkit[1] , csvfix[2]
, and other tools[3] (I even wrote my own CSV munging Unix tools in Perl back
in the day)

[1]
[http://csvkit.readthedocs.org/en/0.8.0/](http://csvkit.readthedocs.org/en/0.8.0/)
[2] [https://code.google.com/p/csvfix/](https://code.google.com/p/csvfix/) [3]
[https://unix.stackexchange.com/questions/7425/is-there-a-
rob...](https://unix.stackexchange.com/questions/7425/is-there-a-robust-
command-line-tool-for-processing-csv-files)

------
sheetjs
caveat: delimiter-based commands are not quote-aware. For example, this is a
CSV line with two fields:

    
    
        foo,"bar,baz"
    

However, the tools will treat it as 3 columns:

    
    
        $ echo 'foo,"bar,baz"' | awk -F, '{print NF}'
        3

~~~
napD
Is there any workaround?

~~~
mbreese
Don't use CSV files...

If I'm working with a datafile where I expect the delimiter to be in one of
the fields, there is something wrong.

This is one reason why I always work with tab delimited files. Having an
actual tab character isn't very common in free-text fields, at least in the
data that I work with. Commas on the other hand, are quite common. Why one
would select a field separator that was common in your data is beyond me (I
know it's historical).

Your data files might be different, in which case, maybe you should select a
different field separator.

Otherwise, no, there is no work around. If you have to quote fields, then you
can't use the normal unix command line tools that tokenize fields.

------
xtacy
I am surprised no one has mentioned datamash:
[http://www.gnu.org/software/datamash/](http://www.gnu.org/software/datamash/).
It is a fantastic tool for doing quick filtering, group-by, aggregations, etc.
Previous HN discussion:
[https://news.ycombinator.com/item?id=8130149](https://news.ycombinator.com/item?id=8130149)

------
ngcazz
No one gives a shit about cut.

    
    
        $ man 1 cut

~~~
mprovost
I'm always surprised when people recommend awk for pulling delimited sections
of lines out of a file, cut is so much easier to work with.

~~~
ciupicri
That's because _cut_ sucks when fields can be separated by multiple space or
tab characters.

    
    
        # printf '1 2\t3' | cut -f 2
        3
        # printf '1 2\t3' | awk '{print $2}'
        2
        # printf '1 2\t\t3' | cut -f 2
        
        # printf '1 2\t\t3' | awk '{print $2}'
        2

------
nailer
I love Unix pipelines, but chances are your data is structured in such a way
that using regex based tools will break that structure unless you're very
careful.

You know that thing about not making HTML with regexs? Same rule applies to
CSV, TSV, and XLSX. All these can be created, manipulated and read using
Python, which is probably already on your system.

~~~
icantthinkofone
iow, use Unix commands and pipes when you can. Don't use them when you can't.

------
jason_slack
The author states:

    
    
        uniq -u movies.csv > temp.csv 
        mv temp.csv movie.csv 
    
        **Important thing to note here is uniq wont work if duplicate records are not adjacent. [Addition based on HN inputs]  
    

Would the fix here be to sort the lines first using the `sort` command first?
Then `uniq`?

~~~
icebraining
Yes, but not first, rather instead. "sort -u" both sorts and hides duplicates.

~~~
mzs
except when you need uniq -c

------
letflow
To run Unix commands on Terabytes of data, check out
[https://cloudbash.sh/](https://cloudbash.sh/). In addition to the standard
Unix commands, their join, group-By operations are amazing.

We guys are evaluating replacing our entire ETL with cloudbash!

------
LiveTheDream
I use this command very frequently to check how often an event occurs in a log
file over time (specifically in 10-minute buckets), assuming the file is
formatting like "INFO - [2014-08-27 16:16:29,578] Something something
something"

    
    
        cat /path/to/logfile | grep PATTERN | sed 's/.*\(2014-..-..\) \(..\):\(.\).*/\1 \2:\3x/' | uniq -c
    

results in:

    
    
        273 2014-08-27 14:5x
        222 2014-08-27 15:0x
        201 2014-08-27 15:1x
        171 2014-08-27 15:2x
        349 2014-08-27 15:3x
        230 2014-08-27 15:4x
        236 2014-08-27 15:5x
        339 2014-08-27 16:0x
        330 2014-08-27 16:1x
    

This can subsequently be visualized with a tool like gnuplot or Excel.

~~~
mprovost
Useless use of cat?

~~~
bch
"Don't pipe a cat" is how I'm used to describing what you're talking about --
it may have been a performance issue in days past, but these days I think it's
simply a matter of style. Not that style is not important.

~~~
mprovost
This was drilled into me back in the usenet days. If you see a cat command
with a single argument it's almost always replaceable by a shell redirection,
or in this case just by passing the filename as an argument to grep. If you're
processing lots of data like in the article there's no point in passing it
through a separate command and pipe first.

~~~
smorrow
I think people like reading cat thefile | grepsedawk -opts 'prog' from left to
right, and that they think the only alternative is grepsedawk -opts 'prog'
thefile.

But there's grep <thefile -opts 're'. I like that one best; it reads the same
way you'd tend to think it.

------
emeraldd
uniq also doesn't deal well with duplicate records that aren't adjacent. You
may need to do a sort before using it.

    
    
       sort | uniq
    

But that can screw with your header lines, so be careful there two.

~~~
sheetjs
You can do this without sorting:

    
    
        awk '!x[$0]++'

~~~
_delirium
That's usually faster where possible, but it may cause problems on large data
sets, since it loads the entire set of unique strings (and their counts) into
an in-memory hash table.

------
bmsherman
I may as well plug my little program, which takes numbers read line-by-line in
standard input and outputs a live-updating histogram (and some summary
statistics) in the console!

[https://github.com/bmsherman/LiveHistogram](https://github.com/bmsherman/LiveHistogram)

It's useful if you want to, say, get a quick feeling of the distribution of
numbers in some column of text.

------
forkandwait
"rs" for "reshape array". Found only on FreeBSD systems (yes, we are better...
_smile_ )

For example, transpose a text file:

~/ (j=0,r=1)$ cat foo.txt a b c d e f ~/ (j=0,r=0)$ cat foo.txt | rs -T a d b
e c f

Honestly I have never used in production, but I still think it is way cool.

Also, being forced to work in a non-Unix environment, I am always reminded how
much I wish everything were either text files, zipped text files, or a SQL
database. I know for really big data (bigger than our typical 10^7 row
dataset, like imagery or genetics), you have to expand into things like HDF5,
but part of my first data cleaning sequence is often to take something out of
Excel or whatever and make a text file from it and apply unix tools.

~~~
jingo
"Found only on FreeBSD..."

Also found on NetBSD, OpenBSD and DragonFlyBSD.

~~~
clarry
Utree tells rs dates back to early 80s...

[http://minnie.tuhs.org/cgi-
bin/utree.pl?file=4.2BSD/usr/src/...](http://minnie.tuhs.org/cgi-
bin/utree.pl?file=4.2BSD/usr/src/new/new/tools/man/rs.1)

------
aabaker99
You should mention this behavior of uniq (from the man page on my machine):

Note: ’uniq’ does not detect repeated lines unless they are adjacent. You may
want to sort the input first, or use ‘sort -u’ without ‘uniq’.

Your movies.csv file is already sorted, but you don't mention that sorting is
important for using uniq, which may be misleading.

$ cat tmp.txt

AAAA

AAAA

BBBB

DDDD

BBBB

$ uniq -d tmp.txt

AAAA

------
michaelmior
It's good to note that `uniq -u` does remove duplicates, but it doesn't output
any instances of a line which has been duplicated. This is probably not clear
to a lot of people reading this.

~~~
barrkel
`uniq` removes duplicates; `uniq -u` only shows unique lines.

~~~
michaelmior
Exactly. The point wasn't clear from reading the article.

------
platz
[http://www.gregreda.com/2013/07/15/unix-commands-for-data-
sc...](http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/)

------
zo1
Just one other thing I'd like to mention before everyone moves on to another
topic. Not all of the unix commands are equal, and some have features that
others don't.

E.g. I mainly work on AIX, and a lot of the commands are simply not the same
as what they are on more standard linux flavors. From what I've heard, this
applies between different distros as well.

Not so much the case with standard programming languages that are portable.
E.g. Python. Unless you take in to account Jython, etc.

------
billyhoffman
For last line, I always did

    
    
       tac [file] | head -n 1
    

Mainly because I can never remember basic sed commands

(Strange, OS X doesn't seem to have tac, but Cygwin does...)

~~~
_delirium
The BSDish way of pronouncing 'tac' is 'tail -r'.

------
baldfat
Certain people might miss the point of why to use command line.

1) I use this before using R or Python and ONLY do this when this is something
I consistently need to be done all the time. Makes my R scripts shorter.

2) Somethings just need something simple to be fixed and these commands are
just great.

Learn awk and sed and your tools just go much larger in munging data.

~~~
RBerenguel
Exactly! I had a longish period when I wanted to do everything with the same
tool. Now, I try to pick the most efficient (for me, not the machine) to do
it. Csvfix, awk, sed, jq and several other command line goodies make my life
easier, the heavy lifting goes to R, gephi, or some ad-hoc Python, go or C

~~~
collyw
Fine if your only tool is Perl.

------
vesche
Using basic Unix commands in trivial ways, am I missing something here?

~~~
collyw
Its data science these days. Latest buzzword. Old is new.

------
known
[https://www.gnu.org/software/parallel/parallel_tutorial.html](https://www.gnu.org/software/parallel/parallel_tutorial.html)
is very handy

------
jweslley
I built stats-tools for use in place of awk for basic statistics.

[https://github.com/jweslley/stats-tools](https://github.com/jweslley/stats-
tools)

------
dima55
Then you can make plots by piping to
[https://github.com/dkogan/feedgnuplot](https://github.com/dkogan/feedgnuplot)

------
known
sort -T your_tmp_dir is very useful for sorting large data

------
forkandwait
There is a command on freebsd for transposing text table rows to columns and
vice versa, but I can't remember or find it now. It is in core, fwiw.

------
OneOneOneOne
awk / gawk is super useful. For C/C++ programmers the language is very easy to
learn. Try running "info gawk" for a very good guide.

I've used gawk for many things ranging from data analysis to generate linker /
loader code in an embedded build environment for a custom processor /
sequencer.

(You can even find a version to run from the Windows command prompt if you
don't have Cygwin.)

------
squigs25
sort before you uniq!

------
CharlesMerriam2
Do you have a pastebin of the CSV file? Time to play...

------
geh4y806
just checking!

------
cyphunk
really HN? if you find yourself depending heavily on the recommendations in
this article you are doing data analysis wrong. Shell foo is relevant to data
analysis only as much as regex is. In the same light depending on these
methods too much is digging a deep knowledge ditch that in the end is going to
limit and hinder you way more than the initial ingress time required to learn
more capable data analytics frameworks or at least a scripting language.

still, on international man page appreciate day this is a great reference. the
only thing it is missing is gnuplot ascii graphs.

------
gesman
Use splunk.

'nuff said.

------
lutusp
Quote: "While dealing with big genetic data sets ..."

What a great start. Unless he's a biologist, the author means _generic_ , not
_genetic_.

The author goes on to show that he can use command-line utilities to
accomplish what database clients do much more easily.

~~~
kafkaesk
A few blog posts earlier, the author writes about "Network Analysis
application in Genetic Studies ", so I am confident to say this isn't a typo.

And for a quick-and-dirty custom analysis of big data sets, the unix tools
might be a lot more convenient than databases.

