
Seven Unix Commands Every Data Scientist Should Know - lerax
http://neowaylabs.github.io/programming/unix-shell-for-data-scientists/
======
cybersol
'sort' and 'uniq' should also be near the top of the list. And once your doing
more on the command-line, 'join' and 'comm' can help you merge data from
multiple files.

~~~
rstuart4133
Amen.

I'm guessing regex's are beyond Data Scientist [0], but throw sed and vim into
the mix and there are very few one off problems that can be managed by a
single CPU that you can't do, and what's more do more efficiently than any
other tool chain. The overhead of throwing it into a SQL database or whatever
is do big, these simple tools simply blow them away is you are doing it just
once.

[0] I'm guessing a "Data Scientist" is someone who knows a lot about the data
and the scientific domain that created it, and to whom a computer is just a
just another hammer you hit the data with. A hammer that someone deliberately
made insanely and unnecessarily complex for job security, or something.

------
aviraldg
I'd add jq ([https://stedolan.github.io/jq/](https://stedolan.github.io/jq/))
to the list. JSON data is so common, and jq makes working with it a breeze.

~~~
adrianN
And xsv for CSV data.

------
foundart
Folks may want to have a look at
[https://www.gnu.org/software/datamash/manual/datamash.html](https://www.gnu.org/software/datamash/manual/datamash.html)
I suppose it violates the Unix philosophy of one tool doing one thing well but
it may nevertheless be useful. See also the examples page
[https://www.gnu.org/software/datamash/examples/](https://www.gnu.org/software/datamash/examples/)

~~~
lerax
> I suppose it violates the Unix philosophy

It is GNU (GNU Is Not Unix), it is on purpose.

Seems a nice tool.

------
fwip
I would be a little bit shocked if any of the data scientists at my day job
didn't know all seven of these, so, I guess that's an accurate title.

~~~
eesmith
I didn't know about shuf because it's a GNU specific utility, and not
installed on my Mac by default.

Homebrew installs the GNU Core Utilities with the g- prefix, and checking now
I see it's available as gshuf.

I see a couple of problems with how it's used in this essay:

    
    
      cat big_csv.csv | shuf | head -n 50 > sample_from_big_csv.csv
    

Since shuf's -n option is "output at most COUNT lines", there should be no
need for the head:

    
    
      cat big_csv.csv | shuf -n 50 > sample_from_big_csv.csv
    

In principle this should be faster because there's less I/O, and it should
take less memory if shuf implements the -n option using something like
reservoir sampling. (EDIT: it does -
[https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/shu...](https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/shuf.c#n175)
)

Also, since shuf takes a filename, it could be:

    
    
      shuf -n 50 < big_csv.csv> sample_from_big_csv.csv
    

However, in terms of data science, don't most CSV files contain a header? Most
of mine do, and the example 'data.csv' has the header "var_x".

Using "shuf" means there's a very high chance that the sampled CSV file either
won't include the header line, or will have it somewhere other than the first
line.

Doesn't that mean that most data scientists will rarely use 'shuf' for
sampling from CSV files?

~~~
sk5t
There is no standard for CSV per se, although experience suggests most files
have it. Use head/tail as needed to preserve or avoid it in pipelines.

The GNU utils are usually the way to go; the Mac/Darwin/BSD variants have some
weird quirks and usually aren't worth the bother to fight with. awk and sort
are notably deficient.

Ease off the "data science" a tad--this is just scripting, a perfectly
honorable pastime! Why else would GNU coreutils exist in the first place?

~~~
eesmith
There is a CSV standard, RFC 4180. There are also many variants. I used
products which specifically say that follow that RFC.

It's awkward to preserve a header using shuf, even with head/tail. Here's what
I came up with:

    
    
      head -1 x.csv && (awk 'NR>1' x.csv | shuf -n 5)
    

My point is that "shuf" is GNU specific, while the author said it was a Unix
tool like the other ones. Linux != Unix. Also, there are no Mac/Darwin/BSD
variants of shuf.

"Ease off"? Why? The term "data scientist" appear in the title and each of the
first three paragraphs, and the comment I replied to was "I would be a little
bit shocked if any of the data scientists at my day job didn't know all seven
of these". I want to know why data scientists who work with CSV files should
be using shuf and not some more appropriate tool like csvkit.

That is, the entire topic is about scripting _as applied to data science_ ,
not "just scripting".

~~~
sk5t
RFC 4180 is really a well-intentioned attempt to document popular ways CSV is
used in the wild, but it does not set a standard. There is no such thing as
CSV well-formedness in the way that exists for XML. Can a CSV field value span
lines or not? How are linebreaks encoded within a field? Can backslash escape
quote (or other) characters, or is the double-doublequote approach
contemplated in the RFC proper? How is one to know definitively if the first
line is a header or a data row? What about comments?

My beef with "data science" is that this is not science--it's even less
science than political science--it's munging.

~~~
eesmith
_shrug_ Sure, yes, I pointed out there are many variants. But you said there
wasn't a standard when there is. You meant to say that there is no widely used
consistent CSV format, I think, which is a bit different.

RFC 4180 says a CSV field can span multiple lines. Section 2.6:

> Fields containing line breaks (CRLF), double quotes, and commas should be
> enclosed in double-quotes.

You asked "Can backslash escape quote (or other) characters". 4180 says "no",
and that double-double quoting is used. The grammar in 2.7 is:

> escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE

The RFC says that the MIME type should use header=present or header=absent to
indicate if the first line is a header. "Implementors choosing not to use this
parameter must make their own decisions as to whether the header line is
present or absent."

I have never seen a CSV file with comments.

Note also that the Python csv module, which supports a large number of CSV
dialects, also does not support comments.

While you have a beef, that doesn't change that you wanted to drag me off
topic (from "data science" to "just scripting"), when my point was that that
example wasn't a good data science example in the first place. It also isn't a
good scripting example.

------
msravi
I cannot recommend this enough:

The Awk Programming Language - Aho, Kernighan, Weinberger

[https://ia802309.us.archive.org/25/items/pdfy-
MgN0H1joIoDVoI...](https://ia802309.us.archive.org/25/items/pdfy-
MgN0H1joIoDVoIC7/The_AWK_Programming_Language.pdf)

The book is amazingly well written, and is invaluable.

~~~
AkshayD08
For a quick understanding of the power of awk :

[https://gregable.com/2010/09/why-you-should-know-just-
little...](https://gregable.com/2010/09/why-you-should-know-just-little-
awk.html)

~~~
justin_g
great, thanks

------
ams6110
They left out _rm_ , used to clean up all their files when they are done so
other users can work.

~~~
sgillen
They also left out ls and cd, might have been considered too basic.

~~~
laumars
However they did include ‘cat’ and only mentioned it’s most basic common use
rather than its primary reason for being.

Ultimately guides like the one submitted are just one persons braindump of
stuff they think might be helpful for others. It’s probably best not to over
think why something was added or omitted.

------
unhammer
[http://visidata.org/](http://visidata.org/) is a nice one for quickly getting
an overview of some tabular data – you can even just stick it at the end of
your pipe. If

bzcat foo.bz2|sort|uniq -c|sort -nr | awk -f munge.awk |blah

produces a tsv, then

bzcat foo.bz2|sort|uniq -c|sort -nr | awk -f munge.awk |blah|vd

makes that tsv an interactive (if you think ncurses is interactive)
spreadsheet with plotting and pivot tables and mouse support :)

You can also save your keypresses in vd to a file and then re-run them at a
later stage – I've got some scripts to re-run an analysis and then run vd on
it and immediately set all columns to floats and open the frequency table so I
can see if I managed to lower the median this time.

~~~
in9
HOLY CRAP I've been trying to find visdata for the past year or so. Saw it
once here on HN and was completely unable to find it again.

Thanks for mentioning it.

~~~
rabidrat
Author of VisiData here, how did you try to find it (what did you search for)?
I'd like to make it easier to find but I'm not sure how.

------
Anthony-G
If you have a lot of files that may be processed by a `find` command and speed
is important, it’s worth knowing about the plus-sign variation of the `-exec`
expression. The command in the original article

    
    
        find . -name setup.py -type f -exec grep -Hn boto3 {} \;
    

could be written as

    
    
        find . -name setup.py -type f -exec grep -Hn boto3 {} +
    

The difference is that the first version (the `-exec` expression is terminated
with a semi-colon) forks a new process to run the `grep` command for each
individual file “found” by the preceding expressions. So, if there were 50
such `setup.py` files, the `grep` command would be invoked 50 times. Some
times this is desired behaviour but in this case, `grep` can accept multiple
pathnames as arguments.

With the second version (expression is terminated with a plus-sign), the
pathnames of the files are collected into sets so that the `grep` command is
only called once for each set (similar to how the `xargs` utility works to
avoid exceeding the limits on the number of arguments that can be passed to a
command). This is much more efficient because only 1 `grep` child process is
forked – instead of 50.

This functionality was added to the POSIX specification [1] a number of years
ago and I’ve been using it for at least 10 year on GNU/Linux systems. I
imagine it should be available on other Unix-like environments (including BSD
[2]) that data scientists are likely to be using – though the last time I had
to work on a friend’s Mac the installed versions of the BSD utilities were
quite old.

[1]:
[http://pubs.opengroup.org/onlinepubs/9699919799/](http://pubs.opengroup.org/onlinepubs/9699919799/)

[2]: [https://man.openbsd.org/find.1](https://man.openbsd.org/find.1)

------
asicsp
I have an example based tutorial for all these commands plus other cli text
processing commands

[https://github.com/learnbyexample/Command-line-text-
processi...](https://github.com/learnbyexample/Command-line-text-processing)

------
dredmorbius
Problem: Given a CSV file, we want to know the number of columns just by
analyzing its header.

    
    
       $ head -n 1 data.csv | awk -F ',' '{print NF}'
    

Or spare a process:

    
    
        awk -F ',' 'NR <= 1 {print NF; quit}' data.csv
    

One of numerous weak points to this article.

------
yakshaving_jgt
> Prints on the screen (or to the standard output) the contents of files.
> Simple like that.

While it's not exactly false, it's also not a good explanation for cat. If you
just want to operate on the contents of a single file, you should use
redirection. The cat utility is for concatenating files.

------
6keZbCECT2uB
tldr: grep, cat, find, head / tail, wc, awk, shuf with bonuses of xargs, and
man.

I've never needed shuf, and awk is a bit out of place in the list, but head
and tail have saved me from many a large file. The interesting data is usually
in head, tail, or grep anyway.

~~~
vajrabum
awk '{print $2,$4;}' is useful and easy to remember. $NF refers to the last
field. FS is the variable to override the default field separator of white
space. Here's an example of it's use. OFS is the output field separator.

awk 'BEGIN{FS=":"; OFS=":"} {print $1,$NF}' /etc/password

Works on CSV files too with the right field separator.

~~~
masklinn
> awk '{print $2,$4;}' is useful and easy to remember. $NF refers to the last
> field. FS is the variable to override the default field separator of white
> space. Here's an example of it's use. OFS is the output field separator.

Yeah also cut is just inconvenient, awk works much better when working with
the usual tabulated data.

> Works on CSV files too with the right field separator.

You're probably better off using xsv though.

------
sgillen
Is there a real advantage to using awk over python for most tasks? Or is just
a little faster/more convenient if you already know it?

~~~
asicsp
if you need a cli tool, then awk clearly wins as python isn't suited in a cli
pipeline

speed is a factor too, if you consider combining with other cli features, for
example [https://adamdrake.com/command-line-tools-can-
be-235x-faster-...](https://adamdrake.com/command-line-tools-can-
be-235x-faster-than-your-hadoop-cluster.html)

------
colechristensen
Is "data science" so undeveloped that pipes and grep need to be on an
everyone-should-know list?

~~~
closeparen
There is a lot of low-hanging fruit available in teaching people from research
backgrounds about what we would consider baseline computer use and software
development practices. It turns out there is a large population of brilliant
people who can accomplish things in Python that I can barely comprehend in
abstract, much less line for line, who have also never heard of Git, unit
testing, or modularity.

~~~
colechristensen
I guess I was under the impression that the label indicated some level of
expertise in the basic tools of the trade that would differentiate it from
statistics or applied math.

Of course there is nothing wrong with specialization or lacking experience
when it is not particularly relevant to your chosen field, but something like
'grep' would seem to be the bread and butter of data science.

~~~
closeparen
That would be R, Pandas, SQL, etc.

------
pumanoir
Any book recommendations to understand and master the use of UNIX commands?

~~~
dredmorbius
System man pages.

 _UNIX Power Tools_ , Peek, et al, is somewhat aged, but excellent.

 _The UNIX Programming Environment_ is even older, but highlights basic
philosophy.

For more recent tools, StackExchange Linux and shell topics can be
illuminating.

~~~
AkshayD08
The Unix Programming Environment was eye opening for me!

------
stakhanov
Let's not forget cut for dealing with csv.

~~~
daveFNbuck
They described how to use awk like cut.

------
noobermin
s/Data Scientist/Unix User/

~~~
yread
Hey, sed is not even on the list

