Hacker News new | past | comments | ask | show | jobs | submit login
Seven Unix Commands Every Data Scientist Should Know (neowaylabs.github.io)
109 points by lerax on Mar 7, 2019 | hide | past | favorite | 64 comments

'sort' and 'uniq' should also be near the top of the list. And once your doing more on the command-line, 'join' and 'comm' can help you merge data from multiple files.


I'm guessing regex's are beyond Data Scientist [0], but throw sed and vim into the mix and there are very few one off problems that can be managed by a single CPU that you can't do, and what's more do more efficiently than any other tool chain. The overhead of throwing it into a SQL database or whatever is do big, these simple tools simply blow them away is you are doing it just once.

[0] I'm guessing a "Data Scientist" is someone who knows a lot about the data and the scientific domain that created it, and to whom a computer is just a just another hammer you hit the data with. A hammer that someone deliberately made insanely and unnecessarily complex for job security, or something.

I can't tell you how many times a combination of sort and join, with a bit of awk has saved by bacon. Seems to be a rather rare skill to have among the various Unix admins I've worked with in the past.

One thing to note, set LANG=C before doing operations with sort and join. I'm not sure if it is a bug, or if it is in all versions, but if you have for example LANG=en_US.utf8 then sort will use one order ("_" comes before '-"), but join uses ASCII order. Note, you don't have to export LANG=C, just put it prior to the command your are launching to export it to just that one command.

  LANG=C sort ...

I know this is not a "official" unix tool, but for join/comm specifically for CSV I love this tool: https://github.com/BurntSushi/xsv/releases/tag/0.13.0

xsv help: https://i.imgur.com/yS8cen7.png

I'd add jq (https://stedolan.github.io/jq/) to the list. JSON data is so common, and jq makes working with it a breeze.

And xsv for CSV data.

It works particularly well in cases where JSON is too large to parsed with a scripting language. Invaluable tool for JSON parsing.

Folks may want to have a look at https://www.gnu.org/software/datamash/manual/datamash.html I suppose it violates the Unix philosophy of one tool doing one thing well but it may nevertheless be useful. See also the examples page https://www.gnu.org/software/datamash/examples/

> I suppose it violates the Unix philosophy

It is GNU (GNU Is Not Unix), it is on purpose.

Seems a nice tool.

I would be a little bit shocked if any of the data scientists at my day job didn't know all seven of these, so, I guess that's an accurate title.

I didn't know about shuf because it's a GNU specific utility, and not installed on my Mac by default.

Homebrew installs the GNU Core Utilities with the g- prefix, and checking now I see it's available as gshuf.

I see a couple of problems with how it's used in this essay:

  cat big_csv.csv | shuf | head -n 50 > sample_from_big_csv.csv
Since shuf's -n option is "output at most COUNT lines", there should be no need for the head:

  cat big_csv.csv | shuf -n 50 > sample_from_big_csv.csv
In principle this should be faster because there's less I/O, and it should take less memory if shuf implements the -n option using something like reservoir sampling. (EDIT: it does - https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/shu... )

Also, since shuf takes a filename, it could be:

  shuf -n 50 < big_csv.csv> sample_from_big_csv.csv
However, in terms of data science, don't most CSV files contain a header? Most of mine do, and the example 'data.csv' has the header "var_x".

Using "shuf" means there's a very high chance that the sampled CSV file either won't include the header line, or will have it somewhere other than the first line.

Doesn't that mean that most data scientists will rarely use 'shuf' for sampling from CSV files?

FWIW that's "xsv sample", and it handles headers by default (with an option for headers-less CSV):

    Randomly samples CSV data uniformly using memory proportional to the size of
    the sample.

    When an index is present, this command will use random indexing if the sample
    size is less than 10% of the total number of records. This allows for efficient
    sampling such that the entire CSV file is not parsed.

    This command is intended to provide a means to sample from a CSV data set that
    is too big to fit into memory (for example, for use with commands like 'xsv
    frequency' or 'xsv stats'). It will however visit every CSV record exactly
    once, which is necessary to provide a uniform random sample. If you wish to
    limit the number of records visited, use the 'xsv slice' command to pipe into
    'xsv sample'.

        xsv sample [options] <sample-size> [<input>]
        xsv sample --help

    Common options:
        -h, --help             Display this message
        -o, --output <file>    Write output to <file> instead of stdout.
        -n, --no-headers       When set, the first row will be consider as part of
                               the population to sample from. (When not set, the
                               first row is the header row and will always appear
                               in the output.)
        -d, --delimiter <arg>  The field delimiter for reading CSV data.
                               Must be a single character. (default: ,)

There is no standard for CSV per se, although experience suggests most files have it. Use head/tail as needed to preserve or avoid it in pipelines.

The GNU utils are usually the way to go; the Mac/Darwin/BSD variants have some weird quirks and usually aren't worth the bother to fight with. awk and sort are notably deficient.

Ease off the "data science" a tad--this is just scripting, a perfectly honorable pastime! Why else would GNU coreutils exist in the first place?

There is a CSV standard, RFC 4180. There are also many variants. I used products which specifically say that follow that RFC.

It's awkward to preserve a header using shuf, even with head/tail. Here's what I came up with:

  head -1 x.csv && (awk 'NR>1' x.csv | shuf -n 5)
My point is that "shuf" is GNU specific, while the author said it was a Unix tool like the other ones. Linux != Unix. Also, there are no Mac/Darwin/BSD variants of shuf.

"Ease off"? Why? The term "data scientist" appear in the title and each of the first three paragraphs, and the comment I replied to was "I would be a little bit shocked if any of the data scientists at my day job didn't know all seven of these". I want to know why data scientists who work with CSV files should be using shuf and not some more appropriate tool like csvkit.

That is, the entire topic is about scripting as applied to data science, not "just scripting".

RFC 4180 is really a well-intentioned attempt to document popular ways CSV is used in the wild, but it does not set a standard. There is no such thing as CSV well-formedness in the way that exists for XML. Can a CSV field value span lines or not? How are linebreaks encoded within a field? Can backslash escape quote (or other) characters, or is the double-doublequote approach contemplated in the RFC proper? How is one to know definitively if the first line is a header or a data row? What about comments?

My beef with "data science" is that this is not science--it's even less science than political science--it's munging.

shrug Sure, yes, I pointed out there are many variants. But you said there wasn't a standard when there is. You meant to say that there is no widely used consistent CSV format, I think, which is a bit different.

RFC 4180 says a CSV field can span multiple lines. Section 2.6:

> Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.

You asked "Can backslash escape quote (or other) characters". 4180 says "no", and that double-double quoting is used. The grammar in 2.7 is:


The RFC says that the MIME type should use header=present or header=absent to indicate if the first line is a header. "Implementors choosing not to use this parameter must make their own decisions as to whether the header line is present or absent."

I have never seen a CSV file with comments.

Note also that the Python csv module, which supports a large number of CSV dialects, also does not support comments.

While you have a beef, that doesn't change that you wanted to drag me off topic (from "data science" to "just scripting"), when my point was that that example wasn't a good data science example in the first place. It also isn't a good scripting example.

I do something like:

    < foo.csv (read -r head; echo “$head”; shuf -n 50)
If you do this a lot, you could combine the read/echo into a single shell function.

Or write a Python program. But a quick test shows that CPython (using a pure Python reservoir sampling) is 1/4th the speed of shuf.

> don't most CSV files contain a header?

  tail +2

D'oh! I forgot that one, and gave an awk description instead.

Personally, I'd use another tool to remove the header before sending it to shuf. As an example:

    cat myfile | tail -n +2 | shuf -n 50

I know most of those, but I've never used awk.

One command I didn't see mentioned that was surprising was cut[1], especially the -d flag. (Maybe it's less necessary if you know awk?)

It's really great for cleaning up the output of poorly structured, wordy scripts. It's also a quick and dirty way to generate csvs with the –output-delimiter flag.


I’m a windows .net dev and I know 6 of them. It’s quite useful to run these on windows using Cygwin. Grep I use a lot the others not so much.

I assumed the bc command would be listed.

I cannot recommend this enough:

The Awk Programming Language - Aho, Kernighan, Weinberger


The book is amazingly well written, and is invaluable.

For a quick understanding of the power of awk :


great, thanks

perl has all the awk features in its commandline interface. I tend to use it in place of awk

They left out rm, used to clean up all their files when they are done so other users can work.

They also left out ls and cd, might have been considered too basic.

However they did include ‘cat’ and only mentioned it’s most basic common use rather than its primary reason for being.

Ultimately guides like the one submitted are just one persons braindump of stuff they think might be helpful for others. It’s probably best not to over think why something was added or omitted.

http://visidata.org/ is a nice one for quickly getting an overview of some tabular data – you can even just stick it at the end of your pipe. If

bzcat foo.bz2|sort|uniq -c|sort -nr | awk -f munge.awk |blah

produces a tsv, then

bzcat foo.bz2|sort|uniq -c|sort -nr | awk -f munge.awk |blah|vd

makes that tsv an interactive (if you think ncurses is interactive) spreadsheet with plotting and pivot tables and mouse support :)

You can also save your keypresses in vd to a file and then re-run them at a later stage – I've got some scripts to re-run an analysis and then run vd on it and immediately set all columns to floats and open the frequency table so I can see if I managed to lower the median this time.

HOLY CRAP I've been trying to find visdata for the past year or so. Saw it once here on HN and was completely unable to find it again.

Thanks for mentioning it.

Author of VisiData here, how did you try to find it (what did you search for)? I'd like to make it easier to find but I'm not sure how.

If you have a lot of files that may be processed by a `find` command and speed is important, it’s worth knowing about the plus-sign variation of the `-exec` expression. The command in the original article

    find . -name setup.py -type f -exec grep -Hn boto3 {} \;
could be written as

    find . -name setup.py -type f -exec grep -Hn boto3 {} +
The difference is that the first version (the `-exec` expression is terminated with a semi-colon) forks a new process to run the `grep` command for each individual file “found” by the preceding expressions. So, if there were 50 such `setup.py` files, the `grep` command would be invoked 50 times. Some times this is desired behaviour but in this case, `grep` can accept multiple pathnames as arguments.

With the second version (expression is terminated with a plus-sign), the pathnames of the files are collected into sets so that the `grep` command is only called once for each set (similar to how the `xargs` utility works to avoid exceeding the limits on the number of arguments that can be passed to a command). This is much more efficient because only 1 `grep` child process is forked – instead of 50.

This functionality was added to the POSIX specification [1] a number of years ago and I’ve been using it for at least 10 year on GNU/Linux systems. I imagine it should be available on other Unix-like environments (including BSD [2]) that data scientists are likely to be using – though the last time I had to work on a friend’s Mac the installed versions of the BSD utilities were quite old.

[1]: http://pubs.opengroup.org/onlinepubs/9699919799/

[2]: https://man.openbsd.org/find.1

I have an example based tutorial for all these commands plus other cli text processing commands


Problem: Given a CSV file, we want to know the number of columns just by analyzing its header.

   $ head -n 1 data.csv | awk -F ',' '{print NF}'
Or spare a process:

    awk -F ',' 'NR <= 1 {print NF; quit}' data.csv
One of numerous weak points to this article.

> Prints on the screen (or to the standard output) the contents of files. Simple like that.

While it's not exactly false, it's also not a good explanation for cat. If you just want to operate on the contents of a single file, you should use redirection. The cat utility is for concatenating files.

tldr: grep, cat, find, head / tail, wc, awk, shuf with bonuses of xargs, and man.

I've never needed shuf, and awk is a bit out of place in the list, but head and tail have saved me from many a large file. The interesting data is usually in head, tail, or grep anyway.

awk '{print $2,$4;}' is useful and easy to remember. $NF refers to the last field. FS is the variable to override the default field separator of white space. Here's an example of it's use. OFS is the output field separator.

awk 'BEGIN{FS=":"; OFS=":"} {print $1,$NF}' /etc/password

Works on CSV files too with the right field separator.

> awk '{print $2,$4;}' is useful and easy to remember. $NF refers to the last field. FS is the variable to override the default field separator of white space. Here's an example of it's use. OFS is the output field separator.

Yeah also cut is just inconvenient, awk works much better when working with the usual tabulated data.

> Works on CSV files too with the right field separator.

You're probably better off using xsv though.

I have used awk several times in different ways. I still don't find it easy to remember a command with as many punctuation characters as letters, like your first one.

And I'm not even sure what the second one is trying to do. Print the user name, followed by the number of fields separated by colons? What good is that? Perhaps you've forgotten that not everyone can read awk syntax right off the page.

Def the first five there. wc as more of a tie-in with the former. Never used shuf and still need to find the time to learn awk.

When it comes to learning awk, my suggestion is to read the man file of one of the simpler implementations; original-awk[0] or plan9's[1] awk. They lack the GNU additions (many of which admittedly are useful), but they decrease the learning curve a lot. I pushed off learning awk for years since I got overwhelmed every time I typed 'man awk', but after stumbling on the man page from plan9 I learned it in no time. Sure, I don't know all the GNU extensions in gawk, but now I can look them up when I need them.

[0] https://manpages.debian.org/stretch/original-awk/awk.1.en.ht... [1] https://www.unix.com/man-page/plan9/1/AWK/

Thanks for this! I appreciate the pointers.

shut I do not know; would add extra random field and then sort

Is there a real advantage to using awk over python for most tasks? Or is just a little faster/more convenient if you already know it?

if you need a cli tool, then awk clearly wins as python isn't suited in a cli pipeline

speed is a factor too, if you consider combining with other cli features, for example https://adamdrake.com/command-line-tools-can-be-235x-faster-...

It's usable inline and is great for state machine stuff (https://two-wrongs.com/awk-state-machine-parser-pattern.html), both useful for text processing.

Is "data science" so undeveloped that pipes and grep need to be on an everyone-should-know list?

There is a lot of low-hanging fruit available in teaching people from research backgrounds about what we would consider baseline computer use and software development practices. It turns out there is a large population of brilliant people who can accomplish things in Python that I can barely comprehend in abstract, much less line for line, who have also never heard of Git, unit testing, or modularity.

I guess I was under the impression that the label indicated some level of expertise in the basic tools of the trade that would differentiate it from statistics or applied math.

Of course there is nothing wrong with specialization or lacking experience when it is not particularly relevant to your chosen field, but something like 'grep' would seem to be the bread and butter of data science.

That would be R, Pandas, SQL, etc.

Everyone starts from somewhere. No matter how “developed” a field is. I think we should encourage the learning no matter what level people are instead of being scornful of how “advanced” a field of study or the people within it are. There are many reasons people might not be familiar with these commands. One super simple one is if they come from a windows background where often guis take precedence over the command line.

Regardless of how "undeveloped" the field of data science is, the fact remains that every professional data scientist should know these tools.

Any book recommendations to understand and master the use of UNIX commands?

System man pages.

UNIX Power Tools, Peek, et al, is somewhat aged, but excellent.

The UNIX Programming Environment is even older, but highlights basic philosophy.

For more recent tools, StackExchange Linux and shell topics can be illuminating.

The Unix Programming Environment was eye opening for me!

Let's not forget cut for dealing with csv.

They described how to use awk like cut.

cut is very limited where you can specify 1 character for separation.

s/Data Scientist/Unix User/

Hey, sed is not even on the list


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact