
An introduction to data processing on the Linux command line - robertelder
https://blog.robertelder.org/data-science-linux-command-line/
======
tjlav5
If you're interested in this space, a great resource can be found at
[https://www.datascienceatthecommandline.com/](https://www.datascienceatthecommandline.com/)
(a free guide to go along with an orielly book)

------
dima55
A plug of my tools:

To visualize data coming in from a pipe, can pipe it to

[https://github.com/dkogan/feedgnuplot](https://github.com/dkogan/feedgnuplot)

Very useful in conjunction with other tools to provide filtering and
manipulation. For instance (the first one is mine):

[https://github.com/dkogan/vnlog](https://github.com/dkogan/vnlog)

[https://www.gnu.org/software/datamash/](https://www.gnu.org/software/datamash/)

[https://csvkit.readthedocs.io/](https://csvkit.readthedocs.io/)

[https://github.com/johnkerl/miller](https://github.com/johnkerl/miller)

[https://github.com/eBay/tsv-utils-dlang](https://github.com/eBay/tsv-utils-
dlang)

[http://harelba.github.io/q/](http://harelba.github.io/q/)

[https://github.com/BatchLabs/charlatan](https://github.com/BatchLabs/charlatan)

[https://github.com/dinedal/textql](https://github.com/dinedal/textql)

[https://github.com/BurntSushi/xsv](https://github.com/BurntSushi/xsv)

[https://github.com/dbohdan/sqawk](https://github.com/dbohdan/sqawk)

[https://stedolan.github.io/jq/](https://stedolan.github.io/jq/)

[https://github.com/benbernard/RecordStream](https://github.com/benbernard/RecordStream)

------
haddr
Command Line tools are powerful beasts (e.g. awk) and they were always central
to data preprocessing. But do we need to call it now a data science?

~~~
p0cc
Yeah this article is about processing text data and not any form of
statistics, modeling, etc. I'm guessing they added "data science" because it's
in vogue? In any case, the provided title does not reflect the article.

~~~
KasianFranks
NLU. It relates to extracting intelligence from human language. Most of which
comes in the form of text.

------
fizixer
Regarding more than one mentions of UUOC in this thread:

\- The original award started in 1995. Even though pentium was already out, I
think it is safe to say that was the era of 486 PCs. In 2019, for day-to-day
shell work (meaning no GBs of file-processing or anything like that), isn't
invoking UUOC and pointing out inefficiencies an example of premature
optimization [1]?

\- Isn't readability a matter of subjectivity, and that for some folks 'cat
file' is more readable than '<file' or a direct use of a processing command
(like grep, tail, head, etc) [2] ? (The whole stackoverflow page is fairly
illuminating [3]).

[1]
[http://wiki.c2.com/?PrematureOptimization](http://wiki.c2.com/?PrematureOptimization)

[2] [https://chat.stackoverflow.com/rooms/182573/discussion-on-
an...](https://chat.stackoverflow.com/rooms/182573/discussion-on-answer-by-
jonathan-leffler-useless-use-of-cat)

[3]
[https://stackoverflow.com/questions/11710552](https://stackoverflow.com/questions/11710552)

------
mark_l_watson
Not really where the author is heading, but I like to configure a backend for
mathplot lib to render graphics in a terminal so when I am SSHed to a remote
system I can get inlined plots.

~~~
dvrx
Better solution: sixel-gnuplot

Shameless plug: [https://github.com/csdvrx/sixel-
gnuplot](https://github.com/csdvrx/sixel-gnuplot)

~~~
mark_l_watson
Thanks, I will try that.

~~~
dvrx
If you like it, share your terminal configuration!

mlterm works.

mintty had a regression, 3.1.0 may have fixed that

------
ibern
Here are some ways you could simplify some of the tasks in the article, saving
on typing:

    
    
        cat data.csv | sed 's/"//g'
    

can be simplified by doing this instead:

    
    
        cat data.csv | tr d '"'
    
    

This awk command:

    
    
        cat sales.csv | awk -F',' '{print $1}' | sort | uniq
    

Can be replaced with a simpler (IMO) cut instead:

    
    
        cat sales.csv | cut -d , -f 1 | sort | uniq
    
    

When using head or tail like this:

    
    
        head -n 3
    

You don't need the -n:

    
    
        head -3
    
    

Also shout out to jq, xsv, and zsh (extended glob), all nice complements to
the typical command line utils.

~~~
gnufx
If you want to simplify things, don't employ "useless us of cat". Pass the
file as a command arg or re-direct input. And sort has options, so the
third/fourth commands can be

sort -u -t, sales.csv

However, those fail with quoted commas.

Also, head -3 is non-POSIX obsolete syntax.

Edit: I don't know why I didn't see other UUOC references initially.

------
arminiusreturns
When I was at a genetics lab, I was helping some researchers on something and
spent 3 days writing a perl script, which kept failing. I sent an email to one
of the guys who wrote the paper the research was being based on, and he said,
why not try awk like this? With a little work, I turned 3 days of perl into a
1 line awk that was faster than anything else for the job at the time. That
was an inspirational moment for the fundamental power of the unix philosophy
and the core utilities in linux for me.

Good introductory article here!

------
mjirv
This is a great list and well-written. As a data professional, I use these
commands all the time and my job would be much harder without them. I also
learned a few new things here (`tee` and `comm`).

I was lucky that my first job was as a support engineer at a data-centric tech
company, which is where I learned these. I've often thought about how to teach
them to data analysts coming from a non-engineering background. This is
comprehensive but clear and would be a perfect resource for training someone
like that. Thank you!

------
fizixer
I'll just leave one of my past comments [1] here.

[1]
[https://news.ycombinator.com/item?id=17324222](https://news.ycombinator.com/item?id=17324222)

P.S.: Not essential, but it really becomes a joy when, as a touch typist, I
have turned on vi mode in the shell (e.g., with 'set -o vi'). My fingers never
have to leave the home row while I do my shell piping work from start to
finish. (no mouse, no arrow keys, etc.)

~~~
hackerm0nkey
Haha. That’s me. Once you go ‘set -o vi’, you can’t go back

------
pferde
Huh, so it turns out that I've been a 'data scientist' for over 20 years. Who
knew?

~~~
kylek
That was my first thought skimming through this too. Either every *nix admin
who is aware of a few text processing tools is a data scientist, or “data
scientists” are just as full of it as I’ve expected.

~~~
jdjdjjsjs
Just because a tool can be used for A, B or C, and you are an expert at using
that tool for A does not imply that your expertise at using the tool for A
makes you an expert in B and C.

The whole point of this article is to point out that a lot of common Linux
tools can be used for Data Science like work (a significant part of which
includes pre processing structured and unstructured text).

------
rodrigo975
Why people use Linux in place of *nix ?

Even worst, most of the tools (cat, grep, awk) are Unix commands, redeveloped
by the GNU project in most of the GNULinux distros.

~~~
clarry
> Why people use Linux in place of __nix ?

I find it more irritating when people try to score greybeard points by saying
*nix (or Unix) when it's obvious that they're talking about a Linux-only
mechanism and quite possibly haven't ever used Unix (or a direct derivative).

~~~
umanwizard
Huh? What is specific to Linux in this post?

Also, the most popular Unix-like OS (far more than Linux) is macOS, basically
the least “leet greybeard affectation” thing I can imagine. Your irritation is
way off base.

~~~
clarry
> Huh? What is specific to Linux in this post?

Perhaps nothing? I was responding to the complaint in general terms.

> Your irritation is way off base.

Please allow me to feel irritated when people refer to obvious Linux things as
something that's supposedly got something to do with Unix. It happens often
enough.

~~~
umanwizard
Fair enough; I suppose it’d be annoying for people to talk about “the Unix
concept of cgroup namespaces” or something like that.

I had thought that you were directly responding to the original poster.

------
wolfhumble
Very nice video, and I like the way you combine it with text and examples! :-)
Looking forward to reading the other articles on your page as well!

------
hackerm0nkey
Very useful article. Learned a couple of new things here.

While reading the idea that I know most of this, would that made me a data
scientist? Jumped at me.

But then I quickly recovered from that thought that surely knowing some of the
tools someone could use for a certain domain does not make you expert at that
domain.

Might just be the case of same ingredients, different recipes.

------
pedro84
This is a little more awk-ish:

awk -F, '$2 == "F" {$0=(($1-32)*5/9)",C"} {print}'

~~~
dvrx
I love awk too but most people don't know much of awk. Better use regular
things and keep awk for whenever you absolutely need it.

------
pnutjam
This is still useful information for data scientist who end up on Linux.

------
oburb
This is also useful:
[https://www.gnu.org/software/datamash/](https://www.gnu.org/software/datamash/)

------
mnaydin
I wouldn't use awk for simple things such as

    
    
      cat sales.csv | awk -F',' '{print $1}'
    

but I'd prefer

    
    
      cut -d, -f1 sales.csv

------
teddyh
Useless use of cat detected!

 _Rememeber, nearly all cases where you have:_

    
    
      cat file | some_command and its args ...
    

_you can rewrite it as:_

    
    
      <file some_command and its args ...
    

_and in some cases, such as this one, you can move the filename to the arglist
as in:_

    
    
      some_command and its args ... file
    

— Randal L. Schwartz
([http://porkmail.org/era/unix/award.html#cat](http://porkmail.org/era/unix/award.html#cat))

~~~
robertelder
hah, I knew someone would point that out (which is why I talked about it in
the article).

I actually prefer useless cat because when you're prototyping a pipeline it's
very awkward to use non-useless cat. You'll probably start off with something
like this to observe the content of the file:

    
    
        cat something.txt
    

Using this doesn't work in bash:

    
    
        <something.txt
    

Then, continuing with useless cat to build on it you do

    
    
        cat something.txt | grep stuff
    

Which you can type easily from using 'up' in your terminal. But if you use
non-useless cat you have to re-type the entire thing or move the cursor
around:

    
    
        grep stuff < something.txt
    

With useless cat, you can keep adding things and check the result:

    
    
        cat something.txt | grep stuff | sed 's/"//g'
    

Or if you need to insert another filter before the last stage like this, you
can just press "up" and insert it:

    
    
        cat something.txt | grep -v negmatch | grep stuff
    

I don't think there is any easily-typed equivalent workflow with non-useless
cat.

~~~
jpxw
If you’re working with a lot of data you probably want to pipe it into head
anyway, initially, so

<file head -n50 | whatever

Can be the starting command. When you no longer need the head there, just get
rid of “head |”.

Although I agree that the pointing out of “useless cat” is usually not
particularly useful or constructive.

~~~
robertelder
Using head when there's lots of data make sense, but I really don't see any
advantage to avoiding useless cat. Useless cat is way faster to type and make
additions to. I sort of get the feeling that 'useless cat' is really just a
fun copypasta kinda like when people like post "I'd just like to interject for
a moment. What you're referring to as Linux, is in fact, GNU/Linux, or as I've
recently taken to calling it..."

~~~
drran
It's still good to be aware about `useless cat`, to save some CPU and I/O,
when converting one liners into scripts.

------
lonelappde
Good intro to data _processing_.

tsort and comm were news to me.

------
c06n
Can somebody explain the advantage of doing it on the command line vs in
Python or R? What would a practical use case look like?

~~~
robertelder
The most significant use case for all things command-line IMHO is
_automation_. Also, I would change that from "command line vs in Python or R"
to "command line and Python or R". Build a pipeline like I've discussed in the
article, then pipe it into Python or R.

~~~
criddell
> Build a pipeline like I've discussed in the article, then pipe it into
> Python or R.

Why not just do it all in Python or R? That way you also get something that
will probably work on non-unix platforms.

~~~
robertelder
Over the years I've found that I usually fall into a pattern of starting with
low-fidelity automation in languages like shell and slowly re-writing it over
time into more higher-level languages, usually python first, then Java. This
way, unimportant tasks can be automated in less than 5 min with one of these
shell commands. If it breaks or has errors, no big deal. Python works well for
figuring out the structure of the solution as an actual program, and then
finally a language with static type checking when it _really_ needs to run
without errors.

~~~
criddell
I go to Python first because it's nice to be able to single-step through the
script with a debugger and monitor exactly what's happening. I also know
Python a lot better than shell script so it saves me a lot of time as well.

------
robertelder
Hi, (I wrote the article). A few people commented noting that I included "Data
Science" in the title, but the content doesn't include any statistics or
machine learning which is closer to the core definition of 'data science'. I
still think the title is appropriate since any kind of low-fidelity data
science task you do on some had-hoc data (log files, heaps of text, web pages)
is going to start with setting up a processing pipeline that involves these
commands. I could have re-named it "An intro to text processing" or "An intro
to data processing", but then the people who need to see this content won't
associate the title with something they're interested in, so they never
benefit from it. The list of commands was chosen specifically with the
question "What Linux commands would someone answering data science/business
intelligence questions use?" in mind. These commands are also among the list
of ones that are usually already installed on every system.

~~~
bitminer
Much can be done just with awk.

My pet peeve is the "grep | awk" idiom. No, just use awk.

Awk does map/reduce, relational joins, associative memory, table lookup, and
so on. Just use awk arrays, begin block, and end block.

~~~
dvrx
If I am going to maintain the script myself and never tweak anything in the
middle of the night, sure.

But most people don't know awk. And awk requires more awareness. I break my
awk when I fix things when tired.

------
netmonk
Ugly UUOC (Useless Use Of Cat). Damn peoples, please i appreciate your will to
share, but share good contents and stop spreading bad shell patterns....

~~~
Hikikomori
Yes, it made the whole article useless.

~~~
robertelder
I assume you're just joking around, but to you and the parent comment, I'd be
happy to hear any good arguments for avoiding 'useless' cat. Note that I did
mentioned 'useless cat' in the article, and there is already a comment thread
in this article that contains my opinions on it.

~~~
netmonk
So why should we repeat ?

