
Explorations in Unix - telemachos
http://www.drbunsen.org/explorations-in-unix.html
======
frankc
I use unix in the same way and for the same purpose as the described in the
blog, but I have come to the opinion that once you get into the describe and
visualize phase, it's much easier to just drop into R. Reading in the kind of
file being worked on here is often as simple as

foo <-read.csv("foo.csv")

Getting summary descriptive statistics, item counts, scatter plots and
histograms is often as easy as

summary(foo)

table(foo$col)

plot(foo$xcol, foo$ycol)

hist(foo$col).

I think that is lot simpler than a 4 or 5 command pipeline that can be
mistake-prone to edit when you want to change column names or things like
that. I still do these kinds of things in the shell sometimes, and I don't
know if I can put my finger on when exactly I would drop into R vs write out a
pipeline, but there IS a line somewhere...

~~~
chimeracoder
I completely agree - while I love using the command-line for most tasks, R is
ideally designed for this... and rightly so, since that's the whole point of
the language!

R also supports the use of atomic operations on vectors (it automatically maps
the operation over the vector), and other idioms that would be liabilities in
other programming languages, but hugely beneficial in this one area.

I still haven't found a language that treats reading (and then processing) a
CSV file more easily than R.

Conveniently, R should be really easy to pick up for someone familiar with
working with a POSIX shell or Bash. For example, to see all variables defined
in the current namespace, just type ls().

R is basically the POSIX mentality applied specifically to data processing,
instead of general-purpose work.

~~~
myg204
One exception though. Piping unix command this way allow handling of _very_
large file (stream model), whereas R requires loading the whole content in
memory.

~~~
dredmorbius
That depends on the statistical moments you're calculating. For sums and
means, yes. For percentiles, standard deviation, and other moments, you'll
need to operate on the set as a whole, which either means re-reading the
dataset or processing it in memory.

That said, I've got my own univariate statistics generator written in awk that
can handle sets with millions of values readily on desktop/laptop hardware.

------
lutusp
A quote: "... As if this wasn't enough, he [i.e.Tukey] also _invented_ what is
probably the most influential algorithm of all time." (emphasis added)

No, Tukey did not "invent" the FFT. He rediscovered it, as did a number of
others over the years since -- who else? -- Gauss first created it.

<http://en.wikipedia.org/wiki/Fast_Fourier_transform>

A quote: "This method (and the general idea of an FFT) was popularized by a
publication of J. W. Cooley and J. W. Tukey in 1965,[2] but it was later
discovered (Heideman & Burrus, 1984) that those two authors had independently
re-invented an algorithm known to Carl Friedrich Gauss around 1805 (and
subsequently rediscovered several times in limited forms)."

~~~
mpyne
It's true the writeup didn't mention Gauss, but the PDF that the author linked
_did_ mention that.

But either way wouldn't the importance for the field of computing in this case
be more on the _application_ of the algorithm and not who published first?
I.e. if no one knows about a computable algorithm used sparingly 150 years
earlier then I don't think it's completely unfair to give some credit to one
who later rediscovers and popularizes that algorithm.

According to the Wikipedia article the Cooley-Tukey algorithm was an
independent re-discovery so it's not as if Tukey had read Gauss and then tried
to steal the credit (it wasn't even noted until almost 20 years later that
Cooley-Tukey was a rediscovery of a Gaussian algorithm).

It's almost unfair though... I think if we really deeply dove into what
mathematicians like Gauss, Euler, Cauchy, etc. came up with that there may be
other "CS" algorithms that were innovated hundreds of years before computers
were available to really popularize them. Every time I read about Euler and
Gauss especially I end up even more impressed.

~~~
lutusp
> But either way wouldn't the importance for the field of computing in this
> case be more on the application of the algorithm and not who published
> first?

Separate issue. My objection was solely to correct the incorrect claim that
Tukey invented the FFT. And this is in no way meant to disparage what Tukey
did accomplish, only so that the history reads correctly.

> I don't think it's completely unfair to give some credit to one who later
> rediscovers and popularizes that algorithm.

Yes, but describing him as the inventor goes too far.

------
mpyne
I almost skipped because I figured it would be another introductory article to
how to use bash and coreutils, but this was actually very good.

------
fcatalan
Hits close to home. I do a lot of data conversion, arrangement and
manipulation on the CLI. When some coworker inherits any of those tasks and I
explain how to do it, the answer tends to be "Aaaaallright, I'll use Excel".

------
piqufoh
Up for unix and "EDA is the lingua franca of data science". What you can do
and discard on the unix CLI takes many times longer on certain GUI based OSes.

------
nipunn1313
head -3 data* | cat has the same result as head -3 data*

Pipe sends stdout to stdin of the next process. cat sends stdin back to
stdout. Piping to cat is rarely eventful (unless you use a flag like cat -n).

~~~
pepve
Some tools adjust their output based on it going to a terminal or not. Try
'ls' versus 'ls | cat'.

~~~
mseebach
Indeed. I often use ps | cat to get the full command line for processes
(that's otherwise truncated). All hail Java command lines.

~~~
jcurbo
Try 'ps auxw' - the w enables wide output.

~~~
mlni
And for really wide commands (like java) you can add another w to get the
whole thing: ps auxww

------
ralph
He writes

    
    
        (head -5; tail -5) <data
    

but that's a bit misleading. These don't work.

    
    
        seq 20 | (head -5; tail -5)
        (head -5; tail -5) < <(seq 20)
    

Both giving just the first five lines.

~~~
derekp7
Are you saying his example doesn't work, or that it can't be extended to your
two examples? Because when working on a file it works fine -- just not through
a pipe (which both your examples are).

    
    
        seq 1 20 >file.txt
        (head -5; tail -5) <file.txt

~~~
ralph
I'm complaining his example doesn't work in the more general case and that he
should warn of this or give a better more general solution. I'm aware of why
it doesn't work.

------
keithpeter
rs and lam look interesting. Are these commands really only available on BSD
(i.e. 'proper' Unix derivatives)? Hoping for Linux compilable code.

~~~
p4bl0
There is a "rs" package in Debian so it must be available on most Linux
distributions. Otherwise it should be quite straightforward to port the code
since it doesn't do anything complicated.

~~~
keithpeter
<http://svnweb.freebsd.org/base/release/9.1.0/usr.bin/lam/>

<http://svnweb.freebsd.org/base/release/9.1.0/usr.bin/rs/>

Not finding 'describe' at present (I shall persist). There seem to be few
dependencies in these source files...

~~~
unimpressive
<https://github.com/drbunsen/describe/>

It was a utility written by the author for his analysis work.

------
mturmon
"describe" is a nice idea. Just knowing the range, the mean, and the second
moment can be helpful.

------
baconhigh
I'd be interested in what Seth's setup/theme/os of choice is .. :)

