Explorations in Unix

frankc · on Dec 3, 2012

I use unix in the same way and for the same purpose as the described in the blog, but I have come to the opinion that once you get into the describe and visualize phase, it's much easier to just drop into R. Reading in the kind of file being worked on here is often as simple as

foo <-read.csv("foo.csv")

Getting summary descriptive statistics, item counts, scatter plots and histograms is often as easy as

summary(foo)

table(foo$col)

plot(foo$xcol, foo$ycol)

hist(foo$col).

I think that is lot simpler than a 4 or 5 command pipeline that can be mistake-prone to edit when you want to change column names or things like that. I still do these kinds of things in the shell sometimes, and I don't know if I can put my finger on when exactly I would drop into R vs write out a pipeline, but there IS a line somewhere...

chimeracoder · on Dec 3, 2012

I completely agree - while I love using the command-line for most tasks, R is ideally designed for this... and rightly so, since that's the whole point of the language!

R also supports the use of atomic operations on vectors (it automatically maps the operation over the vector), and other idioms that would be liabilities in other programming languages, but hugely beneficial in this one area.

I still haven't found a language that treats reading (and then processing) a CSV file more easily than R.

Conveniently, R should be really easy to pick up for someone familiar with working with a POSIX shell or Bash. For example, to see all variables defined in the current namespace, just type ls().

R is basically the POSIX mentality applied specifically to data processing, instead of general-purpose work.

myg204 · on Dec 4, 2012

One exception though. Piping unix command this way allow handling of _very_ large file (stream model), whereas R requires loading the whole content in memory.

dredmorbius · on Dec 5, 2012

That depends on the statistical moments you're calculating. For sums and means, yes. For percentiles, standard deviation, and other moments, you'll need to operate on the set as a whole, which either means re-reading the dataset or processing it in memory.

That said, I've got my own univariate statistics generator written in awk that can handle sets with millions of values readily on desktop/laptop hardware.

dredmorbius · on Dec 5, 2012

I'd disagree that R is trivial to pick up. I find a lot of impedance crossing tool boundaries, and much of R is written by (and for) advanced stats use, making trivial application somewhat daunting.

Which isn't to say that it's not a phenomenally powerful and capable tool. I'm just referring to its accessibility.

theshadow · on Dec 3, 2012

Every developer is familiar with Shell and Unix tools, it's trivially simple to lookup documentation and get something working really quick with unix utilities even if you are not familiar with the tool in the beginning. On the other hand the idea of learning an entirely new programming language for quick and dirty trivial stuff is a little off-putting. That being said I did not know how simple it is in R to play with data, will have to check it out.

lutusp · on Dec 3, 2012

A quote: "... As if this wasn't enough, he [i.e.Tukey] also invented what is probably the most influential algorithm of all time." (emphasis added)

No, Tukey did not "invent" the FFT. He rediscovered it, as did a number of others over the years since -- who else? -- Gauss first created it.

http://en.wikipedia.org/wiki/Fast_Fourier_transform

A quote: "This method (and the general idea of an FFT) was popularized by a publication of J. W. Cooley and J. W. Tukey in 1965,[2] but it was later discovered (Heideman & Burrus, 1984) that those two authors had independently re-invented an algorithm known to Carl Friedrich Gauss around 1805 (and subsequently rediscovered several times in limited forms)."

mpyne · on Dec 3, 2012

It's true the writeup didn't mention Gauss, but the PDF that the author linked did mention that.

But either way wouldn't the importance for the field of computing in this case be more on the application of the algorithm and not who published first? I.e. if no one knows about a computable algorithm used sparingly 150 years earlier then I don't think it's completely unfair to give some credit to one who later rediscovers and popularizes that algorithm.

According to the Wikipedia article the Cooley-Tukey algorithm was an independent re-discovery so it's not as if Tukey had read Gauss and then tried to steal the credit (it wasn't even noted until almost 20 years later that Cooley-Tukey was a rediscovery of a Gaussian algorithm).

It's almost unfair though... I think if we really deeply dove into what mathematicians like Gauss, Euler, Cauchy, etc. came up with that there may be other "CS" algorithms that were innovated hundreds of years before computers were available to really popularize them. Every time I read about Euler and Gauss especially I end up even more impressed.

lutusp · on Dec 3, 2012

> But either way wouldn't the importance for the field of computing in this case be more on the application of the algorithm and not who published first?

Separate issue. My objection was solely to correct the incorrect claim that Tukey invented the FFT. And this is in no way meant to disparage what Tukey did accomplish, only so that the history reads correctly.

> I don't think it's completely unfair to give some credit to one who later rediscovers and popularizes that algorithm.

Yes, but describing him as the inventor goes too far.

ajross · on Dec 3, 2012

I don't see how this correction adds to the discussion.

How are the actions of "invention" and "rediscovery" different? Is it less impressive that someone came up with a great idea simply because someone else did it first in a different context? Obviously Gauss should be celebrated too (and obviously is), but I don't see anything wrong with applauding Tukey either...

lutusp · on Dec 3, 2012

> How are the actions of "invention" and "rediscovery" different?

Simply put, the use of "invention" for a rediscovery deprives the idea's originator of any credit.

> I don't see anything wrong with applauding Tukey either...

My objection was only for misidentifying him as the inventor of the idea. Surely you are aware that in current practice only the originator of an idea gets credit for "inventing".

ajross · on Dec 3, 2012

That's silly. The sentence clearly wasn't about "credit", it was about explaining what a cool guy Tukey was. I think his independent rediscovery of the factorization that allows the FFT to be computed (that is, the thing we commonly think of as "the FFT algorithm" -- for obvious reasons Gauss didn't identify the application correctly) absolutely qualifies as evidence to that effect.

If anything, your dismissal of the guy's work is the real crime here. What's your problem with Tukey, exactly?

CapitalistCartr · on Dec 3, 2012

This is an ad hominem attack that is senseless here. Turkey was a cool guy; "his independent rediscovery of the factorization that allows the FFT to be computed absolutely qualifies as evidence to that effect." Yes, you are absulutely right. But that's not what the article said; that's what lutusp said.

lutusp · on Dec 4, 2012

> If anything, your dismissal of the guy's work ...

Before going on, locate where I did any such thing.

mturmon · on Dec 4, 2012

Here's substantiation of this history:

http://www.cis.rit.edu/class/simg716/Gauss_History_FFT.pdf

Besides Gauss, many others including Runge (yes, that Runge) and Burkhardt (yes, the one on Einstein's committee) independently discovered the FFT well before the 1950s. Like so much of Gauss's work, his work on the FFT was unpublished during his lifetime.

Probably it was the conjunction of the algorithm and the emerging power of the digital computer that caused the Cooley-Tukey paper to take off at that historical moment.

mpyne · on Dec 3, 2012

I almost skipped because I figured it would be another introductory article to how to use bash and coreutils, but this was actually very good.

fcatalan · on Dec 3, 2012

Hits close to home. I do a lot of data conversion, arrangement and manipulation on the CLI. When some coworker inherits any of those tasks and I explain how to do it, the answer tends to be "Aaaaallright, I'll use Excel".

piqufoh · on Dec 3, 2012

Up for unix and "EDA is the lingua franca of data science". What you can do and discard on the unix CLI takes many times longer on certain GUI based OSes.

nipunn1313 · on Dec 3, 2012

head -3 data* | cat has the same result as head -3 data*

Pipe sends stdout to stdin of the next process. cat sends stdin back to stdout. Piping to cat is rarely eventful (unless you use a flag like cat -n).

pepve · on Dec 3, 2012

Some tools adjust their output based on it going to a terminal or not. Try 'ls' versus 'ls | cat'.

mseebach · on Dec 3, 2012

Indeed. I often use ps | cat to get the full command line for processes (that's otherwise truncated). All hail Java command lines.

jcurbo · on Dec 3, 2012

Try 'ps auxw' - the w enables wide output.

mlni · on Dec 3, 2012

And for really wide commands (like java) you can add another w to get the whole thing: ps auxww

ralph · on Dec 3, 2012

He writes

    (head -5; tail -5) <data

but that's a bit misleading. These don't work.

    seq 20 | (head -5; tail -5)
    (head -5; tail -5) < <(seq 20)

Both giving just the first five lines.

js2 · on Dec 4, 2012

For those who don't follow why this is. In the first case, data is dup'ed to stdin and so remains seekable. This is important since head buffers reads (typically 8k at a time) for efficiency, then uses lseek to reposition to just after the requested number of lines. If reading from a pipe as in the second case, lseek() fails and by the time tail runs, head has consumed all of the file.

If you use "seq 10000 | (head -5; tail -5)" you'll get the first and last lines as expected since head hasn't consumed too much of the file.

I don't think this invalidates his example, but it could mention this subtle caveat. :-)

derekp7 · on Dec 4, 2012

Are you saying his example doesn't work, or that it can't be extended to your two examples? Because when working on a file it works fine -- just not through a pipe (which both your examples are).

    seq 1 20 >file.txt
    (head -5; tail -5) <file.txt

ralph · on Dec 4, 2012

I'm complaining his example doesn't work in the more general case and that he should warn of this or give a better more general solution. I'm aware of why it doesn't work.

keithpeter · on Dec 3, 2012

rs and lam look interesting. Are these commands really only available on BSD (i.e. 'proper' Unix derivatives)? Hoping for Linux compilable code.

p4bl0 · on Dec 3, 2012

There is a "rs" package in Debian so it must be available on most Linux distributions. Otherwise it should be quite straightforward to port the code since it doesn't do anything complicated.

keithpeter · on Dec 3, 2012

http://svnweb.freebsd.org/base/release/9.1.0/usr.bin/lam/

http://svnweb.freebsd.org/base/release/9.1.0/usr.bin/rs/

Not finding 'describe' at present (I shall persist). There seem to be few dependencies in these source files...

unimpressive · on Dec 3, 2012

https://github.com/drbunsen/describe/

It was a utility written by the author for his analysis work.

mturmon · on Dec 4, 2012

"describe" is a nice idea. Just knowing the range, the mean, and the second moment can be helpful.

_hnwo · on Dec 3, 2012

I'd be interested in what Seth's setup/theme/os of choice is .. :)