

Unix Command Lines and Relations - ludsan
http://merrigrove.blogspot.com/2013/12/the-occultation-of-relations-and-logic_22.html

======
ivanhoe
Just for the record the initial proposed solution is not really an optimal way
to do it, specially if log files are huge. Using cat will push the whole log
file to stdout first and it's not really needed because `grep GET
/var/log/nginx-access.log` will do exactly the same, but way faster.

------
collyw
Interesting to see him describe everything in terms a database. I sometimes
think I am too stuck in a relational mindset to understand the reasons for
things like MongoDB. Then other times I think NoSql movement is a huge step
backwards, and > 90% of people are using it for the wrong reasons.

------
agumonkey
A similar article about relational 'thinking' with bash scripts
[http://matt.might.net/articles/sql-in-the-
shell/](http://matt.might.net/articles/sql-in-the-shell/)

Like when people shed a "new" light on seemingly simple tools.

------
dredmorbius
As much as I'd like to read this article, Blogspot's themes are so
aggressively reader-hostile that I simply cannot.

Please, people, if you're using Blogspot, stop. Or at the very least, avoid
the craptacular "dynamic" themes like the pox they are.

~~~
gaving
The broken back button behaviour is a nice touch, too.

------
AlexanderDhoore
I'm working on a set of tools to do functional programming in Bash (using
JSON). I have a pretty good idea of how it's going to be, just need to
implement it. Unix pipes are the function application. Json is data
structures. And I'll have message passing concurrency like Erlang.

If only I didn't have exams! (Been working on it 50% of the time though.)

~~~
bulatb
Have you seen jq? It's supposed to be like sed for JSON.
[http://stedolan.github.io/jq/](http://stedolan.github.io/jq/)

~~~
AlexanderDhoore
I have. I got the idea in the first place because I was trying out jq. Small,
powerful tool that.

------
rosser
TFA should be nominated for a Useless Use of Cat Award [0], particularly since
the dissected command apparently "showed the power of true Unix mastery".

[0]
[http://partmaps.org/era/unix/award.html](http://partmaps.org/era/unix/award.html)

~~~
matdes
I agree, but I think it was intended to provide each piped command performing
a single relational operation.

~~~
rosser
The shell command

    
    
      cat file | grep 'expression'
    

is the shell analogue of the SQL query

    
    
      select * from (select * from table) as q where column like '%expression%'
    

...except that a decent query optimizer will collapse the extraneous inner
query. Bash doesn't know how to do that.

It's also more expensive:

    
    
      # wc -l /var/log/secure
      97845 /var/log/secure
      # time cat /var/log/secure | grep root > /dev/null
    
      real    0m1.600s
      user    0m1.517s
      sys     0m0.294s  
      # time grep root /var/log/secure > /dev/null
    
      real    0m1.275s
      user    0m1.237s
      sys     0m0.036s
    

That's an extra .3-and-change seconds, or _over 25% longer_ , on a 98k line
file. Scale that up to a multi-million line ngnix log file, and I'd actually
say it's a _worse than useless_ use of cat.

~~~
NyxWulf
wc -l is faster however it only works on an uncompressed file. The pipeline in
the article was doing much more complicated work on the stream of output. I
will often separate my pipeline logic from my "read this data and feed it into
the pipeline" logic, since the logic around what to read in and where to write
it can often change even when using the same logical pipeline.

As a simple example: cat somefiles.txt | pipeline_script > stored_output.txt

Other times I'm reading gzipped files or remote log files, I don't want that
data mixed in with the pipeline logic. If I want to move the output files to a
different directory on one set of servers, that may not impact my generalized
pipeline.

I work with trillions of lines of log files and there are many ways to scale
up pipelines. I wouldn't start optimizing the difference between 1.6s and
1.275s unless it made an economic impact on the problem. How often is it run?
Can you process more lines in an economic unit of time with the faster
version? If this is a job that runs once an hour or even once a minute how
many lines are collected during that time?

Intermixing the grep or wc -l logic into the pipeline logic can have adverse
support and maintenance costs and many times saves machine time while spending
programmer time. Which one dominates in this scenario?

It's like the old saying...there is more than one way to skin a "cat". :)

