Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Unix Command Lines and Relations (merrigrove.blogspot.com)
57 points by ludsan on Jan 18, 2014 | hide | past | favorite | 16 comments


I'm working on a set of tools to do functional programming in Bash (using JSON). I have a pretty good idea of how it's going to be, just need to implement it. Unix pipes are the function application. Json is data structures. And I'll have message passing concurrency like Erlang.

If only I didn't have exams! (Been working on it 50% of the time though.)


Have you seen jq? It's supposed to be like sed for JSON. http://stedolan.github.io/jq/


I have. I got the idea in the first place because I was trying out jq. Small, powerful tool that.


A similar article about relational 'thinking' with bash scripts http://matt.might.net/articles/sql-in-the-shell/

Like when people shed a "new" light on seemingly simple tools.


As much as I'd like to read this article, Blogspot's themes are so aggressively reader-hostile that I simply cannot.

Please, people, if you're using Blogspot, stop. Or at the very least, avoid the craptacular "dynamic" themes like the pox they are.


The broken back button behaviour is a nice touch, too.


Just for the record the initial proposed solution is not really an optimal way to do it, specially if log files are huge. Using cat will push the whole log file to stdout first and it's not really needed because `grep GET /var/log/nginx-access.log` will do exactly the same, but way faster.


Interesting to see him describe everything in terms a database. I sometimes think I am too stuck in a relational mindset to understand the reasons for things like MongoDB. Then other times I think NoSql movement is a huge step backwards, and > 90% of people are using it for the wrong reasons.


TFA should be nominated for a Useless Use of Cat Award [0], particularly since the dissected command apparently "showed the power of true Unix mastery".

[0] http://partmaps.org/era/unix/award.html


I agree, but I think it was intended to provide each piped command performing a single relational operation.


The shell command

  cat file | grep 'expression'
is the shell analogue of the SQL query

  select * from (select * from table) as q where column like '%expression%'
...except that a decent query optimizer will collapse the extraneous inner query. Bash doesn't know how to do that.

It's also more expensive:

  # wc -l /var/log/secure
  97845 /var/log/secure
  # time cat /var/log/secure | grep root > /dev/null

  real    0m1.600s
  user    0m1.517s
  sys     0m0.294s  
  # time grep root /var/log/secure > /dev/null

  real    0m1.275s
  user    0m1.237s
  sys     0m0.036s
That's an extra .3-and-change seconds, or over 25% longer, on a 98k line file. Scale that up to a multi-million line ngnix log file, and I'd actually say it's a worse than useless use of cat.


wc -l is faster however it only works on an uncompressed file. The pipeline in the article was doing much more complicated work on the stream of output. I will often separate my pipeline logic from my "read this data and feed it into the pipeline" logic, since the logic around what to read in and where to write it can often change even when using the same logical pipeline.

As a simple example: cat somefiles.txt | pipeline_script > stored_output.txt

Other times I'm reading gzipped files or remote log files, I don't want that data mixed in with the pipeline logic. If I want to move the output files to a different directory on one set of servers, that may not impact my generalized pipeline.

I work with trillions of lines of log files and there are many ways to scale up pipelines. I wouldn't start optimizing the difference between 1.6s and 1.275s unless it made an economic impact on the problem. How often is it run? Can you process more lines in an economic unit of time with the faster version? If this is a job that runs once an hour or even once a minute how many lines are collected during that time?

Intermixing the grep or wc -l logic into the pipeline logic can have adverse support and maintenance costs and many times saves machine time while spending programmer time. Which one dominates in this scenario?

It's like the old saying...there is more than one way to skin a "cat". :)


over 25% longer

You're assuming a constant factor here. That might be the case. Or it could be one-time start-up overhead as the linker finds and opens library files.

Unless you profile the process over a range of input sizes, you don't know which you're observing. And you know what they say about premature optimization.


is anyone using this particular blog's interpretation as production code??

No. And if you actually go to the link to the original question, you'll see that cat doesn't exist.

Last, I would disagree that `cat file` is equivalent to `select * from table`, his argument makes the comparison that `cat file` is equivalent to `table` itself, or `load data into table` which needs to happen before any relation can be performed against it.


yes, if piping cat into grep into awk into grep again is a Jedi like trick its no wonder the Republic collapsed.

You could condense the first three stages: cat /var/log/nginx-access.log | grep "GET" | awk -F'"' '{print $6}'

down into: awk -F'"' '/GET/{print $6}' /var/log/nginx-access.log as well.

Or with the fourth stage (cut -d" " -f1): awk -F'"' '/GET/{split($6,a," ");print a[0]}' /var/log/nginx-access.log

Add the fifth stage (grep -E "^[[:alnum:]]) by doing:

awk -F'"' '/GET/{split($6,a," ");if(a[0]~/^[[:alnum:]]/){print a[0]}}' /var/log/nginx-access.log

And the 6th and 7th (sort | uniq -c):

awk -F'"' '/GET/{split($6,a," ");if(a[0]~/^[[:alnum:]]/){b[a[0]]++}}END{for(i in b){print b[i} " " i}}' /var/log/nginx-access.log | sort -rn

Which is still just about a one liner and a lot faster then the original. Actually you could make it shorter, even faster, and more awky by tinkering with the field-separator to get rid of the split() and if:

awk -F'[[:space:]]|"' '/GET/ && $17~/^[[:alnum:]]/{a[$17]++}END{for(i in a){print a[i] " " i}}' | sort -rn

And then replace the last sort with awk's builtin asort() function but thats left as an exercise for the student ;)

But why learn the basics when you can be do Big Data and be buzz word compliant instead.


Nice straw-man. My contention is simply that knowing better than to "cat | grep" is the basics.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: