I'm working on a set of tools to do functional programming in Bash (using JSON). I have a pretty good idea of how it's going to be, just need to implement it. Unix pipes are the function application. Json is data structures. And I'll have message passing concurrency like Erlang.
If only I didn't have exams! (Been working on it 50% of the time though.)
Just for the record the initial proposed solution is not really an optimal way to do it, specially if log files are huge. Using cat will push the whole log file to stdout first and it's not really needed because `grep GET /var/log/nginx-access.log` will do exactly the same, but way faster.
Interesting to see him describe everything in terms a database. I sometimes think I am too stuck in a relational mindset to understand the reasons for things like MongoDB. Then other times I think NoSql movement is a huge step backwards, and > 90% of people are using it for the wrong reasons.
TFA should be nominated for a Useless Use of Cat Award [0], particularly since the dissected command apparently "showed the power of true Unix mastery".
select * from (select * from table) as q where column like '%expression%'
...except that a decent query optimizer will collapse the extraneous inner query. Bash doesn't know how to do that.
It's also more expensive:
# wc -l /var/log/secure
97845 /var/log/secure
# time cat /var/log/secure | grep root > /dev/null
real 0m1.600s
user 0m1.517s
sys 0m0.294s
# time grep root /var/log/secure > /dev/null
real 0m1.275s
user 0m1.237s
sys 0m0.036s
That's an extra .3-and-change seconds, or over 25% longer, on a 98k line file. Scale that up to a multi-million line ngnix log file, and I'd actually say it's a worse than useless use of cat.
wc -l is faster however it only works on an uncompressed file. The pipeline in the article was doing much more complicated work on the stream of output. I will often separate my pipeline logic from my "read this data and feed it into the pipeline" logic, since the logic around what to read in and where to write it can often change even when using the same logical pipeline.
As a simple example:
cat somefiles.txt | pipeline_script > stored_output.txt
Other times I'm reading gzipped files or remote log files, I don't want that data mixed in with the pipeline logic. If I want to move the output files to a different directory on one set of servers, that may not impact my generalized pipeline.
I work with trillions of lines of log files and there are many ways to scale up pipelines. I wouldn't start optimizing the difference between 1.6s and 1.275s unless it made an economic impact on the problem. How often is it run? Can you process more lines in an economic unit of time with the faster version? If this is a job that runs once an hour or even once a minute how many lines are collected during that time?
Intermixing the grep or wc -l logic into the pipeline logic can have adverse support and maintenance costs and many times saves machine time while spending programmer time. Which one dominates in this scenario?
It's like the old saying...there is more than one way to skin a "cat". :)
You're assuming a constant factor here. That might be the case. Or it could be one-time start-up overhead as the linker finds and opens library files.
Unless you profile the process over a range of input sizes, you don't know which you're observing. And you know what they say about premature optimization.
is anyone using this particular blog's interpretation as production code??
No. And if you actually go to the link to the original question, you'll see that cat doesn't exist.
Last, I would disagree that `cat file` is equivalent to `select * from table`, his argument makes the comparison that `cat file` is equivalent to `table` itself, or `load data into table` which needs to happen before any relation can be performed against it.
Which is still just about a one liner and a lot faster then the original.
Actually you could make it shorter, even faster, and more awky by tinkering with the field-separator to get rid of the split() and if:
If only I didn't have exams! (Been working on it 50% of the time though.)