
Benchmarking shell pipelines and the Unix “tools” philosophy - weinzierl
https://blog.plover.com/Unix/tools.html
======
tuldia
Thanks for this!

Another nice thing about /usr/bin/time is the --verbose flag which gives:

    
    
      Command being timed: "ls"
      User time (seconds): 0.00
      System time (seconds): 0.00
      Percent of CPU this job got: 0%
      Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00
      Average shared text size (kbytes): 0
      Average unshared data size (kbytes): 0
      Average stack size (kbytes): 0
      Average total size (kbytes): 0
      Maximum resident set size (kbytes): 1912
      Average resident set size (kbytes): 0
      Major (requiring I/O) page faults: 0
      Minor (reclaiming a frame) page faults: 112
      Voluntary context switches: 1
      Involuntary context switches: 1
      Swaps: 0
      File system inputs: 0
      File system outputs: 0
      Socket messages sent: 0
      Socket messages received: 0
      Signals delivered: 0
      Page size (bytes): 4096
      Exit status: 0
    

:)

~~~
boyter
Can anyone comment why you can only use the verbose flag if you use the full
path of time?

    
    
        time -v ls
    

does not work but

    
    
        /usr/bin/time -v ls
    

does? I don't have enough knowledge of either linux applications or bash to
know whats happening to cause this.

~~~
dtwwtd
This is very likely because without the full path your shell is using the
`time` builtin function of your shell as opposed to using the binary.

The shell's builtin keyword for `time` is more limited in nature than the full
`time` binary. This is true of a number of other common unix commands as well,
e.g. `echo`. The manpage for your shell should describe the builtins
functions.

------
Neil44
I got excited when I saw the 'f' and 'count' commands, but they're just
scripts he has on his system. Like doing grep 'plover' blah.log | cut -d ' '
-f 11 | sort | uniq -c | sort -n. Personally I'd prefer to use the ubiquitous
commands that work everywhere than rely on having custom scripts on my system,
but they are nice.

~~~
juped
Most people who use Unix directly build up some stuff in ~/bin (often a
misnomer because it's shell scripts and not binaries, although mine is less of
a misnomer than most because so much is in C rather than shell). The trick is
to build them _out of_ the standard portable components that exist everywhere.
(This means, among other things, no #!/bin/bash.)

~~~
tuldia
sed 's| no | not only |'

~~~
tingletech
/bin/bash won't usually ship with a BSDish OS because of the license, so it is
not generally portable to use bash-isms. (HPUX, IRIX, SunOS, Solaris, etc. I
don't reckon would have had bash either)

~~~
Galanwe
Not to mention once installed on a BSD it would most likely reside in
/usr/bin/bash (OpenBSD for instance)

~~~
cperciva
Or /usr/local/bin/bash (on FreeBSD).

------
skywhopper
"What if Unix had less compositionality but I could use it with less memorized
trivia? Would that be an improvement? I don't know."

The answer is "no" here, because the alternative doesn't exist. Could it be
created? Maybe in theory, but I suspect that the amount of stuff that you'd
need to memorize (or learn to look up) to use it effectively would be about
the same for any system that allowed a similar variety of work to be
accomplished. If you are willing to trade off functionality for simplicity,
then sure, it can be done. You can get it today by just not using all these
tools at all, I suppose.

~~~
wahern
There would be less trivia to memorize if the command behaviors and options
were more consistent. You may not be able to achieve that at the edges, where
new commands and options are added, but you can always go back and clean
things up.

For example, the cut(1) command is intended to do precisely what his f script
does. But it's inconvenient because unlike many other commands it (1) doesn't
obey $IFS and (2) the -d delimiter option only takes a single character. This
could and should be remediated with a new, simple option.

I think the only thing preventing that change is that there's not enough
interest in moving POSIX forward faster; certainly not like JavaScript.

Another problem are GNU tools. They have many great features but _OMG_ are
they a nightmare of inconsistency. BSD extensions tend to be much better
thought through, perhaps because GNU tools tend to be lead by a single
developer while BSD tools tend to be more team oriented.

So the way forward isn't to replace the organic evolution, it's to layer on
processes that refine the proven extensions. And we already have some of those
processes in place; we just need to imbue them with more authority, and that
starts by not rolling our eyes at standardization and portability.

~~~
vaingloriole
Authority is the problem...not standardization and portability. Everyone is
willing and able to tell you the best way to do your work if you use their
tools. Straitjacketing implementation in the name of order is a surefire way
to dissuade people from using your tools.

------
justinsaccount
'sort | uniq -c | sort -n' is an interesting pipeline. It will always work and
does a great job with large cardinality data on low memory systems.

However, if you have the ram, or know the data set has a low cardinality
(like, http status codes or filesnames instead of ip addresses) then something
that works in memory will be much more efficient.

I threw 144,000,000 'hello' and 'world' into a file:

    
    
      justin@box:~$ ls -lh words
      -rw-r--r-- 1 justin justin 824M Jan  7 15:21 words
      justin@box:~$ wc -l words
      144000000 words
    
    
      justin@box:~$ time (sort <words|uniq  -c)
      72000000 hello
      72000000 world
    
      real 0m22.831s
      user 0m32.999s
      sys 0m4.675s
    

Compared to doing it in memory with awk:

    
    
      justin@box:~$ time awk '{words[$1]++} END {for (w in words) printf("%s %d\n", w, words[w])}' < words
      hello 72000000
      world 72000000
    
      real 0m10.639s
      user 0m9.736s
      sys 0m0.876s
    

so, half the time and 1/3 the cpu.

~~~
crystaldev
All of your examples work in memory.

~~~
justinsaccount
Not exactly. sort (at least GNU sort) will end up doing external merge sort on
temporary files if you give it more data than you have memory. Which, if you
give it 100GB of 5 different strings, ends up being a huge waste.

~~~
tuldia
Not only GNU sort, but also postgresql, mysql and many more...

Please, "huge waste"? How do you sort something that does not fit in memory?

~~~
justinsaccount
Are you being difficult on purpose?

I posted a comment on how 'sort | uniq -c | sort -n' is an interesting and
very capable pipeline, but often misused and slower than other alternatives.

> you are comparing

Yes, I am comparing two methods of accomplishing the same thing. That is how
comparing things works.

> Please, "huge waste"? How do you sort something that does not fit in memory?

Note how the full sentence included "if you give it 100GB of 5 different
strings". If your input is 100GB of 5 different strings, then the hash table
will easily fit in memory, and sorting the entire data set only to pass it to
'uniq -c' is indeed a 'huge waste'.

There are tons of large data sets that only have a small number of unique
values in particular fields. protocols, ports, http status codes, hour of the
day, etc. 'sort | uniq -c | sort -n' will work for all of them, but not nearly
as efficient a hash table.

~~~
tuldia
> Are you being difficult on purpose?

Programming is about paying the _bare minimum_ attention to the details.

> [...] two methods of accomplishing the same thing [...]

Absolutelly not.

one prints:

    
    
      72000000 hello
      72000000 world
    

the other

    
    
      hello 72000000
      world 72000000
    

Now try both examples against a file with more than one column to understand
what I'm talking about ;)

------
3xblah

             for i in $(seq 1 20); do
               (run once and emit the total CPU time)
             done |addup
    

"Here we don't actually care about the output (we never actually use $i) but
it's a convenient way to get the for loop to run twenty times."

This is slower than not running seq and just using builtins.

    
    
             n=1;while true;do
             test $n -le 20||break;
               (run once and emit the total CPU time)
             n=$((n+1));
             done |addup

~~~
deathanatos
The author did say _convenient_ , not fast.

If you don't want the inefficiencies of seq, bash has:

    
    
      for (( expr1 ; expr2 ; expr3 )) ; do list ; done
    

which is a lot more idiomatic than constructing a for loop out of a while
loop.

~~~
pimlottc
You could also just use a bash brace expansion:

    
    
        for i in {1..20}
        do
          ...
        done
    

[https://wiki.bash-hackers.org/syntax/expansion/brace](https://wiki.bash-
hackers.org/syntax/expansion/brace)

------
madacol
I haven't been able to open this site for days (Timeout error, thought it was
down), until today that I decided to use tor

------
perfunctory
> The appearance of the TIME=… assignment at the start of the shell command
> disabled the shell's special builtin treatment of the keyword time, so it
> really did use /usr/bin/time. This computer stuff is amazingly complicated.
> I don't know how anyone gets anything done.

Indeed.

------
m463
the ACTUAL benchmark should be:

1) start timer

2) start deciding which commands to pipeline together

3) run the commands

4) stop timer

a lot of times the decision is the long pole.

in this authors case it included:

5) try a couple more variants of steps 2 and 3

6) write a blog post

:)

~~~
CGamesPlay
This is a good point about the second half of the article (compositonality),
but the author started the article by saying this was a command he "sometimes
runs", presumably indicating he has it saved somewhere.

~~~
m463
My commands of this type are usually retrieved using control-r

