Hacker News new | comments | show | ask | jobs | submit login
Pipes and Filters (petersobot.com)
223 points by rbc on Sept 16, 2014 | hide | past | web | favorite | 47 comments

Some random thoughts:

• Though the first pipeline is didactic, it can be done entirely within awk:

    awk '
    BEGIN { l=0 }
    /purple/ {
        if(length($1) >= l) { word = $1; l = length($1) }
    END { print word }' < /usr/share/dict/words
• Named pipes are neat, but you can also use subshells and additional FDs (I am in no way arguing this is more clear):

          echo out
          echo err >&2
        ) | while read out; do echo "out: $out"; done >&3
      ) 2>&1 | while read err; do echo "err: $err"; done
    ) 3>&1

• Bash has "set -o pipefail" for cases where you want any process in the pipeline that exits non-zero to cause the entire pipeline to exit non-zero.

Error detection is much easier with pipefail:


"If pipefail is enabled, the pipeline’s return status is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands exit successfully"

Pipefail is awesome - I had no idea this even existed. Thanks!

If you like pipes, you'll love Pipe Viewer:


> pv - Pipe Viewer - is a terminal-based tool for monitoring the progress of data through a pipeline. It can be inserted into any normal pipeline between two processes to give a visual indication of how quickly data is passing through, how long it has taken, how near to completion it is, and an estimate of how long it will be until completion.

Another fun trick, by piping through dd you can add a buffer between processes.

Example: the raspberry pi has pretty slow SD performance and the USB bus can get hogged. If you record audio and want to encode + write it to SD you can easily get buffer overruns. Easily solved by a 10sec buffer between arecord and flac in my case.

There's also buffer(1). I found it because it never occurred to me to use dd for that! Nice tip.


So what would that look like? Makes sense, but dd has a lot of options, and I haven't fiddled with it much.

Recording 48Khz, raw, piping through a five second buffer, encode to flac and dump on a USB stick:

  arecord -D hw:1,0 -v --fatal-errors --buffer-size=192000 -f dat -t raw | dd bs=480000 | flac --endian=little --channels=2 --bps=16 --sample-rate=48000 --sign=signed -o /mnt/usbstick/`date '+%s'`.flac -
Gotta love Linux :)

that's insanely cool. I would have probably spent a lot of time implementing the buffer in software.

Reminds me of the classic David Beazley course on coroutines: http://www.dabeaz.com/coroutines/

It highlights a similar pipeline-oriented architecture and eventually ends up being sort of mindblowing.

The nice thing about the Beazley talk is shows how to move structured records in the pipeline, with multiprocessing one could use more cores.

I hate to be "that guy" but someone has to say something about the Useless Use of Cat.


I've never agreed with the UUOC concept when applied to pipelines. Using cat in a pipeline for a single file makes the flow clearly unidirectional, prevents certain types of errors like switching > and <, allows the command to be easily modified to handle multiple files or a glob, and frankly just seems easier to read.

Considering the tiny overhead of an additional cat process, UUOC these days feels like nitpicking.

Also, the cat makes for faster testing. I tend to start out with e.g.

    head bigfile.1.txt | grep | awk | stuff
and refine things, and when output looks right, it's a simple "Ctrl+A Meta+D cat RET" to run it on the full output. Or vice versa if I suddenly want to go back to testing part of bigfile (or exchange the cat for "grep something").

If I want to change that to "< bigfile.1.txt", I have to "Ctrl+A Meta+D < Meta+F Meta+F Meta+F Ctrl+D Ctrl+D" – the extra keypresses are to delete the first "|" symbol. And if I suddenly want to change it back to head or grep, I have to reinsert the | (also I often by habit do Meta+D instead of Ctrl+D at the beginning of the line, which doesn't work as intended if the first token is "<" instead of "cat").

Those useless cats are quite handy when doing a lot of shell work.

You can use <bigfile.txt head | grep | awk

So if I want to change that to the whole file, I "just" have to Ctrl+A Meta+F Meta+F Meta+D Ctrl+D Ctrl+D RET. That's not really an improvement – especially since it depends on how many dots or similar are in the filename.

Also, that's a Useless Use of head, since grep has the option "-m10"

grep -m "Stop reading a file after NUM matching lines", while head -10 takes the first 10 lines and searches on them. Different things.

doh! you're right, I wasn't thinking :-)

But then the article has this

"(If we move grep to run immediately after cat, and before putting data into Redis, this operation runs more than 1,200 times faster.)"

Which does indicate that in some cases the UUOC is justified (thou in this case the cat remains)

I believe in this case, based on the author clearly having some mastery of the art, it was used to extend the example for the sake of explanation.

There are lots of ways to tighten up the example, if needed.

Off topic: Today I learned http://dcurt.is/unkudo. Peter Sobot, I want my kudos back. (Not that I didn't really appreciate learning about ${PIPESTATUS[*]}.)

Yeah, I hate svbtle for exactly that reason. Despite kudos being a thoroughly meaningless number, the kudos widget makes me feel like I'm nine years old and I just fell for the "a sphincter says what?" trick.


...But, yeah, exactly. :)

Various searches on the subject revealed plenty of people noting how meaningless Internet points are, leading to an additional meta-rub: not only did I fall for the hover trick, I also was childish enough to google for a svbtle kudos undo. sigh

I wouldn’t say that kudos are totally meaningless, nor your quest to undo yours. By displaying kudos on their article, the author falsely purports to have more fans (implied by the name “kudos”) than they really do. By attempting to correct that number, you were fighting against (slight) false advertisement, which could cause other visitors to waste their time reading an article that you don’t actually recommend. Though bumping the number by 1 has a negligible effect, fixing all bumps that people were tricked into making would probably save many internet readers a small amount of time.

I have universally blocked Svtble kudos using the custom Adblock Plus filter `##FIGURE.kudo`. The kudos circle no longer appears on the page; I didn’t even notice that this page was supposed to have one.

I hovered over the icon because I had read Peter’s article to the end and my mouse pointer was drawn to the icon, thinking “this looks like it might do something; there’s probably a tool-tip to let me know what” so I agree that it’s a sneaky – though attractive and elegant — UI element. Anyhow, I enjoyed the article so the kudo was legitimate.

I don't fully agree with kudos (and you can have yours back!) but I have noticed that they're a very interesting metric. My own very unofficial data seems to suggest that about 1 in 10 visitors to the post will give their kudos, making it a neat measure of how many people have read to the end and "agree" with a post.

I’ve used pipes for years and had a pretty good understanding of how the system calls of each process in the pipeline interacts with their own `sdin` and `stdout` file descriptors but this article puts it all together really nicely with some good examples.

I don’t mind the useless use of `cat` as it can enhance readability for some people. However, I would suggest replacing the Bash while loop with a for loop:

    ls *.flac | 
    while read song
        flac -d "$song" --stdout | 
        lame -V2 - "$song".mp3

    for song in *.flac 
        flac -d "$song" --stdout | 
        lame -V2 - "$song".mp3

Using Bash’s file globbing avoids problems with enumerating the output of `ls` [1]. It also avoids an unnecessary Bash sub-shell being spawned for the while loop that follows the first pipe. More importantly, I think it’s a lot more readable while still demonstrating how pipes can be efficiently used to process any amount of FLAC files.

[1] http://mywiki.wooledge.org/ParsingLs

Neat. I have my own take on this concept[1] using Redis pub/sub instead of queues. Tradeoffs involve being able to lose data if endpoints aren't connected, but you do get the benefit of having multiple inputs and outputs on one stream, which was important for my use case.

[1] https://github.com/whee/rp

I recommend highland.js for node http://highlandjs.org

I recently came across a language called Factor that works very similar, if not identical, to this.

Here's a video about it: https://www.youtube.com/watch?v=f_0QlhYlS8g

I have been using Factor for some time. It has some drawbacks related to readability, but it is a real-time saver.

Compared against what does Factor save time? Python, C, Java?

Also, I've been looking at Factor, but I have a hard time getting into the mindset of the paradigm, since I'm used to Python (although I'm very much used to using pipes in the terminal). Are there any types of programs that you prefer to write in Factor, and others you prefer to write in -- for example -- Python?

That hyphen placement imparts a pleasing ambiguity.

  < unimpurpled >
Part way through my second viewing of the article, I thought, "what is 'unimpurpled'". Wiktionary didn't know. Google doesn't return useful results for it, even. M-W, finally, clued me in: it's an obsolete term, with an "un" prefix, for the verb "empurple", which means to make purple[1].

[1] And a few similar things. https://en.wiktionary.org/wiki/empurple

I've been playing around with julia[1] this week and discovered the inclusion of a pipe-like operator that removes a lot of the parentheses from functional programming; you can write,

    x |> a|> b |> c|>s->d(s,y)|>e|>...
in julia instead of

    e(d(c(b(a(x))),y)) or (e (d (c (b (a x)) y)) 
...or whatever is your flavour. I reckon it is impossible to make a serious case against that readability gain.

[1] julialang.org

In Haskell there are two varieties on this. "Apply" and "compose":

          f (g (h x)) == f $ g $ h $ x
                      == f . g . h $ x
    \x -> f (g (h x)) == f . g . h
The "compose" combinator, (.), is especially pertinent for making pipelines. Idiomatic Haskell code uses it constantly---especially for its natural mechanism of eliminating "points" like that `x` above. These are usually better described by their type than any variable name given (especially since the variable name cannot be machine checked for meaningfulness, unlike the type) and so are best eliminated.

In many libraries there is also a reverse apply function defined, often as &

          f (g (h x)) == x & h & g & f
                      == x & f . g . h
which is more popular when using other operator chains to describe functions as in lens.

F# and OCaml uses this forward pipe operator as well. It's very useful and makes the flow much more apparent, in other words, it increases readability, a lot.

I'll add to the list of replies with languages with this feature: Hylang, the "Lisp-stick on a Python", where it is called the "threading macro".

There's a Python lib that implements it: https://github.com/JulienPalard/Pipe

Elixir has the same syntax for the same operator.

> I’m calling /usr/bin/time here to avoid using my shell’s built-in time command [...].

I prefer to use command in this scenario.

$ command time

Pipes are also related to the IO monad, similar in a way with the jQuery syntax which is another hugely popular case. I am utterly amazed how they invented such a powerful concept of functional programming for the shell (well, not 100% pure functional, if there are side effects)

utterly amazed how they invented such a powerful concept of functional programming for the shell

You might be interested in the classic McIlroy-Knuth dialogue:


The Unix "Software Tools" philosophy -- small kits composable into something big -- is deeply shared by functional programming.

I don't think one begot the other. It was just obviously the natural thing to do during that era.

>You might be interested in the classic McIlroy-Knuth dialogue:

Yes, that was a good article. The comments on the post were interesting too. Saw it a while ago, and for fun, wrote two quick solutions for the problem, in Python and shell:


A reader, Veky, wrote an interesting comment on my post too.

The first few parts make a pretty good pipeline preach. Two interesting points about pipelines are the concurrency and the difficulty to handle errors (at least without cluttering the syntax), which are often missed by newcomers.

The author has earned him/herself a Useless Use of cat (UUOC) award for not realizing that grep can take a filename argument in the example pipeline.

Basically, the example can be shortened to the following:

    grep purple /usr/share/dict words | # Find words containing 'purple' in the system's dictionary
    awk '{print length($1), $1}' |      # Count the letters in each word
    sort -n |                           # Sort lines ("${length} ${word}")
    tail -n 1 |                         # Take the last line of the input
    cut -d " " -f 2 |                   # Take the second part of each line
    cowsay -f tux                       # Put the resulting word into Tux's mouth

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact