Hacker News new | past | comments | ask | show | jobs | submit login
The Mighty Named Pipe (vincebuffalo.com)
385 points by vsbuffalo on Mar 11, 2015 | hide | past | web | favorite | 96 comments

Nice article. Really easy to follow introduction.

I only discovered process substitution a few months ago but it's already become a frequently used tool in my kit.

One thing that I find a little annoying about unix commands sometimes is how hard it can be to google for them. '<()', nope, "command as file argument to other command unix," nope. The first couple of times I tried to use it, I knew it existed but struggled to find any documentation. "Damnit, I know it's something like that, how does it work again?..."

Unless you know to look for "Process Substitution" it can be hard to find information on these things. And that's once you even know these things exist....

Anyone know a good resource I should be using when I find myself in a situation like that?

Be aware that process substitution (and named pipes) can bite you in the arse in some situations --- for example, if the program expects to be able to seek in the file. Pipes don't support this and the program will see it as an I/O error. This'd be fine if programs just errored out cleanly but they frequently don't check that seeking succeeds. unzip treats a named pipe as a corrupt zipfile, for example:

  $ unzip <(cat z)
  Archive:  /dev/fd/63
    End-of-central-directory signature not found.  Either this file is not
    a zipfile, or it constitutes one disk of a multi-part archive.  In the
    latter case the central directory and zipfile comment will be found on
    the last disk(s) of this archive.
  unzip:  cannot find zipfile directory in one of /dev/fd/63 or
        /dev/fd/63.zip, and cannot find /dev/fd/63.ZIP, period.

Useful. I'm assuming those are cases where normal pipes would fail too? So you can't do:

    cat z | unzip   # I know, uuoc, demo only
It's just with the process substitution you have more flexibility to shoot yourself in the foot?

Aside: If I recall correctly, with the zip file format, the index is at the end of the file. A (named) pipe works fine with, for instance, a bzipped tarball.

I wouldn't be surprised if the ZIP file format has its origins outside the Unix world given its pipe-unfriendlyness.

stdin and stderr are assumed to always be streams, so anything which accepts them won't seek on them.

However, if you give a program a filename on a command line, the program's likely to assume it's an actual file.

I often use SymbolHound in these cases - it's a search engine with support for special characters; for example: http://symbolhound.com/?q=%3C%28%29

Thanks for that. I've already made use of it since you pointed it out.

For this in particular, try the Advanced Bash Scripting doc, http://www.tldp.org/LDP/abs/html/

There's a bunch of interesting constructs there. Most of them also apply to improved shells such as zsh, though some are just pointless there.

I've seen the ABS guide criticized for being obsolete and recommending wrong or obsolete best practices. A recommended replacement is The Bash Hacker's Wiki: http://wiki.bash-hackers.org/doku.php

Which itself recommends as best (current) alternative: Greg's Bash Guide: http://mywiki.wooledge.org/BashGuide

man pages!

  $ man bash
Drops you right into the Process Substitution section.

For those wondering, man, which uses less as a pager, has vi-like key bindings.

"/<\(" starts a regex-based search for "<(" (you must escape the open paren).

This is the origin of the regex literal syntax in most programming languages that have them. It was first introduced by Ken Thompson in the "ed" text editor.

Ahhhh..haaa...ha....DOH! I've never even thought of looking at the manpage for bash before. Thanks, you've just made my life better.

I try to read it about once a year. I always find something new before my eyes glaze over and I'm done until next year.

If you're interested in stock Posix shell rather than bashish, the dash man page is a whole lot shorter and easier to follow, makes a great concise reference.

Nitpick: should read "POSIX-compliant shell", there isn't a "stock POSIX shell"

That's weird, isn't it? You wanted to know how to use a feature of bash and didn't check the manual?

Do you expect `man python` to output the full reference for the Python programming language?

I would, especially considering that "The Python Language Reference" is probably shorter than bash's manpage.

I wish that it did. The man pages really are supposed to be the manual.

I agree, though it does tell you where to look for it.

A bit OT but I don't understand why Google doesn't supply a way to do strict searches where everything you input is interpreted literally.

They have, I have complained loudly about this[1], never hard anything back (this is SOP I understand), but I have seen improvements last year.

Double quotes around part of a query means make sure this part is actually matched in the index. (I think they still annoy me be including sites that are linked to using this phrase[2], but that is understandable.)

Then there is the "verbatim" setting that you can activate under search tools > "All results" dropdown.

[1]:And the reason they annoyed me was because they would still fuzz my queries despite me doublequoting and choosing verbatim.

[2]: To verify this you could open the cached version and on top of the page you'd see something along the lines of: "the following terms exist only in links pointing to this page."

They still ignore any special characters: https://www.google.com/search?q=%22%3C()%22&tbs=li:1

Because if you want to ignore punctuation and case in normal situations, you leave them out of the search index. And then you can't query the same search index for punctuation and/or case-sensitive queries.

So they'd have to create a second index for probably less than 0.01% of their queries, and that second index would be larger and harder to compress.

As much as I'd love to see a strict search, from a business perspective I don't think it makes sense to a provide one.

I wish they'd supply that too, but they do seem to have gotten better at interpreting literally when it makes sense in context. I've been learning C# and have found, for example, that searches with the term "C#" return the appropriate resources when in the past I'd have probably seen results for C.

Google handles some constructs with punctuation as atomic tokens as special cases. C# and C++ are examples. A# through G# also return appropriate results, for the musical notes. H# and onward through the alphabet do not.

.NET is another example. Google will ignore a prepended dot on most words, but .NET is handled specially as an atomic token. I would bet this is a product of human curation, not of algorithms that have somehow identified .NET as a semantic token.

Searching for punctuation in a general case is hard, though. You wouldn't want a search for Lisp to fail to match pages with (Lisp). We often forget that the pages are tokenized and indexed, that Google and the other search engines aren't a byte-for-byte scan across the entire web.

I was recently trying to understand the difference between the <%# and <%= server tags in ASP.NET. Google couldn't even interpret those as tokens to search for. It took me a long time to figure out the former's true name as the data-bind operator in order to search for that and find the MS docs.

Occasionally it's useful to spell out the names of the characters, both when searching and when writing documentation, blog posts, and SO Q&A. That way, searching for "asp.net less than percent hash" might tell you it's the data-bind operator.

#bash on freenode, and http://mywiki.wooledge.org/BashFAQ

books on UNIX and shells

Once you discover <() it's hard not to (ab)use it everywhere :-)

    # avoid temporary files when some program needs two inputs:
    join -e0 -o0,1.1,2.1 -a1 -a2 -j2 -t$'\t' \
      <(sort -k2,2 -t$'\t' freq/forms.${lang}) \
      <(sort -k2,2 -t$'\t' freq/lms.${lang})
    # gawk doesn't care if it's given a regular file or the output fd of some process:
    gawk -v dict=<(munge_dict) -f compound_translate.awk <in.txt
    # prepend a header:
    cat <(echo -e "${word}\t% ${lang}\tsum" | tr [:lower:] [:upper:]) \
        <(coverage ${lang})

> # gawk doesn't care if it's given a regular file or the output fd of some process:

Something wonderful I found out the other day: Bash executes scripts as it parses them, so you can do all kinds of awful things. For starters,

    bash <(yes echo hello)
will have bash execute an infinite script that looks like

    echo hello
    echo hello
    echo hello
without trying to load the whole thing first.

After that, you can move onto having a script append to itself and whatever other dreadful things you can think of.

That's actually one of the things that I really dislike with bash, that it doesn't read the whole script before executing it. I've been bitten by it before, when I write some long-running script, then e.g. write a comment at the top of it as it's running, and then when bash looks for the next command, it's shifted a bit and I get (at best) a syntax error and have to re-run :-(

There are several ways to get Bash to read the whole thing before executing.

My preferred method is to write a main() function, and call main "$@" at the very end of the script.

Another trick, useful for shorter scripts, is to just wrap the body of the script in {}, which causes the script to be a giant compound command that is parsed before any of it is executed; instead of a list of commands that is executed as read.

Ah, thank you for that. I may just start using these tricks in all my scripts :-)

Ewww. Thats... nasty and dangerous. Very dangerous.

Fun fact: That's how goto worked in the earliest versions of Unix, before the Bourne shell was invented: goto was an external command that would seek() in the shell-script until it found the label it was looking for, and when control returned to the shell it would just continue executing from the new location.

To this day, when the shell launches your program, you can find the shell-script it's executing as file-descriptor 255, just in case you want to play any flow-control shenanigans.

one of my favorites is using diff across output of two programs, diff thinks they are files:

diff -u <(zipinfo archive.zip.bak) <(zipinfo archive.zip)

Pipes are probably the original instantiation of dataflow processing (dating back to the 1960s). I gave a tech talk on some of the frameworks: https://www.youtube.com/watch?v=3oaelUXh7sE

And my company creates a cool dataflow platform - https://composableanalytics.com

http://doc.cat-v.org/unix/pipes/ . And there's a bit more about how pipes came to be in unix here: http://cm.bell-labs.com/who/dmr/hist.html

Vince Buffalo is author of the best book on bioinformatics: Bioinformatics Data Skills (O'Reilly). It's worth a read for learning unix/bash style data science of any flavour.

Or even if you think you know unix/bash and data there are new and unexpected snippets every few pages that surprise you.

In zsh, =(cmd) will create a temporary file, <(cmd) will create a named pipe, and $(cmd) creates a subshell. There are also fancy options that use MULTIOS. For example:

    paste <(cut -f1 file1) <(cut -f3 file2) | tee >(process1) >(process2) >/dev/null
can be re-written as:

    paste <(cut -f1 file1) <(cut -f3 file2) > >(process1) > >(process2)


If you like pipes, then you will love lazy evaluation. It is unfortunate, though, that Unix doesn't support that (operations can block when "writing" only, not when "nobody is reading").

If nobody is reading, you will eventually fill the pipe buffer (which is about 4k), and the writing will stop. It's a bigger queue than most of us would expect when compared to generator expressions, but it can and does create back pressure while making reads efficient.

*about 4k

64k on linux these days.

Lazy evaluation with pipes would be problematic because alerts/echos would be ignored by default as they are necessarily part of the stdin/stdout chain.

I.E.: This section of code

     let a = "some_file_name".as_string();
     println!("Opening: {}", a);
     let path = std::path(a);
     let mut fd = std::io::open(path);
would get optimized to

     let mut fd = std::io::open(std::path("some_file_name".as_string()));
with strict lazy evaluation. The user feedback is removed, which is a big part of shell scripting.

I guess you found another issue with Unix. The user does not care in general how something is performed, just that it is performed correctly and with good performance.

I guess an OS should be functional at its interface to the user, and only imperative deep down to keep things running efficiently.

However, note that this hypothetical functional layer on top also would ensure efficiency, as it enables lazy evaluation. This type of efficiency could under certain circumstances be even more valuable than the bare-metal performance of system programming languages.

>The user does not care in general how something is performed, just that it is performed correctly and with good performance.

This is the crux of the matter. With BASH scripting the user does care how a task is preformed as that task maybe system administration, involve sensitive system components, OR sensitive data.

Lazy evaluation is great for binary/cpu level optimization. But passing system administration tasks though the same process is scary as you lose the 1:1 mapping you previously had.

Well, in any case, the problem can be resolved by adding a kernel-level api function that allows one to wait (block) until results are requested from the other end of the pipe.


The opposite is already true and has the same effect.

Each stage of the pipeline is executed when it has data to execute. So ultimately the main blocking event is IO (normally the first stage in a pipeline). Every other process is automatically marked as blocked, until its stdin is populated by the output of the former. Once its task is complete it re-checks stdin, and if nothing is present blocks itself.

So the execution of each task is controlled by the process who's data that task needs to operate.

In your system why would you want to block the previous step? This would just interfere with the previous+1 step, and you'd have to populate that message further up the chain. This seems needlessly complicated. As you have to add extra IPC.

Why: consider the generation of a stream of random numbers; assume each random number requires a lot of CPU-intensive work; obviously, you don't want to put unnecessary load on the CPU, and hence it is better to not fill any buffer ahead of time (before the random numbers are being requested).

There are arguments for supply-driven processing, AND for demand-driven. It all comes down to latency and speed arguments.

I don't know, I love pipes and I'm on the fence regarding strict vs lazy.

BTW: when is nobody reading in pipes? There's always implicit

    &> stdout

EDIT: oh, right, named pipes.

If you write to a named pipe, the call to write(2) will block until somebody opens it for reading and begins to read(2) it.

AFAIK process substitution is a bash-ism (not part of POSIX spec for /bin/sh). I recently had to go with the slightly less wieldy named pipes in a dash environment and put the pipe setup, command execution and teardown in a script.

I've used *nix for ~15 years and never used a named pipe or process substitution before. Great to know about!

Beware though, process substitution is not POSIX and not supported in all shells. It isn't in pdksh or ash/dash for instance.

It's a ksh93 extension adopted by bash and zsh.

Named pipes have been rare for me, but simple process substitution is every day.

Very often I do something like this in quick succession. Command line editing makes this trivial.

  $ find . -name "*blarg*.cpp"
  # Some output that looks like what I'm looking for.
  # Run the same find again in a process, and grep for something.
  $ grep -i "blooey" $(find . -name "*blarg*.cpp")
  # Yep, those are the files I'm looking for, so dig in.
  # Note the additional -l in grep, and the nested processes.
  $ vim $(grep -il "blooey" $(find . -name "*blarg*.cpp"))

Same here, except I typically use $(!!) to re-run the previous command. I find it faster than command-line editing.

    $ find . -name "*blarg*.cpp"
    $ grep -i "blooey" $(!!)
    $ vim $(!! -l)
Granted, you can only append new arguments and using the other ! commands will often be less practical than editing. Still, it's amazing how frequently this is sufficient.

I've always thought it'd be nice if there was a `set` option or something similar that would make bash record command lines and cache output automatically in implicit variables, so that it doesn't re-run the commands. The semantics are definitely different and you wouldn't want this enabled at all times, but for certain kinds of sessions it would be very handy.

EDIT: lazyjones beat me to it.

Since "!!" are replaced when you hit the "up" arrow key (i.e. jump to previous command), you can go really wild with them:


That's actually command substitution, not process substitution :)

Thanks for the correction. The unix is large and I'm so very small.

And command substitution has nothing to do with pipes, named or unnamed.

> Command line editing makes this trivial.

I'm lazy, so I typically do this (2nd step) to avoid the extra key strokes necessary for editing:

  $ grep -i "blooey" $(!!)
Also very useful for avoiding editing in order to do something else with the same argument as in the previous command: !$, i.e.:

  $ foo myfile
  $ bla !$

You could just use a pipe here though, which would also make it more easy to read. e.g.:

    $ find . -name '*blarg*.cpp' | grep -li blooey | vi -

Your version searches for blooey in the filenames, not in the files themselves.

And, to try to be helpful, - seems it lacks an xargs (or similar construct)

I used to do it something like that, but I find it personally easier to understand the way I described it and evolved into that. I also like what I'm ultimately doing (vim/vi in this case) to be over on the left: edit whatever this mess on the right produces.

In fish shell the canonical example is this:

   diff (sort a.txt|psub) (sort b.txt|psub)
The psub command performs the process substitution.

It seems like fish shell's ">" process substitution equivalence is not working as well as bash's though


Does anyone have a working link to Gary Bernhardt's The Unix Chainsaw, as mentioned in the article?

I found a high quality copy I had downloaded: http://sikhnerd.com/downloaded_vids/02-gary-bernhardt.mp4 (703M)

That's the kind of video I might have downloaded. At least I hope so. Gonna check my backups.

update 1 : found it, time to upload.

Thank you kind sir. Waiting for a link

How does the > process substitution differ from simply piping the output with | ?

For example (from Wikipedia)

tee >(wc -l >&2) < bigfile | gzip > bigfile.gz


tee < bigfile | wc -l | gzip > bigfile.gz

Say that you have a program that splits its output into two files, each given by command line arguments. A normal run would be

    <input.txt munge-data-and-split -o1 out1.txt -o2 out2.txt
but since the output is huge and your disk is old and dying, you want to run xz on it before saving it to disk, so use >():

    <input.txt munge-data-and-split -o1 >(xz - > out1.txt) -o2 >(xz - > out2.txt)
If you want to do several things in there, I recommend defining a function for clarity:

    pp () { sort -k2,3 -t$'\t' | xz - ; }
    <input.txt munge-data-and-split -o1 >(pp > out1.txt) -o2 >(pp > out2.txt)

In the tee case the substitution is actually going somewhere different than standard out (that's what tee does).


    cmd1 | tee out.txt | cmd2
So tee is splitting the stream into two outputs, one that carries on out stdout (into cmd2) and the other one that is redirected into out.txt.

With process substitution you can do extra stuff on the way out, I guess (I've never seen it used for output before).

It looks like in the example given they're writing wc stuff to stderr while zipping the content (over stdout).

Nice to see that example, I hadn't even thought about the usefulness of process substitution for outputting like this!

When you connect to processes in a pipe such as ...

    a | b
you connect stdout (fd #1) of a to stdin (fd #0) of b. Technically, the shell process will create a pipe, which is two filedescriptors connected back to back. It then will fork two times (create two copies of itself) where it replaces standard output (filedescriptor 1) of the first copy by one end of the pipe and replaces standard input (filedescriptor 0) of the second copy by the other end of the pipe. Then the first copy will replace itself (exec) by a, the second copy will replace itself (exec) by b. Everything that a writes to stdout will appear on stdin of b.

But nothing prevents the shell from replacing any other filedescriptor by pipes. And when you create a subprocess by writing "<(c)" in your commandline, it's just one additional fork for the shell, and one additional filedescriptor pair to be created. One side, as in the simple case, will replace stdin (fd #0) of "c"... and because the input side of this pipe doesn't have a predefined output of "a" (stdout is already taken by "|b") the shell will somehow have to tell "a" what filedescriptor the pipe uses. Under Linux one can refer to opened filedescriptors as "/dev/fd/<FDNUM>" (symlink to /proc/self/fd/<FDNUM> which itself is a symlik to /proc/<PID>/fd/<FDNUM>), so that's what's replaced as a "name" to refer to the substituted process on "a"'s command line:

Try this:

    $ echo $$
    12345  # <--- PID of your shell
    $ tee >( sort ) >( sort ) >( sort ) otherfile | sort
and in a second terminal

    $ pstree 12345 # <--- PID of your shell

      ├─sort,3600 # <-- this one reads from the other end of the shell's fd #14
      ├─sort,3601 # <-- this one reads from the other end of the shell's fd #15
      ├─sort,3602 # <-- this one reads from the other end of the shell's fd #15
      ├─sort,3604 # <-- this one reads from stdout of tee
      └─tee,3603 /proc/self/fd/14 /proc/self/fd/15 /proc/self/fd/16 otherfile
If your system doesn't support the convenient /proc/self/fd/<NUM> shortcut, the shell might decide not to create a pipe, but rather create temporary fifos in /tmp and use those to connect the filedescriptors.



You can watch the syscalls as they are made:

    $ strace -fe fork,pipe,close,dup,dup2,execve bash -c 'tee <(sort) <(sort)'

It allows multiple, parallel pipes to each individual command, where the | allows just one.

Anybody know of a way to increase the buffer size of pipes? I've experienced cases where piping a really fast program to a slow one caused them both to go slower as the OS pauses first program writing when pipe buffer is full. This seemed to ruin the caching for the first program and caused them both to be slower even though normally pipes are faster as you're not touching disk.

Both mbuffer and pv by default contain fairly large in-memory buffers for pipe data, and accept parameters for particularly large buffers.

http://www.maier-komor.de/mbuffer.html http://www.ivarch.com/programs/pv.shtml

Thanks - hoping that there was a built in solution but a buffer program makes sense

Is this guy a bioinformatician? I think he's a bioinformatician.

Can't be sure if he is a bioinformatician because he never really mentions that he is a bioinformatician.

Seems entirely appropriate given his blog post, and others like it on his site as well as the book he wrote, are clearly aimed at people interested in learning bioinformatics.

The tone made me stop reading. It reads like a child's thoughts in a comic book.

moreutils [1] has some really cool programs for pipe handling.

pee: tee standard input to pipes sponge: soak up standard input and write to a file ts: timestamp standard input vipe: insert a text editor into a pipe

[1] https://joeyh.name/code/moreutils/

i heard somewhere that go follows unix pipe link interfaces.

Pipes are very cool and useful, but it's hard for me to understand this common worship of something like that. Yes, it's useful and elegant, but is it really the best thing since Jesus Christ?

Maybe it's not the best thing since Jesus, but it's worth celebrating its birthday http://blog.fugue.it/2013-10-07-pipeday.html

Wow. I guess that's what I get for not being totally enamoured of Unix.

No, that's not why you were down voted. You were down voted because you were condescending to the people who enjoy working with *nix.

It really is a question that I've been having for a long time. I didn't just say that to piss people off. I guess that's the risk you run of coming across when you try to insert yourself in a conversation where the other participants have already agreed on a set of shared opinions - this is great - and you try to question that common assumption/opinion.

I have honestly been questioning my own understanding of pipe, since I've failed to see the significance before; first I thought it was just `a | b` as in "first do a, then b". So then it just seemed like a notational way of composing programs. Then I thought, uh, ok say what? Composing things is the oldest trick in the conceptual book. But then I read more about it and saw that it had this underlying wiring of standard input and output and forking processes that made me realize that I had underestimated it. So, given that, I was wondering if there is even more that I've been missing.

I have for that matter read about named pipes before and tried it out a bit. It's definitely a cool concept.

I don't think it's that you have a differing opinion. Those are great and most people would be OK with that. I really believe you got the down votes because what you said came off as condescending.

If you had said what you just said in your follow up, I think you'd actually have gotten some up votes.

Perhaps try questioning shared opinions without using the word worship?

I'll have to admit that that sounded unnecessarily judgy. :)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact