Hacker News new | past | comments | ask | show | jobs | submit login
For the Love of Pipes (jessfraz.com)
506 points by ingve 60 days ago | hide | past | web | favorite | 303 comments



[Quote]

The Unix philosophy is documented by Doug McIlroy as:

    Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features”.

    Expect the output of every program to become the input to another, as yet unknown, program. Don’t clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don’t insist on interactive input.

    Design and build software, even operating systems, to be tried early, ideally within weeks. Don’t hesitate to throw away the clumsy parts and rebuild them.

    Use tools in preference to unskilled help to lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you’ve finished using them.

I really like the last two, if you can do them in development then you are then you have a great dev culture


Reformatted to be readable:

> Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features”.

> Expect the output of every program to become the input to another, as yet unknown, program. Don’t clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don’t insist on interactive input.

> Design and build software, even operating systems, to be tried early, ideally within weeks. Don’t hesitate to throw away the clumsy parts and rebuild them.

> Use tools in preference to unskilled help to lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you’ve finished using them.


If bots were not discouraged on news.yc, I would have implemented a bot for this long ago. Code-block quotes are so atrocious, esp. on mobile devices.


It seems "white-space: pre-wrap" on code block would solve most of the problem. There is also additional "max-width" on the pre that I think is not needed.


That would break actual code snippet.

What would solve most of the problems is HN actually implementing markdown instead of the current half-assed crap.


I would hate to see the day HN allowed any way to bold sections of text.

It's way more restful purveying a page of uniformly restrained text.


> I would hate to see the day HN allowed any way to bold sections of text.

HN already has shitty italics (shitty in that it commonly matches and eats things you don't want to be italicised e.g. multiplications, pointers, … in part though not only because HN doesn't have inline code). "bold" can just be styled as italics, or it can be styled as a medium or semibold. It's not an issue, and even less worth it given how absolute garbage the current markup situation is.


For a site that's meant to target programmers, HN's handling of code blocks is pretty poor.

Just give me the triple-tilde code block syntax please!


> For a site that's meant to target programmers, HN's handling of code blocks is pretty poor.

Meh. It does literal code blocks, they work fine.

That's pretty much the only markup feature which does, which is impressively bad given HN only has two markup feature: literal code blocks and emphasis.

It's not like they're going to add code coloration or anything.

And while fenced code blocks are slightly more convenient (no need to indent), pasting a snippet in a text editor and indenting it is hardly a difficult task.


How is that meaningfully different to italics in that regard?


Bold text stands out when visually scanning the page, italics don't.


What is it with so many products (HN, Discord, Slack) building half assed markdown implementations that aren't actually markdown?


What's Markdown?

- Slack product development

(that was a joke but is likely the answer to your question)


To be pedantic, there's debate about what is "actually markdown". No one would say it's the flavor HN implements, but the easiest way to win some games is to simply not play


That would break any existing comments that happened to be using markdown syntax as punctuation. Although I suppose you could have a flag day for the changeover and format differently based on comment creation time.

But I think the very limited formatting is just fine anyway. For the above comment as an example, I agree the code formatting looks awful, especially on mobile. But the version with >'s is ok, and I don't think proper bullet points or a quote bar would have improved it dramatically.


Conversations.im uses an interesting trick for rendering Markdown [0] - it leaves the syntax as is, so in the worst case you've got text with weird bold/italics, but the characters are 1:1 identical to what was sent.

[0]: Actually not Markdown but a subset but it's not important.


I agree with you on the max-width. I can't see whatever benefit it's supposed to provide outweigh the annoyance of having to scroll horizontally when there is a lot of empty space to the right that could be used to display more text.

I'm not too convinced on the wrapping of code, though.


white-space: pre-wrap on a code block could lead to confusion. Any change should be an optional setting in your user profile.

However you're definitely right about dropping the max-width property.


Why is this or OP's even necessary? The bullet points are copied directly from the article.


I'm sure I'm not the only one who opens the comments to skim them and quickly vet whether the article is worth reading or not


You are not. I made a chrome plugin to find the HN-discussions for an article thinking I'd use it primarily after I'd read an article, but I find that I more often than not use it as a benchmark whether I should spend the time to read it or not.


> The Unix philosophy is documented by Doug McIlroy as

TaoUP has a longer discussion[1] of the Unix philosophy, which includes Rob Pike's and Ken Thompson's comments on the philosophy.

[1] http://www.catb.org/esr/writings/taoup/html/ch01s06.html

"Those who don't understand Unix are condemned to reinvent it, poorly." (Henry Spencer)


Avoid TAOUP, is really bad. Most of the lore have stolen from places as LISP and VAX communities. ESR is to Pike and Ken as alien as X11 itself.


I'm not sure I understand - is TAOUP bad because it stole information, or because the information it has is wrong?

Because I'd consider the latter to be far worse than the former.


I enjoyed it when I read it many years ago, but maybe that was because I was inexperienced and naive.

Could you recommend some "original sources" to learn from, instead? Ideally in book form?


Not TAOUP specially, but the Jargon file is what ESR took loads of things as wrong or Unix related. Also, at TAOUP you have Emacs, which is the Anti-UNIX by definition. https://www.dourish.com/goodies/jargon.html


TAOUP had a chapter "a tale of 5 editors" discussing emacs, vi, and more, and does point out emacs is an outlier (and outsider) to many unix principles. It does quote Doug McIlroy speaking against it (but also against vi?). It attempts to generalize from discussing "The Right Size for an Editor" question to discussing how to think about "The Right Size of Software".

I don't know if it's possible to have impartially "fair" discussion of editors. Skimming now, I can see how vi lovers would hate some characterizations there. But it does try to learn interesting lessons from them.

It does NOT simply equate "Emacs has UNIX nature" so you can't just prove something like "TAOUP mentions Emacs, Emacs is GNU, Gnu is Not Unix => TAOUP is not UNIX, QED" ;-)

http://www.catb.org/esr/writings/taoup/html/ch13s02.html

bias disclaimers: I learnt most of what I know of unix from within Emacs, which I still use ~20 years later. I learnt more from Info pages than man pages (AIX had pretty bad man pages). I suspect you have a different picture of unix than I. And I now know better than arguing which editor is better ;)

But I found TAOUP articulated ideas I only learnt through osmosis. I'm looking forward to reading a better articulation if you know one.


The last one is especially interesting to me these days. On a macro scale it sure sounds a whole lot like the robot revolution taking unskilled jobs.

But of course that's probably not the author's intended context.


Programmers have been destroying programmer jobs since those jobs exist. Up to now it has meant we have enough productivity for going into more markets, but that will not last forever.


From 1978, and still applicable to microservices today.


That's my reaction too! Microservices with a dash of agile.


I’m surprised JessFraz who is employed by Microsoft doesn’t talk about powershell pipes at all.

Powershell pipes are an extension over Unix pipes. Rather than just being able to pipe a stream of bytes, powershell can pipe a stream of objects.

It makes working with pipes so much fun. In Unix you have to cut, awk and do all sorts of parsing to get some field out of `ls`. In poweshell, ls outputs stream of file objects and you can get the field you want by piping to `Get-Item` or sum the file sizes, or filter only directories. It’s very expressive once you’re manipulating streams of objects with properties.


> In Unix you have to cut, awk and do all sorts of parsing to get some field out of `ls`.

I'm guessing you've mentioned using `ls` as a simple, first-thing-that-comes-to-mind example, which is cool. I just wanted to point out that if a person is piping ls's output, there are probably other, far better alternatives, such as (per the link below) `find` and globs:

https://unix.stackexchange.com/a/247973

That has been my experience, at least.


If you're using cut and awk to get a field out of ls, maybe what you want is actually stat(1)?


She is actually employed at GitHub now.

https://twitter.com/jessfraz


Still MSFT, in a way!


Which is owned by...


The downside of objects is performance due to the e.g. increased overhead a file object carries. Plus the exact same can be argued with everything being bytes/text on Unix, in that it make everything simple but versatile.

> get some field out of `ls`

If you're parsing the output of ls, you're doing it wrong. Filenames can contain characters such as newlines, etc, and so you generally want to use globbing/other shell builtins, or utils like find with null as the separator instead of newline.


I've been following her on twitter for awhile. She went to MSFT only a couple years ago. From what I understand her expertise is in Linux.


> Powershell pipes are an extension over Unix pipes. Rather than just being able to pipe a stream of bytes, powershell can pipe a stream of objects.

Unix pipes can pipe objects, binary data, text, json encoded data, etc.

The problem is it adds a lot of complication and simply doesn't offer much in practice, so text is still widely used and binary data in a few specific cases.


What happens when there are upstream changes to the objects? Does everything downstream just need to change, and hence, the upstream objects returned by programs need to be changed with care? Or is it using something like protobuf, where fields are only additive, but never deleted, for backwards compatibility?

Or are the resulting chain of pipes so short lived, it doesn't matter?


What happens when you rely on non standard behaviour of unix tools, or just non-posix tools that change from beneath you?

I’m not saying that this makes powershell pipes better/worse, just that this problem isn’t unique. Microsoft tends to be reasonably committed to backwards compatibility but I don’t know the answer to the question


I think they were talking about objects at execution time, not version compatibility.


I don’t really understand what this means. The parent specifically writes “backwards compatibility.” The only other thing I think you might be referring to is if the reading process mutates the objects does that somehow affect the writing end of the pipe but I think that doesn’t make sense from a “sane way of doing inter-process communication” standpoint. Is there something else you are referring to? Could you elaborate please?


Powershell is a shell language, everything is dynamic, and does not error when you access a non-existent property on an object.


Don’t you have to rewrite every single program to make it able to read those “objects”?


You don't have to rewrite them; just write them. Every new OS/language needs its tooling/libraries to be written.


Never parse ls.


> Never parse ls.

I have heard this several times, but either I do not understand it or I disagree. Do you mean parsing the output of the ls program? Parsing ls output is not wrong, the program produces a text stream that is easy and useful to parse. There's nothing to be ashamed when doing it, even when you can do it in a different, even shorter way. I do grep over ls output daily, and I find it much more convenient than writing wildcards.


One can certainly do fine by grepping ls output in one-off instances, but I'd be really hesitant to put that in a script.

For given paths, stat command essentially lets us directly access their inode structs, and invocations are nicely concise. The find util then lets us select files based on inode fields.

Both tools do take a bit of learning, but considerably less than grep and regexs. Anyway, I've personally found find and stat to be really nice and ergonomic after the initial learning period.


I'm probably nitpicking, but if you're using cat to pipe a single file into the sdtin of another program, you most likely don't need the cat in the first place, you can just redirect the file to the process' stdin. Unless, of course, you're actually concatenating multiple files or maybe a file and stdin together.

Disclaimer: I do cat-piping myself quite a bit out of habit, so I'm not trying to look down at the author or anything like that! :)


In fact, I don't like people optimizing shell scripts for performance. I mean, shell scripts are slow by design and if you need something fast, you choose the wrong technology in the first place.

Instead, shell script should be optimized for readability and portability and I think it is much easier to understand something like 'read | change >write' than 'change <read >write'. So I like to write pipelines like this:

  cat foo.txt \
    | grep '^x' \
    | sed 's/a/b/g' \
    | awk '{print $2}' \
    | wc -l >bar.txt
It might be not the most efficient processing method, but I think it is quite readable.

For those who disagree with me: You might find the pure-bash-bible [1] valuable. While I admire their passion for shell scripts, I think they are optimizing to the wrong end. I would be more a fan of something along the lines of 'readable-POSIX-shell-bible' ;-)

[1]: https://github.com/dylanaraps/pure-bash-bible


IMHO, shell scripts are a minefield and if you want something readable and portable, this is also the wrong technology. They are convenient though. They are like the Excel macros of the UNIX world.

Now back to the topic of "cat", which is a great example of why shell scripts are minefields.

Replace "foo.txt" with a user supplied variable, let's call it "$F". It becomes cat $F | blah_blah... I mean cat "$F" | blah_blah, first trap, but everyone knows that.

Now, if F='-n', second trap. What you think is a file will be considered an option and cat will wait for user input, like when no file is given. Ok, so you need to do cat -- "$F" | blah_blah.

That should be OK in every case now, but remember that "cat" is just another executable, or maybe a builtin. For some reason, on your system "cat --" may not work, or some asshat may have added "." in your PATH and you may be in a directory with a file named "cat". Or maybe some alias that decides to add color.

There are other things to consider, like your locale that may mess up you output with comas instead of decimal points and unicode characters. For that reason, you need to be very careful every time you call a command and even more so if you pipe the output.

For that reason, I avoid using "cat" in scripts. It is an extra command call and all the associated headaches I can do without.


> Now, if F='-n', second trap

You're not wrong, but I think it's worth pointing out that's a trap that comes up any time you exec another program, whether it's from shell or python. I can't reasonably expect `subprocess.run(["cat", somevar])` to work if `somevar = "-n"`.

(Now, obviously, I'm not going to "cat" from python, but I might "kubectl" or something else that requires care around the arguments)


> Replace "foo.txt" with a user supplied variable, let's call it "$F". It becomes cat $F | blah_blah... I mean cat "$F" | blah_blah, first trap, but everyone knows that.

I think that you forgot to edit the "I mean" to "echo $F" :)


I agree with the sentiment, but my critique applies so generally that it must be noted: if a command accepts a filename as a parameter, you should absolutely pass it as a parameter rather than `cat` it over stdin.

For example, you can write this pipeline as:

    grep '^x' foo.txt \
        | sed 's/a/b/g' \
        | awk '{print $2}' \
        | wc -l > bar.txt
This is by no means scientific, but I've got a LaTeX document open right now. A quick `time` says:

    $ time grep 'what' AoC.tex
    real    0m0.045s
    user    0m0.000s
    sys     0m0.000s

    $ time cat AoC.tex | grep what
    real    0m0.092s
    user    0m0.000s
    sys     0m0.047s
Anecdotally, I've witnessed small pipelines that absolutely make sense totally thrash a system because of inappropriate uses of `cat`. When you `cat` a file, the OS must (1) `fork` and `exec`, (2) copy the file to `cat`'s memory, (3) copy the contents of `cat`'s memory to the pipe, and (4) copy the contents of the pipe to `grep`'s memory. That's a whole lot of copying for large files -- especially when the first command grep in the sequence usually performs some major kind of reduction on the input data!


In my opinion, it's perfectly fine either way unless you're worried about performance. I personally tend to try to use the more performant option when there's a choice, but a lot of times it just doesn't matter.

That said, I suspect the example would be much faster if you didn't use the pipeline, because a single tool could do it all (I'm leaving in the substitution and column print that are actually unused in the result):

    awk '/^x/{gsub("a","b");print $2; count++}END{print NR}' foo.txt


That syntax is very unusual from anything I've seen. I am also a fan of splitting pipelines with line breaks for readability, however I put the pipe on the end of each line and omit the backslash. In Bash, a line that ends with a pipe always continues on the next line.

In any case, it's probably just a matter of personal taste.


That's actually very readable. I'm now regretting that I hadn't seen this about 3 months ago--I recently left a project that had a large number of shell scripts I had written or maintained for my team. This probably would've made it much easier for the rest of the team to figure out what the command was doing.


If the order is your concern, you can also put the <read at the beginning of the line. <file grep x works the same as: cat file | grep x


I've been using unix for 25 years and I did not know that.


I dunno, You are bringing 5 cores to bear and there is no global interpreter lock which is not a bad start


I like 'collection pipeline' code written in this style regardless of language. If we took away the pipe symbols (or the dots) and just used indentation we'd have something that looked like asm but with flow between steps rather than common global state.

I periodically think it would be a good idea to organize a language around.


awk can do all of that except sed. And I am not sure about the last. No need to wc ($NF in AWK, if I can recall), no need for grep, you have the /match/ statement, with regex too.


> except sed

Doesn't gsub(/a/, "b") do the same thing as s/a/b/g?


Yes, I recall it hours ago.


I find something like this:

   grep '^x' < input | sed 's/foo/bar/g' 
to be very readable, as the flow is still visually apparent based on punctuation.


I don't like this style at all. If you're following the pipeline, it starts in the middle with "input", goes to the left for the grep, then to the right (skipping over the middle part) to sed.

     cat input | grep '^x' | sed 's/foo/bar/g'
Is far more readable, in my opinion. In addition, it makes it trivial to change the input from a file to any kind of process.

I'm STRONGLY in favor of using "cat" for input. That "useless use of cat" article is pretty dumb, IMHO.


Note that '<input grep | foo' is also valid.


In this particular example, ‘unnecessary use of cat’ is accompanied by ‘unnecessary use of grep’.

    cat input | grep '^x' | sed 's/foo/bar/g'

    sed '/^x/s/foo/bar/g' <input


That's not the same thing. The sed output will still keep lines not starting with x (just not replacing foo with bar in those) where grep will filter those out.


Yeah, Muphry's law at work. Corrected version:

   sed -n '/^x/s/foo/bar/gp' <input
This may be an inadvertent argument for the ‘connect simpler tools’ philosophy.


You can just remove the <


You can if input is a file. It might be a program with no arguments or something else.


In your original command, how can 'input' be a program with no arguments?


Oh, damn. You're exactly right.

OK, to save some of my face, this will work:

    grep 'foo' <(input) | sed 's/baz/bar/g'
... at least in zsh and probably bash.


I don’t like that at all. That creates a subshell and is also less readable than

    input | grep foo | sed ...


That specific example is less readable, but I do like being able to do this:

    diff <(prog1) <(prog2)
and get a sensible result.

And sometimes programs just refuse to read from stdin but do just fine with an unseekable file on the command line. True, you do have this:

    input | recalcitrant_program /dev/stdin
... but it's a bit of a tossup as to which one's more readable at this point. They're both relying on advanced shell functionality.


> That specific example is less readable, but I do like being able to do this:

> diff <(prog1) <(prog2)

> and get a sensible result.

That is called process substitution and is exactly the kind of use case that it's designed for. So yes, process substitution does make sense there.

> input | recalcitrant_program /dev/stdin

> ... but it's a bit of a tossup as to which one's more readable at this point. They're both relying on advanced shell functionality.

There's no tossup at all. Process substitution is easily more readable than your second example because you're honouring the normal syntax of that particular command's parameters rather than kludging around it's lack of STDIN support.

Also I wouldn't say either example is using advanced shell functionalities either. Process substitution (your first example) is a pretty easy thing to learn and your second example is just using regular anonymous pipes (/dev/stdin isn't a shell function, it's a proper pseudo-device like /dev/random and /dev/null) thus the only thing the shell is doing is the same pipe described in this threads article (with UNIX / Linux then doing the clever stuff outside of the shell).


This is a very silly way of writing it though. grep|sed can almost always be replaced with a simple awk: awk '/^x/ { sub("a", "b"); print $2; }' foo.txt. This way, the whole command fits on one line. If it doesn't, put your awk script in a separate file and simply call it with "awk -f myawkscript foo.txt".


I would disagree that their way of writing it is silly.

It is instantly plainly obvious to me what each step of their shell script is doing.

While I can absolutely understand what your shell script does after parsing it, it's meaning doesn't leap out at me in the same way.

I would describe the prior shell script as more quickly readable than the one that you've listed.

So, perhaps it's not a question of one being more silly than the other—perhaps the author just has different priorities from you?


I use awk in exactly this way personally, but, awk is not as commonly readable as grep and sed (in fact, that use of grep and sed should be pretty comprehensible to someone who just knows regular expressions from some programming languages and very briefly glances at the manpages, whereas it would be difficult to learn what that awk syntax means just from e.g. the GNU awk manpage). So, just as you could write a Perl one-liner but you shouldn't if you want other people to read the code, I'd probably advise against the awk one-liner too.


Not sure why you say grep and sed are more readable than awk! (not sure what 'commonly readable' means). Or that even that particular line in awk is harder to understand than the grep and sed man pages. The awk manpage even has examples, including print $2. The sed manpages must be the most impenetrable manpages known to 'man', if you don't already understand sed. (People might already know s///g because 99% of the time, that's all sed is used for.)


>sub("a", "b");

That should be gsub, shouldn't it? (sub only replaces the first occurrence)


Yes.


The "useless use of cat" was a repeated gripe on Usenet back in the day: http://porkmail.org/era/unix/award.html


I actually think that cat makes it more obvious what's happening in some cases.

I had recently built a set of tools used primarily via pipes: (tool-a | tool-b | tool-c) and it looks clearer when I mock (for testing) one command (cat results | tool-b | tool-c) instead of re-flowing it just to avoid cat and use direct files.


People use cat to look at the file first, then hit up arrow, add a pipe, etc.


Yes, this. Quite often I start writing out complex pipelines using head/tail to test with a small dataset and then switch it out for cat when I am done to run it on the full thing. And it's often not worth refactoring these things later unless you are really trying to squeeze performance out of them.


I think it's also a grammatical wart of shell syntax. Things going into a command are usually on the left, but piping in a file goes on the right.


   <file command | command | command
is perfectly fine.


The arrow now points backwards.


Of course, if any of your commands prompt for input, you'll be disappointed that's not always as easy as it appears on the surface.

Does anyone have a better way to do this kind of thing?


The standard is expect [1]. There are also libraries for many programming languages which perform a similar task, such as pexpect [2].

[1] https://core.tcl.tk/expect/index [2] https://pexpect.readthedocs.io/en/stable/


The better solution is to change the command so it expects programatic arguments / pass command line parameters.

i.e.

prefer `apt-get install -y` over `yes | apt-get install foo`


I can see how it's redundant. But I use cat-pipes because I once mistyped the redirection and nuked my carefully created input file :)

(Similarly, the first thing I used to do on Windows was set my prompt to [$p] because many years ago I also accidentally nuked a part of Visual Studio when I copied and pasted a command line that was prefixed with "C:\...>". Whoops.)


Not nitpicking. Useless Use of Cat is an old thing: http://catb.org/jargon/html/U/UUOC.html


For interactive use, I would like to point out that even better than this use of cat is less. If you pipe less into something then it forgets it’s interactive behaviour and works like cat on a single file. So:

  $ less foo | bar
Is similar too:

  $ bar < foo
Except that less is typically more clever than that and might be more like:

  $ zcat foo | bar
Depending on the file type of foo.


I would be remiss if I did not point out that calling said program cat is a misnomer. Instead of 'string together in a series' (the actual dictionary definition, which coincidentally, pipes actually do) it quickly became 'print whatever I type to the screen.'

Of course, the example @arendtio uses is correct, because they obviously care about such things.


Having separate commands for outputting the content of a single file and several files would, however, be an orthogonality violation. YMMV whether having a more descriptive name for the most common use of cat would be worth the drawback.


It would fit in the broader methodology of 'single purpose tools that do their job well' or 'small pieces, loosely joined', but yes, probably too annoying to bother with.


I usually replace cat with pv and gets a nice progress bar and ETA :-)


I love the idea of simple things that can be connected in any way. I'm not so much a fan of "everything is a soup of bytes with unspecified encoding and unknown formatting".

It's an abstraction that held up quite well, but its starting to show its age.


I fully agree... and yet... everyone who has tried to "fix" this has failed, at least in the sense of "attaining anything like shell's size and reach". Many have succeeded in the sense of producing working code that fixes this in some sense.

Powershell's probably the closest to success, because it could be pushed out unilaterally. Without that I'm not sure it would have gotten very far, not because it's bad, but again because nobody else seems to have gotten very far....


100% agree. Having to extract information with regular expressions is a waste of time. If the structure of the data was available, you would have type safety / auto-completion. You could even have GUIs to compose programs.


Structured data flows in pipes too. Json can even be line oriented. GUI programming fails when you get past a few hundred “lines” of complexity. What I’d love to see is a revolution of shells and terminals to more easily work and pull from piped data.


I hear what you're saying.

However, how can you ensure the output type of one program matches the input type of another?


Allow programs to specify the type of data they can consume and the type of the data they emit. This is how powershell does it (using the dotnet type system).


And the problem is how can you ensure the output type of one program matches the input type of another.

A program emits one type, and the other program accepts another.

Something will be needed to transform one type into another. Imagine doing that on the command line.


Cat file1 | convert | dest


Having GUIs compose programs seems antithetical to the idea of shell scripts which are often thrown together quickly to get things done. Personally, I view shell scripting as a "good enough" and if you need more structure then you change your tools.


For an alternative view, don't forget to read the section on Pipes of The Unix-Haters Handbook: http://web.mit.edu/~simsong/www/ugh.pdf (page 198)


> When was the last time your Unix workstation was as useful as a Macintosh?

Some of that discussion has not aged well :)


The core critique - that everything is stringly typed - still holds pretty well though.

>The receiving and sending processes must use a stream of bytes. Any object more complex than a byte cannot be sent until the object is first transmuted into a string of bytes that the receiving end knows how to reassemble. This means that you can’t send an object and the code for the class definition necessary to implement the object. You can’t send pointers into another process’s address space. You can’t send file handles or tcp connections or permissions to access particular files or resources.


> You can’t send pointers into another process’s address space.

Thank goodness.


To be fair, the same critisim could be used for a socket? I think the issue is that some people want pipes to be something magical that connects their software, not a dumb connection between them.


I don't want all my pipes to be magical all the time, but occasionally I do want to write a utility that is "pipeline aware" in some sense. For example, I'd like to pipe mysql to jq and have one utility or the other realize that a conversion to json is needed in the middle for it work.

Im working on a library for this kind of intra-pipeline negitiation. It's all drawing-board stuff right now but I coobbled together a proof of concept:

https://unix.stackexchange.com/a/495338/146169

Do you think this is a reasonable way to achieve the magic that some users want in their pipelines? Or are ancient Unix gods going to smite me for tampering with the functional consistency of tools by making their behavior different in different contexts?


This is interesting, yes. If the shell could infer the content type of data demanded or output by each command in a pipeline, then it could automatically insert type coercion commands or alter the options of commands to produce the desired content types.

You're right that it is in fact possible for a command to find the preceding and following commands using /proc, and figure out what content types they produce / want, and do something sensible. But there won't always be just one way to convert between content types...

Me? I don't care for this kind of magic, except as a challenge! But others might like it. You might need to make a library out of this because when you have something like curl(1) as a data source, you need to know what Content-Type it is producing, and when you can know explicitly rather than having to taste the data, that's a plus. Dealing with curl(1) as a sink and somehow telling it what the content type is would be nice as well.


My ultimate use case is a contrived environment where I have the luxury of ignoring otherwise blatant feature-gaps--such as compatibility with other tools (like curl). I've come to the same conclusions about why that might be tricky, so I'm calling it a version-two problem.

I notice that function composition notation; that is, the latter half of:

> f(g(x)) = (f o g)(x)

resembles bash pipeline syntax to a certain degree. The 'o' symbol can be taken to mean "following". If we introduce new notation where '|' means "followed by" then we can flip the whole thing around and get:

> f(g(x)) = (f o g)(x) = echo 'x' | g | f

I want to write some set of mathematically interesting functions so that they're incredibly friendly (like, they'll find and fix type mismatch errors where possible, and fail in very friendly ways when not). And then use the resulting environment to teach a course that would be a simultaneous intro into both category theory and UNIX.

All that to say--I agree about finding the magic a little distasteful, but if I play my cards right my students will only realize there was magic in play after they've taken the bait. At first it will all seem so easy...


The magic /proc thing is a very interesting challenge. Trust me, since I read your comments I've thought about how to implement, though again, it's not the sort of thing I'd build for a production system, just a toy -- a damned interesting one. And as a tool for teaching how to find your way around an OS and get the information you need, it's very nice. There's three parts to this: a) finding who's before and after the adapter in the pipe, b) figuring out how to use that information to derive content types, c) match impedances. (b) feels mundane: you'll have a table-driven approach to that. Maybe you'll "taste" the data when you don't find a match in the table? (c) is not always obvious -- often the data is not structured. You might resort to using extended file attributes to store file content-type metadata (I've done this), and maybe you can find the stdin or other open files of the left-most command in a pipeline, then you might be able to guesstimate the content type in more cases. But obviously, a sed, awk, or cut, is going to ruin everything. Even something like jq will: you can't assume the output and input will be JSON.

At some point you just want a Haskell shell (there is one). Or a jq shell (there is something like it too).

As to the pipe symbol as function composition: yes, that's quite right.


I wonder if something like HTTP’s content negotiation is a good model for this.


That sounds reasonable, I'll look into it--thanks.

I was imagining an algorithm where each pipeline-aware utility can derive port numbers to use to talk/listen to its neighbors. I may be able to use http content negotiation wholesale in that context.


I've been trying to solve the exact same problem with my shell too. It's pipes are typed and all the builtin commands can than automatically decode those data types via shared libraries. So commands don't need to worry about how to decode and re-encode the data. This means that JSON, YAML, TOML, CSV, Apache log files, S-Expressions and even tabulated data from `ps` (for example) can all be transparently handled the same way and converted from one to another without the tools ever needing to know how to marshal nor unmarshal that data. For example: you could take a JSON array that's not been formatted with cartridge returns and still grep through it item by item as if it was a multi-line string.

However the problem I face is how do you pass that data type information over a pipeline from tools that exist outside of my shell? It's all well and good having builtins that all follow that convention but what if someone else wants to write a tool?

My first thought was to use network sockets, but then you break piping over SSH, eg:

    local-command | ssh user@host "| remote-command"
My next thought was maybe this data should be in-lined - a bit like how ANSI escape sequences are in-lined and the terminals don't render them as printable characters. Maybe something like the following as a prefix to STDIN?

    <null>$SHELL<null>
But then you have the problem of tainting your data if any tools are sent that prefix in error.

I also wondered if setting environmental variables might work but that also wouldn't be reliable for SSH connections.

So as you can see, I'm yet to think up a robust way of achieving this goal. However in the case of builtin tools and shell scripts, I've got it working for the most part. A few bugs here and there but it's not a small project I've taken on.

If you fancy comparing notes on this further, I'm happy to oblige. I'm still hopeful we can find a suitable workaround to the problems described above.


> ...with my shell too...

I was hoping to stick with bash or zsh, and just write processes that somehow communicate out of band, but I think we're still up against the same problem.

One idea I had was that there's a service running elsewhere which maintains this directed graph (nodes = types, edges = programs which take the type of their "from" node and return the type of their "two" node). When a pipeline is executed, each stage pauses until type matches are confirmed--and if there is a mismatch then some path finding algorithm is used to find the missing hops.

So the user can leave out otherwise necessary steps, and as long as there is only one path through the type graph which connects them, then the missing step can be "inserted". In the case of multiple paths, the error message can be quite friendly.

This means keeping your context small enough, and your types diverse enough, that the type graph isn't too heavily connected. (Maybe you'd have to swap out contexts to keep the noise down.) But if you have a layer that's modifying things before execution anyway, then perhaps you can have it notice the ssh call and modify it to set up a listener. Something like:

User Types:

    local-command | ssh user@host "remote-command"
Shell runs:

    local-command | ssh user@host "pull_metadata_from -r <caller's ip> | remote-command"
Where pull_metadata_from phones home to get the metadata, then passes along the data stream untouched.

Also, If you're writing the shell anyway then you can have the pipeline run each process in a subshell where vars like TYPE_REGISTRY_IP and METADATA_INBOUND_PORT are defined. If they're using the network to type-negotiate locally, then why not also use the network to type-negotiate through an ssh tunnel?

This idea is, of course, over-engineered as hell. But then again this whole pursuit is.


> I was hoping to stick with bash or zsh, and just write processes that somehow communicate out of band, but I think we're still up against the same problem.

Yeah we have different starting points but very much similar problems.

tbh idea behind my shell wasn't originally to address typed pipelines, that was just something that evolved from it quite by accident.

Anyhow, your suggestion of overwriting / aliasing `ssh` is genius. Though I'm thinking rather than tunnelling a TCP connection, I could just spawn an instance of my shell on the remote server and then do everything through normal pipelines as I now control both ends of the pipe. It's arguably got less proverbial moving parts compared to a TCP listener (which might then require a central data type daemon et al) and I'd need my software running on the remote server for the data types to work anyway.

There is obviously a fair security concern some people might have about that but if we're open and honest about that and offer an "opt in/out" where opting out would disable support for piped types over SSH then I can't see people having an issue with it.

Coincidentally I used to do something similar in a previous job where I had a pretty feature rich .bashrc and no Puppet. So `ssh` was overwritten with a bash function to copy my .bashrc onto the remote box before starting the remote shell.

> This idea is, of course, over-engineered as hell. But then again this whole pursuit is.

Haha so true!

Thanks for your help. You may have just solved a problem I've been grappling with for over a year.


I was thinking something similar, buried in a library that everyone could link. It seems... awfully awkward to build, much less portably.

This reminds me of how busted Linux is for not having a SO_PEERCRED. You can actually get that information by walking /proc/net/tcp or using AF_NETLINK sockets and inet_diag, but there is a race condition such that this isn't 100% reliable. SO_PEERCRED would [have to] be.


The problem with that is that each command in the pipeline would have to somehow be modified to convey content type metadata. Perhaps we could have a way to send ancillary metadata (a la Unix domain sockets SCM_*).


Yes. The compromise of just using an untyped byte stream in a single linear pipeline was a fair tradeoff in the 70s, but it is nearly 2020 and we can do better.


We have done better. The shell I'm writing is typed and I know I'm not the only person to do this (eg Powershell). The issue here is really more with POSIX compatibility but if you're willing to step away from that then you might find an alternative that better suits your needs.

Thankfully switching shells is as painless as switching text editors.


> Thankfully switching shells is as painless as switching text editors.

So, somewhere between, "That wasn't as bad as I feared," and, "Sweet Jesus, what fresh new hell have I found myself in"?


haha yes. I was thinking more about launching the shell but you're absolutely right that learning the syntax of a new shell is often non-trivial.


I'm not going to argue that UNIX got everything right because I don't believe that to be the case either but I don't agree with those specific points:

> This means that you can’t send an object and the code for the class definition necessary to implement the object.

To some degree you can and I do just this with my own shell I've written. You just have to ensure that both ends of the pipe understands what is being sent (eg is it JSON, text, binary data, etc)? Even with typed terminals (such as Powershell), you still need both ends of the pipe to understand what to expect to some extent.

Having this whole thing happen automatically with a class definition is a little optimistic though. Not least of all because not every tool would be suited for every data format (eg a text processor wouldn't be able to do much with a GIF even if it has a class definition).

> You can’t send pointers into another process’s address space.

Good job too. That seems a very easy path for exploit. Thankfully these days it's less of an issue though because copying memory is comparatively quick and cheap compared to when that handbook was written.

> You can’t send file handles

Actually that's exactly how piping works as technically the standard streams are just files. So you could launch a program with STDIN being a different file from the previous processes STDOUT.

> or tcp connections

You can if you pass it as a UNIX socket (where you define a network connection as a file).

> or permissions to access particular files or resources.

This is a little ambiguous. For example you can pass strings that are credentials. However you cannot alter the running state of another program via it's pipeline (aside what files it has access to). To be honest I prefer the `sudo` type approach but I don't know how much of that is because it's better and how much of that is because it's what I am used to.


>> You can’t send file handles

> Actually that's exactly how piping works

Also SCM_RIGHTS, which exists exactly for this purpose (see cmsg(3), unix(7) or https://blog.cloudflare.com/know-your-scm_rights/ for a gentler introduction and application).

That's been around since BSD 4.3, which predates the Hater's Handbook 1ed by 4 years or so.


And that's how Unix is secretly a capability system


Yeah I had mentioned UNIX domain sockets. However your post does add a lot of good detail on them which I had left off.


Look at the alternatives though. Would you really want to use something like Spring in shell scripting?


No. I typically use python as a drop in replacement for shell scripts > ~10 lines of code.


MacOs is layered on a UNIX-like OS. You can use pipes in your command windows.


This comment makes me feel really old.

MacOs wasn't always layered on unix, and the unix-haters' handbook predates the switch to the unix-based MacOs X.


Of course not, but the switch to BSD fixed a bunch of the underpinnings in the OS and was a sane base to work off of.

Not to put too fine a point on it, but they found religion. Unlike Classic (and early versions of Windows for that matter), there was more to be gained by ceding some control to the broader community. Microsoft has gotten better (PowerShell - adapting UNIX tools to Windows, and later WSL, where they went all in)

Still, for Apple it meant they had to serve two masters for a while - old school Classic enthusiasts and UNIX nerds. Reading the back catalog of John Siracusa's (one of my personal nerd heroes) old macOS reviews gives you some sense of just how weird this transition was.


The Unix Haters Handbook was published in 1994, when System 7 was decidedly not unix-like.


You can also drop the "-like". (-:

* https://unix.stackexchange.com/questions/1489/


... has it ? most people using macs never ever open a terminal.


The section on find after pipes has also not aged well. I can see why GNU and later GNU/Linux replaced most of the old Unices (I mean imagine having a find that doesn't follow symlinks!). If I may, a bit of code golf on the problem of "print all .el files without a matching .elc"

  find . -name '*.el' | while read el; do [ -f "${el}c" ] || echo $el; done
Of course this uses the dreaded pipes and doesn't support the extremely common filenames with a newline in them, so let's do it without them

  find . -name '*.el' -exec bash -c 'el=$0; elc="${el}c"; [ -f "$elc" ] || echo "$el"' '{}' ';'


So the dreaded space-in-filenames is a problem when you pass the '{}' to a script.

The following works very nicely for me:

  * find . -name '*.el' -exec file {}c ';' 2>&1 | grep cannot


I should have said "works very nicely for me, including on file names with spaces"


Or p160 by internal numbering.


I twitched horribly at the final sentence, screaming inwardly "you don't pipe to /dev/null, you redirect to it". And now I feel like an arsehole.


Hmm well, the Unix shell seems to follow a plumbing metaphor.

You could direct, or redirect the flow to /dev/null. Or pipe to /dev/null. Or redirect the pipe to /dev/null?

So from a metaphor point of view either would fit.

Although of course you don't use the pipe construct to direct to a file. Which would suggest piping is wrong?

And then on the third hand, we all know what it means so what's the problem.

So I would say theres war, famine and injustice in the world. Don't worry about posix shell semantics. :)


redirect your feelings to /dev/null, because a pipe will just give us a Permission denied


Chmod +x /dev/null

(havent tried above, not sure I recommend that you do)


Well you can't read either from /dev/null, and I don't think that's just a question of permissions. I'm pretty sure it's impossible to get /dev/null to behave like an executable.


You can read from /dev/null—it just behaves as a zero-length file, immediately returning EOF.

This makes /dev/null a valid source file in many languages, C included.


/dev/null behaves like an empty file, which is (or used to be?) a valid executable.

Cf http://trillian.mit.edu/~jc/humor/ATT_Copyright_true.html or https://twitter.com/rob_pike/status/966896123548872705


In most contexts empty file is indeed a valid executable. Debian folks learned this the hard way recently:

https://bugs.debian.org/919341

However, executing /dev/null doesn't seem to work on Linux:

  $ sudo chmod 777 /dev/null 
  $ /dev/null
  bash: /dev/null: Permission denied



Interesting question.

You could write a executable that accepts piped input and throws it away.

When would it exit though? Would it exit successfully at the end of the input stream? That sounds sensible.

That would be behaving like an executable wouldn't it?


> When would it exit though? Would it exit successfully at the end of the input stream?

A process that attempts to read from a closed pipe receives SIGPIPE. The default disposition for SIGPIPE is to terminate the program (similar to SIGTERM or SIGINT). So yeah, assuming that the previous program in the pipeline closes its stdout at some point (either explicitly, or implicitly by just exiting), then our program would die of SIGPIPE the when it tries to read() from stdin and the pipe's buffer has been depleted.

However, our program could also set SIGPIPE to ignored and ignore the EPIPE errors that read() would return in that case. In that case, it could run indefinitely. But at this point, you're way past normal behavior.


As long as it does not exit you can still catch it through /proc/


Pipes are awesome and infuriating.

Sometimes they work great -- being able to dump from MySQL into gzip sending across the wire via ssh into gunzip and into my local MySQL without ever touching a file feels nothing short of magic... although the command/incantation to do so took quite a while to finally get right.

But far too often they inexplicably fail. For example, I had an issue last year where piping curl to bunzip would just inexplicably stop after about 1GB, but it was at a different exact spot every time (between 1GB and 1.5GB). No error message, no exit, my network connection is fine, just an infinite timeout. (While curl by itself worked flawlessly every time.)

And I've got another 10 stories like this (I do a lot of data processing). Any given combination of pipe tools, there's a kind of random chance they'll actually work in the end or not. And even more frustrating, they'll often work on your local machine but not on your server, or vice-versa. And I'm just running basic commodity macOS locally and out-of-the-box Ubuntu on my servers.

I don't know why, but many times I've had to rewrite a piped command as streams in a Python script to get it to work reliably.


> Any given combination of pipe tools, there's a kind of random chance they'll actually work in the end or not.

While this may be your experience, the mechanism of FIFO pipes in Unix (which is filehandles and buffers, basically), is an old one that is both elegant and robust; it doesn't "randomly" fail due to unreliability of the core algorithm or components. In 20 years, I never had an init script or bash command fail due to the pipe(3) call itself being unreliable.

If you misunderstand detailed behavior of the commands you are stitching together--or details of how you're transiting the network in case of an ssh remote command--then yes, things may go wrong. Especially if you are creating Hail Mary one-liners, which become unwieldy.


If got to agree. I can’t recall a pipe ever failing due to unreliability.

One issue I did used to have (before I discovered ‘-o pipefail’[1]) was the annoyance that if an earlier command in a pipeline failed, all the other commands in the pipeline still ran albeit with no data or garbage data being piped to them.

[1] https://stackoverflow.com/questions/1550933/catching-error-c...


Perhaps your example was contrived, but why would you pipe into gzip instead of using transparent ssh compression?


Because it simply never occurred to me to check if ssh would have compression built-in.

Because why would it? If the UNIX philosophy is to separate out tools and pipe them, then the UNIX philosophy should be to pipe through gzip and gunzip, not for ssh to provide its own redundant compression option, right?


This is a good example of where that simple rule breaks down: piping it would only work when you are running a command and feeding its output to a different location whereas having it in SSH helps with everything so e.g. typing `ls` in a big directory, `cat`-ing a file, using `scp` on text, etc. benefits.


I've built a pipe which in very rare cases ran into a segfault. Never found out why.


I recently came across Ramda CLI's interactive mode [1]

It essentially hijacks pipe's input and output into browser where you can play with the Ramda command. Then you just close browser tab and Ramda CLI applies your changed code in the pipe, resuming its operation.

Now I'm thinking all kinds ways I use pipe that I could "tee" through a browser app. I can use browser for interactive JSON manipulation, visualization and all around playing. I'm now looking for ways to generalize Ramda CLI's approach. Pipes, Unix files and HTTP don't seem directly compatible, but the promise is there. Unix tee command doesn't "pause" the pipe, but probably one could just introduce pause/resume output passthrough command into the pipe after it. Then web server tool can send the tee'd file to browser and catch output from there.

[1] https://github.com/raine/ramda-cli#interactive-mode


You can just store the first pipeline results in a file, edit it, then use the file as an input for the second pipeline.


Well, yes, but this kind defeats transient nature of data moving through pipe. Testing and debugging and operating on pipe based processing benefits from close feedback cycle. I’d rather keep that as much as possible.


Yes, pipes are awesome, and the concepts actually translate well to in-process usage with structured data.

https://github.com/mpw/MPWFoundation/blob/master/Documentati...

One aspect is that the coordinating entity hooks up the pipeline and then gets out of the way, the pieces communicate amongst themselves, unlike FP simulations, which tend to have to come back to the coordinator.

This is very useful in "scripted-components" settings where you use a flexible/dynamic/slow scripting language to orchestrate fixed/fast components, without the slowness of the scripting language getting in the way. See sh :-)

Another aspect is error handling. Since results are actively passed on to the next filter, the error case is simply to not pass anything. Therefore the "happy path" simply doesn't have to deal with error cases at all, and you can deal with errors separately.

In call/return architectures (so: mostly everything), you have to return something, even in the error case. So we have nil, Maybe, Either, tuples or exceptions to get us out of Dodge. None of these is particularly good.

And of course | is such a perfect combinator because it is so sparse. It is obvious what each end does, all the components are forced to be uniform and at least syntactically composable/compatible.

Yay pipes.


pipe junkies might like to know about the following tools:

* vipe (part of https://joeyh.name/code/moreutils/ - lets you edit text part way through a complex series of piped commands)

* pv (http://www.ivarch.com/programs/pv.shtml - lets you visualise the flow of data through a pipe)


Yes! I love pv. Besides that and tee, can anyone else suggest some more general pipe tools?


http://joeyh.name/code/moreutils/ have a couple more:

* pee: tee standard input to pipes (`pee "some-command" "another-command"`)

* sponge: soak up standard input and write to a file

Though in zsh and bash you can create pee using tee: `tee >(some-command) >(another-command) >/dev/(null`



Sanjay Ghemawat (the other less visible half of Jeff Dean) wrote a pipe library in Go, learnt quite a bit from it.

https://github.com/ghemawat/stream

Edit: Jeff Dean, not James Dean


Cool, the pipe command must be one of the most essential things in Unix/Linux based systems.

I would have loved to see some awesome pipe examples though.


Ok, here are some example pipelines:

A simple virus scanner in one line of pipe:

https://everythingsysadmin.com/2004/10/whos-infected.html

And a bunch of pipe tricks that are oh so wrong but oh so useful:

https://everythingsysadmin.com/2012/09/unorthodoxunix.html


Back when I first started using Linux you could pipe random data to /dev/dsp and the speakers would emit various beeps. Used to be a pretty cool trick I think when ASLA came out it stopped working.


destroyallsoftware's screencasts have pretty good pipe usage / unix-fu.


Why isn't the pipe a construct that has caught on in 'proper' languages?


Clojure has something that you could call a pipe almost. `->` passes the output from one form to the next one.

This example has a nested hash map where we try to get the "You got me!" string.

We can either use `:a` (keyword) as a function to get the value. Then we have to nest the function calls a bit unnaturally.

Or we can use the thread-first macro `->`, which is basically a unix pipe.

   user=> (def res {:a {:b {:c "You got me!"}}})
   #'user/res
   
   user=> res
   {:a {:b {:c "You got me!"}}}
   
   user=> (:c (:b (:a res)))
   "You got me!"
   
   user=> (-> res :a :b :c)
   "You got me!"
Thinking about it, Clojure advocates having small functions (similar to unix's "small programs / do one thing well") that you compose together to build bigger things.


Clojure: The Pure Function Pipeline Data Flow

https://github.com/linpengcheng/PurefunctionPipelineDataflow


Why are you linking to the same github project over and over again?


Sorry, but the system does not support deletion now.


It has, in the form of function composition, as other replies show. However, the Unix pipe demonstrates a more interesting idea: composable programs on the level of the OS.

Nowadays, most of the user-facing desktop programs have GUIs, so the 'pipe' operator that composes programs is the user himself. Users compose programs by saving files from one program and opening them in another. The data being 'piped' through such program composition is sort-of typed, with the file types (PNG, TXT, etc) being the types and the loading modules of the programs being 'runtime typecheckers' that reject files with invalid format.

On the first sight, GUIs prevent program composition by requiring the user to serve as the 'pipe'. However, if GUIs were reflections / manifestations of some rich typed data (expressible in some really powerful type system, such as that of Idris), one could imagine the possibility of directly composing the programs together, bypassing the GUI or file-saving stages.


maybe I'm being overly pedantic, but people seem to be confused about this:

the pipes in your typical functional language (`|>`) is not a form of function composition, like

```

f >> g === x -> g(f(x))

```

but function application, like

```

f x |> g === g(f(x))

x |> f |> g // also works, has the same meaning

f |> g // just doesn't work, sorry :(

```


> the pipes in your typical functional language (`|>`) is not a form of function composition

What is a "typical functional language" in this case? I don't think I've come across this `|>` notation, or anything explicitly referred to as a "pipe", in the functional languages I tend to use (Haskell, Scheme, StandardML, Idris, Coq, Agda, ...); other than the Haskell "pipes" library, which I think is more elaborate than what you're talking about.


It is! It's the main way of programming in lazy functional programming languages like Haskell

And many programming languages have libraries for something similar:. iterators in rust / c++, streams in java/c#, thinks like reactive


Haskell was the only possible I found.

Iterators don't really fully capture what a pipe is though? Theres no parallelism.

And streams don't have the conceptual simplicity of a pipe?


> Iterators don't really fully capture what a pipe is though? Theres no parallelism.

Pipes are concurrent, not necessarily parallel. Iterators are concurrent, and can be parallel (https://docs.rs/rayon/0.6.0/rayon/par_iter/index.html).


Many functional languages have |> for piping, but chained method calls are also a lot like pipelines. Data goes from left to right. This javascript expression:

  [1, 2, 3].map(n => n + 1).join(',').length
Is basically like this shell command:

  seq 3 | awk '{ print $1 + 1 }' | tr '\n' , | wc -c
(the shell version gives 6 instead of 5 because of a trailing newline, but close enough)


But each successive 'command' is a method on what's constructed so far; not an entirely different command to which we delegate processing of what we have so far.

The Python:

   length(','.join(map(lambda n: n+1, range(1, 4)))
is a bit closer, but the order's now reversed, and then jumbled by the map/lambda. (Though I suppose arguably awk does that too.)


That's true. It's far from being generally applicable. But it might be the most "mainstream" pipe-like processing notation around.

Nim has an interesting synthesis where a.f(b) is only another way to spell f(a, b), which (I think) matches the usual behavior of |> while still allowing familiar-looking method-style syntax. These are equivalent:

  [1, 2, 3].map(proc (n: int): int = n + 1).map(proc (n: int): string = $n).join(",").len
  
  len(join(map(map([1, 2, 3], proc (n: int): int = n + 1), proc (n: int): string = $n), ","))
The difference is purely cosmetic, but readability matters. It's easier to read from left to right than to have to jump around.


C# extension methods provide the same syntax, and it is used for all of its LINQ pipeline methods. It's amazing how effective syntactic sugar can be for readability.


In Julia, this would be:

1:3 |> x->x+1 |> x->join(x,",") |> length


Small correction (or it won't run on my system):

  1:3 |> x -> map(y -> y+1, x) |> x -> join(x, ",") |> length
All those anonymous functions seem a bit distracting, though https://github.com/JuliaLang/julia/pull/24990 could help with that.


Ok, in Julia 1.0+ you just have to use the dot operator:

1:3 |> x->x.+1 |> x->join(x,",") |> length

Note the dot in x.+1, that tells + to operate on each element of the array (x), and not the array itself.


Ok... not sure which version of Julia you're using, but I'm on 0.5 and it works there... Maybe it changed in a later version


Maybe a bit nicer

    (1:3 .|> x->x+1) |> x->join(x,",") |> length


Elixir also has a pipe construct:

    f |> g(1)
would be equivalent to

    g(f, 1)


Others have show languages have some kind of pipe support, but not exactly as the shell.

The shell have the weird(?) behavior of ONE input and TWO outputs (stdout, stderr).

Also, can redirect in both directions. I think a language to be alike pipes, it need each function to be alike:

    fun open(...) -> Result(Ok,Err)
and have the option of not only chain the OK side but the ERR:

    open("file.txt") |> print !!> raise |> print
exist something like this???


OCaml has had the pipe operator |> since 4.01 [0]

It actually somewhat changes the way you write code, because it enables chaining of calls.

It's worth noting there's nothing preventing this being done before the pipe operator using function calls.

x |> f |> g is by definition the same as (g (f x)).

In non-performance-sensitive code, I've found that what would be quite a complicated monolithic function in an imperative language often ends up as a composition of more modular functions piped together. As others have mentioned, there are similarities with the method chaining style in OO languages.

Also, I believe Clojure has piping in the form of the -> thread-first macro.

[0] https://caml.inria.fr/pub/docs/manual-ocaml/libref/Pervasive...


IMO the really useful part of pipes is less the operator and more the lazy, streaming, concurrent processing model.

So lazy collections / iterators, and HoFs working on those.

The pipe operator itself is mostly a way to denote the composition in reading order (left to right instead of right to left / inside to outside), which is convenient for readability but not exactly world-breaking.


What you're describing sounds a lot like reactiveX - http://reactivex.io/

I only have experience using RxJS, but it's incredibly powerful.


To take any significant advantage of it you need to use data-driven, transformational approach of solving something. But funny thing is once you have that it's not really a big deal even if you don't have a pipe operator.


Monads are effectively pipes; the monad controls how data flows through the functions you put into the monad, but the functions individually are like individual programs in a pipe.


I'm not sure I agree with this. Function composition is more directly comparable to pipes, whereas I tend to think of monads as collapsing structure (i.e. `join :: m (m a) -> m a`)


Some of the most common monads (especially lists and other collection-like things) feel very much like pipes.

    list.select { |x|  x.foo > 10 }.map { |x| x.bar }...
Etc.

I wouldn't make the same argument about the IO monad, which I think more in terms of a functional program which evaluates to an imperative program. But most monads are not like the IO monad, in my experience at least.


> list.select { |x| x.foo > 10 }.map { |x| x.bar }...

Forgive me if I'm misreading this syntax, but to me this looks like plain old function composition: a call to `select` (I assume that's like Haskell's `filter`?) composed with a call to `map`. No monad in sight.

As I mentioned, monads are more about collapsing structure. In the case of lists this could be done with `concat` (which is the list implementation of monad's `join`) or `concatMap` (which is the list implementation of monad's `bind` AKA `>>=`).


Nope, it's not. It's Ruby, and the list could be an eager iterator, an actual list, a lazy iterator, a Mabye (though it would be clumsy in Ruby), etc.

And monads are not "more about collapsing structure". They are just a design pattern that follows a handful of laws. It seems like you're mistaking their usefulness in Haskell for what they are. A lot of other languages have monads either baked in or an element of the design of libraries. Expand your mind out of the Haskell box :)


The things you listed may form a monad, but your list/iterator transformation doesn't use monadic composition.

> list.select { |x| x.foo > 10 }.map { |x| x.bar }

Where's `bind` or `join` in that example?


> It's Ruby

Thanks for clarifying; I've read a bunch of Ruby but never written it before ;)

From a quick Google I see that "select" and "map" do work as I thought:

https://ruby-doc.org/core-2.2.0/Array.html#method-i-select

https://ruby-doc.org/core-2.2.0/Array.html#method-i-map

So we have a value called "list", we're calling its "select" method/function and then calling the "map" method/function of that result. That's just function composition; no monads in sight!

To clarify, we can rewrite your example in the following way:

    list.select { |x|  x.foo > 10 }.map { |x| x.bar }

    # Define the anonymous functions/blocks elsewhere, for clarity
    list.select(checkFoo).map(getBar)

    # Turn methods into standalone functions
    map(select(list, checkFoo), getBar)

    # Swap argument positions
    map(getBar, select(checkFoo, list))

    # Curry "map" and "select"
    map(getBar)(select(checkFoo)(list))

    # Pull out definitions, for clarity
    mapper   = map(getBar)
    selector = select(checkFoo)
    mapper(selector(list))
This is function composition, which we could write:

    go = compose(mapper, selector)
    go(list)
The above argument is based solely on the structure of the code: it's function composition, regardless of whether we're using "map" and "select", or "plus" and "multiply", or any other functions.

To understand why "map" and "select" don't need monads, see below.

> the list could be an eager iterator, an actual list, a lazy iterator, a Maybe (though it would be clumsy in Ruby), etc.

Yes, that's because all of those things are functors (so we can "map" them) and collections (so we can "select" AKA filter them).

The interface for monad requires a "wrap" method (AKA "return"), which takes a single value and 'wraps it up' (e.g. for lists we return a single-element list). It also requires either a "bind" method ("concatMap" for lists) or, my preference, a "join" method ("concat" for lists).

I can show that your example doesn't involve any monads by defining another type which is not a monad, yet will still work with your example.

I'll call this type a "TaggedList", and it's a pair containing a single value of one type and a list of values of another type. We can implement "map" and "select" by applying them to the list; the single value just gets passed along unchanged. This obeys the functor laws (I encourage you to check this!), and whilst I don't know of any "select laws" I think we can say it behaves in a reasonable way.

In Haskell we'd write something like this (although Haskell uses different names, like "fmap" and "filter"):

    data TaggedList t1 t2 = T t1 [t2]

    instance Functor (TaggedList t1) where
      map f (T x ys) = T x (map f ys)

    instance Collection (TaggedList t1) where
      select f (T x ys) = T x (select f ys)
In Ruby we'd write something like:

    class TaggedList
      def initialize(x, ys)
        @x  = x
        @ys = ys
      end

      def map(f)
        TaggedList.new(@x, @ys.map(f))
      end

      def select(f)
        TaggedList.new(@x, @ys.select(f))
      end
    end
This type will work for your example, e.g. (in pseudo-Ruby, since I'm not so familiar with it):

    myTaggedList = TaggedList.new("hello", [{foo: 1, bar: true}, {foo: 20, bar: false}])
    result = myTaggedList.select { |x|  x.foo > 10 }.map { |x| x.bar }

    # This check will return true
    result == TaggedList.new("hello", [false])
Yet "TaggedList" cannot be a monad! The reason is simple: there's no way for the "wrap" function (AKA "return") to know which value to pick for "@x"!

We could write a function which took two arguments, used one for "@x" and wrapped the other in a list for "@ys", but that's not what the monad interface requires.

Since Ruby's dynamically typed (AKA "unityped") we could write a function which picked a default value for "@x", like "nil"; yet that would break the monad laws. Specifically:

    bind(m, wrap) == m
If "wrap" used a default value like "nil", then "bind(m, wrap)" would replace the "@x" value in "m" with "nil", and this would break the equation in almost all cases (i.e. except when "m" already contained "nil").


tiny clarification: Function composition is just function composition. Pipes are a different syntax for function application.

see https://news.ycombinator.com/item?id=18971451


More specifically, a pipe is a monad that abstracts your machine's state. It's basically equivalent to Haskell's IO.


Hmmm, well that's why I like Ruby and other languages with functional approaches. Method chaining and blocks are very similar to pipes to me.

    cat /etc/passwd | grep root | awk -F: '{print $3}'

    ruby -e 'puts File.read("/etc/passwd").lines.select { |line| line.match(/root/) }.first.split(":")[2]'
A little more verbose, but the idea is the same.

https://alvinalexander.com/scala/fp-book/how-functional-prog...


awk -F: '/root/ {print $3}' < /etc/passwd


As others have mentioned, the pipe construct is present in many languages (or can be added).

A small additional bit of information: this style is called "tacit" (or "point-free") programming. See https://en.wikipedia.org/wiki/Tacit_programming

(Unix pipes are even explicitly mentioned in the articles as an example)


Julia has a pipe operator, which applies a function to the preceding argument:

julia> 1:5 |> x->x.^2 |> x->2x

5-element Array{Int64,1}:

  2
  8
 18
 32
 50


There is a tradition in C++ of overloading operator| for pipelining range operations. Your mileage may vary.


It doesn't look the same, bug go's up.Reader and io.Writer are the interfaces you implement if you want the equivalent of "reading from stdin"/"writing to stdout". Once implemented io.Copy is the actual piping operation.


It has, D's uniform functional syntax makes this as easy as auto foo = some_array.filter!(predicate).array.sort.uniq; for the unique elements of the sorted array that satisfy the predicate pred.


To add to the list, pipes are also heavily used in modern R


F# has a pipe operator


Haskell has the Conduit library. There's jq's notion of pipes (which isn't the same as Unix pipes, but still).


because a | b | c is equivalent to the very tradtional c(b(a())) and in FP it was infixed as `.`


awk, grep, sort, and pipe. I'm always amazed at how well thought out, simple, functional, and fast the unix tools are. I still prefer to sift through and validate data using these tools rather than use excel or any full-fledged language.

Edit: Also "column" to format your output into a table.


Although I probably use it multiple times everyday, I hate column. At least the implementation I use has issues with empty fields and a fixed maximum line length.

Edit: s/files/fields/


I have wondered for a long time why pipes are not used more often in production-grade applications.

I have seen plenty pipe use in bash scripts, especially for build and ETL purposes. Other languages, have ways to do pipes as well (e.g. Python https://docs.python.org/2/library/subprocess.html#replacing-...) but I have seen much less use of it.

It appears to me, that for more complex applications one rather opts for TCP-/UDP-/UNIX-Domain sockets for IPC.

- Has anyone here tried to plumb together large applications with pipes?

- Was it successful?

- Which problems did you run into?


The biggest issue is that pipes are unidirectional, while not all data flow is unidirectional.

Some functional programming styles are pipe-like in the sense that data-flow is unidirectional:

  Foo(Bar(Baz(Bif(x))))
is analagous to:

  cat x | Bif| Baz |Bar| Foo
Obviously the order of evaluation will depend on the semantics of the language used; most eager languages will fully evaluate each step before the next. (Actually this is one issue with Unix pipes; the flow-control semantics are tied to the concept of blocking I/O using a fixed-size buffer)

The idea of dataflow programming[1] is closely related to pipes and has existed for a long time, but it has mostly remained a niche, at least outside of hardware-design languages

1: https://en.wikipedia.org/wiki/Dataflow_programming


Pipes are not unidirectional on FreeBSD, interestingly.


I built an entire prototype ML system/pipeline using shell scripts that glued together two python scripts that did some heavy lifting not easily reproduced.

I got the whole thing working from training to prediction in about 3 weeks. What I love about Unix shell commands is that you simply can't abstract beyond the input/output paradigm. You aren't going to create classes, types classes, tests, etc. It's not possible or not worth it.

I'd like to see more devs use this approach, because it's a really nice way to get a project going in order to poke holes in it or see a general structure. I consider it a sketchpad of sorts.


My backup system at work is mostly bash scripts and some pipes.

If you write them cleanly they don’t suck and crucially for me, bash today works basically the same way as it did 10 years ago and likely in 10 years, that static nature is a big damn win.

I sometimes wish language vendors would just say ‘this language is complete, all future work will be bug fixes and libraries’ a static target for anything would be nice.

Elixir did say that recently except for one last major change which moved it straight up my list of things to look at in future.


- My IRC notification system is a shell script with entr, notify-send and dunst.

- My mail setup uses NMH, everything is automated. I can mime-encode a directory and send the resulting mail in a breeze.

- GF's photos from IG are being backed up with a python script and crontab. Non IG ones are geotagged too with a script. I just fire up some cli GPS tools if we hike some mountain route, and gpscorrelate runs on the GPX file.

- Music is almost everything chiptunes, I felt interesent on any mainstream music since 2003-4. I mirror a site with wget and it's done. If they offered rsync...

- Hell, even my podcasts are being fetch via cron(8).

- My setup is CWM/cli based, except for mpv, emulators, links+ and vimb for JS needed sites. Noice is my fm, or the pure shell. find(1) and mpg123/xmp generate my music playlist. Street View is maybe the only service I use on vimb...

The more you automate, the less tasks you need to do. I am starting to avoid even taskwarrior/timew, because I am almost task free as I don't have to track a trivial <5m script, and spt https://github.com/pickfire/spt is everything I need.

Also, now I can't stand any classical desktop, I find bloat on everything.


The Pure Function Pipeline Data Flow

https://github.com/linpengcheng/PurefunctionPipelineDataflow

Using the input and output characteristics of pure functions, pure functions are used as pipelines. Dataflow is formed by a series of pure functions in series. A dataflow code block as a function, equivalent to an integrated circuit element (or board)。 A complete integrated system is formed by serial or parallel dataflow.

data-flow is current-flow, function is chip, thread macro (->>, -> etc.) is a wire, and the entire system is an integrated circuit that is energized.


Yes, there is a strong connection between pipes and functional programming, they all transform what passes through them and should not retain state when implemented properly.


They also both rely on very generic data types as input and output. Which is why I was surprised to see the warning about avoiding columnar data. Tabular data is a basic way of expressing data and their relationships.


I'm not a Clojure person, but there are transducers, https://clojure.org/reference/transducers.


Which are in a way comparable, but I'd say that pipes are more like the arrow ->.

Transducers are compostable algorithmic transformations that, in a way, generalize map, filter and friends but they can be used anywhere where you transform data. Transducers have to be invoked and handled in a way that pipes do not.

Anyone interested should check out Hickeys talks about them. They are generally a lot more efficient than chaining higher order list processing functions and since they don't build.intermediate results they have a lot better GC performance.


Fully agree, pipes are awesome, only downside is the potential duplicate serialization/deserialization overhead.

Streams in most decent languages closely adhere to this idea.

I especially like how node does it, in my opinion one of the best things in node. Where you can simply create cli programs that have backpressure the same as you would work with binary/file streams, while also supporting object streams.

    process.stdin.pipe(byline()).pipe(through2(transformFunction)).pipe(process.stdout)


Node streams are excellent, but unfortunately don't get as much fanfare as Promises/async+await. A number of times I have gotten asked "how come my node script runs out of memory" -- due to the dev using await and storing the entirety of what is essentially streaming data in memory in between processing steps.


Pipes have been a game-changer for me in R with the tidyverse suite of packages. Base R doesn't have pipes, requiring a bit more saving of objects or a compromise on code readability.

One criticism would be that ggplot2 uses the "+" to add more graph features, whereas the rest of tidyverse uses "%>%" as its pipe, when ideally ggplot2 would also use it. One of my most common errors with ggplot2 is not utilizing the + or the %>% in the right places.


I've always thought of ggplot2's process as building a plot object. Most steps only add to the input.

Of course, Hadley admitted it was because he wrote ggplot2 before adopting pipes into his packages.


Unix's philosophy of “do one thing well” and “expect the output of every program to become the input to another” is living with "microservices" in nowadays.


> “Perhaps surprisingly, in practice it turns out that the special case is the main use of the program.”

This is, in fact, a Useless Use of Cat [1]. POSIX shells have the < operator for directing a single file to stdin:

    figlet <file
[1] http://porkmail.org/era/unix/award.html


If you want to see what the endgame of this is when taking the reasoning to the maximum, look at visual dataflow languages such as Max/MSP, PureData, Reaktor, LabVIEW...

Like always, simple stuff will be simple (http://write.flossmanuals.net/pure-data/wireless-connections...) and complicated stuff will be complicated (https://ni.i.lithium.com/t5/image/serverpage/image-id/96294i...).

No silver bullet guys, sorry. If you take out the complexity of the actual blocks to have multiple small blocks then you just put that complexity at another layer in the system. Same for microservices, same for actor programming, same for JS callback hell...


> Like always, simple stuff will be simple (http://write.flossmanuals.net/pure-data/wireless-connections...)

That is not actually simple because the data is flowing across two completely different message passing paradigms. Many users of Max/MSP and Pd don't understand the rules for such dataflow, even though it is deterministic and laid out in the manual IIRC.

The "silver bullet" in Max/MSP would be to only use the DSP message passing paradigm. There, all objects are guaranteed to receive their input before they compute their output.

However, that would make a special case out of GUI building/looping/branching. For a visual language designed to accommodate non-programmers, the ease of handling larger amounts of complexity with impunity would not be worth the cost of a learning curve that excludes 99% of the userbase.

Instead, Pd and Max/MSP has the objects with thin line connections. They are essentially little Rube Goldberg machines that end up being about as readable. But they can be used to do branching/looping/recursion/GUI building. So users typically end up writing as little DSP as they can get away with then uses thin line spaghetti to fill in the rest. That turns out to be much cheaper than paying a professional programmer to re-implement their prototype at scale.

But that's a design decision in the language, not some natural law that visual programming languages are doomed to generate spaghetti.

Edit: clarification


> The "silver bullet" in Max/MSP would be to only use the DSP message passing paradigm. There, all objects are guaranteed to receive their input before they compute their output.

in one hand, this simplifies the semantics (and it's the approach I've been using in my visual language (https://ossia.io)), but in the other it tanks performances if you have large numbers of nodes... I've worked on Max patches with thousands and thousands of objects - if they were all called in a synchronous way as it's the case for the DSP objects you couldn't have as much ; the message-oriented objects are very useful when you want to react to user input for instance because they will not have to execute nearly as often as the DSP objects, especially if you want a low latency.


That is certainly true. My point is that this is a drawback to the implementation of one set of visual programming languages, not necessarily a drawback of visual programming languages.

I can't remember the name of it, but there's a Pd-based compiler that can take patches and compile them down to a binary that performs perhaps an order of magnitude faster. I can't remember if it was JIT or not. Regardless, there's no conceptual blocker to such a JIT-compiled design. In fact there's a version of [expr] that has such a JIT-compiler backing it-- the user takes a small latency hit at instantiation time, but after that there's a big performance increase.

The main blocker as you probably know is time and money. :)


I made a simulation and video which shows why the pipe is so powerful: https://www.youtube.com/watch?v=3Ea3pkTCYx4 I showed it to Doug McIlroy, and while he thought there was more to the story, he didn't disagree with it.


I love writing little tools and scripts that use pipes, I've accumulated a lot of them over the years and some are daily drivers.

It's a great learning tool for learning a new programming language as well as the interface between the boundaries of the program are very simple.

For example I wrote this recently https://github.com/djhworld/zipit - I'm fully aware you could probably whip up some awk script to do the same, or chain some existing commands together, or someone else has written the same thing, but I've enjoyed the process of writing it and it's something to throw in the tool box - even if it's just for me!


Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: