
Hints for writing Unix tools - mariusae
http://monkey.org/~marius/unix-tools-hints.html
======
hoggle
_“One thing well” misses the point: it should be “One thing well AND COMPOSES
WELL”_

If the implementation isn't respecting _The Rule of Composition_ it's actually
not adhering to the Unix philosophy in the first place. The tweet is referring
to one of Doug McIlroy's (one of the Unix founders, inventor of the Unix pipe)
famous quotes:

 _" This is the Unix philosophy: Write programs that do one thing and do it
well. Write programs to work together. Write programs to handle text streams,
because that is a universal interface."_

Pure beauty, but it's almost too concise a definition if you haven't
_experienced the culture_ of Unix (many years of usage / reading code /
writing code / communication with other followers). ESR's exhaustive list of
Unix rules in plain English might be a better start for the uninitiated (among
which one will find the aforementioned _Rule of Composition_ ).

For all those seeking enlightenment, go forth and read the _The Art of Unix
Programming_ :

[https://en.wikipedia.org/wiki/The_Art_of_Unix_Programming](https://en.wikipedia.org/wiki/The_Art_of_Unix_Programming)

17 Unix Rules:

[https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E...](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules)

------
jzwinck
Here's one more tip: did you ever notice that "ls" displays multiple columns,
but "ls | cat" prints only one filename per line? Or how "ps -f" truncates
long lines instead of wrapping, while "ps -f | cat" lets the long lines live?

You can do it too, and if you're serious about writing Unix-style filter
programs, you will someday need to. How do you know which format to write?
Call "isatty(STDOUT_FILENO)" in C or C++, "sys.stdout.isatty()" in Python,
etc. This returns true if stdout is a terminal, in which case you can provide
pretty output for humans and machine-readable output for programs,
automatically.

~~~
dap
IMO, this is an anti-pattern. It's violates the principle of least surprise.
(How come I see X when I run the command, but I can't grep for X in its
output? How come it works when I run it from my interactive shell, but it's
broken when I run it from a script? And things like that.)

~~~
burke
I think it depends what sort of things you use it for. I often use it to
switch on or off ANSI colourization, which doesn't really violate the
principle of least surprise.

When used sparingly and thoughtfully, I've never personally had an issue with
it.

~~~
chriswarbo
You may not have issue with the sort of things you use it for, but others
might.

For example, I run shells in Emacs and have had to tweak loads of shell
scripts written by colleagues to fix their poorly-implemented colourisation.
It's useful to know when a test has failed; it's not so useful to have the
whole terminal set to white text on a pale pink background.

One day I couldn't SSH into our servers from Emacs. It turned out somebody had
edited .bashrc for the admin user to make the bash prompt blue. Emacs' TRAMP
process was looking for a prompt ending in "$" or "#", not "$\\[\033[0m\\]",
so it didn't realise the connections were successful.

There are two ways of handling this: we can blame the source of the bug (the
person adding the colours incorrectly, or the assumption-loaded TRAMP regex),
but there will _always_ be more bugs in situations we'd never think of.
Alternatively, we can avoid being 'too clever', and instead aim for
consistency and least surprise.

~~~
philh
Are you suggesting that colored prompts violate the rules of consistency and
least surprise?

(Actually, if you are suggesting that, I'm not going to disagree. But I am
going to say that if so, those rules don't apply in the case of colored
prompts, because colored prompts are _useful_.)

~~~
chriswarbo
I suppose I'm suggesting that, aside from personal scripts, we shouldn't
assume too much about who our users are and what they're trying to do. The
principle of least power tells us to use the dumbest format that will work,
eg. plain text.

Anything we add on top of that, eg. ANSI colour codes, will be useful to some
but harmful to others. The tricky part is working out which of those
categories the current user is in.

~~~
philh
So is your proposed solution not to have colored prompts? (I vehemently
disagree.) Or not to put them in a .bashrc? (I still disagree, but only
strongly.) Or something else?

~~~
wtbob
Or, y'know, have coloured prompts but place the escape characters such that
they don't mess with prompt-detection regexps.

~~~
philh
To be precise, what you're suggesting is that we have prompts which are
allowed to be colored _except for the $ /# at the end_, because you can't
color those without following them by escape characters. And that prompts
_must_ have a $/# at the end.

I don't consider that an acceptable solution.

------
voltagex_
I'm not sure I agree with the "no JSON, please" remark. If I'm parsing normal
*nix output I'm going to have to use sed, grep, awk, cut or whatever and the
invocation is probably going to be different for each tool.

If it's JSON and I know what object I want, I just have to pipe to something
like jq [1].

PowerShell takes this further and uses the concept of passing objects around -
so I can do things like ls | $_.Name and extract a list of file names (or
paths, or extensions etc)

[1]: [http://stedolan.github.io/jq/](http://stedolan.github.io/jq/)

~~~
ygra
I was also constantly thinking of PowerShell while reading that. A PowerShell-
specific list of such advice would actually be rather short, given that most
of the pitfalls are already avoided. I still firmly believe that PowerShell is
actually a much more consistent Unix shell in that several concepts that ought
to be separate are actually orthogonal. Let's see:

Input from stdin, output to stdout: Nicely side-stepped in that most cmdlets
allow binding pipeline input to a parameter (either byval or byname, if
needed). Filters are trivial to write, though.

Output should be free from headers: Side-stepped as well, in that decoration
comes from the Format-* cmdlets that should only ever be at the end of a
pipeline that's shown to the user.

Simple to parse and to compose: Well, objects. Can't beat parsing that you
don't need to do.

Output as API: Well, since output is either a collection of objects or nothing
(e.g. if an exception happened) there isn't the problem that you're getting
back something unexpected.

Diagnostics on stderr: Automatic with exceptions and Write-Error. As an added
bonus, warnings are on stream 2, debug output on stream 3 and verbose output
on stream 4. All nicely separable if needed.

Signal failures with an exit status. Automatic if needed ($?), but usually
exception handling is easier.

Portable output: That's about the only advice that would still hold and be
valuable. E.g. Select-String returns objects with a Filename property which is
not a FileInfo, but only a string; subject to the same restrictions that are
mentioned in the article.

Omit needless dagnostics: Since those would be either on the debug or verbose
stream they can be silenced easily, don't interfere with other things you care
about and cmdlets have a switch for either of that, which means you only get
that stuff if you actually care about it.

Avoid interactivity: Can happen when using the shell interactively, e.g.

    
    
        Home:> Remove-Item
    
        cmdlet Remove-Item at command pipeline position 1
        Supply values for the following parameters:
        Path[0]: _
    

However, this only ever happens if you do not bind anything to a parameter,
which shouldn't happen in scripts. If you bind $null to a parameter, e.g.
because pipeline input is empty or a subexpression returned no result, then an
error is thrown instead, avoiding this problem.

Nitpick: You'd need ls | % Name or ls | % { $_.Name } there. Otherwise you'd
have an expression as a pipeline element, which isn't allowed.

~~~
dec0dedab0de
I have never used a computer that had access to Powershell, but in my new job
I may have to do some small stuff to tie some systems together. I'm terrified
of learning it because I don't want to be lured into some kind of lock-in
scenario.

~~~
emodendroket
Well, you can only use it on Windows. But I mean come on, "terrified?" Bash
scripts are not too useful in Windows either.

~~~
pessimizer
Spending a lot of time learning non-portable technologies isn't a decision to
be taken lightly.

~~~
emodendroket
Well, if it's part of your job I'd say it's not a decision at all...

~~~
pessimizer
If you can't accomplish the same goals in a portable way. Of course, if you
can put the knowledge to work as soon as you learn it, you're already starting
to recoup your investment.

~~~
GhotiFish
I can't believe you got downvoted for that.

------
osandov
A nitpicky tip: --help is normal execution, not an error, so the usage
information should be printed to stdout, not stderr (and it should exit with a
successful status). Nothing is more annoying than trying to use a convoluted
program with a million flags (which should have a man page in the first place)
and piping --help into less with no success.

~~~
foobarbaz1234
I am not so sure with that. Say, your program is used in a shell script and is
invoked badly - you might want to print its usage then. If you exit normally
your shell script might break weirdly but if you exit with error it's easier
to spot the reason of failure.

On the other hand you made me thinking and probably you should have three code
passes per default:

    
    
      [0] normal behaviour (exit 0)
      [1] bad arguments (exit EINVAL)
      [2] --usage (print to stdout but but exit != 0)?
    

Anyway I am not sure if it makes sense to declare "usage" as normal behaviour.

~~~
Someone
In my book, there is a difference between explicitly asking for help/usage and
passing arguments that do not make sense, which triggers the output of
help/usage.

The former, I think, should write to stdout and return 0, the latter should
write to stderr and return something non-zero.

Giving help if the user asks for it is normal behaviour.

------
Animats
1978 called. It wants its pipes back.

That approach dates from the days when you got multi-column directory listings
with

    
    
      ls | mc
    

Putting multi-column output code in "ls" wasn't consistent with the UNIX
philosophy.

There's a property of UNIX program interconnection that almost nobody thinks
about. You can feed named environment variables into a program, but you can't
get them back out when the program exits. This is a lack. "exit()" should have
taken an optional list of name/value pairs as an argument, and the calling
program (probably a shell) should have been able to use them. With that,
calling programs would be more like calling subroutines.

PowerShell does something like that.

~~~
grosskur
You can simulate this with so-called "Bernstein chaining". Basically, each
program takes another program as an argument, and finishes by calling exec()
on it rather than exit(), which preserves the environment. See:

[http://www.catb.org/~esr/writings/taoup/html/ch06s06.html](http://www.catb.org/~esr/writings/taoup/html/ch06s06.html)

Or write environment variables to stdout in Bourne shell syntax so the caller
call run "eval" on it. Like ssh-agent, for example.

~~~
gohrt
Continuation Passing Style! [http://en.wikipedia.org/wiki/Continuation-
passing_style](http://en.wikipedia.org/wiki/Continuation-passing_style)

------
to3m
Additional tip: if writing a tool that prints a list of file names, provide a
-0 option that prints them separated by '\x0' rather than white space. Then
the output can be piped through xargs -0 and it won't go wrong if there are
files with spaces in their paths.

I suggest -0 for symmetry with xargs. find calls it -print0, I think.

(In my view, this is poor design on xargs's part; it should be reading a
newline-separated list of unescaped file names, as produced by many versions
of ls (when stdout isn't a tty) and find -print, and doing the escaping itself
(or making up its own argv for the child process, or whatever it does). But
it's too late to fix now I suppose.)

~~~
fragmede
> newline-separated list of unescaped file names

That breaks when you have newlines in filenames, no?

~~~
pstuart
> That breaks when you have newlines in filenames, no?

That seems like an extremely pathological case.

~~~
deathanatos
> That seems like an extremely pathological case.

When a human is creating files by hand, I almost certainly agree. When a
program is creating files, however, it's only a matter of time before weird
characters wind their way in there.

I really wish newlines had been disallowed. (There's UI implications, in
addition to the parsing ones — how do you do a list view with newlines in the
filename?; I also wish filenames had a reliable character set and weren't just
bytes.)

~~~
bmn_
I think dwheeler is trying to get this fixed/standardised in POSIX via the
Open group.

~~~
oblio
That it's going to be an uphill battle is an understatement.

Someone replied on LWN, when he posted his proposal, that he had implemented a
sort of home-grown database using non-UTF8 characters for the file names.

Rube Goldberg, indeed!

------
acabal
Great article. The other thing I've always wished for command-line tools is
some kind of consistency for flags and arguments. Kind of like a HIG for the
command line. I know some distros have something like this, and that it's not
practical to do as many common commands evolved decades ago and changing the
interface would break pretty much everything. But things like `grep -E,--
extended-regexp` vs `sed -r,--regexp-extended` and `dd if=/a/b/c` (no dashes)
drive me nuts.

In a magical dream world I'd start a distro where every command has its
interface rewritten to conform to a command line HIG. Single-letter flags
would always mean only one thing, common long flags would be consistent, and
no new tools would be added to the distro until they conformed. But at this
point everyone's used to (and more importantly, the entire system relies on)
the weird mismatches and historical leftovers from older commands. Too bad!

~~~
dTal
I've often thought this - that xkcd.com/1168/ is funny is a terrible
embarrassment. I would also like to add that manpage syntax help should be
standardized and machine-parseable. I had an idea recently to auto-generate
GUIs for command line tools from the manpage syntax line, but it turned out
that while such lines _look_ precise but cryptic, they are often in fact
highly ambiguous, nonstandard, and still cryptic. This seems broken to me.

~~~
james2vegas
Blame man(7), have a look at mdoc(7): Semantic markup for command line
utilities: Nm : start a SYNOPSIS block with the name of a utility; Fl :
command line options (flags) (>=0 arguments); Cm : command modifier (>0
arguments); Ar : command arguments (>=0 arguments); Op, Oo, Oc : optional
syntax elements (enclosure); Ic : internal or interactive command (>0
arguments); Ev : environmental variable (>0 arguments); Pa : file system path
(>=0 arguments)

------
dap
Lots of great points here, but as always, these can be taken too far. Header
lines are really useful for human-readable output, and can be easily skipped
with an optional flag. (-H is common for this).

The "portable output" thing is especially subjective. I buy that it probably
makes sense for compilers to print full paths. But it's nice that tools like
ls(1) and find(1) use paths in the same form you gave them on the command-line
(i.e., absolute pathnames in output if given absolute paths, but relative
pathnames if given relative paths). For one, it means that when you provide
instructions to someone (e.g., a command to run on a cloned git repo), and you
want to include sample output, the output matches exactly what they'd see.
Similarly, it makes it easier to write test suites that check for expected
stdout contents. And if you want absolute paths in the output, you can specify
the input that way.

~~~
zaptheimpaler
I also think headers should be included. Its really annoying to go pore
through a man page just to see what the columns mean. You could use flags, or
maybe send headers to STDERR.

------
peterwwillis
Not every program will be able to take input in stdin and output to stdout. If
you have a --file (or -f) option, you'd do well to support a "-" file
argument, which means either stdin or stdout, depending if you're reading or
writing to -f. But you won't support "-" if the -f option requires seeking
backwards in a file. Neither will you be using stdin or stdout if binary is
involved (because tty drivers).

'One thing well' is often intended to make people's lives easier on the
console. Sometimes this means assuming sane defaults, and sometimes just a
simpler program that does/assumes less. Take these two examples and tell me
which you'd prefer to type:

    
    
      user@host~$ ls *.wav | xargs processAudio -e mu-law --endian swap -c 2 -r 16000
      user@host~$ find . -type f -maxdepth 1 -name '*.wav' -exec processAudio -e mu-law --endian swap -c 2 -r 16000 {} \;
    

Write concise technical documentation. Imagine it's your first day on a new
job and you need to learn how all your new team's tools work; do you want to
read every line of code they've written just to find out how it works, or do
you want to read a couple pages of technical docs to understand in general how
it works? (That's a rhetorical question)

Definitely provide a verbose mode. When your program doesn't work as expected,
the user should be able to figure it out without spending hours debugging it.

------
_pmf_
I have a strong bias against people who quote their own tweets in their own
blog posts. I find this to be highly narcissistic.

~~~
1amzave
I sympathize, but I have to say I find it far less annoying than the constant
implorings to "follow me on Twitter!" that have become obnoxiously ubiquitous
in the last few years.

------
RexRollman
Wow, its been a while since I've seen a monkey.org link. I thought the site
was dead. Nice to see I was wrong.

------
mseepgood
Another tip: don't do colored output. I don't want to deal with ANSI codes in
your output.

------
arh68
I think it's insane to restrict programs to just STDOUT & STDERR. Why 2? Why
not use another file descriptor, maybe STDFMT, to capture all the formatting
markup? This would avoid -0 options (newlines are markup sent to stdfmt, all
strings on stdout are 0-terminated), it would avoid -H options (headers go
straight to STDFMT), it would allow for less -R to still work, etc.

It's possible other descriptors would be useful, like stdlog for insecure
local logs, stddebug for sending gobs of information to a debugger. It's
certainly not in POSIX, so too bad, but honestly stdout is hard to keep
readable and pipe-able. Adding just one more file descriptor separates the
model from the view.

~~~
peterwwillis
I honestly have no idea what you are talking about. The whole point of
standard i/o streams is for them to be portable and composable by other
programs without those programs having to be designed to work with yours.
POSIX is here for a very good reason.

Obviously not every program will use just two file descriptors. Binary isn't
handled by stdin and stdout because they're typically used for tty
input/output. If you need to handle multiple files you'll take a list of file
arguments. Often a program takes no input at all that isn't a command-line
option.

And what 'formatting markup'? There is no 'markup' on a terminal, unless
you're dealing with colors or something, which you would disable if your fd
wasn't a tty. And why would you send 'headers' to a completely different file
descriptor anyway?

Oh, I think I get it now. You confused the MVC architecture with Unix
programs. Unix programs don't provide a user interface.

~~~
masklinn
> I honestly have no idea what you are talking about. The whole point of
> standard i/o streams is for them to be portable and composable by other
> programs without those programs having to be designed to work with yours.

His point is that two streams are not enough, you don't want to present the
same output stream or a human, a logfile or an other utility.

> And what 'formatting markup'? There is no 'markup' on a terminal, unless
> you're dealing with colors or something

Right, so there is markup on a terminal.

> which you would disable if your fd wasn't a tty.

Which would be much simpler to handle if there was a stream for human
consumption and one for piping

> And why would you send 'headers' to a completely different file descriptor
> anyway?

Because headers are useful to human users or when capturing output in a file
to read later rather than in an other utility?

~~~
peterwwillis
In practical terms, everything you mention should be done _by different
programs_ , not one giant monolithic subsystem that manages 10 completely
different tasks. Each component should be reusable, independent, and
interoperable. Not tied into one program.

In your program's design, the 'cat' program would handle all kinds of file
i/o, provide some kind of ncurses text GUI to select a file, a progress bar
for the progress of text flowing through it, sending errors to a logging
subsystem, storing header metadata in some object passed along its output
streams, etc. The Unix designers had dealt with this kind of crap before, and
were sick of it, and so they wrote a program which did _only one thing_.

What you describe is the systemd school of design. If I just make my program
more complex and technically superior, i'll have a better program. Who cares
that nobody wants to use it, or that it's burdensome, hard to extend,
difficult to understand, and incompatible with everything that exists today?
Who cares if we can already do all these things without all the downsides?
Technical superiority trumps practicality. Well, that's not Unix.

The Unix environment flourished not only because it was widely available, but
mainly because it was incredibly efficient. By removing all the things they
didn't need, they made the system better. There are four words that accurately
express all of this, and that should guide the development of any Unix tool:

Keep it simple, stupid.
[https://people.apache.org/~fhanik/kiss.html](https://people.apache.org/~fhanik/kiss.html)

------
chilicuil
I agree with what is exposed on the article and I've actually added more
details in how to apply this "principles" to shell scripting:

[http://javier.io/blog/en/2014/10/21/hints-in-writing-unix-
to...](http://javier.io/blog/en/2014/10/21/hints-in-writing-unix-tools-with-
shell-scripting.html)

------
jwr
I would add to this list:

If you are intercepting UNIX signals (starting with SIGINT), go back to the
drawing board and think again. Don't do it. There is almost never a good
reason for doing it, and you will likely get it wrong and frustrate users.

~~~
edwintorok
How about cleaning up tempfiles on ^C?

~~~
renox
YMMV but I prefer cleaning the old tempfiles at start-up. It allows you to get
the content of the tempfiles after the program stopped, very handy for
debugging..

