
Bash Pitfalls - zanok
http://bash.cumulonim.biz/BashPitfalls.html
======
brandonbloom
Maybe some controversial advice: Go ahead, fall in these pits.

I write my fair share of shell scripts and I've hit practically every one of
these snags in the past. However, for the majority of tasks I perform with
bash, I genuinely don't care if I support spaces in filenames, or if I throw
away a little efficiency with a few extra sub-shells, or if I can't test
numbers vs strings or have a weird notion of booleans.

Your scripts are going to have bugs. The important question is: What happens
when they fail?

Are your scripts idempotent? Are they audit-able? Interruptible? Do you have
backups before performing destructive operations? How do you verify that they
did the right job?

For example, if your shell scripts operate only on files under version
control, you can simply run a diff before committing. Rather than spent a
bunch of time tracking down a word expansion bug, you can simply rename that
one file that failed to not include a space in its name.

~~~
brennen
I live and die by the shell. I'm constantly composing little one-liners, and
keep an absurdly long Bash / zsh history to draw from. There are places the
obvious answer is almost always "how about you just write a shell script?"

That said, I long ago reached a place where I realized that, while shell
scripting is _entertaining_ , I'd much rather write anything more than a
handful of lines in a general purpose programming language. Perl, Python,
Ruby, whatever - even _PHP_ involves far less syntactic suffering and general
impedance than Bash. It's not that I'm exceptionally worried about correctness
in stuff that no one besides me is ever going to use, it's just that once
you're past a certain very low threshold of complexity, the agony you spend
for a piece of reusable code is so much less. Even just stitching together
some standard utilities, there are plenty of times it'll take a tenth as long
and a thousandth as much swearing to just write some Perl that uses backticks
here and there or mangles STDIN as needed.

    
    
      > Are your scripts idempotent? Are they
      > audit-able? Interruptible? Do you have
      > backups before performing destructive
      > operations? How do you verify that they
      > did the right job?
    

Every single one of these questions is easier to answer if you're using a less
agonizing language than Bash and its relatives.

~~~
sigil
> Every single one of these questions is easier to answer if you're using a
> less agonizing language than Bash and its relatives.

I disagree. While the set of things that are "hard" to do is probably larger
in shell than the alternatives, the specific questions posed by the
grandparent are hard in any language. They all boil down to "how can I
correctly do something which has side effects (on external state)?"

Statefulness itself is a pain, and shell is in some sense the ultimate
language for simply and flexibly dealing with external state.

 _Simplicity_ : the filesystem is an extremly simple and powerful state
representation. Show me a language that interacts with the fs more concisely
than

    
    
        tr '[A-Z]' '[a-z]' < upper.txt > lower.txt
    

_Flexibility_ : if shell can't do it, just use another program in another
language that can, like `tr` in the above example. What other language enables
polyglot programming like this? Literally any program in any language can
become a part of a shell program.

> it's just that once you're past a certain very low threshold of complexity,
> the agony you spend for a piece of reusable code is so much less.

Here's where I admit I was playing devil's advocate to an extent, because I
fully agree with you here. I write lots of shell scripts. I _never_ write big
shell scripts. Above some length they just get targeted for replacement in a
"real" language, or at the very least, portions of them get rewritten so they
can remain small.

Empirically, it also seems true that shell is harder for people to grasp,
harder to read, and harder for people to get right. These are real costs that
have to be figured in.

PS. Speaking of shell brennen, we should be working on our weekend project. :)

~~~
comex
> tr '[A-Z]' '[a-z]' < upper.txt > lower.txt

That's the biggest problem: some things are very simple, but other things fall
off a cliff. For example, as a related task I ran into recently: how do you
replace FOO with the contents of foo.txt? The natural way would be expanding
it into a command line, but at least with sed that's no good even for nice
short text files because / and \n are special. You can use a sed command to
read a file which I didn't know existed until I looked it up, but it
apparently has the delightful feature that "If file cannot be read for any
reason, it is silently ignored and no error condition is set." You can use
perl... you can use perl to easily do a lot of things that are really hard to
do otherwise (including things as simple as matching a regex and printing
capture groups), but at least to me it feels really awkward to wrong to mix
two different full-fledged languages. Maybe I should just get over that, but I
wish the whole thing were more coherent.

~~~
e12e
Interesting problem. Some quick head-scratching and googling didn't turn up
anything useful on merging templates with awk and sed... then it hit me --- m4
is used for that:

    
    
       sed -r 's/FOO/include(foo.txt)/g' temp.txt |m4

~~~
comex
Interesting solution; I should learn to use m4 for various tasks. Probably
would have already if I didn't have such a negative visceral reaction to
autotools :)

------
comex
The Unix shell may be a highly powerful interactive programming environment,
but it's sure hard to think of anything that comes anywhere close to _sucking_
as badly. With the shell and the standard Unix commands, some things that are
hard in other languages are easy, and most of the things that are easy in
other languages are hard to impossible... I'd love to see a clean slate
replacement for the shell that still feels Unix-like and retains most of its
existing benefits.

(I suspect PowerShell would be a good environment to take design cues from or
even port, but I've never used it so I can't say for sure.)

~~~
thristian
Have you looked at "rc", the shell from Plan 9? It's very similar in spirit to
the Bourne shell, but it's fundamentally better thought-out.

[http://plan9.bell-labs.com/sys/doc/rc.html](http://plan9.bell-
labs.com/sys/doc/rc.html)

------
mlacitation
I have a mirror of this site. wooledge.org runs off of greycat's (#bash on
freenode) home DSL connection:

[http://bash.cumulonim.biz/BashPitfalls.html](http://bash.cumulonim.biz/BashPitfalls.html)

~~~
wyclif
Thanks. I don't understand why people submit links with content that can't be
accessed by more than a few users simultaneously.

~~~
jamesbritt
How would one know this in advance?

~~~
wyclif
How would one know in advance that submitting to HN would result in traffic
consisting of more than a handful of users?

~~~
jamesbritt
How would one know in advance whether a site can handle high traffic?

And did I really have to spell this out?

------
Sprint
Apparently a mod changed the URL (to relieve a not so powerful host). The
original URL (and thus the one you should be bookmarking/remember) was
[http://mywiki.wooledge.org/BashPitfalls‎](http://mywiki.wooledge.org/BashPitfalls‎)

[http://bash.cumulonim.biz/BashPitfalls.html](http://bash.cumulonim.biz/BashPitfalls.html)
is a mirror, see mlacitation's comment.

------
PhasmaFelis
So why do we put up with classic command line tools in general that are so
full of horrible, counterintuitive pitfalls? Is it just tradition? Backwards
compatibility?

The "Unix should be hard" crew has gotten a lot quieter in the last ten years
with the rise of Ubuntu and other relatively user-friendly distros, but I feel
like there's still an underlying current of elitism there; people are proud of
mastering these bizarre, arcane methods, and they're offended that someone
else might be able to accomplish just as much without doing half as much work.

~~~
mixmastamyk
There are alternatives. I use the fish shell for interactive work, and python
when a bash script surpasses a certain complexity.

------
secure
Here’s my personal favorite shell pitfall, which was the last drop to make me
start recommending _against_ using shell except for very very narrow niche use
cases:
[https://plus.google.com/+MichaelStapelberg/posts/YLarC7WPVQB](https://plus.google.com/+MichaelStapelberg/posts/YLarC7WPVQB)

------
dj-wonk
Would someone recommend an automatic bash style checker, such as a 'linter'?
Perhaps something along the lines of Chef's Food Critic?
[http://acrmp.github.io/foodcritic/](http://acrmp.github.io/foodcritic/)

~~~
zwp
Shellcheck recently found a bug in one of my old scripts. You can use the web
interface or compile it for local use:

[http://www.shellcheck.net/](http://www.shellcheck.net/)

[https://github.com/koalaman/shellcheck](https://github.com/koalaman/shellcheck)

------
joshbaptiste
Ah yes, the ultimate reference from Freenode #bash, after you learn from
Greycat's wiki you won't simply Google/DDG "bash tutorial" again and just head
straight here.

------
ak217
I find it sad and amusing that we're writing a ton of mission-critical code in
this language that has an incredible number of obscure quirks. Yes, most of
these pitfalls are directly connected to the semantics of Unix, but I wish
someone made a concerted effort to get rid of them in an otherwise
evolutionary way.

------
druiid
Bash is for most things, one of the easier languages I have dealt with, being
even beyond python, etc. That is though, for shell type scripting.

There's many problems with it, but the only one I've run into that keeps it
from being more useful is that there are no multi dimensional arrays built-in.
There are super hacky ways I have seen them implemented, but by default it's
something I basically never am able to turn to when scripting in bash and have
to turn to other languages, even when the particular task I was working with
would be mostly simpler in bash.

That said, there are associative arrays in bash these days.

------
Aardwolf
The following does not work on files with spaces according to the article:

    
    
      for i in $(ls *.mp3); do
        some command $i
      done
    

So does that mean, that "for" will do something per _word_ of the output of $,
rather than per _line_ of output of it?

What to do if I want to do something for every line? What for example if I
really want the output of ls, find (or any other command you can put int he
$()) and loop through that line per line, even if some output has spaces?

Thanks.

~~~
sigil
> So does that mean, that "for" will do something per word of the output of $,
> rather than per line of output of it?

Correct. The argument to "for" is a list of words.

> What to do if I want to do something for every line?

Use a while loop.

    
    
        find /some/dir/ -type f |
        while read -r line; do
           ; # something with $line
        done
    

PS. You should almost always use `find` instead of `ls` in shell scripts.
Given a pattern, `ls` will exit non-zero if nothing matches it, and you should
be treating non-zero exits like you would exceptions in other languages.

~~~
joseph
One thing to be careful of when doing "while read..." is that a new shell is
started on each iteration, so you cannot for example set a variable within the
loop that you can use later in the script, as its value will be lost when the
shell process exits.

~~~
sigil
> a new shell is started on each iteration

This is not actually true.

    
    
        printf "\n\n\n" | while read i; do a="x$a"; echo "$a"; done
        x
        xx
        xxx
    

The accumulator value even carries over after the while loop:

    
    
        printf "\n\n\n" | ( while read i; do a="x$a"; echo "$a"; done ; echo "$a" )
        x
        xx
        xxx
        xxx
    

(Technically, whether or not the loop body is executed in a subshell may be
implementation dependent. Haven't looked at the POSIX shell spec in a while,
but I seem to remember an old ksh that actually used subshells. At any rate,
none of the modern sh's and bash force a subshell.)

What _is_ true, however, is that a pipeline will execute in a subshell. Maybe
that's what you're getting at here, and it is an important caveat.

    
    
        a=y; printf "\n\n\n" | while read i; do a="x$a"; echo "$a"; done; echo "$a"
        xy
        xxy
        xxxy
        y

~~~
joseph
Ah, ok. True, when I have had this issue it was after doing something like
'grep "pattern" file | while read ...'. I did not realize it was the pipe that
caused this.

------
anon4
I know someone will sooner or later propose that we ban spaces and special
characters in names. Let me just put my two cents forward.

We should absolutely ban special characters from names. Specifically, all
whitespace, the colon, semicolon, forward slash, backward slash, question
mark, star, ampersand, and whatever else I'm missing that will confuse the
shell. Also files cannot start with a dash.

However, people should be able to name files with these characters. So I
propose that these characters in filenames be percent-encoded like they would
be in a URL. Specifically, the algorithm should be

1\. Take the file name and encode it as UTF-8. Enforce some sort of
normalization.

2\. Substitute each problematic byte with equivalent percent-encoded form.
This does not touch bytes over 0x80 - they are assumed non-problematic.

3\. Write the file in the file system under that name.

4\. When displaying files, run the algorithm in reverse.

In the general case files like "01 - Don't Eat the Yellow Snow.mp3" would
simply become 01%20-%20Don't%20Eat%20the%20Yellow%20Snow.mp3 in the filesystem
and cause absolutely no further problems. To make it completely backwards-
compatible we should also add the following rule: If a filename includes a
problematic byte or a percent-encoded byte higher than 0x80, then it is
assumed to be raw and will not undergo percent decoding.

Basically, I propose that every program which receives free text input for a
file name percent-encode the filenames before writing them to the filesystem
and decode them for display. Everything else remains unchanged.

Why this will not work:

Requiring programmers to keep track of two filenames instead of just one is
rather a lot of work. File APIs will have to take both encoded and non-encoded
forms and encode the non-encoded form, creating problems when people
inadvertently use the wrong function with a name, either double-encoding it or
not encoding it and leading to "this file does not exist" errors.

It will be possible to create two files with different names on disk which are
nonetheless shown with the same name to the user.

Why it is ugly:

We're taping over a deficiency of an ancient language by inflicting pain on
programmers.

Double-encoded filenames? MADNESS.

Why I like it:

I'll be able to have ?, * and : in filenames in windows.

My shell scripts will be much simpler.

What do you guys think?

~~~
derefr
> Substitute each problematic byte with equivalent percent-encoded form. This
> does not touch bytes over 0x80 - they are assumed non-problematic.

You know what's crazy? Currently, in Unix, _control characters_ are allowed in
filenames. Like, \t and \n and \b and even \\[. _Those_ shouldn't be allowed,
percent-escaped or not. Everything else you said is sensible.

~~~
ygra
Technically NTFS allows those too. The filesystem, being a very low-level
tool, hardly thinks of the upper layers and what pain it might inflict there.
Its purpose is to store blobs under a name and retrieve them upon request.
Since a char[] (or wchar_t[]) looks enough like a name that's what it uses.

That being said, enforcing such restrictions in upper layers brings pain as
well, because suddenly you can have files that you cannot delete anymore
(happens sometimes on Windows).

~~~
derefr
True; there's no reason that the filesystem should be storing anything other
than char[]. The filesystem is a serialized domain, and char[] buffers are for
storage and retrieval of serialized data. But that also means that each
filesystem _should_ explicitly specify a _serialization format_ for what's
stored in that char[] -- hopefully UTF-8.

However, the filesystem should really be where that serialized representation
begins and ends. The filesystem should be interacting with the VFS layer using
_runes_ (Unicode codepoints), not octets.

And then, given that all filesystems route through the VFS, it can (and
should) be enforcing preconditions on those runes in _its_ API, expecting
users to pass it something like a printable_rune_t[]. (Or even, horror of
Pascalian horrors, a _struct containing a length-prefixed_
printable_rune_t[].)

And for the situation where there's now files floating around without a
printable_rune_t[] name -- this is why NTFS has been conceptually based around
GUIDs (really, NT object IDs) for a decade now, with all names for a file just
being indexed aliases. I wonder when Linux will get on that train...

~~~
ygra
Well, history sadly dictates that the interface to the upper layers it based
around code units because those have always been fixed-length. Unicode came to
late to most operating systems to really be ingrained in their design and
where it was (Windows springs to mind) it all got a turn for the worse with
the 16-to-21-bit shift in Unicode 2 with Unicode-by-default systems being no
better than 8-bit-by-default systems had been a decade earlier.

That NTFS uses GUIDs internally to reference streams is news to me, though.
But I think on Unix-like systems the equivalent would be inodes, I guess,
right?

------
Nick_C

        for arg
    

instead of

    
    
        for arg in "$@"
    

is gold. That is going straight to the pool room.

------
Timmmmbob
1\. Using bash.

