Hacker News new | comments | show | ask | jobs | submit login
How “Exit Traps” Can Make Bash Scripts More Robust and Reliable (redsymbol.net)
158 points by striking 37 days ago | hide | past | web | favorite | 38 comments



Traps are neat but beware, they aren't completely reliable. You can't trap SIGKILL or SIGSTOP. In the article's examples, a SIGKILL would (1) leave temp directories around, (2) fail to restart a service, and (3) leave an expensive AWS resource running.

Remember that SIGKILL isn't always the result of a human typing `kill -9` either: the Linux OOM killer sends it; all unixes potentially send it during shutdown and runlevel switching; programs like timeout(1) send it as a last resort.

Here are some other ways to approach the 3 examples:

1) Avoid temp files and directories if you can. Sometimes you can't, but anecdotally I come across LOTS of shell scripts that create a temp file when they could have used a pipe. Bonus: pipes are fast.

2) Insuring a service comes back up after maintenance: use a process supervisor with automatic restarts, and have the service script grab a startup lockfile first thing. Use a blocking flock(1) or setlock(8) and discard the lock fd immediately afterwards. To bring the service down for maintenance, grab the startup lockfile, stop the service, then do your thing. Once your maintenance script exits -- through any means, including SIGKILL -- the kernel automatically releases the lock and the hitherto blocked service continues starting up.

3) Capping expensive resources: if the EC2 instance truly is temporary, why not impose a timeout and police all such instances out-of-band, with an alarm? The article is right that omissions of this kind can be $$$. https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...


> Traps are neat but beware, they aren't completely reliable. You can't trap SIGKILL or SIGSTOP.

A long time ago, there actually was a sneaky way on some Unixes to trap SIGKILL. If a program was being run under ptrace then any signal would pause the program and alert the program that was doing the trace--even SIGKILL.

So I made a program that I named "sh" and carefully made to have the same memory size as /bin/sh, that just forked and exec'ed another program of mine under ptrace. The other program was named "superman". Whenever my fake sh received notification that "superman" had received a signal, it would write the number of the signal into a variable in superman's address space, and then make it so "superman" continued but with the signal changed to SIGINT. The SIGINT handler in "superman" would would check that variable to see the real signal, and print an appropriate smart remark.

I started this running, then went to the head system admin/system programmer and told him something was wrong and I couldn't kill my program. After seeing that ^C and ^\ did nothing useful he logged into another terminal, became root, found "superman" with ps, and did a kill -9.

The look on his face was priceless when "superman" just printed something like "SIGKILL is not strong enough to harm a Kryptonian!" and continued running.

I was a little sad when later Unixes made SIGKILL kill processes being traced.


I love this story. Thanks for sharing!


I suppose these days you could just write a kernel module, either bundling a shell language in there, or do something root-kit like with protecting the process...


Of course, it's worth noting that if you lose power or your computer is torn asunder by an act of God, then of course your exit trap will also not execute. Point of mentioning that, is to drive home the point that you can't ever fully rely on exit traps or mechanisms like exit traps (i.e. Go's defer, Python's try/finally...) so you must always still consider failure modes beyond that.

Mitigating failure modes beyond that can be challenging, but usually there's something you could do that will go further. For example with EC2 boxes you could find some mechanism to make the box 'temporary,' perhaps tag it as such, then have a program on another machine somewhere that reaps these boxes after 24 hours. (edit: Honestly, rereading the parent post, this is pretty much the same thing they recommended, but I didn't really realize it when I wrote this.)

It's still useful to use exit traps and friends, but as you mentioned, it's not a silver bullet, and you can make a lot more robust systems if you try to handle a layer or two more of failures.


SIGKILL is never received by an application, I believe SIGSTOP isn't either. Both of them are calls to the kernel for managing a process, and SIGKILL should be reserved for extrenuating circumstances (an application won't quit normally...). SIGKILL gives no application a chance to clean itself up, shell script or no.

This is hardly a special feature of Bash, it's just how the operating system works.


> This is hardly a special feature of Bash, it's just how the operating system works.

I never claimed otherwise!


Yes, all regularly scheduled processes should be able to recover gracefully from a forcibly interrupted state. Programmers ought to be thinking about "what happens if the power goes off right NOW"?


especially item 2. wrapping a critical section in a lock is great. I prefer to have the default state be "running", so if the maint script ends for any reason the lock is discarded. If the maint proc gets stuck you can see it's holding the fd on the lockfile, or you can put a watchdog on it (implementation depending on your supervisor).

i love these little bash gems, the only downside for me is I do feel like I'm stepping back into the 20th century whenever I do any heavy bash work...classic systems programming...


> I come across LOTS of shell scripts that create a temp file when they could have used a pipe.

Wouln't a pipe live on in the file system just as well (unless you're talking about things that can be done simply with a "|")?


re: 1) any way to use a (named) pipe to replace a directory besides flattening out the dir?


I wouldn't say so: any pipe only allows strictly linear FIFO data transfer, but anything you could call "directory-like" surely allows random access - at least jumping to the start of any of the contained files.

I think, if you wanted a truly self-tidying temporary directory... you might be able to mkdtemp(), immediately chdir() into it then unlink() it, and thereafter work on the files within by relative paths only - as the directory's link count would be zero, it should get automatically cleaned up when the process exits. (Well, unless your program gets killed between the mkdtemp() and rmdir(), of course.)

... alternatively, you might also be able to mkdtemp(), opendir() then rmdir() it, then work on the files in it with openat() & friends. I can't see that being viable in shell scripts, though - too little support for dirfd-relative file operations.

Not sure if either idea would actually be allowed in practice, mind - I haven't tested them, either might fail at the rmdir() stage with EBUSY.


I actually tried this idea when I read their initial comment. Unfortunately (And this is obvious when you realize it) you can't `unlink` or `rmdir` a non-empty directory. Also, though `rmdir` can work while you still have an open reference to the directory, once you do a `rmdir` you are not allowed to add any new files to the directory, making it essentially useless.

I think the best way would just to make sure you mount a `tmpfs` somewhere and make a directory on that (With something like your PID in the name). Then have some supervisor program periodically remove all directories with PIDs that are not longer running (Along with having the programs themselves clean it up if they have the opportunity). That's probably the best you can get, I don't think there's anyway to make the directory get automatically cleaned-up by the kernel.


> once you do a `rmdir` you are not allowed to add any new files to the directory

Ah, go figure. I had a feeling you'd run into something like that.

As for your tmpfs idea... you could do something with autofs to get automatic cleanup.


Yeah, it's unfortunate, it'd be pretty cool if you could pull something off like that. And conceptually it's not much different then what you can do with files already.

And autofs sounds like it could do it. I think the simplest implementation wouldn't be very complicated though, you could probably do it with a bash script that cross-checks folder names with running processes every 5 minutes or so. Assuming most programs clean-up after themselves, it only needs to handle programs that do an abnormal exit like SIGKILL. Of course, if that script has problems you'd be in for a bad time.


You can only rmdir an empty directory and then after that it's gone and can't have files created in it.

Provided you didn't need concurrent access, what you could do is create a temp file, unlink it, then treat the contents of the now-anonymous temp file as e.g. a zip file or a tar file or similar. But that strikes me as one of those "time to rethink the life choices that led you to this moment" situations rather than a remotely good idea.


Author of article here (but not submitter). I wrote this years ago, and it's fun to see it pop up on HN like this every year or two!

Here's another bash article people like:

http://redsymbol.net/articles/unofficial-bash-strict-mode/


In the past few years, whichever company I was working at, I was relentlessly evangelising the unofficial bash strict mode!


Bash strict mode forces to write clean scripts, with each potential error handled. I use it for more than decade. I wrote lot of bash scripts at Bazaarvoice.

BTW. I developed small set of common functions for strict mode, called bash-modules. Can you look at it and provide some feedback? See https://github.com/vlisivka/bash-modules .


Thank you!


Both are great articles. Thanks!


Your process can always segfault, or the machine lose power. A more robust way of handling this kind of thing is to always clean up the leftovers from previous runs, at the start of the program. This also has the advantage that the state from the most recent run is always available, for debugging / analysis.

Also on linux, you can use unnamed temp files, see O_TMPFILE in http://man7.org/linux/man-pages/man2/open.2.html


How do you use O_TMPFILE from a shell script?


You can't use it directly from bash, but its readily available in perl, with the POSIX module.

Even in bash, you can unlink your temp files before you use them, and pass them around by file descriptor rather than by name.


> its readily available in perl, with the POSIX module

Oh? I don't see anything like this in the Perl codebase.

On the other hand, O_TMPFILE is available in Python since 3.4:

https://docs.python.org/3/whatsnew/3.4.html#os


I've used a variation of this that catches other signals as well:

  trap 'rc=$?; trap "" EXIT; cleanup $rc; exit $rc' INT TERM QUIT HUP
  trap 'cleanup; exit' EXIT


This is unnecessary. The EXIT trap fires on any exit from the shell, not just graceful exit.


The behavior varies with shell:

    $ cat test_trap 
    trap 'echo exiting' EXIT 
    kill $$
    
    $ bash test_trap
    exiting
    Terminated
    
    $ ksh test_trap
    exiting
    Terminated

    $ mksh test_trap
    exiting
    
    $ dash test_trap
    Terminated
    
    $ zsh test_trap
    Terminated


This article is specifically about Bash. I'm not surprised that `trap` behavior varies per shell, and I'd recommend reading the documentation on it.


Beware that you can only have one EXIT handler. I've used this in the past figuring it's analogous to Go's `defer`, but unfortunately not.

e.g this script:

    #!/bin/bash
    trap 'echo Handler 1' EXIT
    trap 'echo Handler 2' EXIT
Will only call 'Handler 2' on exit.


I wrote `defer` (or rather `atexit`) for Bash: https://github.com/orivej/bash-traps/blob/master/atexit.sh


The author mentions that he's always discovering new user cases for this, and it does look handy. I wonder if there are any cookbook/recipe style documents around that showcase some of those?


I get an SSL error for this page:

  redsymbol.net uses an invalid security certificate. The 
  certificate is only valid for the following names: 
    mobilewebup.com, www.mobilewebup.com 
  The certificate expired on 08 July 2011, 04:16. 
  The current time is 14 January 2018, 02:14.
  Error code: SSL_ERROR_BAD_CERT_DOMAIN


It's an http: link. Why are you trying to load it as https:?


Bad detection with https anywhere or similar browser extension is what i'd expect


Oh wow. Thank you for that. I forgot I had set it to force HTTPS.


Korn Shell has trap as well.


Although many of bash's extensions were borrowed from ksh (occasionally even compatibly), 'trap' was in the original Bourne sh.




Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: