Hacker News new | past | comments | ask | show | jobs | submit login

auto-restarting after a failure is a reasonable first order workaround for transient environment issues (network or external dependency goes away for a bit) or perhaps rare per-request issues that aren't handled correctly. arguably it'd be better to understand and fix or mitigate the issue properly, but until you actually have time to do that, it's nice for the service to autonomously attempt to be available.

...except when that causes even more problems! one example i remember from a hobby project was a worker service that would make requests to an external service's API. I'd implemented a throttle in the worker to limit the number of external requests per second made against the external service, where the throttle state was stored in process memory and not persisted anywhere -- seemed to be a pragmatic design tradeoff. you probably see where this is going.

there was an exciting interaction where an external API response caused the worker's API client to throw an unhandled exception, which took down the worker service. then systemd would diligently restart the worker service immediately per the restart policy. the throttle working state had been lost, so upon being restarted the worker service immediately fired another request to the external API, which then issued the same API response, which caused the worker service to fail in the same way, ... luckily noticed by systemd and immediately restarted.

combine with a lack of monitoring + alerting and you get a mechanism where your worker service can make about 100,000x as many external API requests over a few days as it was meant to.




I always set `RestartSec`.


That, plus restricting the number of restarts within an interval, is good.

You can then also set "OnFailure" to trigger another unit if the failure state is reached, e.g. to trigger a notification.

E.g.:

    [Unit]
    ...
    OnFailure=notify-failure@%n.service

    [Service]
    Type=simple
    Restart=on-failure
    RestartSec=5
    ..
    StartLimitBurst=5
    StartLimitIntervalSec=300


I mean, if you fail to code it correctly in the first place then fail to set up reasonable auto restart policy then fail to monitor it as well, yeah, shit will eventually break




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: