Hacker News new | comments | ask | show | jobs | submit login

My experience:

- uncaught exception should terminate the process... but

- cluster.js allows restating failed processes easily

- and you whole cluster should be restarted if it dies (we use upstart's respawn)

- uncaught exceptions should be real exceptional situations and bugs only, never applicative exceptions (it would be a bug)

- have the cluster master send your team an email as soon as a process dies, to maintain awareness (we use Slack notification + email)

- kill bugs, aim for 0 failures. Invalid parameters should never kill your app. An unreachable DB could, because reconnecting can be difficult, and may not succeed

With this:

- when unexpected failure happens (bugs mostly), you know it didn't kill your whole app (a single process dies)

- your cluster could loop trying to restart the app with repeated failures, and it could work and restore your app. If it doesn't, you'll be flooded in emails and you'll be aware of the gravity of the problem. Networking problems are often solved this way in a few seconds

- you will kill bugs, because of the awareness factor of emails

- and you'll have an incentive to keep your app startup time low

That's how we do it.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact