
Failsafe – failure handling with retries, circuit breakers and fallbacks - jodah
https://github.com/jhalterman/failsafe
======
dredmorbius
A note on the name: "fail-safe" in engineering doesn't mean that a system
_cannot_ fail, but rather, that when it does, it does so in the safest manner
possible.

The term originated with (or is strongly associated with) the Westinghouse
railroad brake system. These are the pressurised air brakes on trains, in
which air pressure holds the brake shoes _open_ against spring pressure.
Should integrity of the brakeline be lost, the brakes will fail in the
activated position, slowing and stopping the train (or keeping a stopped train
stopped).

[https://en.m.wikipedia.org/wiki/Railway_air_brake](https://en.m.wikipedia.org/wiki/Railway_air_brake)

Fail-safe designs and practices can lead to some counterintuitive concepts.
Aircraft landing on carrier decks, in which they are arrested by cables, apply
full engine power and afterburner on landing. The idea is that should the
arresting cable or hook fail, the aircraft can safely take off again.

[https://en.m.wikipedia.org/wiki/Fail-
safe](https://en.m.wikipedia.org/wiki/Fail-safe)

Upshot: "fail safe" doesn't mean "test all your failure conditions
exhaustively". It may well mean to abort on any failure mode (see djb's
software for examples). The most important criterion is that whatever the
failure mode be, it be as safe as possible, and almost always, based on a very
simple and robust design, mechanism, logic, or system.

From the description of this project, it strikes me that it may well be
failing (unsafely?) to implement these concepts. Charles Perrow, scholar of
accidents and risks, notes that it's often safety and monitoring systems
themselves which play a key role in accidents and failures.

~~~
Animats
_" These are the pressurized air brakes on trains, in which air pressure holds
the brake shoes open against spring pressure."_ Air brakes don't really work
that way.[1] There's an air tank on each car to provide the pressure to apply
the brakes if the brake line loses pressure.

Fail-safe design comes from railroad signaling. It is a principle of classic
railroad signaling that any broken wire or relay that fails to pull in must
result in an indication not less safe than the correct one. "Vital" Relays in
classic signaling systems fall open by gravity, and use silver-to-silver
contacts so as to avoid welding together on overloads. (Lightning strikes on
rails and on signal lines are considered a normal part of railroad operation.)

[1]
[https://en.wikipedia.org/wiki/Railway_air_brake#Straight_air...](https://en.wikipedia.org/wiki/Railway_air_brake#Straight_air_brake)

~~~
dredmorbius
From your linked source:

"Under the Westinghouse system, therefore, brakes are applied by reducing
train line pressure and released by increasing train line pressure. The
Westinghouse system is thus fail safe—any failure in the train line, including
a separation ("break-in-two") of the train, will cause a loss of train line
pressure, causing the brakes to be applied and bringing the train to a stop,
thus preventing a runaway train."

Without air pressure -- from line or cannister, the brakes fail in the
_activated_ mode.

I'm trying to find a source, but my understanding is that red/green for lit
signals as "stop/go" came about after an earlier mode, in which a steady
_white_ light meant "go" proved problematic: the red disks fronting stop lamps
could fall out (or perhaps be broken), leaving ambiguity as to what "white"
meant.

Switching to red and green lamps meant that the failed-disk mode now clearly
indicated a signalling problem, where the signal could not be trusted.

~~~
Animats
No, train brakes need pressure from the car tank to be applied. This is what
the famous "triple valve" is for. High train line pressure releases the brakes
and charges up the car tank. Low pressure applies the brakes. This has the
annoying property that you can't leave a train parked on a grade for too long
without applying the manual brakes on the cars. US freight air brakes were
standardized in 1893, and haven't changed much since.[1]

Semitrailer parking brakes really are spring-loaded and released by air
pressure.

[1] [http://www.railway-technical.com/air-brakes.shtml](http://www.railway-
technical.com/air-brakes.shtml)

~~~
mcpherrinm
The Lac-Megantic crude oil train derailment/fire disaster is rather horrific
example of that "annoying" property, where an insufficient number of manual
brakes were applied, and an engine fire caused the engine proving air pressure
to be shut down.

[https://en.wikipedia.org/wiki/Lac-M%C3%A9gantic_rail_disaste...](https://en.wikipedia.org/wiki/Lac-M%C3%A9gantic_rail_disaster)

~~~
dredmorbius
I was aware of brake failure as a factor in the Lac-Megantic disaster, but not
that this was the specific cause.

------
nitrogen
Very cool. Consistent and clear retry, backoff, and failure behaviors are an
important part of designing robust systems, so it's disappointing how uncommon
they are. If I were starting a new Java project today I would almost certainly
want to use this library instead of the various threads and timers I had to
hack together years ago.

~~~
heisenbit
Indeed this is conceptually hard stuff. The reason for that I believe is that
the problems one is solving are system level problems and not local ones.
Another way to look at this: It is the other guys problem. A lot of naive
retry strategies sort of work until one has a larger number of clients to deal
with. I still remember the time trying to get through to a base-station
designer who refused to acknowledge the need to do exponential back-off and
other mitigation steps. We ran into interesting times shortly later in the
field on the management system side. Personally I would also put in a bit of
randomness to spread out requests when all clients were initially impacted at
the same time and were thus synchronized.

~~~
jodah
Good example of where random retry delays would be valuable. I filed this as a
feature to add for the next release:

[https://github.com/jhalterman/failsafe/issues/39](https://github.com/jhalterman/failsafe/issues/39)

------
SwellJoe
This title would be 100% better with "for Java" on the end.

~~~
_Codemonkeyism
... for JVM languages.

------
ckugblenu
Quite interesting. It shows potential to be used in numerous use cases. Anyone
know of similar projects in other languages like Python and Javascript?

~~~
rdli
(Full disclosure: co-founder of Datawire)

We released a microservices development kit (MDK) last week that includes
similar semantics (e.g., circuit breakers, failover) that implements these
semantics in Python, JavaScript, Java, and Ruby. The implementation is
actually written in a DSL which we transpile into language native impls. We do
this to insure interop between different languages. We're working on updating
our compiler to support Go and C#, adding richer semantics, and making the
service discovery piece pluggable (currently there's a dependency on our own
service discovery).

[https://github.com/datawire/mdk](https://github.com/datawire/mdk)

------
cpitman
How is this distinct from Hystrix
([https://github.com/Netflix/Hystrix](https://github.com/Netflix/Hystrix))?
Why should I use one over the other?

~~~
jodah
Good question. Someone asked that recently on Github - here's a quick
comparison:

[https://github.com/jhalterman/failsafe/wiki/Comparisons#fail...](https://github.com/jhalterman/failsafe/wiki/Comparisons#failsafe-
vs-hystrix)

~~~
vikiomega9
Is there a more detailed comparison?

For example,

>Executable logic can be passed through Failsafe as simple lambda expressions
or method references. In Hystrix, your executable logic needs to be placed in
a HystrixCommand implementation

It's not apparent to me what the advantage of either interface is. In both
situations I have to define a "lambda" and hold state somewhere(either as an
object field or passed into the lambda). Unless I'm something here, either
seems acceptable.

~~~
jodah
> Is there a more detailed comparison?

There's nothing more detailed that I know of. Is there a particular feature
area/comparison you're curious about? I can add a bit more detail.

> It's not apparent to me what the advantage of either interface is. In both
> situations I have to define a "lambda"

What I meant by this bit is that the user experience is different. Failsafe
can be used with method references or lambda expressions [1], which are a
nice, concise way of wrapping executable logic with some failure handling
strategy. You cannot do this with Hystrix since all logic must be wrapped in a
HystrixCommand impl, which cannot be implemented as a lambda.

> either seems acceptable.

Like anything, it just depends on what you want. If retries and general
purpose failure handling, consider Failsafe. If request collapsing, thread
pool management and monitoring, consider Hystrix.

[1]: [https://github.com/jhalterman/failsafe#synchronous-
retries](https://github.com/jhalterman/failsafe#synchronous-retries)

~~~
vikiomega9
I'm curious about design in general. Circuit breaking and timeouts should be a
well defined semantic of the caller and so my thoughts were more, how one
could compose their code and bolt on failsafe, for example, but also quickly
switch out to some other library.

------
ap22213
It seems like a well-thought, fluent interface to what lots of Java developers
(especially Java 8 ones) inevitably have to write themselves.

------
mandeepj
Please find some of these patterns for .net\azure\c# stack here -
[https://msdn.microsoft.com/en-
us/library/dn568099.aspx](https://msdn.microsoft.com/en-
us/library/dn568099.aspx)

------
fdsaaf
Beware of runaway retries:
[https://blogs.msdn.microsoft.com/oldnewthing/20051107-20/?p=...](https://blogs.msdn.microsoft.com/oldnewthing/20051107-20/?p=33433)

Personally, I'd rather systems fail quickly, with retries only at the highest
(application) and lowest (TCP) levels.

