
Misconfigured Circuit Breakers - rbanffy
https://engineering.shopify.com/blogs/engineering/circuit-breaker-misconfigured
======
derefr
This might be my bias as someone who mostly writes in actor-based concurrent
languages, but: is there a reason to test the service’s back-to-life-ness by
passing a real request through, rather than just having every open circuit
trigger the creation of a periodic background “/health”-endpoint (or
equivalent) poller actor for that backend within your service? To do the check
within the client request lifecycle seems like it would needlessly increase
client-request 99th-percentile latency for no real benefit, other than the
_single_ request which _doesn’t_ time out, closes the circuit, and so gets
served from the service.

ETA: you could even—presuming your system does solely idempotent stuff—take
the exact request whose timeout caused the circuit to open, and pass _that
request_ (the particular query for SQL; the particular URI + POST/PUT data for
REST) to the poller-actor you’re spawning to use as its health-check. Once
_that request_ doesn’t time out any more, you know you’re back online.

I’d recommend a real separate /health endpoint for the service you’re trying
to contact, though, because 1. most applications do non-idempotent things, and
2. some queries time out because they touch edge-case bad/unscalable code
paths, not because the remote-service-as-a-whole is down. If your `SELECT *
FROM expensive_generated_report_view` SQL query times out, you really should
not be reusing that query as a health-check against your RDBMS. (But nor
should you just be doing `SELECT 1`!)

~~~
dtech
A /health endpoint or similar mock might be available and responding fast,
while a real request might fail or time out.

~~~
derefr
Well, that’s not a “true” /health endpoint, then. A service’s /health endpoint
should run through its regular non-trivial code paths and depend for its
success on all of the dependent services normal requests to that service
depend on. (Probably you’ll need to write it yourself, rather than using one
supplied by your application framework.)

For example, if you have a CRUD app fronting a database, your CRUD app’s
/health endpoint should attempt to make a simple database query. If you have
an ETL daemon that pulls from a third-party API and pushes to a message queue,
it should probe both the readiness of the API and the message-queue before
reporting its own readiness to work. (Of course, it is exactly in the case
where the service has its own circuit-breaking logic with back-up paths
“around” these dependencies, that it gets to say it’s healthy when its
dependencies aren’t.)

A test of the /health endpoint is, after all, what determines if a service is
considered viable for routing by load-balancers; and, vice-versa, if the
service is considered to be “stuck” and in need of a restart by platform auto-
scaling logic. As such, your /health endpoints really should be configured in
a way where they generate false _positives_ —reporting being unhealthy when
they’re really healthy—rather than false negatives.

If you’ve got a pool of instances, better to have them paranoid of their own
readiness in a way where your system will be constantly draining traffic from
+ restarting them, than to have them lackadaisical about their readiness in a
way where they’re receiving traffic they can’t handle.

~~~
dtech
> Well, that’s not a “true” /health endpoint, then

You cannot make such a "true" health endpoint, it's super easy to make a
service that contains a paradox about what such an endpoint should do.

1 service with 2 endpoints A and B. A relies on an external service and the
database, A and B both rely on the database. What to do if the external
service is down bringing A down? Either your health endpoint is useless
because A is down and it reports the service is fine or you just cascaded the
downtime to B while there is no reason to do so. Same situation for 1 endpoint
but any branch dependent on the request etc. etc.

Of course you can use a health endpoint for determining restarts, load
balancers etc, but its not replacement for circuit breakers on your calls.

~~~
derefr
> 1 service with 2 endpoints A and B. A relies on an external service and the
> database, A and B both rely on the database. What to do if the external
> service is down bringing A down?

Make one /health/a endpoint and one /health/b endpoint. Client-service A uses
/health/a to check if the service is "healthy in terms of A's ability to use
it." Client-service B likewise pings /health/b.

In a scenario with many different dependent services (a crawler that can hit
arbitrary third-party sites, say, or something like Hadoop that can load
arbitrary not-necessarily-existent extensions per job) these endpoints should
be something clients can create/register themselves, ala SQL stored
procedures; or the service can offer a connection-oriented health-state-change
streaming endpoint, where the client can hold open a connection and be told
when about readiness-state-change events.

But to be clear, these are edge-case considerations: in most cases, a service
has only _critical-path dependencies_ (which it needs to bootstrap itself, or
to "do its job" in the most general SLA sense); and _optional dependencies_
(which it doesn't actually need, and can offer degraded functionality when the
service isn't available via circuit-breaking.)

It's a rare—and IMHO not-well-factored—service that has dependencies that are
on the critical path for some use-cases but not others. Such a service should
probably be split into two or more services: a core service that all use-cases
depend on; and then one or more services that each _just_ do the things unique
to a particular use-case, with all their dependencies being on the critical
path to achieve their functionality of serving that specific use-case. Then,
those use-case-specific services can be healthy or unhealthy.

An example of doing this factoring right: CouchDB. It has a core "database"
service, but also a higher-level "view-document querying" service, that can go
down if its critical-path dependency (a connection to a Javascript sandbox
process) isn't met. Both "services" are contained in one binary and one
process, but are presented as two separate collections of endpoints, each with
their own health endpoint.

An example of doing this factoring wrong: Wordpress. It's got image
thumbnailing! Comment spam filtering! CMS publication! All in one! And yet
it's either monolithically "healthy" or "unhealthy"; "ready" or "not ready" to
run. That is clearly ridiculous, right?

~~~
fatnoah
>Make one /health/a endpoint and one /health/b endpoint. Client-service A uses
/health/a to check if the service is "healthy in terms of A's ability to use
it." Client-service B likewise pings /health/b.

I've done exactly this, and it worked well in my case were the # of related
services was pretty small. Each endpoint would return an HTTP status code
indicating overall health with additional details stating exactly which checks
succeeded or not.

------
klodolph
Consider randomizing error_timeout, to avoid a possible case where all clients
retry at the same time.

~~~
smoyer
We use this quite a bit but we actually have a constant timeout with random
"jitter" added to it. You almost never want the timeout completely randomized
because there's a chance you'll get a zero.

~~~
klodolph
The concept of “completely randomizing” something isn’t really well-defined,
it’s more of a casual phrase that people use without thinking about it. If you
have a constant timeout with random jitter added to it, you’ve randomized it.
It is now random.

If you think about it, a “perfectly random” die roll on an ordinary die will
never be 0. So, whether a value is random has no relationship with whether
that value can be 0.

------
freshbob
The author seems to think completely in terms of "wasted utilization" when it
comes to timeouts. I think they are missing the point of the timeouts and the
retry logic to begin with. The effort by the circuit breaker isn't wasted,
because it is exactly trying to establish whether a resource is responding or
not taking into account occasional network hiccups. If every effort past the
initial timeouts was wasted, then why implement this logic to begin with? I
agree with derefr
([https://news.ycombinator.com/item?id=22546241](https://news.ycombinator.com/item?id=22546241))
in the sense that it seems illogical to increase latency for users simply to
check for availability of a timed-out resource.

IMHO the worst-case assumption of all service instances failing simultaneously
leads the author astray in their quest to reduce "wasted utilization".

Pretend the network switch rebooted and all services were unavailable for a
short period of time, but your website is in high demand, so the error
threshold of three errors per resource was quickly reached. Let's pretend the
network switch needed 5 seconds to reboot, so 42 resources each failing 3
times in that time equals 126 requests/5 seconds, 25.2 requests/second. Now,
instead of quickly recovering from that state after two seconds, the author
advises to instead wait 30 seconds, so that's 756 requests---because your site
is so popular---before the first service is retried. Then an additional 41
requests (~1.67 seconds) until all resources are marked available again. So
now you made about one thousand people unhappy in case it's their browsing
session that's constantly lost. Unless of course your were too optimistic when
setting the half_open_resource_timeout, because then your services might be
blocked for multiples of error_timeouts, e.g. minutes with a high
error_timeout value of 30 seconds. That's a lot more than a thousand people
unable to log in.

IMHO setting the half_open_resource_timeout way lower than the regular
service_timeout value will just risk the services _never_ becoming available
again after an internal network outage in your data center. That seems like a
recipe for disaster.

------
ofek
Good read!

This sounds similar to the work I did [1] on the Datadog Agent, especially
regarding the concept of each resource having its own circuit breaker.

My implementation is a bit different though, instead based on exponential
decay like BGP Route Flap Damping [2]. It matches our use case a bit better
and is easier to reason about.

[1]: [https://github.com/DataDog/datadog-
agent/pull/1458](https://github.com/DataDog/datadog-agent/pull/1458)

[2]:
[https://tools.ietf.org/html/rfc2439#section-2.3](https://tools.ietf.org/html/rfc2439#section-2.3)

------
hhsuey
> Utilization, the percentage of a worker’s maximum available working
> capacity, increases indefinitely as the request queue builds up, resulting
> in a utilization graph like this.

Keyword indefinitely. Isn't this assuming the service worker doesn't have a
timeout itself?

~~~
damianpolan
Even with a timeout, the time spent waiting on the timeout is wasted
utilization. Therefore, if the request rate stays the same, and each request
is wasting enough utilization, the utilization required to process the request
will be higher than the rate it is worked off.

------
myself248
This is apparently a software function known as a circuit breaker, and has
nothing to do with electrical current flow.

Fuckin' HN headlines, I swear.

~~~
dtech
It's called an analogy. It functions similarly to an electrical circuit
breaker because it disables systems before things go fully haywire/bad, it
helps devs understand the concept by using familiar things

~~~
hinkley
Although it behaves a lot more like a GFCI than a circuit breaker, but
apparently nobody involved ever watched This Old House, or any other home
improvement shows.

I would like these throttling tools to behave in a different way, one that
more closely resembles a circuit breaker. But some rat bastard has already
used that name so now I don't know what to call it.

