Hacker News new | past | comments | ask | show | jobs | submit login

Well, that’s not a “true” /health endpoint, then. A service’s /health endpoint should run through its regular non-trivial code paths and depend for its success on all of the dependent services normal requests to that service depend on. (Probably you’ll need to write it yourself, rather than using one supplied by your application framework.)

For example, if you have a CRUD app fronting a database, your CRUD app’s /health endpoint should attempt to make a simple database query. If you have an ETL daemon that pulls from a third-party API and pushes to a message queue, it should probe both the readiness of the API and the message-queue before reporting its own readiness to work. (Of course, it is exactly in the case where the service has its own circuit-breaking logic with back-up paths “around” these dependencies, that it gets to say it’s healthy when its dependencies aren’t.)

A test of the /health endpoint is, after all, what determines if a service is considered viable for routing by load-balancers; and, vice-versa, if the service is considered to be “stuck” and in need of a restart by platform auto-scaling logic. As such, your /health endpoints really should be configured in a way where they generate false positives—reporting being unhealthy when they’re really healthy—rather than false negatives.

If you’ve got a pool of instances, better to have them paranoid of their own readiness in a way where your system will be constantly draining traffic from + restarting them, than to have them lackadaisical about their readiness in a way where they’re receiving traffic they can’t handle.




> Well, that’s not a “true” /health endpoint, then

You cannot make such a "true" health endpoint, it's super easy to make a service that contains a paradox about what such an endpoint should do.

1 service with 2 endpoints A and B. A relies on an external service and the database, A and B both rely on the database. What to do if the external service is down bringing A down? Either your health endpoint is useless because A is down and it reports the service is fine or you just cascaded the downtime to B while there is no reason to do so. Same situation for 1 endpoint but any branch dependent on the request etc. etc.

Of course you can use a health endpoint for determining restarts, load balancers etc, but its not replacement for circuit breakers on your calls.


> you cannot make such a "true" health endpoint

Well, you can make such an endpoint, you already have. It's called...

Your endpoint.

The answer to the top level question is, "because it's easier, more accurate, and more maintainable to call a real endpoint than to try to maintain an endpoint whose sole purpose is to predict whether your other endpoints are actually working."

Aka: Just Ask.


> 1 service with 2 endpoints A and B. A relies on an external service and the database, A and B both rely on the database. What to do if the external service is down bringing A down?

Make one /health/a endpoint and one /health/b endpoint. Client-service A uses /health/a to check if the service is "healthy in terms of A's ability to use it." Client-service B likewise pings /health/b.

In a scenario with many different dependent services (a crawler that can hit arbitrary third-party sites, say, or something like Hadoop that can load arbitrary not-necessarily-existent extensions per job) these endpoints should be something clients can create/register themselves, ala SQL stored procedures; or the service can offer a connection-oriented health-state-change streaming endpoint, where the client can hold open a connection and be told when about readiness-state-change events.

But to be clear, these are edge-case considerations: in most cases, a service has only critical-path dependencies (which it needs to bootstrap itself, or to "do its job" in the most general SLA sense); and optional dependencies (which it doesn't actually need, and can offer degraded functionality when the service isn't available via circuit-breaking.)

It's a rare—and IMHO not-well-factored—service that has dependencies that are on the critical path for some use-cases but not others. Such a service should probably be split into two or more services: a core service that all use-cases depend on; and then one or more services that each just do the things unique to a particular use-case, with all their dependencies being on the critical path to achieve their functionality of serving that specific use-case. Then, those use-case-specific services can be healthy or unhealthy.

An example of doing this factoring right: CouchDB. It has a core "database" service, but also a higher-level "view-document querying" service, that can go down if its critical-path dependency (a connection to a Javascript sandbox process) isn't met. Both "services" are contained in one binary and one process, but are presented as two separate collections of endpoints, each with their own health endpoint.

An example of doing this factoring wrong: Wordpress. It's got image thumbnailing! Comment spam filtering! CMS publication! All in one! And yet it's either monolithically "healthy" or "unhealthy"; "ready" or "not ready" to run. That is clearly ridiculous, right?


>Make one /health/a endpoint and one /health/b endpoint. Client-service A uses /health/a to check if the service is "healthy in terms of A's ability to use it." Client-service B likewise pings /health/b.

I've done exactly this, and it worked well in my case were the # of related services was pretty small. Each endpoint would return an HTTP status code indicating overall health with additional details stating exactly which checks succeeded or not.


> And yet it's either monolithically "healthy" or "unhealthy"; "ready" or "not ready" to run. That is clearly ridiculous, right?

That's my point. If your editor cannot create new articles that's a problem that needs to be resolved.

I probably wouldn't want all my editors smashing the submit button until it worked, leading to additional overload problems. So I open the "circuit breaker" for submission by sending an email to all the editors to please stop submitting while wordpress is broken.

But shutting down the complete website would make the problem much worse because readers wouldn't see anything.

Conclusion: It's neither 100% healthy nor 100% unhealthy. Healthiness cannot be captured by a single boolean.


Sometimes its not a good idea for the health of a service to be determined by its connected parts (eg databases). For purely situational awareness this is fine. But if you use the healthcheck to determine if an instance of an application should be taken out of service you risk cascading failures; turning 1 problem into 10. It's usually better for the application to throw an error if it can't connect to the database. That said I do both approaches depending on the situation.


Two modes of cascading failure here: Request of Death, and cascading failures. If a request kills a particular server you should let the error flow upstream, otherwise it will just bounce from server to server until it's killed all of them.

For the latter, someone related a real-world example of this to me the other day. Say you have a bunch of people managing customers. Every employee has 4 customers, and those take up all of their time.

You get a new customer. Instead of hiring a new rep, you give someone a 5th customer to manage. They struggle, and eventually they quit. Now, all of your employees have 5 customers. Sooner or later one of those will also quit, and then it's a race to see who can get out the door fastest.

The moral of that story is that all the load balancing in the world is for naught if you haven't done your capacity planning properly. And once the system starts to buckle it may be too late to bring new capacity online (since startup usually consumes more resources).


I mean... that's what circuit breakers are for. If a component of a service is optional to its operation, then it wraps calls to that service in a circuit-breaker and fails requests that ask for that service. And if a component of a service is not optional to the operation of a service, then the failure should cascade to the service's dependent clients, and their dependent clients, and so on, so that there's backpressure all the way back to the origin of the request.


Health checks can have several purposes. They're used by the routing control plane to determine inclusion in the load balancer pool. This is already a kind of circuit breaker and is similar to what an application-level circuit breaker would poll. But they're also used by the scheduler to determine whether the instance needs to be restarted. You don't want your thing in a restart loop just because a dependency is down! In fact, if a very widely shared dependency went down, and everyone was checking it in their health check, the scheduler control plane could quickly have a backlog measured in days trying to move all those instances.

Our environment now supports distinct answers to those two questions but most service authors don't know about it.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: