However, looking at the design, this still has a long way to go. There are a lot of failure modes you guys haven't encountered yet, which will result in a few design tweaks. For example, what happens if your health checkers decide to start reporting garbage data (e.g. maybe they are too overloaded to properly perform health checks)? Or when you have a query of death being issued? Also, things like traffic sloshing can very quickly build resonant failures in a system like this.
(Source: many years working on Google infrastructure, including causing outages related to load balancing code)
Good point on garbage data reporting; we do basic validation in synapse, here: https://github.com/airbnb/synapse/blob/master/lib/synapse/se...
We could probably do more there to ensure valid names, IPs and ports (matching against a regex should do it). Also, because of the built-in health checking in haproxy, just the presence of some invalid name in the list of machines doesn't mean that we're going to try to start sending traffic there.
On the health checking/garbage data front:
It's usually more of a problem when something misreports a bad backend, rather than misreporting a good backend. The latter is easy to catch (as you mention, haproxy does it). The former is hard because one misbehaved health-checker can suddenly unload all of your services.
We're also doing a Tech Talk on SmartStack today at Airbnb HQ; stop by if you're in SF: https://www.airbnb.com/meetups/33925h2sx-tech-talk-smartstac...
Boot-strapping would of course be a problem, so you would still need a centralized nervous system for that, but you could survive its failure.
i haven't read much beyond the splash page, but hashicorp generally does great work.
I'm curious though, what does the local developer environment look like when you run an SOA of this complexity? Does everyone needs to run a series of Vagrant VMs/Docker containers to have a fully functional local version of the application running?
Another approach is creating a set of shared services that developers can use rather than deploying their own instances. This article by a LinkedIn engineer describes their internal Quick Deploy system:
Personally, though, I'd go the route of creating a self-contained Vagrant setup if possible. This helps folks become familiar with the entire system and fix bugs anywhere in it, rather than drawing strict lines of ownership around specific services.
We actually usually avoid SmartStack in dev. The rule is, your service always listens on it's SmartStack port. So, for instance, search listens on port 5678 on it's backends; in prod, consumers of search will find it on localhost at port 5678 via SmartStack. In dev, consumers will just find it at localhost port 5678 natively.
Sounds like this could be a whole blog post on its own. Over-complicating the local developer story is a big reason that us and many other firms have punted on SOA in the first place.
If we like what we see on a developer's machine, we can push it to prod & feel confident it will work with the currently-deployed services.
Yet another reason to run in a VPC, which includes internal-only ELBs as a very useful feature.
This serves to solve most of the issues that this toolkit provides for, but likely would not be the good option for everyone. Just some info on at least one other way to deal with this stuff!
Also, if the response to this is then 'but it's still a central point of failure', they haven't really removed that in this solution. If the zookeeper cluster dies you lose everything.
Generally if a clustered load-balancer dies and another takes over there's a half second or a couple seconds of transition, but you're back up and running with a very simple architecture. If all of your load-balancers die, something much bigger to worry about is going on.
Really all this does is place the failure mode into a higher-level service with a much greater potential for failure. Zookeeper even makes you specify the size of your cluster and in my experience it's difficult to live update this. I've read they're working on that, but still.
Clustered load-balancers (using Pacemaker/Corosync, keepalived or similar) are very well understood these days. Pacemaker/Corosync can even run within EC2 now, since a couple years ago they added unicast support thus obviating the multicast issues present within EC2.
Additionally, if we want to talk about load then a well configured haproxy/Nginx load-balancer can handle hundreds of thousands of connections a second. If your installation needs more than this then I'm certain you could get a layer to distribute the load-balancing between a set. Obviously another problem to introduce, but still not one you'll reach until you probably have even more traffic than airbnb gets.
"The achilles heel of SmartStack is Zookeeper. Currently, the failure of our Zookeeper cluster will take out our entire infrastructure. Also, because of edge cases in the zk library we use, we may not even be handling the failure of a single ZK node properly at the moment."
With this approach everything will still work.
Everyone is doing it, but I never see details about how the services are wired together.
Do people use ESBs, or directly wire up services to each other?
I'm pretty sure there has never been an ESB that doesn't support HTTP.
My question is if people are actually using them to wire system together (outside corporate environments, where considerations are different).
For service to service communication, they've deployed client-side load balancing & discovery using per-client HAProxy instances. Rather than have every client polling every possible service for health, Zookeeper is used as an endpoint status repository. The HAProxy configs are kept up to date using a tool called "Synapse" that queries ZK. ZK is kept up-to-date by their own health check service, "Nerve".
I've developed similar myself, using LVS, for a private Australian CDN. Nice to see the model generalised, robustified and open-sourced. There are possible issues relating to work levelling and spike management, but if it's working for AirBNB, great.
If you needed to describe this in an enterprise context, I'd tell them it's a distributed SOA broker. That's a gross oversimplification but the buzzword bingo'll satisfy most project managers.
For a middle-aged IT manager I'd say "it's like the Oracle Parallel Server client reliability model, only for web services rather than databases". Again an oversimplification, but they'd get the idea.