Hacker News new | past | comments | ask | show | jobs | submit login
Introducing SmartStack: Service Discovery in the Cloud (airbnb.com)
94 points by _harry on Oct 23, 2013 | hide | past | favorite | 37 comments

I'm glad to see more cluster management software getting open sourced, and this is sort of on the right track.

However, looking at the design, this still has a long way to go. There are a lot of failure modes you guys haven't encountered yet, which will result in a few design tweaks. For example, what happens if your health checkers decide to start reporting garbage data (e.g. maybe they are too overloaded to properly perform health checks)? Or when you have a query of death being issued? Also, things like traffic sloshing can very quickly build resonant failures in a system like this.

(Source: many years working on Google infrastructure, including causing outages related to load balancing code)

What is traffic sloshing?

Good point on garbage data reporting; we do basic validation in synapse, here: https://github.com/airbnb/synapse/blob/master/lib/synapse/se...

We could probably do more there to ensure valid names, IPs and ports (matching against a regex should do it). Also, because of the built-in health checking in haproxy, just the presence of some invalid name in the list of machines doesn't mean that we're going to try to start sending traffic there.

Traffic sloshing (basic overview): Say you have a pool of machines for a service (traditionally this problem is multi-regional, though it technically can happen at any scale). For some reason (machine restart, query of death, reloading, etc) a subset of your backends become unhealthy. This gets automatically detected by your framework, and the traffic gets routed to different machines. Now, you may have under-provisioned your backends (or you have a query of death), so this concentration of traffic on a smaller number of machines causes them to choke. You get a seesaw effect of traffic going around to the different backends, taking them out like a concentrated firehose. These failures all get detected by your framework, which routes traffic away from the backends. What you really wanted was a steady stream to all backends. A lot of load balancing systems have this failure mode. The good ones can detect it and converge back to a good steady state. The naive ones just keep the firehose spinning. It is harder to fall into this trap with simple binary health checking. It becomes a lot easier when you do traffic allocation by latency, or have more complicated health criteria that is easier to fail.

On the health checking/garbage data front: It's usually more of a problem when something misreports a bad backend, rather than misreporting a good backend. The latter is easy to catch (as you mention, haproxy does it). The former is hard because one misbehaved health-checker can suddenly unload all of your services.

That happened at work a few weeks ago---basically, one of our components did a health check of of DNS when the health check DNS record was mistakenly deleted. Because of the way the health check code was written, that component thought the DNS servers were down and shut down. That in turn, shut down other components that were health checking that component. Boom! Instant virtual dominoes.

Hi guys! I'm one of the primary authors of SmartStack. Happy to answer any questions that aren't covered in the blog post.

We're also doing a Tech Talk on SmartStack today at Airbnb HQ; stop by if you're in SF: https://www.airbnb.com/meetups/33925h2sx-tech-talk-smartstac...

As a thought experiment, what do you think of trying to make this completely de-centralized? Each service has its own synapse, and the services form a nerve for that service. They communicate status directly to the app servers, which form their own nerve.

Boot-strapping would of course be a problem, so you would still need a centralized nervous system for that, but you could survive its failure.

i think that another project that aims to address this was announced same day as smartstack: http://www.serfdom.io/

i haven't read much beyond the splash page, but hashicorp generally does great work.

Great post. This is interesting due to similar discussions we're having at work about moving from a monolithic Rails app architecture to an SOA.

I'm curious though, what does the local developer environment look like when you run an SOA of this complexity? Does everyone needs to run a series of Vagrant VMs/Docker containers to have a fully functional local version of the application running?

One approach is stubbing out the services you don't need. This article by a Heroku engineer describes how to do this at the Rack level:


Another approach is creating a set of shared services that developers can use rather than deploying their own instances. This article by a LinkedIn engineer describes their internal Quick Deploy system:


Personally, though, I'd go the route of creating a self-contained Vagrant setup if possible. This helps folks become familiar with the entire system and fix bugs anywhere in it, rather than drawing strict lines of ownership around specific services.

yeah we've used stubs in the past and it's often awkward and creates silos within the team. I'm not a huge fan of this approach.

At Airbnb, we've moved to a single Vagrant vm for our dev environment. We configure it using the same Chef code we use to build production -- the cookbook that installs search in production also installs it in dev.

We actually usually avoid SmartStack in dev. The rule is, your service always listens on it's SmartStack port. So, for instance, search listens on port 5678 on it's backends; in prod, consumers of search will find it on localhost at port 5678 via SmartStack. In dev, consumers will just find it at localhost port 5678 natively.

That makes a lot of sense. I know Yammer does something similar, albeit with Puppet instead of Chef.

Sounds like this could be a whole blog post on its own. Over-complicating the local developer story is a big reason that us and many other firms have punted on SOA in the first place.

One of the neat tricks with using the localhost HAProxy is that developer mode can just be a single Vagrant VM with all services configured to run on the port that they would have in Synapse. With HAProxy all the production servers think they are talking to localhost, and in development they actually are. The Vagrant VM can then be configured/reconfigured using Chef and the production cookbooks (with some overrides in the roles or environments). It should also be possible (though not entirely trivial) to run some of the services in the cloud for development, allowing developers to switch in and out the components that they need to actively develop.

For our product, each dev runs his own copy of the service. We have shared instances of r/w services (databases etc.), and use prod instances for read-only services provided by other teams.

If we like what we see on a developer's machine, we can push it to prod & feel confident it will work with the currently-deployed services.

> On AWS, you might be tempted to use ELB, but ELBs are terrible at internal load balancing because they only have public IPs.

Yet another reason to run in a VPC, which includes internal-only ELBs as a very useful feature.

Cool stuff I think though that much of this can be handled with other ways of doing things (although obviously there is never one right way of doing these kinds of things). This application kit is one way of orchestrating service/server discovery. Another way, which I have implemented personally is to use a combination of mcollective and puppet (with puppet facts enabled). This allows you to defined roles for specific systems and run tasks against servers of that specific role type, keep track of which servers are that role type, connect them to a 'central' load-balancer and many other things.

This serves to solve most of the issues that this toolkit provides for, but likely would not be the good option for everyone. Just some info on at least one other way to deal with this stuff!

Having a central load balancer is going to turn into a nightmare once you start managing a reasonable number of servers. Hardware goes bad (especially in the cloud), and having a single point of failure leaves you at it's mercy.

A load balancer should never be a single point of failure. You should always have multiples.

Also, if the response to this is then 'but it's still a central point of failure', they haven't really removed that in this solution. If the zookeeper cluster dies you lose everything.

Generally if a clustered load-balancer dies and another takes over there's a half second or a couple seconds of transition, but you're back up and running with a very simple architecture. If all of your load-balancers die, something much bigger to worry about is going on.

Really all this does is place the failure mode into a higher-level service with a much greater potential for failure. Zookeeper even makes you specify the size of your cluster and in my experience it's difficult to live update this. I've read they're working on that, but still.

Clustered load-balancers (using Pacemaker/Corosync, keepalived or similar) are very well understood these days. Pacemaker/Corosync can even run within EC2 now, since a couple years ago they added unicast support thus obviating the multicast issues present within EC2.

Additionally, if we want to talk about load then a well configured haproxy/Nginx load-balancer can handle hundreds of thousands of connections a second. If your installation needs more than this then I'm certain you could get a layer to distribute the load-balancing between a set. Obviously another problem to introduce, but still not one you'll reach until you probably have even more traffic than airbnb gets.

For accuracy it's worth pointing out that if Zookeeper dies only registration/deregistration goes down. The local HAProxy processes will continue to run, and applications will continue to be able to communicate with services. Zookeeper isn't a single point of failure for communication; it's just a registration service.

by the way keepalived >= 1.2.8 also supports vrrp over unicast

Indeed it does! I use it for less complicated fail-over situations myself, but if you start to need more complicated topologies (or something which has good interaction with IPVS/LVS) then Corosync!

ELB will actually do internal load balancing in a VPC using your own custom security groups. Doesn't help if you're not in a VPC, but nowadays the default is for everything to go in a VPC.

Unfortunately, Airbnb is a pretty legacy AWS client; when we started using it, VPC was a much less prominent feature. Migrating to a VPC is a very non-trivial process, especially if you have RDS instances outside the VPC

I'm in the same boat with most of my AWS workload, was created before IGW was available and keep having the same discussion about migrating :-)

I don't know... sounds like they just shifted the 'single point of failure' to the Zookeeper cluster. Is it somehow sexier to have your SPOF be things running Zookeeper instead of things running loadbalancer or DNS software?:

"The achilles heel of SmartStack is Zookeeper. Currently, the failure of our Zookeeper cluster will take out our entire infrastructure. Also, because of edge cases in the zk library we use, we may not even be handling the failure of a single ZK node properly at the moment."

If your load balancer/DNS goes down then every service won't be able to find other services.

With this approach everything will still work.

According to the quote I provided from the article, failure of the Zookeeper cluster will bring down their entire architecture.

I've been hearing good stuff about this for a while - it's awesome to see it open sourced! Setting up a local haproxy to handle service failover / discovery is a really clever solution, and is an awesome approach to encapsulating a bunch of really messy logic.

So.. SOA.

Everyone is doing it, but I never see details about how the services are wired together.

Do people use ESBs, or directly wire up services to each other?

Services are generally using HTTP/REST these days.

Yes. Most (all?) ESBs support (and indeed encourage) REST.

I'm pretty sure there has never been an ESB that doesn't support HTTP.

My question is if people are actually using them to wire system together (outside corporate environments, where considerations are different).

Can someone explain to me what this really is? I don't follow and yes I read the intro.

As far as I can tell, here's my summary of the architecture I just read.

For service to service communication, they've deployed client-side load balancing & discovery using per-client HAProxy instances. Rather than have every client polling every possible service for health, Zookeeper is used as an endpoint status repository. The HAProxy configs are kept up to date using a tool called "Synapse" that queries ZK. ZK is kept up-to-date by their own health check service, "Nerve".

I've developed similar myself, using LVS, for a private Australian CDN. Nice to see the model generalised, robustified and open-sourced. There are possible issues relating to work levelling and spike management, but if it's working for AirBNB, great.

If you needed to describe this in an enterprise context, I'd tell them it's a distributed SOA broker. That's a gross oversimplification but the buzzword bingo'll satisfy most project managers.

For a middle-aged IT manager I'd say "it's like the Oracle Parallel Server client reliability model, only for web services rather than databases". Again an oversimplification, but they'd get the idea.

With a service-oriented architecture (SOA) you basically break your application into distinct services that run across different machines. But you also need a way to find and connect to those services (service discovery). This provides a way of doing that using ZooKeeper and local HAProxy processes.

So I register a new service (or process) to run on metal using Chef. Then Chef installs the service and updates the nerve config running on that metal device. Nerve then keeps a central Zookeeper updated with the status of its local services? Synapse checks for available services in zookeeper that your app may need to use, and then configures a local load balancer for any requests you make?

That's the idea! If you're using Chef, check out the cookbook for SmartStack; you keep a small hash of configuration information per service, and we take care of the rest. https://github.com/airbnb/smartstack-cookbook


Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact