

Show HN: Baker Street – A simple client-side load balancer for microservices - rdli
http://bakerstreet.io/

======
rdli
We made this because we discovered that _lots_ of companies using
microservices have independently converged on this type of architecture for
load balancing (client-side HAProxy + integrated service discovery component +
health checks), but there wasn't a simple, easy-to-setup, end-to-end solution
out there. (Our favorite for the record is AirBnb's SmartStack.) We'd love for
some feedback and/or PRs and/or GitHub stars!

~~~
philsnow
Airbnb recently merged some great changes into SmartStack [0, 1] provided by
Yelp, among which are reducing the number of connections to zookeeper [2].

[0]
[https://github.com/airbnb/nerve/pull/71](https://github.com/airbnb/nerve/pull/71)
[1]
[https://github.com/airbnb/synapse/pull/130](https://github.com/airbnb/synapse/pull/130)
[2]
[https://github.com/Yelp/synapse/commit/82775562a35a89d60084f...](https://github.com/Yelp/synapse/commit/82775562a35a89d60084febb42af135a6362b52c)

~~~
rschloming
The changes that yelp have made are great for SmartStack users, but you still
need to set up zookeeper in order to get going. Yelp is really pushing these
changes for the multi datacenter use cases. I suspect this is one area where
the strong consistency model of zookeeper is an even worse fit for service
discovery than within a single datacenter.

~~~
jolynch
To be honest my favorite part of SmartStack is that you are not tied to a
single discovery backend or mechanism. Both Synapse and Nerve support custom
backends using whatever system you want (zookeeper, etcd, DNS, etc). At the
end of the day both just expose basic configuration files and we exploit that
at Yelp to do pretty cool stuff like allowing multiple systems to inform
nerve/synapse about services (e.g. marathon or puppet) and allowing us to
control service latency using a DSL that compiles down to those configuration
files.

Just to clear something up, we have not found it necessary to run zookeeper at
a cross datacenter level to get multidatacenter support. We're still working
on writing up the details but the general gist is run zk in all datacenters
and then cross register from a single nerve instance to multiple datacenters.
That's why we had to remove fast fail from nerve, because by its nature cross
datacenter communication is flakey. This approach has some tradeoffs however,
as all approaches do.

All that being said, this is an interesting system and I look forward to more
mindshare in the area of service discovery!

~~~
rdli
Awesome, great to know the details (we heard about what you guys were doing
second hand from Igor). Looking forward to more details whenever you post!

------
asherkin
This looks really handy - thanks! A word of warning though: TfL _aggressively_
protects the roundel trademark.

~~~
goatforce5
Yes. Came here to say this. Expect a nastygram from TfL. eg:

[http://www.theregister.co.uk/2006/03/16/tube_map_madness/](http://www.theregister.co.uk/2006/03/16/tube_map_madness/)

------
dcosson
This looks great! Looking forward to playing around with it.

I loved the idea of Airbnb's Synapse, but it's tricky to configure (you
basically have to write an haproxy config from scratch plus learn how synapse
config sections map to the haproxy config). Plus it seemed like the non-
zookeeper backends were pretty unstable, I had to fix a few things to get it
working with ec2 tags (and fwiw, at this point it's been over a month and my
PR to merge the changes back upstream hasn't even been commented on).

How does Baker Street handle restarting haproxy, does it do anything like
this[0] automatically to get zero-downtime configuration reloads?

[0] [http://engineeringblog.yelp.com/2015/04/true-zero-
downtime-h...](http://engineeringblog.yelp.com/2015/04/true-zero-downtime-
haproxy-reloads.html)

~~~
rschloming
Currently we use the restart procedure as described in the haproxy manual. We
would like to get to true zero downtime though, we've been looking both at the
method described in the post you mention as well as possibly using nginx in
favor of haproxy to achieve this.

------
revertts
If I'm reading it right, the directory service today is a single host. That
was very misleading after these statements (which suggested something closer
to netflix eureka):

"Zookeeper provides a strongly consistent model; the directory service focuses
on availability."

"Baker Street doesn't use Zookeeper or the other popular service discovery
frameworks because we wanted a simple, highly available service, not a
strongly consistent one."

Edit: Which is not to say that the project isn't interesting, just that some
of the copy felt like a bait and switch. :)

~~~
eropple
I had the same reaction, and I have severe reservations about availability-
focused service location. The potential of firing traffic at the wrong nodes
and having it dropped on the floor is a real red flag for me. A failure of a
directory service due to a lack of consistency allows an application to, if
not trivially, at least _reliably_ cache requests to be pushed later when the
health of the overall architecture can be established.

~~~
rschloming
In a distributed architecture it is very difficult to avoid the possibility
you mention even with a strongly consistent store at the center of your
service discovery mechanism. The consistency the store provides doesn't
necessarily extend to the operational state of your system.

For example, your zookeeper nodes may all be consistent with each other, but
given that a server can fail at anytime, that information while consistent may
still be stale. Likewise, if a client is caching connections outside of
zookeeper's consensus mechanism, then these connections will also become stale
in the face of changes.

Given these possibilities, there is always the potential for traffic to be
dropped on the floor regardless of how consistent your store is, so ultimately
what matters is how to minimize the probability of this occurring and whether
your system can cope when it does.

------
dmourati
Thanks for sharing. Sent this to my team as we were talking about this problem
just this afternoon.

Points for sudo nano in the install guide.

------
Omie6541
"Datawire Directory" => Mycroft

------
felixgallo
Curious: why choose something like Watson over using haproxy's pretty solid
built in health checking mechanisms?

~~~
rdli
We're running 1 HAProxy per application instance. The HAProxy built in health
checking mechanism is designed when it front-ends > 1 app instances (i.e., it
serves as a central proxy).

Watson checks the health of your local application, and propagates it to the
(global) service discovery framework, so when other microservices want to
connect to the service that Watson is monitoring, they know whether or not
that service is available.

~~~
felixgallo
Seems like you're trading away haproxy as a centralized SPOF for brand new
custom code, which itself is a centralized SPOF; and also needing a new
watchdog daemon to do what the haproxy instance would have done. Would be
interesting to understand what the problem was that forced that complex
arrangement, because it's not obvious to me at the moment.

~~~
rdli
Three reasons:

1\. In the central LB setup, if the LB server dies, you're service dies. Not
in this setup (HAProxy is deployed side-by-side with each instance). 2\.
Elasticity. Imagine you have a shopping cart microservice, a search
microservice, and a users microservice. Each of these requires its own HAProxy
instance. Every time you spin up or spin down a new instance of these
microservices, you need to configure the central HAProxy to pay attention to
these new instances. 3\. Health checks don't work well in DNS. In the
centralized load balancer infrastructure, you end up relying on DNS so that
your users microservice can talk to the shopping cart microservice (for
example). DNS requires client polling and has propagation delays, so if one of
your shopping cart microservice load balancer dies, it takes time for all the
other microservices to figure out where to connect to.

A central LB works well if you have just a single microservice. But when you
have dozens of them, you're suddenly managing dozens of load balancers
dynamically, and it gets to be pretty chunky to manage at that point.

~~~
mark242
Except most people deploy dual HAProxy servers connected via heartbeat, which
share a floating IP. There's no SPOF. Deploying new microservices is as simple
as managing the HAProxy config file via Chef or Puppet.

~~~
rdli
You can definitely do that. There is more programming involved, since you need
to figure out how to tie a new instance deployment into Chef/Puppet/etc to
update HAProxy. You also need to figure out how to get it to update quickly.
Finally, you'll need to figure out how to deploy your dual HAProxy server with
heartbeat setup automatically every time you deploy a new type of
microservice, and update DNS appropriately. It just means that instead of
deploying Baker Street as part of your microservice push, you're deploying a)
your microservice b) your dual HAProxy setup and/or new HAProxy config and c)
DNS.

Lots of ways to solve this problem; our big focus here is on simplicity. If
you have the Chef-fu and time to do all the above, you could definitely make
it work.

------
curiousjorge
so if a HTTP request comes in, how does it communicate with existing HTTP
microservices and know that it's available or not? Does it do this by polling?

I might actually give this a go since I need to route HTTP request to hundreds
of flask servers but if they are busy, I don't want to keep hitting it.

~~~
rschloming
The way this works is described here:

    
    
      http://bakerstreet.io/docs/architecture.html

