
Service discovery at Stripe - edwinwee
https://stri.pe/blog/service-discovery-at-stripe
======
qohen
Related: a detailed 50-minute talk about using Consul at scale (for service-
discovery) at DataDog, by Darron Froese. (The talk was given at SREcon16, on
04/08/16.)

 _We had many VMs in AWS - were ingesting millions of metrics per second - and
were having pain around service discovery and quick configuration changes.
This is the story of how we integrated Consul into our environment, what it
helped us with, mistakes we made and some tips for successful implementation
in your own environment. 10 months later, our growing cluster was using Consul
to facilitate 60 second cluster-wide configuration changes and make service
discovery simpler and more flexible._

Video (and audio):

[https://www.usenix.org/conference/srecon16/program/presentat...](https://www.usenix.org/conference/srecon16/program/presentation/froese)

Slides and code:

[https://blog.froese.org/2016/04/08/srecon-running-consul-
at-...](https://blog.froese.org/2016/04/08/srecon-running-consul-at-scale/)

~~~
josegonzalez
This was a great talk. I shared it with our other ops engineers at SeatGeek
and it mirrored our experience exactly. Hell, we even ended up with similar
projects for writing out shared config files to consul to get around early
consul perfect issues.

------
aaronharnly
Is it just me, or did they use Consul to build a Synapse[1] workalike?

[1] [https://github.com/airbnb/synapse](https://github.com/airbnb/synapse)

~~~
ebroder
(I work at Stripe, and helped with a lot of our early Consul rollout)

We did consider Synapse (and Nerve[1], which they used to call SmartStack when
used together) when we were building out our service discovery, and went with
an alternative strategy for a couple of reasons.

Even though it's not directly in the line of fire, we weren't super excited
about having to run a reliable ZooKeeper deployment; and we weren't excited
about using ZooKeeper clients from Ruby, since it seems like the Java clients
are far and away the most battle hardened. (IIRC Synapse only supported
ZooKeeper-based discovery when it was initially released)

We wanted a more dynamic service registration system than Synapse seems to be
designed for. Changing the list of services with Synapse seems to require
pushing out a config file change to all of your instances, which we wanted to
avoid.

Our configuration also looked less like SmartStack when we first rolled it out
- we were primarily focused on direct service-to-service communication without
going through a load balancer, and we expected to be querying the Consul API
directly. The shape of Consul felt like it better fit what we were trying to
do. We've ended up diverging from that over time as we learned more about how
Consul and our software interacted in production, and what we have now ended
up looking more similar to SmartStack than what we started with.

There aren't a _ton_ of different ways to think about service discovery (and
more broadly, service-to-service communication). One of my coworkers wrote[2]
about this some in the context of Envoy[3], which also looks a lot like
SmartStack. It's not terribly surprising to me that a lot of them converge
over time - e.g. the insight of trading consistency for availability is key.

[1] [https://github.com/airbnb/nerve](https://github.com/airbnb/nerve) [2]
[http://lethain.com/envoy-design/](http://lethain.com/envoy-design/) [3]
[https://lyft.github.io/envoy/](https://lyft.github.io/envoy/)

~~~
jolynch
(I help maintain SmartStack)

I think it's really interesting that "what we've already got setup" is such a
big driver in which systems we pick. For example, in 2013 Yelp already had
hardened Zookeeper setups and Consul didn't exist ... and when it did exist
Consul was the new "oh gosh they implemented their own consensus protocol" kid
on the block, so we opted for what we felt was the safer option. I do have to
be honest that I was also pretty worried about the ruby ZK library, but to be
totally honest it's been relatively well behaved, aside from the whole
sched_yield bug [1] occasionally causing Nerves to infinite loop shutting
down. We fixed that with a heartbeat and a watchdog, so not too bad. Which
technologies are available at which times really drives large technical
choices like this.

Consul template is undeniably useful, especially when you start integrating it
with other Hashicorp products like Vault for real time rolling your SSL creds
on all your distributed HAProxies. And I think that the whole Hashicorp
ecosystem together is a really powerful set of free off the shelf tools that
are really easy to get going with. I do think, however, that Synapse does have
some important benefits, specifically around managing dynamic HAProxy configs
that have to run on every host in your infra. For example, Synapse can remove
dead servers ASAP through the HAProxy stats socket after getting realtime ZK
push notifications rather than relying on healthchecks (in production <~10s
across the fleet, which is crucial because if HAProxy healthchecked every 2s
we'd kill our backend services with healthcheck storms ... because we've
totally done that ...), Synapse can try to remember old servers so that
temporary flakes don't result in HAProxy reloads, and it can try to spread and
jitter HAProxy restarts so that the healthcheck storms have less impact, all
while having flexibility in the registration backend (Synapse supports any
service registry that can implement the interface [2]). However, there are
some pretty cool alternative proxies to HAProxy out there and one area that
Consul is really doing well on is supporting arbitrary manifestations of
service registration data using Consul template; SmartStack is still playing
catch up there, supporting only HAProxy and json files (with arbitrary outputs
on their way in [3]).

I enjoyed the article, and thank you to the Stripe engineers for taking the
time to share your production experiences! I'm excited to see folks talking
about these kinds of real world production issues that you have to deal with
to build reliable service discovery.

[1] [https://github.com/zk-ruby/zk/issues/50](https://github.com/zk-
ruby/zk/issues/50) [2]
[https://github.com/airbnb/synapse/blob/master/lib/synapse/se...](https://github.com/airbnb/synapse/blob/master/lib/synapse/service_watcher/README.md)
[3]
[https://github.com/airbnb/synapse/pull/203](https://github.com/airbnb/synapse/pull/203)

~~~
jacquesm
> We fixed that with a heartbeat and a watchdog, so not too bad.

I disagree. That's a band-aid solution, good for a short time while you figure
out the root cause and solve it for real.

~~~
jolynch
I respectfully disagree. I'm all for root cause analysis and taking the time
to fix things upstream, but I also think that it's easy to say that and hard
to actually do it.

Yelp doesn't make more money and our infra isn't particularly more
maintainable when I invest a few weeks debugging Ruby interpreter/library
bugs, especially not when there are thousands of other higher priority bugs I
could be determining the root cause of and fixing.

For context, we spent a few days trying to get a reproducible test case for a
proper report upstream, but the issue was so infrequent and hard to reproduce
that we made the call not to pursue it further and just mitigate it. I do
believe that mitigate rather than root cause is sometimes the right
engineering tradeoff.

~~~
jacquesm
A bug like that is something that you want to squash because the cause might
have other unintended consequences that you are currently un-aware of. To
_assume_ that there are no other consequences is the error, and the only way
to make sure there are not is to identify the cause. This sort of wiping
things under the carpet is what comes back to bite you a long time after
either with corrupted data or some other consequence.

Now, given the context it doesn't matter whether or not the company or the
product dies so I can see where you're coming from but in any serious
enterprise that would not be tolerated, but when your code base already has
'thousands of other higher priority bugs' it's a lost cause, point taken. But
at some level you have to wonder whether you have 'thousands of higher
priority bugs' _because_ there is such a cavalier attitude to fixing them in
the first place.

~~~
jolynch
> in any serious enterprise that would not be tolerated

I think that's a bit of a true scotsman fallacy. We use a lot of software we
didn't write, and a lot of it has bugs. The languages that we write code in
have bugs (e.g. Python has a fun bug where the interpreter hangs forever on
import [1]; we've hit this in production many times). Software we write has
bugs and scalability issues as well. We try really hard to squash them. We
have design reviews, code reviews, and strive to have good unit, integration,
and acceptance tests. There are still bugs.

I'm glad that there are some pieces of software that are written to such high
standard that bugs are extremely rare (I think that HAProxy is a great example
of such a project), but I know of very few in the real world.

[1] [https://bugs.python.org/issue14903](https://bugs.python.org/issue14903)

------
po
If stripe people are around... Consul has a DNS interface built in[1], but I
get the impression you're not using that... are the DNS servers in your setup
just mirrors for the consul one?

[1]
[https://www.consul.io/docs/agent/dns.html](https://www.consul.io/docs/agent/dns.html)

~~~
vemv
As explained they wanted to tradeoff consistency for availability. Surely
using Consul DNS would be less available than an independent cluster of DNS
servers?

~~~
techman9
Why is this a given? Because of the potential of Raft failovers blocking reads
from Consul?

~~~
zackelan
By default[0] a Raft failover will block DNS reads, since Consul forwards all
DNS queries to the master, making it a single-point-of-failure.

However, Consul can be configured[1] to serve "eventually consistent" DNS
queries, allowing follower nodes to respond with possibly outdated
information, during the few seconds of a Raft failover.

Where I work, we've also set it up so that BIND acts as a front-end, caching
responses from Consul for up to 60 seconds.

0: This apparently changed in 0.7, released last month, to make eventually
consistent reads the default. I assume the version of Consul that Stripe is
using predates this, as does the version my company runs in production.

1:
[https://www.consul.io/docs/agent/options.html#dns_config](https://www.consul.io/docs/agent/options.html#dns_config)

------
jonathanoliver
I'm investigating Consul right now along with HAProxy and consul-templates for
a similar use case. One takeaway from this article seems to be that Consul
isn't quite ready for prime time. The concluding message feels like Consul
works great...except when it doesn't so don't rely on it too heavily or you'll
be sorry. Did anyone else get that after reading the post?

------
doublerebel
Sure, there are alternatives to Consul but I don't think anything else has the
dedication to being nearly zero-config.

Consul+Fabio self configures all routing and consul-template+Vault handle
traditional tool config and secrets.

I find many of the other tools (nginx, caddy, zookeeper, etc) do too much, and
replicate functionality that is better provided by somewhere else in my stack.

------
bharatkhatri14
What are the DNS servers being used for? The request flow says that the
request hits your load balancer (HAProxy) which is updated every 60s using
Consul templates.

They also say that DNS records are updated every 60s using Consul templates.
But when are the DNS servers used for service discovery if external requests
always hit the load balancer (HAProxy)?

~~~
rco8786
I got the impression the DNS bit was legacy and being phased out. Could be
wrong though.

------
chairmanwow
As a graduating college senior, articles like these are a necessity. The
content is extremely relevant to working in industry, but they assume the
average reader knows how things like this work, but they aren't afraid to
remind you of the particularities. Ty for a great read!

------
jvns
this talk "Consensus Systems for the Skeptical Architect" by Camille Fournier
is also really interesting!
[http://www.ustream.tv/recorded/61483409](http://www.ustream.tv/recorded/61483409)

------
user5994461
> This problem of tracking changes around which boxes are available is called
> service discovery. We use a tool called Consul from HashiCorp to do service
> discovery.

Nope, that's called health checks, and that's a job for load balancers :p

------
NKCSS
Lol, "One amazing thing about using AWS is that our instances can go down at
any time, so we need to be prepared..."

~~~
alexbilbie
I agree somewhat with the comment - AWS really do encourage you to think about
your instances as cattle instead of pets and they provide lots of tooling to
help you out (auto scaling groups, launch configurations, load balancers,
etc).

Once you've got into this mindset it's much easier to design scalable and
redundant systems.

~~~
latch
Scaling is about design, not about running on slow, unstable and expensive
servers. If you want scale, you need to design software for your own needs.
And, unless you're building your own datacenter, chances are infrastructure
should be an enabler, not a pain point.

A noisy neighbour doesn't make you think about asynchronous processing. A
single network adapter doesn't result in a denormalized model.

If I sold you a $200K car with no safety features, will you think "this thing
is great, it puts you in a mindset to drive very safely"?

~~~
whataretensors
This metaphor falls apart pretty quickly. If you leased me a car that was
unsafe, except that when it crashed a new car appeared exactly where the last
one was and noone was hurt, I'd think it was fine.

------
slezakattack
Just out of curiosity, what is the difference between using Consul and SNMP?

~~~
tptacek
SNMP is a way of schema-structuring information, and a protocol for
(primarily) fetching that information.

Consul is a distributed database use (primarily) for informing services of the
availability of other services.

Consul could have used SNMP as its data format and access protocol, but it
instead uses JSON and HTTP.

If you were a masochist, you could layer SNMP on top of Consul, to expose
Consul data to network management systems built on top of SNMP. Nobody does
that, though, because SNMP is fundamentally obsolete, and kept relevant only
by network equipment manufacturers.

------
nodesocket
Consul is great, but couldn't you just simply use NGINX Plus (paid) and
utilize the upstream_conf[1], state[2] and healthcheck[3] premium features?

Sorry for the long code block below, but it is a full working example that has
automatic health checks (adds and removes backends automatically) and a
dynamic backend configuration that is persisted to a state file.

    
    
        http {
            proxy_next_upstream error timeout http_502 http_504;
    
            match statusok {
                status 200;
                body ~ "200 OK";
            }
    
            upstream backend {
                state /etc/nginx/backend-state/backends;
                zone backend 256k;
                keepalive 20;
            }
        }
    
        server {
            # dynamic upstream
            location = /upstream_conf {
                upstream_conf;
                allow 127.0.0.1;
                deny all;
            }
    
            location / {
                proxy_connect_timeout 15s;
                proxy_http_version 1.1;
                proxy_set_header Connection "";
                proxy_set_header Host $host;
                proxy_set_header X-Real-IP $remote_addr;
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Forwarded-Proto $scheme;
    
                proxy_pass http://backend;
            }
    
            # health check
            location /internal-health-check {
                internal;
    
                proxy_connect_timeout 15s;
                proxy_http_version 1.1;
                proxy_set_header Connection "";
                proxy_set_header X-Real-IP $remote_addr;
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Forwarded-Proto $scheme;
    
                proxy_pass http://backend;
                health_check interval=3s fails=3 passes=5 uri=/health match=statusok;
            }
        }
    

[1] -
[https://nginx.org/en/docs/http/ngx_http_upstream_conf_module...](https://nginx.org/en/docs/http/ngx_http_upstream_conf_module.html)

[2] -
[http://nginx.org/en/docs/http/ngx_http_upstream_module.html#...](http://nginx.org/en/docs/http/ngx_http_upstream_module.html#state)
[3] -
[http://nginx.org/en/docs/http/ngx_http_upstream_module.html#...](http://nginx.org/en/docs/http/ngx_http_upstream_module.html#health_check)

~~~
josegonzalez
You still need to populate those back ends somehow, and a simple http
healthchecks doesn't show the entire picture around whether or not you should
forward requests to a service.

~~~
nodesocket
You populate backends using the dynamic upstream_conf which saves state to a
state file (see [1]).

    
    
        # add a backend
        curl http://127.0.0.1/upstream_conf?add=&upstream=backend&server=10.0.0.1:8000
    
        # remove a backend
        curl http://127.0.0.1/upstream_conf?remove=&upstream=backend&id=1
    

[1] - [https://www.nginx.com/blog/dynamic-reconfiguration-with-
ngin...](https://www.nginx.com/blog/dynamic-reconfiguration-with-nginx-plus/)

~~~
josegonzalez
So now you have another bit in your deploy pipeline that needs to construct
curl requests vs declaring your services against a registrar. Can you share
these registrations across servers? I'd hate to have to poll for all my web
servers to register a new API process, and I'm at the small scale (~200
servers).

~~~
nodesocket
Put your NGINX state file on a shared disk (NFS) between all load balancers
(something like AWS Cloud File Storage).

~~~
josegonzalez
At my company, every server has its own internal load balancer (previously
haproxy, now nginx). While definitely possible, I don't think sharing a single
file across hundreds of boxes is a "best practice".

That said, this certainly would work for companies with a ton of money to burn
(nginx+ licenses aren't cheap) or who are at smaller scales. If it works, ship
it!

~~~
nodesocket
You really shouldn't need more than a handful of load balancers (may one or
two in a few different availability zones for HA).

You can even pay Amazon and NGINX hourly using the NGINX Plus AMI[1].

[1] -
[https://aws.amazon.com/marketplace/pp/B00A04GAG4/](https://aws.amazon.com/marketplace/pp/B00A04GAG4/)

~~~
closeparen
One way to handle "service discovery" is to assign a distinct port to every
service, run an haproxy on every host, and always make cross-service calls
through localhost:port. Health checks and config management keep the haproxies
up to date, and applications don't have to know which hosts to ring.

Having only a handful of load balancers forces all your internal traffic
through a handful of bottlenecks and points of failure, and you still need
custom logic in your applications to try another load-balancer when the
primary is dead.

------
Annatar
_We’re not listing these to complain about Consul, but rather to emphasize
that when using new technology, it’s important to roll it out slowly and be
cautious._

Rolling out arbitrary changes directly into production, however slowly or
cautiously, is indicative of lack of process. There should be a standard
process for all changes which dictates that any component must go through
product testing, and then through product acceptance phases (PTA), and once
both phases are successful, a deployment to production must only occur within
the approved time window, specified in the change request. That all is just
one part of a comprehensive change management process.

For without a process, no technology can save the situation. Processes, and
their constant revision and optimization are key. CMM level four and above.

