Service discovery at Stripe

qohen · on Nov 1, 2016

Related: a detailed 50-minute talk about using Consul at scale (for service-discovery) at DataDog, by Darron Froese. (The talk was given at SREcon16, on 04/08/16.)

We had many VMs in AWS - were ingesting millions of metrics per second - and were having pain around service discovery and quick configuration changes. This is the story of how we integrated Consul into our environment, what it helped us with, mistakes we made and some tips for successful implementation in your own environment. 10 months later, our growing cluster was using Consul to facilitate 60 second cluster-wide configuration changes and make service discovery simpler and more flexible.

Video (and audio):

https://www.usenix.org/conference/srecon16/program/presentat...

Slides and code:

https://blog.froese.org/2016/04/08/srecon-running-consul-at-...

josegonzalez · on Nov 1, 2016

This was a great talk. I shared it with our other ops engineers at SeatGeek and it mirrored our experience exactly. Hell, we even ended up with similar projects for writing out shared config files to consul to get around early consul perfect issues.

aaronharnly · on Oct 31, 2016

Is it just me, or did they use Consul to build a Synapse[1] workalike?

[1] https://github.com/airbnb/synapse

ebroder · on Nov 1, 2016

(I work at Stripe, and helped with a lot of our early Consul rollout)

We did consider Synapse (and Nerve[1], which they used to call SmartStack when used together) when we were building out our service discovery, and went with an alternative strategy for a couple of reasons.

Even though it's not directly in the line of fire, we weren't super excited about having to run a reliable ZooKeeper deployment; and we weren't excited about using ZooKeeper clients from Ruby, since it seems like the Java clients are far and away the most battle hardened. (IIRC Synapse only supported ZooKeeper-based discovery when it was initially released)

We wanted a more dynamic service registration system than Synapse seems to be designed for. Changing the list of services with Synapse seems to require pushing out a config file change to all of your instances, which we wanted to avoid.

Our configuration also looked less like SmartStack when we first rolled it out - we were primarily focused on direct service-to-service communication without going through a load balancer, and we expected to be querying the Consul API directly. The shape of Consul felt like it better fit what we were trying to do. We've ended up diverging from that over time as we learned more about how Consul and our software interacted in production, and what we have now ended up looking more similar to SmartStack than what we started with.

There aren't a _ton_ of different ways to think about service discovery (and more broadly, service-to-service communication). One of my coworkers wrote[2] about this some in the context of Envoy[3], which also looks a lot like SmartStack. It's not terribly surprising to me that a lot of them converge over time - e.g. the insight of trading consistency for availability is key.

[1] https://github.com/airbnb/nerve [2] http://lethain.com/envoy-design/ [3] https://lyft.github.io/envoy/

rdli · on Nov 3, 2016

I love hearing about the evolution of systems, and how different people have converged on the problem from different directions.

[plug] I wrote a post summarizing all that I have figured out about service discovery and wish I had known from the beginning:

https://www.datawire.io/guide/service-discovery-microservice...

jolynch · on Nov 1, 2016

(I help maintain SmartStack)

I think it's really interesting that "what we've already got setup" is such a big driver in which systems we pick. For example, in 2013 Yelp already had hardened Zookeeper setups and Consul didn't exist ... and when it did exist Consul was the new "oh gosh they implemented their own consensus protocol" kid on the block, so we opted for what we felt was the safer option. I do have to be honest that I was also pretty worried about the ruby ZK library, but to be totally honest it's been relatively well behaved, aside from the whole sched_yield bug [1] occasionally causing Nerves to infinite loop shutting down. We fixed that with a heartbeat and a watchdog, so not too bad. Which technologies are available at which times really drives large technical choices like this.

Consul template is undeniably useful, especially when you start integrating it with other Hashicorp products like Vault for real time rolling your SSL creds on all your distributed HAProxies. And I think that the whole Hashicorp ecosystem together is a really powerful set of free off the shelf tools that are really easy to get going with. I do think, however, that Synapse does have some important benefits, specifically around managing dynamic HAProxy configs that have to run on every host in your infra. For example, Synapse can remove dead servers ASAP through the HAProxy stats socket after getting realtime ZK push notifications rather than relying on healthchecks (in production <~10s across the fleet, which is crucial because if HAProxy healthchecked every 2s we'd kill our backend services with healthcheck storms ... because we've totally done that ...), Synapse can try to remember old servers so that temporary flakes don't result in HAProxy reloads, and it can try to spread and jitter HAProxy restarts so that the healthcheck storms have less impact, all while having flexibility in the registration backend (Synapse supports any service registry that can implement the interface [2]). However, there are some pretty cool alternative proxies to HAProxy out there and one area that Consul is really doing well on is supporting arbitrary manifestations of service registration data using Consul template; SmartStack is still playing catch up there, supporting only HAProxy and json files (with arbitrary outputs on their way in [3]).

I enjoyed the article, and thank you to the Stripe engineers for taking the time to share your production experiences! I'm excited to see folks talking about these kinds of real world production issues that you have to deal with to build reliable service discovery.

[1] https://github.com/zk-ruby/zk/issues/50 [2] https://github.com/airbnb/synapse/blob/master/lib/synapse/se... [3] https://github.com/airbnb/synapse/pull/203

jacquesm · on Nov 1, 2016

> We fixed that with a heartbeat and a watchdog, so not too bad.

I disagree. That's a band-aid solution, good for a short time while you figure out the root cause and solve it for real.

jolynch · on Nov 1, 2016

I respectfully disagree. I'm all for root cause analysis and taking the time to fix things upstream, but I also think that it's easy to say that and hard to actually do it.

Yelp doesn't make more money and our infra isn't particularly more maintainable when I invest a few weeks debugging Ruby interpreter/library bugs, especially not when there are thousands of other higher priority bugs I could be determining the root cause of and fixing.

For context, we spent a few days trying to get a reproducible test case for a proper report upstream, but the issue was so infrequent and hard to reproduce that we made the call not to pursue it further and just mitigate it. I do believe that mitigate rather than root cause is sometimes the right engineering tradeoff.

jacquesm · on Nov 1, 2016

A bug like that is something that you want to squash because the cause might have other unintended consequences that you are currently un-aware of. To assume that there are no other consequences is the error, and the only way to make sure there are not is to identify the cause. This sort of wiping things under the carpet is what comes back to bite you a long time after either with corrupted data or some other consequence.

Now, given the context it doesn't matter whether or not the company or the product dies so I can see where you're coming from but in any serious enterprise that would not be tolerated, but when your code base already has 'thousands of other higher priority bugs' it's a lost cause, point taken. But at some level you have to wonder whether you have 'thousands of higher priority bugs' because there is such a cavalier attitude to fixing them in the first place.

jolynch · on Nov 1, 2016

> in any serious enterprise that would not be tolerated

I think that's a bit of a true scotsman fallacy. We use a lot of software we didn't write, and a lot of it has bugs. The languages that we write code in have bugs (e.g. Python has a fun bug where the interpreter hangs forever on import [1]; we've hit this in production many times). Software we write has bugs and scalability issues as well. We try really hard to squash them. We have design reviews, code reviews, and strive to have good unit, integration, and acceptance tests. There are still bugs.

I'm glad that there are some pieces of software that are written to such high standard that bugs are extremely rare (I think that HAProxy is a great example of such a project), but I know of very few in the real world.

[1] https://bugs.python.org/issue14903

mrud · on Nov 1, 2016

No, it seems that they were building the same thing, except of directly controlling and managing haproxy (via socket or template) it will only create a new template

po · on Nov 1, 2016

If stripe people are around... Consul has a DNS interface built in[1], but I get the impression you're not using that... are the DNS servers in your setup just mirrors for the consul one?

[1] https://www.consul.io/docs/agent/dns.html

vemv · on Nov 1, 2016

As explained they wanted to tradeoff consistency for availability. Surely using Consul DNS would be less available than an independent cluster of DNS servers?

techman9 · on Nov 1, 2016

Why is this a given? Because of the potential of Raft failovers blocking reads from Consul?

zackelan · on Nov 1, 2016

By default[0] a Raft failover will block DNS reads, since Consul forwards all DNS queries to the master, making it a single-point-of-failure.

However, Consul can be configured[1] to serve "eventually consistent" DNS queries, allowing follower nodes to respond with possibly outdated information, during the few seconds of a Raft failover.

Where I work, we've also set it up so that BIND acts as a front-end, caching responses from Consul for up to 60 seconds.

0: This apparently changed in 0.7, released last month, to make eventually consistent reads the default. I assume the version of Consul that Stripe is using predates this, as does the version my company runs in production.

1: https://www.consul.io/docs/agent/options.html#dns_config

jonathanoliver · on Nov 1, 2016

I'm investigating Consul right now along with HAProxy and consul-templates for a similar use case. One takeaway from this article seems to be that Consul isn't quite ready for prime time. The concluding message feels like Consul works great...except when it doesn't so don't rely on it too heavily or you'll be sorry. Did anyone else get that after reading the post?

doublerebel · on Nov 1, 2016

Sure, there are alternatives to Consul but I don't think anything else has the dedication to being nearly zero-config.

Consul+Fabio self configures all routing and consul-template+Vault handle traditional tool config and secrets.

I find many of the other tools (nginx, caddy, zookeeper, etc) do too much, and replicate functionality that is better provided by somewhere else in my stack.

bharatkhatri14 · on Nov 1, 2016

What are the DNS servers being used for? The request flow says that the request hits your load balancer (HAProxy) which is updated every 60s using Consul templates.

They also say that DNS records are updated every 60s using Consul templates. But when are the DNS servers used for service discovery if external requests always hit the load balancer (HAProxy)?

rco8786 · on Nov 1, 2016

I got the impression the DNS bit was legacy and being phased out. Could be wrong though.

chairmanwow · on Nov 1, 2016

As a graduating college senior, articles like these are a necessity. The content is extremely relevant to working in industry, but they assume the average reader knows how things like this work, but they aren't afraid to remind you of the particularities. Ty for a great read!

jvns · on Nov 1, 2016

this talk "Consensus Systems for the Skeptical Architect" by Camille Fournier is also really interesting! http://www.ustream.tv/recorded/61483409

user5994461 · on Nov 1, 2016

> This problem of tracking changes around which boxes are available is called service discovery. We use a tool called Consul from HashiCorp to do service discovery.

Nope, that's called health checks, and that's a job for load balancers :p

NKCSS · on Nov 1, 2016

Lol, "One amazing thing about using AWS is that our instances can go down at any time, so we need to be prepared..."

alexbilbie · on Nov 1, 2016

I agree somewhat with the comment - AWS really do encourage you to think about your instances as cattle instead of pets and they provide lots of tooling to help you out (auto scaling groups, launch configurations, load balancers, etc).

Once you've got into this mindset it's much easier to design scalable and redundant systems.

latch · on Nov 1, 2016

Scaling is about design, not about running on slow, unstable and expensive servers. If you want scale, you need to design software for your own needs. And, unless you're building your own datacenter, chances are infrastructure should be an enabler, not a pain point.

A noisy neighbour doesn't make you think about asynchronous processing. A single network adapter doesn't result in a denormalized model.

If I sold you a $200K car with no safety features, will you think "this thing is great, it puts you in a mindset to drive very safely"?

whataretensors · on Nov 2, 2016

This metaphor falls apart pretty quickly. If you leased me a car that was unsafe, except that when it crashed a new car appeared exactly where the last one was and noone was hurt, I'd think it was fine.

ones_and_zeros · on Nov 1, 2016

I took that to mean they use spot instances. It's basically equivalent to running chaos monkey in your environments a la Netflix and forces you to design for fault tolerance. Plus it's way cheaper.

slezakattack · on Oct 31, 2016

Just out of curiosity, what is the difference between using Consul and SNMP?

tptacek · on Nov 1, 2016

SNMP is a way of schema-structuring information, and a protocol for (primarily) fetching that information.

Consul is a distributed database use (primarily) for informing services of the availability of other services.

Consul could have used SNMP as its data format and access protocol, but it instead uses JSON and HTTP.

If you were a masochist, you could layer SNMP on top of Consul, to expose Consul data to network management systems built on top of SNMP. Nobody does that, though, because SNMP is fundamentally obsolete, and kept relevant only by network equipment manufacturers.

sagichmal · on Nov 1, 2016

The two have absolutely nothing to do with each other.

nodesocket · on Nov 1, 2016

Consul is great, but couldn't you just simply use NGINX Plus (paid) and utilize the upstream_conf[1], state[2] and healthcheck[3] premium features?

Sorry for the long code block below, but it is a full working example that has automatic health checks (adds and removes backends automatically) and a dynamic backend configuration that is persisted to a state file.

    http {
        proxy_next_upstream error timeout http_502 http_504;

        match statusok {
            status 200;
            body ~ "200 OK";
        }

        upstream backend {
            state /etc/nginx/backend-state/backends;
            zone backend 256k;
            keepalive 20;
        }
    }

    server {
        # dynamic upstream
        location = /upstream_conf {
            upstream_conf;
            allow 127.0.0.1;
            deny all;
        }

        location / {
            proxy_connect_timeout 15s;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            proxy_pass http://backend;
        }

        # health check
        location /internal-health-check {
            internal;

            proxy_connect_timeout 15s;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            proxy_pass http://backend;
            health_check interval=3s fails=3 passes=5 uri=/health match=statusok;
        }
    }

[1] - https://nginx.org/en/docs/http/ngx_http_upstream_conf_module...

[2] - http://nginx.org/en/docs/http/ngx_http_upstream_module.html#... [3] - http://nginx.org/en/docs/http/ngx_http_upstream_module.html#...

josegonzalez · on Nov 1, 2016

You still need to populate those back ends somehow, and a simple http healthchecks doesn't show the entire picture around whether or not you should forward requests to a service.

nodesocket · on Nov 1, 2016

You populate backends using the dynamic upstream_conf which saves state to a state file (see [1]).

    # add a backend
    curl http://127.0.0.1/upstream_conf?add=&upstream=backend&server=10.0.0.1:8000

    # remove a backend
    curl http://127.0.0.1/upstream_conf?remove=&upstream=backend&id=1

[1] - https://www.nginx.com/blog/dynamic-reconfiguration-with-ngin...

josegonzalez · on Nov 1, 2016

So now you have another bit in your deploy pipeline that needs to construct curl requests vs declaring your services against a registrar. Can you share these registrations across servers? I'd hate to have to poll for all my web servers to register a new API process, and I'm at the small scale (~200 servers).

nodesocket · on Nov 1, 2016

Put your NGINX state file on a shared disk (NFS) between all load balancers (something like AWS Cloud File Storage).

geofft · on Nov 1, 2016

Have you done this? Does it work? Does nginx actually close the file and re-open it every time it does a write, or does it mmap the file as would be sensible on local disk?

Even if nginx gets this right, now you're relying on the consistency of your shared disk implementation. Popular options include:

1. A single UNIX machine. Now you have a single point of failure, and all your traffic . If you're okay with that, you can just do that and skip NFS. If you're okay home-brewing failover solutions for your former single point of failure and its backup, you can just do that and skip NFS.

2. A fancy cluster of filers that attempts to promise you distributed close-to-open consistency, and gets it right with very high but not 100% probability.

3. A fancy cluster of filers that relaxes some of the guarantees on NFS consistency, or only lets one person successfully open a file at once, or something.

4. Something opaque from Amazon, which could be any of the above options and you have no idea which, or something else entirely. Also, a single NFS export from Amazon EFS only runs within a single availability zone. If you're okay having a single AZ as a SPOF, again, you can skip NFS.

(My employer runs NFS at very large scale in production; basically everything in the company touches NFS one way or another, and we have lots of infrastructure to ensure availability and geographic redundancy in NFS. Every time it fails, things get weird, because application software rarely expects files to have the same problems distributed systems have. It's no more magic than any other distributed system, and possibly quite a bit less magic.)

josegonzalez · on Nov 1, 2016

At my company, every server has its own internal load balancer (previously haproxy, now nginx). While definitely possible, I don't think sharing a single file across hundreds of boxes is a "best practice".

That said, this certainly would work for companies with a ton of money to burn (nginx+ licenses aren't cheap) or who are at smaller scales. If it works, ship it!

nodesocket · on Nov 1, 2016

You really shouldn't need more than a handful of load balancers (may one or two in a few different availability zones for HA).

You can even pay Amazon and NGINX hourly using the NGINX Plus AMI[1].

[1] - https://aws.amazon.com/marketplace/pp/B00A04GAG4/

closeparen · on Nov 1, 2016

One way to handle "service discovery" is to assign a distinct port to every service, run an haproxy on every host, and always make cross-service calls through localhost:port. Health checks and config management keep the haproxies up to date, and applications don't have to know which hosts to ring.

Having only a handful of load balancers forces all your internal traffic through a handful of bottlenecks and points of failure, and you still need custom logic in your applications to try another load-balancer when the primary is dead.

snug · on Nov 1, 2016

It would actually be just as useful to use consul with the upstream resolv available on nginx (paid)

A demo of this can be found here[0]

0: https://github.com/nginxinc/NGINX-Demos/tree/master/consul-d...

Annatar · on Nov 1, 2016

We’re not listing these to complain about Consul, but rather to emphasize that when using new technology, it’s important to roll it out slowly and be cautious.

Rolling out arbitrary changes directly into production, however slowly or cautiously, is indicative of lack of process. There should be a standard process for all changes which dictates that any component must go through product testing, and then through product acceptance phases (PTA), and once both phases are successful, a deployment to production must only occur within the approved time window, specified in the change request. That all is just one part of a comprehensive change management process.

For without a process, no technology can save the situation. Processes, and their constant revision and optimization are key. CMM level four and above.