
Consul, a new tool for service discovery and configuration - BummerCloud
http://www.hashicorp.com/blog/consul.html
======
avitzurel
Those guys are machines.

Usually when people release open source software, the documentation is
lacking, there's no website etc... those guys absolutely nail it every single
time.

Kudos for them, really!

------
samstokes
_Registered services and nodes can be queried using both a DNS interface as
well as an HTTP interface._

This is very cool. Integrating with a name resolution protocol that every
existing programmer and stack knows how to use (often without even thinking
about it) should lead to some magical "just works" moments.

~~~
mikebabineau
See also SkyDNS, another service discovery system:
[http://blog.gopheracademy.com/skydns](http://blog.gopheracademy.com/skydns)

In common with Consul:

* DNS interface

* Operates as a distributed cluster

* Uses Raft for consensus

~~~
armon
SkyDNS is compared to Consul here:
[http://www.consul.io/intro/vs/skydns.html](http://www.consul.io/intro/vs/skydns.html)

------
stormbrew
So I'm mostly curious why this isn't just basically serf 2.0. Looking at serf
I never really felt like it had much use in the basic form it took, with no
ability to advertise extra details about the nodes in a dynamic fashion.
Consul seems to build onto serf the things that serf needed to become really
useful, so seems more like a successor to serf than a parallel project.

It seems like the right thing to do here would be to take the lessons of
building consul into making serf something more like a library on which to
build other things rather than a service in its own right.

~~~
mitchellh
To add to what Armon said: we always had Consul in mind when building Serf.
But Serf was a necessary building block on the way to Consul: highly
available, lightweight membership management. Every distributed system has a
membership problem, and we didn't want to reinvent the wheel for every system
we built. Serf provided an incredibly stable and well proven foundation for us
to build Consul on top, as well as future products that are already in the
works...

And while you may not see Serf as having much use, we've personally helped and
seen Serf clusters with many thousands of nodes. Serf is very useful to these
organizations for its purpose. And while some of these orgs are now looking at
Consul, many don't need Consul in the same way (but may deploy it separately).

We're not stopping with Consul. We have something more on the way. But we now
have some great building blocks and experience building distributed systems to
keep doing it correctly without having to rebuild everything from scratch.

~~~
stormbrew
It would be pretty awesome if you could get some case studies out about these
deployments of serf. I would really like to hear more about them.

------
addisonj
Very impressed.

This coalesces a lot of different ideas together into what seems to be a
really tight package to solve hard problems. In looking around at what most
companies are doing, even startupy types, architectures are becoming more
distributed and a (hopefully) solid tool for discovery and configuration seems
like a big step in the right direction.

------
noelwelsh
Looks like a very cool tool -- could replace Zookeeper with saner admin
requirements -- but I'm more interested in the tech. AP systems (such as Serf,
on which Consul is built) have many advantages and I think we're only just
beginning to see their adoption. I believe CRDTs are the missing ingredient to
restore sanity to inconsistent data. Add that and I can see a lot more such
systems being deployed in the future (and particularly in mine :-)

------
MechanicalTwerk
Seriously, who does design for HashiCorp? Their site designs, though similar,
always kill it.

~~~
cheeseprocedure
Agreed! I also appreciate that they tend to drop with robust documentation
pages already in place.

~~~
MechanicalTwerk
Yep. I'm kind of amazed at the level polish in their initial releases given
that HashiCorp is what, 3 people? And they're up to 4 products now with Serf
being released just 6 months ago and Packer not long before that.

------
hardwaresofton
This is really awesome, distributed system techniques in the real world. I'm
really jealous of what they've managed to build.

I was planning to make a tool like this (smaller scale, one machine), and this
will certainly serve as a good guide on how to do it right (or whether I
should even bother at all).

I can't find a trace of a standard/included slick web interface for managing
the clusters and agents -- are they leaving this up to a 3rd party (by just
providing the HTTP API and seeing what people will do with it)? Is that a good
idea?

~~~
armon
We didn't manage to finish it in time, but it should be released in the next
two weeks or so!

~~~
hardwaresofton
I can only imagine how awesome it's going to be, given the excellent design
(simple, readable, content-focused) of the other work you guys are doing.

If I may ask, it seems like the design of the consul site is one step
(iteration) away from the serf site (particularly, the docs pages -- some
subtle changes made a large difference)... I agree with the others here,
really dig the site, and big text definitely doesn't hurt deeply technical
descriptions architecture page was very readable for me

~~~
armon
You are right, the Consul site is very heavily based on the Serf site. It's
actually almost entirely just a re-skin of the same site. The design is a bit
cleaner however, which does make it nicer to read.

------
igor47
i am constantly impressed at the hashicorp guys, who continue to release great
tools. they actually released serf on the same day as we released nerve and
synapse, which comprise airbnb's service registration and discovery platform,
smartstack. see
[https://github.com/airbnb/nerve](https://github.com/airbnb/nerve) and
[https://github.com/airbnb/synapse](https://github.com/airbnb/synapse)

that said, as i wrote my blog post on service discovery (
[http://nerds.airbnb.com/smartstack-service-discovery-
cloud/](http://nerds.airbnb.com/smartstack-service-discovery-cloud/) ), dns
does not make for the greatest interface to service discovery because many
apps and libraries cache DNS looksups.

an http interface might be safer, but then you have to build a connector for
this into every one of your apps.

i still feel that smartstack is a better approach because it is transparent.
haproxy also provides us with great introspection for what's happening in the
infrastructure -- who is talking to whom. we can analyse this both in our logs
via logstash and in real-time using datadog's haproxy monitoring integration,
and it's been invaluable.

however, this definitely deserves a look if you're interested in, for
instance, load-balancing UDP traffic

~~~
armon
I think that Synapse already provides a plugin architecture. I t should be
trivial to use the Consul HTTP API as a Synapse plugin. All this does is
bridge the information Consul has and setup an HAProxy instance locally. If
you want to go that route, should be pretty simple!

------
sagichmal
The underlying Raft implementation is brand new, and looks much improved on
the goraft used by etcd. Very impressed.

------
allengeorge
This is really impressive - kudos! I'm jealous - these guys are implementing
extremely cool stuff in the distributed systems arena :) (serf -
[http://serfdom.io](http://serfdom.io) \- comes to mind)

How much time did it take to put this together?

------
Loic
Short question: Can I define the IP of the service in the service definition?

From the service definition[0] it looks like the IP is always the IP of the
node hosting `/etc/consul.d/*` files. I am thinking about it in a scenario
where each service (running in a container) is getting an IP address on a
private network which is not the IP of the node.

[0]:
[http://www.consul.io/docs/agent/services.html](http://www.consul.io/docs/agent/services.html)

Update: An external service is possible:
[http://www.consul.io/docs/guides/external.html](http://www.consul.io/docs/guides/external.html)

------
contingencies
This sounds very impressive, at the risk of breaking the chorus of awesome:
_what problem does this actually solve?_

Discovery: The consul page alleges that it provides a DNS compatible DNS
alternative for peer discovery but is unclear as to what improvements it
offers other than 'health checks', with the documentation leaving failure
resolution processes unspecified (as far as I can see) thus mandating a hyper-
simplistic architecture strategy like _run lots of redundant instances in case
one fails_. That's not very efficient. (It might be interesting to note that
at the ethernet level, IP addresses also provide MAC address discovery. If you
are serious about latency, floating IP ownership is generally far faster than
other solutions.)

Configuration: We already have many configuration management systems, with
many problems[1]. This is just a key/value store, and as such is not as
immediately portable to arbitrary services as existing approaches such as
"bunch-of-files", instead requiring overhead for each service launched in
order to make it function with to this configuration model.

The use of the newer raft consensus algorithm is interesting, but consensus
does not a high availability cluster make. You also need elements like formal
inter-service dependency definition in order to have any hope of automatically
managing cluster state transitions required to recover from failures in non-
trivial topologies. Corosync/Pacemaker has this, Consul doesn't. Then there's
the potential split-brain issues resulting from non-redundant communications
paths... raft doesn't tackle this, as it's an algorithm only. Simply put:
given five nodes, one of which fails normally, if the remaining four split in
equal halves who is the legitimate ruler? Game of thrones.

As _peterwwillis_ pointed out, for web-oriented cases, the same degree of
architectural flexibility and failure detection proposed under consul can be
achieved with significantly reduced complexity using traditional means like a
frontend proxy. For other services or people wanting serious HA clustering, I
would suggest looking elsewhere for the moment.

[1] [http://stani.sh/walter/pfcts](http://stani.sh/walter/pfcts)

~~~
agentS
For the record, Raft does mandate what happens in that split-brain scenario.
Neither side will be able to elect a leader, and writes will halt. Electing a
leader requires a quorum.

~~~
contingencies
OK, thanks for clarifying. Still, the point being illustrated was that a drop-
in solution is rarely feasible... ie. the common HA cluster feature of
redundant link-layer communications paths does add significant protection
against availability loss such as the situation you describe. It's not just a
software thing.

------
nemothekid
I'm VERY impressed, even more impressed by the fact that it speaks DNS. I do
with however that it came with a "driver" option rather running a consul
client (or even just SkyDNS-like http option, although I'm unsure how you
would manage membership). That way you could just "include" consul in your
python/ruby/go application, and not have to worry about adding another service
to your chef/pupper config and running yet another service.

------
justinfranks
Consul really solves a large problem for most SaaS companies who run or plan
to run a Hybrid Cloud, Multi-cloud, or Multi-data center environment

------
opendais
This is slightly off topic but I'm curious why none of the service discovery
tools run off of something like Cassanandra as the datastore?

~~~
hardwaresofton
So I think here the problem isn't really the datastore, it's more the high-
availability and discovery. The main bonus that Consul seems to be providing
is maintaining a logical topology of your network without you doing much.

They do this by a using gossip-based protocol and a derivative of paxos called
Raft. These two things work together to essentially have the servers that run
your various services (whether api or db or cache or whatever) know about EACH
OTHER.

The database they use is LMDB, but I think they chose that for lightness --
you could easily replace it with a local instance of cassandra, most likely.

Also, I'm assuming you don't mean switching to a centralized cassandra
instance -- why you don't want to do that should be obvious (central point of
failure).

~~~
opendais
Wouldn't a centralized Cassandra cluster be reliable enough to meet that need?

I've never had a cluster completely collapse on me unless things were already
screwed up enough that Service Discovery was ultimately useless since nothing
else would work.

It just seems to me that losing your datastore makes your services
unusable...at which point 'discovering them' isn't really the issue. Instead,
everyone wants to introduce another datastore you need to rely on that its
loss == can't find anyone. Even if your services themselves are still
functional.

~~~
armon
To reiterate a bit on what hardwaresofton has already said, yes you could use
Cassandra. However, Cassandra like ZooKeeper is just a building block. It
doesn't actually provide service discovery, health checking, etc. You could
use it to build all of those, but it doesn't provide it.

We compare Consul to ZooKeeper here, but much of that applies to Cassandra as
well:
[http://www.consul.io/intro/vs/zookeeper.html](http://www.consul.io/intro/vs/zookeeper.html)

Internally, Consul could also use something like Cassandra to store the data.
However we use LMDB which is an embedded database to avoid an expensive
context switch out of the process to serve requests with lower latency and
higher throughput.

~~~
simonw
I hadn't seen LMDB before. How does it compare to LevelDB? On first glance it
looks like the two have similar characteristics.

~~~
hardwaresofton
Check out LMDB's page:

[http://symas.com/mdb/](http://symas.com/mdb/)

I dunno if leveldb supports acid now (or recently came to), but that's a big
difference

------
dantiberian
How does this differ from [http://www.serfdom.io/](http://www.serfdom.io/),
another HashiCorp product?

~~~
auganov
[http://www.consul.io/intro/vs/serf.html](http://www.consul.io/intro/vs/serf.html)

~~~
stormbrew
"Serf is a service discovery and orchestration tool..."

"... However, Serf does not provide any high-level features such as service
discovery..."

Hm...

~~~
armon
That can probably be worded better. Serf operates at a node level abstraction,
while Consul operates at a service level. In fact, Serf is used in Consul to
power the discovery of other nodes.

------
Axsuul
Can anyone care to provide some real world examples? I'm having a hard time
wrapping my head around what this exactly does.

~~~
mikebabineau
Service Discovery is about helping services find one another.

If your application relies on memcached, you need to pass the memcached
location to your application somehow. For simple architectures, this may just
be a hardcoded localhost:11211.

As you scale, it becomes prudent to distribute services across different
servers. Your configuration could then become something like
"server1.mycompany.com:11211". But what if memcached moves from server1 to
server2? You'll need to reconfigure and restart your application.

More sophisticated apps will often use a dynamic approach: services are
registered with something like ZooKeeper or etcd. When serviceA needs to talk
to serviceB, serviceA looks up serviceB's address in the service registry (or
a local cache) and makes the request.

The good news is that these often include basic health check functionality, so
you get a bit of fault tolerance for free. Unfortunately, this requires
services to integrate directly with ZooKeeper or etcd, adding undesired
complexity.

Some architectures therefore choose to use DNS as their service registry. But
instead of hardcoding a the DNS address of a single node (like
"server1.mycompany.com"), they hit an address associated with the service
(serviceB.mycompany.com). This usually means rolling your own system to keep
DNS up to date (adding/pruning in context of health state).

Consul is a hybrid approach. It allows you to use DNS as a service registry,
but operates as its own, distributed DNS server cluster. Think of it like a
specialized ZooKeeper cluster that exposes service information via DNS (and
HTTP, if you prefer).

Back to the memcached case. With Consul, you'd point your app at
"memcached.consul:11211". If your memcached server fell over and was replaced,
Consul would pick up the change and return the new address. And without any
app config changes or restarts.

~~~
lobster_johnson
I am trying to figure out how Consul meshes with a configuration management
system such as Puppet. There is a lot of overlap.

From what I can tell, Consul supports two registration mechanisms: Static
defined services in /etc/consul.d, and dynamically defined services through
the HTTP API.

For the statically-defined case, for any given node, you have to create Puppet
(or Chef, or whatever) definitions that populate /etc/consul.d with the stuff
that's going to run on that node. For the actual configuration itself, you
still want Puppet to be the one to populate it. The question then is what you
gain by doing this; if that configuration goes into Puppet, then Puppet is
still the main truth where you want to centralize things, so then you have
this flow of data:

    
    
        client <- DNS <- Consul <- /etc/consul.d <- Puppet
    

...compared to the "old" way:

    
    
        client <- /srv/myapp/myapp.conf-or-whatever <- Puppet
    

In this case, Consul's benefit comes from the fact that it can know which
services are alive and not, so that when myapp needs otherapp, it doesn't need
a load-balancer to figure that out.

The documentation makes a point about Puppet updates being slow and
unsynchronized, and it's true that you get into situations where, for example,
service A is configured with hosts that aren't up yet, for example. With
Consul you can update the config "live"; surely you want to centralize config
in Puppet and populate Consul's K/V from Puppet, and then you get the single-
point-of-update synchronization missing from Puppet, but you still need to
store the truth in Puppet.

So I'm counting two good, but not altogether mind-blowing benefits from using
Consul with Puppet, over not using Consul at all. The overlap is looking a lot
like two systems vaguely competing for dominance.

I suspect the better use of Consul is in conjunction with something like
Docker, where you ditch Puppet altogether (except as a way to update the host
OS), and instead build images of apps and services that don't contain any
configuration at all, but simply point themselves at Consul. That means that
when you bring up a new Docker container, it can start its Consul agent,
register its services, and suddenly its contained services are dynamically
available to the whole cluster.

The container itself contains no config, no context, just general-purpose
application/service code; and Consul doesn't need to be populated through
Puppet because in that way, Consul is (in conjunction with some container
provisioning system) the application world's Puppet.

That, to me, sounds pretty nice.

------
djb_hackernews
Should something be happening with the bar data payload in the HTTP kv
example? Or is the value encoded for some reason?

~~~
armon
The value is base64 encoded, since JSON doesn't play nicely with binary values

------
peterwwillis
I'll preface these comments by saying that Consul appears to be the first
distributed cluster management tool i've seen in years that gets pretty much
everything right (I can't tell exactly what their consistency guarantees are;
I suppose it depends on the use case?).

What I will say, in my usually derisive fashion, is I can't tell why the
majority of businesses would need decentralized network services like this. If
you own your network, and you own all the resources in your network, and you
control how they operate, I can't think of a good reason you would need
services like this, other than a generalized want for dynamic scaling of a
service provider (which doesn't really work without your application being
designed for it, or an intermediary/backend application designed for it).

Load balancing an increase of requests by incrementally adding resources is
what most people want when they say they want to scale. You don't need
decentralized services to provide this. What do decentralized services
provide, then? "Resilience". In the face of a random failure of a node or
service, another one can take its place. Which is also accomplished with
either network or application central load balancing. What you don't get
[inherently] from decentralized services is load balancing; sending new
requests to some poor additional peer simply swamps it. To distribute the load
amongst all the available nodes, now you need a DHT or similar, and take a
slight penalty from the efficiency of the algorithm's misses/hits.

All the features that tools like this provide - a replicated key/value store,
health checks, auto discovery, network event triggers, service discovery, etc
- can all be found in tools that work based on centralized services, while
remaining scalable. I guess my point is, before you run off to your boss
waving an iPad with Consul's website on it demanding to implement this new
technology, try to see if you _need it_ , or if you just think it's really
cool.

It's also kind of scary that the ability of an entire network like Consul's to
function depends on minimum numbers of nodes, quorums, leaders, etc. If you
believe the claims that the distributed network is inherently more robust than
a centralized one, you might not build it with fault-tolerant hardware or
monitor them adequately, resulting in a wild goose chase where you try to
determine if your app failures are due to the app server, the network, or one
piece of hardware that the network is randomly hopping between. Could a bad
switch port cause a leader to provide false consensus in the network? Could
the writes on one node basically never propagate to its peers due to similar
issues? How could you tell where the failure was if no health checks show red
flags? And is there logging of the inconsistent data/states?

~~~
mitchellh
Thanks for prefacing the constructive criticism with the compliment. :)

I want to clarify: Of all the buzz words Consul has, one thing Consul ISN'T is
decentralized. You must run at least one Consul server in a cluster. If you
want a fully centralized approach, you can just run one server. No big deal.
Of course, if that server goes down, reads/writes are unavailable. If you want
high availability, you run multiple servers. They leader elect to determine
who will handle the writes but that is about it.

It is "decentralized" in that you can send read/writes to any server, but
those servers actually just forward the requests onto the leader.

~~~
peterwwillis
Oh, well i'm a giant ass then. I saw "completely distributed" and extrapolated
decentralized from that. Still, an automatically-[randomly?]-elected leader in
a distributed network is a form of decentralized network. It's effectively a
toroidal network topology, since your datacenters make it multi-dimensional
(though now that I look at hypercubes i'm not exactly sure which is a better
fit here)

Now that i've re-read your architecture page, let me see if I understand this:
the basic point behind using Consul is to have multiple servers agree on the
result of a request, and communicate that agreement to a single node to write
it, and then return it to the client. So really it's a fault-tolerant
messaging platform that includes features that take advantage of such a
network; do I have that right?

Also, your docs say there are between three and five servers, but here you're
saying you only need one?

~~~
mitchellh
We recommend three or five because that will get you a high level of fault
tolerance. Consul runs just fine with one server, but you'll have no fault
tolerance: if that one server becomes unavailable for any reason, then
reads/writes for your cluster will not be served.

The leader is elected using Raft as the consensus algorithm (it is linked in
the internals section of the docs).

You are correct though that if you run multiple servers, the leader election
will happen automatically. There is no way to manually override it.

