
Serf: A decentralized solution for service discovery and orchestration - coffeejunk
http://www.serfdom.io
======
mitchellh
I'm jumping on a plane right now (a couple hours) but I'd be happy to answer
any questions related to Serf once I land. Just leave them here and I'll give
it my best shot! We've dreamt of something like Serf for quite awhile and I'm
glad it is now a reality.

Some recommended URLs if you're curious what the point is:

"What is Serf?"
[http://www.serfdom.io/intro/index.html](http://www.serfdom.io/intro/index.html)

"Use Cases" [http://www.serfdom.io/intro/use-
cases.html](http://www.serfdom.io/intro/use-cases.html)

Comparison to Other Software: [http://www.serfdom.io/intro/vs-other-
sw.html](http://www.serfdom.io/intro/vs-other-sw.html)

For the CS nerds, the internals/protocols/papers behind Serf:
[http://www.serfdom.io/docs/internals/gossip.html](http://www.serfdom.io/docs/internals/gossip.html)

Also, I apologize the site isn't very mobile friendly right now. Unfortunately
I can write a lamport clock implementation, but CSS is just crazytown.

~~~
lifeisstillgood
A meta question if you will - I have often come across situations in work
where "if only we had that tool". sometimes I have hacked something together,
other times taken it further and tidied it up and released it. But this seems
to have a large level of polish

so ...

When did you realise the need for surf

Did you work on it as a main project at some point or is it a side project

When and how did you decide to commit to getting this done

and the big one for me - tools are driven by a need, but often the need keeps
coming while the time to build it diminishes. What strategies did you use to
keep the plates spinning while building surf?

I think we can all mostly answer the questions - I just waant to know how
different your answers are from say mine when I don't release two major OSS
projects and you do.

cheers

~~~
mitchellh
Great questions. I'll answer each in turn.

I want to mention the "polish": I personally don't believe in releasing an
open source project without polish. If it is missing docs, its just not
complete. If it is ugly, it is not complete. The technical aspects of Serf
were done weeks ago. Getting the human side of things done took another few
weeks (contracting designers and such).

> When did you realise the need for serf?

The need for something like Serf has existed since I started doing ops. Every
time I hit something where I say to myself "why is this so hard/crappy" is
when I write it down in my notebook for a future date. I then just think on
the idea for awhile and eventually when I feel like I have a significantly
better solution than what is out there already, I build it.

I decided to start building Serf when @armon started throwing gossip protocol
academic papers at me. I realized he figured it out, this was clearly
significantly better, so we started working on it.

> Did you work on it as a main project at some point or is it a side project?

To get it out the door we focus on it for some period of time. After it is
shipped it is still what I would consider a "main project" but time is split
between various projects.

> When and how did you decide to commit to getting this done?

A few weeks ago. It took about a month to build. Building it is easy. Figuring
out WHAT to build... took a long time. I have to say I've had "service
orchestration/membership" in my notebook for years.

> What strategies did you use to keep the plates spinning while building surf?

No good answer here, we just prioritize some things over others. Serf was our
top priority this month.

~~~
lifeisstillgood
Thank you - that "building was easy, compared to knowing what to build" put a
lot into perspective. And reaching out to external people to build the polish
is a surprise, but obvious in retrospect.

I am afraid that for such a helpful and clear answer, you get a mere 1 karma
point from me - but thank you.

------
WestCoastJustin
Looks like this might be loosely related to Mitchell Hashimoto [1], who makes
Vagrant.

[1] [https://twitter.com/mitchellh](https://twitter.com/mitchellh)

~~~
moondowner
Both are HashiCorp projects.

------
lambda
The title needs to be improved. "A decentralized, highly available, fault
tolerant solution..." for what? The title should include "for service
discovery and orchestration".

~~~
taterbase
Agreed, these tag lines come off as buzzword soup instead of informative. I
would love a small quick scenario describing what serf can help
prevent/enable.

~~~
plainOldText
You might find this useful: [http://www.serfdom.io/intro/use-
cases.html](http://www.serfdom.io/intro/use-cases.html)

~~~
linker3000
Yes, that page helped me, but the front page should be the hook and it left me
none the wiser.

------
philips
It seems serf and etcd are both trying to solve service discovery but are
attacking it using different approaches. Which is pretty cool!

Serf looks to be eventually consistent and event driven. So you can figure out
who is up and send events to members. This gives you a lot of utility for the
use cases of propagating information to DNS, load balancers, etc.

But, you couldn't use serf for something like master election or locks and
would need etcd or Zookeeper for that. Serf and etcd aren't mutually exclusive
systems in any sense just solving different problems in different ways.

They have a nice write-up on the page here: [http://www.serfdom.io/intro/vs-
zookeeper.html](http://www.serfdom.io/intro/vs-zookeeper.html)

------
kevinpet
This would really benefit from a "how does this relate to zookeeper". I think
this is an entirely new service, with different technical insides, and trying
to provide a higher level solution to what people usually cobble together with
ZK.

But I'd be interested in comments from someone knowledgeable.

Edit: I see this is addressed at [http://www.serfdom.io/intro/vs-
zookeeper.html](http://www.serfdom.io/intro/vs-zookeeper.html) but it would be
nice to have something more "just the facts" rather than arguing the serf is
good.

~~~
armon
In writing that section, we tried to provide "just the facts". If there is
anything that seems wrong or misleading in any way, we'd like to know so that
the page can be corrected. It is not our intention to say "Serf is good,
ZooKeeper is bad". They are very different tools, and we are just trying to
highlight the differences. In fact, we believe that the strongest use cases
involve using those tools together.

~~~
kevinpet
I didn't mean to say that it sounded like a sales pitch. What I meant is that
it's talking about relative strengths and weaknesses, where what I really need
to understand what Serf is is more along the lines of how the API / model
differs from ZKs notion of writing to or waiting on locations in the
distributed space.

------
burntsushi
The package documentation for the `serf` library[1] looks really exciting.
I've been wanting to make a distributed file synchronization tool, and perhaps
this would be an excellent library to build it on.

Question: as a relative networking idiot, how does NAT traversal fit into all
of this?

[1] -
[http://godoc.org/github.com/hashicorp/serf/serf](http://godoc.org/github.com/hashicorp/serf/serf)

~~~
armon
We designed the `serf` library to be able to be easily embedded, so hopefully
it can be of some use. Unfortunately Serf does not make use of any sort of NAT
traversal currently. We've open sourced the project hoping to get the
community involved, and NAT traversal is something we'd gladly work with the
community to get implemented.

~~~
philips
The first building block, STUN, is implemented over here:
[https://github.com/ccding/go-stun](https://github.com/ccding/go-stun)

------
mjohan
How does security work in a system like this. If this is used in a shared
hosting system, can a user inject false messages into Serf with for example
PHP?

~~~
mitchellh
Yes, they can. In the general case its not an issue because usually your nodes
are inaccessible by the public, but if you're using a shared hosting
environment, this is entirely possible.

We're addressing this in the next release by signing/encrypting gossiped
messages. See the roadmap:
[http://www.serfdom.io/docs/roadmap.html](http://www.serfdom.io/docs/roadmap.html)

------
nemothekid
This is actually very cool and is something incredibly handy for managing
failover/membership.

What I unfortunately don't understand is that there doesn't seem to be a
library I can use to take advantage of this in my own application. If I have a
program (in Go) am I expected to spin up my own serf process then communicate
with it via socket? Is there an option for me to have serf live inside my
application?

~~~
armon
The Serf executable is actually just a wrapper around the `serf` library. That
library is designed to be embedded in Go applications. Documentation for the
library is available here:
[http://godoc.org/github.com/hashicorp/serf/serf](http://godoc.org/github.com/hashicorp/serf/serf)

------
peterwwillis
What happens to your cluster when your network experiences intermittent packet
loss and your random UDP messages get lost? Nodes just start going down and up
randomly? (For those of you going "So what, that's normal", this not a quality
of an HA system)

~~~
armon
I'd highly recommend taking a look at this page:
[http://www.serfdom.io/docs/internals/gossip.html](http://www.serfdom.io/docs/internals/gossip.html).
One of the great attributes of the gossip protocol is it is very robust to
intermittent network failures. Under minimal packet loss conditions (<5%), the
rate of false positives should be very low. This is due to a few techniques,
one of which is indirect probing, and another is a novel "suspicion"
mechanism. In the case of a network partition, the parts of the cluster can
run in isolation and will recover when the partition heals. If you are
interested, the paper referenced there ("SWIM: Scalable Weakly-consistent
Infection-style Process Group Membership Protocol"), is the foundation of
Serf. In the paper you can find more details about the behavior of the
cluster, false positive rates under packet loss, and partition handling.

tl;dr the systems is in fact designed with network errors in mind, as opposed
to handling them being an afterthought.

~~~
peterwwillis
What you're saying is it's designed with the knowledge that it's going to
cause false positives, and basically doesn't work well under anything more
than minimal packet loss. I think this is probably an important factor to note
in the description (and I still fail to see how this is considered highly
available or fault tolerant, as described in Intro pages)

~~~
armon
I think we are maybe just working with different definitions. High
Availability for Serf means that it can continue to handle changes in topology
and deliver user events in the face of node failures and network problems.
However, it is inevitable that there will be a degradation in it's performance
given network failures. If there are serious packet loss issues, Serf will
mark a node as failed.

I'm not saying it "won't work well". It works as it is designed to. It will be
available for operations, it will automatically heal when the partition
recovers, and the state will be resynchronized with the "failed" nodes. The
system will be in an eventually consistent state, which is expressly
documented and is it's normal mode of operation.

If you consider 5% packet loss "minimal", I'm not sure what applications you
are running. TCP degrades at over 0.1% packet loss, and most UDP streaming
protocols have serious degradation over 5%.

~~~
peterwwillis
I'm still confused. You mention resynchronizing when the "partition"
"recovers". First, can you clarify what a partition is? Second, can you define
"recovery"? I'm not worried about performance degradation, i'm worried about
nodes being marked down when they aren't down.

Please correct me if i'm wrong, but it sounds like this software only works
reliably when you have two sets of nodes that suddenly can't communicate _at
all_ , and are eventually connected. Sometimes that does happen on a real
network, but often the cause of failures is intermittent and undetermined for
hours, days, or weeks. In this case, how would this program work? Would
network nodes keep appearing and disappearing, triggering floods of handler
scripts, loading boxes and keeping services unavailable?

Yes, tcp performance does degrade under packet loss. It also continues to
operate (at well over 50% loss) and automatically tunes itself to regain
performance once degradation ends. And it does not present false positives.

It maintains its own state (ordered delivery), checks its own integrity,
stands up to Byzantine events (hacking), and is supported by any platform or
application. Unfortunately, due to its highly-available nature, it will
eventually report a failure to an application if one exists. But if latency is
more of a priority than reliability, UDP-based protocols are more useful.

If you're designing a distributed, decentralized, peer-to-peer network, that's
cool! But I personally wouldn't use one to support highly-available network
services (which is three out of the five suggested use cases for Serf)

------
dugmartin
They should add "from the folks who brought you Vagrant" to the top of the
homepage.

------
totoy
can we see serf as a kind of riak core but written in go?

~~~
armon
Riak Core provides a superset of the features of Serf. Riak Core uses gossip
to manage membership, but it also provides quorums for coordination, and is
based around the notion of a hash ring and virtual nodes. You could instead
using Serf to build riak core like technology on top.

------
smandou
Highly available?

~~~
armon
Yes, the availability of the system is not tied to any given node(s). Any node
(or group of nodes) can continue to operate in the face of failure.

------
AsymetricCom
This isn't much of a solution when it involves putting a binary on every host.
Clearly, the best solution is a service framework at the platform level, not a
separate, unmodifiable blob thrown on your hardware.

