Some recommended URLs if you're curious what the point is:
"What is Serf?" http://www.serfdom.io/intro/index.html
"Use Cases" http://www.serfdom.io/intro/use-cases.html
Comparison to Other Software: http://www.serfdom.io/intro/vs-other-sw.html
For the CS nerds, the internals/protocols/papers behind Serf: http://www.serfdom.io/docs/internals/gossip.html
Also, I apologize the site isn't very mobile friendly right now. Unfortunately I can write a lamport clock implementation, but CSS is just crazytown.
When did you realise the need for surf
Did you work on it as a main project at some point or is it a side project
When and how did you decide to commit to getting this done
and the big one for me - tools are driven by a need, but often the need keeps coming while the time to build it diminishes. What strategies did you use to keep the plates spinning while building surf?
I think we can all mostly answer the questions - I just waant to know how different your answers are from say mine when I don't release two major OSS projects and you do.
I want to mention the "polish": I personally don't believe in releasing an open source project without polish. If it is missing docs, its just not complete. If it is ugly, it is not complete. The technical aspects of Serf were done weeks ago. Getting the human side of things done took another few weeks (contracting designers and such).
> When did you realise the need for serf?
The need for something like Serf has existed since I started doing ops. Every time I hit something where I say to myself "why is this so hard/crappy" is when I write it down in my notebook for a future date. I then just think on the idea for awhile and eventually when I feel like I have a significantly better solution than what is out there already, I build it.
I decided to start building Serf when @armon started throwing gossip protocol academic papers at me. I realized he figured it out, this was clearly significantly better, so we started working on it.
> Did you work on it as a main project at some point or is it a side project?
To get it out the door we focus on it for some period of time. After it is shipped it is still what I would consider a "main project" but time is split between various projects.
> When and how did you decide to commit to getting this done?
A few weeks ago. It took about a month to build. Building it is easy. Figuring out WHAT to build... took a long time. I have to say I've had "service orchestration/membership" in my notebook for years.
> What strategies did you use to keep the plates spinning while building surf?
No good answer here, we just prioritize some things over others. Serf was our top priority this month.
I am afraid that for such a helpful and clear answer, you get a mere 1 karma point from me - but thank you.
For that I (think I)'d need a system that will:
* start off with X nodes live in the load balancer
* trigger redeploys on half (or so) of them
* when half of the nodes have been redeployed, block and perform the database migration and trigger redeploys for the remaining server
* after the migration completed switch over the load balancer to the now updated half of servers
Is this something that you could orchestrate with Serf? Or am I looking in the wrong direction here.
1) Send a "pre-deploy" event. Handler scripts use random number generator to decide which group they are in ("flip a coin" basically)
2) Half the nodes should transition to the "left" state, do the deploy and rejoin the cluster.
3) Once this is done, trigger the migration.
4) Flip the LB to the nodes that did the Join/Leave (you can potentially distinguish them using different role names, or by tracking who left and joined)
5) Run the "post-deploy". The other half of the nodes should now deploy
6) Update the LB to include everybody as the nodes leave/join
This is of course a rough sketch, it is certainly possible if tricky to build something like this.
from github, so yeah
Serf looks to be eventually consistent and event driven. So you can figure out who is up and send events to members. This gives you a lot of utility for the use cases of propagating information to DNS, load balancers, etc.
But, you couldn't use serf for something like master election or locks and would need etcd or Zookeeper for that. Serf and etcd aren't mutually exclusive systems in any sense just solving different problems in different ways.
They have a nice write-up on the page here: http://www.serfdom.io/intro/vs-zookeeper.html
But I'd be interested in comments from someone knowledgeable.
Edit: I see this is addressed at http://www.serfdom.io/intro/vs-zookeeper.html but it would be nice to have something more "just the facts" rather than arguing the serf is good.
Question: as a relative networking idiot, how does NAT traversal fit into all of this?
 - http://godoc.org/github.com/hashicorp/serf/serf
We're addressing this in the next release by signing/encrypting gossiped messages. See the roadmap: http://www.serfdom.io/docs/roadmap.html
What I unfortunately don't understand is that there doesn't seem to be a library I can use to take advantage of this in my own application. If I have a program (in Go) am I expected to spin up my own serf process then communicate with it via socket? Is there an option for me to have serf live inside my application?
tl;dr the systems is in fact designed with network errors in mind, as opposed to handling them being an afterthought.
I'm not saying it "won't work well". It works as it is designed to. It will be available for operations, it will automatically heal when the partition recovers, and the state will be resynchronized with the "failed" nodes. The system will be in an eventually consistent state, which is expressly documented and is it's normal mode of operation.
If you consider 5% packet loss "minimal", I'm not sure what applications you are running. TCP degrades at over 0.1% packet loss, and most UDP streaming protocols have serious degradation over 5%.
Please correct me if i'm wrong, but it sounds like this software only works reliably when you have two sets of nodes that suddenly can't communicate at all, and are eventually connected. Sometimes that does happen on a real network, but often the cause of failures is intermittent and undetermined for hours, days, or weeks. In this case, how would this program work? Would network nodes keep appearing and disappearing, triggering floods of handler scripts, loading boxes and keeping services unavailable?
Yes, tcp performance does degrade under packet loss. It also continues to operate (at well over 50% loss) and automatically tunes itself to regain performance once degradation ends. And it does not present false positives.
It maintains its own state (ordered delivery), checks its own integrity, stands up to Byzantine events (hacking), and is supported by any platform or application. Unfortunately, due to its highly-available nature, it will eventually report a failure to an application if one exists. But if latency is more of a priority than reliability, UDP-based protocols are more useful.
If you're designing a distributed, decentralized, peer-to-peer network, that's cool! But I personally wouldn't use one to support highly-available network services (which is three out of the five suggested use cases for Serf)