
Swarm vs. Fleet vs. Kubernetes vs. Mesos - amouat
http://radar.oreilly.com/2015/10/swarm-v-fleet-v-kubernetes-v-mesos.html
======
krenoten
This is a pretty balanced article. Absent is Nomad, which claims the world but
when you dig into the code pretty large advertised chunks are simply not there
yet. Nomad seems like a much more straightforward implementation of the Borg
paper, and one day may be interesting once they write the rest of it. A nice
Kubernetes feature that is similar to what you can do with fleet is the
“Daemon Set” which lets you run certain things on every node. Some cool Mesos
features that are pretty new and haven’t been talked about much yet:

* persistent volumes: let frameworks have directories on agent machines returned to them after an availability event happens, which is nice for replicated stateful services

* maintenance primitives: schedule machines to go offline at certain times, and ask frameworks if it’s safe to take nodes offline. This will soon start being used for stateful services so that they can vote on when it’s safe to take out a replica, and to trigger proactive rereplication when maintenance is desired.

* oversubscription: if you have an agent that has given away all of its resources, but the agent detects that there is still some unutilized CPU, it can start “revokable tasks” to fill up the slack up until it starts interfering with existing workloads.

------
jacques_chester
Cloud Foundry has a scheduler called Diego[1], which is now in public beta on
PWS. Because it's built for Cloud Foundry, you get the whole PaaS as well. No
need to roll your own service injection, staging, logging, authentication and
so on: it's already built and integrated.

For me, the cleverest part about Diego is that it turns placement into a
distributed problem through an auctions mechanism. Other attempts at this
focus on algorithmic approaches that assume perfect information. Diego instead
asks cells to bid on incoming tasks and processes. The auction closes after a
fixed time and the task or process is sent to the best bidder. This greatly
reduces the need to have perfect central consistency in order to perform bin-
packing optimisation -- in a real environment, that turns out to matter a lot.

Cloud Foundry is large and very featuresome, so for those who want a more
approachable way to play with Diego, try Lattice[2].

Disclaimer: I work in Pivotal Labs, a division of Pivotal. Pivotal is the
major donor of engineering effort to Cloud Foundry. I worked on the Buildpacks
team and I've deployed commercial apps to PWS, so I'm obviously a one-eyed
fan.

[1] [https://github.com/cloudfoundry-incubator/diego-design-
notes](https://github.com/cloudfoundry-incubator/diego-design-notes)

[2] [http://lattice.cf/](http://lattice.cf/)

~~~
amouat
Could you compare Diego to Mesos? On the face of it, this sounds like a
similar system.

~~~
jacques_chester
They're arguably symmetric. Based on my very limited understanding:

Diego works by auctioning tasks/LRPs to cells, which bid based on free
resources.

Mesos works by auctioning free resources to tasks/LRPs, which accept or reject
based on their own policies.

The analogy I got from Onsi was that Diego is "demand-driven", Mesos is
"supply-driven".

They have different sweet spots. Diego is built to be the heart of a PaaS, so
it aims for fast success or failure for accepting tasks and processes.
Consumers can decide whether to build further scheduling intelligence, but
Diego doesn't make it explicit.

Mesos exposes some of the placement/scheduling logic back to the client
explicitly. For example, if you are OK with a gigantic batch job running
intermittently, Mesos can more or less do that without fuss. The job will
simply wait until enough free resources are available on the cluster. This is
one of the observations that drove Mesos. Whereas Diego will tell you very
quickly if that capability is unavailable currently.

On the flipside, Diego requires no additional configuration. It also doesn't
require a central view of available resources to be maintained in order to
perform placement. That said, Diego _does_ maintain a central view of tasks
and LRPs that have been placed, in order to relaunch failed ones.

That said, I have sometimes been wrong about my understanding of Diego and
Mesos, so YMMV.

~~~
amouat
Thanks, sounds accurate at least :)

~~~
jacques_chester
No worries. If you want to learn more, I can put you in touch with the Diego
or Lattice teams.

I should also note in passing that Diego can place and manage Docker
containers, Windows .NET apps and classic buildpack-staged apps, with the
architectural flexibility to extend to more kinds of managed resources (eg,
with the right backend, it could manage unikernel apps). That alone sets it
apart from the pack.

~~~
lifty
I think Nomad aims to be similarly flexible in the type of resources it can
manage.

~~~
jacques_chester
Yes, in a different way. Diego pushes the question of the underlying runtime
environment down to Garden, so if you add a backend, Diego can drive it.

------
vidarh
One more thing worth mentioning with fleet is that you can schedule anything
systemd can handle. This means you can have it schedule timers, for example,
and if the machine your timer runs on dies, fleet will re-schedule the timer
on another machine.

For fleet a timer is just another systemd unit - there's no special support
for them -, so you get a simple "cluster wide cron" pretty much for free.

Fleet works fairly well, though it does have some minor maturity issues (I've
gotten it into weird states when machines have left the cluster abruptly and
rejoined where e.g. it refuses to schedule some unit afterwards; solution:
change the name of the unit) . No idea ho well it'll scale in larger
deployments.

------
kevinsimper
I have tried Swarm intensively and it is not ready at all, sadly! There is no
support for pulling from private registries, there is no rescheduling if one
of the nodes goes down and there is not intelligent way to manage volumes.

You are actually better of managing each server individually than using swarm.

(hard earned truth)

~~~
amouat
Yes, I should have mentioned in the article that Swarm is still heavily under
development.

I'm still wondering if and how they're going to address the idea of co-
scheduling groups of containers, like pods in Kubernetes. This is a common
need, but Docker don't seem very keen:
[https://github.com/docker/docker/issues/8781](https://github.com/docker/docker/issues/8781)

------
darren0
I just wanted to throw into the mix Rancher. Rancher is a more recent comer
into the space and still in beta. Rancher focuses on simplicity, flexibility,
and pragmatic solutions for Docker based infrastructure. Rancher is different
in that it can be deployed in conjunction with all of these systems mentioned
or can be ran as a replacement. It includes a scheduler, load balancing,
health checks, service discovery, application templating based on compose
syntax, storage management, application catalog, upgrade management,
github/ldap integration, UI, API, CLI, etc.

Disclaimer: Co-founder of Rancher Labs and chief geeky guy behind Rancher

~~~
z3ugma
Rancher looks really good. I'm going to try it out - can you guys add a launch
guide for Rancher OS on DigitalOcean?

Also, I want to make a plugin/service that's dependent on what you've build
solely for the pun "Huevos RancherOS"

------
unethical_ban
HN: Has anyone here worked significantly with SmartOS and its associated
tools? I love the idea of a Zones/ZFS-backed container OS, but their
documentation looks very sloppy. Does anyone here have extensive experience
with it (who's not Bryan Cantrill - your BSDNow podcast interview is what got
me looking)

~~~
lucd
Yes Joyent's Triton may be the best way to run docker containers, not only for
ZFS, but for Illumos Zones, network virtualization.. You may be interested by
this article from Casey Bisson, about running Mesos on Triton:

[https://www.joyent.com/blog/mesos-by-the-
pound](https://www.joyent.com/blog/mesos-by-the-pound)

------
handimon
Nomad is another scheduler released by hashicorp. This probably wasn't
available when the article was written but I am curious how it would compare
to the others.

[https://www.nomadproject.io](https://www.nomadproject.io)

~~~
dberg
Thanks for posting, Nomad looks really impressive. have you run it in
production ?

~~~
lobster_johnson
Nomad is not ready for production. It's unfinished. Let's call it pre-alpha.

------
tacotuesday
I'm leaning toward Kubernetes because that's what fabric8.io has chosen. The
others may be great, but I'm not really interested in writing a lot of glue
code to make them work well with Jenkins/Gerrit/Nexus/Slack/Docker/Chaos
Monkey/OpenShift the way Fabric8 has.

------
KirinDave
Only Kubernetes has facilities for hosting external services reliably. I'm not
sure why these other lower level tools are being compared to it. Especially
Mesos and Marathon, which are much lower level.

~~~
manojlds
Marathon and Kubernetes are in the same level, why do you call it lower level.
Mesos, yes, and Kubernetes can run on Mesos too.

~~~
KirinDave
Maraton schedules long running tasks, but doesn't offer any explicit
facilities for binding a homogenous group into a relible responder set.

At least, last time I checked. I can see components in the Marathon API to
help you build such a thing. But Kubernetes deals with a fundamentally more
abstract set of primitives: groups of containers (pods), replicated pods, and
automatically managed dispatch points to said groups called services.

The Marathon documentation literally recommends you hand-roll a periodically
updated haproxy service to accomplish that last bit. While K8S introduces a
fair amount of overhead to accomplish this feat reliably, that overhead pays
off by having a very responsive and available system.

So yeah. K8s is working one level up the stack, as I see it. I think you can
make something reasonable with only Marathon, and that thought is backed up by
the reality of people doing it. But my own experience suggests K8S does more
for you here. Hence my excitement about K8S-mesos.

~~~
ersoft
I managed to achieve load balancing for a fleet of microservices using vulcand
[0] and marathon callbacks [1].

It listens for marathon callbacks (for app create/update/scale), or when a
task is killed/lost and updates the backends in etcd, propagating to all
vulcand load balancers.

The tool is available on github [2]

[0] [https://vulcand.io/proxy.html#backends-and-
servers](https://vulcand.io/proxy.html#backends-and-servers)

[1] [https://mesosphere.github.io/marathon/docs/event-
bus.html](https://mesosphere.github.io/marathon/docs/event-bus.html)

[2] [https://github.com/kuende/kameni](https://github.com/kuende/kameni)

~~~
KirinDave
This is a good thing. Perhaps you could get the Maraton folks to stop
recommending such an awkward solution in preference to your work?

~~~
ersoft
Well, it depends. My solution is opinionated, using only one vulcand cluster
for load balancing, multiple vulcand servers listening on one etcd namespace.

For a small number of nodes (< 100) and requests (< 8000 req/s on each vulcand
server) this vulcand approach is ok. For large scale, HAProxy/Nginx supports a
lot more req/s than vulcand, and I think it can also be configured using confd
[0] using a similar aproach: listen for marathon events, update etcd keys,
then confd will listen for changes and reload HAProxy/Nginx

[0] [http://ox86.tumblr.com/post/90554410668/easy-scaling-with-
do...](http://ox86.tumblr.com/post/90554410668/easy-scaling-with-docker-
haproxy-and-confd)

------
idlewords
What is the threshold number of computers past which using this stuff is worth
the tradeoff in complexity?

~~~
rconti
This is something we face in our environment. The OSes are pretty homogenous,
but the application set is very diverse; 16 of these, 2 of these, 8 of those,
and so on. It's made previous orchestration tools a bit more unwieldy, but of
course manual control is unwieldy as well!

In addition, of course, the task of learning, implementing, and evaluating the
options take a large amount of time on top of the time we already spend
(mostly) manually maintaining infrastructure.

Articles like this are a great stepping stone.

------
NateDad
Would have liked to see Juju in this comparison. Maybe juju is too flexible,
since it's not restricted to just deploying docker containers?

(disclosure, I'm a Juju dev)

~~~
tacotuesday
Is there an easy way to try Juju on a laptop? Like a vagrant vm or something?
Thank you :)

~~~
NateDad
Sure, there's a local provider that uses LXC containers on your local machine
(Ubuntu only), and we're working on a way to use LXD (which is a much cleaner
local environment).

If you are running OSX, Windows, or a non-Ubuntu linux, you'd need an Ubuntu
VM to run the local provider (we support Centos, Windows, and OSX for the juju
client in general to control remote providers like Azure, Amazon, etc, just
not for running the local provider).

[https://jujucharms.com/get-started](https://jujucharms.com/get-started)

------
coleca
Any thoughts on comparisons of these vs Amazon ECS? I know it's not open and
portable, but still interested in understanding the differences.

~~~
pkinney
We attempted to use ECS for a while before ultimately switching to Kubernetes.
While its tight, built-in integration with AWS's Elastic Load Balancers and
Auto-Scaling Groups made it fit well in the AWS ecosystem, we found that there
wasn't enough visibility into the system. Containers would be stopped without
notification or logging and not restarted.

We've found the Kubernetes primitives to be the easiest and most straight-
forward to work with while still providing a very powerful API around which to
wrap all sorts of custom tooling.

------
lifty
I am looking forward to the day when Mesos(or any other PaaS software) will be
able to use Ceph to save application(app/container/vm) state. In such a setup
applications could move around the compute cluster without loosing state.

~~~
jfindley
Mesos does now have early-stage support for persistent disks. See
[https://github.com/apache/mesos/blob/master/docs/persistent-...](https://github.com/apache/mesos/blob/master/docs/persistent-
volume.md)

There's also technologies like flocker that will allow you do this with docker
volumes (which can then be run inside mesos).

Both of these are in a pretty early stage of life, so there's some rough
edges, but if you really need it, it's there.

~~~
lifty
Indeed you can cobble together such a solution, but baked in integration at
the framework level would be more interesting. If state is exposed at the
scheduler level, than its much easier for all scheduler frameworks to take
advantage of it.

------
weavie
So the article seems to suggest that for smaller clusters use Kubernetes, for
larger ones use Mesos.

Would people agree?

~~~
amouat
Author here. I would say it's a bit more subtle than that.

If you have a very large cluster (1000s of machines), Mesos may well be the
best fit, as you are likely to want Mesos's support for diverse workloads and
the extra assurance of the comparative maturity of the project.

The big difference with Kubernetes is that it enforces a certain application
style; you need to understand the various concepts (pods, labels, replication
controllers) and build your applications to work with those in mind. I haven't
seen any figures, but I would expect Kubernetes to scale well for the majority
of projects.

~~~
DannoHung
The one thing I don't totally understand about the Kubernetes scaling issue is
that you can ostensibly use Mesos as the scheduler for Kubernetes, which
naively to me seems like it would allow Kubernetes to scale to the same size
as Mesos, or is the DNS and Service stuff the part that doesn't scale as well?

~~~
a1r
Absolutely! We're actively working on this here at Mesosphere, in partnership
with Google. You can check out the project at
[https://github.com/mesosphere/kubernetes-
mesos](https://github.com/mesosphere/kubernetes-mesos)

You might also be interested in a talk I gave at the Kubernetes launch in
August about our work coming the two:
[https://youtu.be/aXcdHwQ5GgQ](https://youtu.be/aXcdHwQ5GgQ)

~~~
TheIronYuppie
Full disclosure, I work at Google on Kubernetes.

a1r is exactly right - Mesos does great in virtualizing your data center,
Kubernetes is a framework on top of that.

------
Florin_Andrei
Anyone using Dkron? [http://dkron.io/](http://dkron.io/)

What I really need is a "distributed cron", first and foremost, with the
additional requirements of: it being lightweight on resources, and it being
multiplatform. Dkron seems to fit the description pretty well, I'm looking for
any feedback from users.

------
isoos
Is any of these platforms useful for a small company that wants to have a
basic self-hosted environment for the usual stuff?

They are looking for services like e-mail (SMTP, IMAP, maybe webmail), website
(static + maybe wordpress), source code hosting (Subversion) and reviews (?),
CI (Jenkins?), devops, and CRM...

~~~
cpitman
Kubernetes can be used to deploy anything that you can put into a Docker
container, including support for persistent volumes (ie "mode 1"
applications). I've been using it recently to host XMPP servers, gitlab,
jenkins, etc.

I haven't installed Kubernetes directly, but I have setup and used Openshift
v3 which adds a PaaS solution on top of Kubernetes. Setting it up is really
easy, and they've release an all-in-one VM to demo it out:
[http://www.openshift.org/vm/](http://www.openshift.org/vm/)

The other option is to use a hosted solution. Google Container Engine
([https://cloud.google.com/container-
engine/](https://cloud.google.com/container-engine/)) is essentially hosted
Kubernetes.

PS I work for Red Hat. OpenShift V3 is our product, and we contribute a lot to
Kubernetes.

------
sciurus
Anything to say about Flynn?

[https://flynn.io/](https://flynn.io/)

~~~
Titanous
Flynn is a high-level platform that doesn't require you to think about low-
level components like schedulers, overlay networking, service discovery,
consensus, etc. Flynn does of course include a built-in scheduler and other
components that accomplish many of the goals mentioned in the article.

(disclosure: I am a cofounder of Flynn)

------
geggam
How many people actually need this level of scale vs how many people are
implementing this because its $BUZZWORD

~~~
amouat
You need orchestration even at small scale (say more than 2 nodes). Otherwise
you have to manually take care of container scheduling and failover etc.

~~~
dberg
Exactly, the biggest problem with running apps in Containers, is you need to
ensure you dont have your entire cluster sitting on one physical node (or
virtual node i guess), services can be restart, auto-scaled, etc. Even at a
small scale some orchestration and management is required

~~~
blumkvist
>you need to ensure you dont have your entire cluster sitting on one physical
node

Why is this the case? My understanding of containers is that they solve
exactly this problem - restart, rescale only a fraction of the application
(only an individual service).

~~~
cpitman
Bringing up new containers will always take a finite amount of time.

If all of a cluster is deployed to a single host, and that host dies, then the
recovery time is the time needed to reschedule the container elsewhere, load
it onto the machine (if not cached), and execute it. This is all downtime,
since every instance was lost.

If instances are on more than one host, then even in the case of a single
host's failure the application is still available while the failed instance is
rescheduled and started.

------
phildougherty
Also check out [https://containership.io](https://containership.io) as a
scheduler. The core of the system is open source, has built in service
discovery, dns, and load balancing.

