Hacker News new | past | comments | ask | show | jobs | submit login
Swarm vs. Fleet vs. Kubernetes vs. Mesos (oreilly.com)
205 points by amouat on Oct 23, 2015 | hide | past | favorite | 83 comments

This is a pretty balanced article. Absent is Nomad, which claims the world but when you dig into the code pretty large advertised chunks are simply not there yet. Nomad seems like a much more straightforward implementation of the Borg paper, and one day may be interesting once they write the rest of it. A nice Kubernetes feature that is similar to what you can do with fleet is the “Daemon Set” which lets you run certain things on every node. Some cool Mesos features that are pretty new and haven’t been talked about much yet:

* persistent volumes: let frameworks have directories on agent machines returned to them after an availability event happens, which is nice for replicated stateful services

* maintenance primitives: schedule machines to go offline at certain times, and ask frameworks if it’s safe to take nodes offline. This will soon start being used for stateful services so that they can vote on when it’s safe to take out a replica, and to trigger proactive rereplication when maintenance is desired.

* oversubscription: if you have an agent that has given away all of its resources, but the agent detects that there is still some unutilized CPU, it can start “revokable tasks” to fill up the slack up until it starts interfering with existing workloads.

Cloud Foundry has a scheduler called Diego[1], which is now in public beta on PWS. Because it's built for Cloud Foundry, you get the whole PaaS as well. No need to roll your own service injection, staging, logging, authentication and so on: it's already built and integrated.

For me, the cleverest part about Diego is that it turns placement into a distributed problem through an auctions mechanism. Other attempts at this focus on algorithmic approaches that assume perfect information. Diego instead asks cells to bid on incoming tasks and processes. The auction closes after a fixed time and the task or process is sent to the best bidder. This greatly reduces the need to have perfect central consistency in order to perform bin-packing optimisation -- in a real environment, that turns out to matter a lot.

Cloud Foundry is large and very featuresome, so for those who want a more approachable way to play with Diego, try Lattice[2].

Disclaimer: I work in Pivotal Labs, a division of Pivotal. Pivotal is the major donor of engineering effort to Cloud Foundry. I worked on the Buildpacks team and I've deployed commercial apps to PWS, so I'm obviously a one-eyed fan.

[1] https://github.com/cloudfoundry-incubator/diego-design-notes

[2] http://lattice.cf/

Could you compare Diego to Mesos? On the face of it, this sounds like a similar system.

They're arguably symmetric. Based on my very limited understanding:

Diego works by auctioning tasks/LRPs to cells, which bid based on free resources.

Mesos works by auctioning free resources to tasks/LRPs, which accept or reject based on their own policies.

The analogy I got from Onsi was that Diego is "demand-driven", Mesos is "supply-driven".

They have different sweet spots. Diego is built to be the heart of a PaaS, so it aims for fast success or failure for accepting tasks and processes. Consumers can decide whether to build further scheduling intelligence, but Diego doesn't make it explicit.

Mesos exposes some of the placement/scheduling logic back to the client explicitly. For example, if you are OK with a gigantic batch job running intermittently, Mesos can more or less do that without fuss. The job will simply wait until enough free resources are available on the cluster. This is one of the observations that drove Mesos. Whereas Diego will tell you very quickly if that capability is unavailable currently.

On the flipside, Diego requires no additional configuration. It also doesn't require a central view of available resources to be maintained in order to perform placement. That said, Diego does maintain a central view of tasks and LRPs that have been placed, in order to relaunch failed ones.

That said, I have sometimes been wrong about my understanding of Diego and Mesos, so YMMV.

Thanks, sounds accurate at least :)

No worries. If you want to learn more, I can put you in touch with the Diego or Lattice teams.

I should also note in passing that Diego can place and manage Docker containers, Windows .NET apps and classic buildpack-staged apps, with the architectural flexibility to extend to more kinds of managed resources (eg, with the right backend, it could manage unikernel apps). That alone sets it apart from the pack.

I think Nomad aims to be similarly flexible in the type of resources it can manage.

Yes, in a different way. Diego pushes the question of the underlying runtime environment down to Garden, so if you add a backend, Diego can drive it.

Since it's too late to edit, turns out I'm out of date about the current way Diego works. There's an auction mechanism but it's been made more centralised due to a thundering herd problem. It more closely resembles Mesos and others, insofar as it collects resource reports and selects candidates based on standard bin-packing techniques; cells can then reject placements.

Would you say your knowledge of Cloud Foundry and Diego is directly proportional to how fancy you dress on Fridays? Because today is Friday and I don't see fancy dress.

It's probably proportional, given that I am at home because my back decided to throw itself another party.

How well does Cloud Foundry play with stateful containers?

It doesn't. It's intended to support 12-factor apps[1].

State is handled by services bound to your application. CF injects the connection details into the staging and runtime environments. Applications using buildpacks generally have that connection autoconfigured.

For example, Rails apps have database credentials autoconfigured if you have a database bound. Spring apps get a DataSource. And so on.

[1] http://12factor.net/

I'm familiar with 12-factor apps, but there are some use cases (usually involving existing / 3rd party software) where it's not really feasible. One thing I like about plain LXC is that you can keep state if you want to. Of course making it HA is a whole other story. :)

It looks like k8s has a way to handle volumes, so maybe I'll have to play with that.

Persistent volumes are on the horizon[1], now that Diego has landed.

But you'd be surprised how many apps can be ported to some semblance of cloud nativeness. Most of the stuff that Absolutely Must Be Stateful can usually adapted to something else. Sticky sessions go to Redis or Memcache, local files can often be relocated into a shared filesystem/blobstore like Riak CS etc.

The big thing about PaaSes is not that they make it a smooth, double-percent-ROI incremental improvement from the status quo. What happens is that unlocking true continuous deployment capability unblocks process backpressure throughout the entire organisation. The value you get from a complete shift in how you develop software swamps the cost of carrying legacy until it can be replaced.

Putting it another way: the sucky parts are tactical. The awesome parts are strategic.

[1] http://blog.pivotal.io/pivotal-cloud-foundry/case-studies/th...

Lattice looks pretty cool. Thanks for the reading material.

One more thing worth mentioning with fleet is that you can schedule anything systemd can handle. This means you can have it schedule timers, for example, and if the machine your timer runs on dies, fleet will re-schedule the timer on another machine.

For fleet a timer is just another systemd unit - there's no special support for them -, so you get a simple "cluster wide cron" pretty much for free.

Fleet works fairly well, though it does have some minor maturity issues (I've gotten it into weird states when machines have left the cluster abruptly and rejoined where e.g. it refuses to schedule some unit afterwards; solution: change the name of the unit) . No idea ho well it'll scale in larger deployments.

I have tried Swarm intensively and it is not ready at all, sadly! There is no support for pulling from private registries, there is no rescheduling if one of the nodes goes down and there is not intelligent way to manage volumes.

You are actually better of managing each server individually than using swarm.

(hard earned truth)

Yes, I should have mentioned in the article that Swarm is still heavily under development.

I'm still wondering if and how they're going to address the idea of co-scheduling groups of containers, like pods in Kubernetes. This is a common need, but Docker don't seem very keen: https://github.com/docker/docker/issues/8781

I just wanted to throw into the mix Rancher. Rancher is a more recent comer into the space and still in beta. Rancher focuses on simplicity, flexibility, and pragmatic solutions for Docker based infrastructure. Rancher is different in that it can be deployed in conjunction with all of these systems mentioned or can be ran as a replacement. It includes a scheduler, load balancing, health checks, service discovery, application templating based on compose syntax, storage management, application catalog, upgrade management, github/ldap integration, UI, API, CLI, etc.

Disclaimer: Co-founder of Rancher Labs and chief geeky guy behind Rancher

Rancher looks really good. I'm going to try it out - can you guys add a launch guide for Rancher OS on DigitalOcean?

Also, I want to make a plugin/service that's dependent on what you've build solely for the pun "Huevos RancherOS"

HN: Has anyone here worked significantly with SmartOS and its associated tools? I love the idea of a Zones/ZFS-backed container OS, but their documentation looks very sloppy. Does anyone here have extensive experience with it (who's not Bryan Cantrill - your BSDNow podcast interview is what got me looking)

Yes Joyent's Triton may be the best way to run docker containers, not only for ZFS, but for Illumos Zones, network virtualization.. You may be interested by this article from Casey Bisson, about running Mesos on Triton:


I haven't, but note you can use ZFS with Docker today.

Nomad is another scheduler released by hashicorp. This probably wasn't available when the article was written but I am curious how it would compare to the others.


Thanks for posting, Nomad looks really impressive. have you run it in production ?

Nomad is not ready for production. It's unfinished. Let's call it pre-alpha.

I'm leaning toward Kubernetes because that's what fabric8.io has chosen. The others may be great, but I'm not really interested in writing a lot of glue code to make them work well with Jenkins/Gerrit/Nexus/Slack/Docker/Chaos Monkey/OpenShift the way Fabric8 has.

Only Kubernetes has facilities for hosting external services reliably. I'm not sure why these other lower level tools are being compared to it. Especially Mesos and Marathon, which are much lower level.

They are being compared as they are the main options for clustering and scheduling containers. I'd agree Kubernetes is at a higher level to the other options, or at least comes with more features.

I'm not sure what you mean by "hosting external services reliably" - what's external and who is unreliable?

Hmm, I'll answer my own question - you mean has a load-balancer built in, don't you?

And that doesn't recommend a comically brittle solution (as in marathon) or provide no facilities at all (as in swarm).

There's nothing wrong with these decisions. It's just odd to compare this group. It makes me wonder how people are considering these options and in what context.

You cannot field a reliable and available service to the outside world without this component. And the instrumentation and operationalization of this component is tricky.

Yeah, I agree that the load-balancer is important and not adequately addressed. I guess most people rig up their own set-up based on haproxy currently.

I don't get why you think I can't compare that group though. What should I compare Kubernetes with? The group represents the various approaches to clustering and orchestration that most people will consider for container deployments. They are all slightly different, which is what makes for an interesting comparison.

K8S pretty much stands on its own in the Docker ecosystem right now. I'm not aware of good alternatives that aren't comically janky as you described.

Marathon and Kubernetes are in the same level, why do you call it lower level. Mesos, yes, and Kubernetes can run on Mesos too.

Maraton schedules long running tasks, but doesn't offer any explicit facilities for binding a homogenous group into a relible responder set.

At least, last time I checked. I can see components in the Marathon API to help you build such a thing. But Kubernetes deals with a fundamentally more abstract set of primitives: groups of containers (pods), replicated pods, and automatically managed dispatch points to said groups called services.

The Marathon documentation literally recommends you hand-roll a periodically updated haproxy service to accomplish that last bit. While K8S introduces a fair amount of overhead to accomplish this feat reliably, that overhead pays off by having a very responsive and available system.

So yeah. K8s is working one level up the stack, as I see it. I think you can make something reasonable with only Marathon, and that thought is backed up by the reality of people doing it. But my own experience suggests K8S does more for you here. Hence my excitement about K8S-mesos.

I managed to achieve load balancing for a fleet of microservices using vulcand [0] and marathon callbacks [1].

It listens for marathon callbacks (for app create/update/scale), or when a task is killed/lost and updates the backends in etcd, propagating to all vulcand load balancers.

The tool is available on github [2]

[0] https://vulcand.io/proxy.html#backends-and-servers

[1] https://mesosphere.github.io/marathon/docs/event-bus.html

[2] https://github.com/kuende/kameni

This is a good thing. Perhaps you could get the Maraton folks to stop recommending such an awkward solution in preference to your work?

Well, it depends. My solution is opinionated, using only one vulcand cluster for load balancing, multiple vulcand servers listening on one etcd namespace.

For a small number of nodes (< 100) and requests (< 8000 req/s on each vulcand server) this vulcand approach is ok. For large scale, HAProxy/Nginx supports a lot more req/s than vulcand, and I think it can also be configured using confd [0] using a similar aproach: listen for marathon events, update etcd keys, then confd will listen for changes and reload HAProxy/Nginx

[0] http://ox86.tumblr.com/post/90554410668/easy-scaling-with-do...

Just to double down on that previous point, check out the code for haproxy-marathon-bridge.


What is the threshold number of computers past which using this stuff is worth the tradeoff in complexity?

This is something we face in our environment. The OSes are pretty homogenous, but the application set is very diverse; 16 of these, 2 of these, 8 of those, and so on. It's made previous orchestration tools a bit more unwieldy, but of course manual control is unwieldy as well!

In addition, of course, the task of learning, implementing, and evaluating the options take a large amount of time on top of the time we already spend (mostly) manually maintaining infrastructure.

Articles like this are a great stepping stone.

I deploy a pretty small app with like 2 instances per deploy and 3 deploys (beta and 2 production versions for different clients) and Kubernetes is 100% worth it for the rolling updates and the ability to build the image once and push it different places.

Would have liked to see Juju in this comparison. Maybe juju is too flexible, since it's not restricted to just deploying docker containers?

(disclosure, I'm a Juju dev)

Is there an easy way to try Juju on a laptop? Like a vagrant vm or something? Thank you :)

Sure, there's a local provider that uses LXC containers on your local machine (Ubuntu only), and we're working on a way to use LXD (which is a much cleaner local environment).

If you are running OSX, Windows, or a non-Ubuntu linux, you'd need an Ubuntu VM to run the local provider (we support Centos, Windows, and OSX for the juju client in general to control remote providers like Azure, Amazon, etc, just not for running the local provider).


Any thoughts on comparisons of these vs Amazon ECS? I know it's not open and portable, but still interested in understanding the differences.

We attempted to use ECS for a while before ultimately switching to Kubernetes. While its tight, built-in integration with AWS's Elastic Load Balancers and Auto-Scaling Groups made it fit well in the AWS ecosystem, we found that there wasn't enough visibility into the system. Containers would be stopped without notification or logging and not restarted.

We've found the Kubernetes primitives to be the easiest and most straight-forward to work with while still providing a very powerful API around which to wrap all sorts of custom tooling.

I am looking forward to the day when Mesos(or any other PaaS software) will be able to use Ceph to save application(app/container/vm) state. In such a setup applications could move around the compute cluster without loosing state.

Full disclosure, I work at Google on Kubernetes.

Kubernetes offers this as a plug-in today.


Mesos does now have early-stage support for persistent disks. See https://github.com/apache/mesos/blob/master/docs/persistent-...

There's also technologies like flocker that will allow you do this with docker volumes (which can then be run inside mesos).

Both of these are in a pretty early stage of life, so there's some rough edges, but if you really need it, it's there.

Indeed you can cobble together such a solution, but baked in integration at the framework level would be more interesting. If state is exposed at the scheduler level, than its much easier for all scheduler frameworks to take advantage of it.

Containers and cluster storage bring you really close to what Google's infrastructure looks like. Especially when the systems are fully fault-tolerant like Mesos and Ceph. One of the drawbacks of Ceph is that its filesystem is not ready for production yet, and you have to resort to block devices that can only accessed from one host at at time.

At Quobyte (http://www.quobyte.com, disclaimer: I am one of the founders), we have a built a fully fault-tolerant distributed file system. This allows concurrent scalable shared access to file systems from any number of hosts. Think of a /data that is accessible from any host and can be mapped in any container.

What I found pretty neat was that we could easily do a mysql HA setup on Mesos: put mysql in a container, use a directory on /quobyte for its data, and enable Quobyte's mandatory file locking. When you kill the container, or switch off/unplug its host, the container gets rescheduled and recovers from the shared file system.

We're already doing this with kubernetes + fleet.

Ceph itself seems really stable, though we did have an issue with kubernetes not aquiring a lock on ceph and thus having 2 nodes write in the event that one went unresponsive. This was patched in a later version, but make sure it's merged into the release you go with.

So the article seems to suggest that for smaller clusters use Kubernetes, for larger ones use Mesos.

Would people agree?

Author here. I would say it's a bit more subtle than that.

If you have a very large cluster (1000s of machines), Mesos may well be the best fit, as you are likely to want Mesos's support for diverse workloads and the extra assurance of the comparative maturity of the project.

The big difference with Kubernetes is that it enforces a certain application style; you need to understand the various concepts (pods, labels, replication controllers) and build your applications to work with those in mind. I haven't seen any figures, but I would expect Kubernetes to scale well for the majority of projects.

> The big difference with Kubernetes is that it enforces a certain application style

In our experience, it isn't so much that Kubernetes enforces a certain style as that it doesn't force services to understand the scheduling layer. A Twelve-Factor-style app will deploy very nicely and easily into a Kubernetes cluster.

The one thing I don't totally understand about the Kubernetes scaling issue is that you can ostensibly use Mesos as the scheduler for Kubernetes, which naively to me seems like it would allow Kubernetes to scale to the same size as Mesos, or is the DNS and Service stuff the part that doesn't scale as well?

Absolutely! We're actively working on this here at Mesosphere, in partnership with Google. You can check out the project at https://github.com/mesosphere/kubernetes-mesos

You might also be interested in a talk I gave at the Kubernetes launch in August about our work coming the two: https://youtu.be/aXcdHwQ5GgQ

Full disclosure, I work at Google on Kubernetes.

a1r is exactly right - Mesos does great in virtualizing your data center, Kubernetes is a framework on top of that.

There are a bunch of scale limits in Kubernetes itself that are being chipped away at; the bottleneck is not the scheduler, so plugging a new scheduler in won't make the system more performant.

Here's a blog discussing the current state of K8s scale: http://blog.kubernetes.io/2015/09/kubernetes-performance-mea...

Note that Bob Wise at Samsung has been driving some horizontal scale testing, and has got a 1000-node cluster up and running, so that's a current "best case" scale number.

Here's Bob's video of a 1000 machines running on CoreOS' Tectonic product: https://www.youtube.com/watch?v=1bHUsBhPL20

(work at CoreOS)

I think Kubernetes on top of Mesos may become a reasonably common set-up, largely because it allows diverse workloads that can drive better usage and efficiency.

I'm not sure what the trade-offs are with running Kubernetes on Mesos - that would make for an interesting article.

(Mesosphere co-founder here)

Running workload-specific schedulers like Kubernetes on Mesos is one of its fundamental ideas. From the paper:

"It seems clear that new cluster computing frameworks will continue to emerge, and that no framework will be optimal for all applications. Therefore, organizations will want to run multiple frameworks in the same cluster, picking the best one for each application."


If you're facing endpoints at the outside world and would like to do so easily and with pretty good reliability, Kubernetes on Mesos is probably the way to think about it.

Or should, when the K8S-Mesos stuff is a bit more mature.

Anyone using Dkron? http://dkron.io/

What I really need is a "distributed cron", first and foremost, with the additional requirements of: it being lightweight on resources, and it being multiplatform. Dkron seems to fit the description pretty well, I'm looking for any feedback from users.

Is any of these platforms useful for a small company that wants to have a basic self-hosted environment for the usual stuff?

They are looking for services like e-mail (SMTP, IMAP, maybe webmail), website (static + maybe wordpress), source code hosting (Subversion) and reviews (?), CI (Jenkins?), devops, and CRM...

Kubernetes can be used to deploy anything that you can put into a Docker container, including support for persistent volumes (ie "mode 1" applications). I've been using it recently to host XMPP servers, gitlab, jenkins, etc.

I haven't installed Kubernetes directly, but I have setup and used Openshift v3 which adds a PaaS solution on top of Kubernetes. Setting it up is really easy, and they've release an all-in-one VM to demo it out: http://www.openshift.org/vm/

The other option is to use a hosted solution. Google Container Engine (https://cloud.google.com/container-engine/) is essentially hosted Kubernetes.

PS I work for Red Hat. OpenShift V3 is our product, and we contribute a lot to Kubernetes.

Hard to say without more detail, but I would lean towards probably not.

For the environment conjured in my head by what you're describing, I would probably use something like Salt or Puppet to drive installation, config management/monitoring and upgrades, possibly on top of oVirt or similar.

I don't see an advantage for tiny shops containerizing everything at this point, at least until some bright shiny future where containerization is much more of the norm and platforms like this are much more mature and easy to manage. Your sysadmin almost certainly has better things to do than adding wrappers and indirection to a tiny environment.

I'd use a public PaaS. Heroku pioneered it. I like Pivotal Web Services because I work for the company which runs it, the same opensource platform (Cloud Foundry) powers BlueMix as well. There's also OpenShift from Red Hat.

Rolling your own is very time-consuming if your goal is to deliver simple apps.

Anything to say about Flynn?


Flynn is a high-level platform that doesn't require you to think about low-level components like schedulers, overlay networking, service discovery, consensus, etc. Flynn does of course include a built-in scheduler and other components that accomplish many of the goals mentioned in the article.

(disclosure: I am a cofounder of Flynn)

The article focused on systems for running straight Docker containers, so I didn't investigate Flynn.

How many people actually need this level of scale vs how many people are implementing this because its $BUZZWORD

It's not about scale, it's about manageability.

Having a set of uniform nodes where you schedule containers is nicer, in my opinion, than managing an infrastructure where you've scripted applications to go places.

Sure, you could make puppet manage containers on uniform nodes but then we're having a massive convention vs configuration argument, which we shouldn't bother with because the distributed schedules are giving us a lot of important stuff for mostly free.

"This level of scale" is very trivial for some of these. E.g. for fleet, the only dependencies are systemd and etcd.

If you have a reachable etcd cluster (which you can run on a single machine, though you shouldn't) and your machines run systemd, and you have key'd ssh access between them, you can pretty much start typing "fleetctl start/stop/status" instead of "systemctl start/stop/status" and feed fleet systemd units and have your jobs spread over your servers with pretty much no effort.

For me it's an effortless way to get failover.

E.g. "fleetctl start someappserver@{1..3}" to start three instances of someappserver@.service. With the right constraint in the service file they'll be guaranteed to end up on different machines.

You need orchestration even at small scale (say more than 2 nodes). Otherwise you have to manually take care of container scheduling and failover etc.

Exactly, the biggest problem with running apps in Containers, is you need to ensure you dont have your entire cluster sitting on one physical node (or virtual node i guess), services can be restart, auto-scaled, etc. Even at a small scale some orchestration and management is required

>you need to ensure you dont have your entire cluster sitting on one physical node

Why is this the case? My understanding of containers is that they solve exactly this problem - restart, rescale only a fraction of the application (only an individual service).

Bringing up new containers will always take a finite amount of time.

If all of a cluster is deployed to a single host, and that host dies, then the recovery time is the time needed to reschedule the container elsewhere, load it onto the machine (if not cached), and execute it. This is all downtime, since every instance was lost.

If instances are on more than one host, then even in the case of a single host's failure the application is still available while the failed instance is rescheduled and started.

You do if you're running an internal paas or DCOS, and a large number of bigger organisations (eg Google...) do because of bin-packing saves on hardware costs.

Many don't, of course...

Also check out https://containership.io as a scheduler. The core of the system is open source, has built in service discovery, dns, and load balancing.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact