This is a pretty balanced article. Absent is Nomad, which claims the world but when you dig into the code pretty large advertised chunks are simply not there yet. Nomad seems like a much more straightforward implementation of the Borg paper, and one day may be interesting once they write the rest of it. A nice Kubernetes feature that is similar to what you can do with fleet is the “Daemon Set” which lets you run certain things on every node. Some cool Mesos features that are pretty new and haven’t been talked about much yet:
* persistent volumes: let frameworks have directories on agent machines returned to them after an availability event happens, which is nice for replicated stateful services
* maintenance primitives: schedule machines to go offline at certain times, and ask frameworks if it’s safe to take nodes offline. This will soon start being used for stateful services so that they can vote on when it’s safe to take out a replica, and to trigger proactive rereplication when maintenance is desired.
* oversubscription: if you have an agent that has given away all of its resources, but the agent detects that there is still some unutilized CPU, it can start “revokable tasks” to fill up the slack up until it starts interfering with existing workloads.
Cloud Foundry has a scheduler called Diego[1], which is now in public beta on PWS. Because it's built for Cloud Foundry, you get the whole PaaS as well. No need to roll your own service injection, staging, logging, authentication and so on: it's already built and integrated.
For me, the cleverest part about Diego is that it turns placement into a distributed problem through an auctions mechanism. Other attempts at this focus on algorithmic approaches that assume perfect information. Diego instead asks cells to bid on incoming tasks and processes. The auction closes after a fixed time and the task or process is sent to the best bidder. This greatly reduces the need to have perfect central consistency in order to perform bin-packing optimisation -- in a real environment, that turns out to matter a lot.
Cloud Foundry is large and very featuresome, so for those who want a more approachable way to play with Diego, try Lattice[2].
Disclaimer: I work in Pivotal Labs, a division of Pivotal. Pivotal is the major donor of engineering effort to Cloud Foundry. I worked on the Buildpacks team and I've deployed commercial apps to PWS, so I'm obviously a one-eyed fan.
Yep. It's a PaaS, a take-it-or-take-less-than-it-and-roll-the-rest-by-hand-deal. It's very opinionated because a large part of the DNA is from app developers (Pivotal Labs) who Just Want To Focus On Writing Apps, Dammit.
I'm wildly, crazily biased because I have seen the fruits of multiple snowflake PaaSes and it is rotten. Brilliant engineers building brilliant systems that consume enormous time and effort to not do everything an off-the-shelf PaaS does serviceably already.
Rob Mee gives the analogy that everyone used to write their own ORMs. Nobody does that any more. Companies with homegrown ORMs are cursed, no matter how brilliant that ORM is. Being on a cutting edge is only a good thing if it eventually stops being the cutting edge.
Does anyone still write their own OS? Their own webserver? Their own database? Their own programming language? Yes! They do. But for 99.99% of the world's problems, delivering user value is the first priority.
> it doesn't feel like true open source in the way that both Mesos and Kubernetes do.
They're arguably symmetric. Based on my very limited understanding:
Diego works by auctioning tasks/LRPs to cells, which bid based on free resources.
Mesos works by auctioning free resources to tasks/LRPs, which accept or reject based on their own policies.
The analogy I got from Onsi was that Diego is "demand-driven", Mesos is "supply-driven".
They have different sweet spots. Diego is built to be the heart of a PaaS, so it aims for fast success or failure for accepting tasks and processes. Consumers can decide whether to build further scheduling intelligence, but Diego doesn't make it explicit.
Mesos exposes some of the placement/scheduling logic back to the client explicitly. For example, if you are OK with a gigantic batch job running intermittently, Mesos can more or less do that without fuss. The job will simply wait until enough free resources are available on the cluster. This is one of the observations that drove Mesos. Whereas Diego will tell you very quickly if that capability is unavailable currently.
On the flipside, Diego requires no additional configuration. It also doesn't require a central view of available resources to be maintained in order to perform placement. That said, Diego does maintain a central view of tasks and LRPs that have been placed, in order to relaunch failed ones.
That said, I have sometimes been wrong about my understanding of Diego and Mesos, so YMMV.
No worries. If you want to learn more, I can put you in touch with the Diego or Lattice teams.
I should also note in passing that Diego can place and manage Docker containers, Windows .NET apps and classic buildpack-staged apps, with the architectural flexibility to extend to more kinds of managed resources (eg, with the right backend, it could manage unikernel apps). That alone sets it apart from the pack.
Since it's too late to edit, turns out I'm out of date about the current way Diego works. There's an auction mechanism but it's been made more centralised due to a thundering herd problem. It more closely resembles Mesos and others, insofar as it collects resource reports and selects candidates based on standard bin-packing techniques; cells can then reject placements.
Would you say your knowledge of Cloud Foundry and Diego is directly proportional to how fancy you dress on Fridays? Because today is Friday and I don't see fancy dress.
It doesn't. It's intended to support 12-factor apps[1].
State is handled by services bound to your application. CF injects the connection details into the staging and runtime environments. Applications using buildpacks generally have that connection autoconfigured.
For example, Rails apps have database credentials autoconfigured if you have a database bound. Spring apps get a DataSource. And so on.
I'm familiar with 12-factor apps, but there are some use cases (usually involving existing / 3rd party software) where it's not really feasible. One thing I like about plain LXC is that you can keep state if you want to. Of course making it HA is a whole other story. :)
It looks like k8s has a way to handle volumes, so maybe I'll have to play with that.
Persistent volumes are on the horizon[1], now that Diego has landed.
But you'd be surprised how many apps can be ported to some semblance of cloud nativeness. Most of the stuff that Absolutely Must Be Stateful can usually adapted to something else. Sticky sessions go to Redis or Memcache, local files can often be relocated into a shared filesystem/blobstore like Riak CS etc.
The big thing about PaaSes is not that they make it a smooth, double-percent-ROI incremental improvement from the status quo. What happens is that unlocking true continuous deployment capability unblocks process backpressure throughout the entire organisation. The value you get from a complete shift in how you develop software swamps the cost of carrying legacy until it can be replaced.
Putting it another way: the sucky parts are tactical. The awesome parts are strategic.
One more thing worth mentioning with fleet is that you can schedule anything systemd can handle. This means you can have it schedule timers, for example, and if the machine your timer runs on dies, fleet will re-schedule the timer on another machine.
For fleet a timer is just another systemd unit - there's no special support for them -, so you get a simple "cluster wide cron" pretty much for free.
Fleet works fairly well, though it does have some minor maturity issues (I've gotten it into weird states when machines have left the cluster abruptly and rejoined where e.g. it refuses to schedule some unit afterwards; solution: change the name of the unit) . No idea ho well it'll scale in larger deployments.
I have tried Swarm intensively and it is not ready at all, sadly! There is no support for pulling from private registries, there is no rescheduling if one of the nodes goes down and there is not intelligent way to manage volumes.
You are actually better of managing each server individually than using swarm.
Yes, I should have mentioned in the article that Swarm is still heavily under development.
I'm still wondering if and how they're going to address the idea of co-scheduling groups of containers, like pods in Kubernetes. This is a common need, but Docker don't seem very keen: https://github.com/docker/docker/issues/8781
I just wanted to throw into the mix Rancher. Rancher is a more recent comer into the space and still in beta. Rancher focuses on simplicity, flexibility, and pragmatic solutions for Docker based infrastructure. Rancher is different in that it can be deployed in conjunction with all of these systems mentioned or can be ran as a replacement. It includes a scheduler, load balancing, health checks, service discovery, application templating based on compose syntax, storage management, application catalog, upgrade management, github/ldap integration, UI, API, CLI, etc.
Disclaimer: Co-founder of Rancher Labs and chief geeky guy behind Rancher
HN: Has anyone here worked significantly with SmartOS and its associated tools? I love the idea of a Zones/ZFS-backed container OS, but their documentation looks very sloppy. Does anyone here have extensive experience with it (who's not Bryan Cantrill - your BSDNow podcast interview is what got me looking)
Yes Joyent's Triton may be the best way to run docker containers, not only for ZFS, but for Illumos Zones, network virtualization..
You may be interested by this article from Casey Bisson, about running Mesos on Triton:
Nomad is another scheduler released by hashicorp. This probably wasn't available when the article was written but I am curious how it would compare to the others.
I'm leaning toward Kubernetes because that's what fabric8.io has chosen. The others may be great, but I'm not really interested in writing a lot of glue code to make them work well with Jenkins/Gerrit/Nexus/Slack/Docker/Chaos Monkey/OpenShift the way Fabric8 has.
Only Kubernetes has facilities for hosting external services reliably. I'm not sure why these other lower level tools are being compared to it. Especially Mesos and Marathon, which are much lower level.
They are being compared as they are the main options for clustering and scheduling containers. I'd agree Kubernetes is at a higher level to the other options, or at least comes with more features.
I'm not sure what you mean by "hosting external services reliably" - what's external and who is unreliable?
And that doesn't recommend a comically brittle solution (as in marathon) or provide no facilities at all (as in swarm).
There's nothing wrong with these decisions. It's just odd to compare this group. It makes me wonder how people are considering these options and in what context.
You cannot field a reliable and available service to the outside world without this component. And the instrumentation and operationalization of this component is tricky.
Yeah, I agree that the load-balancer is important and not adequately addressed. I guess most people rig up their own set-up based on haproxy currently.
I don't get why you think I can't compare that group though. What should I compare Kubernetes with? The group represents the various approaches to clustering and orchestration that most people will consider for container deployments. They are all slightly different, which is what makes for an interesting comparison.
Maraton schedules long running tasks, but doesn't offer any explicit facilities for binding a homogenous group into a relible responder set.
At least, last time I checked. I can see components in the Marathon API to help you build such a thing. But Kubernetes deals with a fundamentally more abstract set of primitives: groups of containers (pods), replicated pods, and automatically managed dispatch points to said groups called services.
The Marathon documentation literally recommends you hand-roll a periodically updated haproxy service to accomplish that last bit. While K8S introduces a fair amount of overhead to accomplish this feat reliably, that overhead pays off by having a very responsive and available system.
So yeah. K8s is working one level up the stack, as I see it. I think you can make something reasonable with only Marathon, and that thought is backed up by the reality of people doing it. But my own experience suggests K8S does more for you here. Hence my excitement about K8S-mesos.
I managed to achieve load balancing for a fleet of microservices using vulcand [0] and marathon callbacks [1].
It listens for marathon callbacks (for app create/update/scale), or when a task is killed/lost and updates the backends in etcd, propagating to all vulcand load balancers.
Well, it depends. My solution is opinionated, using only one vulcand cluster for load balancing, multiple vulcand servers listening on one etcd namespace.
For a small number of nodes (< 100) and requests (< 8000 req/s on each vulcand server) this vulcand approach is ok. For large scale, HAProxy/Nginx supports a lot more req/s than vulcand, and I think it can also be configured using confd [0] using a similar aproach: listen for marathon events, update etcd keys, then confd will listen for changes and reload HAProxy/Nginx
This is something we face in our environment. The OSes are pretty homogenous, but the application set is very diverse; 16 of these, 2 of these, 8 of those, and so on. It's made previous orchestration tools a bit more unwieldy, but of course manual control is unwieldy as well!
In addition, of course, the task of learning, implementing, and evaluating the options take a large amount of time on top of the time we already spend (mostly) manually maintaining infrastructure.
I deploy a pretty small app with like 2 instances per deploy and 3 deploys (beta and 2 production versions for different clients) and Kubernetes is 100% worth it for the rolling updates and the ability to build the image once and push it different places.
Sure, there's a local provider that uses LXC containers on your local machine (Ubuntu only), and we're working on a way to use LXD (which is a much cleaner local environment).
If you are running OSX, Windows, or a non-Ubuntu linux, you'd need an Ubuntu VM to run the local provider (we support Centos, Windows, and OSX for the juju client in general to control remote providers like Azure, Amazon, etc, just not for running the local provider).
We attempted to use ECS for a while before ultimately switching to Kubernetes. While its tight, built-in integration with AWS's Elastic Load Balancers and Auto-Scaling Groups made it fit well in the AWS ecosystem, we found that there wasn't enough visibility into the system. Containers would be stopped without notification or logging and not restarted.
We've found the Kubernetes primitives to be the easiest and most straight-forward to work with while still providing a very powerful API around which to wrap all sorts of custom tooling.
I am looking forward to the day when Mesos(or any other PaaS software) will be able to use Ceph to save application(app/container/vm) state. In such a setup applications could move around the compute cluster without loosing state.
Indeed you can cobble together such a solution, but baked in integration at the framework level would be more interesting. If state is exposed at the scheduler level, than its much easier for all scheduler frameworks to take advantage of it.
Containers and cluster storage bring you really close to what Google's infrastructure looks like. Especially when the systems are fully fault-tolerant like Mesos and Ceph. One of the drawbacks of Ceph is that its filesystem is not ready for production yet, and you have to resort to block devices that can only accessed from one host at at time.
At Quobyte (http://www.quobyte.com, disclaimer: I am one of the founders), we have a built a fully fault-tolerant distributed file system. This allows concurrent scalable shared access to file systems from any number of hosts. Think of a /data that is accessible from any host and can be mapped in any container.
What I found pretty neat was that we could easily do a mysql HA setup on Mesos: put mysql in a container, use a directory on /quobyte for its data, and enable Quobyte's mandatory file locking. When you kill the container, or switch off/unplug its host, the container gets rescheduled and recovers from the shared file system.
Ceph itself seems really stable, though we did have an issue with kubernetes not aquiring a lock on ceph and thus having 2 nodes write in the event that one went unresponsive. This was patched in a later version, but make sure it's merged into the release you go with.
Author here. I would say it's a bit more subtle than that.
If you have a very large cluster (1000s of machines), Mesos may well be the best fit, as you are likely to want Mesos's support for diverse workloads and the extra assurance of the comparative maturity of the project.
The big difference with Kubernetes is that it enforces a certain application style; you need to understand the various concepts (pods, labels, replication controllers) and build your applications to work with those in mind. I haven't seen any figures, but I would expect Kubernetes to scale well for the majority of projects.
> The big difference with Kubernetes is that it enforces a certain application style
In our experience, it isn't so much that Kubernetes enforces a certain style as that it doesn't force services to understand the scheduling layer. A Twelve-Factor-style app will deploy very nicely and easily into a Kubernetes cluster.
The one thing I don't totally understand about the Kubernetes scaling issue is that you can ostensibly use Mesos as the scheduler for Kubernetes, which naively to me seems like it would allow Kubernetes to scale to the same size as Mesos, or is the DNS and Service stuff the part that doesn't scale as well?
There are a bunch of scale limits in Kubernetes itself that are being chipped away at; the bottleneck is not the scheduler, so plugging a new scheduler in won't make the system more performant.
Note that Bob Wise at Samsung has been driving some horizontal scale testing, and has got a 1000-node cluster up and running, so that's a current "best case" scale number.
I think Kubernetes on top of Mesos may become a reasonably common set-up, largely because it allows diverse workloads that can drive better usage and efficiency.
I'm not sure what the trade-offs are with running Kubernetes on Mesos - that would make for an interesting article.
Running workload-specific schedulers like Kubernetes on Mesos is one of its fundamental ideas. From the paper:
"It seems clear that new cluster computing frameworks
will continue to emerge, and that no framework will be
optimal for all applications. Therefore, organizations
will want to run multiple frameworks in the same cluster,
picking the best one for each application."
If you're facing endpoints at the outside world and would like to do so easily and with pretty good reliability, Kubernetes on Mesos is probably the way to think about it.
Or should, when the K8S-Mesos stuff is a bit more mature.
What I really need is a "distributed cron", first and foremost, with the additional requirements of: it being lightweight on resources, and it being multiplatform. Dkron seems to fit the description pretty well, I'm looking for any feedback from users.
Is any of these platforms useful for a small company that wants to have a basic self-hosted environment for the usual stuff?
They are looking for services like e-mail (SMTP, IMAP, maybe webmail), website (static + maybe wordpress), source code hosting (Subversion) and reviews (?), CI (Jenkins?), devops, and CRM...
Kubernetes can be used to deploy anything that you can put into a Docker container, including support for persistent volumes (ie "mode 1" applications). I've been using it recently to host XMPP servers, gitlab, jenkins, etc.
I haven't installed Kubernetes directly, but I have setup and used Openshift v3 which adds a PaaS solution on top of Kubernetes. Setting it up is really easy, and they've release an all-in-one VM to demo it out: http://www.openshift.org/vm/
Hard to say without more detail, but I would lean towards probably not.
For the environment conjured in my head by what you're describing, I would probably use something like Salt or Puppet to drive installation, config management/monitoring and upgrades, possibly on top of oVirt or similar.
I don't see an advantage for tiny shops containerizing everything at this point, at least until some bright shiny future where containerization is much more of the norm and platforms like this are much more mature and easy to manage. Your sysadmin almost certainly has better things to do than adding wrappers and indirection to a tiny environment.
I'd use a public PaaS. Heroku pioneered it. I like Pivotal Web Services because I work for the company which runs it, the same opensource platform (Cloud Foundry) powers BlueMix as well. There's also OpenShift from Red Hat.
Rolling your own is very time-consuming if your goal is to deliver simple apps.
Flynn is a high-level platform that doesn't require you to think about low-level components like schedulers, overlay networking, service discovery, consensus, etc. Flynn does of course include a built-in scheduler and other components that accomplish many of the goals mentioned in the article.
Having a set of uniform nodes where you schedule containers is nicer, in my opinion, than managing an infrastructure where you've scripted applications to go places.
Sure, you could make puppet manage containers on uniform nodes but then we're having a massive convention vs configuration argument, which we shouldn't bother with because the distributed schedules are giving us a lot of important stuff for mostly free.
"This level of scale" is very trivial for some of these. E.g. for fleet, the only dependencies are systemd and etcd.
If you have a reachable etcd cluster (which you can run on a single machine, though you shouldn't) and your machines run systemd, and you have key'd ssh access between them, you can pretty much start typing "fleetctl start/stop/status" instead of "systemctl start/stop/status" and feed fleet systemd units and have your jobs spread over your servers with pretty much no effort.
For me it's an effortless way to get failover.
E.g. "fleetctl start someappserver@{1..3}" to start three instances of someappserver@.service. With the right constraint in the service file they'll be guaranteed to end up on different machines.
Exactly, the biggest problem with running apps in Containers, is you need to ensure you dont have your entire cluster sitting on one physical node (or virtual node i guess), services can be restart, auto-scaled, etc. Even at a small scale some orchestration and management is required
>you need to ensure you dont have your entire cluster sitting on one physical node
Why is this the case? My understanding of containers is that they solve exactly this problem - restart, rescale only a fraction of the application (only an individual service).
Bringing up new containers will always take a finite amount of time.
If all of a cluster is deployed to a single host, and that host dies, then the recovery time is the time needed to reschedule the container elsewhere, load it onto the machine (if not cached), and execute it. This is all downtime, since every instance was lost.
If instances are on more than one host, then even in the case of a single host's failure the application is still available while the failed instance is rescheduled and started.
You do if you're running an internal paas or DCOS, and a large number of bigger organisations (eg Google...) do because of bin-packing saves on hardware costs.
Also check out https://containership.io as a scheduler. The core of the system is open source, has built in service discovery, dns, and load balancing.
* persistent volumes: let frameworks have directories on agent machines returned to them after an availability event happens, which is nice for replicated stateful services
* maintenance primitives: schedule machines to go offline at certain times, and ask frameworks if it’s safe to take nodes offline. This will soon start being used for stateful services so that they can vote on when it’s safe to take out a replica, and to trigger proactive rereplication when maintenance is desired.
* oversubscription: if you have an agent that has given away all of its resources, but the agent detects that there is still some unutilized CPU, it can start “revokable tasks” to fill up the slack up until it starts interfering with existing workloads.