Hacker News new | comments | show | ask | jobs | submit login
The Basics: Kubernetes, Mesosphere, and Docker Swarm (hpe.com)
301 points by CrankyBear on Feb 15, 2017 | hide | past | web | favorite | 96 comments

I've since moved on to Kubernetes and SaltStack, but I got into production Docker containers with Deis[1] a year ago. With full Heroku buildpack support, Kubernetes backed, and great container tools. I feel like it's sad that this project doesn't get more love. It's a 20 minute deploy to GKE or AWS (I recommend GKE).

If you run OSX, TheNewNormal's solo and cluster Kube setup is a breeze and has the Deis Workflow as a one command install, then you have full local parity with deployment.

When I started with straight Docker figuring out the transition to the real world was be really daunting. Getting into a real production environment that I could compost quickly as I learned learn the paradigms was invaluable. Articles like this seem to buzzword their way into making the path muddier than it actually is.

Another big help for me was any time I needed to set something up, first thing, I'd look at how the bleeding edge was being done with Alpine. Practically you might not always run that light, but that community has all the answers.

[1]https://deis.com/ [2]https://github.com/TheNewNormal/kube-solo-osx [3]https://github.com/TheNewNormal/kube-cluster-osx

When I first heard about Deis it was heavily compared to Flynn [1]. It's a little hard to tell from the outside if the two are still competitors or if Deis is focusing on a lower level of abstraction.


They both support Heroku-like buildpacks. Major difference IMO is that Flynn supports databases and other stateful services.

I evaluated Flynn, Deis and Dokku and wasn't entirely satisfied with any of them opting for Heroku instead.

I've promised people a blogpost on this and I will do it once I get time to evaluate RancherOS too.

Can you evaluate Kontena as well while you are at it?

Flynn was much less polished than Deis. At least, that was when i tried it last year.

I also started with Deis and move on to k8s later on.

>"I've since moved on to Kubernetes and SaltStack"

I curious how those relate to each other. Are you using SaltStack to bootstrap Kubernetes?

I'm actually right in the middle of putting that together as well. I quickly arrived at analysis paralysis when choosing how to bootstrap Kubernetes and run it in production given the tremendous flexibility. Tons of options[1], but nearly all of them are incompletely documented. Ended up walking through Kelsey Hightower's K8s the hard way[2] repo to peel it all apart.

End result will probably look something like Terraform -> SaltStack -> Kubernetes.

[1] https://kubernetes.io/docs/getting-started-guides/

[2] https://github.com/kelseyhightower/kubernetes-the-hard-way

That's exactly how we started: manually bootstrapped a cluster through "Kubernetes the Hard Way", assembled the steps from there and what else we found during discovery out to Ansible playbooks and now we are contemplating kube-aws to setup the production cluster in AWS.

Not the OP, but we use Google Container Engine (hosted Kubernetes), with Salt for the non-GKE VMs.

This is needed because K8s is not mature enough to host all the things. In particular, stateful sets are still in beta. Not sure if I would trust K8s to run our production databases even with stateful sets. We've had K8s kill pods for unknown reasons, for example, and the volume handling has also been historically a bit flaky. Fine for completely redundant, stateless containers, less fine for stateless ones.

Sure that makes sense. Statefuf especially databases would make me nerovous as well. am curious, what are the advantages of GKE over deploying K8 on AWS? Or were you already GCE?

Our production setup is actually still on Digital Ocean; we're currently testing GCP (GKE + Salt-managed VMs) with a staging cluster to see how it works in practice.

Before GCP, we set up a staging cluster on AWS. It was okay. The biggest pain point is that AWS's VPC does not match Kubernetes' requirement for per-pod IPs, so anyone who installs on AWS ends up setting up their own overlay network (such as Flannel or Calico). That's a big downside, because VPC isn't fun to deal with.

You don't really notice how antiquated AWS is until you move to GCP. Everything — networking, UI, CLI tools, etc. — feels more modern, sleeker and less creaky.

One area where GCP is particularly superior is networking. Google gives you a Layer 3 SDN (virtual network) that's more flexible than the rigid subnet-carving you need to do with AWS VPC. The tag-based firewall rules are also a breath of fresh air after AWS's weird, aging "security group" model.

It's not all innovation, of course. Some services are carbon-copy clones of AWS counterparts: Pub/Sub is essentially SQS, Cloud Storage is S3, and so on, with only minor improvements along the way. Cloud Storage, for example, doesn't fix S3's lack of queryability. I've also been distinctly unimpressed with the "StackDriver"-branded suite of services, which do things like logging and metrics. I don't know why anyone in this day and age would bother making something that doesn't compare favourably to Prometheus.

I should add that the security situation on GKE could be better:

* GKE's Docker containers run in privileged mode.

* There's still no role-based authentication.

* Containers end up getting access to a lot of privileged APIs because they inherit the same IAM role as the machine.

* You can't disable the automatic K8s service account mount [1].

Another scary thing, unrelated to GKE, is that the VMs run a daemon that automatically creates users with SSH keys. As a team member, you can SSH into any box with any user name. Not sure if real security weakness, but I don't like it.

I love the CLI tools ("gcloud" and so on), much nicer than awscli etc, and hopefully much friendlier to junior devs.

[1] You can disable the mount by mounting an "emptyDir" on top, but that makes K8s think the pod has local data, which causes the autoscaler to refuse to tear down a node. Fortunately there's an option coming to disable the service account.

Thanks for the detailed response. You mentioned "GKE's Docker containers run in privileged mode", do you why this is?

Not sure. Speculating here, but it's possible that it's because they're still working on building out GKE itself.

GKE is basically a wizard that just runs a pre-built image for the master and node VMs, and comes with some support for upgrades. There are very few settings [1] aside from the machine type. So it's pretty rudimentary. You'd think that GKE would come with a flashy dashboard for pods, deployments, user management, autoscaling and so on, but you're actually stuck with kubectl + running the Dashboard app [2] as a pod, which, while nice enough, is not integrated into the Google Cloud Platform web UI at all. Kubernetes runs fine, but GKE itself feels unfinished.

Anyway, a lot of people were asking for privileged mode back in 2015 [3], and it kind of looks like they turned it on by default rather than developing a setting for it.

[1] http://i.imgur.com/6pGRzl9.png

[2] https://github.com/kubernetes/dashboard

[3] https://github.com/kubernetes/kubernetes/issues/12048

>"GKE is basically a wizard that just runs a pre-built image for the master and node VMs..."

Interesting. So the master and the VMs are deployed in KVM or Docker containers running inside of KVMs then?

> Pub/Sub is essentially SQS

This is just wrong

> Cloud Storage, for example, doesn't fix S3's lack of queryability.

And this is now fixed in S3

Pub/Sub is feature-equivalent to SQS. There are improvements (e.g. gRPC, better performance), but it's still a basic topic/subscription system with a very simple model; for example, just like SQS, it doesn't support fanout.

Is S3's queryability fixed? Are you referring to the Inventory service or Athana? Because the former is just a big CSV file generated daily, and the second is for querying content. You still can't query the inventory of objects (e.g. find all objects created yesterday whose key contains the letter "x").

Okay, with S3 I was referring to Athena.

SQS is a "Simple Queue", Pub/Sub is not. There are no topics in SQS, just distinct queues. There is no 1-to-many. There is no push. And so on...

SNS is Amazon's push notification service and any number of queue's can subscribe to an SNS topic for 1-to-many.

Is there a bug for the emptyDir autoscaler node tear down issue? That sounds really wrong by itself.

It's working as intended, as I understand it. A pod with a host directory is tied to the host. The problem isn't emptyDir, it's that GKE doesn't let you disable the service account mount.

Title really shouldn't include "Mesosphere." It should be "Marathon" or "DC/OS". Or, if you keep Mesosphere, it should be "Google, Mesosphere, and Docker."

Aside from that, it's not a bad overview. I particularly like this quote:

“Mesos and Kubernetes are largely aimed at solving similar problems of running clustered applications; they have different histories and different approaches to solving the problem. Mesos focuses its energy on very generic scheduling and plugging in multiple different schedulers.” In contrast, says Burns, “Kubernetes was designed from the ground up to be an environment for building distributed applications from containers. It includes primitives for replication and service discovery as core primitives, whereas such things are added via frameworks in Mesos. … Swarm is an effort by Docker to extend the existing Docker API to make a cluster of machines look like a single Docker API.”

Mesos is useful when you have requirements that don't fit in k8s' rather opinionated world view. If you can use k8s, it's great, but it's also changing rapidly, so be aware you're on the bleeding edge. DockerSwarm is Docker finding a business model. (I'm biased here)

Better bleeding edge than end-of-life. We're using Marathon, and suddenly the entire UI is end-of-life, and your options are "use Marathon back-end only without a UI" or "migrate to DC/OS" which is even more opinionated than K8s. I think the real tell is that Mesosphere don't have the cash/people to maintain a non-DC/OS UI for marathon, and there isn't a community able/willing to do it for them.

About a year ago there was a decision made in our project to "just use Marathon" (our technical lead has a bit of a tendency to declare things as simple, without thinking too hard about operational problems) - a year later (...don't ask...) and I very much have the feeling we've backed the wrong horse.

Conversely...I'm deeply suspicious of excited discussions of distributed systems these days because it doesn't seem like a lot of people are truly running them "at scale" - there's not a lot of write-ups about how people handle 300+ node clusters, and probably more importantly how they handle them (how do their internal users use them).

> Swarm is an effort by Docker to extend the existing Docker API to make a cluster of machines look like a single Docker API

This isn't really true any more; it's a reference to the original version of Swarm. The version of Swarm that came with Docker 1.12 is completely rethought and not constrained in the same way.

We've been running Kubernetes in production for over a year. It has a steep learning curve and it takes a lot of time to fine-tune the underlying infrastructure (we use CoreOs) to make it production-ready. There still seems to be a lot of shortcomings of k8s that seem like should have been addressed by now:

- it is impossible to trigger a rescheduling / rebalancing. When a new node comes in (via AutoScaling policy or whatever), kubernetes doesn't do anything. Thus, the new nodes can be sitting there doing nothing.

- once a pod has been scheduled onto a node, it never reschedules it anywhere else. The node may be experiencing problems, thus the pod is affected. k8s doesn't do anything to heal that -- it could simply delete the pod so it is rescheduled somewhere else.

- docker itself constantly ships with a lot of bugs. To this day (we are on docker 1.12.6), we constantly have problems with the docker daemon hanging or becoming unresponsive. I'm not sure if k8s can do much about this, but I feel like it should since we don't directly control docker.

- doesn't integrate more tightly with the OS / cloud provider. For example, it could perform health checks and decide if the node should be restarted, terminated, or idle.

All of our services are stateless, so it would be nice to have the option for all the above, especially k8s started as being the solution for stateless apps.

This is right on point with our experience. Kubernetes is great tech but it could be smarter with respect to scheduling decisions.

Docker is always packed full of bugs; we had to increase redundancy everywhere in our stack. It's consistently been the source for me getting paged at night, and now it's the first thing I look at when there is a new issue. Some components I feel could fail more easily: our etcd and Consul clusters have been chugging along fore more than a year, even though they solve a problem much more complex than Docker does. Docker has been "production-ready" for years now, but I would not recommend it. The developers always fix that critical, production-affecting bug in the next release, but it's been disappointing for some time. I look forward to rkt + Kubernetes getting more mature.

> - once a pod has been scheduled onto a node, it never reschedules it anywhere else. The node may be experiencing problems, thus the pod is affected. k8s doesn't do anything to heal that -- it could simply delete the pod so it is rescheduled somewhere else.

IIRC, if a node goes from NodeReady to NodeNotReady, pods are drained from it.

We've been using Docker as well and I can attest to the shear number of bugs and constant regressions. I don't think a release has gone by without a bug that prevents us from updating. From slow pulls, to slow extractions, to issues with SELinux, to namespace incompatibilities, to the daemon just hanging repeatedly. It's frustrating because the technologies around Docker (Mesos, Kubernetes, Marathon, etc) seem to be getting better and more stable while Docker just continues to have issues.

Also, in our case it doesn't help when CoreOs auto-upgrades docker versions :(

I'm really looking forward to rkt so that we can finally have a solid alternative to docker.

> - docker itself constantly ships with a lot of bugs. To this day (we are on docker 1.12.6), we constantly have problems with the docker daemon hanging or becoming unresponsive. I'm not sure if k8s can do much about this, but I feel like it should since we don't directly control docker.

To be fair, sometimes these problems are due to the kernel. Specifically, the infamous unregister_netdevice ref count issue (https://github.com/docker/docker/issues/5618) has been around for years. One of the comments from a kubernetes dev says they're bypassing the cause and don't see it in GKE production.

> - once a pod has been scheduled onto a node, it never reschedules it anywhere else. The node may be experiencing problems, thus the pod is affected. k8s doesn't do anything to heal that -- it could simply delete the pod so it is rescheduled somewhere else.

Can you please elaborate this? When you are using replication controllers or deployments, don’t they drive the state to the desired/goal state, which is N replicas of a pod? So when the node is shut down, I guess it should be rescheduling those dead pods somewhere else to satisfy the goal state?

You may have misunderstood me. The case I'm talking is, the node reports Ready, but the pod itself is not functioning properly.

One common issue we have is the pod gets stuck in a restart loop (for whatever reason, including starvation of resources). k8s just keeps restarting it for days on that node, instead of simply rescheduling it after X restarts or some other condition.


Isn't that what custom liveliness probes are for?

No- liveness probe just restarts the pod if the check fails, doesn't kill it.

I think a failed liveness will restart the container, not the whole pod.

You're right. We always have one container per pod so I didn't fully think about this.

Isn't that because the pod's restart policy is set to restart? What if you set that to off? Does it fail the whole pod?

Isn't this basically eviction? https://github.com/kubernetes/community/blob/master/contribu...

Are you saying this doesn't work?

Eviction works fine when k8s detects there's a problem. I was referring to cases like this one: https://github.com/kubernetes/kubernetes/issues/13385

When I was solving this problem about a year ago for a previous company, I got around this by having a simple health checker that killed any node not properly responding to dns/docker/etc queries, and automatically replacing it with a new node.

Granted we were using mesos not k8s, but I suspect a similar approach could work here too.

With all of this talk about docker, Kubernetes and the rest I feel peer pressured into ditching my monolithic heroku rails app and switching to the distributed services heaven that docker seems too advertise.

Can anybody that has made the switch give me a convincing argument about why I should switch to (or not)? My feeling that docker is great if you are VP of Engineering at Netflix, but is probably not the best thing if you are starting a startup and just need to get things done.

Disclaimer: I'm not religious about this and I'm totally open to being convinced that I'm wrong.

> With all of this talk about docker, Kubernetes and the rest I feel peer pressured into ditching my monolithic heroku rails app and switching to the distributed services heaven that docker seems too advertise.

I successfully run lots of Docker microservices in production, and I strongly advise you to keep your app on Heroku as long as you can. :-)

Microservices make sense in two circumstances:

1. You have multiple teams of developers, and you want them to have different release cycles and loosely-coupled APIs. In this case, you can let each team have a microservice.

2. There's a module in your app which is self-contained and naturally isolated, with a stable API. You could always just make this a separate Heroku app with its own REST API.

But in general, microservices add complexity and make it harder to do certain kinds of refactorings. You can make them work, if you know what you're doing. For example, ECS +RDS+ALBs is halfway civilized, especially if you manage the configuration with Terraform, and set up a CI server to build Docker images and run tests. But it's still a lot more complex than a single, well-refactored app on Heroku.

I need to write my journey through docker article. Now I get it I'm pretty much 100% convinced it's the best way to go, simply because everyone who needs to work on something gets a clean dev environment that just works. The bit's about deployment/test/ci/production all just about being the same is wonderful.

The next thing I realise is that I'm very happy dealing with infrastructure now in a way I wasn't before; I've looked through docker files and know what they install and why and if anything goes wrong it provides me with an immediate goto which is let's add more servers or lets rebuild the environment and switch over (should there be more users or more load than expected). Docker will buy you time here.

Docker removes the temptation to start editing stuff on servers in the event of issues.

In terms of doing a startup I think other people here are better to advise; if you are MVP no but anything bigger than that I think it'll pay off.

Managing secrets is still an absolute pain though...

You state that the biggest benefit is consistency. You can get that without Docker. Try Ansible.

Ansible is not fashionable enough ;-)

Seriously though, knowing docker really well is more likely to improve my career and also having the ability to remove the devops issues associated with setting up dev environments is awesome. My Mac broke this week and I was able to switch to a different machine in 30 minutes because of that.

Does Ansible provide isolation of different dev environments? I think not.

>Seriously though, knowing docker really well is more likely to improve my career

Unfortunately true -- for now. However, your career would be even better served by gaining experience in non-fad technologies.

Also, "career development" is an offensive reason to deploy a technology for your employer. I recognize that it is common, but it's still improper to prioritize resume points over the employer's long-term stability and interests. Personally, when a candidate gives off that vibe to me, I pass on them every time.

>Does Ansible provide isolation of different dev environments? I think not.

I don't understand. Anything you can script in Docker, you can script in Ansible. They both allow the user to pass in arbitrary shell commands and execute anything they want on the target. How does this not accommodate "isolation of different dev environments"?

Maybe you mean that since you can execute a Docker container on your Mac, you don't need to set up a "local" env? Docker transparently uses a virtual machine to execute a Linux kernel on the Mac. You can execute an Ansible script on a normal VM the same way (optionally using something like Vagrant to give more simplistic, Docker-like (which is really Vagrant-like) CLI management).

It sounds like your current setup is working just fine, so I don't see a compelling reason to switch.

When you start to deploy many applications, and they need to talk to each other, and you need service discovery, automatic restart, rolling upgrades, horizontal scaling, etc - then Kubernetes brings a lot of value.

If you're a startup, then I'd look at the growth you're expecting. Containers scale well, and when you're big, maintaining multiple heroku apps eats into developer time that can be better spent somewhere else.

Of course, if you've just started, and are getting an MVP out the door, don't worry about docker just yet. And also don't listen to the microservices people. It'll be like putting the cart before the horse.

You can switch to Pivotal Web Services[0] (disclosure: I work for Pivotal on Cloud Foundry) and get the best of both worlds.

PWS is based on Cloud Foundry, which allows routing by path. So as an intermediate step towards decomposing your app into services, you can deploy different copies and have them respond to particular routes.

Cloud Foundry uses Heroku's buildpack code, with very minor changes, with additional testing. Your app will stage and run identically.

If you decide to switch to docker images, Cloud Foundry can run those too.

Cloud Foundry is a complete platform, rather than a collection of components. Test-driven, pair programmed, all that jazz. More mature and production-tested than any alternative that I'm aware of.

I think it's awesome, but I'm biased. Feel free to email me.

[0] https://run.pivotal.io/

> distributed services heaven

"The grass is always greener on the other side"

There are plenty of upsides to the distributed approach. But there are downsides to distributed too which don't get discussed as much. Things like communication between nodes, fault tolerance, monitoring and handling failure. Same case with having many microservices. Also this stuff becomes time consuming if you are a solo dev / small dev team.

IMO one approach isn't better than the other for all cases. Maybe I'm a bit of a laggard here, but I still like Heroku and believe in just doing enough infrastructure to support where your app is / is going in the near future.

The ecosystem around containerization is still emerging and is in a rapid state of flux. I really wouldn't recommend getting production anywhere near it for the next 2 years minimum, and realistically, you probably want to wait more like 5.

Both Docker and k8s change quickly and both lack functionality that most people would consider pretty basic. Google may have transcended into a plane where persistent storage and individually-addressable servers are a thing of the past, but the rest of us haven't. Many things that an admin takes for granted on a normal setup are missing, difficult, convoluted, or flat out impossible on k8s/Docker.

We're converting our 100+ "traditional" cloud servers into a Docker/k8s cluster now, and it's a nightmare. There's really no reason for it. The biggest benefit is a consistent image, but you can get that with much less ridiculous tooling, like Ansible.

My opinion on the long-term: containers will have a permanent role, but I don't think it will be nearly as big as many think. Kubernetes will be refined and become the de-facto "cluster definition language" for deployments and will take a much larger role than containers. It will learn to address all types of networked units (already underway) and cloud interfaces/APIs will likely just be layers on top of it.

The hugely embarrassing bugs and missing features in both k8s and Docker will get fleshed out and fixed over the next 2-3 years, and in 5 years, just as the sheen wears off of this containerland architecture and people start looking for the next premature fad to waste millions of dollars blindly pursuing, it will probably start to be reasonable to run some stable, production-level services in Docker/k8s. ;) It will never be appropriate for everything, despite protests to the contrary.

I think the long-term future for k8s is much brighter than the future for Docker. If Docker can survive under the weight of the VC investments they've taken, they'll probably become a repository management company (and a possible acquisition target for a megacorp that wants to control that) and the docker engine will fall out of use, mostly replaced by a combination of container runtimes: rkt, lxc, probably a forthcoming containerization implementation from Microsoft, and a smattering of smaller ones.

The important thing to remember about Docker and containers is that they're not really new. Containers used to be called jails, zones, etc. They didn't revolutionize infrastructure then and I don't think they will now. The hype is mostly because Docker has hundreds of millions of VC money to burn on looking cool.

If Docker has a killer feature, it's the image registry that makes it easy to "docker pull upstream/image", but the Dockerfile spec itself is too sloppy to really provide the simplicity that people think they're getting, the security practices are abysmal and there will be large-scale pwnage due to it sometime in the not-too-distant future, and the engine's many quirks, bugs, and stupid behaviors do no favors to either the runtime or the company.

If Docker can nurse the momentum from the registry, they may have a future, but the user base for Docker is pretty hard to lock in, so I dunno.

tl;dr Don't use either. Learn k8s slowly over the next 2 years as they work out the kinks, since it will play a larger role in the future. In 5 years, you may want to use some of this on an important project, but right now, it's all a joke, and, in general, the companies that are switching now are making a very bad decision.

I'm actually swinging hard against containers at the moment.

I've been playing around the runv project in docker (definitely not production ready) and running containers in optimized virtual-machines...and it just seems like the better model? Which, following the logic through, really means I just want to make my VMs fast with a nice interface for users - and I can, with runv I can spin up KVM fast enough to not notice it.

Basically...I'd really rather just have VMs, and pour effort into optimizing hypervisors (and there's been some good effort along these lines - the DAX patches and memory dedupe with the linux kernel).

Excellent analysis, much in line with my own thoughts and experience. Thanks for taking the time write this down.

Isn't the goal of software development to reduce complexity, both in real life and the software itself? I would argue is if your work is on Heroku and you know what your doing, why chase after intermingled microservice hell?

It is cool for big teams where operations play a big role, but I've found myself switching back to Heroku for some small (as in team size) projects.

I haven't worked anywhere that's doing containerization since the rise of the container managers being discussed here, but I can comment on some of the benefits we saw from deploying Docker at a previous job in what was, at the time, a fairly standard Consul/Consul-Template/HAProxy configuration with Terraform to provision all the host instances.

1. We could run sanity tests, in production, prior to a release going live. Our deploy scripts would bring up the Docker container and, prior to registering as live with Consul, make a few different requests to ensure that the app had started up cleanly and was communicating with the database. Only after the new container was up and handling live traffic would the old Docker container be removed from Consul and stopped. This only caught one bad release, but that's short bit of unpleasantness that we avoided for our customers.

2. Deployments were immutable, which made rollbacks a breeze. Just update Consul to indicate that a previous image ID is the current ID and re-deployment would be triggered automatically. We wrote a script to handle querying our private Docker registry based on a believed-good date and updating Consul with the proper ID. Thankfully, we only had to run the script a couple of times.

3. Deployments were faster and easier to debug. CI was responsible for running all tests, building the Docker image and updating Consul. That's it. Each running instance had an agent that was responsible for monitoring Consul for changes and the deploy process was a single tens-of-megabytes download and an almost-instantaneous Docker start. Our integration testing environment could also pull changes and deploy in parallel.

4. Setting up a separate testing environment was trivial. We'd just re-run our Terraform scripts to point to a different instance of Consul, and it would provision everything just as it would in production, albeit sized (number/size of instances) according to parameterized Terraform values.

Docker also made a lot of things easier on the development side too. We made a development version of our database into a Docker image, so the instructions for setting up a completely isolated, offline capable dev environment were literally install Docker/Fig (this was before Docker had equivalent functionality), clone the repo and tell Fig to start (then wait for Docker to pull several gigs of images, but that was a one-time cost.)

As I see it, the main thing that Kubernetes and the rest of the container managers will buy you is better utilization of your hardware/instances. We had to provision instances that were dedicated to a specific function (i.e. web tier) or make the decision to re-use instances for two different purposes. But the mapping between docker container and instance/auto-scaling group was static. Container managers can dynamically shift workloads to ensure that as much of your compute capacity is used as possible. It was something we considered, but decided our AWS spend wasn't large enough to justify the dev time to replace our existing setup.

Having not used Heroku, I can't say how much of this their tooling gives you, but I think it comes down to running your own math around your size and priorities to say whether Docker and/or any of the higher-level abstractions are worth it for your individual situation. Containers are established enough to make a pretty good estimate for how long it will take you to come up to speed on the technologies and design/implement your solution. For reference, it took 1 developer about a week for us to work up our Docker/Consul/Terraform solution. If you look at the problems you're solving, you should be able to make a pretty good swag at how much those problems are costing you (kinda the way that we did when we decided that Kubernetes wouldn't save us enough AWS spend to justify the dev time to modify our setup). Then compare that to the value of items on your roadmap and do whatever has the highest value. There's no universally correct answer.

>" I’ve spoken with several DevOps who argue that compared with running a Kubernetes or Mesos cluster, Docker Swarm is a snap."

I'm always leary of statements like that. My experience with cluster managers(Yarn, Mesos) and distributed systems in general is that they are almost never "a snap" to run once you move past trivial workloads and requirements.

Not many startups move past trivial workloads. And if you do, it's a good problem to have. If swarm can get your mvp up quickly, it might be a good choice.

Yep, that's my exact experience with Swarm (1.13). I want the higher-level abstractions like K8s Pods and Deployments, but I absolutely cannot wait around for Swarm to stabilize those kind of things.

My experience with Docker Swarm has been great. I came from Kubernetes but found it way too complex for my needs.

The author is (mis)quoting me there. I meant the initial process of installing and getting Swarm running is much easier than k8s or Mesos, not that on-going maintenance was easier.

Also, since I wrote that, k8s has done a lot of work to simplify installation with kubeadm and other improvements.

Ah OK. Yeah doing a greenfield deployment is one thing, "operationalizing" it is often more involved.

For Mesos anyone if is interested and you are Ansivle shop you can look at Ansible Shipyard to simplify an install.


Kubeadm looks interesting. Thanks.

Quoting from the disadvantages section: "Second, Kubernetes excels at automatically fixing problems. But it’s so good at it that containers can crash and be restarted so fast you don’t notice your containers are crashing. "

A key part of Kubernetes is to the bring the state of the system to that of the spec, through a series of watches - and in some places polling (config maps watching in kubectl as of 1.4.x) - this may look like magic, but it's not fixing problems per se its that your desired state of the system (the spec) is different to what is observed (the status). This is not a disadvantage. Kubernetes if not for someone sitting at the terminal expecting a bell or stop the world when problems happens, although I guess you could configure it that way and fight against the system.

Great article! I just made a point to get caught up Docker and related elements of the containerization world.

Mesosphere looks like it works similar to Nanobox (http://nanobox.io)

Unfortunately, Mesosphere is no longer focusing on Marathon as an Apache Mesos framework and is instead integrating it into DC/OS, their commercial offering. Open-source Mesos+Marathon no longer appears to be an option.

Isn't DC/OS now opensource? (https://github.com/dcos/dcos)

So DC/OS is a "one solution for all the things" offering, like K8s. It's "magical" and "monolithic": if it happens to do what you want, out of the box, then you're in luck. If it doesn't, or more likely, if you can't figure out how to make it do what it says it does, then there is a small community that can help you. And that's if you use the "magical" installer. I wouldn't even consider building the i-dont-know-how-many-but-more-than-ten (its 30) different "open-source" packages that make up DC/OS myself, let alone deploy them.

If you are searching for a product in this space, then K8s is the way to go. It has a huge community.

If, on the other hand, you are looking for a small, non-monolithic, resource scheduler + some basic frameworks (i.e. Apache Mesos + Marathon) then DC/OS is not that.

But since Mesosphere are no longer supporting working on Marathon outside of DC/OS, or indeed any framework outside of DC/OS, the Mesos/Marathon is effectively dead, since, while Mesos is open source, the only major supporter of frameworks for it is Mesosphere, and you wont get them unless you use DC/OS.

I think it's a matter of scope.

For some relatively small projects, k8s is sufficient. But pretty quickly you end up needing more and more functionality and you end up in a larger stack like OpenShift.

That's the place that Mesos and DC/OS (IMHO) shine: when you're working with large clusters (1000+) running very different workloads. Because in that scenario, k8s is still pretty immature and you're going to inevitably need to solve all the other problems that OpenShift, DC/OS, Rancher or other stacks solve. Just like Linux distros, in theory, anyone can do it. In practice, it's a pain and you want to pick up a standard distro. That's what these stacks provide: a pre-configured suite of open source tools that get you way more than k8s or Mesos or Marathon on their own provide.

Sure, but DC/OS has yet to demonstrate that it can do any of that. Mesos has demonstrated that it can handle clusters at scale, but I don't know of anyone running the full smorgasbord of all 30 services on a cluster that big, and its the other 29 things that's going to fail at scale.

And again, if you are going to be running DC/OS at that scale, then you are going to be running the enterprise version, because you aren't going to be running that much magical shit on million dollar hardware without someone being paid to troubleshoot.

That might be me in a few years, except that Mesosphere seems to be attempting to kill me in the short term by killing non-DC/OS mesos/marathon while crippling non-enterprise DC/OS. So looks like we'll be migrating to K8s and hoping that in a few years K8s scales. The "K8s doesn't scale" argument is losing ground with every new release btw.

> but DC/OS has yet to demonstrate that it can do any of that [...] its the other 29 things that's going to fail at scale.

Of course you are right that in DC/OS there is a lot more besides Mesos that could fail at scale. That is why we are carefully scale-testing DC/OS in its entirety -- under lab conditions (internally) as well as under real-world conditions (large production environments). We will be more transparent about this in the future, but one example I can give is that we regularly run DC/OS on 10^3 nodes for testing purposes.

> if you are going to be running DC/OS at that scale, then you are going to be running the enterprise version

I am not convinced by that argument. Still, just to clarify, Mesosphere Enterprise DC/OS does not have better scaling characteristics than (open) DC/OS. We are trying to land all corresponding goodness in DC/OS.

> killing non-DC/OS mesos/marathon

I understand that brutal decisions have been made. But if you look at Mesos and Marathon as of today (btw., there is no such thing as "non-DC/OS mesos/marathon"), progress is being made at an impressive rate. I observe more "animating" than "killing".

> while crippling non-enterprise DC/OS

I only see additions, no removals. Also, if you were looking at the core technology stack, the delta between Enterprise DC/OS and DC/OS will probably appear to be surprisingly small to you.

> The "K8s doesn't scale" argument is losing ground with every new release btw.

I agree.

The "scale argument" is insignificant compared to other aspects in most of the cases anyway. As we all know :-).

All of Siri and Azure runs on top of DC/OS, it's certainly demonstrated it can be used for massive data centers

Siri doesn't run on DC/OS. They use mesos (ie. the open source apache foundation project that marathon and later DC/OS are built on top of), but write their own framework(s) that run on top of it.

Source: I worked on the team, although we had talked about this at mesoscon in the past.

All of Azure ? :) DC/OS is a moderately popular workload on Azure, yes.

Disclaimer: I work for MS

What makes you feel that non-enterprise DC/OS is crippled in any way?

Disclaimer: I work at Mesosphere.


>But since Mesosphere are no longer supporting working on Marathon outside of DC/OS, or indeed any framework outside of DC/OS, the Mesos/Marathon is effectively dead, since, while Mesos is open source, the only major supporter of frameworks for it is Mesosphere, and you wont get them unless you use DC/OS.

Marathon isn't the only scheduler - Twitter (and some others) have been using Aurora (http://aurora.apache.org/). Currently we are deployed on Mesos/Marathon, and originally I thought Aurora may become more popular because it was Apache "blessed".

Will any of the commercial only stuff (like the distributed firewall or ldap authentication / authorization bits) ever be open sourced, or will DC/OS forever be open core crippleware?

That's a little harsh. I don't think the intention is to be open core crippleware - there are certainly a good number of users using (open) DC/OS in production without the enterprise features that you mention. I can't speak to the specific roadmap since I am uninformed but generally I believe the trend is for most enterprise features to trickle down into open over time.

Care to comment on GP?

We're very much committed to the Marathon open source community and keeping them happy.

I'm afraid I don't know what the plans are for the project (not working on Marathon myself) but being the most popular open source project run and maintained by Mesosphere, I have no doubt that we will continue adding new features and functionality.

Mesosphere has had an open-source free version for nearly a year now. You can read about it here:


Are there package repositories for DC/OS somewhere? I could only find these install scripts but I guess they are meant for demo installations.

DC/OS has a unique packaging mechanism, and cannot be found as a traditional Linux package. (Personally, I'd love to see that happen at some point but for simplicity's sake, it's distributed as a single binary - this allows us to upgrade the bits on nodes near atomically).

There are three ways to install it: 1) Cloud provider specific templates (e.g. CloudFormation / Azure Resource Manager. GCP support is on the backlog)

2) The ssh installer (this has a UI and a CLI). I believe this doesn't work well for larger installations because of the one to many issue.

3) The advanced install method. This generates a binary that is copied to each node. You can integrate this with Puppet, Chef, Ansible and so on.

https://dcos.io/install/ and https://dcos.io/docs/ has more details on this.

Source for this?


>Dear Marathon Community,


>Because of our focus on integrating the experience of using Marathon and DC/OS, we aren’t planning on updating the old UI further.

And by "old UI" they mean the one that works without DC/OS.

So there is no Apache Mesos + Marathon. There's Mesos. With no actively developed frameworks (Chronos is dead too). And then there's DC/OS.

I don't think that's really fair.

Marathon is still being developed as an open source Mesos framework. The _UI_ is going to stall out a bit, but it's just the UI.

Chronos is still being developed, but Mesosphere is taking Metronome and folding it into Marathon, making it more like Aurora. Moreover there's Singularity, PaaSTA, and a whole lot more:


Yes, DC/OS is intended to be a (mostly) full stack, but even then, it's still just Mesos under the hood and you can run any Mesos framework on it. The open source edition of DC/OS is fairly full featured, while the enterprise version gives you better account security, networking and integrated secrets (vault) support.

Mesosphere official github repo: "we aren’t planning on updating the old UI further"

Jaaron: "The _UI_ is going to stall out a bit, but its just the UI."

My options are:

a) running latest marathon backend without the user interface

b) running latest marathon backend and clone/maintain the user interface

c) migrate the whole thing to enterprise DC/OS (since I need authentication)

d) migrate to K8s

I view the lack of option "(e) keep using marathon-the-product as it is now", as a breach of trust, so (c) is off the table (whereas, before, it was my expected endpoint). YMMV.

Speaking as the lead of marathon, I'll say this:

We have to support the company first which has a more integrated solution that actually has to make money at the end of the day. We are also a pretty damn small team with a huge backlog to deliver, so it sucks that we had to abandon the UI outside of DCOS. We hope that the components of the DCOS UI for marathon can become the native UI for marathon, but again, it's a balance of priorities.

Marathon by itself has a lot more coming in the future, some will be restricted to DCOS, but not everything, it's a balancing act. Given our history of changing course publicaly (I wasn't involved in these decisions), I'm waiting to share our plans for 1.5 until I'm confident we're committing to them.

Just a quick two cents.

I think that's even more scary. You're confirming that you've halted development on the Marathon-UI-for-Mesos because you are resource constrained. I've got a working Mesos+Marathon cluster. The marathon part is now a dead-end, but more importantly, I now know that dead-ending a user is an option for Mesosphere: and its either because Mesosphere doesn't care, or because it does care, but is so resource constrained that it has no choice. Either option is a huge red-flag.

I wish you had forked it. If you're resource constrained and need to meet goals for LargeCorp, then fork the entire front-end/back-end. The approach of just ditching the UI half of the app while you iterate the same backend but with the DC/OS UI is really what's causing the grief here.

I built Jollyturns (https://jollyturns.com) which has a fairly large server-side component. Before starting the work on Jollyturns I worked at Google (I left in 2010) on a variety of infrastructure projects, where I got to use first hand Borg and all the other services running on it.

When I started more than 5 years ago few of the options mentioned in the article were available, so I just used Xen running on bare metal. I just wanted to get things done as opposed to forever experimenting with infrastructure. Not to mention the load was inexistent, so everything was easy to manage.

After I launched the mobile app, I decided to spend some time on the infrastructure. 2 years ago I experimented with Marathon, which was originally developed at Twitter, probably by a bunch of former Google employees. The reason is Marathon felt very much like Borg: you could see the jobs you launched in a pretty nice Web interface, including their log files, resource utilization and so on. Deploying it on bare metal machines however was an exercise in frustration since Marathon relied heavily on Apache Mesos. For a former Googler, Mesos had some weird terminology and was truly difficult to understand what the heck was going on. Running Marathon on top of it had major challenges: when long-running services failed you could not figure out why things were not working. So after about 2 weeks I gave up on it, and went back to manually managed Xen VMs.

Around the same time I spent few days with Docker Swarm. If you're familiar with Borg/Kubernetes, Swarm is very different from it since - at least when it started, you had to allocate services to machines by hand. I wrote it off quickly since allocating dockerized services on physical machines was not my idea of cluster management.

Last year I switched to Kubernetes (version 1.2) since it's the closest to what I expect from a cluster management system. The version I've been using in production has a lot of issues: high availability (HA) for its components is almost non-existent. I had to setup Kube in such a way that its components have some resemblance of HA. In the default configuration, Kubernetes installs its control components on a single machine. If that machine fails or is rebooted your entire cluster disappears.

Even with all these issues, Kubernetes solves a lot of problems for you. The flannel networking infrastructure greatly simplifies the deployment of docker containers, since you don't need to worry about routing traffic between your containers.

Even now Kubernetes doesn't do HA:



Don't be fooled by the title of the bug report, the same component used in kubectl to implement HA could be used inside Kube' servers for the same reason. I guess these days the Google engineers working on Kubernetes have no real experience deploying large services on Borg inside Google. Such a pity!

FYI, Kubernetes can do HA: the issues you pointed to are both about client connectivity to multiple API servers, which is typically solved today either using a load-balancer, or by using DNS with multiple A records (which actually works surprisingly well with go clients). The issues you pointed to are about a potential third way, where you would configure a client with multiple server names/addresses, and the client would failover between them.

As I said, it's not only kubectl that has the problem. None of the services implemented by Kubernetes are HA: kubelet, proxy, and the scheduler. For a robust deployment you need these replicated.

Using DNS might work for the K8s services, but at least in version 1.2, SkyDNS was an add-on to Kubernetes. This should really be part of the deployed K8s services. Hopefully newer versions fixed that, I didn't check.

Preferably, the base K8s services implement HA natively. Deploying a separate load balancer is just a workaround around the problem.

FYI Google's Borg internal services implement HA natively. Seems to me the Kubernetes team just wanted to build something quick, and never got around to doing the right thing. But I think it's about time they do it.

I think this was true around kubernetes 1.2, but is no longer the case. etcd is natively HA. kube-apiserver is effectively stateless by virtue of storing state in etcd, so you can run multiple copies for HA. kube-scheduler & kube-controller-manager have control loops that assume they are the sole controller, so they use leader-election backed by etcd: for HA you run multiple copies and they fail-over automatically. kubelet & kube-proxy run per-node so the required HA behaviour is simply that they connect to a different apiserver in the event of failure (via load-balancer or DNS, as you prefer).

kube-dns is an application on k8s, so it uses scale-out and k8s services for HA, like applications do. And I agree that it is important, I don't know of any installations that don't include it.

I think the right things have been built. We do need to do a better job documenting this though!

Great, thanks for the update! I'll update my deployment towards the end of spring, hopefully that's not going to be too painful.

etcd itself cannot be horizontally scaled because of the architecture. etcd's leader model cannot allow you to go beyond a certain number of nodes in cluster. The leader would be overloaded.

I think federation allows to scale horizontally above the limitation of a single etcd cluster. OTOH The fact that zk/etcd/consul are all leader-based is probably the reason flynn "simply" uses postgres

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact