It is not the fault of the author (and I commend them for taking the trouble), but Kubernetes is super complex for getting traffic inside the cluster. But here's a genuine appeal from someone who has suffered this before.
In kubernetes, you have multiple types of ingresses - and you invariably need to have a combination of them to get anything done correctly. For example nginx-ingress will lose source-ip info and will not do L4 proxying very well. Which is why the unsaid point of all kubernetes deployments is "use ELB or GLB with proxy protocol and call it a day".
I get why - K8s is trying to solve the most complex problems first and trickling it down. Which means for the time being, ingress is a very hard problem to solve in k8s.
The second place where every one will get stuck is calico vs weave vs flannel vs whatever. Most tutorials say "use any one of them and let's move on". Unfortunately, that is not the case - choose your network plugins wisely.
After spending a lot of time in k8s, I setup a Docker Swarm cluster in about 3 hours. It does not have some of the more complex usecases supported but it does three things super well - networking, ingress and secrets ... and for most beginners, that's all they need.
P.S. i have the exact same stack running on kubernetes and docker swarm
Docker Compose yml form 3.1 is brilliant - other than my docker files, i need a 30 line yml file to deploy my whole stack across 3 machines. Kubernetes on the other hand needs atleast 12 yml files ("deployments" and "services"). I'm not counting another 4 yml files for statefulsets and persistentvolumeclaim since there is nothing equivalent in docker swarm
You don't need to split your manifests into multiple files. You can use a list:
- apiVersion: extensions/v1beta1
- apiVersion: v1
- apiVersion: v1
Re ingress, I agree that it's probably the weakest point of Kubernetes at this point. It's particularly weak when it comes to internal load balancing. When you have lots of services that should only need to be available internally, you'll want to have an internal ingress; I don't know about AWS, but Google Cloud Platform's internal load balancer doesn't support pointing at Kubernetes . I haven't found a better option than to run Traefik as a DaemonSet and rely on round-robin DNS (aka poor man's HA).
(Google now has a L3 internal load balancer, but not a L7 LB. It's fair to say that this will be made much easier this year.)
you could try linkerd. It is built with precisely this usecase in mind
Also, it was not just a question of number of files. The size of my docker-compose.yml is about 30-40 lines or so.
Helm "charts" are templates that generate Kubernetes manifests. To install a chart, you provide values (for which there are defaults). For example, here  is the chart for PostgreSQL. The default values are in values.yaml, and the templates for each manifest are in the templates folder.
Conceptually, this even cleaner than Docker Compose, because there's zero data that isn't specific to your install: Everything else defaults to a pre-defined default.
So in the same way that the official PostgreSQL Docker image is general-purpose, you can define a completely general-purpose chart that can be used by anyone. You customize it by providing overrides.
The only downside is that you still end up with Kubernetes manifests. It's not an abstraction, it's an automation tool.
If you find any issues or have questions, please file Github issues.
I really like how you guys decided to do your own 3rd party manifests instead of using annotations.
I'm going to play more with this :-)
please please submit this to kubernetes-incubator
> The first thing I do when I look at kubernetes tutorials is look at ingress or load-balancing... and usually I find it absent.
Yup! I dedicate a few minutes to this in the grab lesson because how important it is.
> Docker Compose yml form 3.1 is brilliant - other than my docker files, i need a 30 line yml file to deploy my whole stack across 3 machines. Kubernetes on the other hand needs atleast 12 yml files ("deployments" and "services"). I'm not counting another 4 yml files for statefulsets and persistentvolumeclaim since there is nothing equivalent in docker swarm
Interesting point. You may like helm (https://helm.sh) as a tool for deploying Kubernetes applications.
you should check out the #kompose channel on slack.
What's nice about ECS is that it lets you deploy using Docker Compose files (though you often make minor modifications for ECS)
ecs-cli compose --file ecs-compose.yml create
Either way, your client has to know what pod (group of Docker containers running your app) to send traffic to. The pods run on hosts and ports that aren't fixed in time. So if you want some kind of client-resolving load balancing without involving a central bottleneck on the server, you'd have to transport that inventory of pod-host-ports to the client, and then either provide a host-based proxy or open the ports on the nodes themselves. And of course you'd have to build the balancing/failover logic into the client (which would either have to be randomly distributed, or rely on resource utilization info fetched from Kubernetes).
That may be great for some narrow use cases, but for most applications, the load balancing pattern is much simpler.
Your load balancer definitely shouldn't be a bottleneck - you can deploy multiple LBs to handle the traffic and provide redundancy, though you need to use other methods to balance between them (DNSRR, anycast, etc).
kubernetes pretty much mandates either an LB because it is offloading the ingress job to a machine which can talk to the outside world and the VPC simultaneously. Otherwise you either get into complex hostport/nodeport/hostnetwork semantics... or use an ingress.
In the k8s world, there is not many simpler alternatives.
I'm currently working towards a k8s-on-metal system using nginx for ingress and was led to believe that I could use either X-Forwarded-For or proxy_protocol to preserve the originating IP address. Is the traffic bounced through an SNAT rule on it's way to the ingress controller or similar?
It can read/pass-through proxy protocol headers. So if you have haproxy in front of nginx.. it will work fine (or ELB/GLB).
Which is why there's another beta ingress - the "keepalived-vip" repo in k8s ingress. However, not a lot of people seem to be using haproxy primarily because haproxy cannot have a zero-downtime deploy.
A couple of weeks back on k8s slack, I was discussing using Github multibinder to solve that one problem and move ahead with haproxy ingress.
protip: please join the #sig-onpremise channel on slack. There's not a lot of momentum for bare metal deploys otherwise.
For reference, the feature was added in nginx 1.9.2, and the ingress-nginx controller appears to use nginx 1.11.9: https://github.com/kubernetes/ingress/blob/0.9.1/images/ngin...
Regardless, would I be right in assuming that your issues are mostly related to the ingress of TCP streams? My usecase is almost exclusively HTTP, so I might be missing the worst of it.
Thanks for the pointer to the #sig-onpremise channel, I'll be sure to check it out. Been meaning to look into the SIGs since reading https://coreos.com/blog/self-hosted-kubernetes.html
if you are using nginx ingress, then you need to move all of your existing config into their own config file format. Since setting up full TLS internally in kubernetes is buggy, I didnt want to terminate my traffic on the ingress. Also I really did not want to move my (fairly complex) nginx configuration to the ingress.
Which is why I'm a big believer in haproxy ingress - it was designed to interface with nginx.
However, you are already one of the "enlightened" ones. You actually do know what an ingress is. A significant amount of k8s slack traffic is "how do I setup a loadbalancer/how to make the cluster available to the outside world"
I still believe that there is NO better tool to use kubernetes than kops. I now use the 100.64 subnet that you guys figured out ("carrier grade NAT" seriously?) in docker swarm.
but genuine question - can u show me how ? Because I honestly went crawling into the nginx codebase and could only find client side handling of proxy protocol.
I found NO place where it showed how it could be injected and chain reverse proxies together (that is what it was invented for in the first place).
P.S. just a quick google reveals this link which sort of confirms my suspicion - http://blog.haproxy.com/haproxy/proxy-protocol/
This commit should show the difference:
This is the nginx function AFAICT: https://github.com/nginx/nginx/blob/b580770f3afaeec48a15cb8c...
Looking at that though, maybe it only works with SSL passthrough... but that is the typical use-case for using proxy protocol instead of X-Forwarded-For
My understanding is that it will make the system deal automatically with failing containers, though docker-compose already does that for us. Load balancing is another strength, though that is currently not an issue for us. It might automate certain container managing scripts, but that is less of an issue right now
In other words, I currently don't see an actual benefit of k8s over even docker-compose (not docker-swarm, which I understand is the direct k8s alternative). Anyone able to elaborate?!
* Namespaces, to group objets and prevent names clashs. Ideal when many devs runs their own copy of the same distributed app from the same manifests and having the same names.
* ConfigMap, to share common configs and project them to files in the containers (and keep them updated in the container, etc), and to provide the configuration separately
* Automatic (EBS, or GCS, or NFS...) volume attachment, ie. to the node hosting your database; and re-attaching elsewhere if the node fails and the database is re-scheduler on an other host. Automated database HA/failover is very nice.
* Ingress manifests, to version the http routing with your app. Though this is very young and incomplete yet.
But yes, k8s is much more complex than Compose (and Swarm), so if the later works for you, changing is probably overkill.
In other words, change the terminology so that K8s appears to be a good choice even if you are just starting (this is the marketing part). Granted it will seem too complex for most apps, but how about a community effort, perhaps backed by a company, that essentially wraps most stacks (say Rails and PG, Django and Mysql, perhaps a caching layer, nginx, and so on) in a K8s module so that devs can start coding and be oblivious to changing loads as their apps grow? In this scenario, K8s will just feel natural, and will be something you just have to learn as part of your journey (Git was like this for a lot of developers, but I concede the analogy is not that great).
Even if most devs reject this as being too complex, the few who pick it will be an invaluable feedback source.
Do you currently have more than one server? It sounds like you don't.
What would you currently do if your celery workers couldn't keep up and you needed to add additional machines to handle the load?
1. If you can use a simple deployment system based on 'git push' (such as Heroku), you should just go ahead and do that until you outgrow it. This is easy to test, deploy and maintain.
2. If your project is tricky to build, go ahead and set up a Dockerfile with a correctly configured environment. You can still deploy this using Heroku, Google Container Engine, Amazon ECS or many other cloud providers. Choose something simple. This is harder than something like Heroku, but it's mostly a one-time learning cost.
3. If you need multiple containers, then set up a continuous integration system that automatically builds and tests your containers on every 'git push', and which allows you to deploy a service with a single click. You'll also want a separate staging environment. At this point, you're probably going to need to devote at least 10 hours/week of engineering time to keeping everything working. System-wide integration testing will get more complex.
Again, you can stay on a managed cloud provider like Amazon's ECS or Google Container Engine for a long time before you need to set up and manage your own Kubernetes cluster. But if you find that you're writing lots of scripts to manage the containers on your clusters, it's probably time to look at a standard solution like Kubernetes.
TL;dr: Stay with the easiest deployment solutions, and only add complexity when you need it.
If you're simply bootstrapping a single app on a single vm, then fine, have at it.
* As soon as you go down the route of making that reproducable, it can as easily be mostly done by a Dockerfile than a bash or Ansible (or any other configuration management) script.
* Then you want failover or have it active on another server? Okay, you can just run you ansible against, maybe you put it in cloud-init and an autoscaling group, or maybe you can have something like a kubernetes framework take care of that.
* You want health checking? Sure, have it configuring nagios or something similar, or you can have a ELB check an endpoint, or you can have kubernetes do it?
* Want some storage? Let me add a PVC, or I can play about with managing EBSs and other block storage myself.
Once you start digging a bit deeper, realising that many of the things your apps will probably want you either need to build yourself, or will lock you into AWS, going to google and clicking the GKE button doesn't seem a terrible prospect. There are other ways to do things, but once you've learnt this way, you can reuse it almost anywhere.
Our industry is really faddish so I don't blame you for being skeptical, but the second you start hosting 3-4 apps, I'd rather have just bootstrapped kubernetes (or went to a managed service provider) and have all the primitives avaiable to me.
It's great that technology x lets you build something in 1 month, but if it takes 6 months to find someone willing to do it, then you could have written it faster in almost any other technology.
Ban for: Databases.
Certainly approach with caution, but there's no reason for a blanket dismissal of dbs in containers.
1. Know what it takes to run a database (including storage, backup, upgrade, lifecycle, failure modes)
2. Know how a containerized cluster manager manages process, storage, lifecycle, and failure modes
If you know both of those, running databases on Kubernetes (can't speak for swarm or mesos) is not hard or surprising, and you can get the benefits of both. If you don't know both of those in detail, you don't have any business running databases in containers.
The intersection of folks who know both is still small. And the risk of problems when you understand only one is still high.
Are you speaking from experience when you say it is not hard? Could you elaborate on what databases your are currently running on Kubernetes and how they are configured? Also are these production?
If I know number 1 and number 2 does that mean that I automatically understand all of the of the potential failure modes I might experience from combining 1 and 2? I certainly wouldn't think so.
Your point about 1/2 is fair, I was trying to convey that Kube follows certain rules w.r.t. process termination, storage, and safety that can be relied on when you internalize them. What's lacking today is the single doc that walks people through the tradeoffs and is easily approachable (although the stateful set docs do a pretty good job of it). In addition, we've made increasing effort at ensuring that behavior is predictable (why StatefulSets exist, and the changes in 1.5 to ensure terminating pods remain even if the node goes down).
Storage continues to be the most important part of stateful apps in general. On AWS/gce/azure you get safe semantics for fencing storage (as long as you don't bend the rules). On metal you'll need a lot more care - the variety of NAS storage comes with lots of tradeoffs, and safe use assumes a level of sophistication that I wouldn't expect unless folks have made an investment in storage infrastructure. I expect that to continue to improve, with things like Ceph and Glusters direct integration, VMWare storage, and NetApp / other serious NFS integration.
And it's always possible to treat nodes like pets on the cloud and leverage their local storage if you have good backups - at scale that can be fairly effective, but when doing one-off DBs using RDS and Aurora and others is hard to beat.
But containerizing a workload isn't the same thing as handing it off to a cluster scheduler to manage. Google hasn't been running databases via K8 for nearly a decade. Who knows how Borg handles volume management internally at Google. I realize K8 has foundations in Borg but its still not apple to apples I don't think.
GFS (now colossus) is not mounted as a legacy volume, but instead is accessed via a userspace library.
They use internal proprietary technology that doesn't have the same characteristics and flaws than Docker.
Seems to work pretty well. DCOS has lots of database options.
Kubernetes ensures that the container always has this volume mounted, and of course only one container at a time can claim the volume for itself.
What you should avoid doing is to use a host mount and pin a pod to run on a specific node, because then that pod can only run on that node, and you have no way of migrating without manually moving the mount and unpinning the pod. With Kubernetes, you really want to avoid thinking about nodes at all. State follows pods around; pods don't follow state around.
Spanner uses cross-datacenter Paxos. Your data won't be lost even if an entire datacenter goes dark.
For Vitess (http://vitess.io), we use semi-sync replication that always ensures that at least one other machine has the data.
StatefulSets are more geared towards apps that manage their own redundancy, such as Cassandra or Aerospike, where adding another instance is a matter of just starting it. One of the things that a StatefulSet permits is to preserve the network identity of a pod. For example, if you wanted deploy Cassandra without StatefulSets, you'd deploy each instance as a separate Deployment + Service pair, called, let's say, cassandra-1, cassandra-2 and so on. You would not be able to use Kubernetes' toolsets to scale the cluster. Each instance would use a persistent volume, so effectively it would be almost exactly like a StatefulSet, except Kubernetes would not be handling the pod replication.
In the case of something like Postgres, you'd probably not get any benefit from using a StatefulSet for the master (since only one instance can run), but you can use a StatefulSet to run read-only replicas.
StatefulSets guarantees "at most one"
Just because you were lucky to not experience massive issues doesn't mean they aren't present.
In fact, running the postgres instances in isolation has given us far more confidence than if they were run "natively". backing up docker instances is trivially easy in comparison to running native instances, as you already know what data volumes you need to back up. all our instances use exactly the same backup and restoration script. all our instances get rolled into staging using the same script on a daily basis. no failures so far. zero.
would be interested in actual "up-to-date" reasons, other than "docker's engineering department is not dependable", which btw I can emphasise with if you were burned in the past.
The typical dev who thinks running on a beta version of Ubuntu is the norm and calls anything else "outdated".
Yes, docker may be up to your standards.
No, docker is not up the standards of real businesses, who use stable OS and sometimes even paid support for it.
There's a lot of moving parts in there and some of the defaults for common install methods like kubeadm might be a bit of a surprise to people (e.g. the kubelet port being default open and allowing someone to take complete control of the cluster without authentication (https://raesene.github.io/blog/2016/10/08/Kubernetes-From-Co...)
Ideally something which broke out the various components and had guidelines for possible security options would be a great addition, I think.
For example, to start:
* Where do you store the YAML manifests, and how do you maintain them? Do you put them in the same git repo as your app, and if so, how do you deal with the fact that the YAML files are going to be different for production, staging, testing/QA, etc.? (For example, ingresses will use other host names. Configs are likely to be rather different altogether.)
* Or if you centralize them in a single git repo, you have to make sure that you always pull the newest version, and that your workflow includes diffing against the currently deployed version, and so on.
* How do you protect configs/secrets, if they're in git and available to all?
* Dependency tracking: If app A needs app B, you want to encapsulate that dependency in your workflow.
* Continuous delivery: How do you take care of these concerns in relation to the CD system (e.g. Drone, Jenkins)?
* And of course, you'll want to be able to develop locally. If you run Minikube, how do interact with it, YAML-wise?
There's Helm, but Helm is essentially just a "templates in a central repo" manager. It doesn't solve the configuration issue: You still need to provide values to the charts.
It doesn't sit right to have the YAML files in the project's git repo. Config follows environment, not project.
For our current, non-Kubernetes production setup, we have a tool called Monkey that allows people do things like "monkey deploy staging:myapp -r mybranch" to deploy a branch, or "monkey sql prod:myapp" to get a PostgreSQL shell against the production database for that app, with lots of nice commands to work with the cluster. They also work directly with the Vagrant VM that developers run locally: "monkey deploy dev:myapp --dev" actually deploys to the VM, pointing the directory to the user's own project (on the host machine), so that the files are the same. Combined with a React/Redux app, they get hot code reloading and so on without having to do anything special. This shields developers from the complexities and realities of a Linux server in a very nice, convenient way. We want to provide the same on top of Kubernetes.
Right now I'm debating whether to just go for "simple and stupid" and have a central repo, and then build a small toolset to wrap kubectl for devs. By running Minikube locally, devs would be able to use the same tool locally. The command-line tool could do things like always pull the repo to ensure you're working with the newest files. The main challenge there is organizing the YAML files in a clean way. Do we use templates to reduce boilerplate? Or do we rely on just copy-pasting stuff? It's not quite clear to me.
Helm seems nice, but it's another moving part to add to the mix. I'm leaning towards a simpler approach. I've looked into Deis Workflow, but it's actually a full-blown PaaS, and seems a bit much.
> * Where do you store the YAML manifests, and how do you maintain them? Do you put them in the same git repo as your app, and if so, how do you deal with the fact that the YAML files are going to be different for production, staging, testing/QA, etc.? (For example, ingresses will use other host names. Configs are likely to be rather different altogether.)
We store the manifests in the same git repo as the app. Docker images and YAML files are all the same, what's different are the values, so we use helm and override the values per environment.
The values are split 4 ways:
- default sensible values (values.yaml). For example, feature flags.
- datacenter specific (gcp.yaml, aws.yaml). For example, REDIS_HOST and MYSQL_HOST.
- environment values (env-production.yaml or env-staging.yaml). For example, the ingress values and its certs, how many replicas for each service
- secrets.yaml (not stored in git, but generated on the fly, will explain more later). For example, MYSQL_PASSWORD
I mean, if you're doing your services right, they should all have the same docker images, only the config changes. Helm has a pretty good templating and config helpers.
> * Or if you centralize them in a single git repo, you have to make sure that you always pull the newest version, and that your workflow includes diffing against the currently deployed version, and so on.
Helm takes care of this. When you upgrade a deployment, it looks like it's doing a diff. What's cool though, is that it seems like it does a diff on everything except replicas, which is exactly what we needed.
> * How do you protect configs/secrets, if they're in git and available to all?
I don't mind having the config in git, I do when it comes to secret. We store secrets as a gitlab variable, and our CI dynamically creates a "secrets.yaml" before deploying. I'm still not really happy about this though, I think a better way would be to use Vault, but it adds some complexities that I'm not really to deal with yet.
> * Dependency tracking: If app A needs app B, you want to encapsulate that dependency in your workflow.
I'm not too sure I understand what you mean by this one. We, thankfully, merged all our git repos into 1 monolithic repository. I remember at first reading about the big boys (FB, MS, Google) having a single repository and thought they were crazy. Then we added more and more services and suddenly, I just realized that having a monorepo makes is a lot easier for developers and ops. AppA~1.0 needs AppB~2.5. We used to have a crazy graph of dependency like that. Not anymore. Now every services has the same version -- we're simply using git hashes. AppA~4c20d needs AppB-4c20d. Every push to the repo rebuilds the docker images and tags them like this: AppA:master-4c20d and AppB:master-4c20d. And yes, we build for all branches and all commits.
So for deploying, it's super easy. I just deploy everything at once, at all time. If I need to rollback, I can rollback to a specific version on all services. If I need to test in a real environment a specific version, all I do is add --set BRANCH=fix-bug-branch,COMMIT=4c20d
> * Continuous delivery: How do you take care of these concerns in relation to the CD system (e.g. Drone, Jenkins)?
There's no concerns here since every code push starts our pipeline, and the pipeline rebuilds all docker images, and runs the tests between each other. It's pretty dope.
> * And of course, you'll want to be able to develop locally. If you run Minikube, how do interact with it, YAML-wise?
We have a local bare-metal server running a single-node k8s, but to be honest, we've never had a dev needing to develop on K8S specifically. Each dev that builds a service is also tasked with building the dockerfile that goes with it, and with assistance, they build the YAML file that goes with it. It's pretty straightforward. Our gitlab auto deploys to dev, staging and review environment, so if there's an invalid YAML, the pipeline will catch it.
> There's Helm, but Helm is essentially just a "templates in a central repo" manager. It doesn't solve the configuration issue: You still need to provide values to the charts.
I think it does but not directly. It solves it because it allows you to use templates and override config in a sane way.
Here's an extract of our .gitlab-ci file for deploying to production
- echo "Deploying to Kubernetes Env=production Branch=$CI_BUILD_REF_NAME Version=$CI_BUILD_REF"
- "echo MYSQL_PASSWORD: $MYSQL_PASSWORD_PRODUCTION > tmp.yaml"
- "echo MYSQL_USERNAME: $MYSQL_USERNAME_PRODUCTION >> tmp.yaml"
- ... more secrets echo config here
- helm upgrade master ./manifests/App -f ./manifests/gcp.yaml -f ./manifests/env-production.yaml -f tmp.yaml --set BRANCH=master,COMMIT=$CI_BUILD_REF
> Right now I'm debating whether to just go for "simple and stupid" and have a central repo, and then build a small toolset to wrap kubectl for devs [...]
We started with building our own toolset to wrap kubectl. It worked. Then we found Helm. I remember at first looking at it and thought it wasn't useful for us.. now it looks pretty good. We heavily use templates to reduce boilerplate.
I'm curious about your tool called Monkey. Is it a frontend for chef or ansible? Also, I can pretty much ask all the same questions you've asked but geared towards your tool (where do you store manifests, how do you protect config/secrets, how do do you handle dependency, continuous delivery, etc etc).
To address some of your responses:
Monorepo: I see the benefits, but there are also some big downsides to this approach. It's problematic to have to sync code in/out of a monorepo when you have lots of open source projects. You'll also end up with much more frequently needing to pull and deal with upstream changes completely unrelated to your stuff; and git commands like "git log" now need to be invoked with "git log ." to avoid getting a swathe of unrelated commits outside the app you're working on. And so on. Not about to go down that road.
Regarding dependencies, I'm talking about a development environment where you'd want to run a subset of the apps needed to work on the stack. (Running all of our stuff at once on one machine would make for a _really_ heavy VM.) We rely on each developer having a VM controlled with Vagrant. If you want to work on app A, which depends on app B, then a developer shouldn't need to read the readme to figure that out; they should be able to just deploy A, and have B be included automatically. (This is a lot more trivial than the other challenges, and can be solved with annotations.)
Lastly, about Monkey: It is simply a small tool that controls our VMs via SSH. The VMs are already statically configured with Puppet, so Monkey knows what app should be deployed where (it can ask PuppetDB), and it just runs commands to do things like "npm install" and "go build". Some of that stuff is read from Puppet, some is declared as part of Monkey's config. But it's entirely manual. This is the system we want to move away from, of course, so it's not a template of how things should be done.
But the point about Monkey is that it's a convenience tool that glosses over the gritty details of interacting with a cluster. A developer doesn't need to know what's going on behind the scenes of a deploy command. I want to achieve the same thing with Kubernetes.
We're not at the point where we want to do continuous delivery, but I'd like to use Drone to perform the Docker build, as it can also run tests at the same time. This is iffy, since a developer would need a way to wait for Drone to push the final Docker image — unless builds are themselves manual, which is another option. Yet another option is to reserve a specific branch (e.g. "prod") for testing.