Hacker News new | comments | show | ask | jobs | submit login
The Horrors of Upgrading Etcd Beneath Kubernetes (gravitational.com)
179 points by twakefield 45 days ago | hide | past | web | favorite | 76 comments

Hey thanks for the article. I know etcd upgrades can look complex but upgrading distributed databases live is always going to be quite non-trivial.

That said for many people taking some downtime on their Kube API server isn't the end of the world. The system, by design, can work OK for sometime in a degraded state: workloads keep running.

A few things that I do want to try to clarify:

1) The strict documented upgrade path for etcd is because testing matrixes just get too complicated. There aren't really technical limitations as much as wanting to ensure recommendations are made based on things that have been tested. The documentation is all here: https://github.com/coreos/etcd/tree/master/Documentation/upg...

2) Live etcd v2 API -> etcd v3 API migration for Kubernetes was never a priority for the etcd team at CoreOS because we never shipped a supported product that used Kube + etcd v2. Several community members volunteered to make it better but it never really came together. We feel bad about the mess but it is a consequence of not having that itch to scratch as they say.

3) Several contributors, notably Joe Betz of Google, have been working to keep older minor versions of etcd patched. For example 3.1.17 was released 30 days ago and the first release of 3.1 was 1.5 years ago. These longer lived branches intend to be bug fix only.

> upgrading distributed databases live is always going to be quite non-trivial.

I understand that the implementation of live upgrades for a distributed database will be complex but this post is about the user experience. Given enough resources, is there a reason that it can't be a single "upgrade now" command? Or maybe slightly more real-world, a 3 step process like: "stage update" -> "test update" -> "start update".

That is the process: https://github.com/coreos/etcd/blob/master/Documentation/upg...

  for i in members
    Replace etcd binary
    Restart etcd process

I agree with you, tooling is still lacking a lot in UX on k8s world.

Awesome, thanks for the clarifying items.

When going through this, I did test it and see that the upgrade appeared to work, until I think it was 3.3 where it would panic, but didn't want to rely on the undefined / untested behaviour, even if it seemed to work in a lab. The interview was long, so as we were cutting it down I think this aspect got lost.

And thanks for the hard work on etcd.

I don't know about you, but my application is tested on a single platform/stack with a specific set of operations. When the operation of the thing I'm running on changes, my application has changed. It just can't be expected to run the same way. Upgrade means your app is going to work differently.

Not only is the app now different, but the upgrade itself is going to be dangerous. The idea that you can just "upgrade a running cluster" is a bit like saying you can "perform maintenance on a rolling car". It is physically possible. It is also a terrible idea.

You can do some maintenance while the car is running. Mainly things that are inside the car that don't affect its operation or safety. But if you want to make significant changes, you should probably stop the thing in order to make the change. If you're in the 1994 film Speed and you literally can't stop the vehicle, you do the next best thing: get another bus running alongside the first bus and move people over. Just, uh, be careful of flat tires. (https://www.youtube.com/watch?v=rxQI2vBCDHo)

Picking a dependency for a system that does not have a live method of updating, is a terrible idea for a software system.

Physical systems, like a car, have obvious limitations on what can be modified when. Similarly, software will have some limitations on what happens when you are updating. But accepting "upgrades can't be done easily" for software is putting much more limitations on the software than makes sense.

With the exception of like, binary patching of executables that use versioned symbols or some craziness like that, virtually all software cannot be upgraded while it is running and expected not to produce errors.

I mean, if you use a plug-in style system, you can program it to block operations while a module is reloaded or something. But most software is not designed this way. Especially with ancient monolithic models like Go programs.

Upgrades just can't be done easily in a complex system. You can do them without concern for their consequences, but that doesn't mean they're safe or reliable methods.

It really depends on what your scenario/usage is. Sometimes it's a terrible idea: you wouldn't do a live code update on a server handling your blog. Sometimes it's just what you're aiming for: you have expensive hardware plugged into physical lines, there's not enough capacity to migrate data flow uninterrupted, you have to update in place without clients having more than X ms delay. Real life is virtually always the first case... But if you really need it, tech like Java hotswap or elang hot code swap is there.

This may be an unpopular opinion, but I’m not a big fan of containers and K8S.

If your app needs a container to run properly, it’s already a mess.

While what K8s has done for containers is freaking impressive, to me it does not make a lot of sense unless you run your own bare metal servers. Even then, the complexity it adds may not be worth it. Did I mention that the tech is not mature enough to just run on autopilot and now instead of worrying about the “devops” for your app/service you are playing catch-up with upgrading your K8s cluster?

If you’re in the cloud, VMs + autoscalling or fully managed services (eg S3, lambda, etc) make more sense and allow you to focus on your app. Yes there is lock-in. Yes, if not properly arhitected it can be a mess.

I wish we would live in a world where people pick simple over complex and think long term vs chasing the latest hotness.

This is such nonsense not to mention hilariously hypocritical since you criticise "chasing the latest hotness" but then advocate for a Serverless architecture.

Anyway go and try and build a typical application with Lambda/SAM. It is a nightmare of complexity and all you are doing is moving your logic to AWS where you pay 100x the cost of just running it yourself in a container.

And the idea that Kubernetes isn't mature is pretty laughable. It's used everyday by Netflix, eBay, Apple, Microsoft, IBM, Lyft, Uber, Square, Google, Pinterest, Stripe, Airbnb, Yahoo, Salesforce etc. And with AWS you have EKS which allows you to run containers in a HA and managed way.

If you are not married to AWS, try GKE.

EKS is still... early. I'd rather manage myself with Kops or Terraform in its current stage. And in fact, that's what we do at my company.

We tried using EKS. Unfortunately, HorizontalPodAutoscalers are configured to use metrics-server and it is not currently possible to run metrics-server on EKS. Had to replace it with a kops + terraform setup.

No project with 759 confirmed bugs on Github should be considered mature, in my opinion.

This is a nonsensical metric and you know it. How many of these are production impacting? For me, that just reflects a lot of activity. Take a look at how many get resolved.

I manage a few hundred worker nodes, they are heavily loaded at all times but happily chugging along. My pager is silent and has been for a while now. Last k8s issue we saw was actually our own fault, due to a misconfiguration.

I guess it is a little silly. But it remains a complicated project under heavy development.

I'm the kind of guy who enjoys Python 2.7 because it's deprecated, it's done changing :)

I try remember the maxim, Happiness is a Boring stack.

Ha. EKS is a joke. I have written services powered by lambda. As a matter of fact I’ve used AWS/GCP/Azure. I mentioned lambda as an example of a fully managed service - it was not to “endorse” serverless.

Also if you look at a container long enough and squint you will see the words serverless emerge.

I also had the “pleasure” of working with K8S. If anything, this looks like a pretty good play from Google to get you to eventually run your workloads in GCP.

> If you’re in the cloud, VMs + autoscalling or fully managed services (eg S3, lambda, etc) make more sense and allow you to focus on your app. Yes there is lock-in. Yes, if not properly arhitected it can be a mess.

I've just rolled off a project (line of business web app + cluster of workers for background job processing), this more or less describes what we have running.

Some of the architecture is a bit wrong (it wasn't designed for AWS but shifted there after it had been running for a year or so) but the system works well enough to deliver value to the business.

As a developer I hate state, things that aren't properly isolated, ill-defined system boundaries -- but it's not obvious to me what the business case would be to containerise everything.

> it's not obvious to me what the business case would be to containerise everything

Containers allow you to move apps trivially between environments and guarantee that they will just work. It allows you to isolate dependencies between apps e.g. Python 2 versus Python 3. It allows you to move apps between cloud providers or between on premise and cloud. With platforms like Kubernetes it allows you to easily scale and self heal when nodes die.

And compared to rewriting your app in Lambda which is expensive and complex it is simple to build a container as almost every language has automated tooling.

> Containers allow you to move apps trivially between environments and guarantee that they will just work.

Assuming your staff can keep your cluster un-screwed, am I right?

Technologies like vmware, (which also give you os independence not just linux flavors) if we're talking on premises also allow you to move apps between environments. It's trivial to, gasp, push a machine out of vmware player to the cloud even. Pick your poison, vmware, virtualbox, azure, aws, gcp.

I'll run kubernetes as provided by GCP or AWS for my clients if it's warranted (ohhh, you wanted that in "webscale", got it. ;) ), but I really feel sorry for all of the on-prem "enterprise" shops that have taken the hype bait and are now paying the maintenance burden under all of the false pretenses that are flying around. "Horror's of Upgrading Etcd Beneath Kubernetes" with dozens of production application instances, with uptime SLA's and real customer business impact? Indeed. Fraught with peril, and without the right staff onboard, disastrous.

Folks be like, hey a team of consultants just finished building out our new kubernetes cluster, now we want to run our mission critical oracle/mssql servers on it. They said it should work "fine". Too bad they all got jobs at insert mega capacity company here right after we cut the invoice.

Y'all remember when everyone would line up to get the latest microsoft windows beta? Flashback. Yo yo yo, XML is all the rage! Not. Relational databases are dead! Um, no.

Maybe this needs some time to travel down and back up the hype cycle curve. Power to all of you beta testers! You're truly doing gods work :)

> Folks be like, hey a team of consultants just finished building out our new kubernetes cluster, now we want to run our mission critical oracle/mssql servers on it. They said it should work "fine". Too bad they all got jobs at insert mega capacity company here right after we cut the invoice.






Thanks shoo, that made me chuckle - maybe I'm feeling a bit too angsty this Friday evening.

I love all of the new development, we're going great places, we just need to prescribe the right medicine for the patients so to speak. Cocaine was great in coca-cola before folks realized what it was doing to people. :)

I really love how Dan McKinley frames this kind of thing in terms of each company having a budget of "innovation tokens".


edit: now it's a club! http://boringtechnology.club/

> Let's say every company gets about three innovation tokens. You can spend these however you want, but the supply is fixed for a long while.

> If you choose to write your website in NodeJS, you just spent one of your innovation tokens. If you choose to use MongoDB, you just spent one of your innovation tokens. If you choose to use service discovery tech that's existed for a year or less, you just spent one of your innovation tokens. If you choose to write your own database, oh god, you're in trouble.

> Any of those choices might be sensible if you're a javascript consultancy, or a database company. But you're probably not. You're probably working for a company that is at least ostensibly rethinking global commerce or reinventing payments on the web or pursuing some other suitably epic mission. In that context, devoting any of your limited attention to innovating ssh is an excellent way to fail.

> There is technology out there that is both boring and bad. You should not use any of that. But there are many choices of technology that are boring and good, or at least good enough. MySQL is boring. Postgres is boring. PHP is boring. Python is boring. Memcached is boring. Squid is boring. Cron is boring.

This +1000. :) People, remembering to focus on core-competencies, making the world a better place.

This is why I'm growing into a bit of a fan of the hashicorp suite at the moment. It allows for a gradual, problem-driven extension of an infrastructure, instead of requiring you to start over. At our place we did a couple of iterations:

- First off we went from manually managed servers to chef managing servers. That was good progress, because it allowed us to scale a growing application on a cloud provider due to a large new contract.

- Then we added vault in order to simplify secret generation, management and rotation in chef. It's cool, because now we have a secure secret storage. We can give our devs access specific access to the secrets of clusters they manage but not other clusters. We can script a lot of stuff around vault.

- Then we added terraform to manage VMs easier. We should have done that earlier, I suppose, but hindsight.

- And now our devs are having large issues with their docker-based test setups, so we can open up the consul cluster and deploy nomad for this use case. We'll probably migrate some other services into that nomad cluster so we can get them loadbalanced with little effort. We'll probably shuffle some annoying things in chef around and use consul-template there.

I like that approach, because it is problem-driven and converges to simplify existing problems. For example, we have an elastic stack, and we won't move the elasticsearch cluster or the influxdbs around it away from chef on bare metal. It's a solid and stable setup, why change it.

Have an upvote. Appreciate the pragmatism :)

Not sure I agree. If you need a container to do the things you mention you’re already in a pretty bad shape.

Not to mention that a lot of people don’t understand what a container is.

> Not to mention that a lot of people don’t understand what a container is.

That's part of the appeal, the abstraction away of all those pesky, irrelevant details. Developers just want their app to run, as has been mentioned elsewhere in the thread. That desire is understandable.

So long as the abstraction isn't too leaky and nothing underneath breaks, there's no downside. It's all upside in terms of human productivity and time to market.

Even introduced inefficiency (if any) is unimportant if VC money is fueling the auto-scaling. There's a popular aphorism about premature optimization. Humans are also, in general, far more expensive than machines, especially at scale, and even a mature/traditional company would ignore this at their peril.

If problems do eventually crop up with containers or the toolsets around them, chances are, by then, their sheer popularity will ensure the availability of a cadre of experts who can troubleshoot. They may even understand what a container is, even if it's terribly unfashionable to admit, as is the case with operating systems today.

Every new technology/tool has some growing pains and is subject to what some consider misuse (due to ignorance). That doesn't necessarily mean it's best to reject it outright.

Big fan of playing with new tech. In a safe environment where failure is an option. Don’t go betting the farm on shiny things.

"Personally" (professionally), neither am I. This kind of conservatism was formed after decades of sysadmin/ops experience.

However, having acquired a strong affinity for startups, I also accept that risk (even complete, betting-the-farm risk) is totally OK, so long as that risk is taken in an informed manner. Startups, especially early ones, are a risky proposition from the get-go, and VC money tends to amplify that.

The market doesn't much punish a web startup that grew like crazy but lost some user-generated content by using some "NoSQL" database when it first came out. It does, however, punish the one that failed to grow by being too conservative.

That's obviously a false dichotomy, but I believe that's essentially the perception that's created, partly by characterizing new technologies as dangerous because they're new and "shiny".

Ultimately, I consider mine to be a service profession and an engineering one. As such, if I think a new tech is too undesirable, it's up to me to provide an alternative that actually addresses the original problems (without offloading the burden onto my users/customers).

> Not to mention that a lot of people don’t understand what a container is.

Well, I'll bite. Do you? What is a container?

Best definition: there is no such thing as a container. When people say container they usually mean a bunch of kernel features (cgroups, namespaces, apparmor, etc) + an overlay filesystem.

I played with stuff like this since the chroot jail days and knew about containers before they were cool (anyone remember lxc?).

> This may be an unpopular opinion, but I’m not a big fan of containers and K8S.

It is unpopular for a reason.

Disclaimer: if you can run solely on cloud managed services + serveless, please do that and do not even look at the rest of this message. This is a very nice approach, although there are some things you need to setup before calling victory (deployment pipeline is one). And, as you mentioned, there is vendor lock-in.

Now, containers. Look, no-one WANTS containers. Or VMs. Or anything else. We just want to run our stuff. It just so happens that containers are one of the most useful abstractions there are. Unless someone else comes up with a new abstraction, containers it is.

Because a container is at the end of the day, a process. Are you against processes? Or against process isolation in general?

You cannot lift an existing service and run serveless, you need to modify it. In many cases, it is not practical. In other cases, they need to be an actual server-like application and hold a connection. Lambda doesn't help there.

> While what K8s has done for containers is freaking impressive, to me it does not make a lot of sense unless you run your own bare metal servers.

One thing has nothing to do with the other. These are different levels of abstraction, there are challenges when running bare metal servers which are not present in cloud environments. Kubernetes can do so much more in a cloud environment (persistent volume claims, etc). Rolling out network attached storage on bare metal servers is a pain. It is also a pain with Openstack, but at least there is a standard interface there.

> If you’re in the cloud, VMs + autoscalling or fully managed services (eg S3, lambda, etc) make more sense and allow you to focus on your app.

Sorry, I respectfully disagree. I have spent the last two months implementing automation for deploying a cluster on AWS, with auto-scaling, auto-healing, the works, automatically deployed through Jenkins. It is NOT easy, it is not simple, and it is not focusing on my end application, unless you are ignoring all the technical debt you are incurring. And we DO have several k8s clusters, I will be moving that crap to k8s as soon as I can.

Let me make a quick list of what you need:


You can use a barebones (ubuntu|redhat|coreos|etc) VM. In which case, provisioning is not complete once the VM is up by the ASG, you need to install the app. If you use an AMI, you now need to build automation to construct these AMIs. Note: if at this stage you are building AMIs by hand, this is a technical debt, which you will have to pay. Alternatively, you can use something like cloud init. If so, see below:

* Create the auto-scaling group * Create the launch configuration * If your AMI is not entirely complete, add user data (or equivalent) here * Set the health checks

And you are done! Right? No.

What about log rotation? Do you have centralized logging? No, tech debt. Go set it up. What about monitoring? Are you using cloudwatch? Prometheus? Go set that up. What about alerting? Not everything requires a VM to be destroyed, you need to set it up. What about upgrades? Are these cattle servers? Then you have to modify your AMI and launch config. Go automate this (tech debt if not) How are you controlling access? Do you have a team? Are they allowed to SSH? Where and how are you storing the keys? How do you invalidate if a key gets compromised? I could go on, but let's keep at this level because the point is to draw a comparison.

With K8s, here's what you do:

Create a container image. Dockerfile, fancy Jenkins script, some other mechanism, I don't care. Create an image, put it in a registry somewhere. Create a YAML file describing your 'deployment'. It can be a few lines of code if you don't care about most of the stuff. If you need external access, you can create a service, which is another YAML If you don't have an existing HTTPS load balancer, point to the k8s workers (trivial with something like ingress on GKE)

And you are done. This automatically gets you:

• Self healing • Scheduling among worker nodes. You can control it or let K8s decide • Bin-packing • Logging (centralized logging requires a one-time step, with fluentd or similar, may be handled by cloud providers) • Similarly, monitoring and alerting require a one-time investment in deploying something like Prometheus, after that is done. Getting prometheus to scrape your pods is very easy to do, easier than deploying in a VM by VM basis • Upgrades: deployments handle that for you. Even replica sets before it, you just needed to apply a new YAML with an updated version • There are no SSH keys to mess around. K8s has certificate-based user access control, with an optional RBAC • The SSH equivalent is kubectl exec • Service discovery: you have DNS records for all local services created for you. The cluster will direct you to the correct node. • Scaling is trivial, but most importantly, quick. kubectl scale deployment --replicas=X. It only takes whatever time is required to download the image and run it. You don't have to spin up a whole operating system • Optional: you can have horizontal pod auto-scaling, so your services can scale up and down automatically.

It is not perfect, but it can be a game changer. I cannot imagine how we would be running our operation without K8s. Actually I can: version 1.0 of the app was a bunch of VMs, one for each service. It was nightmarish. Now the push, company-wide, is to move everything to K8s. All VMs, all data stores, all of it. And it has absolutely nothing to do with hype, it has everything to do with proven advantages, compared to most of other alternatives.

I guess you could also do Mesos. They have a similar concept, only it's not K8s.

Note that SOMETHING needs to run the K8s cluster itself. That something is precisely your auto-scaling groups and VM images. It is less painful with a container-optimized OS (like CoreOS or whatever Google uses)

This is so true. I'm glad my team embraced containerization of applications internally, deployments before were a nightmare compared to being able to pulling the new image and running it running it or now updating the deployment to the target version of the application. When we need to debug a specific version, we can get the exact image of the application and its entire environment, not just checkout the revision that should be deployed.

There's something to be said though about turning entire productive development teams into chasers for cloud tech stuff. I've witnessed teams at startups who were quite focussed on their product to spend many sprints almost entirely on integration into k8s and other orchestration software. You could see minor k8s topics (like "minikube crashing on developer's box") popping up to slowly fill the entire backlog. I guess k8s just makes for an excellent excuse for procrastination. Then once you get your apps running, you need a whole staff of k8s experts to keep it running. It ends up being very, very expensive in terms of HR costs in my experience.

I'm getting k8s's goal to become a cross-cloud orchestration framework, and I'm as much of a standardization fan as could be. I just doubt the overreaching goals for k8s are worth it in the majority of cases, and have seen better business value in Mesos (though Marathon isn't where it could be) because you can realistically run it on your own premises.

I guess k8s is Google's vision for a self-service cloud platform that offloads everything to configuration details on a uniform matrix of nodes, and in particular such that Google doesn't need to provide customer support. I just don't see the benefit for the customer, considering we've been running POSIX workloads for almost 50 years now.

Thank you! Every time k8s is mentioned we get comments like the OP: misinformed get off my lawn/hacky bash scripts/just give the infra people a zip file to do the needful, because containers are wrong and k8s is alpha at best opinion. And it hurts to see people be willfuly ignorant and be without physical keyboard, and it also is tiring to try and show them they're wrong. Thank you for this comment. You did good.

thank you for writing this up.

> I have spent the last two months implementing automation for deploying a cluster on AWS, with auto-scaling, auto-healing, the works, automatically deployed through Jenkins. It is NOT easy, it is not simple, and it is not focusing on my end application, unless you are ignoring all the technical debt you are incurring.

yes, i can appreciate that having a system automatically handle even some of this necessary plumbing in a reasonable and standardised way is attractive.

A container is not a process. The fact that you say this makes me wonder if you understand what K8s and how it works.

K8s is not going to solve the issues you outline above (logs, proper monitoring, etc). Even worse, you’re gonna have a bad time migrating them to a proper solution.

While k8s does not in and of itself solve some of the issues pointed out above, it does centralize a lot of these problems. These centralized problems are then often easily addressed by the cloud providers actually running the k8 cluster. GKE automatically sends log data and metrics through fluentd to their Stackdriver platform including error alerting. If something even prints something that looks like a stack trace to stdout or stderr, Stackdriver sends alerts and creates an issue for us.

To the first degree, it is. The way people use containers is typically one binary starting up, which equates to one process in Linux. Sure, docker et al have overhead, but that's constant overhead, and much of the hard work is done in cgroups.

You just might be wrong here, friend.

> If your app needs a container to run properly, it’s already a mess.

> If you’re in the cloud, VMs + autoscalling or fully managed services (eg S3, lambda, etc) make more sense

So running with container isolation is a mess, but serverless isolation is admirable?

I'm old school. I look at containers as jails and all the work to isolate applications in containers as of indifferent value given a flat plane process scope with MAC and application resource controls in well designed applications.

That is I default to good design and testing rather than boilerplate orchestration and external control planes.

All containers have done (popularly) in my opinion is add complexity and insecurity to the OS environment and encouraged bad behavior in terms of software development and systems administration.

The clustering story for etcd is pretty lacking in general. The discovery mechanisms are not built for cattle type infrastructure or public clouds. ie it is difficult to bootstrap a cluster on a public cloud without first knowing the network interfaces your nodes will have or it requires you to already have an etcd cluster OR use SRV records. From my experience etcd makes it hard to use auto scaling groups for healing and rolling updates.

From my experience consul seems to have a better clustering story but I'd be curious why etcd won out over other technologies as the k8s datastore of choice.

> From my experience consul seems to have a better clustering story but I'd be curious why etcd won out over other technologies as the k8s datastore of choice.

That'd be some interesting history. That choice had a big impact in making etcd relevant, I think. As far as I know, etcd was chosen before kubernetes ever went public, pre-2014? So it must have been really bleeding edge at the time. I don't think consul was even out then - it might have been they were just too late to the game. The only other reasonable option was probably ZooKeeper.

I was around at CoreOS before Kubernetes existed. I don't recall exactly when etcd was chosen at the data store, but the Google team valued focus for this very important part of the system.

etcd didn't have an embedded DNS server, etc. Of course, these things can be built on top of etcd easily. Upstream has taken advantage of this by swapping the DNS server used in Kubernetes twice, IIRC.

Contrast this with Consul which contains a DNS server and is now moving into service mesh territory. This isn't a fault of Consul at all, just a desire to be a full solution vs a building block.

My understanding is that Google valued the fact that etcd was willing to support gRPC and Consul wasn't -- i.e., raw performance/latency was the gating factor. etcd was historically far less stable and less well documented than Consul, even though Consul had more functionality. etcd may have caught up in the last couple years, though.

At the time gRPC was not part of etcd - that only arrived in etcd 3.x.

The design of etcd 3.x was heavily influenced by the Kube usecase, but the original value of etcd was that

A) you could actually do an reasonably cheap HA story (vs Singleton DBs)

B) the clustering fundamentals were sound (zookeeper at the time was not able to do dynamic reconfiguration, although in practice this hasn’t been a big issue)

C) consul came with a lot of baggage that we wanted to do differently - not to knock consul, it just overlapped with alternate design decisions (like a large local agent instead of a set of lightweight agents)

D) etcd was the simplest possible option that also supported efficient watch

While I wasn’t part of the pre open sourcing discussions, I agreed with the initial rationale and I don’t regret the choice.

The etcd2 - 3 migration was more painful than it could be, but most of the challenges I think were excacerbated by us not pulling the bandaid off early and forcing a 2-3 migration for all users right after 1.6.

My impression is that etcd works more in a lower-level data store abstraction than Consul, exactly why it's not so feature-rich but is used as building block. Consul packs more out-the-box if that's what you need.

Both are atill much better to operate than ZooKeeper.

There are several ways of bootstraping ETCD. The one I use is the one you mention: since they are brought up with Terraform, always on a brand new VPC, we can calculate what the IP addresses will be on Terraform itself and fill the initial node list that way. We can destroy an ETCD node if need be, and recreate. Granted, it is nowhere near close to being as convenient as an ASG.

The alternate method, and the method we used before, is to use an existing cluster, as you mention. If cattle self-healing is that important, perhaps you could afford a small cluster only for bootstrapping? Load will be very low unless you are bootstrapping a node somewhere. There are costs involved in keeping those instances 24/7, but they may be acceptable in your environment(and the instances can be small). Then the only thing you need is to store the discovery token and inject it with cloud init or some other mechanism.

That said, I just finish a task to automate our ELK clusters. For Elasticsearch I can just point to a load balancer which contains the masters and be done with it. I wish I could do the same for ETCD.

To sidestep upgrade issues, we're pursuing stateless immutable K8S clusters as much as possible. If we need new K8S, etcd, etc., we'll spin a new cluster and move the apps. Data at rest (prod DBs, prod Posix FS, prod object stores, etc.) is outside the clusters.

Where do you run your persistent apps?

Not the OP, but I have two kinds of persistent data.

1) Images / files / etc. It all lives in cloud storage ("s3"), outside of K8s

2) RDBMS data. You can just run as hosted sql (say CloudSQL) or a not-in-k8s VM. I have found no compelling reason to move my RDBMS into my k8s cluster.

That's a bit distressing. Most everywhere I've worked, the infrastructure that matters to the business has fallen into roughly two categories:

Category 1: stateless-"ish" workloads. More than 90% of hosts/containers used . . . less than 25% of operations headaches and time. Issues that happen here are solvable with narrow solutions: add caches, scale out, do very targeted, transparent fixes to poorly-performing application code.

Category 2: stateful workloads. Less than 10% of hosts/containers. 75% or more of operations headaches and time. Issues that happen here have less visibility, fewer short-term fixes ("just add an index and turn off the bad queries" only works so many times before you're out of low-hanging fruit), and require more expertise to solve in a way that doesn't require the application/clients to change.

If k8s and other next-gen technologies are only easing the first category, that makes me sad. It's like we have this sedan (off-the-shelf web technologies) that we have to take off-roading and it falls apart all the time. I don't want a better air conditioning system and more cushions in my seats; I want the vehicle to not break.

In k8s you can easily host stateful services. You have persistent disks that you can attach to containers, and you also have StatefulSets if you have a stateful service that you want to have automatically scaled (https://kubernetes.io/docs/concepts/workloads/controllers/st...). You can use both to run a database (postgres) for example.

I thought that this was the case too, but OP had a link (https://gravitational.com/blog/running-postgresql-on-kuberne...) to previous post that got me worried.

From TFA

> Kubernetes is not aware of the deployment details of Postgres. A naive deployment could lead to complete data loss.

That sounds ominous, but is actually a tautology.

You have the exact same challenges anywhere else, but since K8s makes some operations so easy to do, you need to be careful. RDBMs are specially tricky because most of them expect a single "master" which holds special status. And it so happens to hold all your data too (as do your replicas, provided they are up to date).

K8s and similar represent a current view on how systems should be designed and run from the ground up (largely based on the same observations you have made that the stateless workloads can be run with drastically less operational problems). If your architecture lines up with this approach (eg, following the 12-factor approach), there really are a lot of advantages. But no, legacy architected applications are just not going to benefit in the same way. I assume that there must be consultants and "thought leaders" out there who are pushing k8s and containerization as silver bullets and that's unfortunate.

require more expertise to solve in a way that doesn't require the application/clients to change.

Surprise! IT is hard and requires experts. Either hire one, or become one, but either way, the idea that "nobody in my enterprise/company/team need to understand the details of how stuff works" is crazy. Imagine a car-mechanic shop where nobody knows how an engine works: "we plugged in the computer tool, and it said something is wrong with your engine. I guess you need a new one"

Yes, it is. But tooling can improve that. For example: having worked on DB2 and Postgres, and I'd choose Postgres for any new work barring extremely unique constraints--not because it means "I don't have to understand the details of how stuff works", but because it exposes how stuff works in a more intelligible, powerful, and useful way. Tools improve. Sometimes they repeat past mistakes--I'd pick Postgres over Hadoop/HBase too, until considerable duress. But sometimes? Sometimes tools learn from past mistakes--use those tools.

Not everyone complaining about the deficiencies of current-gen tools is doing so out of ignorance or laziness about their stack.

k8s solves state pretty well lately, but that doesn't mean you need to move everything into your cluster.

I use k8s for bin packing and easy rolling deployments. Neither of which matter for my DB (i.e. no such thing as an easy rolling postgres deployment (maybe citris?)), and I am not going to put anything else on my database server... so no bins to pack.

If you are doing one or the other, then sure worry about it from a k8s sense. But don't throw stuff in k8s for no reason. It makes stuff like rolling upgrades of your cluster way more scary.

> ... not-in-k8s VM

One of the cooler innovations we've seen, and I think we're going to see more of, is the ability to take a non-k8s VM and expose it to the cluster as-if it were just another pod. This would let you schedule and expose RDBMs and other specialized servers through kubernetes while keeping them on a standard VM.

I think that's the win/win approach to bridge the gap.


But I think that doesn't go far enough: I want to provide K8s with a YAML file and have Kubernetes itself go and create and provision a VM for me. I.e, I want to "kill" Terraform. I don't care about a 'fabric' network (although that could be convenient), just give me an IP that I can reach even it if is external to the cluster. VMs could be just one more resource that can be created or destroyed by k8s, just like network-based storage today.

I haven't found this exact use-case implemented yet. If noone else builds it in the coming months I'll probably start doing it.

We did something simple for GCP and CI jobs, when we needed to have VMs in a certain project to launch Kubernetes single node testing - https://github.com/openshift/ci-vm-operator/

The base pattern should be pretty easy to modify, although the use case here is very specific.

Have you looked at Virtlet or Kubevirt?

This is the approach we’ve taken with other similar systems too. Things can go so horribly wrong underneath your containers that the concept of having only one cluster in a prod scenario and maintaining it mid-air would be unthinkable.

Also agree on keeping state outside. Maybe the relevant tech will be mature sometime soon but we’ve seen orchestration bugs do nasty things to stateless containers that would have been a nightmare if state had been involved.

This article really hits home.

A K8s cluster can survive just about anything. Worker nodes destroyed, meh, scheduler will take care of bringing stuff up. Master nodes destroyed. Meh. It doesn't care.

ETCD issues though? Prepare for a whole lot of pain. They are very uncommon though. Upgrading is the most frequent operation.

I'll have to read this later this weekend, my home k8s cluster that broke did so because of etcd. Grrr

From my experience, running etcd in cluster mode simply creates too many problems. It can scale vertically very well and if you run etcd (and other Kubernetes control plane components) on top of Kubernetes you can get away with running only a single instance.

Etcd misbehaving during upgrades or when a VM was replaced was a massive source of bugs for Cloud Foundry.

There is no longer an etcd anywhere in Cloud Foundry.

Aren’t they introducing Kubernetes as part of Cloud Foundry: https://techcrunch.com/2018/04/20/kubernetes-and-cloud-found...

Sort of. Kinda. It's complicated. But insofar as Kubernetes becomes the container orchestrator, I imagine we will encounter some or all of those problems again.

Cloud Foundry's current orchestrator, Diego, is of similar vintage to Kubernetes. It now relies entirely on relational databases for tracking cluster state. Ditto other subsystems (eg, Loggregator). It scales just fine. MySQL, while not my personal favourite, has proved more reliable in practice than etcd. Some folks use PostgreSQL. Also more reliable in practice.

Paying customers care more about being more reliable in practice than being more reliable in theory.

I've ha-ha-only-seriously suggested we throw engineering support behind non-etcd cluster state. For example: https://github.com/rancher/k8s-sql

What problems exactly is etcd trying to solve?

Interviewer here -- "If you want to form a multi-node cluster, you pass it a little bit of configuration and off you go." .. "Raft [etcd] gets you past this old single-node database mantra, which isn’t really particularly old or even a wrong way to go. Raft gives you a new generation of systems where, if you have any type of hardware problem, any type of software instability problem, memory leaks, crashes, etc., another node is ready to take over that workload because you are reliably replicating your entire state to multiple servers."

Distributed storage + consensus, basically.

For the purposes of this thread, etcd is the underlying Kubernetes storage mechanism. For many practical purposes, that's all one needs to know, unless you are in charge of maintaining the ETCD cluster.

Lol. CoreOS and Hashicorp products often throw “cloud” and “discoverability” around but lack crucial features for ops supportability found in solutions that came before. Zookeeper, Cassandra, Couchbase didn’t evolve in a development vacuum chamber. New != better.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact