Very misleading title, was hoping for a more substantive read. Kubernetes itself wasn't causing latency issues, it was some config in their auth service and AWS environment.
In the takeaways section, the author blames the issue on merging together complicated software systems. While absolutely true, this isn't specific to k8s at all. To specifically call out k8s as the reason for latency spiking is misleading.
Ok, we've replaced "Kubernetes" with a more accurate and representative phrase from the article. If someone suggests a better title, we can change it again.
I think the point is that this was a real problem that happened because of combining k8s and aws, which is a pretty common scenario. And it underscores that the bug was hard to find - I'm not sure how many people on my team would be comfortable looking deeply at both GC and wireshark. It required asking "Why" a few more levels deep than bugs usually require, and I think a lot of developers would get stumped after the first couple of levels. So it's another piece of data just counseling that a proper k8s integration is not as easy as people might expect.
I also get the sense that that team has a better than average allocation of resources. Some teams I've been on, this type of problem would be the responsibility of one person within an afternoon, or with impatient product people and managers checking for status after that.
This is exactly what happens when you abstract away anything, in this case, infrastructure. Most of the time people focus on the value-added by the abstraction. This time somebody had to face the additional burden that was introduced by it, making it harder to track down the bug.
I really like the title and I don't think it's link-bait-y.
Why?
Because too many engineers I've worked with would bump into this situation and this would be their answer. They wouldn't take the time to debug the situation deeply enough and they'd blame k8s, or blame the network, or blame...
In my experience, the most common issues with complex distributed systems are much more likely to be due to misconfiguration because of a limited understanding of the systems involved than such issues are to be caused by core, underlying bugs. And I believe that's why some engineers shy away from otherwise valuable frameworks and platforms: they have a natural and understandable bias to solve problems via engineering (writing code) than via messing with configuration parameters.
Hi, author here. That was exactly the intent of the title, reflecting the reaction we (almost always) get from developers: "k8s is at fault", the result of most investigations is "not really". I try to make that evident at the conclusions, but I agree without realizing that intent the title is misleading.
I don't know. k8s is pretty complicated. How many small/medium apps need more than this nginx/Terraform/Docker example? This would be a lot more difficult to set up in k8s (pods, ingress, etc.)
"Once this change was applied, requests started being served without involving the AWS Metadata service and returned to an even lower latency than in EC2."
Title should be: My configuration made my latency 10x higher.
We found EKS to be really disappointing in comparison to Self hosted solution. Not only you simply cannot tweak extremely important kubelet configuration, you also cannot run real HA. Most of AWS implementation around EKS was simply terrible and outclassed by community driven projects. For me personally EKS is the same failed service as Elasticsearch Service. Good for low to medium size workloads but terrible for anything 1st world class.
There are some cluster parameters you simply cannot change because api refuses or are not available (node cluster dependant parameters also). Example can be HPA downscale grace periods..
What grinds my gears is hard-coded magic timeout numbers. Somehow microservice people seem to think that these are good (e.g. for circuit breakers) without seeming to realize the unexpected consequences of composing like this. Your timeout is not my timeout. So firstly - don't do it. Time is an awful thing to build behaviour on, and secondly if you ignore that then make it a config parameter - then I've got a chance of finding it without wire level debugging (if you document it).
The timeouts this post is talking about are related to credential expiration and when to refresh, not request/connection timeouts like you'd see in microservices. In this case, not expiring credentials isn't a great option because you'd lose a useful security property: Reducing the time window when stolen credentials can be used.
For service behavior (e.g. request/connections), timeouts provide value for services and clients. For services, if you never time out then under failure conditions you either end up saturating your max concurrent request limits or growing your concurrent connections indefinitely until you hit a limit (connections, threads, RAM, CPU). Unless all of your clients are offline batch processes with no latency SLA there's a good chance that the work clogging up your service was abandoned by your clients long before it completes.
Timeouts also help clients decide if/when they should retry. Even if the service never times out, clients can't really tell if their request is just taking a long time or if something is wrong and the request will never succeed (e.g. network partitions, hardware failure). There's at least implicit latency SLAs on most things we do (1 second? how about an hour or week?). Given that there is a limit somewhere, it makes sense to use that limit to get benefits like resiliency in services.
>>>Your timeout is not my timeout.
Absolutely. Client deadlines are a great way to reduce wasted work across distributed systems. e.g. service has 60s timeout, but client has a 5s timeout for an online UI interaction. The client can tell the service to give up if it's not completed within the client's SLA.
And this is why I find DevOps work more interesting than programming. System integration is just endlessly challenging. I always enjoy reading a well-documented integration debugging session!
Okay. So based on 20+ years of experience, I can say that most developers have no interest in automating deployment, configuration, monitoring, performance, logging, etc... Who should do this work?
Yeah, operations-focused engineers will continue to have a niche carved out for them because too many devs black-box infrastructure.
Companies can either choose to have their devs take on ops responsibilities or continue having dedicated ops jobs.
In either case, whether or not dedicated ops jobs exist, ops responsibilities always will. I'll be there to pick up the slack because designing and maintaining systems is an interesting job that has to be done that a lot of people can't/won't do and it pays accordingly.
People overlook that there's a common systematic belief that looks like this:
1) Developers don't code on prod
2) Prod needs to be protected
3) Therefore, developers need to be restricted from prod
One of my teams went through the process of "let's do DevOps!" with the intent of giving developers the ability of pushing something all the way through to prod on AWS. Months later, this resulted in having a poorly-supported dev-only VPC with IAM/policy restrictions, and other "official" VPCs that devs are locked out of in various ways. Since then, devs had little incentive to learn and are again reliant on Ops for any deployment problems.
There's a common systematic belief of that because that's the sort of thing a lot of actual compliance regulations de facto require (i.e., they demand controls around software deploys, and putting enforcing that in the same hands as those wanting to deploy it, i.e., devs, will fail an audit).
Source: My employer is currently undergoing SOX compliance
This might be achieved with software tooling, though we're a while away from having a great solution. Lots of stuff hasn't been automated yet, and we tend to be afraid of some of trying it, but I like to think we'll be able to automate that someday.
Interesting, I've observed many teams where the engineers seem to prefer the DevOps work to actually building features. To the point where I think some times we have too many engineers Opsing rather than Deving.
If you tackle the problem from an efficiency angle it makes sense. Just like how people advocate for keeping things DRY why should every developer worry about deployment pipelines, logging/monitoring/alerting systems, etc. It makes sense to have a dedicated team to worry about those issue so others don't.
I think the best model is to have your "devops/sre guy" embedded right on the team.
I market myself in interviews as a software engineer (on the same team as the other SWEs) that just happens to focus on what actually getting this product into production and keeping it going looks like.
I got my start in this by working at a startup where we were responsible for everything just by the nature of the limited staff. I think that's how a lot of people in Devops get their start.
I didn't even know coding without a thought about deployment was a thing until I joined a huge company.
> I think the best model is to have your "devops/sre guy" embedded right on the team.
And here lies the problem.
We had "Ops" before. People would throw stuff over the wall from Dev to Ops, and now it became "Ops" problem. Operations would stonewall releases – because releases bring problems, and they want to keep things stable. Developers would push for releases, because that's what they are paid to do.
To solve this conflict, someone came with the idea of "DevOps". You don't have a split organization now, you have some developers in your team that can wear the "Ops" hat. That way, they experience production issues first hand, and that allows them to architect the application better – as well as ensuring it is thoroughly tested – noone wants to wake up at 2AM.
And, because they are developers at heart, they will do things to make their job easier and more predictable, like infrastructure as code. And heavy automation.
Somewhere along the way, people misread the whole idea, and now think that "devops" == Terraform or some other tool. Then they rebranded their Ops org as DevOps and hired a few automation folks and called it a day. Ops rebranded is still ops.
It was not supposed to be like that. If you have a "devops" team, you are not doing it correctly. Call it operations like it should be called. You may even have a dedicated team working on common tools or similar – which is fine – but you still need developers doing operations and ultimately responsible for their own shit, otherwise this is not devops.
> I got my start in this by working at a startup where we were responsible for everything just by the nature of the limited staff. I think that's how a lot of people in Devops get their start.
Yup. I got into that by trying (and ultimately failing) to launch my own startup. We didn't have dedicated operations folks, or time to manually tinker with servers, or to respond to repeated production issues. When you are sales, operations and engineering, you try your hardest to not have incidents (so that you don't lose customers), but you are trying to move fast to deploy features (so that you can sell).
> And, because they are developers at heart, they will do things to make their job easier and more predictable, like infrastructure as code. And heavy automation.
To be fair, most ops people i know worth their salt have been automating since forever. Doing operations at scale and still staying sane is impossible without automation.
The largests issue in ops is that reality in production is usually far more messy then development. Some minor issue which can be easily fixed in dev by something simple (take a reboot of a service for example) can be a major pain in production because of several factors. Usually related to interdependencies or customer impact.
Also, production (especially on the networking/hardware side) is usually hard because of things like hardware failure, physics itself or simply human error. These gems for instance[1][2].
DevOps is more about culture and collaboration. The tooling becomes necessary to truly make the processes go faster and better, ie. automated testing and CI/CD.
So should developers be doing all their own ops work in a half-assed, ignorant way, or should we go back to a world of throw it over the wall systems where the ops team doesn't understand the code or have a working non-adversarial relationship with the developers?
Because I've lived with both, and to hell with both of them.
this is the old issue of dev and ops having different goals in terms of their jobs.
Ops wants stability because uptime and a non disrupting service is paramount, devs want velocity because they need to push new features to market.
It's a delicate balance to strike, especially considering having a product with a ton of features but lackluster stability doesn't get you anywhere, and neither does having a super stable product that lacks basic features.
"move fast and break things" doesn't work in a lot of sectors. It might work for an internet startup, but good luck applying that principle to fields like large infrastructures and datacenters.
We have hit similar issues with GKE. GKE has a soon to be deprecated feature called "metadata concealment"[1], it runs a proxy[2] that intercepts the GCE metadata calls. Some of Google's own libraries made metadata requests at such a high rate that the proxy would lock up and not service any requests. New pods couldn't start on nodes with locked up metadata proxies, because those same libraries that overloaded the proxy would hang if metadata wasn't available.
That was compounded by the metadata requests using DNS and the metadata IP, and until recently Kubernetes didn't have any built-in local DNS cache[3] (GKE still doesn't), which in turn overloaded kube-dns, making other DNS requests fail.
We worked around the issues by disabling metadata concealment, and added metadata to /etc/hosts using pod hostAliases:
"We are blending complex systems that had never interacted together before with the expectation that they collaborate forming a single, larger system."
I suggest more people read the first few chapters of Specifying Systems by Lamport. Maybe the rest is good also, but that's as far as I got.
It works through a trivial system (display clock) and combines it with another trivial system (display weather).
Nothing Earth-shattering, but it really stuck with me. Thinking about it at that level gave me a new appreciation for what combining two systems means.
Exactly! On our OpenShift production cluster we ran into ndots problems with DNS and slow DNS resolution overall. This blog post was very helpful in understanding the issue and ways to fix it, https://pracucci.com/kubernetes-dns-resolution-ndots-options...
Yeah, Tim Hockin and I still regret not designing the DNS name search process in Kube better. If we had, we would have avoided the need for 5 and could have kept 90% of the usability win of “name.namespace.svc” and “name” being resolvable to services without having to go to 5. And now we can’t change it by default without breaking existing apps.
Pardon my lack of Kubernetes knowledge, but any regrets supporting the hierarchical lookup where they don't have to qualify their dns requests (and maybe could have used some other way to find their "same namespace")?
Good question. I certainly use the “name” and “name.namespace.svc” forms extensively for both “apps in a single namespace” and “apps generic to a cluster”.
I know a small percentage of clusters makes their service networks public with dns resolution (so a.b.svc.cluster-a.myco is reachable from most places).
The “namespace” auto-injected file was created long after this was settled, so that wasn’t an option. I believe most of the input was “the auto env var injection that docker did we don’t like, so let’s just keep it simple and use DNS the way it was quasi intended”.
Certainly we intended many of the things istio does to be part of Kube natively (like cert injection, auto proxying of requests). We just wound up having too much else to do.
Isn't this a KIAM bug? The default configuration of any piece of software should not cause pathological cases in other pieces of software that are commonly used with it. Maybe I'm just a bleeding heart, but I think good software delights its users; the deployment and configuration story is a part of this.
this is good info but the title is misleading. This could have easily been "my latency increased because my local time was off" or something like that.
This had nothing to do with kubernetes..it was a problem with their set up.
The title is a bit misleading, kub didn't cause the 10x latency - also latency was lower after they fixed their issues
TL;DR version - Migrate from EC2 to Kub; due to some default settings in Kiam & AWS Java SDK, latency of application increased, fixed after reconfiguration and kub latency lower than EC2
It is relevant though as k8s makes everything more complicated so you have to deal with stuff like this. Also if it was a brand new app theyd maybe not notice the problem in the first place.
>> It is relevant though as k8s makes everything more complicated
More complicated than what? Without a baseline for the comparison it's not that useful. In our case we transitioned over the last four years from running hundreds of VMs provisioned with puppet and jenkins to running K8S workloads (on a lot fewer nodes than we had VMs) provisioned with helm/kustomize and using gitlab ci/cd pipelines. In my opinion the current platform is much less complex to understand and manage than the old one was. Yeah there are layers of complexity there that didn't exist in the previous platform, i.e the k8s control plane, the whole service/pod indirection, new kinds of code resources to manage, but it's all pretty consistent and works logically, and isn't really any harder to internalize than any other platform-specific complexity we've had to deal with in the past. And in terms of day-to-day development, deployment and operations k8s has been completely transformative for us.
This is only true if you don't need the features in the first place, and you haven't gone through the learning curve yet.
So I can deploy a single container. Works fine. You can tell docker to restart it. Works fine. If that's all you need, you don't need k8s.
But you may now want to deploy two containers. And let's say they are servers. Now, you need incoming connections from a single port going to them. So maybe you are going to deploy something like haproxy – which now you have to learn.
Also, you now need a way to tell if your service is down. So you add a health check. But then you want to kill the container if it goes bad. So now you add a cronjob or similar.
At some point, one machine is not enough. So do you replicate the exact workloads across machines? If you do, it is a bit easier - but potentially more expensive. But if you need to spread them across different machine types, it gets more difficult. And now you also have storage distributed which you need to keep track of. If you haven't automated your deployment, now it is the time.
Now you have this nice setup. It is time to upgrade. How do you do that? Probably some custom script again.
Instead, you can forget all of the above. You can spin up a k8s cluster with any of the currently available tools (or use one of the cloud providers). Once you have k8s, a lot is taken care for you. Multiple copies of the service (with the networking taken care for you) ? Check. Want to increase? kubectl scale deploy --replicas=<n>. Want to upgrade? Apply the new YAML. Health checks and auto-restart? Add a couple of lines to your YAML. Want to have different workloads over different machine types? That's also an easy change.
Want to have storage (that follows your workload around)? Easy. You are in a cloud provider like GCP and want a new loadbalancer (with automated SSL cert provisioning!)? A couple of lines of YAML. Want granular access controls? It's built in. I can go on.
Of course, there's a learning curve. But the learning curve is also there if you stitching together a bunch of solutions to replicate the same features. Once you get used to it, it's difficult to go back.
I might even go further - kiam is not a standard deployment supported by the kubernetes project like the api server or the scheduler or the autoscaler. See https://github.com/uswitch/kiam
That said it is a very common deployment strategy in ec2 to run kiam or kube2iam. I wish the kube core teams took over the development of an aws iam role service since issues like bad defaults would be solved much quicker. Your only other alternative is to use iam access keys and nobody likes that (security wise and it’s a pain to configure).
Which isn't surprising. Kubernetes default settings are to work in all environments using overlay, not to be optimized for performance (make it work, then make it fast).
In other words, switching to Kubernetes _did_ make their latency higher — otherwise they wouldn't have needed to reconfigure anything, would they? If you want to help k8s, try starting a pull-request to make the defaults better rather than playing spin-doctor telling people that well-documented problems don't exist.
KIAM is not part of the core Kubernetes system but it was a necessary component to avoid introducing a security regression as part of the switch.
Again, my point was that rather than trying to do PR damage-control it would be better to work to improve things so this doesn't understand. Someone went to the trouble of posting a detailed examination of a real problem with a fix and some references to upstream improvements. That's a lot more useful than trying to draw a line between one of two tools commonly used to meet security requirements when operating Kubernetes in one of the most popular cloud environments.
It’s not necessary to use kiam - as I mentioned that multiple times - but you need something like that if you’re trying to maintain security coming from a good EC2/ECS deployment. Since it’s one of the more popular options, and Java is not uncommon, it seemed reasonable to consider this a likely pain point for many users.
I would not describe a service which is only ever used on Kubernetes workers and is only necessary for code running on Kubernetes as having nothing to do with Kubernetes. The fact that you and the OP are so emotionally driven to find a way to dismiss it is what makes it sound like PR — why not just acknowledge there's a real problem which is being fixed and be glad someone documented it well enough to save other people time? Nobody is saying that Kubernetes is bad, or that you shouldn't use it, only that there's an interaction which Java users should know about if they're running on AWS.
You're being strangely aggressive about this. Nobody cares about PR here, nor is anyone denying anything. We're just being accurate.
KIAM and the Java SDK have bad timeout overlap. That's the problem, has nothing to do with Kubernetes, and looks like it's well documented and now resolved.
This was a well-written post about a problem a lot of people would encounter with a common Kubernetes deployment. Instead of talking about that, most of the comments were people complaining about the title.
In the takeaways section, the author blames the issue on merging together complicated software systems. While absolutely true, this isn't specific to k8s at all. To specifically call out k8s as the reason for latency spiking is misleading.