Hacker News new | past | comments | ask | show | jobs | submit login
Kubernetes Security Assessment [pdf] (github.com/kubernetes)
255 points by Tomte on Aug 9, 2019 | hide | past | favorite | 63 comments

This is good stuff.

A nit:

Users should not use AES-CBC or GCM for encryption. Secretbox should be the default mode of storing information and users should be encouraged to use KMS.

I see where this is coming and agree in spirit, but GCM is actually idiomatic Go and implemented through the crypto/aead interface, which does about as good a job as any library at being user-proof.

I too would probably prefer code that used Nacl primitives over Seal/Open, but I would probably not flag code that didn't.

> I see where this is coming and agree in spirit, but GCM is actually idiomatic Go and implemented through the crypto/aead interface, which does about as good a job as any library at being user-proof.

Good point, and I appreciate the (updated) Kubernetes docs do a pretty good job of telling you what the implications of using aesgcm vs secretbox are.

However, I was surprised that XChaCha20-Poly1305 wasn't recommended. XChaCha appears to check all the boxes you mentioned and is nonce-misuse resistant.

It's "NMR" in the sense that the nonce is long enough to safely use random nonces, you mean? In practice, Kubernetes can use random GCM nonces safely too. Real NMR ciphers don't just have misuse-resistant ergonomics, but also better failure modes when the ergonomics fail: if you reuse a Chapoly nonce, it blows up. That doesn't happen with AEZ or SIV.

I agree that both can be used safely. And, yes to be clear, NMR here means "less likely to happen" not "better able to handle failure." Unfortunately, AES-GCM-SIV (or AEZ) aren't yet in Go's standard lib.

But, why not use XChaCha20-Poly1305 over AES-GCM in Go? Both are "implemented through the crypto/aead" and -- to my eyes -- seem equally user-proof. Why not take the bigger nonce size?

AES-GCM or even CBC for that matter is not vulnerable/broken. Why did they recommend Secretbox? Is there an implementation error? I am not talking about the potential of making mistakes and using platform supported constructs.

Does it make sense to make this recommendation even if the dev did not choose a vulnerable algorithm and there aren't any issues with implementation?

First: it's not as simple as "broken" or "not broken". GCM in Go is provided through an AEAD abstraction that is in fact pretty close to secretbox, ergonomically. In Python, Fernet provides AES-CBC with HMAC-SHA2 with similar ergonomics. So you can't just look at the constructions in isolation.

Using CBC in a Go program would be bad indeed.

Second, while you can make CBC secure, it isn't secure by default. New designs should generally avoid CBC mode in favor of a mainstream AEAD. So while I'd happily recommend Fernet to people --- it also dates back to a time when AEAD ciphers were a little less mainstream than they've become --- I would see CBC as a design smell in a newer library.

In the document they say that AES-CBC is vulnerable to padding oracle attacks, and AES-GCM uses random nonces and requires key rotation after so many iterations.

CBC is vulnerable to error oracles if you don't encrypt-then-MAC it properly (without the MAC it's also malleable, which is a game-over flaw). GCM is vulnerable to a bunch of its own misuse issues; it doesn't "use" random nonces, it is conceivably (through not really realistically) unsafe to use random nonces, and if you screw up nonce handling it blows up worse than CBC does.

My point is just, these things all have rough edges.

For me, the largest one mentioned is a known security issue with k8s architecture, which is the lack of support for certificate revocation.

So anywhere client cert's are used for authN, if one is lost there's no way to revoke it, short of rolling the whole certificate authority.

when you combine that with the 200k+ Internet exposed Kubernetes clusters, that's quite a large potential for attack.

The GH issue for this has been open since 2015 https://github.com/kubernetes/kubernetes/issues/18982

The Trail of Bits folks also open sourced the code behind their audit:

https://twitter.com/lojikil/status/1159190646478913536 https://github.com/trailofbits/audit-kubernetes

I know the Kubernetes Assessment was the one to make all the news, but the teams actually audited a bunch of CNCF projects. Here is the one for the Vitess project


Vitess is

> A database clustering system for horizontal scaling of MySQL

> Vitess combines many important MySQL features with the scalability of a NoSQL database. Its built-in sharding features let you grow your database without adding sharding logic to your application.

What a quirky project. Is this for folks who started out with MySQL then find themselves needing to scale out in "NoSQL" style?

> Vitess automatically rewrites queries that hurt database performance.

That sounds scary.

Vitess was created by Youtube.

But they're hardly the only places scaling out MySQL. Facebook and Slack are two other prominent examples.

Slack actually uses Vitess to scale out its databases.

Facebook has taken MySQL scaling to extremes well beyond what Vitess offers.

Not sure if that's a good thing.

And to understand scaling and extremes: FB basically uses RocksDB and/or MySQL as a low level storage layer for whatever thing they want to. (And on top they build the clustering stuff, with the particular CAP choices they think is best for that particular service/purpose.)

It's part of the CNCF graduation criteria now, that any project which is going to "graduated" status has to have a 3rd party security review, so you should be able to get one for any of the projects in that category.

Cure53 did the Vitess audit. I think they've done others for the CNCF, too. The Kubernetes audit was done by Trail of Bits. It was a different team that did the assessment.

There is a GH issue tracking the findings from the report https://github.com/kubernetes/kubernetes/issues/81146

> Fix the hard-coded Docker daemon process name. The process name should be dockerd instead of docker.

Why is this a security issue? Also, beyond naming convention, why?

It's definitely a bug, not just a convention-

> The container manager used in kubelet checks for docker daemon process either via pidfile or process name. While the pidfile points to the docker daemon process PID, the dockerProcessName constant stores a docker cli name (docker) instead of docker daemon name (dockerd).

They're trying to look up the process by a name the process isn't using.

I think the HTTP proxy based architecture is just weird and inherently insecure. Everything would be much simpler and and easier to analyze in a normal end-to-end scenario.

I think that's because of the WebSockets support in kubectl so you can tunnel things, but it's been a long time since I read about it.

Given all of the security issues I am curious who thinks this is production ready ?

Not sure why this is being downvoted.

Kubernetes is 5 years old. This is very, very young for mission-critical infrastructure management software.

Having a certain level of doubt in young open source projects is responsible, in my opinion. I'm interested to hear other people's perspective on production-readiness of k8s for mission-critical applications.

If security got to be the number one concern for whether things were deployed or not, then sure we could likely take a more conservative view.

However realistically k8s is in heavy deployment in a wide variety of industries including public sector, financial services, retail, technology ... and it's clear that this kind of concern is not the primary consideration.

There were banks in the UK (Monzo) deploying k8s almost 3 years ago (https://monzo.com/blog/2016/09/19/building-a-modern-bank-bac...)

The tradeoffs Monzo made are not ones that apply to most business. For most businesses, you have a profitable and sustainable model and you want to mitigate the possibility that you sink the ship by screwing the pooch on security or availability.

Monzo, on the other hand, was default-dead, so betting the farm on a relatively unproven technology perhaps wasn't risking as much. Nobody talks about the startups that used unproven tech and sank.

I don't think Monzo had to adopt k8s to survive. It's an infrastructure technology not something which provides a unique advantage from an app. development perspective.

Also k8s is far from only used in tech companies. the UK home office (not exactly a startup) were giving talks about their use of k8s in 2016 https://www.phpconference.co.uk/videos/2016/kubernetes-home-...

In most other industries, saying something is in "heavy development" is usually the same as "unstable". (Unstable usually is interpretted as Bad in software engineering -- but the dictionary definition of "unstable" only means "prone to change", which I think is an accurate characterization of k8s considering its degree of maturity)

Whether or not something is a smart choice to use in mission-critical production applications doesn't depend on the number of big banks or big tech companies that use the technology.

At the end of the day, Kubernetes is a tool that will change very rapidly over the next 5 years. I could see k8s being a decent choice to use in a tech project that you expect to actively maintain and improve for the next 5+ years, AND if you (and your developers) are willing to invest time (potentially a lot of time) every year keeping up to speed with how k8s evolves through every version release. That's the primary risk in using something like k8s.

Sure rapid development is likely to equal lots of change, but it's far from alone in that regard.

The last decade has been dominated by rapid adoption of technologies that were under heavy development at the time, from Ruby on Rails, to Node.JS to Golang to Rust.

The simple reality of modern IT is that companies are unwilling to wait until a technology has stabalized before making use of it.

Personally I'd rather they did, but my opinion has little weight in that regard.

And its clear with the number of data dumps online of personal information that this needs to change.

Good example is the recent offer for 147 million people to get 125 dollars out of a pool of 31 million ( Equifax )

As an industry IT is simply shamefully shoddy

Kubernetes has already seen far more production-hours of operation than most infrastructure management software will ever see. Age is no substitute for experience.

Which of these issues strikes you as being a meaningful risk to a realistic deployment of Kubernetes?

The better question is when you deploy k8s in a production how do you ensure none of the risks are being exploited.

Given todays landscape of hardware and software exploits adding a complex orchestration layer with identified issues seems like less than prudent behavior.

I currently work on kubernetes in production and am migrating large clients into these systems. I see the distinct lack of knowledge around securing systems and more so when adding kubernetes.

As with anything in security, this will depend on your threat model and the benefits of using k8s.

There is some good information about securing k8s around, although we could always do with more.

There's a free oReilly eBook on k8s security from aqua (https://info.aquasec.com/kubernetes-security)

Also the CIS benchmark for k8s is reasonably up to date, although could use expanding, which should be happening for the next version.

On top of that there have been quite a few conf. talks now about k8s security https://www.youtube.com/playlist?list=PLKDRii1YwXnLmd8ngltnf... for some examples

I'm not running antagonistic workloads in k8s though, I'm just running my own junk, each component of which also has its own laundry list of security nightmares.

Are these more / worse security issues than anything else we call production-ready? The Linux kernel is full of security bugs, for instance.

So adding more exploits is better ?

OpenBSD security model is sound. Simplify and secure.

"Only two remote holes in the default install, in a heck of a long time!"

Due respect to smart acquaintances who work on OpenBSD, but to most people who secure application deployment environments, this is not the reassuring statement OpenBSD seems to think it is.

To be fair, hardly anyone uses openbsd compared to kubernetes. And last I checked, most openbsd services are disabled by default, so it makes it hard to break in, but unusable in its default state.

Who has a better track record?

The problem is that it isn't a track record.

What's funny about it is, if you're going to make up a benchmark (and theirs is contrived; it was "no remote vulnerabilities", as I recall, when I was involved with the project, then "no remote vulnerabilities in the default install", then "only one remote vulnerability in the default install"), make up one where your number is zero, not "just 2 in a heck of a long time".

But more substantively: the reason you run an operating system is to do stuff on it. It isn't 1996 any more and nobody gets public shell accounts on Linux systems or OpenBSD systems; similarly, remotely-exploitable vulnerabilities in other operating systems are also exceedingly rare, and so OpenBSD's benchmark excludes the LPEs that actually make up the meaningful attack surface of a modern OS.

What's a more important question is what features the operating system provides to harden the non-default programs that inevitably have to run on it. OpenBSD has historically lagged here, though they're upping their game recently.

Despite briefly being involved with the project during "The OpenBSD Security Audit" in the late 1990s, I have a longstanding bias against OpenBSD that I should be up front with: we shipped an appliance on OpenBSD at Arbor Networks, and I spent several days debugging a VMM problem that would zombify pages of memory and gradually suffocate our systems. When I presented evidence to Theo, he said (not a literal quote) "don't bother me about this, Chuck Cranor" --- I think it's Chuck Cranor but could be wrong --- "wrote this VMM as his graduate project and I've got nothing to do with it". For whatever that's worth, I've felt OpenBSD is an unserious option for deploying real systems other than near-stateless network middleboxes ever since.

If we have to count the exploits in every new thing against some grand total of allowable exploits then there will never be new things. The question was not whether k8s added to the universe of exploits, but whether the exploits make it unready for production. Personally I was more bothered by some of the code quality issues than the list of specific high severity exploits. It's a large project and issues like this will be found.

When the complexity of the attack surface gets to the degree of k8s I would say that is a problem.

The fact very few and I do mean very few people understand the low level functions going on ( like the multiple layers of nat via iptables ) and they are simply struggling to keep it running its pretty obvious they arent qualified to run this in production.

I have been at google HQ in kubernetes discussions and its frightening how little people know about the internals of it.

These arent amateurs off the street either.

We already depend upon layer after layer of highly complex software. I'd argue that the complexity of k8s is not out of line with its scope. I don't want to get into a debate about specific things like netfilter. Yeah it's an odd setup and full of warts, but it's completely pluggable. On GKE for example you can now run in a mode where the pod networking is handled as a VPC subnet with load balancing directly to pods. And that's sort of the point: it's the maturing abstractions that are valuable, not the specific implementation of a part like networking.

As for struggling to run it, our experience has been different. Granted we're a small user. Our largest cluster has just over 100 nodes. Our highest volume service hits about 15k req/sec at peak. We're on GKE which is a well-managed implementation and that also makes it less risky. In two years of production the platform has been extremely reliable. Moreover we've been able to do things that would have been a lot harder before, such as autoscaling the service I mentioned above so that we're not paying for capacity we don't need off peak.

I agree if you are using k8s GKE is the place to be.

Let the experts who designed it run it for you, almost like it was planned that way :)

You keep saying that the attack surface is high, but is it higher than all other software we consider suitable for this purpose?

Does anyone understand the JVM and servlet containers? Does anyone understand OpenSSL's state machine? Does anyone understand hardware load balancers? Does anyone understand speculative execution? Does anyone understand the Postgres query planner? Does anyone understand all the same-origin policies? Does anyone understand their laptop's power supply?

I've seem a lot of people build a lot of successful systems on things they don't know every detail of, even when not knowing those details is quite dangerous. That Kubernetes is yet another one of these building blocks isn't an indictment of Kubernetes, it's an indictment of the compulsion to understand everything.

Problem is you dont replace all of the other vectors with k8s. You add more to them.

When the entirety is so complex seasoned engineers shrug when you ask what is wrong with the stack you have a problem

Can you name one security vulnerability from this document that, in a functionally-similar architecture that used OpenBSD and didn't use Kubernetes, would have been prevented by OpenBSD's security model?

("Don't build the system you want to build, build the system I want you to build" isn't an answer.)

sure... all of them.

Thing is everyone I have worked with uses k8s because its the new cool toy. None of them have a requirement to create a large expensive platform which costs more than simple hardware so a company can bring products to market faster

Everyone thinks they can save money with k8s. You wont. Especially in AWS

It's production ready, folks have been running it on production and will be running it on production. Sure it has issues from inside the cluster. But if you secure it and it's not accessible from outside, it's good to go. Probably more secure than trying to run 500 boxes at once.

So you think kubernetes eliminates the need to run hardware or you think it reduces the number of machines ?

I think it adds overhead but does allow you do maximize server density and usage allowing you to use all 500 machines more effectively.

And it adds a plethora of attack vectors

Out of curiousity, Which k8s attack vectors, do you think are particularly concerning?

Yes, it significantly reduces the number of machines. That's the main benefit. You can binpack your pods by sizing it up well and maxing out resources on each machine.

I completely disagree based on multiple deployments at scale. You use more hardware with k8s not less.

we keep finding 0-days and security issues in yourfavoritesoftware. Who thought yourfavoritesoftware was production ready?

So you agree keeping things simple to reduce attack vectors is the best way to go ?

Seems as though a bunch of nonsense burgers. But then i realized they are all just low impact

There's a nice table with five high security issues. The lack of authentication within the cluster is pretty damning.

yeah that one is kind of interesting, really needs more detail. I think what they're talking about is that it's possible to configure insecure connections between the different components.

However, if that's the case, that's a distribution specific issue and not really anything intrinsic in k8s.

Edit - there's a GH issue here https://github.com/kubernetes/kubernetes/issues/81112

OK misread it and only found mediums. Sorry for being wrong, where's the high ones?

Yea I just don't buy it. The fact that you can use any tls cert is obviously how it should work. It complains loudly

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact