
GKE Sandbox: Independent operating system kernel to each container - alpb
https://cloud.google.com/blog/products/containers-kubernetes/gke-sandbox-bring-defense-in-depth-to-your-pods
======
WestCoastJustin
For anyone who hasn't seen this before. There is a pretty good gVisor
Architecture Guide that explains how this works pretty well via a few diagrams
[1]. Lots more info on these pages too [2, 3].

> _gVisor intercepts application system calls and acts as the guest kernel,
> without the need for translation through virtualized hardware. gVisor may be
> thought of as either a merged guest kernel and VMM, or as seccomp on
> steroids. This architecture allows it to provide a flexible resource
> footprint (i.e. one based on threads and memory mappings, not fixed guest
> physical resources) while also lowering the fixed costs of virtualization.
> However, this comes at the price of reduced application compatibility and
> higher per-system call overhead._

From what I understand, basically a user-space program that wraps your
container and intercepts all system calls. You can then allow/deny/re-wire
them (based on a config). So, you have pretty much complete control over what
your apps can do.

This for me, is sort of the key takeway from the blog post too: " _because we
use gVisor to increase the security of Google 's own internal workloads, it
continuously benefits from our expertise and experience running containers at
scale in a security-first environment"_. So, Google's using something like
this internally too for their own workloads, which should be a pretty good
sign this works in real life.

[1]
[https://gvisor.dev/docs/architecture_guide/](https://gvisor.dev/docs/architecture_guide/)

[2] [https://github.com/google/gvisor](https://github.com/google/gvisor)

[3] [https://gvisor.dev/](https://gvisor.dev/)

~~~
prattmic
> From what I understand, basically a user-space program that wraps your
> container and intercepts all system calls. You can then allow/deny/re-wire
> them (based on a config).

gVisor actually intercepts and _implements_ the system calls in the user-space
kernel. Two specific goals of gVisor are that (1) system calls are never
simply allowed and passed through to the host kernel, and (2) you don't need
to write a policy configuration for your application; just put your
application inside gVisor and go. These are significant differences over
simply using something like seccomp on its own (what the architecture guide
calls "Rule-based execution").

Some of this is covered in our security model:
[https://gvisor.dev/docs/architecture_guide/security/#princip...](https://gvisor.dev/docs/architecture_guide/security/#principles-
defense-in-depth)

~~~
saagarjha
Reimplementing system calls is non-trivial, especially ones that have complex
interactions with others (for example, the system calls related to process
management). How do you prevent errors when translating this, and how do you
implement features that ostensibly require calls to the OS anyways?

~~~
prattmic
For sure, implementing Linux is no easy task, and there is no magic bullet.
For compatibility testing, we have extensive system call unit tests [1] and
also run many open source test suites. Language runtime tests (e.g., Python,
Go, etc) are particularly useful. We also perform continuous fuzzing with
Syzkaller [2].

> how do you implement features that ostensibly require calls to the OS
> anyways?

gVisor's kernel is a user-space program, so it can and does make system calls
to the host OS. Some examples:

* An application blocks trying to read(2) from a pipe. gVisor ultimately implements blocking by waiting on a Go channel. The Go runtime will ultimately implement this with a futex(2) call to the host OS. * An application reads from a file that is ultimately backed by a file on the host (provided by the Gofer [3]). This will result in a pread(2) system call to the host.

The purpose here isn't to avoid the host completely (that's not possible), but
to limit exposure to the host. gVisor can implement all the parts of Linux it
does on a much smaller subset of host system calls. Anything we don't use is
blocked by a second-level seccomp sandbox around the kernel. e.g., the kernel
cannot make obscure system calls, or even open files or create sockets on the
host (those operations are controlled by an external agent).

[1]
[https://github.com/google/gvisor/tree/master/test/syscalls/l...](https://github.com/google/gvisor/tree/master/test/syscalls/linux)

[2] [https://github.com/google/syzkaller](https://github.com/google/syzkaller)

[3]
[https://gvisor.dev/docs/architecture_guide/overview/](https://gvisor.dev/docs/architecture_guide/overview/)

~~~
mav3rick
How is this different than a nicerUI over a seccomp filter for your container?

------
raesene9
This is a really interesting add-on to GKE and I'm glad to see vendors
starting to offer a variety of container runtimes on their platforms.

That said, I'm really not a fan of the opening line where it references the
old trope of "containers don't contain"

The idea that it's trivial to break out of any Docker style container just
doesn't reflect reality.

Have there been vulns that allow for container breakout, sure there have, but
every piece of software (including gVisor) has had vulnerabilities in it.

What you can say about gVisor is that it likely presents a smaller attack
surface in its default configuration than a runc style Docker container.

However, of course, there's nothing to stop people tightening up on the
defaults and still using runc.

As an aside for anyone who thinks container breakouts are trivially easy, you
can go to [https://contained.af](https://contained.af) and win yourself some
money :)

~~~
amscanne
(I'm a co-author of the blog post)

I generally agree re: trope, but it's useful because I'm not sure the core
idea is widely understood outside security circles. Many people assume that
containers provide a strong isolation boundary, and while a break-out is not
trivial, providing more isolation in some cases is important, as you allude.

While one option is certainly to provide a locked down policy, monitor the
flow of kernel CVEs, and patch constantly, this may not be feasible for many
organizations if a) they lack the technical expertise or b) don't know the
workloads they're running a priori and can't apply a fixed policy.

So different container runtimes are about providing additional tools for
defense-in-depth. (VMs are fantastic tool for this, but it's also nice to have
tools that play well in containerized infrastructure other than custom
security policies.) None of these tools will be perfect of course, hopefully
they can make it easier to improve on the status quo.

Re: contained.af, this is a great example of the workloads problem. If you
have a known workload where you can essentially disable all capabilities and
access to system resources (e.g. no network), there are many options for
securing that workload. They aren't all generalizable.

~~~
raesene9
Oh I'd agree and gVisor provides (IMO) a smaller attack surface than a default
runc container.

With that said both options, and indeed hypervisor based isolation, are
generally one security flaw away from a breakout vulnerability, so the only
difference in that respect is the incidence of those flaws.

My experience of people's expectations of container isolation is perhaps
somewhat different to yours, which is what prompted my initial comment.

It's all too common (in my experience) to see container isolation dismissed
using that "containers don't contain" trope, and for me that feels frustrating
as the real picture is much more nuanced than that.

It's all about choosing the right isolation technology for a) a given workload
and b) a given threat model/attack surface.

There are tradeoffs (both in terms of performance, and in terms of
flexibility) in replacing the runc layer with a different container runtime.
Sometimes those will make sense, other times not so much :)

All that said I'm very excited to see more options here, as it'll give
everyone the choice of what mechanism works for them for specific workloads.

------
muricula
GKE Sandbox/gVisor syscall performance is at least 100x worse than
virtualization[0], which is huge. Why shouldn't I just run everything in a
VM/lxc container instead? Is it worth proxying everything through your syscall
broker when I can just trust my hypervisor to be a security boundary instead?
[0]:
[https://gvisor.dev/docs/architecture_guide/performance/](https://gvisor.dev/docs/architecture_guide/performance/)

~~~
amscanne
(I am co-author of the post)

System calls are important, but only one factor. The linked doc is an attempt
to clarify and delineate various costs. There are number of platform options
(the platform is what does syscall interception), and I don't believe any of
them are 100x so to say "at least" is a bit disingenuous. You may have
confused the "runsc-kvm" number with "using a VM". "runsc-kvm" is the system
call performance of gVisor using the kvm platform, which is not a full VM [1].
In general the syscall cost in a VM depends entirely on the guest OS, since
there is no VMEXIT for this operation.

VMs are a valid choice depending on your workload, and this is providing an
additional tool that provides an easy control for containerized
infrastructure. You can use what works for you. Native containers certainly
work as well, but you'll probably want to consider additional security
controls of some form if you're really running untrusted stuff in there.

[1]
[https://github.com/google/gvisor/tree/master/pkg/sentry/plat...](https://github.com/google/gvisor/tree/master/pkg/sentry/platform/kvm)

~~~
muricula
You're right, that's not the chart I wanted to see. I'm just dubious that
reimplementing lots of the Linux kernel in Go while paying the cost of the
ptrace interception is worthwhile. It seems like you're just adding a lot of
attack surface (admittedly managed code > native code) with a large perf
impact. Do you have any docs on how the kvm-runsc platform works? Skimming the
files, I don't see some of the bits necessary for a bluepill style hypervisor,
so I'm not sure why parts are named bluepill in there. I also don't see a lot
of the linux kernel paravirt vdev code I would expect, and you seem to imply
that you're not telling KVM to enable syscall trapping for the guest.

~~~
amscanne
I'm not sure what you mean by KVM syscall trapping for the guest. The bluepill
refers to the fact that the Sentry runs transparently in VMX non-root ring 0
and regular host ring 3.

I'm not sure what to provide re: docs -- the code is all there, reasonably
documented and there are discussions on the public groups of how the KVM
platform works. I feel a bit like you're coming in with a specific set of
ideas and skimming files (e.g. the performance guide and the code itself) in
order to confirm an existing understanding, but it's just not working.

I'd love more precise criticisms re: adding to the attack surface, but
otherwise I'm not sure how I can help.

~~~
muricula
I'm very skeptical about the platform and don't have the time to devote to
reading the codebase or having conversations as I would like. The TL;DR is
that the syscall interception technique seems expensive and I wonder if you
will write all sorts of logic bugs in the sentry broker. It seems like you
folks care about security, and have some good ideas, but if you really care
about hostile multi-tenant containers, why not stick the container in a VM and
call it a day?

~~~
yoshiat
I replied in other comments but our talk at Next'19 [1] includes a story by
one of our customers, which may help understand the use cases. In a nutshell,
GKE Sandbox should allow sharing the resources of GKE Nodes (VMs) among
multiple tenants.

[1]
[https://www.youtube.com/watch?v=TQfc8OlB2sg](https://www.youtube.com/watch?v=TQfc8OlB2sg)

------
whalesalad
Isn't this server-side react rendering? What are we doing?

We started with virtual machines and then thought, no, we can share a kernel
and do this without the overhead. Now we want each of our containers to have
their own kernel. This is full curcle... why not just fire up a VM? Am I
missing something?

Firecracker doesn't have the product vision behind it to do this, but at some
point we will have a microvm technology with the ergonomics of containers and
then we'll be WAY closer to true portability and better security.

~~~
amscanne
(I'm a co-author of the blog post)

Many functions of the kernel are still effectively shared: memory management
(e.g. reclaim, swap), thread scheduling, etc. The application is simply
limited in its ability to interact with the shared kernel, and functionality
related to system APIs is isolated. Arguably I think this is closer to the
ergonomics of containers, but with compatibility and performance trade-offs.

------
nullwasamistake
The only advantage I see to containers over VM is RAM sharing. Beyond that,
hardware VM's are better performing and much more secure.

gVisor is just another flavor of containers that replaces kernel interfaces
with a Go shim layer to reduce the attack surface in return for worse
performance.

If somebody could hack ram sharing/overcommit into traditional VM's all this
container nonsense could be dispensed with. Containers are a virtualization
layer just like the old days when we used the JVM to run "safe" applets on
client machines. Like the JVM, the attack surface will always be huge and
security issues nearly endless.

------
bogomipz
I have read that all containers at Google run inside of a VM and indeed that
article mentions that gVisor is in use in things like App Engine and their
internal workloads.

So if containers on GKE were already being spun up inside lightweights VMs
what does allowing customer's to select the gVisor runtime offer beyond
whatever Google's existing lightweight VM already provides?

~~~
thesandlord
Other way around: Everything at Google runs inside a container, including the
VMs

gVisor lets you run multiple untrusted workloads on the same VM, in this case
a GKE node.

~~~
bogomipz
What would running a VM inside a container provide in terms of security and
isolation that just running a VM would not?

This ACM article from a few years ago written by folks that worked on
Borg/Omega/Kubernetes states:

>"The isolation is not perfect, though: containers cannot prevent interference
in resources that the operating-system kernel doesn't manage, such as level 3
processor caches and memory bandwidth, and containers need to be supported by
an additional security layer (such as virtual machines) to protect against the
kinds of malicious actors found in the cloud."[1]

Also see slide 13 of Joe Beda's talk from a five years ago shows the container
running in a VM not the other way around:

[https://speakerdeck.com/jbeda/containers-at-
scale?slide=13](https://speakerdeck.com/jbeda/containers-at-scale?slide=13)

[1]
[https://queue.acm.org/detail.cfm?id=2898444](https://queue.acm.org/detail.cfm?id=2898444)

~~~
thesandlord
(I work for GCP)

It looks something like this:

your container -> Compute Engine VM (GKE Node) -> container -> Borg

The container on top of Borg is used for scheduling and management. Joe's talk
has a slide on this. As a GCP customer, you never have to worry about this or
care about it, as it is an implementation detail.

>"The isolation is not perfect, though: containers cannot prevent interference
in resources that the operating-system kernel doesn't manage, such as level 3
processor caches and memory bandwidth, and containers need to be supported by
an additional security layer (such as virtual machines) to protect against the
kinds of malicious actors found in the cloud."

As a GCP customer using GKE, your applications are separated from other GCP
customer using VMs.

However, if you want to run your OWN untrusted workloads, then in the past you
would have to spin up a separate VM for untrusted workload A and a one VM for
untrusted workload B.

This sucks in terms of resource utilization. It would be better in many cases
if you could run workload A and B on the same VM. That's where gVisor comes
into play.

your untrusted container -> gVisor -> Compute Engine VM (GKE Node) ->
container -> Borg

I hope this makes sense!

~~~
bogomipz
Thanks for the explanation, this makes sense yes.

>"The container on top of Borg is used for scheduling and management."

Is this the "open source node container manager" box on slide 13 then? I'm
guessing this is the Borg's version of the kubelet then?

[https://speakerdeck.com/jbeda/containers-at-
scale?slide=13](https://speakerdeck.com/jbeda/containers-at-scale?slide=13)

~~~
yoshiat
That's a very old slide :) I "guess" the slide deck was talking about
[https://cloud.google.com/compute/docs/containers/deploying-c...](https://cloud.google.com/compute/docs/containers/deploying-
containers)

~~~
bogomipz
I see. So is "the container on top of Borg is used for scheduling and
management" the Borg equivalent of the K8S kubelet then?

~~~
yoshiat
As you can see from the Borg paper [1] and the name, "borglet" is the most
closest component to "kubelet".

[1]
[https://pdos.csail.mit.edu/6.824/papers/borg.pdf](https://pdos.csail.mit.edu/6.824/papers/borg.pdf)

------
metta2uall
I quite like this defense-in-depth approach, but it's disappointing that it
will only be available as part of the probably expensive GKE Advanced. I would
have thought safety features should be standard..

~~~
dilyevsky
I think either way control plane is free now?

~~~
metta2uall
Well gVisor doesn't use the control plane. It is free, but I wouldn't think it
has a high cpu or memory load, and Google would make a lot of profit on the
nodes.

~~~
dilyevsky
I know but they may conceivably just charge fixed fee for enabling that option
on the nodepool.

> it has a high cpu or memory load, and Google would make a lot of profit on
> the nodes.

They currently solve that problem by having their node VMs melt down at like
50% utilization so you have to run everything with huge padding.

------
andrewstuart
Sandboxed containers with kernels - so what's the difference now between this
and fully isolated virtual machine?

Another approach might be to make virtual machine technology more like
containers. Then the two shall meet.

~~~
edoo
I didn't dig into the implementation details but the term para-hvm came to
mind, not quite para virtual but not quite full hvm. Perhaps if security is a
real issue then HVM is the only real choice.

------
conroy
For those more familiar with Kubernetes and gVisor, would this allow me to
build a CI/CD service that runs untrusted user code?

~~~
raesene9
Well like all things in security, that kind of depends :)

What gVisor does is provide a smaller attack surface to a containerized
process, when compared with a "traditional" Docker container using standard
Docker setup (you can, of course harden Docker containers considerably from
base, if you are so inclined).

However it doesn't affect anything outside of that interface so, for example,
if your CI/CD process is running on a network that has other insecure services
on them, then gVisor alone won't really help you if malicious code is executed
inside a container allowing an attacker to start probing the environment from
the perspective of that container.

------
fulafel
gVisor is easy to run in on your local dev laptop Docker too. It's a nice
alternative to running Docker in a VM, if you prefer a security boundary
between random Docker containers you get off Docker Hub and your host machine.

After you built or downloaded the gVisor single binary to /usr/local/bin/ or
wherever, just put a snippet provided in the gVisor README in the Docker
settings file ("runtime": {...}), and bob's your uncle.

------
ronsor
At this point, why not just use a virtual machine? We've come full circle!

~~~
jacques_chester
Mostly the performance characteristics. A virtual machine presenting as a
_machine_ needs an operating system to be useful. Most operating systems have
long-engrained assumptions about the nature of the world, such as:

"There is a time when I go from power-off to power-on, and it is rare, so I
may perform expensive operations then to amortise their cost over running
time".

or

"While running, time does not skip and hardware does not change".

The practical upshot being that the OS needs to be booted from scratch in a
number of scenarios.

But it's not the OS that provides value. It's a means to an end, and that end
is to run software. Most software written to run on OSes _also_ have engrained
assumptions, such as "I will come to be launched on a fully-booted system".

Containers move the virtualisation up from hardware to the OS API surface.
Because the cost of booting is now amortised over all containers running on
the system, the original assumptions of both OS designers and software
designers become, approximately, true again.

So you're right, we came full circle, but not to a point that means "use
fully-dressed VMs again".

