
Open-sourcing gVisor, a sandboxed container runtime - rbanffy
https://cloudplatform.googleblog.com/2018/05/Open-sourcing-gVisor-a-sandboxed-container-runtime.html
======
kyrra
There is an interesting tool on their github repo called go_generics[0]. It
looks like it transforms a Go source file and writes out a new file, by doing
name replacement, prefixing/suffixing for variable, method, and class names.

[0]
[https://github.com/google/gvisor/blob/master/tools/go_generi...](https://github.com/google/gvisor/blob/master/tools/go_generics/generics.go)

~~~
wjkohnen
There are several open source Go projects out there that implement generics by
templating.

[https://awesome-go.com/#generation-and-generics](https://awesome-
go.com/#generation-and-generics)

------
catern
Since I see that some of the developers are in this thread, I'll post my
question to them here.

What are your plans to deal with the overhead and nastiness of ptrace? Beyond
the performance losses, there's also the annoyance that you can't ptrace a
single task twice, so no debuggers.

Are you familiar with the FlexSC paper? Have you considered using a FlexSC-
like RPC interface (a secure one, of course) to achieve your syscall
interception, instead of ptrace? That would allow you to not just match the
performance of native system calls, but even theoretically exceed their
performance, while still having the same level of control. (I have been
working on such an approach, so I was excited to see this gvisor project
posted - I hoped you might have already done this and saved me some work :))

Not sure how far this project can go if it sticks with ptrace...

~~~
amscanne
(As requested, I work on this project.)

To correct one misconception, the project is not bound to ptrace. There is a
generic platform interface, and the repository also includes a KVM-based
platform (wherein the Sentry acts as guest kernel and host VMM simultaneously)
described in the README. The default platform is ptrace so that it works out
of the box everywhere.

> What are your plans to deal with the overhead and nastiness of ptrace?
> Beyond the performance losses, there's also the annoyance that you can't
> ptrace a single task twice, so no debuggers.

It's true that you can't trace the sandbox itself, and that's annoying, but
you can still use ptrace inside the sandbox (ptrace is implemented by the
Sentry). Just wanted to make sure that was clear.

> Are you familiar with the FlexSC paper? Have you considered using a FlexSC-
> like RPC interface (a secure one, of course) to achieve your syscall
> interception, instead of ptrace? That would allow you to not just match the
> performance of native system calls, but even theoretically exceed their
> performance, while still having the same level of control. (I have been
> working on such an approach, so I was excited to see this gvisor project
> posted - I hoped you might have already done this and saved me some work :))

I am familiar with FlexSC. There are certainly opportunities for improvement,
including kernel-hooks, shared regions for async system calls, etc. Given the
pace that this space is evolving, our priority was to share these pieces so
that we can discuss things in the open. While I don't think we'll be able to
save you work (sorry!), we're aiming for collaboration and cross-
fertilization.

~~~
bradneuberg
Good to have a gVisor developer on here. At Dropbox one way we use secure
containers is to run machine learning models. Do you know if TensorFlow works
in a gVisor container? How is GPU support in the container? If running on a
CPU, are BLAS libraries supported to speed up matrix math in the container?
Finally, do you know if OpenCV currently runs in gVisor containers?

~~~
flx42_
nvidia-docker[1] maintainer here.

Curious to know, are you using docker today? If yes, is there anything missing
to satisfy your security requirements?

[1] [https://github.com/NVIDIA/nvidia-
docker](https://github.com/NVIDIA/nvidia-docker)

~~~
bradneuberg
We aren't using Docker today; we have jailing infrastructure using linux
cgroups, namespaces, seclist, etc.

------
fulafel
Money quote from
[https://github.com/google/gvisor](https://github.com/google/gvisor) for Linux
folks who have been around for a while:

"gVisor's approach is similar to User Mode Linux (UML), although UML
virtualizes hardware internally and thus provides a fixed resource footprint."

~~~
wnevets
as a non linux folk, can you explain?

~~~
elliotf
User Mode Linux was a very old library for running virtual machine guests:
[https://en.m.wikipedia.org/wiki/User-
mode_Linux](https://en.m.wikipedia.org/wiki/User-mode_Linux)

------
zbentley
This seems neat, but every time I read about something of this sort, I am left
wondering: what's wrong with the pledge/seccomp model? According to TFA:

> Kernel features like seccomp filters can provide better isolation between
> the application and host kernel, but they require the user to create a
> predefined whitelist of system calls.

Isn't that something you'd effectively have to do anyway if you want a
sandbox? Like, a sandbox isn't worth that much if you don't define what it can
and can't do, no?

I'm far from an expert in this area. It's an honest question, not a veiled
criticism.

~~~
cyphar
It also ignores that seccomp-bpf allows for far more fine-grained rules for
syscalls (like specifying that certain bits be cleared or certain arguments be
equal to a value). And they're adding more and more features over time to it.
I don't get why you would use ptrace (and if you don't use ptrace then you
don't need another layer -- just play with the exising OCI support for seccomp
and use runc directly).

~~~
geofft
seccomp-bpf doesn't let you follow pointers, so you can't even implement most
pledge() restrictions in it. For instance, pledge() always permits
open("/etc/localtime"), but at the point seccomp is run, all you know is
open(some pointer to userspace).

You could imagine combining seccomp-bpf with some other system that reads the
arguments after they've been copied to kernelspace, which is basically
Landlock's approach
[https://lwn.net/Articles/698226/](https://lwn.net/Articles/698226/). But I've
been personally waiting for something like this since 2011 or so when people
were saying seccomp mode 2 should use ftrace, and Landlock itself has been in
review (slash argument) for two years. An approach like gVisor works today.

~~~
cyphar
> seccomp-bpf doesn't let you follow pointers, so you can't even implement
> most pledge() restrictions in it. For instance, pledge() always permits
> open("/etc/localtime"), but at the point seccomp is run, all you know is
> open(some pointer to userspace).

This is something that is being worked on (separately but similar to Landlock)
in the form of seccomp syscall emulation (I don't remember the actual name of
the patchset at the moment but it was proposed a month ago I think). However
after talking to some seccomp folks I was told that in theory eBPF maps could
be used for this purpose (though I'm not really convinced to be honest).

The real downside of ptrace is that you cannot filter which syscalls you're
interested in -- so you pay the price of tracing for every syscall. seccomp
doesn't have this problem.

~~~
geofft
You can use SECCOMP_RET_TRACE to kick complicated cases back to the ptracer
but handle the easy cases without the slowdown. So you can write a seccomp
policy that does something like this pseudocode:

    
    
        if syscall == SYS_open:
            if flags == O_RDONLY:
                return (SECCOMP_RET_TRACE, 0)
            else:
                return (SECCOMP_RET_ERRNO, EPERM)
        else if syscall in (SYS_read, SYS_write, ...):
            return (SECCOMP_RET_ALLOW, 0)
        else:
            return (SECCOMP_RET_ERRNO, ENOSYS)
    

and it would be much much faster than tracing every system call, since most
programs call open() rarely and read() and write() very often.

That said, the ptracer's job here is kind of hard, because the kernel _still_
gets an untrusted userspace pointer, and another thread, another process, etc.
can modify that memory in between the ptracer okaying it and the kernel
getting to it. (See "Argument races" in Tal Garfinkel's 2003 "Traps and
Pitfalls" paper.) So you either want the filtering to happen in the kernel
after it's been copied to kernelspace (which is Landlock's approach), or do
the open from a trusted process and send the fd over (which is I think
gVisor's approach).

------
hacknat
I hate to rain on this interesting project’s parade, but if you need full
sandbox isolation then you should probably look to full VM isolation (ala Kata
containers, formerly Clear containers). User namespace-ing, SecComp, and
Selinux/apparmor buy you about as much of a sandbox as you’ll need, with the
one caveat that a kernel exploit could still take you down (all other exoits
are rendered sandboxed).

If you need that final kernel sandboxing, then a full VM is your only
guarantee. UML still sits on top of an exploitable kernel, and presumably this
project itself can be hacked. While it certainly is better than nothing the
only thing it seems to be buying you above Kata containers is a faster spin up
time and the ability to dynamically resize the container. Maybe that trade off
is worth it to someone, but the added performance overhead and the demi-kernel
isolation seem like a high price to for those features.

~~~
amscanne
Hi! I work on this project.

Nothing wrong with a full VM, but I don't think it's a panacea (and you
probably shouldn't guarantee security). You may have taken the UML comparison
to heart: did you look at the KVM platform? I'm not clear on the distinctions
you're making in that case -- a kernel escape would look a lot like a user-
space VMM code execution vulnerability (which also sits on top an exploitable
kernel).

~~~
hacknat
Well, I wasn’t arguing for a VM being a panacea, but in the context of not
being satisfied with the Linux primitives for sandboxing, I think it is the
next logical step up in security.

From my perspective this project seems like an intermediate jump from Linux
containerization primitives and a full blown VM, and I was wondering out loud
who fits that use case?

Finally, I didn’t mention KVM, but my understanding of KVM is that it’s
isolation primitive was the hardware virtualization instructions (or at least
could be, I’m not sure if it has a PV mode or not).

I guess my question for you would be:

In what context would I want to use this over something like Kata containers?

~~~
amscanne
KVM is the kernel interface for virtualization features, but the model created
(i.e. the emulated hardware or lack thereof) is up to the user space component
(normally QEMU). I think your understanding of KVM is tied with a specific
implementation.

FWIW, I don't disagree that it's an intermediate step in some regards. The use
cases follow (also from trade-offs discussed in the README). I can't speculate
on a stranger's needs, it's great if Kata works for yours. I also think that
approach is valuable (as an aside, I authored an experimental project with a
similar approach years ago [1]).

[1] [https://github.com/google/novm](https://github.com/google/novm)

~~~
tejasmanohar
Is there a comparison of Kata and gVisor based on how they act functionally
rather than how they are implemented under-the-hood? Like the OP, I'm curious
when you'd use this over Kata.

~~~
houseofzeus
Not a direct comparison of these projects specifically but here is the write-
up that was presented in the context of the Kubernetes SIG Node discussions
about this topic:

[https://docs.google.com/document/d/1QQ5u1RBDLXWvC8K3pscTtTRT...](https://docs.google.com/document/d/1QQ5u1RBDLXWvC8K3pscTtTRThsOeBSts_imYEoRyw8A/edit#)

------
beagle3
Sounds like it should be straightforward porting it to run on BSD or Windows,
as it implements everything rather than just pass through syscalls.

I wonder how the performance of such a port would compare to WSL.

------
fgblanch
Does anybody know how this is related with chromeos project Crostini? It seems
they are all related (GCP, Crostini,...) But they are not linked yet?

~~~
jacksmith21006
I do not believe so. Have a PB and have been playing with the new Linux
capabilities and they are using a VM with LXC/LXD containers on top.

We did get an update today and have a Linux apps option in setting menu. Which
when enable opens a linux shell.

------
zzzcpan
Is it possible to hook a custom filesystem implementation to this? So an app
won't touch filesystem at all but still will be able to use some sort of
virtual files.

~~~
beagle3
This has been possible and quite easy with FUSE (and overlayfs / mergerfs /
mhdfs / aufs if you don't want to implement all the functionality yourself)
for the last 8 years at least -- I've used mhddfs and fuse for exactly this on
Ubuntu 10.04

~~~
zzzcpan
I'm not interested in fuse here. Gvisor has fsgofer, which is a proxy of some
kind to a filesystem, there is even ReadAt [1]. But it is light on details, I
was curious how and to what extent it proxies filesystem API.

[1]
[https://github.com/google/gvisor/blob/0c7d73b31ae5e7ba506866...](https://github.com/google/gvisor/blob/0c7d73b31ae5e7ba5068669306aefe5a7b559f92/runsc/fsgofer/fsgofer.go#L720)

------
bonyt
What ever happened to user-mode-linux (UML)? It seems like gVisor takes a
similar approach, a "kernel" in userspace handling syscalls. I was playing
with UML the other day, a UML kernel is still in Debian's repo and appears
roughly up to date[1], but all the documentation I could find on it was many
years old and seemed out of date.

[1]: [https://packages.debian.org/buster/user-mode-
linux](https://packages.debian.org/buster/user-mode-linux)

~~~
patrickg_zill
My experience with UML is very old, but I think that it had a significant
impact on performance. About a 50% loss in certain cases, vs running the same
applications on the host directly.

------
jancurn
Amazing work guys! I'm only wondering, is this ready for production
environments?

~~~
bradfitz
[https://twitter.com/tallclair/status/991621542265180161](https://twitter.com/tallclair/status/991621542265180161)
\-- "Google has been relying on gVisor to sandbox production workloads for
years. I'm super excited that we've open sourced! Now we can talk about
#Kubernetes integrations :)"

------
loudmax
The Github page mentions that postgres, nginx and elasticsearch aren't
currently supported by the gVisor kernel (or pseudo-kernel?). Those are
significant, but they shouldn't be showstoppers for a lot of deployments. They
list each of these shortcomings as bugs, so hopefully there's a real effort to
support the capabilities required by those applications.

Also, I'd like to see what the performance impact is like. What types of
applications would suffer the worst performance running on gVisor? What
applications will be the least affected?

~~~
hacknat
It’s likely that web apps and servers will be least affected, I cannot imagine
the performance on databases being good.

~~~
loudmax
You're right, the Github page says as much:

> gVisor may provide poor performance for system call heavy workloads.

Incidentally, I find the description on Github much more interesting than the
one on the Google Blog:
[https://github.com/google/gvisor](https://github.com/google/gvisor)

------
comboy
So with security in mind, when I run this:

    
    
        docker run --runtime=runsc hello-world
    

Is there any way to remove default runc docker runtime, or to make sure that
it is actually running gVisior?

~~~
cpuguy83
At the daemon level set "\--default-runtime"

------
cfontes
Well that looks very nice.

For the security experts among us, would containers using root as their user
be safe running on gVisor?

That would save a lot of pain because of some super annoying bugs like this:

[https://github.com/moby/moby/issues/6119](https://github.com/moby/moby/issues/6119)

[https://github.com/moby/moby/issues/5505](https://github.com/moby/moby/issues/5505)

~~~
ithkuil
If it's not safe then it wouldn't be a proper sandbox I'm the first place. The
goal is to intercept all system calls and reimplement them in a lightweight
kernel that talks with the host kernel only via a minimal 9p based protocol.

I.e. there is never a direct syscall being served by the host kernel on behalf
of a process running inside the container.

From what I can read in their design docs, the user id running inside the
container seems completely irrelevant.

------
mwcampbell
Proponents of Illumos zones and FreeBSD jails claim that these solutions offer
better security than Linux containers while maintaining the performance of
running directly on a shared kernel. And now, both Illumos and FreeBSD have
Linux emulation for x86-64. Has Google tried these solutions and found them
wanting? Has anyone done research on whether Illumos zones or FreeBSD jails
really provide better security than Linux containers?

~~~
mirashii
> Has anyone done research on whether Illumos zones or FreeBSD jails really
> provide better security than Linux containers?

This is an ever changing property of both systems, and a little bit
subjective, so a study is both difficult and outdated as soon as it's done.

What we can do is look at the number of published vulnerabilities over a
timeframe, and compare the overall system designs and development
philosophies. I don't know of a comparison of the numbers of vulnerabilities,
but for a bit of history and why I would personally trust zones over
containers I previously wrote this comment.

[https://news.ycombinator.com/item?id=15179858](https://news.ycombinator.com/item?id=15179858)

~~~
Mister_Snuggles
Your comment really speaks to the different philosophy of Linux vs FreeBSD.

FreeBSD is very much a single system - the kernel and userland are designed
and built together. Linux, on the other hand, is a kernel which has multiple
different userlands made up of different pieces that distributions pick and
choose. Ubuntu, for example, is quite different from OpenSUSE, which is quite
different from CentOS, but they're all still Linux.

~~~
jacksmith21006
Linux only being a kernel versus BSD a OS makes it easier to leverage with
Android, ChromeOS, etc. Google uses the Linux kernel everywhere. From CC to
Google home and wifi, etc. But then also their cloud

------
apstndb
Is it related with the sandbox of Google App Engine?

~~~
puzzle
There are commonalities...

[https://github.com/GoogleCloudPlatform/appengine-java-vm-
run...](https://github.com/GoogleCloudPlatform/appengine-java-vm-
runtime/blob/master/testwebapp/src/main/java/com/google/apphosting/tests/usercode/testservlets/TestInetAddressServlet.java#L167)

~~~
apstndb
The appengine-java-vm-runtime is for Flexible Environment. I want to know the
sandbox of Standard Environment. Java 8 Runtime seems that it use new sandbox
mechanism.

------
staticassertion
I'm really curious to see actual performance benchmarks, and to understand
where the majority of performance is lost.

------
zapita
Is this the same actual codebase that is running in production on Google's
core infrastructure? Or is it more like Kubernetes, that is to say: an
distinct open reimplementation based on the experience of running the in-house
system?

~~~
dward
It's been used in production to sandbox specific workloads for years.

~~~
zapita
That makes this announcement even more exciting. Thanks.

------
peterwwillis
From a security perspective, I don't think there is a big difference between
process isolation and kernel isolation. Oh, you think you made some really
secure software? That's great, here is how I will use a side-channel to work
around it.

If it weren't for the fact that the vanilla Linux kernel is the security
equivalent of swiss cheese, process isolation should be good enough for basic
"sandboxing". Add SELinux and some patches and it's good enough for the NSA.

So rather than waste time on piling on another layer of abstraction for what
is, in practical terms, no significant security advantage, just make userland
containers (don't run your container manager as root) and secure the OS and
stop reinventing the wheel with added complexity.

~~~
nickpsecurity
"Add SELinux and some patches and it's good enough for the NSA."

Exactly. They didn't accept that, though, since the underlying TCB was too
insecure. I'll elaborate for other readers.

SELinux was a prototype to add a tiny amount of functionality from "Trusted
Operating Systems" to Linux to see if its security could be improved. Those
assessing security at NSA rated the confidence of systems like that at
C2/EAL4+. Here's what that means as described by Shapiro when Windows 2000 got
the rating:

[https://www.rigacci.org/comp/freesoftware/trust-
comp/win_eal...](https://www.rigacci.org/comp/freesoftware/trust-
comp/win_eal4/NT-EAL4.html)

Highly-assured software they accept as close to secure is rated EAL6/7 (new
criteria) or B3/A1 (older criteria). One of my favorite papers illustrating
the kind of rigor that goes into that was VAX VMM Security Kernel for secure
virtualization back in early 1990's. It was designed for A1/EAL7. Look at
layering and Assurance sections for examples of techniques high-assurance
security still uses today albeit with different tooling.

[http://research.cs.wisc.edu/areas/os/Qual/papers/vmm-
securit...](http://research.cs.wisc.edu/areas/os/Qual/papers/vmm-security-
kernel.pdf)

One of the newer ones at EAL6+ is INTEGRITY-178B whose page nicely illustrates
the kinds of features and evidence packages they had to use to assure the
separation kernel. There's more politics in the process these days, though,
where they try to play down any weaknesses. The features and analyses are
still good examples, though, of what would be in an openly-developed
alternative.

[https://www.ghs.com/products/safety_critical/integrity-
do-17...](https://www.ghs.com/products/safety_critical/integrity-do-178b.html)

SELinux and common virtualization solutions don't begin to compare in
confidence that attackers will find minimal hacks. Instead, the endless
complexity demanded by the features people add assures there will be plenty
vulnerabilities to come even in things that used to be safe. That's the
default of proprietary and FOSS. Stuff making sacrifices for maximum security
is rare. SELinux isn't one of latter. Its predecessor, LOCK, was though.
Compare and contrast them, too, esp on where UNIX functionality set and
what/why of the modifications.

[https://www.acsac.org/2002/papers/classic-
lock.pdf](https://www.acsac.org/2002/papers/classic-lock.pdf)

------
amluto
This sounds like it will be dramatically slower than a VM on most workloads.

~~~
d1zzy
Why do you think so? For compute (non syscall bound) it should be native speed
(just like a VM). For syscall-heavy operations it will depend on the syscall
and how it's implemented and the backend processing.

For example, if you call lots of syscalls that are fully implemented in gVisor
(does not rely o an external backend/service for completing them) that happen
to be implemented inefficiently in gVisor (because they aren't top priority
right now/not a lot of users rely on them being fast) vs the same syscall
being already optimized in Linux then obviously there will be a lot of
difference. But if calling syscalls which need an external service (ex.
network/storage) to complete the request, then depending on the latency of the
external service the processing speed difference of gVisor vs guest Linux
kernel may not matter.

It really depends on the workload, there is indeed potential that some
workloads will be significantly slower with gVisor, all other things being
equal, but it doesn't seem to me to be a general thing.

------
rwmj
How is this different from Intel Clear Containers?

~~~
cpuguy83
This implements system calls in user space, intercepting those calls via
ptrace (though there is a KVM backend as well).

Honestly I don't see an advantage to this over a stricter seccomp policy, and
most definitely slower.

~~~
cpuguy83
I retract the statement that I don't see the advantage.

See
[https://twitter.com/rauchg/status/991850924057350144](https://twitter.com/rauchg/status/991850924057350144)
for why.

Basically run `mount` in a gvisor container and run `mount` in a runc
container and see the major differences there. Just one example, but as you
can see, linux mount namespaces tend to leak lots of mount information. _some_
of it could be cleaned up with additional unmounts after setting up the new
root for the container, but knowing what to unmount is not so simple (plus
it's just janky AF).

------
xtrapolate
> "Since gVisor is itself a user-space application, it will make some host
> system calls to support its operation, but much like a VMM, it will not
> allow the application to directly control the system calls it makes."
> [[https://github.com/google/gvisor](https://github.com/google/gvisor)]

TLDR; This is a user-space process that hooks syscalls/ioctls made by your
"containerised" applications.

(1) This is hardly a strong security model. Proper security cannot be
guaranteed by simply hooking API calls in user-space alone.

(2) With this framework in mind, developers now need to worry about yet
another layer of indirection. Assume <application> was tested to work on
Ubuntu, that fact alone is not sufficient to assume it will keep running under
gVisor.

(3) I would personally like to see more documentation/benchmarks regarding the
performance impacts that come with using this.

(4) This is strongly coupled with internal Kernel implementations. It will not
be easy to port and maintain this across different Kernels.

> "but much like a VMM, it will not allow the application to directly control
> the system calls it makes."

From user-space? hold my beer.

~~~
geofft
> _(1) This is hardly a strong security model. Proper security cannot be
> guaranteed by simply hooking API calls in user-space alone._

The thing you're talking about is not a security model, it is a (reliable)
mechanism that can be used in the implementation of security models.

> _(2) With this framework in mind, developers now need to worry about yet
> another layer of indirection. Assume <application> was tested to work on
> Ubuntu, that fact alone is not sufficient to assume it will keep running
> under gVisor._

This is true of existing container technologies. An application running under
Ubuntu on bare hardware will potentially not run in an Ubuntu Docker image.
You'll need to test it extensively.

> _(4) This is strongly coupled with internal Kernel implementations. It will
> not be easy to port and maintain this across different Kernels._

I don't understand this—gVisor is a userspace application and is not itself
tied to kernel implementations the way a kernel module would be. The interface
gVisor exposes is the Linux syscall ABI, which is the thing Linux tries very
hard to hold stable. There are multiple production reimplementations of this
ABI (Windows Subsystem for Linux, FreeBSD's Linuxulator, Solaris's branded
zones). You'll need to add new features if you want them, of course, but
holding at a specific emulated kernel version is totally fine.

> _From user-space? hold my beer._

ptrace (with PTRACE_O_EXITKILL from kernel 3.8+) is designed to be reliable
for this.

Also, if you don't trust it, just set everything to SECCOMP_RET_TRACE, which
kills the process if there is no ptracer.

~~~
jagger11
_Also, if you don 't trust it, just set everything to SECCOMP_RET_TRACE, which
kills the process if there is no ptracer_

A small correction, it causes for the syscall not to be executed, and return
with errno==ENOSYS

~~~
geofft
Oh, thanks. (It's still safe, because the inability to execute system calls
basically translates into an inability to do anything the process was not
previously authorized to do via... mmapped memory, and I think that's it.)

------
alberth
I wonder how this is similar or different to FreeBSD jail; which has existed
for the past 13+ years.

[https://www.freebsd.org/cgi/man.cgi?query=jail&apropos=0&sek...](https://www.freebsd.org/cgi/man.cgi?query=jail&apropos=0&sektion=0&manpath=FreeBSD+6.0-RELEASE+and+Ports&format=html)

~~~
cyphar
Very different. It's also very different to Solaris Zones. The design of Linux
containers is namespace-based while Jails and Zones are (basically) ID-based.

In addition gVisor is basically a ptrace wrapper around your process that
applies restrictions and other things on top of containers. I don't really
understand what the benefit is over seccomp-bpf (which is slowly becoming as
powerful as ptrace but without the overhead and without the security flaw of
your sandbox rules being entirely in userspace without any protections like
seccomp). They call it a kernel, likely because it is based on the idea of USL
(User-Space Linux), but there's a reason that USL never took of as a
virtualisation tool -- its entire security was predicated on ptrace to trick
processes spawned in the "guest os" into not being able to see the host. In
this configuration it looks like gVisor is using both namespaces and ptrace --
but then you have to worry about the massive overhead of ptrace (it affects
every syscall and signal event involving the process, and requires four
context switches and signal delivery to the tracing process in addition to the
normal syscall costs).

gAdvisor appears to be working on a KVM shim, but I'm not quite sure how you
can use KVM and still differentiate yourself from Kata Containers (the project
that came from Clear Containers and HyperHQ). Seems like duplicated effort to
me.

EDIT: I just re-read the article and it looks like they don't actually use
containers at all. Unless I'm mistaken this means that they are not taking
advantage of any of the sophisticated security primitives in the kernel that
ordinary containers use, and thus have the same (bad) security model as UML.

~~~
antoncohen
I recommend reading the GitHub README[1].

> I don't really understand what the benefit is over seccomp-bpf

With seccomp-bpf you are filtering what syscalls can be made, but those
syscall still happen in the host kernel. If that kernel syscall has a
vulnerability it could allow exploitation of the host and other containers in
a multi-tenancy environment. The gVisor kernel actually implements the
syscalls, it doesn't just pass them on.

> Advisor appears to be working on a KVM shim, but I'm not quite sure how you
> can use KVM and still differentiate yourself from Kata Containers

Kata Containers virtualize hardware and run a regular Linux kernel on the
virtual hardware. gVisor doesn't virtualize hardware, it is a kernel running
in userspace, implementing syscalls.

> I just re-read the article and it looks like they don't actually use
> containers at all. Unless I'm mistaken this means that they are not taking
> advantage of any of the sophisticated security primitives in the kernel that
> ordinary containers use, and thus have the same (bad) security model as UML.

The gVisor kernel (Sentry) runs in an empty user namespace with seccomp
filters applied.

[1] [https://github.com/google/gvisor](https://github.com/google/gvisor)

~~~
cyphar
> With seccomp-bpf you are filtering what syscalls can be made, but those
> syscall still happen in the host kernel.

I'm not sure what the distinction you're making is. By the same token, because
PTRACE_SYSCALL/PTRACE_SYSEMU only signals the calling process _after_ the
syscall boundary has been crossed, then ptrace also doesn't help with the
problem you are describing (though I also don't really agree that it's a
problem in the first place -- user-kernel context switches are not security
vulnerabilities). In fact, the seccomp restrictions are applied _immediately_
after PTRACE_SYSCALL/PTRACE_SYSEMU -- there is only a few lines of code
separating the two cases in the syscall entry path[1]. _And_ in the case of
PTRACE_SYSEMU, seccomp rules are still executed even though the syscall is
never going to be executed.

> gVisor doesn't virtualize hardware, it is a kernel running in userspace,
> implementing syscalls.

I understand that, but that's the ptrace helper (which I spent the rest of my
comment talking about). In the README (which I did read before commenting)
they mention an experimental KVM driver, which is what my VM comments were
referring to.

Is there an explanation somewhere about how gVisor uses KVM to virtualize
syscalls? I'm not sure I understand how you could use KVM to do that. That's
why I mentioned Kata, because it's the only point of reference I have for
using KVM to "emulate syscalls" (though of course it emulates more than that).

> The gVisor kernel (Sentry) runs in an empty user namespace with seccomp
> filters applied.

Good to know, but I didn't see that mentioned anywhere? If you're affiliated
with the project it'd be great if you could add it somewhere in the README or
the blog post.

[1]:
[https://elixir.bootlin.com/linux/v4.16.6/source/arch/x86/ent...](https://elixir.bootlin.com/linux/v4.16.6/source/arch/x86/entry/common.c#L81)

~~~
ithkuil
I started by reading here
[https://github.com/google/gvisor/blob/master/pkg/sentry/plat...](https://github.com/google/gvisor/blob/master/pkg/sentry/platform/kvm/kvm.go)

Basically the Sentry binary works as both the VMM and the guest kernel. It
uses the KVM API to setup page an address space and installs fault handlers
through which it regains control when the guest payload faults (on memory
access or soft interrupts/sysenter).

What I couldn't find on a quick skin is how much of that logic works on which
"side" of the wall, i.e. how much logic can the Sentry evaluate without
crossing the VM boundary.

I can imagine this can depend quite a lot on Go runtime internals. How do you
setup the environment inside the "guest" so that it can run the Go code?

