
Containers, Security, and Echo Chambers - ipm42
https://blog.jessfraz.com/post/containers-security-and-echo-chambers/
======
erulabs
Dropping privileges from docker containers, and container isolation are both
very interesting and important topics - ones I hear discussed constantly. This
author would serve himself much better by dropping the self congratulations -
no reason at all for comments like “tech bros crying” and “container isolation
is a hard problem - unless you’re me!”. Sorry, you’re not the only nerd smart
enough to populate the capdrop table. Also, the default app armor profile for
Docker leaves -a lot- to be desired. I hear the term “Echo Chamber” most often
from people who deem themselves smarter than the rest, and always in an
accusational way... fairly ironic if you ask me.

~~~
nixgeek
Walk a mile in her shoes before you cast stones.

[https://www.theregister.co.uk/2016/04/26/harassment_of_femal...](https://www.theregister.co.uk/2016/04/26/harassment_of_female_docker_staff/)

------
benmmurphy
linux namespaces can be aggressively locked down. for example sandstorm by
kenton varda has not been vulnerable to any of the recent linux kernel
vulnerabilities except for badiret and a TLB bug
([https://github.com/sandstorm-
io/sandstorm/blob/master/docs/u...](https://github.com/sandstorm-
io/sandstorm/blob/master/docs/using/security-non-events.md)). it certainly
wasn't vulnerable to dirty cow which is quite difficult to protect against.

however, i think docker may have been vulnerable against dirty cow.
[[https://github.com/gebl/dirtycow-docker-
vdso](https://github.com/gebl/dirtycow-docker-vdso)] i'm not sure if this was
before or after jessie's work on securing docker.

also, i don't think gvisor would have been vulnerable to dirty cow. it looks
like it 'gates' all the mmap/munmap/madvise syscalls through a sentry process
which does some kind of emulation of virtual memory through some magic.
[[https://github.com/google/gvisor/blob/797cda301677abc8523d5a...](https://github.com/google/gvisor/blob/797cda301677abc8523d5a2a8d731312cc43bce4/pkg/sentry/mm/README.md)].
like ultimately, i think mmap() system calls need to be executed in the
monitored process but i think they are only done by the sentry using ptrace
and if the sentry dies then i assume it is the root of the pid namespace so
the process it is monitoring is killed as well.

~~~
kentonv
It's been a while, but IIRC Docker wasn't affected by Dirty COW because they
mount /proc read-only. (Sandstorm was unaffected because it doesn't mount
/proc at all.)

FWIW I haven't kept the security non-events up to date over the past year or
two. There was at least one Linux kernel bug I can remember last year
(CVE-2017-5123) that allowed a breakout from _all_ container engines, because
waitid() is too important a syscall to block. However, the vulnerability was
newly-introduced and hadn't made its way into too many distros before it was
fixed.

------
mindhash
After reading this article I am more convinced about gvisor or something
similar.

The security of your systems is best left to experts.

~~~
lrvick
Speaking as someone who has been called a security "expert" by many for years:
I actually have very little idea what I am doing and neither do most of the
people I know that discover vulnerabilities. I find issues by just reading
code that clearly got minimal if any review or catching common design flaws
that would of never happened if someone took the time to think about their
threat profile and attack surface before implementation.

The attitude of most engineers I encounter of "security is someone else's
problem" is the problem.

If you are writing systems other people rely on to be secure, then security is
your problem. You will do a much better job avoiding creating new security
holes if you take the time to learn some basics yourself instead of expecting
"experts" to do it for you.

Namespacing features and system call filtering tools like gvisor, seccomp,
selinux, apparmor etc should be your -last- line of defense and they are only
going to be useful tools for you if you invest the time to understand them and
tune them to your specific needs.

~~~
ithkuil
there is an interesting thin line dividing a "hardening feature" (e.g. a
system call filtering system) and something perceived as a genuine execution
environment category (e.g. a virtual machine).

Most people like to be able to reason about general security implications in
broad strokes and concepts as "virtual machine" have the ability to convey a
given notion of isolation guarantees that make them stand out as a primitive
you can build upon, rather than an "additional layer".

In order to achieve this standing, the "virtual machine" concept is rooted on
the general idea that most of the traditional OS abstraction is moved inside
the sandbox, leaving only a very small, easy to understand (and hence easy to
secure) channel to the underlying shared resource. This is traditionally
achieved by running a fully kernel in the sandbox (guest OS) and having it
interact with the host through a hypervisor. The optional assistance provided
by hardware virtualisation features is often necessary to achieve good
performance, mostly because of the necessity to move a traditional OS in the
guest, which was designed to work as a primary OS in the first place.

The niche gVisor is trying to fill is the ability to approximate that
abstraction without requiring the traditional hypervisor mode, which has some
practical drawbacks that make it hard to deploy in some scenario (think of
hardware support for nested virtualization which would be required to run your
own virtualization solutions inside of public cloud compute instances).

gVisor achieves this by fundamentally implementing a user-space kernel,
leveraging some of the aforementioned system call filtering tools as one of
the possible ways of implementing the sandbox mechanism and the guest/host
communication channel. So, while using the very same features that are
commonly thought of providing additional hardening features, gVisor fills a
different niche, more akin to what people usually call virtualization.

It can be argued that the amount of host features exercised by gVisor is too
high to be able to call it a virtualization feature proper (especially in its
seccomp mode), but it shares with classic virtualization one very powerful
property: when end-user workloads require a given OS feature (e.g. some new
cgroups feature), only the guest "OS" needs to implement it.

On the other hand, traditional seccomp/selinux/apparmor style hardening
requires the host OS to implement all the features needed by the guest
workload. Furthermore, it also often requires that the rules (e.g. syscall
filters) to be updated to let the sandboxed workload use said features, and
the amended rule can often be applied incorrectly. Moreover, the filtering
rules need to be expressive enough in order to implement some scenarios in the
first place (e.g. seccomp-bpf cannot currently follow pointers).

