
Disabling Docker ICC still allows raw ethernet communication between containers - bthornbury
https://github.com/brthor/docker-layer2-icc
======
bthornbury
Note that this repros on a default unprivileged container, with the default
seccomp profile, and SELinux enabled.

Relevant filed issue:
[https://github.com/moby/moby/issues/36355](https://github.com/moby/moby/issues/36355)

There are three immediate workarounds.

1\. Dropping `CAP_SYS_RAW` in docker run will prevent the raw socket from
being opened (and is always a good idea).

2\. Placing each container on it's own bridge network blocks this
communication but requires you to manually specify the subnets of those
networks to work around dockers default limit of 31 networks.

3\. Create ebtables rules, to filter below the ip layer (layer 2) see the repo
for an example.

~~~
waittrtwut
For #2 did you mean CAP_NET_RAW or CAP_SYS_RAWIO?

~~~
bthornbury
CAP_SYS_RAWIO looks even more terrifying, but CAP_NET_RAW gates your ability
to use raw sockets.

[https://linux.die.net/man/7/capabilities](https://linux.die.net/man/7/capabilities)

~~~
contingencies
I raised this stuff in 2014.

[https://github.com/moby/moby/issues/5661](https://github.com/moby/moby/issues/5661)

------
sydney6
FreeBSD's Jails doesnt allow raw socket access by default since i have been
using it (~10Y) and probably even much longer.

edit: Since 2004 to be precise.

[https://www.freebsd.org/releases/5.3R/relnotes-
amd64.html](https://www.freebsd.org/releases/5.3R/relnotes-amd64.html)

So well documented. Gotta love it.

~~~
bthornbury
How does the isolation compare with docker in general?

Not familiar with freeBSD jails.

~~~
boomboomsubban
I'm sure this isn't a total list of features, but Wikipedia's comparison shows
jails having all the features of Docker.
[https://en.wikipedia.org/wiki/Operating-system-
level_virtual...](https://en.wikipedia.org/wiki/Operating-system-
level_virtualization#Implementations)

~~~
c0n5pir4cy
As someone who has used both extensively. I wouldn't say Docker and Jails are
directly comparable; Jails are a lot more comparable to something like LXC.

Jails doesn't have any of the deployment, tools for creating images, building
of images, registry for images etc. Docker also automates a lot of networking
stuff away whereas with jails I usually have to use something like PF to set
up port forwards; I also sometimes run into problems with the fact it shares
the network stack with the host.

That being said, all the configuration is quite simple and building exactly
what I need for running an application isn't too difficult. The big advantage
after that being that I understand how everything is wired together and I can
debug easier when something goes wrong. Doing it manually however takes up
much more time.

~~~
boomboomsubban
Terribly late reply, sorry, but the deployment tools are all in ZFS. Giving
each jail it's own ZFS file system let's you have templates with a bunch of
steps done, or mass copy a complete jail. And with ZFS send/receive, you can
share them.

------
geofft
Even if you used ebtables to filter out containers talking to each others' MAC
addresses, wouldn't they be able to send broadcast or multicast packets to
communicate with each other?

I guess it's not clear to me if the vulnerability/bug/whatever here is "two
conspiring containers can establish a covert channel" or "a malicious
container can send normal-looking traffic to a container, bypassing that
container's firewall rules."

It does seem like the right answer is unique bridge networks per container. On
physical networks, it's hard to prevent two untrusted devices on the same L2
domain from establishing a covert channel. (And it's hard to prevent two
networked untrusted, conspiring devices anywhere on the internet from
establishing a covert channel, if they're trying hard enough.)

~~~
bthornbury
> Even if you used ebtables to filter out containers talking to each others'
> MAC addresses, wouldn't they be able to send broadcast or multicast packets
> to communicate with each other?

I'm afraid my initiation to ebtables (and iptables) was yesterday, so feedback
such as this is highly appreciated. I will explore this and add a test to the
repo.

> I guess it's not clear to me if the vulnerability/bug/whatever here is "two
> conspiring containers can establish a covert channel" or "a malicious
> container can send normal-looking traffic to a container, bypassing that
> container's firewall rules."

In the context of a sandbox, I purport the vulnerability is that a malicious
container can bypass the firewall. The covert channel is largely inevitable
because ultimately the containers can modulate cpu usage to communicate via
cpu timing (I read this in a security paper somewhere, not sure how actual
implementation works).

The impact of bypassing the firewall can be serious if the non-malicious
container exposes any service without proper authentication (as is often done
behind VPCs in the cloud).

Given that a high number of official containers contain known vulnerabilities
(according to dockerhub). This is a potential issue for a large number of
docker users.

> It does seem like the right answer is unique bridge networks per container.

My thinking is along the same lines, although I'm wondering if a container can
communicate with the host if L2 packets may be able to traverse the network
bridges. If anyone has input here, it'd be greatly appreciated.

~~~
paulfurtado
Re: bridge filtering

You can use the iptables physdev extension to match based on a bridge port
rather than a mac address.

So to filter between containers, you can do:

    
    
      iptables -I FORWARD -m physdev --physdev-in vetha056126 -j DROP
    

where vetha056126 is the name of the container's veth adapter's peer interface
on the host.

~~~
bthornbury
Perhaps I don't understand this rule properly, but isn't this dropping all
traffic being forwarded from the interface?

~~~
paulfurtado
When the sysctl net.bridge.bridge-nf-call-iptables=1 is set, bridged packets
will traverse iptables rules. I believe this is the default on modern Linux
kernels, but may not be set by default on RHEL/CentOS systems.

The iptables filter table has three chains: INPUT, OUTPUT, FORWARD. In a
traditional non-bridge context:

\- INPUT means packets entering the host

\- OUTPUT means packets leaving the host

\- FORWARD means packets being forwarded through the host which are not
destined for the host itself.

In the context of a bridge, FORWARD means forwarding packets from one bridge
port to another. You can test the simplest case of this with:

    
    
      iptables -I FORWARD -j DROP
    

by adding that rule, two containers on the same host cannot communicate with
each other despite the fact that their communication did not require any
traditional routing from the host. This is because the bridge traffic also
traversed iptables rules.

------
blattimwind
> This behavior is highly unexpected, and in highly secure environments,
> likely to be an issue.

If your "highly secure" environment relies on docker for isolation (a software
neither meant nor designed for security), then it hardly qualifies as "highly
secure" in the first place.

~~~
bthornbury
> If your "highly secure" environment relies on docker for isolation (a
> software neither meant nor designed for security), then it hardly qualifies
> as "highly secure" in the first place.

This is just not true. Docker by default requires hardening, but container
implementations are fairly secure. With Kernel/OS hardening they are quite
secure.

Docker security by default is actually quite good outside of a few issues
(like raw sockets), and as long as you don't mount '/' or something. Seccomp
is enabled by default to mitigate attack surface on the kernel, SELinux can be
enabled to further enforce access control.

Plenty of sandboxes are relying on containers now seemingly successfully.
Sandstorm.io is a good example (and where I picked up a lot of tips for
securing containers).

If you're convinced otherwise, feel free to share some research.

~~~
kentonv
As the author of Sandstorm, I'm going to say: "ehh... maybe."

There's still some big differences between Sandstorm and Docker's sandboxes --
not as big as there used to be, but non-trivial. For example, Docker mounts
/proc, albeit read-only; Sandstorm doesn't mount it at all. Docker lets the
container talk to the network; Sandstorm gives it only a loopback device and
forces all external communication through one inherited socket speaking Cap'n
Proto. In general, Sandstorm is willing to break more things to reduce attack
surface; Docker puts more priority on compatibility.

So is it "secure"? That's not a yes/no question. Security is about risk
management; you can always reduce your risk further by jumping through more
hoops. There will certainly be more container breakout bugs found in the
future, including bugs affecting Docker and affecting Sandstorm. There will
probably be more that affect Docker, because its attack surface is larger.

There will also be VM breakouts. VM breakouts appear to be less common than
container breakouts, but not by as much as some people seem to think. Anyone
who makes a blanket statement that VMs are secure and Docker is not does not
know what they are talking about.

The only thing I'd feel comfortable saying is "totally secure" is a computer
that is physically unable to receive or transmit information (airgapped, no
microphone/speaker, etc.), but that's hardly useful.

In general, if you are going to run possibly-malicious code, the better way to
reduce your risk is not to use a better sandbox, but to use multiple layered
sandboxes. The effect is multiplicative: now the attacker must simultaneously
obtain zero-days in both layers. For example, Google Chrome employs the V8
sandbox as well as a secondary container-based sandbox.

In Sandstorm's case, the "second layer" is that application packages are
signed with the UI presenting to you the signer's public identity (e.g. Github
profile) at install time, and the fact that most people only let relatively
trustworthy users install apps on their Sandstorm servers. (For Sandstorm
Oasis, the cloud version that anyone can sign up to, application code runs on
physically separate machines from storage and other trusted code, for a
different kind of second layer.)

~~~
bthornbury
Thanks for commenting here!

I was inspired by sandstorm's supervisor.c++ while looking into container
security which actually eventually led me to this issue.

> For example, Docker mounts /proc, albeit read-only;

I can't find too much information on this, does docker mount /proc from the
host by default in each of the containers?

> In general, if you are going to run possibly-malicious code, the better way
> to reduce your risk is not to use a better sandbox, but to use multiple
> layered sandboxes. The effect is multiplicative: now the attacker must
> simultaneously obtain zero-days in both layers. For example, Google Chrome
> employs the V8 sandbox as well as a secondary container-based sandbox.

I think that sandboxing the language is pretty tough in the general case,
since every language takes a lot of effort. In non-managed, compiled
languages, it will be even tougher.

What is your opinion on tools like SELinux as a secondary layers?

~~~
paulfurtado
> I can't find too much information on this, does docker mount /proc from the
> host by default in each of the containers?

I'm curious what kentonv has to say, but on modern kernels docker can make use
of PID namespaces, so /proc only shows PIDs from the container's PID
namespace. That said, it does still provide several information leaks like
/proc/self/mountinfo showing which host directories are mounted where in the
container.

In addition to PID namespaces, another isolation gotcha is users, and docker
does not enable user namespaces by default since it's a relatively new kernel
feature and it breaks backwards compatibility (ex: kubernetes doesn't yet
support them). A good example of this in practice that many people hit issues
with in the past was ulimits: if UID 1000 in a container exceeds a ulimit, it
also affects UID 1000 in every other container. Docker solved this by setting
ulimits to unlimited on the docker daemon process, which are then inherited by
containers (this also happens to be good for performance). User namespaces are
one of the big recent improvements to container security.

------
yanslookup
Is the idea that an attacker would enumerate mac addresses and scan for
connections? Or is there a way for an attacker to discover neighboring
container mac addresses?

~~~
andrewguenther
I'm curious about this as well. It seems like this would be pretty difficult
to exploit in the real world.

~~~
bthornbury
I think some more research is needed here.

I'm wondering if containers could abuse ARP to find the neighboring addresses.

~~~
andrewstuart
sudo arp-scan --interface=br0 --localnet --bandwidth=8192000 --numeric
--retry=1

Should return mac addresses plus the associated IP address

although nmap would probably yield more interesting results.

------
jlgaddis
Since iptables only works for IPv4 packets (EthType 0x0800), it doesn't really
seem unusual that you could still pass back and forth frames with different
EtherTypes.

There's a reason why filtering frames at layer 2 is a thing, but I suppose the
average developer/Docker user wouldn't really be aware of it. I'm a senior
network engineer at $work (ISP) and there are several locations in the network
where I do this (the most common location is probably at IXP exchanges).

Layer 2 security is something that most people overlook.

------
fapjacks
Well, this is kind of unexpected, but "highly secure" environments surely
won't be running containers on the default bridge without any of the
capabilities configured.

~~~
bthornbury
This is true. I suppose the issue is if people naively rely on defaults with
ICC disabled.

------
eyeareque
Could it be possible to spoof ether frames with udp in them? Like say dns
responses from the othe containers dns queries? Or to spoof the same MAC
address as a peer container?

------
kuschku
The interesting question will be how this applies to Kubernetes using flannel,
considering that creates a separate network bridge per container.

~~~
TheDong
Kubernetes using flannel creates a veth pair per container, but they're all
attached to the same bridge.

~~~
djsumdog
I'd be curious of what would happen between nodes; could you get to the
virtual mac address of another container on another node? It's also be
interesting to see what happens in Flannel, WeaveNet and other docker network
implementations.

------
iforgotpassword
Why not just put each docker container in its own VM? Problem solved without
any ugly hacks, clean and elegant.

~~~
bringtheaction
What do you need docker for if you put each container in a VM of its own?

~~~
chatmasta
Once use case is if you're running a multi-tenant service, where customers can
deploy their own docker containers. In terms of developer usability, it's way
easier to have them push containers to a registry than it is to ask them to
package a custom ISO for upload into a VM. To your customers, it appears you
are just running docker, but on the backend you isolate each container in a VM
for OS-level security isolation.

The Intel Clear Containers project [0] is working on this.

[0] [https://github.com/clearcontainers](https://github.com/clearcontainers)

