Disabling Docker ICC still allows raw ethernet communication between containers

bthornbury · on Feb 20, 2018

Note that this repros on a default unprivileged container, with the default seccomp profile, and SELinux enabled.

Relevant filed issue: https://github.com/moby/moby/issues/36355

There are three immediate workarounds.

1. Dropping `CAP_SYS_RAW` in docker run will prevent the raw socket from being opened (and is always a good idea).

2. Placing each container on it's own bridge network blocks this communication but requires you to manually specify the subnets of those networks to work around dockers default limit of 31 networks.

3. Create ebtables rules, to filter below the ip layer (layer 2) see the repo for an example.

waittrtwut · on Feb 21, 2018

For #2 did you mean CAP_NET_RAW or CAP_SYS_RAWIO?

bthornbury · on Feb 21, 2018

CAP_SYS_RAWIO looks even more terrifying, but CAP_NET_RAW gates your ability to use raw sockets.

https://linux.die.net/man/7/capabilities

contingencies · on Feb 21, 2018

I raised this stuff in 2014.

https://github.com/moby/moby/issues/5661

sydney6 · on Feb 21, 2018

FreeBSD's Jails doesnt allow raw socket access by default since i have been using it (~10Y) and probably even much longer.

edit: Since 2004 to be precise.

https://www.freebsd.org/releases/5.3R/relnotes-amd64.html

So well documented. Gotta love it.

geofft · on Feb 21, 2018

Does ping or traceroute work in jails by default?

It sounds like the Docker folks aren't thrilled with enabling raw sockets by default, but enough people expect outbound ping to work.

(Linux's network stack does have an unprivileged ping, controlled by setting sysctl net.ipv4.ping_group_range appropriately; recent ping binaries support using it but I think most distros don't enable the sysctl. I'm guessing Docker could enable it within containers if they weren't worried about older distros that predate this support in ping.)

toast0 · on Feb 21, 2018

Not by default. I've run two kinds of jails:

a) I didn't need much separation, I just wanted to run multiple environments on one box to save hardware; in this case, I enabled raw sockets, because it was convenient to ssh into the virtual environment and ping things.

b) things that I wanted really separate; for these environments, I left raw sockets disabled, the jail only contains the one executable (statically compiled); additionally, I also setup ipfw rules to prevent IP traffic from the jail from getting in or out, other than the specific things it was intended to do.

aidenn0 · on Feb 21, 2018

Does disabling raw sockets still allow the various IPv6 ICMP messages that can be necessary for things like PMTU discovery to happen?

toast0 · on Feb 22, 2018

It shouldn't -- the permission is for user mode programs to access raw sockets, and user mode programs aren't needed to generate and handle normal MTU ICMPs (on both IPv4 and IPv6).

gtirloni · on Feb 21, 2018

Supposedly, that happens a layer below the socket API so restricting the creation of raw sockets by userland shouldn't have an impact.

bthornbury · on Feb 21, 2018

How does the isolation compare with docker in general?

Not familiar with freeBSD jails.

boomboomsubban · on Feb 21, 2018

I'm sure this isn't a total list of features, but Wikipedia's comparison shows jails having all the features of Docker. https://en.wikipedia.org/wiki/Operating-system-level_virtual...

c0n5pir4cy · on Feb 21, 2018

As someone who has used both extensively. I wouldn't say Docker and Jails are directly comparable; Jails are a lot more comparable to something like LXC.

Jails doesn't have any of the deployment, tools for creating images, building of images, registry for images etc. Docker also automates a lot of networking stuff away whereas with jails I usually have to use something like PF to set up port forwards; I also sometimes run into problems with the fact it shares the network stack with the host.

That being said, all the configuration is quite simple and building exactly what I need for running an application isn't too difficult. The big advantage after that being that I understand how everything is wired together and I can debug easier when something goes wrong. Doing it manually however takes up much more time.

boomboomsubban · on Feb 26, 2018

Terribly late reply, sorry, but the deployment tools are all in ZFS. Giving each jail it's own ZFS file system let's you have templates with a bunch of steps done, or mass copy a complete jail. And with ZFS send/receive, you can share them.

icebraining · on Feb 21, 2018

Docker is a high-level container management tool, which uses container execution technologies underneath. Initially they used LXC, which is comparable to Jails, and in fact there's someone porting Docker to use FreeBSD Jails: https://github.com/kvasdopil/docker/blob/freebsd-compat/FREE...

geofft · on Feb 21, 2018

Even if you used ebtables to filter out containers talking to each others' MAC addresses, wouldn't they be able to send broadcast or multicast packets to communicate with each other?

I guess it's not clear to me if the vulnerability/bug/whatever here is "two conspiring containers can establish a covert channel" or "a malicious container can send normal-looking traffic to a container, bypassing that container's firewall rules."

It does seem like the right answer is unique bridge networks per container. On physical networks, it's hard to prevent two untrusted devices on the same L2 domain from establishing a covert channel. (And it's hard to prevent two networked untrusted, conspiring devices anywhere on the internet from establishing a covert channel, if they're trying hard enough.)

bthornbury · on Feb 21, 2018

> Even if you used ebtables to filter out containers talking to each others' MAC addresses, wouldn't they be able to send broadcast or multicast packets to communicate with each other?

I'm afraid my initiation to ebtables (and iptables) was yesterday, so feedback such as this is highly appreciated. I will explore this and add a test to the repo.

> I guess it's not clear to me if the vulnerability/bug/whatever here is "two conspiring containers can establish a covert channel" or "a malicious container can send normal-looking traffic to a container, bypassing that container's firewall rules."

In the context of a sandbox, I purport the vulnerability is that a malicious container can bypass the firewall. The covert channel is largely inevitable because ultimately the containers can modulate cpu usage to communicate via cpu timing (I read this in a security paper somewhere, not sure how actual implementation works).

The impact of bypassing the firewall can be serious if the non-malicious container exposes any service without proper authentication (as is often done behind VPCs in the cloud).

Given that a high number of official containers contain known vulnerabilities (according to dockerhub). This is a potential issue for a large number of docker users.

> It does seem like the right answer is unique bridge networks per container.

My thinking is along the same lines, although I'm wondering if a container can communicate with the host if L2 packets may be able to traverse the network bridges. If anyone has input here, it'd be greatly appreciated.

paulfurtado · on Feb 21, 2018

Re: bridge filtering

You can use the iptables physdev extension to match based on a bridge port rather than a mac address.

So to filter between containers, you can do:

  iptables -I FORWARD -m physdev --physdev-in vetha056126 -j DROP

where vetha056126 is the name of the container's veth adapter's peer interface on the host.

bthornbury · on Feb 21, 2018

Perhaps I don't understand this rule properly, but isn't this dropping all traffic being forwarded from the interface?

paulfurtado · on Feb 21, 2018

When the sysctl net.bridge.bridge-nf-call-iptables=1 is set, bridged packets will traverse iptables rules. I believe this is the default on modern Linux kernels, but may not be set by default on RHEL/CentOS systems.

The iptables filter table has three chains: INPUT, OUTPUT, FORWARD. In a traditional non-bridge context:

- INPUT means packets entering the host

- OUTPUT means packets leaving the host

- FORWARD means packets being forwarded through the host which are not destined for the host itself.

In the context of a bridge, FORWARD means forwarding packets from one bridge port to another. You can test the simplest case of this with:

  iptables -I FORWARD -j DROP

by adding that rule, two containers on the same host cannot communicate with each other despite the fact that their communication did not require any traditional routing from the host. This is because the bridge traffic also traversed iptables rules.

eigengrau · on Feb 21, 2018

> Even if you used ebtables to filter out containers talking to each others' MAC addresses, wouldn't they be able to send broadcast or multicast packets to communicate with each other?

ebtables supports «broadcast» as a destination to match ethernet frames on.

   ebtables … -d broadcast

which is equivalent to

  ebtables … -d ff:ff:ff:ff:ff:ff/ff:ff:ff:ff:ff:ff

myrandomcomment · on Feb 21, 2018

I do not know IPtables well (because the syntax is very dumb and vile IMNSHO - PF is so much better), but I digress. The rule as outlined would likely allow broadcast and multicast as the rule is specific to src/dst pairs so there would be no match for src/bcast.

sydney6 · on Feb 21, 2018

ebtables ≠ iptables.

Both are part of the netfilter subsystem in the linux kernel.

sydney6 · on Feb 21, 2018

edit: not part of the kernel, but userpace utilities to interact with the kernel api.

myrandomcomment · on Feb 21, 2018

Is the syntax as hard to parse as iptables, if so, still sucks :)

I have a friend that I worked with at now large company (now) with back in the startup days. Guy worked at ATT and has name on RFCs for things that make all this stuff work. We where doing the startup thing, ie the guys that understand it own it, for the company. No IT so you do it since you know it...firewall and vpn, etc (he wrote routing code, I was CSE), but startup so... It made me laugh so hard when he said “iptables makes no sense. Black magic. Let’s just use PFsense”. That had always been my view but to hear someone that wrote the stuff that makes this Internet magic work agree with me was joyful.

blattimwind · on Feb 21, 2018

> This behavior is highly unexpected, and in highly secure environments, likely to be an issue.

If your "highly secure" environment relies on docker for isolation (a software neither meant nor designed for security), then it hardly qualifies as "highly secure" in the first place.

bthornbury · on Feb 21, 2018

> If your "highly secure" environment relies on docker for isolation (a software neither meant nor designed for security), then it hardly qualifies as "highly secure" in the first place.

This is just not true. Docker by default requires hardening, but container implementations are fairly secure. With Kernel/OS hardening they are quite secure.

Docker security by default is actually quite good outside of a few issues (like raw sockets), and as long as you don't mount '/' or something. Seccomp is enabled by default to mitigate attack surface on the kernel, SELinux can be enabled to further enforce access control.

Plenty of sandboxes are relying on containers now seemingly successfully. Sandstorm.io is a good example (and where I picked up a lot of tips for securing containers).

If you're convinced otherwise, feel free to share some research.

kentonv · on Feb 21, 2018

As the author of Sandstorm, I'm going to say: "ehh... maybe."

There's still some big differences between Sandstorm and Docker's sandboxes -- not as big as there used to be, but non-trivial. For example, Docker mounts /proc, albeit read-only; Sandstorm doesn't mount it at all. Docker lets the container talk to the network; Sandstorm gives it only a loopback device and forces all external communication through one inherited socket speaking Cap'n Proto. In general, Sandstorm is willing to break more things to reduce attack surface; Docker puts more priority on compatibility.

So is it "secure"? That's not a yes/no question. Security is about risk management; you can always reduce your risk further by jumping through more hoops. There will certainly be more container breakout bugs found in the future, including bugs affecting Docker and affecting Sandstorm. There will probably be more that affect Docker, because its attack surface is larger.

There will also be VM breakouts. VM breakouts appear to be less common than container breakouts, but not by as much as some people seem to think. Anyone who makes a blanket statement that VMs are secure and Docker is not does not know what they are talking about.

The only thing I'd feel comfortable saying is "totally secure" is a computer that is physically unable to receive or transmit information (airgapped, no microphone/speaker, etc.), but that's hardly useful.

In general, if you are going to run possibly-malicious code, the better way to reduce your risk is not to use a better sandbox, but to use multiple layered sandboxes. The effect is multiplicative: now the attacker must simultaneously obtain zero-days in both layers. For example, Google Chrome employs the V8 sandbox as well as a secondary container-based sandbox.

In Sandstorm's case, the "second layer" is that application packages are signed with the UI presenting to you the signer's public identity (e.g. Github profile) at install time, and the fact that most people only let relatively trustworthy users install apps on their Sandstorm servers. (For Sandstorm Oasis, the cloud version that anyone can sign up to, application code runs on physically separate machines from storage and other trusted code, for a different kind of second layer.)

bthornbury · on Feb 21, 2018

Thanks for commenting here!

I was inspired by sandstorm's supervisor.c++ while looking into container security which actually eventually led me to this issue.

> For example, Docker mounts /proc, albeit read-only;

I can't find too much information on this, does docker mount /proc from the host by default in each of the containers?

> In general, if you are going to run possibly-malicious code, the better way to reduce your risk is not to use a better sandbox, but to use multiple layered sandboxes. The effect is multiplicative: now the attacker must simultaneously obtain zero-days in both layers. For example, Google Chrome employs the V8 sandbox as well as a secondary container-based sandbox.

I think that sandboxing the language is pretty tough in the general case, since every language takes a lot of effort. In non-managed, compiled languages, it will be even tougher.

What is your opinion on tools like SELinux as a secondary layers?

paulfurtado · on Feb 21, 2018

> I can't find too much information on this, does docker mount /proc from the host by default in each of the containers?

I'm curious what kentonv has to say, but on modern kernels docker can make use of PID namespaces, so /proc only shows PIDs from the container's PID namespace. That said, it does still provide several information leaks like /proc/self/mountinfo showing which host directories are mounted where in the container.

In addition to PID namespaces, another isolation gotcha is users, and docker does not enable user namespaces by default since it's a relatively new kernel feature and it breaks backwards compatibility (ex: kubernetes doesn't yet support them). A good example of this in practice that many people hit issues with in the past was ulimits: if UID 1000 in a container exceeds a ulimit, it also affects UID 1000 in every other container. Docker solved this by setting ulimits to unlimited on the docker daemon process, which are then inherited by containers (this also happens to be good for performance). User namespaces are one of the big recent improvements to container security.

kentonv · on Feb 21, 2018

> I can't find too much information on this, does docker mount /proc from the host by default in each of the containers?

Sorry, I should have clarified: It's /proc for the specific PID namespace. So in theory it doesn't leak anything bad. The problem is that it's a huge attack surface; there have been bugs in /proc before.

> I think that sandboxing the language is pretty tough in the general case, since every language takes a lot of effort. In non-managed, compiled languages, it will be even tougher.

WebAssembly sandboxes non-managed compiled languages pretty well. :)

> What is your opinion on tools like SELinux as a secondary layers?

IMO it doesn't help very much, because many of the kinds of kernel bugs that allow you to escape a container tend to allow you to escape SELinux as well.

andrewguenther · on Feb 21, 2018

As someone who also used to hammer on Docker security, it has significantly improved in the last two years. I don't think claims like this have much merit anymore.

packetized · on Feb 21, 2018

Except for this, I presume?

yanslookup · on Feb 21, 2018

Is the idea that an attacker would enumerate mac addresses and scan for connections? Or is there a way for an attacker to discover neighboring container mac addresses?

andrewguenther · on Feb 21, 2018

I'm curious about this as well. It seems like this would be pretty difficult to exploit in the real world.

bthornbury · on Feb 21, 2018

I think some more research is needed here.

I'm wondering if containers could abuse ARP to find the neighboring addresses.

andrewstuart · on Feb 21, 2018

sudo arp-scan --interface=br0 --localnet --bandwidth=8192000 --numeric --retry=1

Should return mac addresses plus the associated IP address

although nmap would probably yield more interesting results.

jlgaddis · on Feb 21, 2018

Since iptables only works for IPv4 packets (EthType 0x0800), it doesn't really seem unusual that you could still pass back and forth frames with different EtherTypes.

There's a reason why filtering frames at layer 2 is a thing, but I suppose the average developer/Docker user wouldn't really be aware of it. I'm a senior network engineer at $work (ISP) and there are several locations in the network where I do this (the most common location is probably at IXP exchanges).

Layer 2 security is something that most people overlook.

fapjacks · on Feb 21, 2018

Well, this is kind of unexpected, but "highly secure" environments surely won't be running containers on the default bridge without any of the capabilities configured.

bthornbury · on Feb 21, 2018

This is true. I suppose the issue is if people naively rely on defaults with ICC disabled.

eyeareque · on Feb 21, 2018

Could it be possible to spoof ether frames with udp in them? Like say dns responses from the othe containers dns queries? Or to spoof the same MAC address as a peer container?

kuschku · on Feb 21, 2018

The interesting question will be how this applies to Kubernetes using flannel, considering that creates a separate network bridge per container.

TheDong · on Feb 21, 2018

Kubernetes using flannel creates a veth pair per container, but they're all attached to the same bridge.

djsumdog · on Feb 21, 2018

I'd be curious of what would happen between nodes; could you get to the virtual mac address of another container on another node? It's also be interesting to see what happens in Flannel, WeaveNet and other docker network implementations.

iforgotpassword · on Feb 21, 2018

Why not just put each docker container in its own VM? Problem solved without any ugly hacks, clean and elegant.

bringtheaction · on Feb 21, 2018

What do you need docker for if you put each container in a VM of its own?

chatmasta · on Feb 21, 2018

Once use case is if you're running a multi-tenant service, where customers can deploy their own docker containers. In terms of developer usability, it's way easier to have them push containers to a registry than it is to ask them to package a custom ISO for upload into a VM. To your customers, it appears you are just running docker, but on the backend you isolate each container in a VM for OS-level security isolation.

The Intel Clear Containers project [0] is working on this.

[0] https://github.com/clearcontainers

bthornbury · on Feb 21, 2018

I guess the same reason I get a better computer every few years.