
Container Isolation Gone Wrong - knoxa2511
https://sysdig.com/blog/container-isolation-gone-wrong/
======
steven777400
I enjoy reading descriptions such as these that start with easily detectable
problems and go into a level of debugging far beyond my skill level. Helps to
illuminate some of those "unknown unknowns" that I didn't even ever consider
before.

~~~
oneplane
I had the same feeling, but this also reinforced my view that Docker and
containerization in general (often used as a scapegoat to not have to do
proper configuration management) 'for the masses' is more problematic than
helpful. In most cases it doesn't solve anything but does add problems that
can be hard to debug. The actual 'lack' of isolation wouldn't have happened
with true virtualisation, and the method of debugging here is something most
people that think they need containers won't have.

To me, debugging like this is something that should be far more important to
people than slinging words like Docker and NodeJS around all day. (and then
mostly on Discord, or to them, the older Slack, but not IRC because that is
too hard for that crowd -- totally unfounded opinion/rant)

~~~
dasil003
Docker didn't cause this problem, the point of the article is that Docker
doesn't _prevent_ all such problems. On the other hand it does solve a lot of
packaging, dependency and environment parity problems that traditional
virtualization is too heavy to accomplish.

I'm old enough to also be frustrated with buzzword driven development, and
it's pretty annoying that so many believe Docker invented containerization,
but don't throw the baby out with the bathwater. Containerization is an
awesome tool and orthogonal to config management.

~~~
eropple
Traditional virtualization is "too heavy", now, for solved problems like
packaging? How and why?

~~~
omginternets
For the reasons mentioned in the article:

\- slow (re)start times

\- greater resource consumption

Granted, "too heavy" is relative, but starting a few hundred VMs on a single
host (assuming commodity hardware) is not going to work very well.

~~~
user5994461
That works just fine. It's typical in a VM farm to have hosts with a hundred
VMs.

~~~
omginternets
>It's typical in a VM farm to have hosts with a hundred VMs.

Several things:

1\. We may have a different definition of "commodity hardware", but you're
missing the broader point.

2\. The broader point is that VMs are significantly less resource-efficient.

3\. 1 & 2 notwithstanding, you're conveniently ignoring the issue of (re)start
time

4\. It's fine to use VMs, but it's frankly _bizarre_ to fight tooth-and-nail
over the ridiculous notion that they should always be preferred over
containers.

~~~
user5994461
I am simply addressing the fact that it's perfectly fine and common to have
hosts with a hundred VMs and it works flawlessly.

VMs are memory intensive because they duplicate the operating system. The
starting point is around 500 MB per VM. That's the only meaningful difference
in resources compared to containers.

I am not discussing that they have different starting and stopping time.

~~~
omginternets
>I am simply addressing the fact that it's perfectly fine and common to have
hosts with a hundred VMs and it works flawlessly.

And that was never the point.

~~~
user5994461
Yet that was your conclusion.

------
gtirloni
Ah, large directories and d_entries... the bane of any NAS operator. Having
seen hundreds of OpenSolaris appliances being abused in similar ways, I can
relate.

It doesn't seem like Kubernetes supports I/O resource limiting at this point
[0][1].

In any case, after a problem like this is identified, a cluster admin can use
pod affinity/anti-affinity to avoid both apps co-existing on the same node
[2].

EDIT: For hypervisor-based container runtime, check Frakti
([https://github.com/kubernetes/frakti](https://github.com/kubernetes/frakti))

0 - [https://kubernetes.io/docs/concepts/configuration/manage-
com...](https://kubernetes.io/docs/concepts/configuration/manage-compute-
resources-container/#resource-types)

1 - [https://www.kernel.org/doc/Documentation/cgroup-v1/blkio-
con...](https://www.kernel.org/doc/Documentation/cgroup-v1/blkio-
controller.txt)

2 - [http://blog.kubernetes.io/2017/03/advanced-scheduling-in-
kub...](http://blog.kubernetes.io/2017/03/advanced-scheduling-in-
kubernetes.html)

~~~
smarterclayton
Regarding why block io limiting isn't implemented yet in Kube - its really
hard to make block io sharing work well without killing performance (it's easy
for one workload to screw up another if it seeks at the wrong time, and ssds
are fast enough that just having to check the io limits may severely limit
your max throughout). If you read some of the proposals for io, the end goal
is to make it easy to use multiple volumes per workload where possible, and
have high level limits in place for other things like inodes, total writes,
etc.

~~~
skyde
Isn't that what BFQ scheduler was designed for ?

~~~
smarterclayton
Yes, although even the 4.12 code can have substantial overhead vs NOOP
(phoronix benchmarks, as an example). It's not a complete slam dunk - turning
it on for io indifferent workloads might make sense, but not necessarily on
all boxes.

[http://www.phoronix.com/scan.php?page=article&item=linux-412...](http://www.phoronix.com/scan.php?page=article&item=linux-412-io&num=2)

~~~
AstralStorm
Phoronix tested BFQ in low latency mode on throughput. Results are obvious.
There is a simple twiddle for that.

Of course they didn't test fairness or latency, that is too hard.

~~~
smarterclayton
Yeah, I'd love to see comprehensive benchmarks of competing workloads on
larger scale boxes. If workload isolation can be achieved with moderately low
overhead (15%?), there would be a lot of interest in pushing a default setup
with BFQ in Kubernetes once more stable kernel streams have it available.

------
jmull
Nice problem solving.

I'd classify the primary root cause as a kernel bug. It's good to make use of
otherwise unused memory for caches, but not to the extent that the caches grow
so large they slow things down.

Secondarily, there's probably something wrong in a system where you have to
constantly poll and attempt to access large numbers of files that don't exist.
(But probably 100% of systems that do anything useful have at least some weird
cruft like this somewhere in them at any given time, so I'm not judging.)

~~~
solatic
The author notes that future kernels did decide to introduce limits that
would've prevented the slowdowns from happening, but the customer was running
an outdated kernel.

That's what made the article disappointing for me. Do all this impressive in-
kernel debugging just to find out that you should've upgraded your systems
first. Sigh...

~~~
jdmichal
But the fix was to limit the cache based on process memory constraints.
jmull's point is that if a cache is permitted to blow out so big that lookups
are impacting performance, then that cache is not really serving its purpose
in the first place.

~~~
mcherm
Other fixes would have been possible. If the hash table could be resized when
necessary there would never have been a performance problem.

------
itaysk
Really enjoyed this, thanks! One lesson I learned from this, and correct me if
I'm misinterpreting the root cause, is that more memory is not always better.
This is, to me a far more powerful lesson I gained from this article.

------
Walf
What I want to know is why the 'trasher' was looking directly for so many
different files. Could it not parse some output of finding/listing existing
files to look for its targets?

------
everdayimhustln
Containers != type 1 hypervisor with all-encompassing resource quotas,
reservations and prioritization like VMware ESXi. The problem can be solved
either using a hypervisor that deploys only one container per VM (with
suitable paravirt/dedupe), or fix the OS to operate with much finer-grained
resource contention and allocation knobs for each and every limited resource.
The latter is superior when only a single OS is needed, because it reduces the
need for virtualization as a crutch for inadequacies of the OS.

~~~
AlphaSite
[https://www.vmware.com/products/vsphere/integrated-
container...](https://www.vmware.com/products/vsphere/integrated-
containers.html) and I’ve developed a small docker backend for Xen previously
(which is vaguely similar to vic).

------
mathattack
Lately I've learned to panic whenever a job candidate starts dropping Docker
and Kubernetes in the interview.

~~~
striking
Maybe if they think it's a one-stop solution to hosting problems, yeah. But
penalizing candidates just for familiarity with new technology?

~~~
mathattack
The issue is more on people dropping buzzwords rather than what they're
actually doing it for.

~~~
striking
In my opinion, that's entirely the wrong mentality. If someone gave me a pile
of raw buzzwords, or told me they used entirely the wrong tool because of
buzzwords, then I'd penalize them, sure.

But say they made something wonderful, and it was cleaner and more efficient
because of their use of Docker/Kubernetes, and they had taken the time to
figure out the tradeoffs inherent to that approach. Is that worth penalizing,
from your point of view?

~~~
mathattack
Of course not. My issue is with leading with the tool, rather than the problem
it's solving.

------
luord
This was a great read, the conclusion of always monitoring, no matter what
technologies one is using, should be obvious, but I've noticed that it really
isn't, unfortunately.

Even I fall into the trap and sometimes I wish I knew about all this stuff
but, alas, I prefer development.

------
rfraile
Close related to this other history:
[https://blog.booking.com/troubleshooting-a-journey-into-
the-...](https://blog.booking.com/troubleshooting-a-journey-into-the-
unknown.html)

------
doomrobo
What's the solution here? Can you limit d_entry table size per-process? Do you
have to limit it globally? Is the answer to just not use containers?

~~~
gighi
OP here

The solution is very simple: as mentioned in the article, just use a newer
kernel and always set memory limits for containers, the blog post is based on
an older kernel (2.6.32) that quite a few people irresponsibly still use in
containerized environments, mostly because EL6 is so popular among
enterprises.

In newer kernels, allocations from object pools are now tied to the limits of
the memory cgroups that requested them in userspace, if any, so you wouldn't
incur in this specific issue and you would just effectively have a container
not being able to use more than X MB of dcache entries (although there are
probably other minor ones, for example related to sharing global kernel
mutexes and such).

~~~
deepsun
I couldn't understand two things from the article:

1\. If one of the two containers caused the issue, then the why you needed
both of the containers to produce the issue? Why running just the offending
one was not enough?

My guess is that "worker" container requested those non-existent files from a
volume mounted by the other container, is it right?

2\. Kernel hash table implementation. The whole point of hash table is that
it's size is O(N), where N is the number of elements it holds.

Capping the hash table size to some constant and putting all the excess
elements to its linked lists makes it perform like a linked list divided by
the constant, no surprise. So it sounds like there's a bug in dentry hash
table implementation -- it should either increase its size accordingly to
elements count, or stop accepting new/evict old entries.

~~~
gighi
> 1\. If one of the two containers caused the issue, then the why you needed
> both of the containers to produce the issue? Why running just the offending
> one was not enough?

Running just the offending one would have been clearly enough, since its
effects would have caused the same increased latency for every other process
in the system (including itself). However, using a second container to observe
the performance degradation proves the point that one container is able to
affect another one, which is sort of the gist of the article, since too many
people think containers provide much more isolation than what in reality
happens.

> My guess is that "worker" container requested those non-existent files from
> a volume mounted by the other container, is it right?

No, the containers didn't share any volume, the dentry cache is effectively a
singleton within the kernel, so even if the set of volumes is not overlapping,
all processes in the system will see a performance degradation, regardless of
where the files being accessed reside.

> 2\. Kernel hash table implementation. The whole point of hash table is that
> it's size is O(N), where N is the number of elements it holds.

Your speculation is correct, however, there are sound reasons for doing such a
thing in the kernel (and not allowing the main array of the hash table
dynamically expand/shrink), so I wouldn't consider it a bug per se. I'll refer
you to this excellent comment:
[https://news.ycombinator.com/item?id=14660954](https://news.ycombinator.com/item?id=14660954)

~~~
deepsun
Thank you. Very good article, thank you for writing it!

------
cratermoon
I knew before I got into the meat of the article that it was going to be i/o
contention. The first two sections talked about memory and cpu limits on the
containers, but nothing about i/o rates. This was a Known Problem back in the
90s when a variety of filesystems (DEC's AdvFS being one) that were efforts to
address the issues around dentries and inode. See also
[http://www.starcomsoftware.com/proj/usenet/doc/c-news.pdf](http://www.starcomsoftware.com/proj/usenet/doc/c-news.pdf)

~~~
RijilV
I think they're actually talking about blowing out the dentry cache, but sure
IO contention is anoyher shared resource in containers. Depending on what
you're doing you might run into issues in various networking limits
(somaxconns comes to mind, as does what's left of the route cache), blowing
out the page cache, or something eating up your memory bandwidth (maybe via
some unlucky NUMA).

All for containers but they don't solve the hard problems folks often ascribe
to them, really just shows in most cases you don't need to solve the hard
problems. Most of the time what containers are buying you is an easy
deployment method that leverages some nice features in the OS to make believe
you're on separate machines.

~~~
bogomipz
>"Depending on what you're doing you might run into issues in various
networking limits (somaxconns comes to mind, as does what's left of the route
cache)"

I'm curious what issue(s) you might be referring to here with the route cache?
Could you elaborate?

~~~
RijilV
Sure - The route cache was largely removed in (IIRC) 3.8, but there's still
entries that get stored[0]. There's a limit to how many entries Linux will
store, and like any LRU-esk data structure rapidly cycling entries through it
isn't going to do anything wonderful for your performance, never mind if you
actually expected to use any of the cached data for a business performance
'feature'.

25g NIC is an awful lot of 60byte packets. I'm not saying this is going to be
a common concern, just that like any other shared kernel resource cgroups and
namespaces aren't going to help.

0: [https://www.systutorials.com/docs/linux/man/8-ip-
tcp_metrics...](https://www.systutorials.com/docs/linux/man/8-ip-tcp_metrics/)

~~~
bogomipz
Sure, that makes sense and thank you for the link. I was curious about your
comment:

>"25g NIC is an awful lot of 60byte packets."

Where are you getting that 60 number from? A minimum IPv4 header is 20 bytes
and a minimum TCP header is 20 bytes. Also how would a tiny TCP packet relate
to the route cache? Tiny TCP packet are certainly a problem with PPS that a
NIC is capable I understand that. Cheers.

------
digi_owl
Linux have all manner of oddities when it comes to IO it seems.

