
Containers from first principles - setheron
https://fzakaria.com/2020/05/31/containers-from-first-principles.html
======
disqard
I recently discovered systemd-nspawn and was amazed at how lightweight a basic
container can be.

~~~
westurner
"Docker Without Docker" (2015) explains /sbin/init and systemd-nspawn. Systemd
did not exist when docker was first created.
[https://chimeracoder.github.io/docker-without-
docker/](https://chimeracoder.github.io/docker-without-docker/)

~~~
oso2k
That's not what I remember. Wikipedia backs up that recollection as well [0],
marking systemd's initial release in 2010. Docker's initial release is listed
as 2013 [1]. Maybe that was true of the dotCloud internal releases, and
certainly, not all of the EL and other Linux distros had not adopted systemd
during 2010 - 2013. Certainly after 2014 or 2015, systemd had spread to the
major Linux distros so Docker could have chosen to take a systemd-based
approach at that point.

[0]
[https://en.wikipedia.org/wiki/Systemd](https://en.wikipedia.org/wiki/Systemd)

[1]
[https://en.wikipedia.org/wiki/Docker_(software)](https://en.wikipedia.org/wiki/Docker_\(software\))

~~~
westurner
Are there other systemd + containers solutions?

"Chapter 4. Running containers as Systemd services with Podmam"
[https://access.redhat.com/documentation/en-
us/red_hat_enterp...](https://access.redhat.com/documentation/en-
us/red_hat_enterprise_linux_atomic_host/7/html/managing_containers/running_containers_as_systemd_services_with_podman)

AFAIU, when running containers with systemd:

\- logs go to journald by default

\- there's no docker-compose for just the [name-prefixed] containers in the
docker-compose.yml,

\- you can use systemd unit template parametrization

\- it's not as easy to collect metrics on every container on the system
without a _read-only_ docker socket: how many containers are running, how much
RAM quota are they assigned and utilizing? What are the filesystem and port
mappings?

\- you can run containers as non-root

\- you can run containers in systemd timer units

\- you use runC to handle seccomp

... You can do cgroups and namespaces with just systemd; but keeping
chroots/images upgraded is outside the scope of systemd: where is the ideal
boundary between systemd and containers?

See this comment regarding _per-container_ MAC MCS labels:
[https://news.ycombinator.com/item?id=23430959](https://news.ycombinator.com/item?id=23430959)

There's much additional complexity that justifies k8s / OpenShift: when would
I want to manage containers with just systemd units?

------
peterwwillis
> Each container should have at least the following isolated: network stack,
> filesystem, processes

I'm going to go out on a limb here and say that in 99% of cases, there is no
benefit to the network isolation. It adds unnecessary overhead and complexity,
all so we don't have to configure individual services with a unique listening
port. But the host that exposes the virtual networks still has to route
something to them, so you still need to assign an arbitrary port and then do
some port forwarding.

Going even further in the unnecessary abstractions category is process
isolation. In most cases we don't need that either. From a security
perspective, I trust the Linux kernel about as far as I can throw it, so I
don't care what guarantees there supposedly are, container breakout and local
privesc are (in my opinion) a near-certainty. So why are we forcing ourselves
to jump through tons of hoops just to heap dump or ptrace() an application?
The regular-old security mechanisms in Linux are enough for every other
process on the system, why not container processes?

Filesystem abstraction (copy-on-write overlays and chroots) has been the
killer feature of containers since day one. That is the one thing about
containers that makes them useful: a reproducible application snapshot without
dependency management hell. If we strip everything else about containers away,
this is the one thing we need to keep the useful purpose of a container.

Docker threw in a lot of _extra incredibly_ extra features, such as the
Dockerfile (no more configuration management!) and overlays and build cache
and layers, etc. Nobody would be using containers if all these features
weren't present in one solution, and we all owe them a big debt and thanks.
But if we really strip the container down to its essential useful element,
it's basically just a wrapper round chroot().

~~~
Terretta
> _That is the one thing about containers that makes them useful: a
> reproducible application snapshot without dependency management hell._

Joe Stein, of Kafka renown, calls containers "21st century tarball".

// I realize I've mentioned this before, in 2017's "My VM is lighter and safer
than your container":
[https://news.ycombinator.com/item?id=15614777](https://news.ycombinator.com/item?id=15614777)

~~~
infogulch
My favorite is "static linking for millennials"

------
kcolford
Why do we need to use mount and pivot_root when we have chroot available? Am I
missing something here about why those can't be used?

~~~
setheron
You can escape chroot easily with relative paths pretty sure.

In Linux it never surprises me that there's X ways to do Y. A side effect of
the OSS system and wanting to not break comparability.

This guide was meant for newbies so it doesn't broach on these security
concerns.

------
mehrdadn
I was so confused why Fareed Zakaria would be talking about containers until I
Googled and read the name more carefully...

~~~
chrisweekly
Heh, apparently you're not the only one. Way down at the bottom of the linked
page:

"I'm a software engineer, father and wishful amateur surfer. If you've come
seeking my political views; you've found the wrong Fareed."

~~~
setheron
(I'm the author)

It's a long running joke that I've come to terms with.

------
chrisweekly
Favorited; thanks for this useful, well-written post!

~~~
setheron
Thank you (author). This was a written version of a live tutorial I have my
peers.

It's always challenging to translate shell focused teaching in prose but I'm
glad it struck a chord.

------
westurner
> _Many people might think the word “container” has a specific meaning within
> the Linux kernel; however the kernel has no notion of a “container”. The
> word has been synonymous with a variety of Linux tooling which when applied
> give the resemblance of what we expect a container to be._

Before LXC ( [https://LinuxContainers.org](https://LinuxContainers.org) ) and
CNCF ( [https://landscape.cncf.io/](https://landscape.cncf.io/) ) and OCI (
[https://opencontainers.org/](https://opencontainers.org/) ), for shared-
kernel VPS hosting ("virtual private server"; root on a shared box), there was
OpenVZ (which requires a patched kernel and AFAIU still has features, like
bursting, not present in cgroups).

Docker no longer has an LXC driver: libcontainer (opencontainers/runc) is the
story now. The LXC docs have a great list of utilized kernel features that's
also still true for docker-engine = runC + moby. The LXC docs:
[https://linuxcontainers.org/lxc/introduction/](https://linuxcontainers.org/lxc/introduction/)
:

> _Current LXC uses the following_ kernel features _to contain processes:_

> _## Kernel namespaces (ipc, uts, mount, pid, network and user)_

>> _Namespaces are a feature of the Linux kernel that partitions kernel
resources such that one set of processes sees one set of resources while
another set of processes sees a different set of resources._
[https://en.wikipedia.org/wiki/Linux_namespaces](https://en.wikipedia.org/wiki/Linux_namespaces)

> _## Apparmor and SELinux profiles_
> [https://en.wikipedia.org/wiki/AppArmor](https://en.wikipedia.org/wiki/AppArmor)
> / [https://en.wikipedia.org/wiki/Security-
> Enhanced_Linux](https://en.wikipedia.org/wiki/Security-Enhanced_Linux)

udica is an interesting tool for creating SELinux policies for containers.

Is it possible for each container to run confined with a different SELinux
label?

> _## Seccomp policies_
> [https://en.wikipedia.org/wiki/Seccomp](https://en.wikipedia.org/wiki/Seccomp)

See below re: Seccomp.

> _## Chroots (using pivot_root)_
> [https://en.wikipedia.org/wiki/Chroot](https://en.wikipedia.org/wiki/Chroot)

Chroots and symlinks, Chroots and bind mounts, Chroots and _overlay
filesystems_ , Chroots and SELinux context labels.

FWIU, Chroots are a native feature of filesystem syscalls in Fuchsia.

> _## Kernel capabilities_

[https://wiki.archlinux.org/index.php/Capabilities](https://wiki.archlinux.org/index.php/Capabilities)
:

>> _" Capabilities (POSIX 1003.1e, capabilities(7)) provide fine-grained
control over superuser permissions, allowing use of the root user to be
avoided. Software developers are encouraged to replace uses of the powerful
setuid attribute in a system binary with a more minimal set of capabilities.
Many packages make use of capabilities, such as CAP_NET_RAW being used for the
ping binary provided by iputils. This enables e.g. ping to be run by a normal
user (as with the setuid method), while at the same time limiting the security
consequences of a potential vulnerability in ping."

> _## CGroups (control groups)*
> [https://en.wikipedia.org/wiki/Cgroups](https://en.wikipedia.org/wiki/Cgroups)

Control groups enable per-process (and to thus per-container) resource quotas.
Other than limiting the impact of resource exhaustion, cgroups are not a
security feature of the Linux kernel.

Here's a helpful explainer of the differences between some of these kernel
features; which, combined, have become somewhat ubiquitous:

From "Formally add support for SELinux" (k3s #1372)
[https://github.com/rancher/k3s/issues/1372#issuecomment-5817...](https://github.com/rancher/k3s/issues/1372#issuecomment-581797716)
:

> _[https://blog.openshift.com/securing-
> kubernetes/*](https://blog.openshift.com/securing-kubernetes/*)

>> _The main thing to understand about SELinux integration with OpenShift is
that, by default, OpenShift runs each container as a random uid and is
isolated with SELinux MCS labels. The easiest way of thinking about MCS labels
is they are a dynamic way of getting SELinux separation without having to
create policy files and run restorecon.*

>> _If you are wondering why we need SELinux and namespaces at the same time,
the way I view it is namespaces provide the nice abstraction but are not
designed from a security first perspective. SELinux is the brick wall that’s
going to stop you if you manage to break out of (accidentally or on purpose)
from the namespace abstraction._

>> _CGroups is the remaining piece of the puzzle. Its primary purpose isn’t
security, but I list it because it regulates that different containers stay
within their allotted space for compute resources (cpu, memory, I /O). So
without cgroups, you can’t be confident your application won’t be stomped on
by another application on the same node._

From Wikipedia:
[https://en.wikipedia.org/wiki/Seccomp](https://en.wikipedia.org/wiki/Seccomp)
::

> _seccomp (short for secure computing mode) is a computer security facility
> in the Linux kernel. seccomp allows a process to make a one-way transition
> into a "secure" state where it cannot make any system calls except exit(),
> sigreturn(), read() and write() to already-open file descriptors. Should it
> attempt any other system calls, the kernel will terminate the process with
> SIGKILL or SIGSYS.[1][2] In this sense, it does not virtualize the system's
> resources but isolates the process from them entirely._

... SELinux is one implementation of MAC (Mandatory Access Controls) that is
built upon the LSM (Linux Security Modules) support in the Linux kernel. Some
distros include policy sets for Docker hosts and lots of other packages that
could be installed; see: "Formally add support for SELinux" (k3s #1372)
[https://github.com/rancher/k3s/issues/1372#issuecomment-5817...](https://github.com/rancher/k3s/issues/1372#issuecomment-581797716)

