
Some questions about Docker and rkt - deafcalculus
http://jvns.ca/blog/2016/09/15/whats-up-with-containers-docker-and-rkt/
======
justincormack
In terms of the daemon model of Docker, I guess it does look a bit
complicated, and is not explained very well.

In production you will do docker run -d nginx, not run it in the foreground,
so the client (docker) process is not really in the picture - if you run in
the foreground it is just there to stream the standard IO, and so you can kill
the process with ^C from the shell.

The docker daemon (dockerd) is there to listen for new requests, but since
1.11 it no longer runs containers. Since 1.12 you can restart it without
killing your containers (with the right config option) see
[https://docs.docker.com/engine/admin/live-
restore/](https://docs.docker.com/engine/admin/live-restore/) so you can eg do
a daemon upgrade without downtime. It is still handling some things, eg logs,
so it is best if it does restart.

The process that actually runs containers is containerd. This is a very simple
daemon with a grpc socket interface. That uses runc (the OCI standard runner)
but that does not stay running, only a small process called containerd-shim
does, which is there to act as a parent for the actual container process, so
that containerd can be restarted.

You can use containerd as a runtime, with runc containers, but runc is not
that user friendly. You can use
[https://github.com/jfrazelle/riddler](https://github.com/jfrazelle/riddler)
to get something you can run from a docker container. You could also use runc
from systemd if you want. However runc doesnt do a lot of setup, eg the
layered filesystem handling is all part of how dockerd sets things up for
runc, so you would have to do that yourself if you dont want to waste a lot of
disk space.

It does sound a bit complicated, but it is just separation of concerns and
breaking up the once monolithic docker binary into a client and a set of
servers that all do smaller tasks and which can be restarted independently.

~~~
inopinatus
Such structures can be hard to document with both clarity and brevity. Take a
look at how Wietse Venema describes the elements of Postfix, for a nuanced
masterclass in the art.
[http://www.postfix.org/OVERVIEW.html](http://www.postfix.org/OVERVIEW.html)

~~~
tptacek
Both the architecture and the approach to documenting it were pioneered by
Bernstein's qmail, which is the first Unix program of comparable ambition to
be structured in this way --- it's crazy to think that there was a time when
this implementation strategy was groundbreaking, but, it was.

[http://cr.yp.to/qmail/pictures.html](http://cr.yp.to/qmail/pictures.html)

(Fun fact: Venema and Bernstein had a long-running feud, and Postfix exists
pretty much entirely because Venema appreciated qmail's architecture but
couldn't stomach working with anything Bernstein produced.)

------
felixgallo
Julia writes: "I think "violates a lot of normal Unix assumptions about what
normally happens to normal processes" is basically the whole story about
containers."

This is a key point. Lots and lots of standard Unix invariants are violated in
the name of abstraction and simplification, and the list of those violations
is not popularized; and most of the current systems have different lists.

For example, in Kubernetes (my current love affair), the current idea of
PetSets (basically, containers that you want to be carefully pampered, like
paxos members, database masters, etc. -- stuff that needs care) /still/ has
the notion that a netsplit can cause the orchestrator to create (1 .. #-of-
nodes) exact doppelgangers of your container, all of which believe they are
the one true master. You can imagine what this means for database masters and
paxos members, and that is going to be, as the kids say, surprising af to the
first enterprise oracle db admin who encounters this situation.

If you believe in containers, then one thing that you really do have to get
to, is that most of your existing apps should not be in them yet, and that if
your app is not (a) stateless (b) strongly 12-factor (c) designed for your
orchestrator and (d) written not to do things like fork() or keep strong
references to IP addresses, then you should probably wait 3-4 years and use
VMs in the meantime.

~~~
iheartmemcache
Oracle has had multi-homed master-master RDBMS setups for > 10 years. I'm
pretty sure a half-competent Oracle administrator wouldn't be really
'surprised af' at functionality that's been in Oracle for at least a decade.

For things that need 'care', this has been a solved problem for decades.
Banks[0] homed in the WTC on Sept 11 kept on running because OpenVMS has had
NUMA clusters and multi-node replication since the DEC Alpha days. This is
with 100% transactional integrity maintained and DC failovers measured within
the order of 500ms to 5s. (Obviously banks don't all run on VMS.)

Platforms exist like IBM z systems let you live upgrade zOS in a test
environment hosted within the mainframe to see if anything breaks, in complete
isolation from production of course, revert snapshots, and do basically
everything the whole ESX suite (from things like live migrations of VMotion,
to newer stuff like growing raid arrays transparently / virtual storage
solutions where you can add FC storage dynamically and transparently to the
end user). Their stock systems let you live upgrade entire mainframes without
a blip. They're built to withstand total system failure (i.e. literally
processors, RAM, NICs, and PSU's could all fail on one z13 and you'd have
fail-over to a hot-backup without losing any clients attached to the server).
HP's Non-Stop, with which I have no experience, offers a similar comprehensive
set of solutions.

[0] On Sept 11, a bunch of servers went down with those buildings. * “Because
of the intense heat in our data center, all systems crashed except for our
AlphaServer GS160... OpenVMS wide-area clustering and volume-shadowing
technology kept our primary system running off the drives at our remote site
30 miles away.” \--Werner Boensch, Executive Vice President Commerzbank, North
America*
[http://ttk.mirrors.pdp-11.ru/_vax/ftp.hp.com/openvms/integri...](http://ttk.mirrors.pdp-11.ru/_vax/ftp.hp.com/openvms/integrity/OPENVMS83PRESENTATION.PDF)

~~~
felixgallo
I'm saying that an arbitrary number of exact replicas of a master can
magically appear on the network believing they are the one true master,
identifying themselves as such, and expecting to act that way. Additionally,
an arbitrary number of database masters expecting to participate in the
cluster may show up or leave at any time. That is somewhat nontrivial for even
modern databases to deal with.

~~~
bmurphy1976
Why run your database inside kubernetes though? We've always white gloved our
database (and a few other special services). You don't have to put 100% of
your infrastructure in docker/kubernetes.

~~~
finnh
That's felixgallo's point exactly.

------
drdaeman
It seems that isolation is frequently the cause. E.g.:

* Better developer environment. Actually, I'm not sure anymore. It totally makes sense for testing (all the CI/CD stuff), and - thanks to the packaging aspect - it's easy to set up external dependencies (like databases), but I just wasn't able to grasp how the actual development is better _with_ Docker. Developers tinker with stuff, containers and images are all about isolation and immutability, and those stand in one's way.

* PID1. Obviously, isolation is the cause for this. With `--pid=host` it's gone, but no one does that, probably because of nearly complete lack of UID/GID management, thus the security drawbacks. I guess, it has roots in "all hosts are the same" idea, as UID/GID have to be a shared resource and they're harder to manage than just spawning things into a new PID namespace so processes won't mess with each other.

* Networking. Yes, as it was pointed out, it makes sense due to port conflicts, but usually it's inferior over-complicated version of moving port numbers to environment variables. Instead of binding your httpd to [::]:80 and setting up port mapping, bind it to [::]:${LISTEN_PORT:-80}. All the same stuff, but - IMHO - much more straightforward. Sure, there are (somewhat unusual) cases where separate network namespace is a necessity (or just a good thing), but I don't think they're any common.

So, I think, the question is also: is there (and why) the need for isolation
in a way Docker does it? Doesn't the way it does unnecessarily complicate
things?

~~~
justincormack
Developers tinker with code, but most of the time you don't tinker with the
output of that code, like hot patch your binaries or whatever. Same with
systems, you build a container from a Dockerfile and maybe Makefile, you don't
then go and change a few things you change the source code. We are just
pushing the immutability boundaries further and getting more reproducible
environments as we do it.

~~~
drdaeman
It depends on the project, I guess. Sometimes, it's not that easy.

For scripting languages that don't have a compile-time the code is what gets
executed. So with Docker there's either necessity to rebuild the container
(extra delays, and quite noticeable ones) or necessity to maintain a separate
Dockerfile.dev and mount-binding the code into the container a-la Vagrant.

Even for compiled stuff, it can be a nuisance with that "Sending build context
to Docker daemon" phase. Like when you have a fair chunk of artwork assets
next to the code. And the advantage of having the intermediate compiler
results are also either lost (adding extra build time) or require extra tricks
to make things smooth and nice.

And either way, it also means extra work setting up your debugger toolset jump
over the isolation boundaries so you can dig into live processes' guts. One's
probably going to abandon PID space isolation.

Those consequences are quite rarely mentioned when the immutability aspects of
Docker are advertised. It's usually told as "you'll have a reproducible
environment" (yay! great!) but never "you may lose that heartwarming
experience of having a new build ready to be tested while you switch from the
editor to the terminal/browser/whatever window".

~~~
justincormack
You can debug from the host or from another container using
`--pid=container:id` which puts you in the process namespace of a running
container.

Build time is important, if you can use build layer cacheing it helps a lot,
but how to structure it depends on your project. I don't myself use
Dockerfile.dev, but I do sometimes mount the code into the container to build
and run it directly. I think it would definitely help for more blogs and
examples of how to do these things, as there is a lot of room for improvement.

------
gnufied
I will try and answer networking question.

At scale a single host can be running may be 20 containers and port collision
becomes a real problem. So imagine if a container opened a port directly on
host -we have to be careful that they don't step on each other toes.

Even if all containers used some sort of contract about which port they are
going to use - there are all sort of corner cases waiting to happen such as
ephemeral ports(the port you bind to when you connect externally) taking over
a port taken by real server app.

I have seen two approaches being used to solve this problem:

1\. Using Smartstack ([http://nerds.airbnb.com/smartstack-service-discovery-
cloud/](http://nerds.airbnb.com/smartstack-service-discovery-cloud/)) the
applications running inside container can run on any port but the port on
which they are externally available is decided by orchestration service.
Typically, no one talks to application inside container directly but they go
through the haproxy configured on localhost. The advantage is - smartstack can
remove a service if it is failing healthcheck etc.

2\. The kubernetes/openshift approach of Software defined
networking([https://github.com/coreos/flannel](https://github.com/coreos/flannel)).
Although they also integrate with load balancers, so that is not the _only_
way.

I know if someone is just getting started with containers, it seems bit
overwhelming to digest all this. But having worked in some large companies
which are using containers at scale, it kinda makes sense.

~~~
justincormack
The default setup of Docker does not make any assumptions about the host
setup, so it assumes it might only have one IP address, so there is only one
set of ports.

It is perfectly ok if you have lots of IPs to put routed IPs on the `docker0`
bridge, and never use port publishing at all, or to use some of the other
optional setups, such as the new macvlan and ipvlan setups
[https://github.com/docker/docker/blob/master/experimental/vl...](https://github.com/docker/docker/blob/master/experimental/vlan-
networks.md) which are the kind of production setups you may want if you run
your own networking. But Docker cannot assume anything about the network setup
in the default configuration, hence the use of published ports, which is kind
of inconvenient but always works in any environment.

------
dsr_
"Installing stuff on computers so you can run your program on them really
sucks. It's easy to get wrong! It's scary when you make changes! Even if you
use Puppet or Chef or something to install the stuff on the computers, it
sucks."

I think a lot of people feel this way. I think that fear is born of ignorance,
and we should fix that.

Let's say you are working on an application in NewPopularLanguage 2.3.1, using
CoolFramework version 3.3. Your Linux distro ships NPL 2.1.7 and CF2.8, which
don't support a really nifty feature that you would like to have.

Important questions to ask: what is the distro's support record? Do they have
a dedicated security team? Is there significant support for NPL and CF in the
distro, or just a single package maintainer?

If the distro's security and NPL packaging team are good, you might want to
use their versions even if it means giving up use of the really nifty feature
until sometime in the unknowable future. Making an explicit, considered
decision is worthwhile.

But if you really need the new versions, you should use a repeatable build
system that generates OS packages exactly the way you want them. You should
put them into a local repo so that when you install or upgrade a new machine,
you get the version you specify, not whatever has just hit trunk upstream. And
you may want your versions to be placed in a non-(system)-standard location,
so that your application has to specify the path -- but be guaranteed that you
can install several versions in parallel, and use the right one.

It feels like a lot of overhead, but it can save you lots of debugging and
deployment time. Once you have the infrastructure tools in place, using them
is not much of a burden, and pays for itself many times over.

~~~
radarsat1
> But if you really need the new versions, you should use a repeatable build
> system that generates OS packages exactly the way you want them. You should
> put them into a local repo so that when you install or upgrade a new
> machine, you get the version you specify, not whatever has just hit trunk
> upstream. And you may want your versions to be placed in a
> non-(system)-standard location, so that your application has to specify the
> path -- but be guaranteed that you can install several versions in parallel,
> and use the right one.

Exactly. You have to be fucking careful. Or you can just use a container.
That's his point.

> I think that fear is born of ignorance, and we should fix that.

Actually I think it's born from having a lot of experience of installing
things and it being a total nightmare..

You're right, obviously we should stick to packaged versions of libraries
whenever possible, but as you say, it is not always possible.

~~~
acobster
> That's his point.

That's _her_ point. :)

------
lima
Another issue with Docker: it does not interact well with process supervision
(say systemd). The "docker run" process that you run with systemd is only a
proxy for the real container process, which is started by the Docker daemon -
so in reality, you have two init systems, Docker _and_ systemd. This means
that many supervision features won't work (signals, seccomp, cgroups...).

rkt fixes this by not having a global daemon.

The linked article puts it well:

[https://medium.com/@adriaandejonge/moving-from-docker-to-
rkt...](https://medium.com/@adriaandejonge/moving-from-docker-to-
rkt-310dc9aec938#.vkdl46i11)

~~~
justincormack
cgroups, seccomp etc are set by docker so they do work. I think it is weird to
view these as exclusively owned by the init process.

Docker works on systems without systemd (indeed, it runs on Windows), so
relying on features that systemd has (currently, many are only recent
additions) is not really an option.

~~~
andrewd18
Give it time. Once systemd runs out of Unix utilities to consume, it will
inevitably turn on WinInit.exe and the Service Control Manager.

------
a4dev
I think these are good questions and I am interested in the answers. At least
some of the answers are not obvious or not generally agreed by the experts, it
seems.

------
shriphani
While this thread has visibility, I am curious about your typical security
model with docker.

From my experience, whoever is running docker seems to be able to run root
commands on the host [1].

So any best practices for running docker ?

[1] [http://reventlov.com/advisories/using-the-docker-command-
to-...](http://reventlov.com/advisories/using-the-docker-command-to-root-the-
host)

~~~
justincormack
You can use authorization plugins to control what commands are allowed.

However generally you don't give people access to run any docker command in
production, you have some system that lets them deploy containers with
predetermined settings, which don't include being able to set --privileged or
add capabilities or change security policies.

~~~
shriphani
I am currently running docker with a systemd script (it seemed like a
reasonable idea at the time).

But docker = sudo without password essentially.

So I am curious if there is a recommended way to run a service with a docker
run.

------
moondev
Kubernetes is the answer to all of your questions.

You shouldn't directly use "docker run" in production. At least not yet.

Think of the docker binary and daemon as development tools not a production
platform.

Develop your apps one process per container, microservice style. If you can't
do that you should probably use vms.

When it comes time to deploy, kubernetes handles scheduling for you
automatically across your fleet.

Kubernetes secrets can be mounted inside the containers so you don't leak them
like you can with env vars.

Kubernetes will eventually support other runtimes like rkt. But this
abstracted away.

Kubernetes assumes a flat networking space, but this is taken care of with
stuff like flannel.

You should probobly use Dockerfiles to create containers in your build
process. Packer can create them but I would only reccomend that way if you
have other tooling that does that. Spinnaker can leverage that bake-centric
stuff very nicely.

~~~
skuzye
Kubernetes supports rkt as we speak: [http://kubernetes.io/docs/getting-
started-guides/rkt/](http://kubernetes.io/docs/getting-started-guides/rkt/)

------
kraftman
Maybe Docker networking gets more complicated later on but for what I do with
it I find it pretty easy and useful. Docker compose makes it pretty simple to
control which ports get exposed on the host and which are limited the the
docker network.

------
artellectual
So I had mostly the same questions you did. I went on a journey and made
videos about it. Check it out here [https://www.codemy.net/channels/docker-
for-developers](https://www.codemy.net/channels/docker-for-developers)

------
lliamander
> My coworker told me a very surprising thing about containers. If you run
> just one process in a container, then it apparently gets PID 1?

That's true for Docker (and possibly rkt, I don't know) but not for LXC.
Docker is intended to provide isolation for a single service/app, so having an
init process (arguably) doesn't make sense. For LXC, it's more like a separate
OS, so it does need an init process.

These two operating models are referred two as "application containers" and
"system containers". It seems that the former is more popular for service
deployment situations, but if you want a virtual dev environment / sandbox to
play in, I would think the latter is a better choice.

------
Gorgor
So how do you handle what she addressed under secrets? How do you share
passwords between containers? For quick and dirty stuff, I use environment
variables that are set in my docker-compose file, but I have no experience
running docker in production.

~~~
justincormack
environment variables are problematic, as they can be read by other processes
potentially. Vault or another secrets management tool is a better option. A
secrets management solution integrated into Docker is planned, as it is
difficult to get right without tooling support.

~~~
vkat
Is there a way to prevent other processes from reading environment variables?

~~~
bogomipz
on linux kernel >= 3.2 you can mount /proc with the option hidepid=2. However
this isn't a very elegant solution in my opinion.

------
sigjuice
I never quite understood how Docker lets developers share the same development
environment. Most Dockerfiles that I have seen are a series of apt-get install
commands. If different people build images using the same Dockerfile at
different times, isn't there a chance that they will pick up different package
versions? What am I missing?

~~~
Matthias247
Create a dockerfile that does performs installation of all the tools that you
need, execute that once to create an image and then share the image with
everyone else who needs it, possibly through a private registry.

We use that approach now to store some build environments for embedded
systems, where our prebuilt and shared images contain all 3rd party
dependencies (which are only slowly changing). We use then those images to
build our own software. Depending on the use case we create new images from
it, or only spawn containers for compiling something, copy the artifacts
outside of the container and remove them again. Works really well for us.

------
kozikow
IMO, in the long term the most sane approach for production is kubernetes with
rkt, especially given things happening in docker 1.12.

You even can use docker locally and rkt in prod, as rkt can run docker images.

------
ggambetta
"rkt" is such a strange name. In my head I don't pronounce it as "rocket" but
as "rekt", with all the connotations that has.

~~~
ecnahc515
It used to be rocket very early on but was changed to rkt for legal reasons
fairly early as well.

------
theptip
Here's my take on these questions:

1) packaging: this is the feature that's easiest to see benefits from. Having
a single artifact that can be run on your CI infrastructure, development
machine, and production environment is a massive win.

2) scheduling: there are big cost savings to be had by packing your
application processes more efficiently onto your infrastructure. This might
not be a big deal if you're a startup, and you haven't yet hit scale.

3) dev environment: It's powerful to be able to run exactly what's been
deployed to prod, on your local machine. I've not found developing in a
container to be great though; I still use the Django local dev server for fast
code-loading. (It's possible to mount your working directory into your built
container; this is just personal taste).

4) security: containers are not as robust a security boundary as hypervisors,
so they are less suitable for multi-tenant architectures. The most common use-
case is to run your containers in a VM, so this isn't necessarily a problem.
As an additional defense-in-depth perimeter, containers are great.

5) networking: think of network namespaces as a completely isolated network
stack for each container. You can run your containers in the host namespace
using `--net=host`, but this is insecure [1]. Using host networking can be
useful for development though. In general the port forwarding machinery allows
your orchestrator to deploy multiple copies of the same container next to each
other, without the deployed apps having to know about other container's port
allocations. This makes it easier to pack your containers densely. (More
concretely, your app just needs to listen on port 8000, even if Kubernetes is
remapping one copy of it to 33112, and another copy to 33111 on the host).

6) secrets: containers force you to be more rigorous with your handling of
secrets, but most of the best practices have been established for some time
[2]. The general paradigm is to mount your keys/secrets as files, and consume
them in the container; Kubernetes makes this easy with their "Secrets" API.
You can also map secret values into env variables if you prefer.

7) container images: the Dockerfile magic is a pretty big win for building
artifacts; the build process caches layers that haven't changed, which can
make builds very fast when you're just updating code (leaving OS deps
untouched). Having written and optimized a build pipeline that produced VMDK
images, and experiencing the pain of cloning and caching those artifacts, I
can attest that this a very nice 80/20 solution out of the box.

[1]:
[https://github.com/docker/docker/issues/14767](https://github.com/docker/docker/issues/14767)
[2]: [https://12factor.net/](https://12factor.net/)

