
Why strace doesn't work in Docker - kiyanwang
https://jvns.ca/blog/2020/04/29/why-strace-doesnt-work-in-docker/
======
lmilcin
strace/ptrace gives you huge amount of power.

I had a discussion once with admins at my company that wanted to convert
servers to have AD accounts in a naive way. The discussion could be shortened
to something like this:

Me: please, don't convert servers to AD because my password will be available
to everybody as everybody has root on test environment.

Admins: no, this is perfectly safe.

Me: Guys, once you have root on machine you can do everything including
accessing passwords as they are being typed.

Admins: No, not possible. Perfectly safe.

Me: Here, I ran the tool using ptrace from root account, here is a list of all
accounts and passwords of all users using test env. This is actually read()
call at some point, you just need to be able to filter it out from strace
output.

Admins: Wtf! You are going to get fired!

~~~
tw04
I'm not sure what AD has to do with the fact as root, you have a huge amount
of power on ANY box and abusing it will get you fired. strace/ptrace is just
another tool in that toolbox that you can use maliciously.

Your company either runs in a model where it trusts the employees or it
doesn't. If it doesn't trust the employees they shouldn't be giving ANY of
them root on a shared system.

~~~
filleokus
In my case, I'm using a different, easy password, for test servers that
contain no important production data or are in a especially privileged network
position. If they were connected to AD, I would be forced to present my "real"
password.

On production servers on the other hand, the amount of people with root is
probably much smaller, and there might be bigger wins by being able to
centrally enforce password policies or whatever the sysadmins want to do.

~~~
tw04
So create a second AD account for the test environment like most shops do?

~~~
ghjkldsf
But then you don't get the joy of throwing the baby out with the bath water.

90% of the time someone insists AD is too insecure for them, it's because
they're unwilling to maintain hygiene (logging into an untrusted machines with
overscoped credentials 8 times a day), or unwilling to use the provided
security features (what do you mean use kerberized services? ldap binds work
just find and give me an anti-ad security straw man!)

------
NikolaeVarius
I find for Docker, the documentation is extremely sparse/omits details that
have larger consequences.

I've gotten used to just going to read the source to see what an option does.

~~~
StreamBright
Imagine a situation what something is broken and you need to figure out how to
strace in production while a service is down. This is my main argument against
Docker, it is black box computing and you are going to figure this out the
hard way. Of course you can prepare, yet, I do not necessarily have the
bandwidth.

~~~
justincormack
As the article points out, in any recent version of Docker and up to date
kernel ptrace just works.

------
croo
> This hypothesis doesn’t make much sense but I hadn’t realized that the root
> user in a Docker container is the same as the root user on the host, so I
> thought that was interesting

Wait what? I didn't know that. This sounds terrible. Is the root in the
container the same user who runs the container, or the same root who is root
on the host machine?

~~~
dharmab
You can enable user namespace isolation, but it comes with a significant
number of tradeoffs.

[https://docs.docker.com/engine/security/userns-
remap/](https://docs.docker.com/engine/security/userns-remap/)

Also, Kubernetes doesn't support it.

[https://github.com/kubernetes/enhancements/issues/127](https://github.com/kubernetes/enhancements/issues/127)

~~~
jbverschoor
Docker, mongo, MySQL, JavaScript

Insane defaults, but easy to get started without knowledge

------
amelius
If you think that fixed it, try stracing the container AND running strace
inside the container at the same time.

------
mehrdadn
Wow this was way more interesting than I expected. Cool of her to share!

~~~
drvdevd
Yeah this was a great write up and an interesting example of how the reality
of the implementation of something can be quite mismatched to expectations. I
wouldn’t necessarily call it surprising that the implementation is doing more
than one would expect, but this is a good reminder of how to look around when
things get weird.

------
roblabla
This is interesting, but stops just short of explaining why ptrace wasn't
whitelisted, and what changed that allows it to be whitelisted on recent
kernel versions.

Here's what the linked commit to docker[0] says:

> 4.8+ kernels have fixed the ptrace security issues so we can allow ptrace(2)
> on the default seccomp profile if we do the kernel version check.

This commit itself links a commit in the linux kernel[1], which says:

> x86/ptrace: run seccomp after ptrace

> This moves seccomp after ptrace on x86 to that seccomp can catch changes
> made by ptrace. Emulation should skip the rest of processing too.

This doesn't give us much more information, and I'm not familiar enough with
this code to understand the changeset. Thankfully, by looking up the changeset
name, we can easily find the email that proposed the changeset[2], which
explains the issue more in-depth:

> There has been a long-standing (and documented) issue with seccomp where
> ptrace can be used to change a syscall out from under seccomp. This is a
> problem for containers and other wider seccomp filtered environments where
> ptrace needs to remain available, as it allows for an escape of the seccomp
> filter.

So the basic idea is to use ptrace to swap an allowed syscall with a forbidden
syscall. Seccomp used to run before ptrace, so it would see the allowed
syscall and allow the code to continue, but then ptrace would swap the syscall
to something forbidden! The details are a bit thin, however, they don't
explain _how_ we can do that. After a bit more digging, I found this pdf[3]
which contains a simple POC showing the issue, and explaining it.

The TLDR is to use ptrace with `PTRACE_SYSCALLS` to execute until a syscall is
hit. When a syscall is hit, seccomp would first check the syscall is allowed,
and then pass execution to the tracing system of the kernel, that would stop
the process for ptrace to inspect. From ptrace you can then modify the
registers with `PTRACE_SET_REGS` to change the syscall being called (the
syscall number is in register RAX on amd64) and resume execution. The kernel
will then happily execute the modified, unchecked syscall!

So what changed is that seccomp will be run after ptrace now, so there isn't a
way to modify the registers before the syscall is run anymore.

[0]:
[https://github.com/moby/moby/commit/1124543ca8071074a537a15d...](https://github.com/moby/moby/commit/1124543ca8071074a537a15db251af46a5189907)

[1]:
[https://github.com/torvalds/linux/commit/93e35efb8de45393cf6...](https://github.com/torvalds/linux/commit/93e35efb8de45393cf61ed07f7b407629bf698ea)

[2]: [https://www.mail-archive.com/linuxppc-
dev@lists.ozlabs.org/m...](https://www.mail-archive.com/linuxppc-
dev@lists.ozlabs.org/msg104378.html)

[3]:
[http://asm.rajiska.fr/misc/container_security.pdf](http://asm.rajiska.fr/misc/container_security.pdf)
page 9 for the explanation, page 16 for the POC code

~~~
JoshuaDavid
Huh. If I'm reading that correctly, that means it would be possible to build a
paranoid_sudo command, where you run a command and use PTRACE_SYSCALLS on it,
and for each of the syscalls it makes, try running it first as a non-root
user, and if that fails prompt the user for whether they want to allow the
base program to run whatever syscall requires the child process to be root.

I don't think the above would actually be all that useful, but still it's kind
of cool that it's possible to do at all.

~~~
roblabla
PTRACE_SYSCALL isn't what you want, since it will always run the syscall. What
you can do is use PTRACE_SYSEMU, which allows you to emulate a syscall in
userland. It will break on syscall enter, allow you to emulate your syscall
(e.g. with your scheme of "run it without root, ask user for elevated privs if
necessary"), and resume execution at the next instruction when you ptrace
again, never really executing the syscall instruction.

For this to work you'd also need a way to share various namespaces (like file
descriptors) which is doable via the clone2 syscall.

There's a ton of cool tricks you can do with linux, it's a really powerful
kernel.

------
jsnk
I had a memory leak problem running a node service inside a docker container.
I tried to use gcore, but got the precise problem in the article.

ptrace: Operation not permitted. You can't do that without a process to debug.
The program is not being run. gcore: failed to create core.36

I found this SO answer and it worked out for me.

[https://stackoverflow.com/questions/42029834/gdb-in-
docker-c...](https://stackoverflow.com/questions/42029834/gdb-in-docker-
container-returns-ptrace-operation-not-permitted)

------
carapace
That was fun but it pretty much reinforces my main reason for not using
containers: it's reinventing the OS. I already have the OS, so...

(Change my mind?)

~~~
techntoke
No need to change your mind. Containers improve orchestration of deploying
scalable services. The Kubernetes ecosystem will continue to improve and one
day your boss will tell you that you have to use containers to deploy your
apps or find a different job.

~~~
ctrlc-root
Not really. Let me prefix my reply with some context: I've used Puppet, Salt
Stack, Ansible, and PowerShell DSC to manage infrastructure based on Windows,
Linux, and macOS on top of bare metal servers, VMs, and containers. Any tool
that promises to "make the complex simple" is really just hiding the
complexity away from you behind default settings and their own implementation.
Which is probably fine for simple use cases. If your application can be
deployed to Heroku then deploying to a Kubernetes cluster you don't have to
manage is probably analogous in terms of overhead and how much complexity you
have to worry about. However this simply doesn't scale as your application
increases in complexity.

In fact I recently completed a migration away from Docker/Kubernetes to an AWS
stack that doesn't use anything higher level than autoscaling groups and it
has been a positive experience. The same application is now faster (less
layers necessary), easier to deploy (we own all the moving pieces), easier to
debug in production (I can use strace and tcpdump without trying to install
them in an already running Docker container), etc. All of the "complexity"
that Docker and Kubernetes were handling for us has been replaced with a ~ 500
line Python script that's mostly comments (you can see an early prototype of
this script in the repository below, named deploy_build.py, which omits some
error handling but weighs in at ~ 150 LoC). Furthermore, now that we have to
think about how to package and deploy our application to production, the
incentives are there to follow proper Python best practices around packaging
(i.e. use packages, don't make assumptions about paths, etc) which has been a
nice bonus.

Containers (and Docker, Kubernetes, etc) are all tools that have their use
cases and corresponding tradeoffs. Anyone who can't explain when or why you
wouldn't want to use a tool is not prepared to explain when or why you should
use it.

[https://github.com/CtrlC-Root/proto-aws-django](https://github.com/CtrlC-
Root/proto-aws-django)

~~~
techntoke
First, containers don't necessarily make complex simple. What they do is take
an OS image, and allow you to run it on another OS (best performance is always
going to be Linux on Linux, since containers were first built natively into
the Linux kernel).

Obviously, Microsoft also saw your same vision of containers being so
unnecessary that they also decided to build the concept into their OS as well.

This concept can't be done on bare metal. Additionally, because it is built
into the kernel and they share the same kernel, they generally startup at over
10x performance of traditional VMs. There are solutions now of course to boot
directly into the kernel from a hypervisor, but these are generally paired
with container solutions as well due to the orchestration and ecosystem that
exists to make container orchestration easy.

Most container solutions use a unified image solution that allows you to
rapidly reiterate and test your changes in multiple environments. Doing this
on bare-metal or VMs takes considerably more time and money for fixed
infrastructure costs.

With container solutions such as Kind and Helm, you can rapidly deploy a local
cluster to pre-test cloud rollouts on a single machine. The orchestration can
also autoscale clusters horizontally and vertically, and is robust enough to
directly compete with other bulkier VM orchestration solutions such as
OpenStack.

With automatic certificate rotation, automated service discovery, etc there is
no need to try to create tasks for every little thing you'd need to do with
Ansible or another IaC component. It is all baked into the Kubernetes
ecosystem.

Abstraction away from cloud-specific solutions can only be seen as a good
thing. People that are invested in AWS generally can't be trusted to create
holistic solutions.

------
enriquto
I love this writing style where the scientific method is laid bare so
explicitly.

~~~
blitmap
Agree! Wish I had the mental capacity to stay focused like this person :-)

~~~
lhuser123
Me too. Also, the ability to explain things this way. It’s remarkable.

------
asdf21
>strace actually does work in newer versions of Docker

Good to know, this should open up some troubleshooting

------
richardwhiuk
Previously:
[https://news.ycombinator.com/item?id=23021002](https://news.ycombinator.com/item?id=23021002)

------
tonetheman
A great writeup. And they end up looking at the source in the article. Good
stuff.

~~~
acqq
I also recommend:

[https://wizardzines.com/](https://wizardzines.com/)

by the same author. It looks "popular" but it's written much more thoughtfully
and competently than most of the content found on the web. Like this article,
it's the result of real personal research, it is never just rewording of some
manual or already existing material. In short, really worth reading, and also
worth using inside of the companies.

~~~
codazoda
As a fan of zines, I've followed her for a while now. She does awesome work.

------
jeffbee
Tl;dr: because docker is doing a bunch of undocumented junk you didn’t ask
for.

------
aritmo
Why strace didn't work in Docker.

