Hacker News new | past | comments | ask | show | jobs | submit login
Why strace doesn't work in Docker (jvns.ca)
329 points by kiyanwang on May 4, 2020 | hide | past | favorite | 67 comments

strace/ptrace gives you huge amount of power.

I had a discussion once with admins at my company that wanted to convert servers to have AD accounts in a naive way. The discussion could be shortened to something like this:

Me: please, don't convert servers to AD because my password will be available to everybody as everybody has root on test environment.

Admins: no, this is perfectly safe.

Me: Guys, once you have root on machine you can do everything including accessing passwords as they are being typed.

Admins: No, not possible. Perfectly safe.

Me: Here, I ran the tool using ptrace from root account, here is a list of all accounts and passwords of all users using test env. This is actually read() call at some point, you just need to be able to filter it out from strace output.

Admins: Wtf! You are going to get fired!

I'm not sure what AD has to do with the fact as root, you have a huge amount of power on ANY box and abusing it will get you fired. strace/ptrace is just another tool in that toolbox that you can use maliciously.

Your company either runs in a model where it trusts the employees or it doesn't. If it doesn't trust the employees they shouldn't be giving ANY of them root on a shared system.

In my case, I'm using a different, easy password, for test servers that contain no important production data or are in a especially privileged network position. If they were connected to AD, I would be forced to present my "real" password.

On production servers on the other hand, the amount of people with root is probably much smaller, and there might be bigger wins by being able to centrally enforce password policies or whatever the sysadmins want to do.

That was exactly the point. I had separate password on each box (which I did not have to use normally because I use it only to setup ssh key).

So create a second AD account for the test environment like most shops do?

But then you don't get the joy of throwing the baby out with the bath water.

90% of the time someone insists AD is too insecure for them, it's because they're unwilling to maintain hygiene (logging into an untrusted machines with overscoped credentials 8 times a day), or unwilling to use the provided security features (what do you mean use kerberized services? ldap binds work just find and give me an anti-ad security straw man!)

Shared root on an unimportant system with disposable creds is different on shared root on a system that uses (reusable) org-wide creds.

such is the curse of privileged access.. Even if they hadn't enabled AD, a privileged admin could have stolen ssh keys, or passwords by launching nefarious sshd's or other daemons. You could make the argument "but that would have been detected", but then so could watching for "strace" or "ptrace" invocations. At some point, you have to trust the people with privileged access, and make sure they aren't abusing it.

I think the point is that using AD would mean the OP would be forced to use the same password to access the test system as the prod system. Otherwise, they would use different passwords and no matter how nefarious any admin on the test system was they wouldn't get access to the prod system through it.

It's entirely possible to run separate Kerberos realms for test and prod.

If using AD then why on earth not forgo passwords entirely and use Kerberos to access these systems...

SSH keys not so much, because you keep your private key on your local machine.

With poorly implemented AD (think telnet/ssh login app asking for password), on the other hand, you get password transmitted to the other box. At some point some application performs read() to actually receive it.

You could use GSSAPI auth with SSH. This is also a good argument for having a test AD domain to use in your test environment.

> including accessing passwords as they are being typed.

If you're running X you don't even need root :(

That's not necessarily true. You can run X as user. You can run X locally and run the app on remote server using local X server over SSH tunnel using SSH keys.

I do this about as frequently as I need to install something on remote server that has graphical installer. Not sure why people think thought this is a good idea but a huge amount of commercial software in 2000s had that kind of installation process.

Certain OSes, one of them being SGI IRIX 5.x, used to ship with X security disabled. This meant your display was completely open to the world, the equivalent of running "xhost +"

In the 90's, firewalls were uncommon, so literally anyone on the Internet could sniff your keyboard or pop up windows on your display. Or, better yet, xmelt it.

This resulted in LOTS of mayhem.

Many years ago, a young consultant was working at a client site on a SunOS box. Suddenly, the machine started playing the, let's call it the restaurant scene, from "When Harry met Sally." Yeah, it got awkward real fast...

I find for Docker, the documentation is extremely sparse/omits details that have larger consequences.

I've gotten used to just going to read the source to see what an option does.

Imagine a situation what something is broken and you need to figure out how to strace in production while a service is down. This is my main argument against Docker, it is black box computing and you are going to figure this out the hard way. Of course you can prepare, yet, I do not necessarily have the bandwidth.

As the article points out, in any recent version of Docker and up to date kernel ptrace just works.

> This hypothesis doesn’t make much sense but I hadn’t realized that the root user in a Docker container is the same as the root user on the host, so I thought that was interesting

Wait what? I didn't know that. This sounds terrible. Is the root in the container the same user who runs the container, or the same root who is root on the host machine?

Without User Namespaces, if you run a container, user IDs in the container use the same pool as the host. So if your container runs as UID 0 / root, it’s the same UID 0 / root as your host.

This is one of the many reasons why giving non-root users access to the Docker daemon (so they can start containers) is dangerous: if, as a non-root user, I can start a container that’s running as UID 0, there’s a lot of possibility for misuse.

User Namespaces enable Docker to use separate UID pools for the containers, which enables a container to run as “UID 0 / root” but have the host actually map that as some arbitrary other UID, so the host can treat it differently than actual UID 0.

You can enable user namespace isolation, but it comes with a significant number of tradeoffs.


Also, Kubernetes doesn't support it.


Docker, mongo, MySQL, JavaScript

Insane defaults, but easy to get started without knowledge

And not only for running processes, but also for files on the disk. If you mount an image on the host and make a file owned by "root", then mount that image in a container, the container's "root" now owns it too.

Edit: Note that userids are the same, but usernames are not managed by the kernel (handled by /etc/passwd or LDAP or something), so userid 1001 inside the container is userid 1001 outside, but they might have different names if you "ls -l" from different places.

See also (1)mount's nosuid/nodev options, because plugging in a USB stick with a setuid-root shell on it apparently used to work.

Or, years before, NFS: the NFS permissions model trusts the client and server implicitly. In many environments, that meant that getting root[1] on any system quickly cascaded across all of them if you could write to, say, someone's shell profile, SSH keys, or a shared binary in a common location — which was not uncommon at all when people were trying to conserve storage costs by only installing things in one place. No suid, nodev, and the various options for preventing uid=0 access were all attempts to bandaid around the lack of a better authentication option until people started switching to Kerberos.

1. Or, if they didn't require trusted ports, any account at all using https://github.com/NetDirect/nfsshell

Docker doesn't use user namespaces by default. They have been working on adding support for a while, but it makes permissions on volumes difficult, and things kept changing in kernel land for years.

Root inside a non-userns container is the same as root on the Linux host, but it is constrained by security policies like seccomp and apparmor.

> it makes permissions on volumes difficult

Yeah much easier to run everything as root. Also, we don't need /etc/password and stuff for UID/GID resolution and credentials in the container anyway; we just use ad-hoc auth and crypto from a random third-party lib (that isn't vetted, is never updated, runs in the same address space as your app, and hasn't access to meaningful entropy since running in a container) and supply root credentials on the docker command line or the Dockerfile checked in to github, or both.

That's what we've been saying for years: Docker doesn't solve anything, it merely hides problems from you (and helps your cloud provider's bottom line). Good luck, and yes, PHBs should be worried for civil/criminal gross negligence if the shit hits the fan. Your cloud provider is happy to take your checks, but will shrug-away and point out they're just providing the infrastructure; it's up to you to competently configure your ever-changing 12-factor k8s.

> hasn't access to meaningful entropy since running in a container

getrandom(2)/getentropy(3) should be using kernel randomness generation, which isn't affected by containers I thought.

That some things are a more complicated doesn't lead to "doesn't solve anything." It solves many things. If they're not things you need solved then fine, but you can't deny that for many workflows it is actually useful.

It's the same user, but with locked down access (by default). As an example, can't ptrace by default.

A container is a process with some kernel isolation mechanisms setup around it. Unless one of those mechanisms is a user namespace with uids/gids mapped to an unused set of users, you get the same user.

> This sounds terrible.

Well it is. Most production web sites today are exploitable because of this.

It's even worse than it sounds.

> Most production web sites today are exploitable because of this.

How so? To exploit this, you need to already have RCE on a container. But generally you get that RCE by exploiting the site (the application code) in the first place.

In which scenario does an attacker have code execution privileges in a container, but needs this root privilege to exploit the site?

This is one of the reasons I run podman instead where I can and don't just give users access to the docker service. You can also run buildkit in rootless mode

You can also run Docker in rootless mode.

If you think that fixed it, try stracing the container AND running strace inside the container at the same time.

Wow this was way more interesting than I expected. Cool of her to share!

Yeah this was a great write up and an interesting example of how the reality of the implementation of something can be quite mismatched to expectations. I wouldn’t necessarily call it surprising that the implementation is doing more than one would expect, but this is a good reminder of how to look around when things get weird.

This is interesting, but stops just short of explaining why ptrace wasn't whitelisted, and what changed that allows it to be whitelisted on recent kernel versions.

Here's what the linked commit to docker[0] says:

> 4.8+ kernels have fixed the ptrace security issues so we can allow ptrace(2) on the default seccomp profile if we do the kernel version check.

This commit itself links a commit in the linux kernel[1], which says:

> x86/ptrace: run seccomp after ptrace

> This moves seccomp after ptrace on x86 to that seccomp can catch changes made by ptrace. Emulation should skip the rest of processing too.

This doesn't give us much more information, and I'm not familiar enough with this code to understand the changeset. Thankfully, by looking up the changeset name, we can easily find the email that proposed the changeset[2], which explains the issue more in-depth:

> There has been a long-standing (and documented) issue with seccomp where ptrace can be used to change a syscall out from under seccomp. This is a problem for containers and other wider seccomp filtered environments where ptrace needs to remain available, as it allows for an escape of the seccomp filter.

So the basic idea is to use ptrace to swap an allowed syscall with a forbidden syscall. Seccomp used to run before ptrace, so it would see the allowed syscall and allow the code to continue, but then ptrace would swap the syscall to something forbidden! The details are a bit thin, however, they don't explain how we can do that. After a bit more digging, I found this pdf[3] which contains a simple POC showing the issue, and explaining it.

The TLDR is to use ptrace with `PTRACE_SYSCALLS` to execute until a syscall is hit. When a syscall is hit, seccomp would first check the syscall is allowed, and then pass execution to the tracing system of the kernel, that would stop the process for ptrace to inspect. From ptrace you can then modify the registers with `PTRACE_SET_REGS` to change the syscall being called (the syscall number is in register RAX on amd64) and resume execution. The kernel will then happily execute the modified, unchecked syscall!

So what changed is that seccomp will be run after ptrace now, so there isn't a way to modify the registers before the syscall is run anymore.

[0]: https://github.com/moby/moby/commit/1124543ca8071074a537a15d...

[1]: https://github.com/torvalds/linux/commit/93e35efb8de45393cf6...

[2]: https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/m...

[3]: http://asm.rajiska.fr/misc/container_security.pdf page 9 for the explanation, page 16 for the POC code

Huh. If I'm reading that correctly, that means it would be possible to build a paranoid_sudo command, where you run a command and use PTRACE_SYSCALLS on it, and for each of the syscalls it makes, try running it first as a non-root user, and if that fails prompt the user for whether they want to allow the base program to run whatever syscall requires the child process to be root.

I don't think the above would actually be all that useful, but still it's kind of cool that it's possible to do at all.

PTRACE_SYSCALL isn't what you want, since it will always run the syscall. What you can do is use PTRACE_SYSEMU, which allows you to emulate a syscall in userland. It will break on syscall enter, allow you to emulate your syscall (e.g. with your scheme of "run it without root, ask user for elevated privs if necessary"), and resume execution at the next instruction when you ptrace again, never really executing the syscall instruction.

For this to work you'd also need a way to share various namespaces (like file descriptors) which is doable via the clone2 syscall.

There's a ton of cool tricks you can do with linux, it's a really powerful kernel.

I had a memory leak problem running a node service inside a docker container. I tried to use gcore, but got the precise problem in the article.

ptrace: Operation not permitted. You can't do that without a process to debug. The program is not being run. gcore: failed to create core.36

I found this SO answer and it worked out for me.


That was fun but it pretty much reinforces my main reason for not using containers: it's reinventing the OS. I already have the OS, so...

(Change my mind?)

Yup, that's the basic idea. If you ever wished you could use the cp command on an entire os, that's basically what docker lets you do. If that's not a thing that's useful to you (for example because you already have a reproducible build, or because you want to be principled about understanding how your system components fit together instead of shrugging and copy-pasting the environment that works but nobody is quite sure how), you probably don't need docker.

I mean, they aren't reinventing the OS, they're basically just a sum of namespacing capabilities in the OS. They have the same basic purposes as chroots/jails/etc: isolation of components. That these are overused to avoid having a properly configurable userspace is not an indictment of the capabilities themselves.

No need to change your mind. Containers improve orchestration of deploying scalable services. The Kubernetes ecosystem will continue to improve and one day your boss will tell you that you have to use containers to deploy your apps or find a different job.

Not really. Let me prefix my reply with some context: I've used Puppet, Salt Stack, Ansible, and PowerShell DSC to manage infrastructure based on Windows, Linux, and macOS on top of bare metal servers, VMs, and containers. Any tool that promises to "make the complex simple" is really just hiding the complexity away from you behind default settings and their own implementation. Which is probably fine for simple use cases. If your application can be deployed to Heroku then deploying to a Kubernetes cluster you don't have to manage is probably analogous in terms of overhead and how much complexity you have to worry about. However this simply doesn't scale as your application increases in complexity.

In fact I recently completed a migration away from Docker/Kubernetes to an AWS stack that doesn't use anything higher level than autoscaling groups and it has been a positive experience. The same application is now faster (less layers necessary), easier to deploy (we own all the moving pieces), easier to debug in production (I can use strace and tcpdump without trying to install them in an already running Docker container), etc. All of the "complexity" that Docker and Kubernetes were handling for us has been replaced with a ~ 500 line Python script that's mostly comments (you can see an early prototype of this script in the repository below, named deploy_build.py, which omits some error handling but weighs in at ~ 150 LoC). Furthermore, now that we have to think about how to package and deploy our application to production, the incentives are there to follow proper Python best practices around packaging (i.e. use packages, don't make assumptions about paths, etc) which has been a nice bonus.

Containers (and Docker, Kubernetes, etc) are all tools that have their use cases and corresponding tradeoffs. Anyone who can't explain when or why you wouldn't want to use a tool is not prepared to explain when or why you should use it.


First, containers don't necessarily make complex simple. What they do is take an OS image, and allow you to run it on another OS (best performance is always going to be Linux on Linux, since containers were first built natively into the Linux kernel).

Obviously, Microsoft also saw your same vision of containers being so unnecessary that they also decided to build the concept into their OS as well.

This concept can't be done on bare metal. Additionally, because it is built into the kernel and they share the same kernel, they generally startup at over 10x performance of traditional VMs. There are solutions now of course to boot directly into the kernel from a hypervisor, but these are generally paired with container solutions as well due to the orchestration and ecosystem that exists to make container orchestration easy.

Most container solutions use a unified image solution that allows you to rapidly reiterate and test your changes in multiple environments. Doing this on bare-metal or VMs takes considerably more time and money for fixed infrastructure costs.

With container solutions such as Kind and Helm, you can rapidly deploy a local cluster to pre-test cloud rollouts on a single machine. The orchestration can also autoscale clusters horizontally and vertically, and is robust enough to directly compete with other bulkier VM orchestration solutions such as OpenStack.

With automatic certificate rotation, automated service discovery, etc there is no need to try to create tasks for every little thing you'd need to do with Ansible or another IaC component. It is all baked into the Kubernetes ecosystem.

Abstraction away from cloud-specific solutions can only be seen as a good thing. People that are invested in AWS generally can't be trusted to create holistic solutions.

Never thought I would run into someone else that has used Powershell DSC...

Eh. They're the new shiny. And they do make certain things a bit nicer for some job roles, at the cost of shoving the complexity somewhere else.

In particular, it has been useful to move our engineers to developing using it. We'll probably experiment with using it in production for a few things at some point, just for consistency, but it buys us zero value there otherwise. (We will never run "in the cloud" because it would be cost-prohibitive, at least with current cloud revenue models.)

There will be another new shiny in a few years, at which point the True Believers will move on and tell you how you'll lose your job if you don't get on board that train, too.

That's so not what I meant and I think you know it. I don't mean to be rude but your response make you sound like a systemd enthusiast.

Do you have a technical argument to put forward?

I mean, "systemd enthusiast" doesn't sound rude, unless you give it some tone, are part of a special group where it is, or prefix it with "i don't mean to be rude".

For the difference of Docker and reinventing OSes, it depends on what you establish as reinventing OSes. Docker provides some form of separation/containerization, a function that is also implemented by some OS. But it's discussable of having the functionality in user or kernelspace or having a wrapper application with a consistent api across multiple operating systems isn't worth it.

I had just read "Systemd, ten years later: a historical and technical retrospective" so it was on my mind. I know they didn't invent it but I associate that kind of attitude with that project. Shouldn't have mentioned it, really.



Absolutely, but it is hard to decipher if it is even worth arguing at all. Would you argue against the crowd opposing automobiles? I mean, there are plenty of valid arguments against them but at the same time what good does it do to fight about it?

> it is hard to decipher if it is even worth arguing at all

Then why? I mean, I put it in parentheses for a reason man.

I mean, you seem to be saying that I can't tell the difference between automobiles and, what? horses?

Surely if the difference is that great it should be easy to come up with lots of solid points to "change my mind"? Since you're taking the trouble to comment anyway?

In your sib comment https://news.ycombinator.com/item?id=23071602 you seem to be making some solid points, but from my POV it still sounds like "learn the thing" where "the thing" does what I can already do without it, or it enables me to do stuff that I'm not actually interested in doing.

Bottom line: if you want to argue please do it with facts and not dismissive condescension. (I've got plenty of that already without your help, thanks.)

I love this writing style where the scientific method is laid bare so explicitly.

Agree! Wish I had the mental capacity to stay focused like this person :-)

Me too. Also, the ability to explain things this way. It’s remarkable.

Don't sell yourself short. Julia is a lovely person but not someone I'd describe as "extremely focused all the time". If you make time for careful writing you can definitely do it.

>strace actually does work in newer versions of Docker

Good to know, this should open up some troubleshooting

A great writeup. And they end up looking at the source in the article. Good stuff.

I also recommend:


by the same author. It looks "popular" but it's written much more thoughtfully and competently than most of the content found on the web. Like this article, it's the result of real personal research, it is never just rewording of some manual or already existing material. In short, really worth reading, and also worth using inside of the companies.

As a fan of zines, I've followed her for a while now. She does awesome work.

Tl;dr: because docker is doing a bunch of undocumented junk you didn’t ask for.

Why strace didn't work in Docker.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact