Hacker News new | past | comments | ask | show | jobs | submit login
File Permissions: A painful side of Docker (2019) (gougousis.net)
110 points by zdw on May 31, 2021 | hide | past | favorite | 67 comments



> The problem with this approach is that is not portable. What if I am developing using more than one computers where in each computer my user has different ID?

Make the build script use local $USERID and $GROUPID as args during the build process.

In docker-compose.yml (or, if using docker directly, using --build-arg):

    build:
      context: ./build
      args:
        USERID: ${USERID}
        GROUPID: ${GROUPID}

So you're passing the local uid and gid as variables to the build process.(1)

In build/Dockerfile:

  FROM image:tag
  WORKDIR "/application"
  ARG USERID
  ARG GROUPID

  RUN if [ ${USERID:-0} -ne 0 ] && [ ${GROUPID:-0} -ne 0 ]; then userdel -f www-data ;fi \
    && if getent group ${GROUPID} ; then groupdel www-data; fi \
    && groupadd -g ${GROUPID} www-data && useradd -m -l -u ${USERID} -g www-data www-data -s /bin/bash \
(1) $USERID and $USERID might not be available as an environment variable on your system. To do so, place this under .bashrc:

  export USERID=$(id -u)
  export GROUPID=$(id -g)


But that doesn't solve the problem, just works around it:

1. Images are still pre-baked with a given UID/GID pair, so you can't distribute them as something universal and reusable.

2. This requires workarounds / extra steps on a local workstation, so it doesn't work for everyone unless they follow a given project's unique quirks setup.

Shell/compose duct tape like this doesn't make for a great experience, this really should be solved by upstream projects themselves as it's an extremely common issue when attempting to use Docker.


It's a feature for a multi-tenant deployment if you use user remaps. Maybe you only allow specific tenant containers with tenant specific uid/gid.


1. Nope, they are not pre-baked. They are built at runtime from env vars on each machine. 2. One step, setting up two vars. They can be set by a build script. Lots of things have build scripts way more complicated than this.

The only tedious thing is you have to adapt this for every image type you run.


> The only tedious thing is you have to adapt this for every image type you run.

The tedious thing is that this escalates into complexity whenever you have to deal with K developers using M projects developed by N teams each using a different way to handle this:

Do I need to set USERID for project foo, or UID? Does it default to 1000 or the author's UID? Oh, someone has a problem with our project, did they remember to set COMPANY_USERID in their bashrc? Oh, wait, they're using zsh, how do you do that there? Oh, but they followed this other project's readme and that set COMPANY_USERID but not COMPANY_GROUPID...

Docker is supposed to simplify this by unification and a limited API surface, and applying hacks like this on top kind of kills that whole premise.


> Do I need to set USERID for project foo, or UID? Does it default to 1000 or the author's UID? Oh, someone has a problem with our project, did they remember to set COMPANY_USERID in their bashrc? Oh, wait, they're using zsh, how do you do that there? Oh, but they followed this other project's readme and that set COMPANY_USERID but not COMPANY_GROUPID...

You set it to the output of id -u and id -g. It's two lines. There are definitely lots of things more complex when dealing with docker than this.

You provide the team with a script containing those two lines and a docker-compose wrapper and you're set.

Of course it would have been better not to have to care about these things, but hey, at least you're not installing and configuring 4-5 services to bootstrap an application.


If you have to build it on each machine, I would not consider that easily/universally distributable. One of the key points of Docker is you can build once (in your CI or someone else's) and run it on any machine. I think that was GP's point.


Sure, great, let me just rebuild all my docker images on every single machine they run on thereby completely defeating the point of having images in the first place.


You start from a base image of your choice. You only build the user replacement part.

You run docker-compose build ONCE and you're set. On my machine, it takes five seconds.

Heck, you can even run docker-compose build everytime you start the application, it will use the cached build and take less than one second.

---

Correction: the docker-compose up -d takes care of the build process the first time it runs.

Literally, it takes more to complain about the issue than build the image ONCE.


And reproducibility goes straight out the window. And how do you even interop this with kubernetes?


The solution is for docker-compose (or plain docker).

I don't think the reproducibility is out. It's the same app, the same image, the same intended user, you just inject, once, the local user and group ids.


but that _requires_ you to build-at-runtime, which is sometimes not the best way to deploy a docker app. if you have one app that you want to run on many nodes, you'll want to set up a docker registry and have the nodes pull pre-built images.


Of course, but really only build once on every machine. The subsequent starts use the cached build, even after reboot.

In fact, docker-compose up -d takes care of the build thing by itself. It's a five second tradeoff for the lifetime of the application.


For anyone that uses immutable infrastructure where servers’ configuration is never once built and subsequent deployments result in replacement with entirely new VMs, building once per machine still happens every time there is a deployment. You don’t ever reboot these machines.

In environments where vulnerability scanning of docker images used is important, running anything in production that isn’t stored in a docker registry kind of breaks things.

This approach also won’t work with container orchestrators like Kubernetes, ECS, Lambda, CloudRun, etc.

Where I can see doing a docker build of a small layer that just sets file perms potentially being useful is for container based dev environments to be ran on laptops and workstations.


This has been a major Docker pain point, and not many people know about this trick. I didn't know you could have the variables in the Compose file directly, does that really work?

Our approach so far was to add yet another layer (a script to pass uid/gid to Compose), but if we don't need the script that would be fantastic.

EDIT: Ah, I just saw the bashrc wrinkle you mention. Yeah, that's why we had the script, and it's a damn shame Docker can't do this natively. It has been a major hassle.


> I didn't know you could have the variables in the Compose file directly, does that really work?

Yep, it's because the build args get read in from a .env file by default and then from there Docker Compose sends those build args to Docker when it builds the image.

This was one of the topics from my talk at DockerCon last week (creating a production ready Docker Compose set up). The video and 6,000 word blog post for it will be coming out tomorrow. Both things will be added to the talk's reference links at https://github.com/nickjj/dockercon21-docker-best-practices.


That's interesting, thanks! My shell sets the USER variable (but no USERID or GROUPID), which might be good enough for all our developers, but probably not reliable enough for a general audience.


Honestly in practice everything tends to work fine without any hacks or extra scripts.

I run all of my containers as a non-root user and create the user in the image with its default values of 1000:1000 for the uid:gid. I haven't bothered to expose the uid:gid as build arguments because it's pretty much never an issue in development or production.

With a uid:gid of 1000:1000 built into the image any bind mounted files end up being correctly owned by the Docker host's user under the following conditions:

- Docker Desktop on macOS

- Docker Desktop on Windows using WSL 1

- Docker Desktop on Windows using WSL 2 and native Linux (as long as your dev box's user is set to 1000:1000)

IMO it's really rare that your dev box's user wouldn't be 1000:1000 on native Linux or WSL 2.

In production you also have full control over the uid:gid of your deploy user.

The only time where it kind of stinks is CI, but it's super easy to get around this by simply not using volumes in CI.

I have a bunch of examples of this pattern at:

    - https://github.com/nickjj/docker-flask-example
    - https://github.com/nickjj/docker-django-example
    - https://github.com/nickjj/docker-rails-example
    - https://github.com/nickjj/docker-phoenix-example
    - https://github.com/nickjj/docker-node-example
    - https://github.com/oleksandra-holovina/docker-play-example


> IMO it's really rare that your dev box's user wouldn't be 1000:1000 on native Linux or WSL 2.

Any company-wide (GNU/)Linux deployment that uses LDAP or some other centralized user directory will not have devs with UID/GID 1000:1000. Hope is not a strategy.


> Any company-wide (GNU/)Linux deployment that uses LDAP...

You can go the extra mile and turn the UID:GID into build args like the original parent and you're good to go. No hacks necessary, and since it's all self contained into a .env file there's nothing extra you need to run since you're likely using an .env file already for other vars.

Alternatively you could do this: https://news.ycombinator.com/item?id=27344491

In either case you can solve the problem without too much effort.


> You can go the extra mile and turn the UID:GID into build args like the original parent and you're good to go.

That doesn't help you if you're attempting to use pre-built/existing Docker images that are not built internally and make the assumption that “1000:1000 is good enough”. You then not only have to hack around Docker limitations, but also around someone else's broken assumption.


> That doesn't help you if you're attempting to use pre-built/existing Docker images that are not built internally

Most pre-built images that I've come across don't require bind mounts to function.

Images like PostgreSQL aren't affected by this because you can use a named volume, and most pre-built applications that are shipped as images tend to store their state in a database and don't require bind mounts to function.


> IMO it's really rare that your dev box's user wouldn't be 1000:1000 on native Linux or WSL 2.

Any major company using LDAP/AD or other forms of centralized user management won't be able to make that guarantee.

> In production you also have full control over the uid:gid of your deploy user.

If you're running in an un-managed environment, yes - managed hosting of any kind generally doesn't provide these guarantees.


Hm, you're right, I guess I've seen a non-1000 user very rarely. However, for a company of tens to hundreds of people where you want them to be able to develop locally, you might very well hit this issue, and if you hardcode 1000 it's going to be hard for them to work around it.

This method works well until it doesn't work at all, and I think I would prefer one that works slightly less well but also had an easier way to override it. Then again, I might try this and see if we ever hit an issue, thanks!


IIRC on Arch, unless you create your own group, you're part of the users group, with GID 100


maybe i did something weird last time i installed ubuntu, but my user is 1001:1002 and the default ubuntu user is 1000:1001


Within docker-compose.yml I use

  services:
    foo:
      image: foo/bar:6.9
      user: ${UID:-1000}:${UID:-1000}
On Linux with Bash it runs with your current user and most other platforms it runs with id 1000, which is setup as the default user in the Dockerfile. This is no problem on MacOS or Windows because of the way Docker-Desktop uses VM's.

ZSH or other shells don't necessarily set $UID, so if you're running Linux, not id 1000 and not running Bash you might need a little .env file with `UID=1001` in it to make it work. And then the user is still nameless in the container. This is kind of rare and I only use it for dev containers where most relevant files (and permissions) are bind-mounted from the host, so it hasn't really been a problem in practice.

Remaps would be cleaner but I find it too much work to explain for normal developers just wanting to use a dev container.


From my experience, UID is not always available as to docker-compose.yml because it isn't exported (at least in bash).

See more here: https://stackoverflow.com/a/50900530/15428104

$ declare -p UID declare -ir UID="1000"

The -x option is missing.


This is excellent, thank you.


Containers are ideally meant for a single service. The best way I've found is to just pass the `--user` flag to `docker run` and have the service run as whatever user it is that you want. The only challenge is that you need to make sure that the volume mounts are already created on the host with the correct permissions.


That runs the container as a given usee, but doesn't prevent the container running some processes as a different internal user.


If you built the container or inspected it before running you should know what the container is doing. Again, containers like Docker aren't really "meant" to run multiple processes. They are meant to run a single process and your app should be able to run as whatever user you run the container with. If you want to run multiple processes or services inside a single container then ultimately you're better off with a different container solution.


> more than one computers where in each computer my user has different ID

Decades of network filesystem users have had many solutions to that.


I can think of basically two solutions:

1) pass user/group names around and resolve them at the destination to UID/GID; 2) ignore them entirely; assign ownership of all newly created files to the currently authenticated user (if authorized).

Are there other ones?


3) treat a machine-id/user-id pair as the “real userid” 4) add a remote->local userid mapping feature to your filesystem.


There is a new mount syscall in Linux 5.12, see "ID remapping in mounts" [1], that should help with all the permission madness, eventually.

It allows different mounts to expose the same content with different ownership, and in general to map permissions IDs between mounts in any way we like.

systemd-homed wll use that to abstract over the uids and gids of portable home directories, for example.

[1]: https://lwn.net/Articles/837566/


This doesn't even really seem like a problem that docker introduced. All these problems have been encountered by anyone running an NFS server, or a dozen other ways you can have systems with disparate uid/gid mappings using a shared or removable file system


In Podman this is a solved problem: podman run --userns=keep-id


also `podman unshare` is really helpful


> If this user is the “root”, then these files will not be accessible from web server or the CGI server, except if the server is running as root

Wait, what? Why not install the immutable files as root and let them be readable to everyone?


No one mentions how podman solves that problem with user id mapping?


I mentioned this in another comment here.


Would you mind elaborating?


I don't have the time to write elaborate comments right now, but see here:

http://docs.podman.io/en/latest/markdown/podman-run.1.html

Especially the "userns" option with the "keep-id" value.


I blogged about this same problem a month ago.

"Docker and the host filesystem owner matching problem": https://www.joyfulbikeshedding.com/blog/2021-03-15-docker-an...

In my blog post I layout 2 solution strategies, how one might go about implementing them, and caveats to watch out for.

1. Matching the container's UID/GID with the host's UID/GID.

2. Remounting the host path in the container using BindFS.


Using uid 0 in containers is asking for trouble. Any privileged resources (such as low ports) can be mapped in without messing with capabilities so there should be no need for it.


The port mapping is done by the container engine, not the container. Using low ports is allowed if the engine runs as root. Moreover I think it’s acceptable to use uid 0 inside a rootless container like podman since it’s by default only mapped to the user running it.


AWS Fargate won't let you remap ports. Whatever the container exposes, that's the port it's going to listen on. To work around this and other problems, I ended up making fat containers that start as root, and add entrypoints that can either run a process as root (to listen on low ports) or sudo to a user to drop perms before starting a process (to listen on high ports).

There's also weird junk you sometimes need to do in order to capture file handles depending on how a container engine is running the container, which you need to do before you fork or drop privs. But it took me years to finally run into that use case, most people will never need to do this.


Shameless plug: a boilerplate where I had to solve UID permissions, running as non-root user, publishing files to another container, mounting fs as read only, and hot reloading in dev environment.

It's still pretty much a proof of concept and it relies on docker compose but perhaps some of you may find it useful as a starting point: https://github.com/tacone/loki


Recently ran into this. So far I've landed on `setfacl`

- `--user` didn't work for me because there were root permissions in my image

- I didn't dig into why `userns-remap` didn't work

- I didn't give https://github.com/boxboat/fixuid a try yet

Some notes from my experience

  setfacl -dm "u:alexandros:rw" ~/alpine
should be

setfacl -R -dm "u:alexandros:rwx" ~/alpine

In case:

- `-R`: There is existing content in `~/alpine` you want made avalable

- `x`: You want your container to be able to create directories

However, you can still run into problems if

- Your container copies data from outside your bind-mount to inside. It sort-of worked except somehow the mask was `r--`, making things lose writeable.

- Your container moves data from outside your bind-mount to inside. This fully preserves the permissions.

I ended up creating a `.keep` file in the bind mount and doing a `cp --attributes-only --preserve=mode,ownership,xattr .keep <target>`


For the local development scenario, I made an open source utility that uses the setuid bit to change the UID/GID of a particular user and any files that user owns inside of the container at runtime:

https://github.com/boxboat/fixuid


Hack to "solve" Rootless Docker permission issues:

  nsenter -U --preserve-credentials -n -m -t $(cat $XDG_RUNTIME_DIR/docker.pid) /usr/bin/chown -R root:root /home/user/workspace
This one liner enters the namespace of Rootless Docker, and does the chown back to your normal user (root is your host user when you switch back).

Useful anytime you use a filesystem mount ... (Ex: storing database on disk so docker doesn't kill it every run).

You can now do backups, rebuild docker images, etc.

More information: https://github.com/jpetazzo/nsenter#how-do-i-use-nsenter


I think all those problems disappear when you run containers with proper orchestration tools, such as Kubernetes.

And not only that, I think that examples given in the article ("Assume that your Apache/PHP container is mounting the host’s /home/alexandros/myapp/ application directory to the container’s /var/www/html directory.") are in fact anti-patterns. If your container depends on specific file being available at specific location on the host then you're doing it wrong. The only place where that makes sense is on developer's local environment. In shared enviornments you want something like Kubernetes ConfigMap to contain config files, and dedicated persistent volumes for everything else.


The orchestration tool does not provide any additional functionality to fix this problem, it's up to the container execution environment, and today's container execution environments have no way (that I am aware of) to natively map file permissions outside of the container.

It could be I just haven't dug enough into the kernel internals, maybe there is a transparent permissions remapping thing. But something would absolutely have to map permissions. Otherwise there is no way to use filesystem ownership between execution environments without them using conflicting UID/GIDs, to say nothing of changing the file perms.


It makes sense that mounting a volume requires understanding a user mapping tbh. I think the answer is twofold:

a) Many problems solvable with a volume can be solved with a bind-mount, cache-mount, etc [0].

b) In the event that you actually need to map in a user-file, wrap the docker command in a script that manages the logic. At this point you're writing a system tool that's doing things outside of the context of a container - it's not really docker's fault that it doesn't try to make this trivial.

[0] https://vsupalov.com/buildkit-cache-mount-dockerfile/


> First of all, security issues may rise in a production system. If a container is compromised and the container is executed as root (uid = 0), then the intruder has access to any file of the host filesystem that has been loaded to the container filesystem through a mount. The owner UID of files that belong to the host root will be 0 in the container. So, they will be accessible to the intruder.

Use supervisord to coordinate the processes inside your Docker container, as easy as that. Bonus point, you don't need to wrangle with properly handling "docker stop"/ctrl+c.


Isn't this a bit of an anti-pattern? There really are very few situations in which you should be mounting things in production. Apache/PHP/etc is definitely not one of those situations.


I would absolutely say it's a production antipattern to run a container with access to some already existing host files belonging to some other user.

However, this is something that's basically unavoidable if you're attempting to use OCI/Docker for dev where you access a developer's source code checkout from a container running a standardized language runtime. And that's what a lot of people use OCI/Docker for...


Couldn't you run into this issue when mounting device files? I believe doing that for accessing external hardware or sensors is not all that uncommon.


Sure, that's one of the cases when this might needed in prod (although in the parent post I meant only access to honest-to-god data files, not things like bindmounting /dev).

In practice bindmount smell can also be somewhat alleviated by using things like k8s device plugins to request things at a higher level ('I want GPU access' vs. 'please bindmount /dev/drm... and use the proper modes'). It's still effectively a bindmount, but some extra security precautions can be made to ensure exclusive access and that no arbitrary mounts from the host are permitted. And things like k8s device plugins can also poke at file modes and other namespace magic at runtime so that the end user never has to worry about things like UID/GID and chardev modes. That IMO prevents the smell associated with random host bindmouts.


I wasn't aware of k8s device plugins, that seems like it would help with that, if k8s is an option. Thanks for the pointer!


You're welcome :).

They're also very easy to write, so if you ever happen to run k8s and need to give workloads access to some odd/custom host hardware, implementing a proper plugin for it is quite painless and gives much better guarantees than plain bindmounts.


Every additional mount can be considered as extra failure in design in terms of security or just being considered as laziness. Those all increase the attack vector. Even though containers are not designed in terms of isolation, every mount and volume are one step closer to break this isolation. Of course, the total risk depends on where from you are mounting.


Related to this post, a recent runc version included a change that inadvertently made a number of images built on the distroless base image difficult to use: https://danielmangum.com/posts/runc-chdir-to-cwd/


Personally I rely on boxboat/fixuid when thus is an issue for me.

Would love a real solution from docker though.

https://github.com/boxboat/fixuid


This is something CharlieCloud was built around for HPC and something podman can work around. User namespaces and fuse-overlayfs are the building blocks to fix this


Ive always solved this by just having a proxy script that creates a user when the container starts with the right UID/GID then executes the given command.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: