Hacker News new | past | comments | ask | show | jobs | submit login
Alpine makes Python Docker builds slower, and images larger (pythonspeed.com)
306 points by itamarst on Jan 29, 2020 | hide | past | favorite | 149 comments



I do a lot of Python and a lot of Docker. Mostly python in Docker. I've used both Alpine and Ubuntu. There is a fair amount right about this article, and a lot wrong.

First "the Dockerfiles in this article are not examples of best practices"

Well, that's a big mistake. Of course if you don't follow best practices you won't get the best results. In these examples the author doesn't even follow the basic recommendations from the Docker Alpine image page. Ex, use "apk add --no-cache PACKAGE". When you're caching apt & apk, of course the image is going to be a ton larger. On the flip side he does basically exactly that to clean up ubuntus apt cache.

The real article should have been "should you use alpine for every python/docker project?" and the answer is "No". If you're doing something complicated that requires a lot of system libs, like say machine learning or imagine manipulation - don't use Alpine. It's a pain. On the flip side if all you need is a small flask app, Alpine is a great solution.

Also, build times and sizes don't matter too much in the grand scheme of things. Unless you're changing the Dockerfile regularly, it won't matter. Why? Because Docker caches each layer of the build. So if all you do is add your app code (which changes, and is added at the end of the Dockerfile) - sure the initial build might be 10 mn, but after that it'll be a few seconds. Docker pull caches just the same, so the initial pull might be large, but then after that it's just the new layers.


> Also, build times and sizes don't matter too much in the grand scheme of things. Unless you're changing the Dockerfile regularly, it won't matter. Why? Because Docker caches each layer of the build.

It does if you practice continuous deployment, or even if you use Docker in your local dev setup and you want to use a sane workflow (like `docker-compose build && docker-compose up` or something). Unfortunately, the standard docker tools are really poorly thought out, beginning with the Dockerfile build system (assumes a linear dependency tree, no abstraction whatsoever, standard tools have no idea how to build the base images they depend on, etc). It's absolute madness. Never mind that Docker for Mac or whatever it's called these days will grind your $1500 MacBook Pro to a halt if you have a container idling in the background (with a volume mount?). Hopefully you don't also need to run Slack or any other Electron app at the same time.

As for the build cache, it often fails in surprising ways. This is probably something on our end (and for our CI issues, on CircleCI's end [as far as anyone can tell, their build cache is completely broken for us and their support engineers couldn't figure it out and eventually gave up]), but when this happens it's a big effort to figure out what the specific problem is.

This stuff is hard, but a few obvious things could be improved--Dockerfiles need to be able to express the full dependency graph (like Bazel or similar) and not assume linearity. Dockerfiles should also allow you to depend on or include another Dockerfile (note the differences between including another Dockerfile and including a base image). Coupled with build args, this would probably allow enough abstraction to be useful in the general case (albeit a real, expression-based configuration language is almost certainly the ideal state). Beyond that the standard tooling should understand how to build base images (maybe this is a byproduct of the include-other-Dockerfiles work above) so you can use a sane development workflow. And lastly, local dev performance issues should be addressed or at least allow for better debugging.


Docker's build system has been given a huge overhaul and does create a dependency tree. It is highly efficient and even does things like support mounts for caching package downloads, build artifacts, etc.

See https://github.com/moby/buildkit. You can enable it today with `DOCKER_BUILDKIT=1 docker build ...`

There is also buildx which is an experimental tool to replace `docker build` with a new CLI: https://github.com/docker/buildx


I don't see how buildkit could possibly build the correct dependency tree because the Dockerfile language doesn't let you express nonlinear dependencies.

If you have a command `do_foo` that depends on do_bar and do_baz (but do_bar and do_baz are independent) and you do something like:

    RUN do_bar # line 1
    RUN do_baz # line 2
    RUN do_foo # line 3
I'm guessing the buildkit dep graph will look like `line_3 -> line_2 -> line_1` (linear). Unless there is some new way of expressing to Docker that do_foo depends on do_bar and do_baz but that the latter two are independent.

EDIT: clarified example.


Dependcies are also for multiple stages, e.g.:

So `COPY --from=<some stage>` and `FROM <other stage>`

Also, a Dockerfile is just a frontend for buildkit. The heart of buildkit is "LLB" (sort of like LLVM IR in usage) which is what the Dockerfile compiles into. Buildkit just executes LLB, doesn't have to be from a Dockerfile.

For that matter you can have a "Dockerfile" (in name only) that is not even really dockerfile since the format lets you specify frontend to use (which would be a container image reference) to process it.

There's even a buildkit frontend to build buildpacks: https://github.com/tonistiigi/buildkit-pack Works with any buildkit enabled builder, even `docker build`.


Yeah, there are a lot of workarounds when you leave the standard tooling. I like the idea of using something like Bazel to build Docker images (since modeling dependencies and caching build steps is its raison d'etre) and eschewing Dockerfiles altogether; however, I haven't tried it (and Bazel has its own costs). I'm not familiar with buildkit in particular, but it's cool that it has an internal representation. I'll have to dig around.


Nix also can build efficient docker images and computes layers in a way to make them reusable among multiple projects.


Yes! Check out https://nixery.dev/


Yeah, it is cool, but it feels more like a demo for nix.

If you want to create docker image of your own app, you probably would use:

https://nixos.org/nixpkgs/manual/#sec-pkgs-dockerTools

This will produce an exported docker image as tar file, which then you can import it using either docker, or tool like skopeo[1] (which is also included in nixpkgs).

The nix-shell functionality is also quite nice, because it allows you to create common development environment, with all tooling available that one might need to work.

[1] https://github.com/containers/skopeo


> If you want to create docker image of your own app, you probably would use [...]

Nixery can be pointed at your own package set, in fact I do this for deployments of my personal services[0].

This doesn't interfere with any of the local Nix functionality. I find it makes for a pleasing CI loop, where CI builds populate my Nix cache[1] and deployment manifests just need to be updated with the most recent git commit hash[2].

(I'm the author of Nixery)

[0]: https://git.tazj.in/tree/ops/infra/kubernetes/nixery [1]: https://git.tazj.in/tree/ops/sync-gcsr/manifest.yaml#n17 [2]: https://git.tazj.in/tree/ops/infra/kubernetes/tazblog/config...


I'm not sure I understand what you mean by workarounds and "leavings standard tooling"


I’m guessing he meant roughly ”docker build”


Correct (as far as I know), but:

A) Changing a Dockerfile is rare B) Typically the lines that change (adding your code) are near the end of the Dockerfile, and the long part with installing libraries is at the beginning


(A) is true, but it's not the only way to bust the cache. Changes to the filesystem are much more common. You can try to account for this by selectively copying in files in some sort of dependency order and this can work okay so long as the dependency order resembles your filesystem hierarchy, but if you want to (for example) install all of your third party dependencies before adding your source code, you'll need to add each requirements.txt or package.json file individually so as to avoid copying in your source code (which changes more frequently). Doing this also tightly-couples your Dockerfile to your in-tree dependency structure, and keeping the Dockerfile in sync with your dependency tree is an exercise in vanity. Further, because you're forcing a tree-structure (dependency tree) into a linear structure, you are going to be rebuilding a bunch of stuff that doesn't need to be built (this gets worse the wider your dependency tree). Maybe you can hack around this by making an independent build stage per in-tree package, which might allow you to model your dependency tree in your Dockerfile, but you're still left keeping this in sync manually. No good options.


> Never mind that Docker for Mac or whatever it's called these days will grind your $1500 MacBook Pro to a halt if you have a container idling in the background (with a volume mount?). Hopefully you don't also need to run Slack or any other Electron app at the same time.

Just in case anyone is wondering, this is a great exaggeration. Idling containers are close to idling (~2%cpu currently), and slack got pretty small last year. These work just fine, I wish the trope died already.


I very much experienced this with our django application and traced the problem back to django's dev server polling for fs changes at 1hz. Apparently this is enough to spin fans with docker for mac.

I solved the problem by installing inotify into the container, which django will use if present, which reduced cpu from 140% to 10%. This is a couple of months ago.


Thank you very much! I have this same issue but did not discover the culprit.


This is because there is no native support on Mac for docker. Everything has to run in a virtualized environment that's basically a slightly more efficient version of VirtualBox. When people say docker is lightweight and they're running it on a Mac they don't quiet understand what they're saying. Docker is lightweight on baremetal linux, it's not lightweight on other platforms because the necessary kernel features don't exist anywhere except linux.


Yeah the slow-downs are killing me on my Mac book. Everything starts off fine but after a few hours (usually right when I'm in the zone) the whole system just grinds.

I've started experimenting with coding on a remote docker host using vscode's remote connection feature.

I'd be interested to know if anyone else had gone down this path?


We looked into this. The biggest tradeoff is that your whole team has to change their workflow because the repo has to stay inside of a container on the Linux VM (it can't be shared with the host or you'll trigger the performance issue) which means anyone who wants to cat/sed/edit/grep/etc a file will have to go through a Docker container to do it. It's also a bit more complex if you're running multiple services via Docker Compose, as we were. We couldn't see this workflow change working out well in the long term, and someone had already done the work to use native processes orchestrated with PM2 and that seemed to work reliably once it was set up.


https://blogs.vmware.com/teamfusion/2020/01/fusion-tp20h1-in...

Project nautilus is a pretty interesting approach to running a container on mac. In theory it should be more efficient than docker for Mac.

Disclaimer I work for VMware, but on a different team.


I have like this:

Mac Virtualbox runs linux with host only network vbox0.

Docker runs in vbox linux. (now it gets ugly)

Vbox linux brctl's docker0 (set to match vbox0 ip space) into vbox0.

Docker container is reachable by IP from mac host. All is fast and good.


> I'd be interested to know if anyone else had gone down this path?

this is at least one use case for 'docker-machine'


It's true that the overhead is larger on macos. And you're right it doesn't run natively there. But it's not like idle hardware virtualised process is less idle than a native one. Sure, there may be extra events waking up the kernel itself. But let's not exaggerate the impact. There's no grinding to a halt from an idle container.


To be clear, the CPU on the container itself was negligble according to `docker stats`; however, the VM process was still using its max allotment. My money is on the volume manager, but we didn't see it being worthwhile to figure out how to debug deep into the bowels of the VM technology (we don't have VM or filesystem experts on staff) to sort it out. Note that we also tried docker-sync and a variety of other solutions, but the issue persisted. Eventually we gave up and just moved back to native processes managed by PM2. Our local dev environment sucks a fair bit more to set up, but it's cheaper than debugging Docker for Mac or dealing with the persistent performance problems.


Docker for Mac has always included some settings to tune the VM too. If your whole computer grinds to a halt because of Docker, it's because you probably allocated too many resources to the VM. I have half of my laptop CPU/RAM dedicated to the docker VM and while sometimes the fans go a little crazy I've never had the desktop lock-up or anything like that.


This is true, but it doesn't solve the problem. If you give Docker half of your overall resources, it's just going to take Docker twice as long (most likely longer) to finish running the test suite or whatever you're doing in Docker. The crux of the problem is that Docker for Mac has pathological cases, probably involving host-mounted volumes that are big or overlaid or something else that we were doing; the containers can be near idle and Docker for Mac consumes 70-100% of its CPU budget (presumably doing volume/filesystem things).

Note that a little Googling reveals that this is a pretty common problem.


If you give any VM all your cores and then your desktop locks up, you played yourself. That wasn't a good idea before docker and it's not a good idea now. I've personally had issues with using file-change-watchers on mounted volumes in some cases but because I limited my VM to half my resources, the underlying OSX was fine and I could still do whatever I needed to do (including killing those containers).


You’re being pedantic. Docker for Mac shouldn’t use the full VM allotment at idle, full stop. Nitpicking the parent for speaking in terms of the host cores instead of the VM cores is off topic and boring.


There's a lot of space between those extremes, nobody is claiming that idle containers are consuming entire CPU cores. But idle virtualised machines are interrupting your host OS a lot more than you might realise.


That's what the comment I was responding to claimed - "grind to a halt" was a quote.


Was it edited?

> This is because there is no native support on Mac for docker. Everything has to run in a virtualized environment that's basically a slightly more efficient version of VirtualBox. When people say docker is lightweight and they're running it on a Mac they don't quiet understand what they're saying. Docker is lightweight on baremetal linux, it's not lightweight on other platforms because the necessary kernel features don't exist anywhere except linux.

"grind to a halt" and "is not lightweight" are not even close to being synonymous.



That's several layers up, and is true. There are many bugs in docker for mac, one of them is that vpnkit(?) leaks memory like a motherfucker. And the other is that volume mounts crunch I/O like madness, so your load factor spikes and your laptop perceptibly becomes slow due to IO latency.

So "grinds" is somewhat accurate, if you have long running containers doing very little, or you are constantly rebuilding, even if the machine does not look like it's consuming CPU.


On Windows it plugs into Windows containers and Hyper-V.


When people say Docker is lightweight and they’re running it on a Mac, they’re probably talking about their production environment and it’s rude to say that they don’t quite understand.


It's certainly not an exaggeration:

Docker with a few idle containers will burn 100% of CPU. https://stackoverflow.com/questions/58277794/diagnosing-high...

Here's the main bug on Docker for Mac consuming excessive CPU. https://github.com/docker/for-mac/issues/3499


My quote was our observed performance across our fleet of developer machines until ~December of 2019. Maybe our project was hitting some pathological edge case (more likely this is just the performance you can expect for non-toy projects), but there's no documented way to debug this as far as I (or anyone else in our organization) could tell. Note that this was the performance even if nothing was changing on the mounted file systems and with all watchers disabled. Bystanders can feel free to roll the dice, I guess.


npm watch on a host mounted volume is a pretty good way to kill performance though.


I'll add the same: idling does take a whole CPU in my situation too, and only on the mac, not on Linux.


I've built some docker build infrastructure which attempts to optimize build times and reduce cost of incremental builds. I was able to take a monolithic binary artifact which cost around 350mb per build down to less than 40mb by more intelligent layering. If you haven't found it already, `--cache-from` with the previously built stable image makes this relatively painless.

I've been considering writing a merge tool to support fork/merge semantics, focussing more on development and debugging than build-time optimization.


Speaking of best practices for Dockerfiles and CI/CD, a lot of these issues can be highlighted at build time with a Docker linter like https://github.com/hadolint/hadolint.


I didn't know about hadolint, but I don't see how it (or any other linter) can address any of these issues (unless "these issues" is not referring to the issues I was mentioning in the post you responded to).


Hadolint will tell about things like adding --no-cache to apk add. My point being that comments were made about not following best practices, and hadolint will help with that.


Yeah, you replied to the wrong post. You should have replied to the parent of the one you replied to.


Yep


Hadolint is nice. Hadolint addresses none of the major issues you raise.


> As for the build cache, it often fails in surprising ways. This is probably something on our end (and for our CI issues, on CircleCI's end

We have been using Google Cloud Build in production for over an year and Docker caching [1] works great. And Cloud Build is way cheaper than CircleCI.

I recommend it, and I'm not getting paid anything for it.

[1] https://cloud.google.com/cloud-build/docs/speeding-up-builds...


Or, and I am biased here, use Cloud Native Buildpacks and never think about this stuff again.


I'm not familiar, can you elaborate on how those solve these problems? I'm always looking for a better way of doing things.


Broadly, CNBs are designed to intelligently turn sourcecode into images. The idea of buildpacks isn't new, Heroku pioneered it and it was picked up in other places too. What's new is taking full advantage of the OCI image and registry standards. For example, a buildpack can determine that a layer doesn't need to be rebuilt and just skip it. Or it can replace a layer. It can do this without triggering rebuilds of other layers. Finally, when updating an image in a registry, buildpacks can perform "layer rebasing", creating new images by stamping out a definition with a changed layer. If all your images are built with buildpacks, you reduce hours of docker builds to a few seconds of API calls.

This is a bit of a word soup, so I'll point you to https://buildpacks.io/ for more.


Most Dockerfiles for python projects will have a line to install their python dependencies though.

  COPY requirements.txt ./
  RUN pip install -r requirements.txt
If you're building the image on a CI server, docker can't cache that step because the files won't match the cache due to timestamps/permissions/etc... The same is true for other developer's machines.

This is a problem if your requirements includes anything that uses C extensions, like mysql/postgresql libs or PIL.


You can achieve a similar caching improvement by either:

1. Using poetry which keeps a version lock file so all changes are reflected/cached, or

2. Doing a similar thing yourself by committing `pip freeze` and building images from that instead of requirements.txt.


To be clear, the only file in question is requirements.txt; Docker has no idea what files `pip install ...` is pulling and doesn't factor them into any kind of cache check. Beyond that, I didn't realize that timestamps were factored into the hash, or at least if they were, I would expect git or similar to set them "correctly" such that Docker does the right thing (I still think Docker's build tooling is insane, but I'm surprised that it breaks in this case)?


I just tested if timestamps are factored in, and I was wrong. According to the documentation:

https://docs.docker.com/develop/develop-images/dockerfile_be...

> For the ADD and COPY instructions, the contents of the file(s) in the image are examined and a checksum is calculated for each file. The last-modified and last-accessed times of the file(s) are not considered in these checksums. During the cache lookup, the checksum is compared against the checksum in the existing images. If anything has changed in the file(s), such as the contents and metadata, then the cache is invalidated.


Not been an issue for me using Gitlab CI runners, at least..? Which may be because Gitlab CI keeps working copies of your repos.

If the CI system keeps the source tree the Dockerfile is being built from around rather than removing it all after every build, it caches stuff as normal.


> Also, build times and sizes don't matter too much in the grand scheme of things.

In which case... why do you bother with alpine in the first place?


> Too much

There are cases in which it does matter. Just like anything else, it strongly depends on your use case. If you're into the microservices thing and rebuild your containers, or change requirements frequently - maybe container size (as they'll be pulled a lot, by a lot of different hosts) matter. Maybe you're making something for the public consumption and want to make sure it doesn't take up a huge amount of space. Maybe you're making an image for IOT/RPi type devices. You get the idea.

Personally, I like using Alpine where possible because it's got less stuff. Less software means less things that could potentially have a security issue needing fixing/patching/updating later.

However my default container for anything else is "miniubuntu" build as it's got all the basics, it's 85mb in size, and I can install all the things I need for the more complicated projects.


I don't. Never got on board with it and just stick to Ubuntu so I don't have to think about the differences.

Never had a business driver come up for going with Alpine though.


You're also supposed to delete any dependencies you no longer need after compiling. This image might be unnecessarily large.

https://stackoverflow.com/questions/46221063/what-is-build-d...


Assuming it's even possible (there are cases where it isn't), unless you're doing the entire operation in a single RUN instruction (install dependencies, compile, remove dependencies), deleting dependencies isn't enough because they'll exist in an ancestor layer before you delete them. That leads to image bloat.

This is why multi-stage builds are a thing, which the author advocates against doing.


I've built openresty with tons of custom plugins in a single RUN call. It's possible. Image is tiny compared to: - debian based image for the same thing - Not deleting build-times dependencies.

Author just doesn't know better. That's what happen you never build things from source yourself.


> Ex, use "apk add --no-cache PACKAGE"

I have an Alpine docker image which was 185MB and after I added the above, it was 186MB. I was definitely hoping for more, given your strongly worded advice.


I disagree with the author on many counts, largely because I maintain two stable, multi-platform Python images:

- https://github.com/insightfulsystems/alpine-python

- https://github.com/rcarmo/ubuntu-python

The first uses stock Alpine packages, and the second builds Python from scratch (with some optimizations) atop Ubuntu LTS, and they serve two different use cases, but maintaining both made me learn a few things and there are a few factual errors in the article.

For starters, yes, you can run manylinux wheels on Alpine. Here’s how to do it:

https://github.com/insightfulsystems/alpine-python/blob/mast...

So no, you don’t need to recompile every single package for an Alpine Python runtime.

As to runtime differences, yes, they exist, but are _extremely_ dependent on your use cases. I have not encountered any of the bugs - I did have a few crashes with GPU libraries and Tensorflow (which is why I also maintain the Ubuntu-based version), but the author points out third-party accounts, one of which seems to be attributable to locale settings (which you should always set anyway).

Performance differences are negligible on Intel (at least for web apps using Sanic and asyncio - I don’t do much Django these days), and the inclusion of a link to buy a “production-ready template” mid-article is just... iffy.

Give us data and working code - I’d like to see an objective benchmark, for instance, and might even set up one with my own images.


Thanks for sharing your manylinux wheels setup. Will the python packages just link the musl libc instead of glibc?

If that's all that is needed then it makes me wonder why pip is only downloading wheels on glibc distros currently.


In my experience the only things that broke were the CUDA/NVIDIA bindings that are “several turtles down” from Keras and the like, but, again, these images serve different purposes and I use the Alpine ones mostly for web services and vanilla sci-kit stuff.


I confirmed I was able to install lxml from a wheel and parse a web page with it.


The build issue is because musl wheels don't exist, so installing with pip takes forever.

If you just need pandas, then you can install it with apk along with many other packages and it'll be much quicker: https://pkgs.alpinelinux.org/packages?page=1&branch=edge&nam...

Alpine are doing what they can, by maintaining common python packages on apk. It is pip that is tied to glibc


Unfortunately in many (most?) cases I've seen the prescribed deployment practice for Python software is to use virtualenv (or some variant of it). Python dependencies are to be handled exclusively via pip (requirements.txt, install_requires in setup.py, etc.) and using distro-packaged Python modules is frowned upon because they are considered to be outdated.

I'm not saying that's the best practice. Personally I like to use distro-packaged stuff as much as possible because it's less volatile and tends to go through some more review than things on PyPI. But the virtualenv/pip exclusive method does seem to prevail in the field in my experience, so the advice to just use Alpine packages isn't useful in many cases.


Absolutely agree. Which is why I don't use Alpine linux for python projects and wouldn't recommend anyone else does.

In the article, they are only using pandas and not using a requirements.txt file. So it's a pretty perfect case to just use apk. In cases where you don't need many packages, you can still use Alpine if you would like without most of the downsides mentioned in the article.


I write a decent amount of python on alpine, in general I stick to the standard library and occasionally something like pyserial, sqlite, or numpy for specific things. IMO this attitude of pulling in dozens of packages with pip is pretty dangerous, I don’t care which language you’re using.


What's the alternative?


Removing features, moving them into micro services, and sometimes writing a class or two yourself.


That just sounds like distributing the dependencies. You'll still install them all, just not in the same spot.


It is best practice, but you probably don’t need virtualenv if you’re deploying a single application in a docker container.


Messing with the system python without using some form of virtual environment is a recipe for disaster.


In the docker container you ARE the system python version. In fact many docker best practices would say that preferably only 1 application is running, and that would be your application.


It's still not safe. For example there was a conflict one day between awscli, urllib and some system tool (apt?) which led to a broken (apt fails) system if you just "pip install awscli" globally. Even in docker image, you still need virtualenv to be safe (or use only global distro packages).


Could you elaborate on why? I worked at a co that almost exclusively ran Alpine containers with a single Python app. No use of virtualenv. Never experienced any hangups with this.


Same here, and I'm curious to know why some think virtualenv is required in a container, as opposed to just starting with am image containing a clean python install and adding what you need.


I linked the example here https://news.ycombinator.com/item?id=22186046

But tlrd is: if you "apt install some-utility", then "pip install something else", you may have upgraded packages that some-utility relies on but is not compatible with the new version anymore.


Pip and apt install to different folders already. In the very unlikely event there’s an issue in a container the solution is simple, modify the sys.path. The other 99% of the time avoiding the venv is a win.


Sure it's unlikely, but it happened at least once now, and that's exactly the situation venv is designed to avoid. Pip and apt install to different places (by default/convention), but you end up importing the pip version by default. You can change sys.path to work around it of course... but that's pretty much what virtualenv does anyway. Utilities by default do not change that path which leads to issues like the linked one.

I'm not saying never install globally, just let's keep in mind that this can lead to real issues which may be very surprising / hard to debug once in production. Unless you understand exactly what and how is delivered with every change, defaulting to venv is a safer option.


Complexity on top of complexity makes things harder to debug in my opinion. I’ve never had trouble debugging library issues and they’ve never happened to me in a container. Even easier is to not mix tools in a container anyway. Less is more.


It can be a tool used as a dependency of your app. And I feel like you can make the same claim about locks. They just introduce complexity and make debugging harder, race condition never happened to me anyway... It's about ensuring edge cases don't bite you when you don't expect it.

If you can prove it doesn't apply, then sure, why not install globally.


Looked up the issue and it was cloud-init, not apt that failed. Here are the details: https://github.com/aws/aws-cli/issues/3678

In one sentence: "Installing the latest aws-cli on an image is preventing new AMIs based on that image from booting due to the above issue with urllib3."

While this specific issue wouldn't affect docker images since you normally don't run cloud-init on them, it's just luck that it wasn't some other utility affected instead. Next time it can affect docker images too.


Why do you have cloud-init in a container?


It's a docker box, you're already inside some form of virtual environment


Not in docker. The entire point of docker is isolation so when you use something like python3.7-slim, its fine to use the system python. Thats the whole point of using the image.

Now you still want to run and install pip packages as a non-root user of course, but you don't need a virtualenv in docker.


Debian distros have a separate dist-packages directory for system libs. You can do whatever you want in site-packages. It is effectively like a default virtualenv when operating in a container.


For those like me who have no idea what "wheels" is in this context, I think it is referring to https://github.com/pypa/wheel

This looks like some sort of binary Python packaging thingy. I think @dkarp is saying that there exists pre-existing binary builds for glibc for many things, but not for musl libc so they are forced to be built explicitly when using pip to manage Python dependencies.

I could not figure out where this repository of binary packages is or who maintains it. Any corrections would be appreciated.


You've got it. Wheels in Python are like gems in ruby. (Or eggs in [python minus a few years].) The official Python package index has binary images, but they're compiled against glibc, and therefore don't work with Alpine. So when you `pip install` a package with binary dependencies on Alpine, it needs to compile them. Which can lead to a bit of a trip down a rabbit hole.

As others have mentioned, Alpine's repository does include Alpine-compatible binary versions of many popular Python packages. So you can often save yourself a lot of trouble by using apk instead of pip to get those.


> The official Python package index has binary images, but they're compiled against glibc, and therefore don't work with Alpine.

That’s a false statement (depending on interpretation). Whether a PyPI package has wheels is entirely at the discretion of the package maintainer, who is responsible for compiling all source and binary distributions; PyPI does not compile anything and only accepts uploads, verbatim. If the maintainer doesn’t upload wheels or doesn’t upload wheels for your platform, then no wheels for you (not from PyPI). In general compiling statically linked wheels (when you have C extensions and external dependencies) is a fairly involved process.

At the moment there are six platform tags for Linux wheels supported by PyPI: manylinux{1,2010,2014}_{x86_64,i686}. Each is based on a CentOS release with very old glibc. See

https://github.com/pypa/manylinux

PEP 513 - maynylinux1 https://www.python.org/dev/peps/pep-0513/

PEP 571 - manylinux2010 https://www.python.org/dev/peps/pep-0571/

PEP 599 - manylinux2014 https://www.python.org/dev/peps/pep-0599/


The wheels are in regular PyPI. They are uploaded by package maintainers. See for example here: https://pypi.org/project/pandas/#files


Pre-built binary distributions ("wheels") and source distributions (which sometimes need a compilation step) just live side by side on pypi.org, the public python package index.


> If you just need pandas, then you can install it with apk along with many other packages and it'll be much quicker: https://pkgs.alpinelinux.org/packages?page=1&branch=edge&nam....

And so again need to rely on something "non-standard".

> Alpine are doing what they can, by maintaining common python packages on apk. It is pip that is tied to glibc

The "manylinux" tag means glibc by necessity, because nobody has driven a PEP for tagging, detecting and managing musl-based wheels to completion[0]. The issue is not restricted to the choice of libc, or even linux alone, either. But again, that requires people actually put in the work to chip at and fix the issue[1].

[0] https://github.com/pypa/manylinux/issues/37

[1] https://github.com/pypa/packaging-problems/issues/69


Not everyone uses pip. Many data scientists use conda. Using you distribution's package manager is a legitimate solution.

The point is though, that this is a python packaging issue. Not an Alpine issue and not an issue that Alpine can do much about. As you linked, pypa have a big task to make packages work on every possible platform. Unfortunately, this comes as an unexpected surprise for anyone who wants to try out Alpine - maybe pip should show a warning when wheels aren't available because of musl.



This post is actually reversing cause and effect: Python wheels are slower on anything that is not glibc or x86(-64). There's nothing wrong from Alpine side or anything that is not x86(-64) but with the mechanism used by Python.

This is a problem I have with Python: everything is assumed to be a standard GNU/Linux/desktop environment, you are assumed to always have to have the latest versions of everything (or to have "up to" a certain version, which then breaks everything else that relies on the latest), and all becomes sub-optimal if you deviate a bit from the expectations.

Now this is fine for some people but you can't blame the ones that have different environments. This is why most people doing actual embedded cringe when someone suggests using Python, because the experience is not usually pleasant.


Lots of criticism of the dependency issues detailed here, but no one talking about the fact that musl looks to allocate roughly ~50% slower than glibc.

For many applications, particularly those that aren't written with low-level performance details in mind, this is likely to cause significant slowdowns (obviously not the full 50% as many things will be CPU bound, but still likely noticeable).

More generally, it seems that Alpine optimises for build/deployment, but glibc-based distros are going to be faster at runtime. While a fast build/deploy is nice, it's probably not worth it if it comes at a cost of significant runtime performance issues.


Like anything else, if you need to worry about speed - you should optimize. There are just as many people who will shit on Python for it's speed/scale capabilities as there are people who will point out "Well, it works for Instagram".


There are levels to caring about speed, and this could easily cause issues at a level above what they should, being a fairly low level detail (memory allocations).

I work on a Python application where we don't care about memory allocations (we care about the number of database queries for example). Using musl may make memory allocations expensive enough (as they are 50% slower) that we may have to start paying attention to this in some areas of our code.


The reason musl is slower is largely because the project's goals are about being lightweight and correct first and foremost, and this can often come into conflict with performance optimization. Python isn't the issue here, but almost any Python program will be heavily affected by this because the Python interpreter does a lot of allocation.


I'd structure the Dockerfile in another way with the "python:3.8-alpine" approach.

E.g. something like:

  FROM python:3.8-alpine
  
  RUN apk add --no-cache --virtual build-dependencies gcc build-base freetype-dev libpng-dev openblas-dev && \
    pip install --no-cache-dir matplotlib pandas && \
    apk del build-dependencies
This way you won't keep the build dependencies in the layer and the final image. Maybe not all of the packages are build-dependencies, but at least there is no need to keep the gcc and build-base package around. Long story short, wrap your "pip install", "bundle install", "yarn install" around some "apk add .. && apk del" within one RUN.

Yes, it would still compile for quite some time, but the final image size is not "851MB" but "489MB". Still larger than the "363MB" of the "python-slim" version. Guess "pip install" will keep some build artifacts around?


alternatively you can create a multi staged build.

also you dont need clean up the builder, since the unused files are not used in the final image.

also add a new file you can just append it to the builder. (not rebuild required of the previous installs required since you don't care about the size of the builder image )

  FROM python:3.8-alpine as builder
  
  RUN apk add --virtual build-dependencies gcc build-base freetype-dev libpng-dev openblas-dev
  RUN pip install matplotlib pandas

  FROM python:3.8-alpine
  # a couple of copy commands to copy binaries and py files from builder to the final image. 
  COPY --from=builder /usr/bin /usr/bin
  COPY --from=builder /usr/local/lib/ /usr/local/lib/


Exactly. This is what I always do.

I would never want any build tools in the final Alpine image.


TIL about `--virtual`. This is very useful.

>-t, --virtual NAME

>Instead of adding all the packages to 'world', create a new virtual package with the listed dependencies and add that to 'world'; the actions of the command are easily reverted by deleting the virtual package


Your guess is correct. Pip does keep a cache of packages around. It's easy to fix by setting the `--no-cache-dir` switch when installing:

`pip install --no-cache-dir -r requirements.txt`


Wow--that actually solves a mystery which I have been trying to figure out for a few months now! Awhile back, I was trying to Dockerize the Twint app, and ran into exactly the same problem. I ended up writing a "lite" version of the Dockerfile that would strike references to Pandas in the Python code:

https://github.com/dmuth/splunk-twint/blob/master/Dockerfile...

To this day, I'm amazed it worked. But it took build time from 15 minutes down to 30 seconds.

I will have to revisit this, however, because I would prefer to have just one container that builds quickly instead of having to do a workaround like that.


> And the resulting image is 851MB.

If you include the dev dependencies and compiler toolchains in the final image. You can (and should) cut all that out of the final image.


The author sort of mentions this:

> The image size can be made smaller, for example with multi-stage builds, but that means even more work.

...which I really don't find compelling, honestly. If you want a smaller image, then yes, you need to do one of the major things that produces smaller images. Don't install a whole dev chain, leave it in the image, and then complain that your image is big.


I think the advice of "ship gcc and its dependencies to production to avoid writing 2 extra lines in your Dockerfile" is pretty awful advice. Less stuff in your containers means faster image pulls and less attack surface. Seems crazy not to do it. FROM clean-image, COPY --from=base-image ...

I am guessing with Python the difficulty is in enumerating what files you need to copy, because there's more than one. But I am sure that can be figured out exactly once and you can save yourself some time and money on unnecessary data transfer.


A couple more nitpicks for the author if he/she is reading this...

> Alpine doesn’t support wheels

You mean wheels do not support musl, or to be more precise they don't conform to POSIX 2008 or C11 standards.

> Alpine has a smaller default stack size for threads, which can lead to Python crashes.

You're linking to a Python bug. That's right, Python. This is a bug in Python. Not Alpine, nor musl.

The whole point of using Alpine-based Docker images is that musl provides a lean, efficient and standards-compliant implementation of libc which should in turn result in faster runtime execution of the software running in your container, and other secondary benefits such as smaller binaries and faster build times. If your software (in this case, Python) is glibc-dependent and is not standards-compliant, the bug is in your software, not musl. Don't expect it to work. The bottom line is that if you don't understand this, you shouldn't be using Alpine. Please don't trash-talk something you don't understand.


Not really Alpine's problem that there aren't upstream wheels for your dependencies. I think you're being realistic with your suggestion but you're barking at the wrong tree.


Multi-stage container builds are also not exactly unusual. You would generally not want an entire C build system sitting around in your container.


Isn’t this also a problem of requiring the compiler toolchain at install time?

I think we will eventually reach consensus that package managers requiring a compiler is a misfeature.

I think I’d be just fine having two base images, one for runtime, and a superset image for Continuous Integration that layers the compiler toolchain on top.

But Node has this same behavior and it drives me nuts.


>Isn’t this also a problem of requiring the compiler toolchain at install time?

This isn't a new problem if you use docker. To address it you have to either use multistage builds or more traditionally you install the dev tools, build and install your artifacts, then remove your dev tools all in a single step. It's annoying, but it's a common pattern.


> I think we will eventually reach consensus that package managers requiring a compiler is a misfeature

What's your alternative solution? For simple native libraries there's always libffi, but things that need numerical processing will want to actually plug into the underlying value representation directly to not sacrifice speed.


I Agree, since this is the case for all distros using musl. This was a decision made by python to not support musl in the wheel manylinux standard: https://github.com/pypa/manylinux/blob/46241e9debbaf4b045c88...

If you need more than a few packages installed using pip, then Alpine simply is not a good choice for python.


musl seems very niche. That’s an ecosystem problem; that is on Alpine for swimming upstream.


The criticisms about musl's memory allocator being slower than glibc may be true. However, this specific issue can be resolved by telling the Python interpreter to use e.g. jemalloc as its allocator instead of the system allocator.

Unfortunately, this requires building a custom binary linking jemalloc and calling into Python C APIs to configure the embedded interpreter and that's probably prohibitively too much effort for many users, who view the Python interpreter as a pre-built black box and therefore see musl's allocator as an unavoidable constant. Fortunately, tools like PyOxidizer exist to make this easier. (PyOxidizer supports using jemalloc as the allocator with a 1 line change and jemalloc can deliver even more performance than glib'c allocator.)


raw_allocator (string)

Which memory allocator to use for the PYMEM_DOMAIN_RAW allocator. Values can be jemalloc, rust, or system.

Important: the rust crate is not recommended because it introduces performance overhead.

From the docucmentation


What about https://github.com/sgerrand/alpine-pkg-glibc? It provides a glibc apk for Alpine. The miniconda Dockerfile makes use of this before continuing to install miniconda. https://github.com/ContinuumIO/docker-images/blob/master/min.... Would this solve all of the issues raised in the article?


Don't get scared by this article and not use Alpine. It's perfectly fine to run Python and it's what we do with Home Assistant too.

Home Assistant maintains Alpine wheels for a ton of packages for their 5 supported platforms: https://wheels.home-assistant.io/

Repository of our wheels builder can be found at https://github.com/home-assistant/hassio-wheels


From my own (limited) experience, a 25-minute build time seems pretty unusual for an Alpine image pulling in Python dependencies. The lack of wheel support on Alpine definitely seems like a compelling argument against using it for Python apps, but that build time indicates to me that some confounding factor might be at play here. Maybe matplotlib and/or pandas are just way bigger than the libraries I've pulled into Alpine images before? Curious to hear what others think.


Pandas and numpy (which all things data science depend on) require compilation of C/C++ code which comes in addition to the download/install steps. Wheel files can contain precompiled .so files.


Some of the libraries which can be very slow to install due to compilation that we've ended up building our own alpine-compatible wheels for:

Pillow PyXB cffi cryptography gevent lxml msgpack-python psycopg2 pycrypto uWSGI


> Most or perhaps all of these problems have already been fixed, but no doubt there are more problems to discover. Random breakage of this sort is just one more thing to worry about.

It's 2020, and yet we're still falling victim to textbook examples of FUD.


Definitely ran into "Alpine Linux can cause unexpected runtime bugs" -- with strftime, libc and musl don't have 100% equivalence of the date/time formatting strings.


C library variations between UNIX systems are the eternal curse of this species apparently.


That happens as well on Debian if there are no wheels for your packages. It happens quite often when Python versions change.

I just run a multistage build where I first install all the packages, building wheels were necessary. Then I copy them to a new clean image from the first container. Another benefit is that I don't need to have a compiler installed in my final image. The Dockerfile is much cleaner than having one huge RUN which installs and then uninstalls build dependencies.


This.

The author should use a multi-stage build where one stage is used for building the missing wheels. Those should then be copied over to the final stage.

No need to install any build tools in the final Alpine (or Debian) image. The Alpine image will definitely be much smaller.


Not sure I've seen this mentioned anywhere before, but what we tend to do is build all of our dependencies in CI by setting `_PYTHON_HOST_PLATFORM=alpine-x86_64` and uploading them to an internal index.

This way when we go to build our final app images, there is no need to compile any python packages or install a toolchain. Our builds are generally under a minute or two as its just downloading and installing packages via Poetry/Pip.


This article nicely summarizes the shortcomings of Python (depending on lots of C, C++ libs) and how it is being tide to GCC. If Alpine was a tier 1 platform for Python usage I am pretty sure there would be musl flavoured wheel packages. Comparing apples to oranges and concluding they are different.


The article makes you click on another article for the answer of "well, what Python docker build should you use?"

If you're not satisfied with Alpine and need slim builds, answer is python3.8:slim-buster (from other article).

From empirical experience, slim-buster has been a good choice for our use case of ML services.


We're using an Alpine based image while prototyping an application at work right now. I really dislike the APK, it is so limited compared to what comes with APT. I guess its a little smaller than the limited Ubuntu image available, but APK is just not any fun to work with.

Just my two cents.


The issues in the article mostly apply to packages with (some) extensions.

Matplotlib? Pandas? Those are not typical except for the datascience crowd.

For your regular web stack? It works fine. And yes, installing the apk will save time in a lot of occasions.


FWIW, my current project is a pretty typical web stack (Flask/SQLAlchemy) and the build time was bad enough that I switched back to debian. In a web stack, you're likely to have a lot of external imports, so I'd say you're generally better off with debian. If you're doing data science and only importing Pandas then you'll be fine as you can switch to the apk.


Glibc libm has been heavily tuned for both speed and correctness over many years. Not sure I'd trust musl to the same extent in that area.

(Assuming your data sciency workload uses those functions; if not, this doesn't apply)


That's weird, my guess is that it might be because of a DB driver or some other module that needs C

In my use cases the image size download eclipsed the build time for the most part


What exactly is pip erroring out on with the wheel install?

if it's looking for glibc /lib/ld-$foo.so you could symlink musl and see if it works and if python touches some glibc specific junk just install gcompat - https://pkgs.alpinelinux.org/package/edge/community/x86/gcom...


On Alpine it doesn't even download the wheel at all, it just downloads the source code.


Could you RUN apk add --no-cache gcompat then see what pip does? If pip is checking os-release for determining the libc then it is utterly broken.


I understand the need for it, but at least at my amateur level I dislike it. I can afford a decent VPS - I don't care about the extra 100MB. I do care if the CLI command I want doesn't work because it got stripped out for the sake of efficiency. Now I'm spending time very inefficiently adding back the stuff that is there by default on ubuntu?

If I wanted maximum efficiency I'd be coding in ASM not deploying dockers.


The fact glibc is the "standard" does make musl users required to work a bit harder. I would say the proper solution regarding Alpine would be to simply package more python libraries in the official repos and install them from there (pandas is in edge at the moment). Another solution could be to build in one docker image and use that as a pip repo for the development one.


Beside image size and build time, runtime performance can be really bad with alpine's libc implementation (Musl). My colleague pinpointed that for example PowerDNS has a throughput of less than 50% compared to a PowerDNS running with glibc. We should really make a blog.


About the runtime errors - I confirm, experienced periodic crashes on the same data that works fine in Ubuntu-based docker. I use OCaml though, but the point is still valid.


There is so much misinformation in this article. Its hard to tell if its deliberate or not, so I'll assume the author is uninformed.


I tend to agree that the author is at best making some very specific judgement calls (particularly, I think that refusing to do multi-stage builds disqualifies any size-based comments), but your comment isn't very constructive if you're not actually addressing anything that they say.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: