Hacker News new | past | comments | ask | show | jobs | submit login
Distributions vs. Releases: Why Python Packaging Is Hard (pydist.com)
131 points by BerislavLopac on April 30, 2019 | hide | past | favorite | 76 comments



Most of my pain with python packaging comes from incompatible changes in the toolchain.

Some years ago, pip started distrusting HTTP mirrors, and while you could add some options to force it use HTTP, these options weren't present in previous pip versions. Which meant that you now had to provide options depending on the pip version -- which is harder if you don't call pip directly (for example through dh-virtaualenv).

We switched to HTTPS, but one with a TLS cert signed by a company-internal CA. Getting pip to trust another CA is non-obvious, and again depends on the pip version. So another version-dependent treatment necessary.

Then pip suddenly moved all of its implementation from pip.something to pip._internal.something, stating that it was documented all along that you shouldn't import anything from the pip.* namespace. But, the package names all didn't start with an underscore, so, by python convention, looked like perfectly fine public classes and methods.

Moreover there simply isn't any public API to do things you really don't want to reimplement yourself, like parsing `requirements.txt` files.

As soon as you want to do anything slightly non-standard with pip/distutils/setuptools, you find a solution pretty quickly, and that solution turns into a nightmare when you upgrade any of the components involved.

Also, finding a local pypi mirror that lets you inject your own packages and deals with wheels correctly and doesn't generally suck and is somewhat easy to use... not easy.


Agreed; for something that managed to become the de-facto solution for Python package installation, pip always felt surprisingly unreliable to me.

My solution at a small company, where we couldn't waste time with this crap, was to vendor everything: when a developer updated the requirements.txt, they would also install the packages in a project directory (using "pip install -r requirements.txt -t ./dependencies") and commit that. That way, the rest of the process (CI, packaging, etc) could just checkout the repo, set PYTHONPATH=./dependencies and ignore pip completely.


Nix's usage of source hashes to pin every package is more and more prescient, as each language's custom-written package manager reinvents the wheel. (Pun indented!)


It's a bit unfair to accuse languages which predate Nix by decades of "reinventing" anything.


I don't think it is. First, a correction, Python predates Nix by 12 years, not decades, the first experimentation with what would later be named Nix was published in 2003 by Eelco Dolstra[1]; the 2004 paper[2] refers to it specifically as Nix. However, we aren't comparing Nix (language) to Python, we're comparing Nix (package manager) to easy_install and more recently pip, released in 2004 and 2011 respectively. It's 2019 now, and I don't think it's longer unreasonable to expect what Nix is capable of, of your package manager.

[1]:

https://nixos.org/~eelco/pubs/iscsd-scm11-final.pdf

[2]:

https://nixos.org/~eelco/pubs/nspfssd-lisa2004-final.pdf



There are plenty of situations where the output of the build process is not deterministic on only the source.


Nix hashes your development environment along with the sources to generate a version hash of the build process result.


Example?


Install Python from source with and without readline-devel in the ld library path.

Or any other piece of C software with macros that depend on the build environment. Same .c file in, different binary out.


Nix does not have libraries in the dynamic linking library path, they are made available for linking by specifying the library's derivation as a dependency. There is no such thing as e.g. a global glibc or readline. E.g.:

  % ldd $(which nc) | grep ssl                                                
 libtls.so.18 => /nix/store/s6j0yd68cnfb4mv76lyrb413qhhac57g-libressl-2.8.3/lib/libtls.so.18 (0x00007f8875f6d000)
 libssl.so.46 => /nix/store/s6j0yd68cnfb4mv76lyrb413qhhac57g-libressl-2.8.3/lib/libssl.so.46 (0x00007f8875f1a000)
 libcrypto.so.44 => /nix/store/s6j0yd68cnfb4mv76lyrb413qhhac57g-libressl-2.8.3/lib/libcrypto.so.44 (0x00007f8875d3b000)
A binary or library always links against a specific version of a library. This is why in Nix, you can have several versions of a library installed in parallel. The initial part of the path after /nix/store is the hash of the derivation used to build this version of libressl and all of its recursive dependencies. Furthermore, Nix can build packages in a sandbox (which is the default on NixOS) to avoid accidental dependencies.

Consequently, a Python derivation that depends on readline would have a different hash than one that does not depend on readline.


Homebrew has had to deal with this in its sometimes-painful transition over the years from being a build-from-source system to a binary package manager. Suddenly you can't just have some package require the boost package be built using the --with-python flag, you either need to have it on in the bottled version (but wait, then boost has a dependency on Python, is that okay?), or you have to ship a separate boost-python, like Debian (but wait, then it's no longer a 1:1 mapping of recipes to bottles, is that okay?).


Nix handles this just fine. You should check it out.


This sort of thing is IMO exactly why containers have merit. Let the software vendor decide what to distribute or others jump in and fill the gap. Powerusers can still read the dockerfiles and decide what do do or even build themselves. When it's built the image is well defined, i.e. not depending on host environment, including files stem except volume mounts. When will we get a whole desktop distro like this?


Many of the pieces which go into a desktop have a strong assumption of being able to interact with other components in ways other than a network socket. Containerizing an application is one thing, but how do you containerize something that has plugins? What about desktop widgets? Do they go in their own containers, or get mixed in with the host's somehow? How is ABI compatibility enforced now that we're in container land and none of that matters any more?

What about client/server components where there's dbus in between, like the network manager or volume control?

These are solvable problems, but it's been enough of a challenge building a server OS from containers (RancherOS), and building a desktop OS that way is a significantly harder problem.


> containerize something that has plugins

I don't really see the problem there, I see the plugins as essentially just data (i.e. stored in a mounted volume), for which updating and versioning is in the domain of the application itself or maybe some standardized library it uses.

desktop widgets: essentially the same thing, it's a plugin to the desktop environment and can be stored as a volume mount on the DE container.

dbus is probably something that would require an evolution on the container side, or alternatively it would need to be all abstracted into network interfaces. another possible way to look at it is to have a layer between kernel and containerized userland that is responsible for manipulating all the physical host things in the traditional way, and the examples you give are exactly that. maybe this sort of thing should continue to be distributed tightly together with the kernel.


There's an entire Debian project working on making Debian builds deterministic and/or reproducible. The fact that they had to devote a whole team to it and they've poured who knows how many hours into it kind of says that most software builds aren't reproducible (deterministic, theoretically, yes, cause in many cases it's just different timestamps).


NixOS also has a reproducibility project: https://r13y.com/

It seems that things must have recently regressed; the page currently says 40% but it used to say 98%.


Java, the most enterprise-y of the enterprise platforms. Practically no Java build system produces deterministic results. Gradle supposedly can, Maven does not, but I've read that the Java compiler itself can't be entirely trusted to produce exact builds soo...


I only got interested in Python again after I became aware of tools like Poetry https://poetry.eustace.io/


Looks very interesting, but the part of the installation instructions where it says that its directory "will be in your $PATH environment variable" worries me – is the installer taking the liberty of messing with my configuration without asking first?


That looks really sharp.


If you take this further, and can compile any of the "C" code (open source), then you can link that existing code as one single binary - e.g. "python" - basically python with all compiled code needed, thus: - You no longer need .dll, .so, etc. to go along, you can package the rest of the .py files into a .zip and slap it at the end of the now "fat" python binary, hence effectively having one executable - a bit of a "go" style releasing... - This can even have smaller sizes, but careful (with FFI) - because only the code that is exposed to be used from Python would get linked in the fat "python" binary.. that is - with "FFI" is more tricky, as you have to force symbol to be kept - You can obviously do that with other runtimes - java - e.g. your whole "java" can be down to one executable really

But to get there, you need sane build system, in order to explain these dependencies outside of python/java/whatever... And obviously the source code (although precompiled static libs should work in this scheme too)...

The goal again - just to have one thing, and most importantly no .so, .dll, .dylib lying around,


This is what tools like py2exe and PyInstaller do.


You can do this with plain zip files as Python can unpack and execute in-memory. But you have to extract non-Python code such as C libraries at runtime somewhere so the OS can link them. I would be interested to know how the packagers you mentioned get around this, or if they too use temporary files and unpack.


This is exactly how I package my python scripts. The answer to your question: Memfd https://magisterquis.github.io/2018/03/31/in-memory-only-elf...

And zipimport pip package


Right, there are OS-specific approaches on all systems, but if you are able to compile Python itself along all open source C/C++ libs that your .py-thon file might use, you end up with exactly one executable (you can possibly even go further packaging the .py files in some compressed (bytecode compiled) inside the same binary) - e.g. all you need is there, and the approach would fare better across more systems (e.g. does not rely on advanced techniques, no need to unpack the .DLL in temp folder, etc.). Also better for post-mortem debugging - your .PDB, and your .DWP (or whatever symbol files) are.. well just one single file, rather than several.

Same thing for java, dart, etc. This makes it much easier to deploy, use, download, as it's basically "zero"-config install, no need for scripts to tweak your PATH, etc.

(but it has limitations clearly, if you use proprietary code, or you can't compile given open source library bundled with you - e.g. LGPL code, that you have to dynamically link, then again if you are project is open source this is no longer that much of a limit (but would not know that for sure))


Author here, happy to answer any questions.


Do you have any internal guidelines for ensuring these sorts of public announcements have a neutral, compassionate tone? To me, it feels like it would be easy to read this article as being more empathetic to the tools than the users, and to interpret it as blaming the users for using the tools wrong: authors for adding a new distribution to a release post-hoc, instead of creating a new release; or consumers for not carefully vetting packages to ensure there’s a matching distribution for their platform & not using source distros.

I feel like, as a representative of the tool it’s easy for readers to default to understanding any criticism (even the slightest) of user behaviour & no discussion about tool behaviour as being defensive of the tools.

I know the Python package tools community value their users and would never dream of suggesting that they’re using the tools the wrong way. How important do you feel it is to recognise/appreciate those users in user-facing messaging?


Why isn't the obvious answer to do this change? https://github.com/pypa/packaging.python.org/issues/564#issu...


Because right now you can only upload individual distributions to PyPI. A release is implicitly created when you upload the first one. If PyPI implemented that change, you could only have one distribution per release.

The proper fix would be to make publishing a release a separate operation. But that breaks all existing tooling and workflows.


Make it a configuration option for the package on pypi, then tooling can migrate slowly and at some point it can become the default for new packages. If then at some point someone uses old tooling the downside would be that they might need to do a manual publishing step to actually publish their package, but at least you don't have the problem of publishing before you are ready.


What is the key advantage of PyDist over Artifactory/bintray ? That's the system most companies I know of use, and I was wondering what makes PyDist compelling.


PyDist is specifically for Python packages, so it's 1) a better fit in this niche (for example, it mirrors PyPI so you can install public and private packages through a single index, and you won't be broken by packages deleted from PyPI), and 2) it's much cheaper if you don't need everything else Artifactory offers.


I believe Artifactory proxies and caches PyPi as well via remote repos.


Ah, looks like you can set that up. But it's not as convenient, because 1) you have to provision the remote repository and 2) instead of transparently installing from PyPI, you have to specify the remote repository each time you install from it.


It is exactly as convenient (if not more). You set up a set of 3 repositories (local for your modules, remove to proxy the PyPI, and virtual, which unifies them under a single URL (which solves the #2 you mentioned).

So, both points are incorrect.


Honest question, is pydist just a hosted version of devpi (https://github.com/devpi/devpi)? What features distinguish pydist from devpi?


Node with npm must have the same problem. How they handle that and why python is not doing the same? If it should of course.


NPM doesn't have the same problem, because they only allow a single file per release (which also means there is no way to distribute platform-specific builds).


There is a way. It seems node-gyp and node-pre-gyp is used for that. Here is project that has precompiled binaries http://sharp.pixelplumbing.com/en/stable/install/ .


It looks like they don't include the binary in the package at all, and instead the package has custom install code that downloads it during install. Which works, but it completely bypasses the package manager and would certainly take me by surprise as a user.


That's my point - it works. What's wrong with this approach? I have not investigated neither python nor node way (while I use both) and as user I don't care (maybe I should, I don't know).


Expanding on jep42's answer, there are a few problems:

- The package manager can't tell whether the package will be able to support your system or not, so if this is e.g. an optional dependency your install will probably break instead of the package being skipped.

- This results in an unexpected and hard-to-foresee dependency on some other non-npm server, which could disappear, be blocked by a firewall, etc.

- Standard tools will not understand what this package is doing and their assumptions may be broken, e.g. when trying to make builds repeatable.


Looked into this a little bit deeper. There are multiple packages in node that can do prebuilts handling, e.g. sharp is using this https://github.com/prebuild/prebuild-install so there are multiple solutions.

Now answering your notes:

- It is not very different in python from practical point of view. I understand that while from theoretical point of view it should be different but it is not. E.g. when I was working on Windows I met situations where package is supported but there is no binary package. In such situations it falls back to compilation from source code and for some packages you need to perform "ancient Zulu dance" to get it compiled on Windows. Sometimes I was in the mood to do that sometimes I was not. However from user perspective even if package manager can tell that it is supported it does not help much. In practice, both python and node packages compile successfully quite frequently and in the end package is almost always supported.

Node supports optional dependencies and failure to install package either as prebuilt or built from source code will simply skip it and will not fail the whole install.

- In practice node packages usually use github and sometimes amazon s3 for prebuilts (at least based on quick analysis). There are still at least two systems but github and amazon s3 both seems to be good enough.

- I don't see problem with repeatable builds in node's case. Could you elaborate more here?

I see one problem with node's approach however. In case you want to have your own prebuilts (e.g. you have your own server) then you will end up with one big problem. You will need to have override each package with prebuilts separately.

Overall I think node did good thing by not trying to do everything and allowing community to figure out prebuilt solution. Problem is that there are multiple solutions. Python approach most probably will work out as better solution in the long run. In the end, most probably both solutions will be equally good and will look quite similar from user perspective.


jep. it can cause hard to trace errors and makes reproducible build more difficult.


https://github.com/pypa/pip/issues/5874

This seems like reasonable suggestion.


shouldn't pip then also look at the hashes (if provided) to determine which package to install? If it worked at one time it should keep working if there are "better" options available now. Consistency is important here.


The article links to https://github.com/pypa/pip/issues/5874 which proposes exactly this.


In fact I'm hoping to get some time this weekend to make a PR implementing this! (I'm assuming the reason you/Donald Stufft haven't done so is time, rather than disagreement over whether this is a good solution or some unseen obstacle to implementing it.)


How does R avoid this then?


I believe that, for R, releases are synonymous with distributions. The code is compiled on the user's machine (I once had a fortran error message while installing an R package).

If reproducibility matters, you can also use Microsoft R Open to get your packages from a frozen snapshot of CRAN : https://mran.microsoft.com/


R / CRAN provides precompiled distributions for macOS and windows machines, but builds those distributions on its own infrastructure so they’re available more-or-less the same time the source distribution is. Additionally, R/CRAN do not try to solve the problem of binary releases on Linux distros, and instead default to source distributions or defer to the OS package manager


https://xkcd.com/1987/ (edit: this is for anyone who mentions conda in here)


How so? Conda is strictly simpler than alternatives, by a solid order of magnitude at least.


It’s beyond disappointing that the article does not discuss conda and conda’s build variant capabilities [0].

Maybe 2 years ago there was still room for honest debate among Python package tooling. There just sincerely isn’t anymore.

[0]: https://conda.io/projects/conda-build/en/latest/resources/va...


"Maybe 2 years ago there was still room for honest debate among Python package tooling. There just sincerely isn’t anymore."

That is way out of line.

I'm not going to try to talk you out of conda, but for the benefit of anyone reading your comment who doesn't know better I have to point out that your opinion is far from universal.


It’s sincerely not out of line. You can still use pip to install within conda environments, yet conda packaging just supersedes other approaches. Of particular inportance is the unambiguous failure of pipfile/pipenv and similar approaches.

I’ve been working day in and day out with packaging and environments in Python for 12 years, and nothing has come close to being as serious of a general solution as conda. I count my blessings every day I get to use it, remembering the bad old days with Python’s native packaging, wheels & eggs, and then also the pipfile / pipenv mess.


> unambiguous failure of pipfile/pipenv and similar approaches.

How have those approaches failed?


Here is but one discussion out of the numerous discussions refuting pipenv,

- https://chriswarrick.com/blog/2018/07/17/pipenv-promises-a-l...


"I haven’t tried it, but I consider Conda/Anaconda unnecessary for most use-cases. Most stuff that used to be difficult to compile/install should now be available as PyPI wheels for multiple platforms."

https://chriswarrick.com/blog/2018/07/17/pipenv-promises-a-l...

That is the author of that blog post, replying to a comment there.


So the author, who hasn’t tried conda, dismisses it. What is your point? How does this relate to the criticisms of pipenv?


So what about the "and similar approaches" part of your comment? The author seems to have only minor issues with Poetry (for example) and I agree with him, it fixes or provides simple workarounds for all the major pain points of Pipenv as far as I can tell.


As someone who aggressively uses conda, I do think one of its downsides is how heavy it is. I agree that it's a good one-stop-shop if you want to get an environment up and running with no issues. But if you're not doing any scientific computing for a given project, it might help to use something more lightweight.


I've had to 'debug' quite a few (ana)conda setups, mostly ones that were used for scientific work, where the whole python setup got so messed up that nothing was quite working anymore, both system and conda pythons. I advise against it.


> the whole python setup got so messed up that nothing was quite working anymore, both system and conda pythons

This seems like the biggest downside I've seen mentioned thus far. Can you (or anyone) offer more detail as to how using Conda resulted in corrupting your Python installation? It it similar to what can happen with game mods, or is there something else going on?


This sounds highly implausible, since conda’s number one mode of operation will fully isolate system dependencies into separate environments. It is extremely difficult to misuse conda in a way that would cause this problem, which makes me believe this comment is just trolling.


Sorry, no trolling here. The times I've tried to fix the setups I've lost my patience with it and IIRC got things back working by deleting conda and installing packages through manually homebrew (all Macs). But it wasn't on my machine, so I'm not sure how they got into that state. I think it was due to using a wide range of IDE's/environments that all tried to manage their own packages and load paths, combined with perhaps some manual setup.

Probably Conda is fine if you just commit to it, but I think the users who got into these messes weren't knowledgable enough to oversee the consequences of their actions.

So you could argue that the problem isn't with conda, but with the users. However, it seems that for naive users the consequences of using conda aren't clear (enough), which can get them into a mess.


Hi! I was the conda dev lead for 2.5 years. We work really hard to make sure conda does the Right Thing in unpredictable, almost hostile, environments. It’s a large surface area to guard against, but in general we’ve been incredibly successful.

I hate to hear that you had a bad experience. If it’s at all possible, please provide something on our issue tracker (github.com/conda/conda) we can replicate and write regression tests against. Help us improve on what we’ve missed.


The situation you describe would be actively difficult for new users to create in conda, based on some of the “idiot proof” defaults conda uses and the wide range of other tools (especially pip and native system packages) that the conda developers have made to “just work” in the easiest, out of the box ways.


I use it with good success managing non-scientific and non-Python packages. I think especially with conda-forge, it is just an all purpose package manager and package distributor.

Using the miniconda installer often helps keep it lightweight too, no need for Anaconda in many use cases.


What makes conda heavy?


Maybe Windows and Mac users use conda? In the linux world, it's distro-provided python + pip & venv.


It's more skewed by scientific computing vs not


I use conda for everything now, Linux included.


In linux world, conda is the best choice. Here was an interesting take on it,

https://wesmckinney.com/blog/conda-forge-centos-moment/




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: