Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ABI compatibility in Python: How hard could it be? (trailofbits.com)
96 points by lukastyrychtr on Nov 19, 2022 | hide | past | favorite | 54 comments


May take on this, as a Python specialist who maintains a C++ library that exposes Python bindings (but not an expert in such), is that the Python packaging ecosystem (as it has organically grown) is a mess. I would like an authoritative (i.e., in the standard library) way to accomplish the major packaging tasks. I shouldn’t have to track down the current canonical way to do things from two or three sources.


https://www.pypa.io/ ?

"The PyPA publishes the Python Packaging User Guide [1], which is the authoritative resource on how to package, publish, and install Python projects using current tools."

[1]: https://packaging.python.org/


Last I checked, and I could be missing something, PyPA doesn’t say how to package extensions using the new pyproject.toml style. I happen to be using SWIG (which I inherited), which suggests the use of setup.py alone. I just haven’t had time to figure out the right path.


I don't genuinely understand what's wrong with setup.py and why people wouldn't just use setuptools, as it is even suggested by PyPA.

IMHO Python should improve what it already has instead of pushing the usage of third party tools for a core component such as packaging.

For all my Python projects I just use setuptools and the built-in venv module for isolation.

What I feel it's really missing is an official way to have multiple Python version installed.


> What I feel it's really missing is an official way to have multiple Python version installed.

Install them into different directories. There's nothing about python that makes this at all hard or even tricky.


Isn't installing different versions an OS concern, not the Python's concern? I bet Debian and NixOS have very different takes on that, to say nothing of macOS and Windows.

I also suppose that pyenv and poetry exist and solve that to a reasonable extent.


One could say that providing Python modules is an OS concern as well.

Debian IMHO has the worst take on that (see the venv module).

Pyenv definitely solves it, but I talked about an officially supported solution.


pyenv will solve your multiple python on one machine problem :)


You can migrate a lot out of your setup.py into the pyproject.toml nowadays (https://setuptools.pypa.io/en/latest/userguide/pyproject_con...).


I dunno about using it with SWIG but I've had a pretty good experience using Poetry [1] for pure Python stuff myself. That uses pyproject.toml to declare all your project's dependencies, but no idea if it can be used with SWIG or not. I'd have to research it…

[1]: https://python-poetry.org/

Edit: Some quick cursory research seems to indicate I'd have to do some custom scripting to integrate SWIG into the process.


Poetry is currently not useful for extensions: https://github.com/python-poetry/poetry/issues/2740


Wouldn't you make a wheel?


Indeed… That does seem to be the best way to go about this particular task.


My comment on that guide is that it expects everyone to run the latest version of Python. Last time I made a package following it, it turned out to be unusable in practice. I had to go back to setup.py and hunt down old guides and copy-paste parts from known-good packages to get something that was working for everyone involved. I'm not even talking about Python 2 (which can still be found in the wild). Something like Python 3.5 is still very common out there


It always struck me as a bit ironic that such an opinionated language could utterly fail to have "one right way" to package things. As much as Rust's aversion to binaries can be a pain, building and packaging (even cross building) is generally painless.


That's because Rust is tiny and new.

Wait until you have Rust packages which include and wrap hundred of thousands of lines of very old C and Fortran code like numpy has.


Nah. It'd probably be surprising how many crates have dependencies here and there on C code – mostly it's seamless even when cross compiling. I've gotten (ages ago) a cross toolchain setup to build binaries for one of those MIPS routers and the biggest pain point was being stuck on an antiquated Debian where cross compiling was painful. There are popularity issues with Rust (e.g. crate namespacing), but non-Rust code hasn't been one of them for me.

So far Conda is pretty much the only problem I've had with native code mingling with Python (although pip was able to get numpy installed on my BSD box). Otherwise it's an issue of the tooling just being really archaic. Which is amusing to me given how much hand wringing there is over providing one right, idiomatic way to do things (e.g. ternary operator). For me digging into Stable Diffusion (and related bits and bobs) was a bit of an eye opener with just how much cargo culting is going on. Installation instructions are often copy and pasted from other related projects. Or maybe just check out our repo and follow the instructions from some other project.

Honestly, my hope is that some of the effort spent obsessing about language details gets spent on coming up with a better packaging and dependency management story.


> That's because Rust is tiny and new.

Dozens of other languages that are tiny and new do not have good packaging story. Rust does, because it cares about developer experience.

> Wait until you have Rust packages which include and wrap hundred of thousands of lines of very old C and Fortran code like numpy has.

I don't know about Fortran, but C is basically legacy at this point, it doesn't change much and even if it does, there is one some-c-lib wrapper package that takes care of all those issues, Rust apps rarely depend on some C dependency directly.


  Rust apps rarely depend on some C dependency directly.
Pretty much any Rust app that does encryption will be relying on a mix of C and assembly. It's one of the stumbling blocks in getting anything that does TLS or SSH ported to WASM or little endian MIPS.

Thing is Rust makes this pretty well invisible to most users because it all works with so little teeth gnashing.


We solved this by porting ring to rust using c2rust, and the resulting code seems to work well even on our bespoke architecture. It's possible to use the generated code and cross compile from any major platform to riscv32imac-unknown-xous-elf even without a C compiler, and it seems to be constant time still.

https://github.com/betrusted-io/ring-xous


Interesting. Is the rustified stuff still constant time?


Surprisingly, seems like it's reasonably good! There's a section in this blog post about characterizing it and comparing it to a hardware implementation: https://www.bunniestudios.com/blog/?p=6521

Since it's the same algorithm, most of the tricks appear to make it through the language conversion, so it's reasonably good even on a completely different architecture.


> Pretty much any Rust app that does encryption will be relying on a mix of C and assembly

Perhaps, but not directly. It will depend on some libssl-rs which depend on libssl-sys which handles C bindings, so that application authors don't think about it.


Ring still has a bunch of handcrafted stuff, which is what I was referring to. OpenSSL is a whole other bag of worms.


Rust will eventually have to sort out the binaries story, if the community cares about adoption in C and C++ domains where shipping binaries is the only way the industry cares about.


You've misquoted. It's "There should be one-- and preferably only one --obvious way to do it." Preferably only one obvious way, not one right way.


Agree. I had a build pipeline pulling down an internal package to build into a `.pex` [0] file for running a Python program in a release pipeline that did not have access to the internal package repository. I came up with this solution because the typical virtualenv that you might build locally does not appreciate being copied wholesale to another system (iirc it had some absolute paths).

One day the release pipeline broke because a dependency's authors published a new version containing a wheel for a newer ABI, though the older ABI was still supported. My build pipeline pulled the one for the later ABI because it was compatible, but the release pipeline environment, which could not be upgraded, did not have the newer version of gcc required. It was a nasty introduction to the fact that one Python package version can have multiple published wheels targeting different C++ compilers!

But the worst part was not that I learned something new! It was that I wrangled with both pex and pip for a few hours trying to figure out how to download a wheel for an older ABI as both tools resolve the latest compatible by default. The options to do so are ostensibly there but the effect was not. I definitely could have made some mistakes but in my mind it shouldn't be _hard_ to get it right.

I ended up downloading the wheel for the internal package in the build and copying it directly into the artifact. The release pipeline resolved public dependencies and installed the internal package from the local file.

[0] - https://pex.readthedocs.io/en/latest/


The release pipeline broke because you didn't lock your dependencies.


Not sure what you mean. pex is the dependency lock, at least from a version perspective. To be clear, it broke after I intentionally upgraded the offending package to the later version.

I admit to being unaware of "versions inside versions" where a version may have multiple published ABIs that are not compatible across systems, but a nice packaging system would still make it easy for me to use platform X to build for platform Y.


Oh, from your "One day the release pipeline broke because a dependency's authors published a new version..." wording it seemed that this breakage occurred without you upgrading anything.


Sorry for the wording, I see how it can be interpreted that way. Seems I skipped a few details on accident.


> It was a nasty introduction to the fact that one Python package version can have multiple published wheels targeting different C++ compilers!

Erm, is this not the failure of C++ standards instead of Python?

This seems like Python is the failure point being caused by C++ not having a stable ABI. Python is simply trying to paper over the brokenness of the C++ ecosystem.

Am I missing something?


  Python is simply trying to paper over the brokenness of the C++ ecosystem.
I'd be pretty comfortable saying that this is indeed a Python problem. C++ ABI interop has been an issue for decades. Instead of papering over it, Python should be exposing a host triplet (or whatever) so that end users can more easily identify what is and isn't compatible.

Pretty much every other ecosystem (except perhaps Javascript?) does this. Look at FreeBSD packages, the ABI is used as part of the identifier. GNU Autoconf, rvm, homebrew, rust, etc., etc. all use and expose an ABI identifier so that you aren't accidentally going to mix and match things.


Python packaging doesn't have good ways to depend on specific versions of <insert-other-language-here> packages. This isn't specific to C/C++ and is mostly historical: it's designed for an ecosystem of packages where everything is written in Python! But nowadays, more and more Python libraries are bindings to code written in many other languages, so you need a package management approach that includes this.


There's no way to depend on specific versions of C or C++. The actual binaries vary depending on architecture, compilation toolchain settings, compilation flags, link options, and all sorts of other things. There's really no way to address those built artifacts even if they were available on pypi or something.


There's a couple different ways.

1. Build the artifact that you depend on, publish it on "pypi or something" referenced by both the version of the library and a hash that uniquely identifies the build. This is the approach taken by conda, for better or worse.

2. Only allow one canonical build of a library you depend on so that the "version" becomes a unique identifier for the build. This is the approach taken by distribution package managers.

3. Create metapackages that describe ABI constraints that are required by packages and must be satisfied by the underlying system. For example, a "cxxabi" package could be provided by the underlying system, and the packaging tools could automatically add dependencies on "cxxabi" at build time, based on either an exact pin to the library built against, or in some cases, a relaxed dependency by inspection of versioned symbols used by the binary.

4. Statically link all your dependencies and/or vendor all your dependencies. These are used by quite a few pypi packages that depend on standalone C libraries to avoid most of the issues altogether.


Of course, all of these either have flaws [1] or are so detailed that they're distributed build caches with more steps. You can hash project source files, all build commands, all textually included headers, precise versions of toolchains, etc., into a Merkle tree, but this is not generally how python applications pin "versions" of dependencies.

[1] For instance, you cannot version a C or C++ library build independent from the versions of all of the transitive dependencies of that library (more or less). Of the options listed here, only distribution package managers can really account for this problem, and not every distribution package manager cares to.


My thoughts are that viewing binary package distribution/archival as really just a distributed build cache is the only real way to go, and that's exactly why pypi and associated tooling has so many flaws.


Well, there is: It's Nix's buildPythonApplication.

Trying to use Python in Nix is frequently a bother, but at least once it works, it works.


That’s a good point! However, there’s still things Python could do to help make the experience better.

The package I was using had multiple artifacts per version, including the raw zipped source code and wheels for each ABI they built for. It’s certainly convenient that the most applicable one for your current system is pulled automatically by pip. But in this case that’s not the behavior I wanted, and I could not successfully get pex or pip to download a “less optimal” artifact.

You can blame C++, but if Python is already papering over deficiencies it’s not unreasonable to expect improvements that aren’t fundamentally impossible. It should definitely be possible for there to be one easily understood argument to specify which artifact pip should download. In my (potentially flawed) experience, figuring out which ABI value to pass was already not straightforward. Even then, the arguments for selecting an ABI or even asking for the raw source instead of a wheel weren’t seeming to have an effect.


I maintain a small open source Python module written in C. I can't use the limited API (I use stuff outside it), so I have to maintain a build matrix. If you want to see what a decent-sized Python build matrix looks like, take a gander:

https://github.com/jnwatson/py-lmdb/actions/runs/3457851983

It gets out of hand pretty fast.


And you didn't even include crazy people using Alpine Linux with musl.


Have you considered WASM?


Python on wasm is still very much a work-in-progress. But even if it weren't, wasm wheels are only compatible with wasm Python, which is not what most users of your library out there are going to have.


PRs accepted.


Related: Nimpy[0] provides an easy way to write Python extensions in Nim, which manages the ABI side very well.

Python 2 is now gone, but until it was, Nimpy was an easy way to write Python extension modules that only needed to be compiled once, and would work with any of your installed Python 2 and Python 3. Magic.

[0] https://github.com/yglukhov/nimpy


ABI discussion is one thing. But if you into the discussion there is another - how to freeze the dependence … the more important question is transparency. It is like docker sort of where you build up your final based on certain before image (binary c is the problem mentioned most). Can we have a way to document it all or even optional download all for the production software then freeze it.

And testing document in a .sw_env_conf to keep the change vs the base.

If it is a re-runnable and readable doc we can trace it.

Sorry not solving the Abi but try to deal with the mess. Python is easy to learn and program. But running it is really. I document at least 7 Python version and control which one to use is itself hard.


After many years of Python ABI pain we moved to python-cffi. Life become way easier.


One reading is that despite declared mismatches between versions. There were no actual instances of incompatibility found?


Using symbols from a newer Python version with an older ohe declared is an incompatibility, if run on an older Python install.


I'm so glad I don't have to worry about these things as a PyO3 user


I don't like it. once people have abi guarantees they start thinking they can skip shipping the source.


No worries, some people already do!

I've seen plenty of packages with no source release.


I wish PyPI required reproducible builds, including public source code.

https://reproducible-builds.org/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: