Really cool to see all the hard work on Trusted Publishing and Sigstore pay off here. As a reminder, these tools were never meant to prevent attacks like this, only to make them easier to detect, harder to hide, and easier to recover from.
As a user of PyPI, what’s a best practice to protect against compromised libraries?
I fear that freezing the version number is inadequate because attackers (who don’t forget, control the dependency) could change the git tag and redeploy a commonly used version with different code.
Is it really viable to use hashes to lock the requirements.txt?
Release files on PyPI are immutable: an attacker can’t overwrite a pre-existing file for a version. So if you pin to an exact version, you are (in principle) protected from downloading a new malicious one.
The main caveat to the above is that files are immutable on PyPI, but releases are not. So an attacker can’t overwrite an existing file (or delete and replace one), but they can always add a more specific distribution to a release if one doesn’t already exist. In practice, this means that a release that doesn’t have an arm64 wheel (for example) could have one uploaded to it.
TL;DR: pinning to a version is suitable for most settings; pinning to the exact set of hashes for that version’s file will prevent new files from being added to that version without you knowing.
That seems like short-sighted advice. My company probably isn't paying me to write crypto, web frameworks, database drivers, etc. If it's not where I'm adding business value, I would generally try to use a third-party solution, assuming there's no stdlib equivalent. That likely means my code is an overwhelming minority of what gets executed.
If C dominates your codebase or you're squeezing out every inch of performance, then sure, you may well have written everything libc is missing. In Python, or another language that had a thriving ecosystem of third-party packages, it seems wasteful to write it all in-house.
They aren't paying you to integrate a bunch of third-party dependencies either, especially not when you could be using the time to generate actual business value.
The specific examples you listed are usually fine for generic SAAS companies (I'd usually object to a "full" web framework), but advice of the flavor "most code should be your own" is advocating for a transitive dependency list you can actually understand.
Anecdotally, by far the worst bugs I've ever had to triage were all in 3rd-party frameworks or in the mess created by adapting the code the business cares about into the shape a library demands (impedence mismatches). They're also the nastiest to fix since you don't own the code and are faced with a slow update schedule, forking, writing it yourself _anyway_ (and now probably in the impedence-mismatched API you used to talk to the last version instead of what your application actually wants), or adding an extra layer of hacks to insulate yourself from the problem.
That, combined with just how easy it is to write most software a business needs, pushes me to avoid most dependencies. It's really freeing to own enough of the code that when somebody asks for a new feature you can immediately put the right code in the right spot and generate business value instead of fighting with this or that framework.
"They aren't paying you to integrate a bunch of third-party dependencies either, especially not when you could be using the time to generate actual business value."
They might, but in my experience, it's bottom of the barrel clients playing out of their league. Example, a single store that is using shopify and wants to migrate to their own website because the fees are too high, might pay 500-1000$ for you to build something with wordpress and woocommerce, or worse, a mysql react website.
You win most of the time, until you get log4jed or left-padded. Then my company survives you.
Also I might win even without vulns. I don't write frameworks, I just write the service or website directly. And less abstractions and 3rd party code can mean more quality.
It surprises me how much companies rely on that kind of projects without 1) making a proper assessment and 2) cloning the project to ensure it isn't tampered in the future.
Not only do they not clone projects or freeze their dependencies, but they are pressured to constantly update to the latest version to avoid vulnerabilities ( while introducing risk of new ones)
Download the libraries' real source repos, apply static analysis tools, audit the source code manually, then build wheels from source instead of using prebuilt stuff from PyPI. Repeat for every update of every library. Publish your audits using crev, so others can benefit from them. Push the Python community to think about Reproducible Builds and Bootstrappable Builds.
This is where tools like poetry, uv with lock files shine. The lock files contains all transient dependencies (like pip freeze) but they do it automatically.
Personally I'd move as much logic out of the YAML as possible into either pure shell scripts or scripts in other languages. Then use shellcheck other appropriate linters for those scripts.
Maybe one day someone will write a proper linter for the shell-wrapped-in-yaml insanity that are these CI systems, but it seems unlikely.
Attacker sent a PR to the ultralytics repository that triggered Github CI. This results in
1) attacker trigger new version publication on the CI itself
2) attacker was able to obtain secrets token for publish to PyPi
Sadly, popular open source projects are vulnerable to this vector. A popular package that is adopted by a large vendor (Redhat/Microsoft) may see a PR from months or a year ago materialize in their product update pipeline. That is too easy to weaponize so that it doesn't manifest until needed or in a different environment.
We scan PyPI packages regularly for malware to provide a private registry of vetted packages.
The tech is open-sourced: Packj [1]. It uses static+dynamic code/behavioral analysis to scan for indicators of compromise (e.g., spawning of shell, use of SSH keys, network communication, use of decode+eval, etc). It also checks for several metadata attributes to detect impersonating packages (typo squatting).
If the tech is open-sourced, then an attacker can keep trying in private until they find an exploit, and then use it.
Also, you only know if your security measures work if you test them. I'd feel much safer if there was regular pen-testing by security researchers. We're talking about potential threats from nation state actors here.
I'm just pointing out a huge downside of the approach and that more measures such as pen testing are really needed. I don't want to be right, I want a secure PyPI <3
I appreciate PyPI's transparency and the proactive measures to mitigate future risks. Are there plans to further educate developers on secure workflow practices to prevent similar incidents? This seems like a vital area for community collaboration and awareness.
reply