Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The only thing that surprises me about such attacks is that they don't happen more often. With Python and Node.js it's now the norm that large packages have hundreds of transitive dependencies. In this case, what happened was dependency confusion because of PyPI taking precedence, but even with such holes plugged the problem remains that by installing a single package, you're potentially trusting hundreds of authors. And yet we do want all these packages, because they solve specific problems in an optimized manner, and we do want anyone to be able to publish packages. I doubt there is a good solution here, ultimately.


There is not a conflict here. PyTorch depends on many packages and none of them were compromised. Instead, a limitation with the way pip installs dependencies allowed someone to create a package on PyPI with the same name and version as the one on PyTorch's nightly package index, which took precedence over the real package. This could be fixed by supporting better ways to specify Python dependencies in pip, without any dampening effect on the package ecosystem.


there's a few factors here that allow dependency confusion attack to happen here

1. pytorch publishes their nightly package on their repo which depends on a custom triton build provided on their repo, but using a package name they don't own in pypi. This has been mitigated by them by renaming the dependency from torchtriton to pytorch-triton, reserving pytorch-triton package in pypi, and changing the dependency name on the newer nightly builds 20221231 forward to point to pytorch-triton package instead.

2. pytorch installation instruction for nightly using pip in (https://pytorch.org/get-started/locally/#start-locally), make use of the --extra-index-url option. This is a known vector of dependency confusion attacks, and is an inherently insecure method of installing packages from private repositories. The recommended approach of distributing wheels in private repositories is by using a repository server that allows proxying/redirecting the public pypi packages to pypi, and users should be using a single --index-url pointed to that private repository (assuming the maintainer of that private repository is to be trusted). --extra-index-url is meant to provide mirror urls (serving the same set of packages as the main one), rather than to combine repos with different sets of packages.


> The recommended approach of distributing wheels in private repositories is by using a repository server that allows proxying/redirecting the public pypi packages to pypi, and users should be using a single --index-url pointed to that private repository (assuming the maintainer of that private repository is to be trusted).

Alternatively, keep a separate requirements_private.txt around for private dependencies and add a line --index-url <my private repository>.


I think blame is shared between pip being ultra vulnerable to foot guns like this and whoever put together the PyTorch nightly install not seeing the dep confusion issue from a mile away.


> And yet we do want all these packages, because they solve specific problems in an optimized manner, and we do want anyone to be able to publish packages.

Do we want all those packages or do we want the functionality of them? Dependency hell happens for reasons of deficient first-party support. Notably, languages that lack a sufficient standard library and a blessed toolchain should be considered a pre-existing condition to this decease.

I get that language maintainers have legit reasons to exclude something like HTTP from std, but there has to be some middle ground here. For instance, Golang provides experimental packages that have high quality but a lower level of support. To me, this is a win. It centers the community in a common direction, and delivers real world value in the meantime, with minimal maintenance upkeep compared to the 100s of packages we see in some ecosystems like JS and to a lesser extent Rust.


Python has the best or second-best standard library in existence, Go being its only competitor for that title.

It still contains only a tiny fraction of the functionality needed by any large project. The world is just too complex for a standard library to ever realistically cover a meaningful portion of the problem space.

We all need tensor algebra, video decoding, cutting-edge network protocols, compression, cryptography, dozens of data exchange formats, syscall bindings for three or more platforms, fuzzing, containers, and I don't know what else. This isn't going to all fit into a standard library. "Dependency hell" is here to stay.


> This isn't going to all fit into a standard library.

There is a size limit for standard libraries?

I think Java introduced modules and its dependency model so you could package only what your software needed.


> There is a size limit for standard libraries?

There is a size limit to what language curators are able to maintain, yes.

The idea of the standard library is that it will select components that are general enough to cover a wide range of use cases; therefore the language builders ned time to review every new proposal, and their time is limited.

You may dump everything under the sun in the standard library, but then it won't be better than what you get from a library marketplace.


Java?


Let’s start the year with a programming language debate!

In my opinion, Java’s standard library is good, but not as good as the one in Python and then Golang. I also prefer the one in C#.


Log4j?


Never thought I would shrill for Java, but that vulnerability could exist in any language. That it was such a widespread issue only goes to show how widely deployed Java is for server software.


No, it wouldn't. Java is the only ecosystem where dynamically loading code by default from random web servers is considered to be a feature. Probably a legacy of its early decade, when everybody was dreaming of this.


Java is unique as the culture encourages doing magic convention over configuration by reflection AND it-just-works serialization, which leads to a new bug class: unsafe deserialization except that the deserialization happens in the library by default (e.g. fastjson, jackson, tons of struts2 bugs, etc) and propagates along the supply chain.

Libraries refuse to shift away from this culture thing, and decided to just blacklist the known exploit path while keeping the vulnerability intact. It's not that they do this for backward-compatibility, it's new libraries still being designed this way.

The log4j 2 bug is slightly different, but IMO as a design issue it has root in the same culture thing described above.


> Libraries refuse to shift away from this culture thing, and decided to just blacklist the known exploit path while keeping the vulnerability intact.

Log4j 2 refused to shift away from it since they are paid to add that kind of bloat, yes this wasn't an accident caused by a bad culture, this was a feature someone requested. As far as I understood they are completely separate from the original log4j and it is just another, mostly compatible logging implementation.

> Java is unique as the culture encourages doing magic convention over configuration by reflection AND it-just-works serialization

As opposed to what? Python? The language that culturally refuses to fix the GIL and as workaround provides multiprocessing, which requires that everything is serialized?

Also I may not have seen many Python code bases, but from what I have seen ClassLoader/Import abuse is alive in both languages.


I agree that the GIL situation is unfortunate.

Python at least does not pretend their it-just-works serialization is secure. The documentation [1] actively discourages deserializing untrusted data and suggests to use pure data format (e.g. JSON) as possible.

In contrast, the documentation on ObjectInputStream before Java 11 does not warn about this at all. Even then [2], it suggests to implement blacklist/whitelist filters, and it is pretty hard to get right. The same filtering can also be done for pickle, but the usual consensus in Python developers seems to be "don't do this even if it's possible".

> but from what I have seen ClassLoader/Import abuse is alive in both languages.

Yeah, should have said that my rant is mainly against endless deserialization/OGNL injection/whatever-popular-expression-language injection bugs in frameworks, not abusing ClassLoader. These features, just like log4j2's code-execution-disguised-as-string-interpolation feature [3], shouldn't exist.

[1] https://docs.python.org/3/library/pickle.html

[2] https://docs.oracle.com/en/java/javase/11/docs/api/java.base...

[3] I'd argue that's the real bug, instead of the obvious JNDI class loader blah blah stuff. Luckily log4j didn't refuse to fix it and completely removed "message lookup".


Python developers aren't smart enough to develop threaded code. Making threading work in python would be like giving crack to an infant. Python maintainers made the right choice to coddle their user base, you see if you have to be smart to use python then you wouldn't be using python to begin with and user base goes to zero. I kid I kid. Happy New Years - a python developer


That's a third-party dependency, not a part of the standard library.


Log4j was never part of Java


I think that Debian and most other Linux distros handle this extremely well.

There are thousands, if not tens of thousands, of packages available to install via apt or yum. But most of those packages are packaged by a dedicated maintainer - not any rando. The bar isn't much higher, but there _is_ a bar. Python's (well, pip) practice of letting any vermin with no prior vetting publish is root of the problem in my opinion.


Staging the packages in experimental then testing, sometimes for months, before going to stable, is doing also a lot against this kind of threats.

Which leads me to wonder: do we really always want the very latest version or the packages? Any slightly older version will be immune to dependency-poisoning, thanks of the scrutiny of users over several weeks.


> Which leads me to wonder: do we really always want the very latest version or the packages?

Not really, but then someone needs to decide what version to use and when/how this version is updated. For every single dependency. I agree that keeping track of your dependencies and actually managing them may be better engineering but it's a lot of work not many people want to do


We're solving this problem at https://socket.dev starting with npm, with python coming in the next month or two. Here's an example of a date picker web component that runs an install script, collects telemetry, accesses the network and filesystem, and more -- all detected with our static analysis engine. https://socket.dev/npm/package/angular-calendar

We show alerts in GitHub pull requests, or the CLI, if you add a dependency with a supply chain risk.


Developing in container environments with limited access might help, but I think there's a performance hit for heavy processing/ML training unless you use privileged mode which kinda defeats the purpose.


What I do is that I do all dev work in a VM, usually dedicated to a group of related projects. There are no passwords stored in the VM and its ssh key has read-only access to repositories. The VM has write access to forks and code is merged using pull requests just like third party contributions. This setup has been working well for me, and I even created a tool to automate setting up these dev VMs which I hope to make publicly available at some point.

I don't do any heavy computations except for running test suites of various kinds, and these seem to perform the same on raw hardware as they do in the Kernel-based VMs.


Jails / sandboxes such as bwrap could be enough in this case by denying access to e.g. HOME files not explicitly whitelisted.

Also a Little Snitch-like host-based firewall, which would request explicit permission to connect.


Love to see a container environment that can monitor Monitor and log all outgoing network connection requests.... Monitor and log all critical file/directory access such as /etc/*

With such container, we can catch the compromised supply-chain attach easily, right?

Does anyone know such container exist?


Only using privileged containers, or else you don’t have visibility into signal from other containers.

But, say you had such a container, there’s an important distinction between “you captured a log showing the smoking gun evidence of the supply chain attack”, and “you successfully picked that log out of all of the log data you generated and classified it with high confidence as an attack”.

Speaking from experience, the second problem is the hard problem for a multitude of reasons. So while you would have the data, you’d probably have trouble getting good precision/recall on when to actually sound the alarms vs. when it’s some SRE who needed to troubleshoot some network connectivity issues.


> Only using privileged containers, or else you don’t have visibility into signal from other containers.

The suspect application doesn't need the privileges, so I'm not sure how much of a problem that is?

> there’s an important distinction between “you captured a log showing the smoking gun evidence of the supply chain attack”, and “you successfully picked that log out of all of the log data you generated and classified it with high confidence as an attack”.

Assuming that you're talking about the signal:noise problem, that's hard in the general case but I feel like you could easily pick off really obvious cases like trying to access private SSH/GPG keys and still get a lot of value.


> Assuming that you're talking about the signal:noise problem, that's hard in the general case but I feel like you could easily pick off really obvious cases like trying to access private SSH/GPG keys and still get a lot of value.

Probably. I’d agree that it’s worth trying at the very least. I’ve run into enough “should be easy” cases that turn out to be not that easy that my default is to get the data and see if the hypothesis really pans out.


I’ve created Packj sandbox [1] for “safe installation” of PyPI/NPM/Rubygems packages

1. https://github.com/ossillate-inc/packj

It DOES NOT require a VM/Container; uses strace. It shows you a preview of file system changes that installation will make and can also block arbitrary network communication during installation (uses an allow-list).


strace uses ptrace, which is not safe for security use because of race conditions. Linux Security Modules should be used.

https://stackoverflow.com/a/4421762/711380


Thanks for highlighting this! While PTRACE introduces TOCTTOU vulnerabilities, Packj sandboxes fixes that by using read-only args for ptrace. You can find my PhD work [1] on this relevant.

1. https://lwn.net/Articles/803890/


If CI/ CD pipeline uses GitHub Actions, you can monitor and even block outbound network calls at the DNS and network level using Harden Runner (https://github.com/step-security/harden-runner). It can also detect overwrite of files in the working directory. Harden Runner would have caught this dependency confusion and similar attacks due to a call to the attacker endpoint.


This is the way things should be done by any competent developer.

Generally a VM jail is preferable - firecracker or cloud-hypervisor(virtiofs & gpu passthrough) recommended.

A proper namespace jail (eg. bwrap) is sufficient for 99.9% of cases. To break out of a properly configured namespace jail you would need to sacrifice a 0day.


It looks like totally overengineered solution to me. The CPU already has a protected mode which doesn't allow the program to access any files directly. Why do you need to run a VM which by the way is run from the kernel (privileged mode) in Linux? Why cannot you run untrusted programs in protected mode?


With containers the entire surface area of the kernel is available to attack (syscalls). With a VM the surface is restricted to the VMM and KVM.

This is an oversimplification, there may be other protocols that are passed through or utilized, they would add to the surface.


Also, the container itself usually contains (or has access to) valuable secrets, such as keys for staging servers and, of course, the source code.


Maven / Java seem to have solved it well.


I was developing in Java right up until Maven became popular. We used to just download jars. What would you say is the main difference with Maven/Java vs NPM?

My recollection is that Java libraries were larger, higher-quality, more stable, and better-maintained, and you didn't need as many of them. A Java jar was not a "package" but contained dozens of "packages" developed together. Jars tended to be self-contained or mostly self-contained; small dependencies would shipped inside. The idea of making npm packages as small as possible, like practically putting each file in a separate git repo, and publishing it as a separate artifact, emerged shortly after NPM itself, and it was radical, and not really particularly good. Java also has a much larger standard library, and between the packages that come with Java itself, the packages that aren't technically part of the standard library but were written by Sun/Oracle, and well-known third-party utilities, you didn't need a lot of third-party packages. And if you needed something tiny like left-pad and didn't have it, you'd probably just copy and paste it.


> What would you say is the main difference with Maven/Java vs NPM?

Maven doesn't allow execution of arbitrary code at install-time, which curbs a large number of potential supply-chain attacks.

Because of the JVM and JARs being mostly self-contained Maven doesn't really need to worry about system or runtime dependencies (unless you're using Scala...). This allows Maven to be a 'dumb' package manager that relies on simple semantics (no hidden specially-generated indices, for example) and be fairly successful. Of course, there's an internal battle of whether Gradle or Maven is superior, but they both rely on the same distribution and packaging specifications.


Maven doesn't have this problem because maven central is too obtuse for hackers to use, and Enterprise Java developers don't ever update their dependencies. It's actually to their benefit, but it's for the wrong reasons.


> Maven doesn't have this problem because maven central is too obtuse for hackers to use

I have many gripes with Sonatype but Maven Central isn't really one of them. The fact you can publish a packages to the likes of PyPI, npm Registry, or Docker Hub with 0 friction makes those places very attractive to spammers and bad actors. Maven Central having a higher barrier of entry is a feature.

IIRC Brian Fox, the CTO of Sonatype, was actively involved with Maven in the early days and was part of the decision for Maven packages to use domains for namespaces. Namespaces are another valuable feature of Maven that makes supply-chain attacks like typo-squatting harder to pull off.


The real reason was the second one. That was just a cheap dig at their UX. Both were cheap digs, but also both true.


Lol, I knew you were mostly joking — but you also weren't wrong.

At the same time, some people genuinely shit on Maven Central and think that it's inferior to other registries.


There's a real problem with maven central and java in general that there's no correlation between the package name - which is nicely domain-name formatted - and actual domain names. If there were a clear "this is really thai domain name and DNS verified" and "this is compatible but not DNS verified" marker, it would be great.

I think golang has the best answer for this, where it's easy to impersonate but it has to be explicit.


Yeah, it's far from perfect but it does get a lot right. It's painful watching all these new package management tools like pip and npm completely ignore what came before them.

I think Go's approach is interesting, though it does rely on some magic that isn't immediately obvious. I agree that being explicit is a tremendous benefit: it avoids the attack used here, and makes it less likely for typo-squatting to succeed (e.g., `npm install axiod`).


Publishing to Maven Central is a bit of a pain, but the manual effort, doc jars, signed jars, etc. help with security and keep away low-effort packages.


Also, a pretty sophisticated way to manage transitive dependencies. Python is an absolute mess in this regard (as well as pretty much everything else with dependency management…)


Alex Birsan actually published his findings for this vulnerability in Feb 2021 [1], and collected a bunch of bug bounties from various companies (Apple, Microsoft, Paypal, Yelp, Tesla, Shopify, Uber, Netflix).

He was able to steal packages names for Python (PyPi), JS (npm), Ruby (Gem), where these various companies have their own private package repositories with private modules where they don't control the package names in the default repositories.

The main requirements for this attack are

1. Having private repositories, where the package names in the default repositories is not owned by the owner of the package.

2. Package manager allows downloading packages from multiple repositories (the default and private repositories) without being able to pin specific package to only be downloaded from the private repositories.

What's notable on his findings is the omission of Facebook and Google, which I believe due to their usage of Buck/Bazel and monorepos for their internal codes. Another thing to note was that Alex mainly target companies internal private packages, while this particular instance affects an open source project which was providing package on their own repository and make use of a package (which I was not able to find any bug bounty program for).

There's another post by Kjäll et.al [2] that explains how this particular vulnerabilities affects other package managers (PHP, Java, .Net, ObjC/Swift, Docker), in what conditions it's vulnerable, and how to mitigate the risk. Two notable language package managers that were not affected are

1. Rust, mainly because you have to explicitly select the private registry for each private packages.

2. Go, mentioned as unlikely, due to the use of FQDNs in the package names and hash verification by default.

I think anyone adding non default package repositories or providing one (their own private repo in enterprise setup, or 3rd party provided repositories), need to be aware of this particular class of vulnerability, and implement policy to mitigate it. I would say, individual devs installing on their dev machines or CI/CD systems based on shell commands (rather than secured package manager setup) would be the main area of attacks mainly due to the relative difficulty of auditing those scenarios.

[1]. https://medium.com/@alex.birsan/dependency-confusion-4a5d60f...

[2]. https://schibsted.com/blog/dependency-confusion-how-we-prote...


Good analysis, thanks for the view in from the outside. I found the terms of any such bug bounty[0], whose scope includes "Open source projects by Meta"

And from the engineering blog, "[...] PyTorch 1.0, the next version of our open source AI framework."[1] (emphasis mine)

[0] https://www.facebook.com/whitehat/

[1] https://engineering.fb.com/2018/05/02/ai-research/announcing...

However Meta has since ditched it[2], and a careful keyword search of pytorch.org, linuxfoundation.org, suggests there is not any current official bug bounties for PyTorch.

[2] https://pytorch.org/blog/PyTorchfoundation/


I was aghast when I first started to see dependency management tooling that allowed dependency versions to be declared using a wildcard. That seemed like an insane compromise between convenience and safety.

I couldn’t bring myself to use a wildcard for a long time. I always specified the exact version and incremented manually to feel like I was at least trying to maintain control.

I still think it’s an insane practice, but with software engineering ever increasingly being the art of composing dependencies with bits of glue code, I get it.


The problem is the automation (PyPI accepts packages from basically anybody automatically). Fedora and Debian/Ubuntu package a large fraction of the python ecosystem (everything required by any package!) and are far less susceptible to these attacks, due to a maintainer doing the update. That's not to say that there is probably perfect vetting of every update, but something like this would be harder.

PyPI probably needs a vetting system for new contributors of some kind.


I think this is an unpopular opinion, but I believe language package management systems try to solve solve the problem that has been solved by Linux distributions a long time ago, and they typically do it very poorly.

I suspect a prime reason is the absence of package management on Windows which a lot of developers (and users) use, and secondly the desire by developers to always use cutting edge library features when writing code, but nobody wanting to upgrade any dependencies after. There used to be a lot discipline about being compatible with many libraries versions IMO, nowadays people just specify the latest version of each library in their requirements.txt file (or equivalent in other languages).


> language package management systems try to solve solve the problem that has been solved by Linux distributions a long time ago, and they typically do it very poorly.

Yet, every main Linux distribution has its own packaging format (deb, rpm, etc) , package naming convention, dependency resolver, package release strategy (rolling, fixed, etc) , package build & deployment system (source, binary, per arch binary, etc), and package install peculiarities (custom, upstream focused, system wide, in a chroot, in a snap, etc), reproducibility constraints, etc...

So it's not like it's a _solved_ problem for Linux distributions.

Not to mention that most distribution package managers are system wide, while language package managers are often environment based.


> > language package management systems try to solve solve the problem that has been solved by Linux distributions a long time ago, and they typically do it very poorly.

> Yet, every main Linux distribution has its own packaging format (deb, rpm, etc) , package naming convention, dependency resolver, package release strategy (rolling, fixed, etc) , package build & deployment system (source, binary, per arch binary, etc), and package install peculiarities (custom, upstream focused, system wide, in a chroot, in a snap, etc), reproducibility constraints, etc...

> So it's not like it's a _solved_ problem for Linux distributions.

Just because someone reinvents the wheel does not mean it wasn't invented (solved before hand. I would also argue each distro package manager is miles ahead of any language one.

> Not to mention that most distribution package managers are system wide, while language package managers are often environment based.

That's the thing I was alluding to in the second part of my post, if developers would be more careful about backwards compatibility we wouldn't have to use environments. I do admit that packages for apps are more of an issue, it would be nice to upgrade those without having to upgrade the rest of the system.


It's not impopular it's completely wrong to a point I can't comprehend. Have you ever seen the download page of any multi platform software ? It's one or 2 packages for windows, covering from win7 to win11, one or two for mac, and a dozen for linux and it cover only a very small subset of the distribs and their versions. Or more often they don't even care and provide only one ubuntu and red hat package and good luck to you.

The state of linux software distribution is completely abyssimal and ridiculous


> that has been solved by Linux distributions a long time ago

Certainly not. A recent example: I wanted to try a KDE distribution, so I installed Neon which has three dependency updaters by default: pkcon, apt, and snap. If I use Python or Node, then I usually need to use their distribution management systems as well.

Dependency management is one area that completely fragments Linux into different distro universes.

I am not saying Windows is better (it isn’t) but if Linux had solved the problem, then most of us would share one solution.


I agree with you completely. Not only is security a problem, but mixed-language packages too. It's possible to put something like that in PyPI but it it's a huge pain (and the new setup.py replacements make it basically impossible).


If they happened often, companies would need to do a better job securing against them. I suspect nation states would like to avoid that.


> hundreds

Thousands.


What are the chances some sort of united nations institution pays workers to both audit and prevent/harden supply chain attacks? Wondering in light of potential job obsolescence with the progress of ChatGPT and the like, if maybe we can forward the human intellect surplus there... Also, could it be thought as a human right to have access to "safe" environment in the future, I certainly would love if my children do not have to worry about this constant threat at some point or at least the stress can be decreased... Sounds a bit like being paranoid is the only way to go, I wonder too what long term effects this has on mental health, do we as techs view our close ones as less trusted the closer to supply chain attacks we work?


> What are the chances some sort of united nations institution pays workers to both audit and prevent/harden supply chain attacks?

Zero, essentially. State actors profit massively from such systemic weaknesses, so it is not in their interest to eliminate them for the population at large (they do of course want to eliminate them for themselves, but they already have extremely strict supply chain policies so that's mostly a solved problem).

Hell, we have state-sponsored institutions working hard to actively create vulnerabilities in software that previously didn't have them. Security vulnerabilities are a tool through which power is exercised. They're not going away as long as governments have any say in it.


I think it is a great idea. The problem is that institutions/decision makers have very little knowledge about open source software so it is really hard to convince them to do this. I can only speak about Germany, but I recently read a newspaper by some IT government official that only contained buzzwords, where it was clear that he did not know what he was talking about. There are exceptions, someone got the government to fund Curl and OpenSSH (both between 50k-500k). So that is great. But you also have a second fund where everyone can apply, and looking at the responsible team you see that out of 5 people, none has a STEM degree, but instead graduated in fields like cultural studies. I doubt that they know/care enough about the threat of supply chain attacks to direct funds there.


Why do you expect that from UN?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: