PyPI new user and new project registrations temporarily suspended

throwaway892238 · on May 20, 2023

Methods to deal with malicious actors in your system:

- Require toilsome identity verification. Things that are in short supply and are difficult to get and uniquely identify a person. Examples include a phone number, credit card, driver's license, mailed letter, etc.

- Require a referral, both for accounts and for new packages. Don't allow a signup unless the user has a referral code generated by another user with good karma. This isn't fool-proof, as a user that does get an account can then generate more accounts. But it makes it easier to revoke them en-masse, and forces users to be more scrupulous about who they refer, as you can block the referrer as well as the malicious user.

- Require community review, both of new users, and new packages. New users/packages are sent to a mailing list of moderators and somebody has to approve it, but someone who notices a problem can also reject it, even after it's been mistakenly approved. Slower, but there's more eyeballs to spot malicious stuff. This is more common among open source projects. (Growing your moderator list is important, as well as automating as much of the review process as possible; obviously PyPI currently has a shortage of moderators, but they should have a huge number of them by now!)

- Don't allow new users to do certain things until enough time has passed or they have accrued enough karma points. May require fixing bugs, answering questions, etc; work which benefits the community and most malicious actors wouldn't invest the time and effort in. Again not fool-proof, but definitely increases the time and difficulty for successful attacks.

- Captchas. These can obviously be worked around, but are a minimum-effort way to avoid bot signups.

For better defense-in-depth, combine multiple methods.

reaperman · on May 21, 2023

> Captchas. These can obviously be worked around...

Best to not consider this a barrier at all. I helped build a lot of systems to bypass these and the "guarantees" they offer are truly "de minimus". Phone numbers and credit cards are are only slightly better.

The real answer is to determine how much value a successful compromise of your service can generate, and combine enough barriers to clearly exceed that value. Some criminals are very business-savvy, but many have poor judgement of value and will overspend on marginal cost-of-acquisition and try to "make up for it in volume", so it's important to be obviously not worth the expense.

Unfortunately for PyPI, there's a wide gamut of value gained from compromise from a wide variety of different actors. And for some of those actors, the potential value gained is existential, so they are willing to spend anything to compromise PyPI, and they have incredible resources. So then PyPI has to make itself more expensive than compromising Windows/SS7/Linux/iOS/Office365/AT&T/etc/etc. And that's a very difficult field to find yourself competing in.

Shameless plug: http://resolved.dominuslabs.com feel free to reach out for custom work - specializing in accessing gated sites for new search engines to do better web crawling, scraping public data, etc.

throwaway892238 · on May 21, 2023

None of the barriers are fool-proof, but they are layers of defense. When adding defensive layers, you need a cost comparison of how much time and money it costs you, versus how much time and money it costs the attacker. If it costs you $0 recurring and 5 developer days, and it costs the attacker 1 human solving 1 captcha per 1 form submission, then it's probably worth doing it, if only to stop randomly-scanning botnets and reduce the signal-to-noise on your detection capabilities. (I know AI is getting better at automated solving, dunno if they're all broken yet)

reaperman · on May 22, 2023

> you need a cost comparison of how much time and money it costs you, versus how much time and money it costs the attacker.

Not quite! It's actually "how much money your countermeasures costs the attacker to beat, versus how much money it gains the attacker to beat your countermeasures".

The attacker doesn't care if you spent $1 or $1 billion on it, they only care what it costs them vs. gains them. If it gains a lot, there are usually ways to fund the effort, no matter how large. There's usually an enormous mismatch between your budget and the attackers' budgets (in either direction), as well as a massive adjustment for purchasing-power-parity.

This isn't a "near-peer" fight where you're trying to "outspend" the attackers using efficient techniques to exhaust their budget before your budget is exhausted. This is a situation where there are nomadic bandits, regional pirates, FAANGs, and entire governments where you want to just make it so they look at your system and go "nah skip that. we'll do this another way entirely by getting access to a different company's system instead".

pimlottc · on May 20, 2023

Of course, it should also be noted that these will create burdens for legitimate users as well, in some cases disproportionately (e.g. new developers will have a harder time getting referrals).

genericone · on May 20, 2023

If it prevents coding classes and bootcamps from pushing junk onto pypi/npm as an exercise of the class, I'm all for it.

talideon · on May 21, 2023

They should be using test.pypi.org. Sheer laziness on the part of tutors.

circuit10 · on May 21, 2023

New developers can also create genuinely new and useful things, some people (like me) learn by creating (hopefully) useful things as it’s more motivating

raverbashing · on May 21, 2023

You can always leave it on github or some similar service, you don't need to "pollute the global namespace for your pet project"

And yes you can install it from pip, etc from there

circuit10 · on May 21, 2023

But aren’t people meant to be able to push useful things to pip? These developers might not even be new, they might just not talk about it online much, or they could be new but fast learners, so these projects could still be things that many other people want

raverbashing · on May 21, 2023

If it's really useful (and it has users beside the author) yeah, but I wouldn't upload a 30min project just because you feel like it

Pip is already full of crap projects, I don't care if you're a 'fast learner', and you could be clobbering a name (or a similar name) to a more important package

woodruffw · on May 20, 2023

I posted a variant of this comment below, but for here as well: the single greatest challenge with any additional method here is operational burden. New methods for ensuring the authenticity of users cannot substantially increase the load on PyPI's maintainers without compromising the other activities that keep PyPI well-maintained (like developing and reviewing features, administrating the running instance itself, and addressing baseline operational overhead from locked-out users, etc.).

A pernicious thing about this problem is that even the methods that appear to devolve time away from the maintainers (like community review) quietly increase operational burden: they require effort to be allocated to support for those systems, require maintenance and administration of the "trusted list," etc.

ryan29 · on May 21, 2023

I think domain verified namespaces would make a difference. If I own example.com, it makes sense for my namespace to be @example.com (pypi.org/@example.com/project).

If all namespaces were domain based it would reduce impersonation and confusion while making it much easier to assess trustworthiness. If I can definitively associate a namespace across GitHub, PyPi, the internet, I can get a much better idea of whether or not I should trust someone.

I also think that could evolve into some type of attestations about money already being spent. I call it collateral attestation.

For example, if I use GitHub Sponsors to donate to a project, GitHub could attest to that. If I donate $50 to the PyPi project, they get an attestation that someone donated $50 to them and I get an attestation that I donated $50 to someone.

Even if it's only domain validated namespaces, at least that's something that costs a little bit of money every year and it scales without a lot of human intervention compared to most other ideas.

paulddraper · on May 21, 2023

I assume ownership would follow the domain? if it gets transferred, then the ownership of packages gets transferred too?

ryan29 · on May 21, 2023

No. I would use domain as a vanity pointer and a permanent account as the source of truth that clients use to download packages. You have to assume the domain can be dropped and re-registered by a bad actor, so a permanent account with proper package signing is really the only way to ensure you're not getting packages that are the result of an account takeover.

All a domain based namespace really does is help to get rid of impersonation and the confusion causes by squatting and mismatched namespaces everywhere. I think it helps in evaluating trustworthiness the first time you add a package to a project, but you still need package locking and signing to ensure you're only getting packages from the person or organization you've trusted.

bscphil · on May 21, 2023

> For better defense-in-depth, combine multiple methods.

It really starts to sound like we are reinventing Linux distributions. A strong vetting process and barriers to entry are why Linux distributions remain free of malware in their repositories. AFAIK no Linux distribution has ever been found distributing malware in an official repository (though I'd be interested if anyone knows of an exception).

eesmith · on May 21, 2023

No Linux distribution comes close to distributing the diversity of packages available from PyPI.

I had a package distributed by DebianScience. It was essentially impossible to get them to upgrade.

'malware' is a broad word. I know some Debian developers have characterized Debian-distributed software as having a "timebomb". Does that count?

ploxiln · on May 21, 2023

It was Xscreensaver, written by JWZ. He was pissed that debian distributed old versions, so he put in a timebomb that would make it stop working on a certain date. I kinda forget the details, but still I wouldn't consider this malware, it's not something you would worry about trying to exfiltrate credentials, run a cryptominer, open a backdoor shell, etc. It just disables itself with a taunting message, harmless really.

Another famous case was when debian patched OpenSSL to fix an uninitialized-memory compiler warning, way back 20 years ago, and other than PID this was the only entropy mixed in before generating keys at the time, so there was a couple months when only 65535 different ssh keys could be generated on such systems, and it's pretty easy and fast to just check them all. Still today (or until very recently) there was a special check to reject usage of such set of weak keys that were generated in those ill-fated 2 months. Still, really nothing like malware. Just a huge security bug. Those happen occasionally (e.g. heartbleed, not debian specific ofc).

So like 2 notable cases of less-than-optimal software prepared for users by the debian project? But not malware.

eesmith · on May 21, 2023

> he put in a timebomb that would make it stop working on a certain date

I wrote 'some Debian developers have characterized [it] as having a "timebomb"'. because I do not agree with that assessment. It did not actually stop working on a certain date. It displayed a warning dialog box on the screen saver lock screen saying the package was out of date. Here is the bug report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=819703 Here is the relevant code: https://github.com/Zygo/xscreensaver/blob/379ea5d57c212863d2...

Nothing there about no longer working. (If "stop working on a certain date" makes something a time bomb, then expired certifications are certainly in Debian, like https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=844631 linking to https://bugs.chromium.org/p/chromium/issues/detail?id=664177 "the [Certificate Transparency] information has a built-in build-time bomb of 10 weeks".)

Okay, "time bombs" are not malware. Got it.

My main point remains, Linux distributions handle far fewer packages than PyPI. "Debian comes with over 59000 packages" says https://www.debian.org/intro/about and PyPI has 455,474 projects says https://pypi.org/.

Can Debian developers handle 7x more packages? Almost certainly not. They couldn't even handle my software.

If a package maintainer already can't detect a "time bomb" with a huge warning message in the code, then why should I believe they are able to detect more subtle malware, especially with a 7x higher maintenance load?

I write specialist software with only a few thousand users world-wide. Do you really think each of the distros will have the people who are able to review and integrate my code into the distro at the same level as they do xscreensaver?

More realistically, they will reject my package as being out of scope, leaving me ... back at PyPI.

ploxiln · on May 21, 2023

... or leaving you at GitHub (or GitLab, or SourceHut). You can pip install a git repo directly, and specify tag, or a hosted release tarball. If only the ~ 10k most used python packages were available from pypi, and the more niche ones had to be more directly referenced from their own repos or websites, I'd be fine with that. I'm accustomed to that for C libraries, distros provide the most common/core few thousand of them, the rest you can get yourself, it's doable. Minimize your dependencies, its better engineering anyway.

eesmith · on May 21, 2023

Does the lack of comment about "time bomb" mean anything regarding your misapprehension that the xscreensaver code made the program stop working on a certain date?

I meant 'PyPI' as synecdoche, yes, for a large number of alternatives which cannot meaningfully be described as "reinventing Linux distributions".

Those three quoted words were bscphil's claim. My counter-claim is we are not, and further, the approach Linux distributions use is not able to handle what people already use PyPI for.

> and the more niche ones had to be more directly referenced from their own repos or websites

These are still not "reinventing Linux distributions".

> If only the ~ 10k most used python packages were available from pypi, and the more niche ones had to be more directly referenced from their own repos or websites

How do you determine the top ~10K if some 450K packages are distributed across a variety of platforms?

If a package leaves the top-10K, is it removed? How much breakage will that cause? Or, once it reaches the 10K level, does it stays on PyPI forever?

Why do you think this will result in reducing the amount of malware? Seems to me it opens the attack surface, and makes it harder for anti-malware scanners to get a good oversight of the large portion of what people use.

> distros provide the most common/core few thousand of them

Are you sure you aren't agreeing with me? The Linux distros are already unable to manage the packages in my field.

That's why "reinventing Linux distributions" will not work, if forced to scale up 7x and handle niche area.

That's why people use other methods, like PyPI and Anaconda, to handle what the distros cannot.

Hnrobert42 · on May 20, 2023

These are good suggestions. I also like @BozeWolf’s suggestion below about charging for new accounts.

You could even do a combo. Like $25 sign up. Free with referral.

strogonoff · on May 21, 2023

Identity verification is the only reliable option for popular public package registries that want to avoid becoming a cesspool of malware.

The problem with identity verification is that it can be expensive for you, the registry (reliable implementations require things up to video call verification in case of doubt), and risky for your users (who will have to trust your or your verification partner’s infosec practices with their PII).

Ideally either of those, but if you screw up—both.

It’s also crucial to remember that identity verification is never a guarantee: you cannot dispense with any pre-existing measures, like manual package review and user reports, to recoup the costs.

All in all, it is inevitably an expensive undertaking and a sad example of the tragedy of the commons.

quickthrower2 · on May 21, 2023

Or don’t have a system? What if you installed stuff via a git (not necessarily github) url/tag? Puts the onus on users to verify. Use checksums to avoid upstream changes you didn’t explicitly accept. No different than cloning a repo from a risk perspective (just as risky but no more so). This removes any implicit seal of approval. Companies like malwarebytes could monitor and maintain blacklists or whitelists. Sign up to multiple lists if paranoid.

thayne · on May 21, 2023

That's how go does it, and it has some drawbacks:

Things break if the git url changes, which is more likely than the package name changing.

It complicates using a fork, or local version of a package. Having the url once in a package manifest instead of repeated in every file that importa,like go dows, would probably help.

It can make discovering the package you need more difficult.

aunderscored · on May 21, 2023

Go has vanity URLs for the first, and go module files have replace statements to redirect imports to other names or local directories.

thayne · on May 21, 2023

> Go has vanity URLs for the first

In theory, maybe. In practice, I haven't seen vanity urls used very much in go projects, and for it to completely solve the problem you need to use the vanity url from the very beginning and you can't ever change the vanity url. It also requires you to have your own domain.

nonethewiser · on May 20, 2023

What about some testnet that allows people to push things up as a test? That seems like a valid need but a source of lots of garbage at the same time.

mplewis · on May 20, 2023

This is what https://test.pypi.org/ is for.

nmstoker · on May 21, 2023

What about some method of "moderate to get upload credit"?

Obviously not full moderation (as you don't trust these people) but you could have them fed a few basic tasks that support moderation.

They would be give a selection of tasks, some of which overlap with tasks done by trusted people. If the user answers consistently with the trusted person, they get some credit. When a few such users show consensus on the non-overlapping tasks, those answers get accepted.

matheusmoreira · on May 21, 2023

Linux distribution maintainers already do all of those things and probably more. It's always interesting to see other projects rediscover these lessons multiple times.

BozeWolf · on May 20, 2023

I wonder if Apple’s trick would work for python packages. Pay a few (5? 10? 20?) bucks to become a Pypi developer/sponsor, it would also pay pypi’s operational bills. It raises the bar for malicious actors.

Additionally, if pypi provides keys to developers, pypi can also revoke certificates for developers making malicious packages.

It would need a system which checks package signatures on startup of a python app, or maybe there is some other way to do that. Pip —check or some thing which then runs in pipelines, specifically meant to check for malicious packages each day.

To decrease the barrier of entry on pypi, students could identify with their student number. Or pypi could work with a system where you have trusted users and packages and untrusted users. A bit like the blue checkmark, but without the negative connotation.

viraptor · on May 20, 2023

That seems like a way to kill pypi as a popular service. For the 3 or so packages I have, I'd probably just change the description to pull from another location rather than pypi. It's not that I can't afford it, it's just that I don't want to end up paying for another subscription if rules change.

There are also people who wouldn't be able to pay for legal reasons. It would also stop teenagers who don't need the hassle of getting parents to pay.

dalke · on May 21, 2023

I already host my own packages on my own domain. It was not hard to make a PEP 503 "simple" index.

  python -m pip install chemfp -i https://chemfp.com/packages/

This is commercial/proprietary software for cheminformatics fingerprints.

I don't feel comfortable hosting it on PyPI as I suspect in the future they will want to pay to host non-FOSS licensed software. Which is definitely their prerogative! By hosting it myself already I end up with less lock-in.

Having done it both way, hosting the packages myself means I also like having full access to the download logs, the ability to replace files when I made a mistake, and I can rsync using my usual SSH keys instead of using PyPI's API token system to upload wheels.

I have a very old version of my software on PyPI, distributed under a free software license. The major downside is that people sometimes download that instead of the most recent version. But since it doesn't work on Python 3.x, they either search and find my web site, with the download instructions, or email me.

I figure I also have keep something on PyPI, to prevent name squatting.

wombatpm · on May 20, 2023

I think you misinterpreted the comment. The Payment would be required for people wanting to PUSH to pypi. Developers and people PULLING from pypi would remain free. The problem is garbage and malicious submissions.

chatmasta · on May 20, 2023

I don't think the comment you're replying to misunderstood the comment. I think the author was saying that they would update the readme of their projects (which they _push_ to a registry, and for which they do not want to pay) to instruct people to pull the package from somewhere else.

wombatpm · on May 21, 2023

You can do that, but it certainly doesn’t help you with project discovery. If you want only distribute via GitHub that’s fine. But as a developer I’d need a compelling reason to use such a project vs one distributed via pypi

viraptor · on May 21, 2023

That's ok. Many projects don't care about discovery. (As in being popular/brand name, rather than being possible to find if someone needs that specific thing) And even those that do can rely on people searching for "python package for frobnicating" and finding it somewhere else.

I'm publishing my code for the benefit of others, not mine, so if someone writes the same thing and pays to get it published, that's fine with me.

> I’d need a compelling reason to use such a project vs one distributed via pypi

Apart from trivial projects, there's not that many alternatives you can swap out. Your choices may be "use such project or write your own".

Too · on May 21, 2023

Tbh, changing PyPi urls to git urls at GitHub would almost be a win-win. At least GitHub has namespaces and many other indicators of trust, such as issue tracker, stars, git log and so on. You would loose semver however.

Usually checking the official GitHub projects readme for exact name of pip install command is what I normally do anyway. I would never ever use PyPi as a discovery mechanism, it has too many typosquatters and other lookalikes.

viraptor · on May 21, 2023

If they went with "let's make the default lookup work well with versions in other registries", I'd love that solution.

cozzyd · on May 21, 2023

not to mention pip search doesn't work anymore...

matheusmoreira · on May 21, 2023

Imagine having to pay to give open source software away for free. Sounds like a great way to ensure the death of PyPI.

EVa5I7bHFq9mnYK · on May 21, 2023

How about assigning a trust level to each package. Level 0 - Free, unchecked package, Level 1 - reviewed by at least one moderator, Level 2 - developer bought a $100 license + reviewed, etc. Then it's up to package consumer to decide which package they trust.

woodruffw · on May 20, 2023

FD: I’ve done some work on PyPI, but I am not an admin and everything below is an independent opinion/understanding.

PyPI’s operational costs are, to my understanding, mostly covered: hosting is graciously provided by a sponsor, and the PSF currently funds roles for its develop, administration, and security. More funding is always good and I believe the PyPI admins are looking to enable payments through the newly released “Organizations” feature[1].

Edit: and to make it more clear: payments for Organizations would be principally aimed at corporations and other groups.

> Additionally, if pypi provides keys to developers, pypi can also revoke certificates for developers making malicious packages.

There are currently plans in progress to allow PyPI users to upload Sigstore[2] signatures for packages.

That won’t directly address the spam issue, however — signatures will be opt-in (by necessity, due to the size of the packaging ecosystem), and no codesigning scheme can prevent a spammer from simply assuming a new identity (especially when new identities are “free,” as they normally are.)

Separately, revocation itself is a nasty problem for packaging ecosystems to deal with: ecosystems with trillions of dollars of value behind them (like the Web PKI) struggle with it, so it’s not immediately clear that it would be anything other than an additional operational burden.

Similarly for reputational systems: they’re difficult to operationalize without additional maintainer burden. That’s not to say that they’re necessarily bad or impractical for PyPI’s purposes; only that I’m not aware of a successful use of them in an open source packaging ecosystem. Compare, for example, PGP’s WoT failures.

[1]: https://blog.pypi.org/posts/2023-04-23-introducing-pypi-orga...

[2]: https://www.sigstore.dev/

XorNot · on May 20, 2023

Web of Trust failed because it lacked motivating reasons to use it, and had a very weird idea about how trust is represented (i.e. the idea of partial trust keys meaning... What exactly?)

What was missing was a role/application based idea of trust chains. i.e. what I trust to do my banking is quite different to what I trust for software reputation.

Because ultimately we're operating a web of trust system now on the internet, it's just informally specified.

woodruffw · on May 20, 2023

> Web of Trust failed because it lacked motivating reasons to use it, and had a very weird idea about how trust is represented (i.e. the idea of partial trust keys meaning... What exactly?)

This is true, but it's also just one among several reasons that the PGP WoT failed. Some are purely implementation and design failures that wouldn't imply to a better scheme than PGP, but others are still intractable in the general case (like timely revocations and "strong set" maintenance).

> Because ultimately we're operating a web of trust system now on the internet, it's just informally specified.

Could you elaborate on what you mean by this? The trust scheme that the modern web is built on is a well-specified hierarchical PKI, i.e. the exact opposite of a web of trust. But it's possible I'm misunderstanding what you mean.

taeric · on May 21, 2023

I don't know that I agree with this framing. Pgp could support multiple keyrings, for one.

Bigger, though, is that the internet isn't a web of trust in the same vein? Take out the CA chain, and websites fail trust pretty hard. Even adding them, we wind up accepting certificate pinning on browsers, as nobody has a better option, yet.

And that is building trust to web sites. The stick houses we have to trust communication is basically, "you always call us."

Edit: forgot to say that partial trust is pretty close to "believed, but I wouldn't bank with this identity."

BozeWolf · on May 21, 2023

Didn’t know sigstore. The idea looks promising. They should work on marketing though. “Their” technology would work on any package repository, right? Also for npm and the likes.

woodruffw · on May 21, 2023

Not only would it work for NPM, but it already does[1] :-)

[1]: https://blog.sigstore.dev/npm-public-beta/

pmeira · on May 21, 2023

I maintain a few niche (electric power systems) packages, and I wouldn't mind a one-time or yearly fee, or a fee per project created. I say this as a Brazilian who lived in the middle of nowhere and managed to have a website in the 90's as a teen. If a monetary fee is not desirable, some other hurdle/challenge would probably work fine.

Recently I've seen someone on Reddit trying to automate the creation of PyPI projects through GitHub Actions. The person was complaining that the first deployment couldn't use an API key for that project since it didn't exist. So I'm not surprised some people are trying to do the same for malicious purposes.

The PyPI front page lists 455k projects. If you search for "test", you'll see there's a lot of throwaway projects (note that test.pypi.org is a thing). I'm mostly an EE researcher and I'm not sure students need a low barrier to entry to PyPI, since pip and other tools support installing from GitHub without too much hassle and there are also other non-PyPI package indices. Student packages/projects tend to be abandoned soon after graduation. An archived repo (with a license...), on GitHub or somewhere else, sounds more reasonable and also has more visibility that could end in code reuse someday (through the service's own search and search engines in general). I'd love to understand why so many people repeat this meme that student and teens need trivial access to production infra like PyPI.

So, I'd say being too inclusive, allowing fully unrestricted trivial creation of projects is kinda foolish. There needs to be some extra step, be it a fee, identity confirmation, manual moderation/approval, or something else. I'm sure the PyPA devs/maintainers have ideas.

ctoth · on May 21, 2023

> There needs to be some extra step, be it a fee, identity confirmation, manual moderation/approval, or something else. I'm sure the PyPA devs/maintainers have ideas.

When I was younger, I always thought computing was so incredibly cool, because me, just some blind kid in Florida, could contribute and make things and share things and just ...participate. I would talk to friends trying to go in to other careers, and excitedly talk about what I was working on and be curious why they never did anything related to what they wanted to do when they grew up.

Now, I understand how this comes about, bit by bit, with the best of intentions.

And I hate it.

Please, just no. If you want to set up a corporate only, super-sekret clubhouse of a PyPi that only the authorized developers can push to, well, the source code for PyPi is right here[0]! And here's Stripe[1]! But please don't break even more of the open, free Internet that I grew up with, I'm pleading with you.

[0]: https://github.com/pypi/warehouse

[1]: https://dashboard.stripe.com/register

logdap · on May 22, 2023

The ones who broke the commons are the ones who abused it by uploading malware and lazy student projects. You shouldn't blame the host for trying to maintain standards in light of this. When you have a tragedy of the commons scenario, the solution is to regulate the commons. Otherwise it will lose all value for everybody.

woodruffw · on May 21, 2023

> Recently I've seen someone on Reddit trying to automate the creation of PyPI projects through GitHub Actions. The person was complaining that the first deployment couldn't use an API key for that project since it didn't exist. So I'm not surprised some people are trying to do the same for malicious purposes.

Sorry for the tangent, but: you can do this now! If you use trusted publishing, you can register a "pending publisher" for a project that doesn't exist yet. When the trusted publisher (like GitHub Actions) is used, it'll create the project[1].

All of this is supported transparently by the official publishing action for GitHub Actions[2].

[1]: https://docs.pypi.org/trusted-publishers/creating-a-project-...

[2]: https://github.com/pypa/gh-action-pypi-publish

pmeira · on May 21, 2023

Interesting, thanks for the links. By the way, the one I mentioned was in r/learnpython, which is probably not exactly the ideal audience for such a feature.

woodruffw · on May 21, 2023

No problem. And I agree completely -- it'd be really nice if newcomers could copy a template (or even better, have a tool make one for them) that handles all of this behind the scenes.

miohtama · on May 20, 2023

You can always have a trusted community member to waive the payment requirement, so anyone who demostrates genuine effort can have it for free.

BozeWolf · on May 20, 2023

This. Also enterprises using specific (still untrusted) packages could pay it for the developer. There must be a way to do this. But in true spirit of python the system should be as open as possible.

NewJazz · on May 20, 2023

Universities could host gitlab instances with pypi registries built-in.

NegativeK · on May 21, 2023

Start asking maintainers for SBOMs (this is rhetorical; please don't) and you'll find out how disinterested they are in doing even more work for others. They understand that there's a problem, but don't have the will to deal with extra work when they're already short on time.

Open source projects tend to be maintained by a tiny number of people who are scratching their own itch, not signing up for more barriers.

CommitSyn · on May 20, 2023

As long as devs could be identified properly (ID selfie + ID closeup + small payment with card in same name) and students do the same with a smaller amount, I think that could work. Refund of protection deposit when you close your account. System of honor based on account age. The problem is it's incredibly hard to properly ID people. If fraudsters can get an ID+card they can make a convincing fake selfie ID photo from social media pictures. Plus accounts can just be stolen.

crabbone · on May 20, 2023

Hahaha no. You physically cannot do it. Read PEP-508. Python can install packages from anywhere. Even if you start with PyPI, the dependency can be specified as hosted on a random unrelated resource.

And, on principle, I wouldn't pay PyPI anything. I want them gone, I don't want my money to feed a bunch of incompetent people who make my life miserable. So, if they were to implement this idea, I'd be hosting my packages on GitHub or Gitlab etc. and have dependencies link to those sources.

hluska · on May 21, 2023

You seem pleasant.

efitz · on May 20, 2023

> The volume of malicious users and malicious projects being created on the index in the past week has outpaced our ability to respond to it in a timely fashion, especially with multiple PyPI administrators on leave.

People suck and that is why we cannot have nice things.

miohtama · on May 20, 2023

It’s calle Tragedy of Commons

https://en.m.wikipedia.org/wiki/Tragedy_of_the_commons

One solution would be have a minimum payment or a trusted sponsor in order to register a new package. A minimum trust score through well-known community persons, or reduce spam by payments.

On the root cause why people suck is that different cultures and different people have different values. For many people making money by spamming is not an issue in their own internal value system. Thus, it is better to be tackled by making spam unprofitable.

checkyoursudo · on May 20, 2023

I disagree that this is a case of tragedy of the commons (totc). Totc is about unregulated depletion of resources, just like your link says. This pypi stuff is about maliciousness. Criminal or misanthropic behavior is not totc; it's just antisocial.

Also, https://aeon.co/essays/the-tragedy-of-the-commons-is-a-false...

Hnrobert42 · on May 20, 2023

The shame of it is that a tiny, tiny, tiny fraction of people really suck. The number of people involved is probably less than a thousand. (This is my totally uninformed speculation.)

donio · on May 20, 2023

Yes, people suck and I wish they didn't but the other side of the problem is that the centralized flat-namespace model of PyPI is especially vulnerable.

imran-iq · on May 20, 2023

It's only a matter of time until it happens to other systems too. Take rust for example where folks can install software via `cargo install <global name package>`. Top that off with fact that during compile (as an attacker) you have full access to your victims computer[0].

Python just happens to have more popularity for now.

I think golang gets this right where you need the full package name to have it be usable with get/build/install commands.

---

0: https://doc.rust-lang.org/cargo/reference/build-scripts.html

woodruffw · on May 20, 2023

Namespacing isn't the problem here: an ecosystem with two (or more) levels of namespacing has the exact same username and package stuffing issues. Namespacing arguably makes typosquatting more difficult, but that has nothing to do with build-time code execution (a common feature of packaging systems) or other package confusion techniques.

Go is able to avoid these problems because it (arguably wisely) sidesteps nearly all package index problems by punting them to the source host.

fiddlerwoaroof · on May 20, 2023

I think maven central has the right idea, but it could be generalized.

Require people submitting packages to prove they own a domain and use the domain as the namespace representing the user. We could then use SRV records or TXT records to specify where a domain’s packages are authoritatively hosted and then have conventions to find a signing key for a given domain that can be used to verify cached packages in non-authoritative hosts.

woodruffw · on May 21, 2023

Again: namespacing isn't the problem (or a totalizing solution) here.

I admire Maven's approach on a conceptual level, but it leaves a lot to be desired on a technical level: it effectively gatekeeps the entire packaging ecosystem on the ability to purchase and maintain a domain name, which (1) is unnecessarily exclusionary, and (2) assumes trust in a namespace (DNS) that was never really designed to be authenticated or a source of stable, permanent identifiers (domains change hands all the times, including for hostile reasons).

A modern version of Maven's scheme would bootstrap on well-known URIs over HTTPS instead, but even this would still represent a significant imposition on PyPI uploaders.

fiddlerwoaroof · on May 21, 2023

I think the goal is not to have "pypi uploaders" but instead have a discovery system designed to enable safe caching of artifacts and automatic discovery of where an artifact is hosted.

The exclusionary aspects can be mitigated via somewhere like github offering package hosting for their users. $10/yr for a domain name also isn't all that much.

woodruffw · on May 21, 2023

I don't know how that's panned out for the Java ecosystem, but distributed hosting is something that PyPI moved away from (with a great deal of pain and heartburn) years ago. Previous versions of PyPI were a "pure" index, with distribution hosted being provided by individual projects. This made package installation exactly as reliable and secure as the least reliable and least secure host.

fiddlerwoaroof · on May 22, 2023

You can setup a system like DNS, where downstream systems can cache verifiable versions of the packages hosted at the authoritative source to help here. If the lock file for the project records signatures or similar identifiers for the project’s dependencies, the exact source of the bits matters a lot less, reducing the reliance on package hosts remaining up indefinitely.

Symbiote · on May 21, 2023

Maven supports multiple package repositories, and it's not unusual to see small packages in a self-hosted repo. That could be a self-hosted HTTP server, S3 or similar, or GitHub's package repo.

The hostname of the repository needn't match the name space of the packages it hosts.

ryan29 · on May 21, 2023

> Again: namespacing isn't the problem (or a totalizing solution) here.

The current system works better for impersonators and squatters than it does for me as a legitimate participant, so, in my opinion, that's at least part of the problem. Domain validated namespaces go a long way towards solving that.

> it effectively gatekeeps the entire packaging ecosystem on the ability to purchase and maintain a domain name

There's no reason there can't be a default namespace similar to what Bluesky did. By default you get @example.bsky.social and can optionally switch to a custom domain. The main thing that Bluesky got wrong is they recycle your original handle when you switch to a custom domain and I think it should be (optionally) kept as an alias to prevent impersonation.

> assumes trust in a namespace (DNS) that was never really designed to be authenticated

It's trusted enough for MS365, Google Workspace, ACME (protocol), and basically everything on the internet at this point. PyPI.org is using a domain validated SSL certificate and (AFAIK) distributes unsigned packages. I view that as about the same amount of risk as using domain validation for namespaces.

> or a source of stable, permanent identifiers (domains change hands all the times, including for hostile reasons).

Domains changing ownership can be handled reasonably well by treating the domain as a mutable pointer to an immutable namespace. The immutable namespace would be the source of truth that would be trusted by clients and would point back to the domain so they reference each other. If either pointer changes the client can warn the user and let them decide how to handle it. By using the immutable namespace after following the mutable (domain) pointer once, you eliminate a lot of the value of stealing a domain to take over a namespace. If you flag domains that have recently started pointing to a different immutable namespace (ie: a new domain or owner), it acts as a cautionary warning so users know they shouldn't be blindly trusting the namespace.

Dealing with banned domains would likely be a pain point, but I don't think banning an account changes much in terms of human factors involved. On the domain side of things I'd probably maintain a public list of permanently banned domains, at least to start with, and see what problems that creates before trying to find solutions. Non-negotiable, permanent banning might even be ok.

For me personally, domain validated namespaces would be a huge improvement. I'm sick of the gold rush and having to chase my handle on every platform. Even if I get it, I'm still indistinguishable from a bad actor on most platforms because I have very little activity and nothing popular.

At least with domain verified namespaces someone who doesn't know me can go from pypi.org/@example.com to github.com/@example.com or example.com and be nearly guaranteed it's me (ignoring domain hijacking). That makes it easier to make a judgment call on whether or not to trust me because most of my activity will show up at, or be linked to from, those places.

Plus, as a package becomes more popular, the recognizably of the owner's handle also grows. With a single, domain validated namespace, that's only one definitive thing for people to have to recognize which cuts down on the cognitive load of trying to figure out if mismatched namespaces are legitimate or someone trying to engineer an attack.

I think the current system for namespaces is a big chunk of the problem because it's non-authoritative in the context of knowing who owns a namespace. By switching to domain validated namespaces a lot of that ambiguity disappears because you know that example.com, pypi.org/@example.com, github.com/@example.com, twitter.com/@example.com, etc. are all the same person or organization. That means that as soon as you know you can trust them on one of those platforms, you can trust them everywhere else because, as long as the UI flags new domain pointers to combat domain hijacking, you've basically removed impersonation and squatting as a potential threat.

So, maybe it's not perfect, but, in my opinion, it adds enough value that it's worth considering as an option. I think it's great on Bluesky because I find it so much easier to determine whether or not people are legitimately who they say they are and not some prankster or impersonator. For example, I'm following @tailscale.com on Bluesky and I know it's them.

blibble · on May 21, 2023

> which (1) is unnecessarily exclusionary

meanwhile pypa spent months making it so the only trusted publisher is microsoft

woodruffw · on May 21, 2023

This is misleading: the trusted publisher is GitHub's OIDC IdP, which happens to be the single most common publishing source for PyPI (as well as the first IdP to give us all the features we needed). It has absolutely nothing to do with Microsoft; I don't think Microsoft (as a discrete corporate entity) was even aware of the work.

The feature was intentionally built to be extensible and to support additional publishers. I know this because I'm the one who made it extensible, and who's working on supporting them[1][2].

[1]: https://github.com/pypi/warehouse/issues/13551

[2]: https://github.com/pypi/warehouse/issues/13575

lolinder · on May 21, 2023

Does Maven have these problems? I haven't seen it if so.

woodruffw · on May 21, 2023

Maven doesn't have these problems, but not because of namespacing itself. The adjacent thread has the details on that.

lolinder · on May 21, 2023

Usually with something like this is the whole system that makes it work—namespacing alone may not be enough to solve all our package manager woes, but can a package manager solve them without namespacing?

mike_hearn · on May 21, 2023

The Java ecosystem doesn't have these problems for several reasons, not just namespacing:

1. The Java ecosystem has no way to distribute executable programs. Maven is only for libraries. There is literally no equivalent of "npm install" or "pip install". This is not a strength! IMHO it's one of the things that pushes people away and towards scripting languages. Being able to distribute programs as well as libraries is a very useful feature. It does, however, mean that there's very little value to be had in pushing malware to Maven Central because the instructions for how to run anything from it would be more complicated than just distributing the JARs yourself. State actors might want to compromise widely used libraries, but that's a different kettle of fish.

2. Not only can you not distribute programs but Maven resolvers don't execute any code. Dependencies are statically declared and the "installation" process just involves walking the dependency tree, downloading the JARs to a file system cache and then computing a list of those JARs for the build system. There's no equivalent of setup.py.

You might be wondering, how then do you distribute libraries that require native code? Well, the Java ecosystem has no direct support for that either. You'll have to pre-compile your native libs for each OS and CPU arch that users might want, distribute the libraries as data resources in your JAR, extract them to somewhere in the user's home directory at runtime and load them from there. (There's also no standard mechanism for this so every library rolls its own).

3. Maven Central (but not the clients) requires that all uploads are PGP signed. So, to be able to upload bad releases of existing libraries requires you steal a signing key.

4. The JVM ecosystem is oriented around larger but fewer libraries. Although individuals do create and upload libraries of course, the huge standard library means you don't need to rely on them so often. There is no leftpad library for Java. Many dependencies will come from Google, Apache, JetBrains, etc. When you do rely on individuals they tend to have been around a long time, their libraries are relatively well known, have a lot of entries in the issue tracker, long lived mailing lists and other unforgeable evidence of real-ness.

5. All operating systems come with Python built in these days, even Windows. In contrast no OS comes with Java out of the box. So if you just want people to run malware Python is a better bet because it's just one command, users won't have to install Python itself.

I don't think having a central server is that big a difference in the grand scheme of things, though it does mean someone owns the problem of clearing bad packages were one to get in.

lolinder · on May 21, 2023

Points 1-4 all sound like substantial strengths to me.

1. Who decided that the same package manager needed to be in charge of downloading libraries for CI pipelines and finished executables? Combining the two muddies the water and makes it unclear if a given package is going to provide a library or execute code or both. Bad actors thrive in confused spaces.

2. This comes down to the same point: I don't see why the package manager for coordinating dependencies in a software project ought to be the same thing as the package manager for installing applications in an OS. They're different use cases, usually different sets of users, and, as you note, different sets of requirements. Nothing good comes from a library being able to execute code as it is downloaded, so Maven lacks that feature because that's not the use case it's targeting.

3. Sounds fabulous. When is this feature coming to PyPI?

4. This doesn't seem like an incidental difference, it's the natural result of the design of the system, and a big part of the reason for the big-package culture is Maven Central. If the Python and JavaScript ecosystems made package ownership more explicit, the culture of thousands of code dependencies would be more obviously a culture of thousands of human dependencies, and I think people would be much more freaked out with the status quo.

5. See points 1 and 2 above: the Python and JS ecosystems make a muddled mess of the distribution between development and distribution, and the result is a dangerous hybrid package manager that is frustrating for either use case.

mike_hearn · on May 22, 2023

1. Agree that having a clear separation at the UI level is of value, but a program is a set of modules that depend on each other, no different to a library. Where you start needing very different things is when distributing software to end users who aren't developers, but there's a lot of small utility programs, demos etc that would benefit from being distributed the same way as libraries. There's a reason that so many dev frameworks these days start by telling you to install a framework-specific CLI tool to help you with things.

2. I'm thinking here of cases where the users aren't really different. Apps for developers, for example. Should have made that clear, sorry!

3. PyPI already tried to use PGP signatures and gave up. It doesn't add much security over just having a good authentication system on a central server. Note that nothing checks the signatures except, I think, Maven Central itself, so it doesn't help in case of server compromise.

We're in agreement that the situation with Python/JS (and to a lesser extent Ruby?) is a real mess, but I don't see it as due to any fundamental design choices. More like a set of fuzzy accidents of history. If someone added a new command to mvn or gradle tomorrow, for example, (1) might stop being true.

crabbone · on May 20, 2023

The answer was there all along since, at least Maven (but probably earlier). When specifying dependencies in Maven, you need to also provide checksums. Really hard to screw that accidentally.

But, the real problem is that any such system needs moderators. Verifying submissions is a tedious and difficult task with a lot of responsibility, if we are talking about something the size of PyPI. If they want to properly process the volume of submissions they have today, they need an army...

On the other hand, 99% of all stuff stored on PyPI is absolute junk. Yet on the other hand, people are used to there being dozens of versions of every package and packages declare dependencies very liberally.

All leads to the situation where there's a "frontier" of packages that can be installed together, but anything that lags 3-4+ versions behind the frontier is just taking space. Even if it's not malicious, it's not installable anymore because there's no more support for the platform it was written for.

But this is not the end of it. There are tons of simply broken packages, where some part of the archive is missing, or is malformed etc.

Proper moderation would have you submit your code, then review it, then publish it. Possibly, rejecting it in between if the code doesn't match requirements. Most people who publish their stuff to PyPI don't understand what needs to be in the package. Even packages like NumPy are full of useless junk.

But this will never happen, because the "community" grew used to the circus PyPI is. Python community doesn't care about the quality of the code, safety or anything other than a very shortsighted "make it work now" kind of goal.

There won't be a revision that purges junk or makes adding more junk harder. There will be another "quick fix", that will be obviously bad, but will kick the can down the road for a while longer.

kccqzy · on May 20, 2023

A massive amount of people signing up with malicious intent (the Sybil attack) is a thing afflicting every single online service with a signup. I don't blame PyPI, but I think it's genuinely a hard problem to solve. Big Tech often combats this by running a bunch of heuristics and suspending accounts they detect to be bad, but of course we know how easy it is for even tiny false positives to blow up. I think we might come to the end of the era of free online accounts; only those accounts requiring a method of payment will continue to be viable.

lolinder · on May 21, 2023

Modern package registries implicitly treat all package authors as equally trustworthy—you don't need to know which organization wrote the `zip` package, you just need to know that there's a package by that name and it opens your zip files. It's charmingly egalitarian and quite convenient, but the global namespace both enables typosquatting and trains developers to not think at all about the people who write the stuff in their supply chain. The biggest point of failure in any system—the humans—has no first-class expression in the typical modern package registry.

I can't help but wonder if this is something Maven got right. Namespacing packages according to domain names makes typosquatting vastly more difficult to pull off and makes authorship of dependencies a first-class concern for end users. If something is under org.apache, that means something about the contents of the package, and PyPI and company don't front that information nearly as well as Maven does.

dlor · on May 20, 2023

I know folks hate the centralization of identity management to the big identity providers, but as AI gets better and better at defeating captcha it's going to get harder and harder for the small, independent ones to operate reliably and securely.

adhesive_wombat · on May 20, 2023

"Never use your real name on that internet thing" will soon be a quaint throwback.

pixl97 · on May 20, 2023

We're moving closer to "when you post on the internet, it is signed and big brother knows you did it. Watch your words"

ivanmontillam · on May 21, 2023

We are already there since many years ago.

reaperman · on May 21, 2023

I wrote a lot of the systems to break captchas. Currently the "small" (really medium), independent ones are doing way way way better than the big identity providers. Google reCAPTCHA was particularly easy for us to get past.

Shameless plug: http://resolved.dominuslabs.com feel free to reach out for custom work - specializing in accessing gated sites for new search engines to do better web crawling, scraping public data, building better API integrations on top of existing services, etc.

vhcr · on May 21, 2023

Open Source captcha libraries are non-viable, you can train a NN on a few hours with a simple CNN architecture, there are some papers that describe how to do it.

reaperman · on May 21, 2023

We didn't have a lot of luck with simple CNN's on modern CAPTCHA's. But more advanced techniques work great on pretty much all of them.

Shameless plug: http://resolved.dominuslabs.com feel free to reach out for custom work - specializing in accessing gated sites for new search engines to do better web crawling, scraping public data, building better API integrations on top of existing services, etc.

tzhenghao · on May 20, 2023

Tragedy of the commons - only need a few bad actors to ruin it all for us. Almost all distributors face this problem, from Docker Hub to PyPI. This also reminded me of official Postgres Docker image running a cryptominer in the background [1]

[1] - https://github.com/docker-library/postgres/issues/770

boomanaiden154 · on May 20, 2023

It doesn't seem like the linked Github issue really backs up the claim that malware was getting shipped in the official image? The OP kept claiming that (even titling it `[Confirmed]`), but only one other person was able to corroborate the claim and their image hashes didn't match any of the hosted ones.

brasetvik · on May 20, 2023

Reading that thread it doesn't seem like the official image shipped with any cryptominer at any point, and that it's more likely that the container got compromised in other ways. (A compromised [superuser connection to Postgres can execute shell code](https://medium.com/r3d-buck3t/command-execution-with-postgre...), so that seems more likely than the image shipping with a miner)

jeroenhd · on May 20, 2023

Several other issues also mention this problem across versions, with one issue mentioning that changing the default password fixed the issue.

It looks like the miner was installed because the Postgres port got exposed with a weak password.

nonethewiser · on May 20, 2023

That is a very interesting read. The maintainers seem more sane than Antender. Im even sympathetic to many of his points but he cant be sure a maintainer put a cryptominer in the image. Shouldnt be so fast to point fingers.

Are there any aggregators of interesting github issues? Links with a bit of context about them make for some fascinating reads.

badrabbit · on May 21, 2023

I think this is the wrong problem they are addressing. They should allow malicious users to be a possibility but what is lacking is package authentication/signing.

Sign all packages, new users get the worst grade, users with a lot of downloads over a long period get a slightly better grade then have different levels if trusted third party reviewed grades.

For new users' packages you get a big red warning that needs to be confirmed and bypassed in multiple ways with a message like "ERROR:THIS PACKAGE IS FROM AN ILL-REPUTED SOURCE, IT IS LIKELY MALICIOUS AND HASN'T UNDERGONE HUMAN REVIEW" kind of scream at you a little.

woodruffw · on May 21, 2023

Codesigning is something that's actively being worked on for PyPI.

It won't have that kind of "noob" reputation scheme, though: those kinds of schemes (1) effectively punish newcomers for daring to contribute to an ecosystem, and (2) are incredibly hard to build trust policies around (what's the trust gradation between "new package" and "very old package with a new maintainer"?).

badrabbit · on May 21, 2023

The grade is for the maintainer not the package. It does punish newcomers but that's the point, newcomers cannot be trusted. People don't search for packages directly on pypi, they find them elsewhere and then use pypi to install it. If you are a new maintainer, you build reputation elsewhere and ask users to ignore the warning on pypi. After say 1k unique downloads and 90 day probation, you move to the next level with a less serious warning. To avoid all that though, you can purchase a code signing certificate and verify your identity (with pypi not just the ca) and even as a new comer there will be no warning for your packages.

After observing many such systems my view is that you have to make up your mind and have a rigid and reliable reputation system or have none at all. The middle is the most dangerous place for everyone but I fear open source people like to repeat mistakes and sacrifice the safety of their users for the sake of ideology.

Newcomer+ no known reputation+unverified = warn people

omginternets · on May 20, 2023

I’ve seen multiple comments stating that namespacing solves (or at least mitigates) these kinds of attacks. Could someone kindly explain how? I’m sure it’s straightforward, but I don’t see it.

woodruffw · on May 20, 2023

It doesn't, at least not directly. Namespaces make typosquatting more difficult, but it doesn't stop the main other incentives for spamming an index with inauthentic users and packages, i.e.:

1. Sneaking an inauthentic dependency into a tree somewhere;

2. Convincing less experienced users to install your package directly.

My understanding that (2), in particular, is an increasingly issue in cryptocurrency and other communities: inexperienced users typically talk on Discord and other chats, and may not fully understand that `pip install foo` essentially means "allow a random person to run code on my machine."

tiedieconderoga · on May 21, 2023

Can "pip install" actually run code? My understanding was that it doesn't, and a package needs to be imported by a script before anything gets run.

woodruffw · on May 21, 2023

> Can "pip install" actually run code?

Indeed it can, and in fact has to by necessity in many cases: `pip install` boils down to (a variant of) `python setup.py install` for many packages, where `setup.py` is package-controlled arbitrary Python code.

This is also true for subcommands like `pip download`, as well as `pip install --dry-run`.

(There's some subtlety here, since this only applies to source distributions and not wheels. But source distributions are still the "baseline" way to distribute Python packages.)

> and a package needs to be imported by a script before anything gets run.

Even with wheels, this part is unfortunately not true: a package can install a `.pth` file[1], which can be used to auto-load code on Python interpreter startup, even if the malicious package itself is never imported directly.

[1]: https://docs.python.org/3/library/site.html

remexre · on May 21, 2023

I thought it runs setup.py

dalke · on May 21, 2023

There is a long-term goal to remove setup.py in favor of letting "a thousand flowers bloom".

In modern Python setups, the pyproject.toml contains an entry for the 'build-backend'. There is a fall-back to 'setuptools.build_meta:__legacy__' because a large number of projects still use setup.py.

PyPI recommends hatchling, as the example used in https://packaging.python.org/en/latest/tutorials/packaging-p... ("You can choose from a number of backends; this tutorial uses Hatchling by default, but it will work identically with setuptools, Flit, PDM, and others that support the [project] table for metadata.")

These do not use setup.py.

I use setup.py because the new build backends don't seem to support C extensions all that well. My package also has Cython code, and some Python-based code generation for the C code.

Another package I know, RDKit, by default automatically downloads missing packages if not available. (These packages are provided as source from the respective authors.)

I think distribution people really don't want the build step to be able to run arbitrary code, but they face a long, up-hill battle.

woodruffw · on May 21, 2023

Just as a note: build backends still run arbitrary code. You’re exchanging a `setup.py` for an arbitrary Python package that specifies the build backend, and packages can specify their own backend (including one embedded in their own source).

Python as an ecosystem will probably never be able to fully remove build-time code execution from sdists, since native extensions fundamentally rely on it. The best we can do is unify and streamline them so that as many people upload wheels as possible.

dalke · on May 21, 2023

Thank you. I don't think I was clear enough that the desire really doesn't match reality, in part because I'm not clear about that myself. I don't have a deep understanding of the new way of doing things, having only read about them whilst trying to understand if/when I should migrate from setup.py.

> including one embedded in their own source

I didn't know about that! I thought it needed to be a pre-build dependency.

bpicolo · on May 20, 2023

It's a lot easier to typo a package (e.g. requests -> request results in the wrong package) than it is to typo a namespace-package (@namespace/requests -> @namespace/request would result in an error).

Somewhat likewise, namespaces can build trust in a way that single packages can't

jeroenhd · on May 20, 2023

With namespacing you still run the risk of @namespace/requests -> @namsepace/requests typos.

These may not be all that obvious. For example, the name of the popular Rust HTTP client "reqwest" is an intentional typo. tokio vs tokyo can also be a less than obvious typo.

Too · on May 21, 2023

On the other hand, this opens up for two potential places to typo. Both the name and the namespace.

This happens often to me on GitHub, you keep browsing some code, only to later realize you are in someone’s personal fork of the official project. Especially when the original author isn’t a big organization, you only have two equally arbitrary usernames as namespaces to compare against.

jeroenhd · on May 21, 2023

If the namespace is handled correctly, you won't accidentally download a malicious dependency if you get the namespace right. You'll get an error, but that's what you want for that kind of typo.

The cryptic username problem is indeed a bother, but often popular projects will use readable names for their Github accounts. There's no real fix for that if you depend on a much smaller project. When you think you may be running that risk, you should probably wonder if it's a good idea to depend on a project made by someone small enough not to be instantly recognizable within your specific programming niche.

omginternets · on May 21, 2023

Ah right! It was indeed straightforward — thank you!

bluish29 · on May 21, 2023

I like the simple solution that arXiv use (which allow anyone to publish a pre-print) also it has order of magnitude more code (compared to PyPI packages). If you want to be able to publish for the first time, you have to get a referral from a well established author on the platform.

And of course, things might be easier because you need to register using your real identify and that is excepted. This is different for PyPI that until now does not expect people to have their real identifies linked to their accounts.

benatkin · on May 20, 2023

I hope they get a handle on this and don't do anything else to alienate data portability folks like making Microsoft the only "Trusted Publisher".

https://news.ycombinator.com/item?id=35646436

ewdurbin · on May 20, 2023

there are no plans to limit trusted publishers. in fact there is another in the works: https://github.com/pypi/warehouse/issues/13551

Severian · on May 20, 2023

I've been very cautious the last couple of years due to these bad actors when looking at packages that might suit my needs. If there is no online presence of the source code (git anything, zips/gzs, etc), multiple packages submitted in a short time frame, or a greater than normal amount, an/or a derivation/plugin of a popular package it's usually a no-go.

For those that I do possibly trust, I then download the package (pip download) and review it. Doing a quick regex for URLs or exec() calls helps, but I probably should use something like guarddog (https://github.com/DataDog/guarddog)

woodruffw · on May 20, 2023

For what it's worth: `pip download` is capable of running arbitrary package-defined code[1], by design. You shouldn't use it as a security boundary.

If you're trying to statically analyze a distribution before doing anything else with it, you should download it directly from the PEP 503[2] simple index.

[1]: https://yossarian.net/res/pub/hushcon-west-2022.pdf

[2]: https://peps.python.org/pep-0503/

Severian · on May 21, 2023

Good to know, I was not aware of this. Thank you!

Too · on May 21, 2023

> If there is no online presence of the source code

Could this be utilized the other way around? Any package that was published using official GitHub action X, would automatically be signed as being created from something with source and link to source at given revision.

Of course this source could contain bad contents or download even more bad contents, at least it establishes some level of chain of supply, where the first layer of source can be inspected.

woodruffw · on May 21, 2023

> Could this be utilized the other way around? Any package that was published using official GitHub action X, would automatically be signed as being created from something with source and link to source at given revision.

That's exactly the idea behind Sigstore[1] -- GitHub Actions (or any other CI system with an OIDC credential) can be leveraged for "machine identities", which in turn are bound to ephemeral signing keys. The end result is a scheme where users never have to touch or maintain a signing key while still getting all the benefits of a normal PKI-esque codesigning scheme.

[1]: https://www.sigstore.dev/

NoboruWataya · on May 21, 2023

I know `pip search` has been disabled with PyPI for a while now due to a constant deluge of requests overwhelming the backend. Knowing that, and seeing this update, makes me think that PyPI is subject to an exceptional volume of malicious activity.

Does anyone know if it's true that PyPI receives more abuse than similar projects and, if so, why?

zyl1n · on May 21, 2023

As someone who straddles both Python and Java world, I don't seem hear this kind of malicious packages problem with Maven (central) as I do with PyPI. What could be the difference? Perhaps Java world is full of old farts like me doing mundane thing with a handful of well established libraries.

blibble · on May 20, 2023

namespacing? what's that

adontz · on May 21, 2023

PyPI could integrate with Google/Microsoft/Apple as an authentication system (OAuth?).

Almost everyone has one of these IDs and it's hard enough to register new ones.

woodruffw · on May 21, 2023

> PyPI could integrate with Google/Microsoft/Apple as an authentication system (OAuth?).

PyPI supports "trusted publishing,"[1] which provides a variant of this: it doesn't replace a user identity, but instead allows a platform (currently just GitHub, but support for others is on the way) to mint API tokens on a project's behalf.

Binding PyPI identities to well-known IdPs would address some of the problems here, but also introduces new ones: it creates a new kind of account lockout state (users who lose access to their IdP service, for whatever reason), introduces regulatory and data collection concerns, may prove excessively restrictive to users in countries with filtered Internet access, etc.

[1]: https://blog.pypi.org/posts/2023-04-20-introducing-trusted-p...

adontz · on May 21, 2023

And all of these is already a thing for nuget?

vhcr · on May 21, 2023

All the spam I receive directly from gmail.com accounts kind of disproves your point.

jjgreen · on May 21, 2023

That's what Rust does, you can only register with crates.io with a GitHub account.