“ Our current CDN “costs” are ~$1.5M/month and not getting smaller. This is generously supported by our CDN provider but is a liability for PyPI’s long-term existence.”
Conceptually, I love the idea of bittorrent distribution for binary packages like this. In practice, "oh no, my CI docker builds :(" Too much potential variability.
I just want a turnkey way to add multiple hosts/resolvers and use content-addressing. Like bittorrent but with passlists. Is that a thing?
Like, IPFS, but I want specific use-case domains. Like I'd be more than willing to stand up some sort of pip proxy for my company's domain but I'd only want it hosting packages used internally.
If pip had native IPFS support, you could do that by just pinning the packages you use on the local node.
You could also have an IPFS gateway with just normal HTTP(S) access control techniques and a normal pip, deferring to the swarm for packages that you haven't pinned. It's unhealthy for the swarm to stop your node from sharing it's data with other nodes, as that'd loose the Torrent swarm effects that unload the initial uploader (PyPI).
There should really exist a transparent pip proxy, then I can set the config of all my CI machines to it and `pip install foo` would do the right thing and install from that cache. I just want a single binary that I can run with zero configuration and have it talk to PyPI and cache packages locally. It would save so much bandwidth.
This is one of the key things which artifactory can do (and it can do it for basically any repository type, not just pypi). It's not so straightforward to set up, however.
In the past I've run this in a docker container, setup some pip environment variables and its off to the races. It transparently caches stuff from PyPi and keeps things local.
Most CI vendors offer caching directories between builds, and then you set the pip cache directory. Failing that, I made a proxy server that can work transparently if you can set default environment variables: check out proxpi
proxpi is exactly what I was looking for, thank you! The straightforward Docker installation is especially great, I can have it running with one command.
Such things certainly exist, we had one in my previous employer though I don't remember if it was something in-house, an artifactory feature, or something open source.
IPFS sounds like the ideal solution here. In particular because it would mean you could set up a local node to act as a cache, and pin any packages you rely on.
I'm eagerly waiting for ipfs-based package distribution channels. They would be great both for local caches and potentially getting closer sources in countries without great mirrors.
Termux (Android app, giving a mostly-normal terminal without root) already uses IPFS.
Requiring IPFS for accessing packages/wheels exceeding a threshold from PyPI seems reasonable.
Just have PyPI offer collaborative clusters[0] for a few common situations, and maybe work with the IPFS devs to get collaborative clusters to support only pinning a subset by path (like, only pinning the packages you depend on for local CI).
IPFS would be the ideal solution to so many things if it worked well. As it stands, I don't think it would be a solution to anything, it can hardly discover content on other peers in a timeframe less than minutes...
I didn't downvote you but I can attempt to offer an explanation:
When reading the first part of your post, I read it as implying that the person you replied to is trying to push a secret agendo or something with IPFS. That's on me of course, and you seem to have no such intentions, but that's how I read it. Also, sarcasm can be hard to understand for some people, and even harder in text form.
I think including your "Joking tone aside, that actually seems like an ideal application for it." in your initial comment would have helped. It indicates that what was before was in part a joke, and inform people of your real position.
> includes continuous integration scripts that run quite frequently
This use case is something I believe they could charge for if they need to cover infra costs. Same as Docker hub started doing - if someone fails to cache properly in their CI and wants to redownload things from the internet, they should pay for that.
> Continuous Integration automated build and testing services
can help reduce the costs of hosting PyPI by running local mirrors
and advising clients in regards to how to efficiently re-build
software hundreds or thousands of times a month without re-downloading everything from PyPI every time.
[...]
> Request from and advisory for CI Services and CI Implementors:
> Dear CI Service,
> - Please consider running local package mirrors and enabling use of local package mirrors by default for clients’ CI builds.
> - Please advise clients regarding more efficient containerized
software build and test strategies.
> Running local package mirrors will save PyPI (the Python Package Index, a service maintained by PyPA, a group within the non-profit Python
Software Foundation) generously donated resources. (At present (March 2020), PyPI costs ~ $800,000 USD a month to operate; even with generously donated resources).
Looks like the current figure is significantly higher than $800K/mo for science.
How to persist ~/.cache/pip between builds with e.g. Docker in order to minimize unnecessary GPU package re-downloads:
RUN --mount=type=cache,target=/root/.cache/pip
RUN --mount=type=cache,target=/home/appuser/.cache/pip
From an open-source developer's perspective, whether we hit PyPI (a third-party free service) or the cache provided by the CI service (a third-party free service) doesn't seem very different.
AFAIU `RUN --mount=type=cache` is specific to moby (Docker Engine) with `DOCKER_BUILDKIT=1`, though buildah does support a build-time volume mount option and moby easily could as well:
This might be the CDN list price. When you get to higher levels of traffic, you can usually negotiate and commit to a certain level of spend over the next X years, and in exchange the per GB cost falls by at least an order of magnitude.
Then again, it might not be - the amount of CI setups out there that download the entire universe every 5 minutes..
If it's 'generously covered by the CDN' then I suppose it must be list price, what else would they do? 'We'd probably agree to a 30% discount if you negotiate hard with this load, so that comes to..'?
$1.5M/month is about 5-15 software engineers, depending on seniority. Given that this is one of the most popular software repositories of one of the most popular languages it's not actually a lot of money. It isn't cheap but isn't expensive either.
Wow