Hacker News new | past | comments | ask | show | jobs | submit login
What to do about GPU packages on PyPI? (python.org)
124 points by polm23 4 months ago | hide | past | favorite | 106 comments

Yeet them into the sun. I say that as someone whose job benefits greatly from the convenience of easy to pip-install GPU wheels, and has fought many hours compiling GPU libraries from scratch.

I think it's the wrong abstraction. It encourages ridiculous amounts of bloat and duplication, of bandwidth, file systems, virtual environments, and docker images. It's really not actually that portable.

It encourages blindly installing unauditable giant binary blobs which run with all kinds of leaky abstractions. But boy howdy are they convenient.

I think "Allow external hosting" is the best case here. The first point of user expectation on PEP 470 is:

> Easily allow external hosting to "just work" when appropriately configured at the system, user or virtual environment level.

That's already out of the window once GPGPU is involved. The ecosystem is much better than it used to be, but there is still no plug-and-play GPGPU on linux without hiccups.

Let PyPI be the namespace, registry and lookup, and let users pass an --allow-external again, or (less optimal imho) do something like Conda channels. Improve the user experience for integrating with external hosts.

Pip is unlike NPM, Cargo, go get, etc because Python is much more intimate with linked libraries. It's unlike Brew because it's way more cross-platform. It's much closer to apt-get in this regard, and its challenges are different. Apt-get has sources lists, and I think that makes some sense for Python as well.

This is why Spack supports external packages. For any package, you can tell spack the prefix where it lives and build with that install prefix instead of a spack-built version:


People use this the tell Spack where the system cuda libs or system MPI are.

> Yeet them into the sun

Wow, gen Z lingo has come into HN faster than I expected. Can someone translate this to traditional English? The definition I find for 'yeet' ("an exclamation of excitement") isn't particularly enlightening...

Update: I see, thanks all!

As I am solidly a millenial, I guess I'll take it that I'm keeping hip to the lingo of kids these days :)

My definition of "Yeet" is to fling/throw/hurl something, typically with great vigor. Urban Dictionary defines it as "To discard an item at a high velocity." Derives from the Latin Iacio, From Proto-Italic jakjō‎, from Proto-Indo-European (H)yeh₁-‎ ("to throw, let go"). JK I just made that up, but it's actually plausible.

Never heard the "exclamation of excitement" definition.

From another milliennial: the only detail I'd add is that "yeet" is less about pure force and more about the impulsive, imprecise nature of the action. Throwing a basketball into a hoop wouldn't be described as "yeeting", except perhaps if it were by complete accident.

Right, if precision is involved, it's "LeBron".

Yeet is for distance/power. Kobe is for accuracy/precision.

Also worth mentioning that the past tense of yeet is widely agreed to be yote or yoted as in:

"The first empty soda can was yote many moons ago"

"widely agreed upon" is debated.

According to Wiktionary:


1. simple past tense and past participle of yeet" [1]

Worth noting there is a fair amount of discussion on whether the past tense is "yeeted" or "yote", or even "yaught": https://en.wiktionary.org/wiki/Talk:yeet

See also: https://linguistics.stackexchange.com/questions/28300/what-i...

[1]: https://en.wiktionary.org/wiki/yeeted#English

(I do think it would be pretty funny if the past tense ended up as "yote", though.)

The ablaut conjugation yeet/yote/yitten is hilarious and gets my vote but it seems like yeeted will win out.

I particularly enjoy “yote” as it sounds so formal/medieval.

Nahh it's definitely yeeted.

I've never seen a zoomer say yote. Yeeted on the other hand...

I've seen and heard yoted many times before, because double past tensening a word adds comedic effect.

> Derives from the Latin Iacio, From Proto-Italic jakjō‎, from Proto-Indo-European (H)yeh₁-‎ ("to throw, let go"). JK I just made that up, but it's actually plausible.

I think you're way overestimating the people who come up with new words :P

A derivation from iacio via jettison doesn't sound too implausible.

Hahaha I see, thanks. It reminds me of defenestration!

To yeet is to throw enthusiastically, happily and without care.

My 20 girlfriend informs me it means to fling or throw but with a comical or exuberant effort.

FWIW, it already made it in the Oxford dictionary:


I believe Yeet! Vine is the source usage.


Amazing. Thanks!

s/yeet/toss dramatically/

Unpopular opinion: wheels are a bad idea in general for the same reasons.

I'm in a devops role where we actually reroll the Tensorflow whl in-house (to get a few tweaks like specific AVX flags turned on), but because the rest of our deployment is apt/debs, we then turn around and wrap that whl in a deb using Spotify's excellent dh-virtualenv:


There's no expertise for Bazel in-house; when we run the build, it seems to fail all its cache hits and then spend 12-13h in total compiling, much of which appears to be recompiling a specific version of LLVM.

Every dependency is either vendored or pinned, including some critical things that have no ABI guarantees like Eigen, which is literally pinned to some random commit, so that causes chaos when other binaries try to link up with the underlying Tensorflow shared objects:


And when you go down a layer into CUDA, there are even more support matrices listing exact known sets of versions of things that work together:


Anyway, I'm mostly just venting here. But the whole thing is an absurd nightmare. I have no idea how a normal distro would even begin to approach the task of unvendoring this stuff and shipping a set of reasonable packages for it all.

The Tensorflow maintainers themselves even kind of admit the futility of it all when they propose that the easiest thing to do is just install your app on top of their pre-cooked Tensorflow GPU docker container:


FYI, I also rebuild TensorFlow in-house, hate it, and hate it a bit less now that I've realized that you can get 200+ vCPU machines in the cloud without difficulty these days. It takes about 12 hours on my workstation too, but it takes just under half an hour on an n2d-highcpu-224, for an effective cost of under $4 per build. I have a VM I power on whenever I need to rebuild TensorFlow and power off once I upload the resulting wheels. Getting my edit-compile-test-scream cycle down to under a day is definitely worth a couple of dollars to the business.

(The irony of paying Google to compile their software is not lost on me.)

> Every dependency is either vendored or pinned, including some critical things that have no ABI guarantees like Eigen, which is literally pinned to some random commit,

starts sweating from GDAL PTSD

I hadn't heard about dh-virtualenv. That looks super convenient, I'll definitely give that a shot.

Honestly I've been kinda leaning into Conda-in-Docker images (so you can isolate while you isolate, dawg), which actually isn't so bad with a custom SHELL directive so you can RUN in a conda env. So many of my customers use it and provide algos developed in conda that it's often just easier to grab their environment.yml and stick it all in a container for deployment.

The pre-cooked TF image are a godsend but also the instant you need two different versions of TF in the same pipeline, you're hosed. I've been playing with Argo Workflows to get around this.

ML software packaging is the worst.

> two different versions of TF in the same pipeline

Or if you just need to integrate with anything else whose deployment model is "lol here's a container."

Check out the tensorflow package in Spack. We patch Bazel slightly to allow external dependencies/library paths, and the dependencies are factored out into their own packages:


Spack can build with two levels of parallelism (within a build and among builds) on a single node or multiple nodes with a shared filesystem:


You can also generate GitLab pipelines of your build, which will break the jobs out as well:


You should really try out the spack package manager, it will manage all these constraints between packages, variants, dependencies for you by feeding it a very simple description of what you want to install: https://github.com/spack/spack

Looks a bit nixy, but without the whole input-addressed hash thing.

You might be interested in this talk from FOSDEM a few years ago comparing nix, spack, conda, and guix


And with dependency resolution (spack calls it “concretization”), which nix and guix do not have.

+1 for Spack and also EasyBuild.

although spack had a lot of problems with tensorflow for years... But yes, in general it works best nowadays and with clingo coming up, I guess it will eventually beat NixOS (because hopefully browser-like dependency trees become fast then).

(and I still don't get the advantages of an advanced build system like bazel, where I first have to build the right version of the build system (including all patches which might crop up with new libcs in vendored dependencies), and then it takes ages to compile (compared to any other similar software...) and I still have to fix the libc bugs...)

Bazel is for the Google-scale companies of the world who must ship petabytes around the internet. Mortals do not use Bazel - a cluster of 80 core CPUs does, and the user just downloads the cached build results.

It is extremely powerful but also extremely tedious, like programming in MPI.

“ Our current CDN “costs” are ~$1.5M/month and not getting smaller. This is generously supported by our CDN provider but is a liability for PyPI’s long-term existence.”


Maybe distribution via bittorrent is an alternative. It wouldn't be better performance than a good CDN but it distributes the costs better.

Conceptually, I love the idea of bittorrent distribution for binary packages like this. In practice, "oh no, my CI docker builds :(" Too much potential variability.

I just want a turnkey way to add multiple hosts/resolvers and use content-addressing. Like bittorrent but with passlists. Is that a thing?

Like, IPFS, but I want specific use-case domains. Like I'd be more than willing to stand up some sort of pip proxy for my company's domain but I'd only want it hosting packages used internally.

If pip had native IPFS support, you could do that by just pinning the packages you use on the local node.

You could also have an IPFS gateway with just normal HTTP(S) access control techniques and a normal pip, deferring to the swarm for packages that you haven't pinned. It's unhealthy for the swarm to stop your node from sharing it's data with other nodes, as that'd loose the Torrent swarm effects that unload the initial uploader (PyPI).

There should really exist a transparent pip proxy, then I can set the config of all my CI machines to it and `pip install foo` would do the right thing and install from that cache. I just want a single binary that I can run with zero configuration and have it talk to PyPI and cache packages locally. It would save so much bandwidth.

This is one of the key things which artifactory can do (and it can do it for basically any repository type, not just pypi). It's not so straightforward to set up, however.

I have yet to discover a more-than-default-repository setup that is straightforward when using artifactory.

You are in luck, such a tool exists:


In the past I've run this in a docker container, setup some pip environment variables and its off to the races. It transparently caches stuff from PyPi and keeps things local.

Most CI vendors offer caching directories between builds, and then you set the pip cache directory. Failing that, I made a proxy server that can work transparently if you can set default environment variables: check out proxpi

proxpi is exactly what I was looking for, thank you! The straightforward Docker installation is especially great, I can have it running with one command.

Such things certainly exist, we had one in my previous employer though I don't remember if it was something in-house, an artifactory feature, or something open source.

Hmm, if you were using Artifactory, it sounds like that's what was doing that. I'm not aware of a dead simple, plug-and-play thing, though.

Extra variability could be an incentive for CI builds to do their own caching.

Nexus? We do this with Nexus.

IPFS sounds like the ideal solution here. In particular because it would mean you could set up a local node to act as a cache, and pin any packages you rely on.

I'm eagerly waiting for ipfs-based package distribution channels. They would be great both for local caches and potentially getting closer sources in countries without great mirrors.

Termux (Android app, giving a mostly-normal terminal without root) already uses IPFS.

Requiring IPFS for accessing packages/wheels exceeding a threshold from PyPI seems reasonable.

Just have PyPI offer collaborative clusters[0] for a few common situations, and maybe work with the IPFS devs to get collaborative clusters to support only pinning a subset by path (like, only pinning the packages you depend on for local CI).

[0]: https://cluster.ipfs.io/documentation/collaborative/

IPFS would be the ideal solution to so many things if it worked well. As it stands, I don't think it would be a solution to anything, it can hardly discover content on other peers in a timeframe less than minutes...

If it works for Ubuntu, why not?


"Hmm, how can I get people to start using IPFS..."

Edit: Not sure about the downvotes. Joking tone aside, that actually seems like an ideal application for it.

I didn't downvote you but I can attempt to offer an explanation:

When reading the first part of your post, I read it as implying that the person you replied to is trying to push a secret agendo or something with IPFS. That's on me of course, and you seem to have no such intentions, but that's how I read it. Also, sarcasm can be hard to understand for some people, and even harder in text form.

I think including your "Joking tone aside, that actually seems like an ideal application for it." in your initial comment would have helped. It indicates that what was before was in part a joke, and inform people of your real position.

Ah, got it. No, I was suggesting that IPFS may be a legitimate option here.

Bear in mind this isn't just end-users installing on their machines, it also includes continuous integration scripts that run quite frequently.

> includes continuous integration scripts that run quite frequently

This use case is something I believe they could charge for if they need to cover infra costs. Same as Docker hub started doing - if someone fails to cache properly in their CI and wants to redownload things from the internet, they should pay for that.

[Huge GPU] packages can be cached locally: persist ~/.cache/pip between builds with e.g. Docker, run a PyPI caching proxy,

"[Discussions on Python.org] [Packaging] Draft PEP: PyPI cost solutions: CI, mirrors, containers, and caching to scale" https://discuss.python.org/t/draft-pep-pypi-cost-solutions-c...

> Continuous Integration automated build and testing services can help reduce the costs of hosting PyPI by running local mirrors and advising clients in regards to how to efficiently re-build software hundreds or thousands of times a month without re-downloading everything from PyPI every time.


> Request from and advisory for CI Services and CI Implementors:

> Dear CI Service,

> - Please consider running local package mirrors and enabling use of local package mirrors by default for clients’ CI builds.

> - Please advise clients regarding more efficient containerized software build and test strategies.

> Running local package mirrors will save PyPI (the Python Package Index, a service maintained by PyPA, a group within the non-profit Python Software Foundation) generously donated resources. (At present (March 2020), PyPI costs ~ $800,000 USD a month to operate; even with generously donated resources).

Looks like the current figure is significantly higher than $800K/mo for science.

How to persist ~/.cache/pip between builds with e.g. Docker in order to minimize unnecessary GPU package re-downloads:

  RUN --mount=type=cache,target=/root/.cache/pip

  RUN --mount=type=cache,target=/home/appuser/.cache/pip

From an open-source developer's perspective, whether we hit PyPI (a third-party free service) or the cache provided by the CI service (a third-party free service) doesn't seem very different.

A 7TB local cache within the CI service is much cheaper to host, since it doesn't have CDN concerns, and bandwidth is much cheaper "within the infra".

Wow, that RUN trick is exactly what I've been looking for! I've spent hours and hours in Docker documentation and hadn't seen that functionality.

Looks like it might be buildkit-specific?

AFAIU `RUN --mount=type=cache` is specific to moby (Docker Engine) with `DOCKER_BUILDKIT=1`, though buildah does support a build-time volume mount option and moby easily could as well:

"build time only -v option" https://github.com/moby/moby/issues/14080

"Build images with BuildKit" https://docs.docker.com/develop/develop-images/build_enhance...

This might be the CDN list price. When you get to higher levels of traffic, you can usually negotiate and commit to a certain level of spend over the next X years, and in exchange the per GB cost falls by at least an order of magnitude.

Then again, it might not be - the amount of CI setups out there that download the entire universe every 5 minutes..

If it's 'generously covered by the CDN' then I suppose it must be list price, what else would they do? 'We'd probably agree to a 30% discount if you negotiate hard with this load, so that comes to..'?

It sounds like that's a retail price. I wonder what the actual production cost to the CDN is.

$1.5M/month is about 5-15 software engineers, depending on seniority. Given that this is one of the most popular software repositories of one of the most popular languages it's not actually a lot of money. It isn't cheap but isn't expensive either.

$18mm/year for 5-15 software engineers? Seems just a bit excessive. (maybe you confused the monthly vs yearly rate)

I have confused that ...

The fundamental issue here is that Python users expect things to be install and go. That has actually never been the case, but it’s got more common over time - before wheels came along, you ended up getting compilation errors all the time from missing packages with pip. Enthought Canopy and later Conda came along and tried to fix that, and then later pip implemented binary wheels, but the wheels are often not a great solution because as someone in this thread said - every package ends up including duplicate stuff, so you can end up with like 5 versions of OpenBLAS and whatever other C/C++/FORTRAN dependencies your packages depend on. Intel released an MKL package onto PyPi which has reduced a lot of that somewhat but it is obviously a bit problematic, because lots of things now use this instead of OpenBLAS and fail to compile on other architectures (cough, PyTorch, cough).

The issue we now have is that scientific users almost always turn to Conda. Scientific authors also have increasingly turned to conda, because it solves a lot of distribution problems for them - it’s a lot easier to say to people ‘conda install -c conda-forge XYZ’ than it is to say ‘install the CUDA toolkit >v10.1 and then download dependency X and then install it from PyPi.’ Look for e.g. at Software Carpentry which aims to teach Python to researchers, or any undergrad course in Python outside of perhaps Comp Sci, and these days they almost always tell people to use Conda. In addition to that, it’s so much easier from a build perspective to add packages to Conda Forge than it is to build the various types of wheels that you need to do for PyPi. The manylinux instructions aren’t even particularly up to date.

To me, a lot of this stems back from Guido basically saying to the scientific community ‘it sounds like your problems are different to that of a lot of the Python ecosystem’ but without seeing that the whole compiled dependency thing affects everyone at some point. I’m not sure what the fix is - do we want people to be able to use PyPi to bundle compiled external libraries? It sounds like you’d just end up replicating Conda at the end of the day.

For me personally, this fragmentation causes a lot of issues because I help users install packages into our HPC cluster. I need to be able to install things from source and PyPi usually offers that, but increasingly, users bring packages with a chain of dependencies that have no source PyPi package and a long list of conda dependencies. I don’t blame package authors at all - they are doing what is easiest for them and most of their users - but I do think it needs a lot of thought and work by PyPa.

most commonly rdkit, where I've never managed to build all the dependencies

For what it's worth, I have multiple RDKit instances on my Mac, so I can test chemfp against different toolkit releases. It's a headache! Here's my notes from installation - I have to keep notes as hints for newer releases:

  /Users/dalke/local/bin/python3.9 -m venv ~/venvs/py39-2021-4
  source ~/venvs/py39-2021-4/bin/activate.csh

  pip install numpy scipy matplotlib pandas

  cd ~/ftps
  tar xf Release_2021_03_1.tar.gz
  cd rdkit-Release_2021_03_1
  mkdir build_py39-2021-4
  cd build_py39-2021-4/

   -DCMAKE_INSTALL_PREFIX=/Users/dalke/venvs/py39-2021-4 \
   -DCMAKE_INSTALL_RPATH="/Users/dalke/venvs/py39-2021-4/lib/;/Users/dalke/local/lib/" \

  make -j4
  make install
and then manual edit (!) activate.csh so the DYLD_LIBRARY_PATH is updated:

  alias deactivate 'test $?_OLD_VIRTUAL_PATH != 0 && setenv PATH "$_OLD_VIRTUAL_PATH" && unset _OLD_VIRTUAL_PATH && setenv DYLD_LIBRARY_PATH "$_OLD_DYLD_LIBRARY_PATH" && unset _OLD_DYLD_LIBRARY_PATH; rehash; test $?_OLD_VIRTUAL_PROMPT != 0 && set prompt="$_OLD_VIRTUAL_PROMPT" && unset _OLD_VIRTUAL_PROMPT; unsetenv VIRTUAL_ENV; test "\!:*" != "nondestructive" && unalias deactivate'


  if (`printenv DYLD_LIBRARY_PATH` == '') then
You'll note that I'm a tcsh user ('cause that was kewl in the 1990s when I started in Unix) and my directory structure uses "~/ftps" for things I've manually downloaded (ditto), and I'm probably using venv poorly.

Oh, but am I done? No! See the "/Users/dalke/local/lib/" in the CMAKE_INSTALL_RPATH?

That's because I installed my own Python, and Boost:

  cd ~/ftps
  curl -O 'https://www.python.org/ftp/python/3.9.1/Python-3.9.1.tar.xz'
  tar xf Python-3.9.1.tar.xz
  cd Python-3.9.1
  ./configure --prefix ~/local --with-openssl=/usr/local/opt/openssl
  make -j3
  make install

  cd ~/ftps
  tar xf boost_1_72_0.tar.gz
  cd boost_1_72_0
  ./bootstrap.sh --prefix=/Users/dalke/local --with-python=/Users/dalke/local/bin/python3.9 --with-toolset=clang --with-python-version=3.9
  ./b2  -j 4
  ./b2 install
And even then it's a house of cards that will likely collapse when I install python 3.10.

Oh, and I have multiple Open Babel installs from scratch, for the same reasons

Pity me. :)

I agree. So many times I have had to reinstall rdkit using homebrew and link it again and again. Although, you can now install rdkit via pip: pip install rdkit-pypi

GitHub Issue: https://github.com/rdkit/rdkit/issues/1812#issuecomment-8088...

Informatics pipeline software in general seem to be "All that accepted practice about distributing software? Let's throw it out of the window"

Ok, I read this and had a question right at the beginning and it wasn't answered: Why are GPU packages so large?

Like, what's inherent about GPU functionality that makes them large compared to other things? (Usual python packages are in the 100 kilobytes size range, so no idea how you'd ever go up to several gigabytes.)

When you compile CUDA code, it has an architecture flag, and you can specify compiling for multiple generations of the code. They’re to some degree backwards compatible (you could run a sm35 package on an sm70 compatible GPU) but performance is much better if you compile appropriate for the GPU you have because it can take advantage of newer hardware features. There’s now 7 or 8 widely in use GPU architectures with only a few early ones so far unsupported by the compiler (though a few more are now deprecated). Just as an example - in my personal PC I have a 750Ti, a card released in 2014, and the CUDA compiler will only drop support for that in the next version. This means that you end up with multiple libraries for every package.

On top of that, people want to install their package with PyPi for e.g. and go - but need the entire CUDA runtime libraries - which include cuFFT, cuBLAS, cuDNN, etc etc. So if every package includes 7 or 8 arch specific versions of the compiled code and the full CUDA RT distribution, it’s easy to see how sizes get big.

The interesting thing is that exactly the same problem occurs on CPUs anyway. We have various optimisation levels in compilers for each family of processor, but aside from MKL I don’t know of any packages that are wheels that try to provide an optimised version of the code across CPU generations. This means anyone who wants good performance should always try to build from source rather than install the wheel.

Even keeping around 10 or so copies of the same compiled code does not explain hundreds of MBs, GBs even, of executable code. Does it?

You could consider using https://github.com/spack/spack and build python packages + binaries from sources optimized for your particular microarchitecture. Clearly there are no sources for cuda itself, but they are downloaded from the nvidia website or a mirror.

Also NVIDIA could consider breaking up their packages into smaller pieces. But then again, they're still doing better than Intel, which ships 15GB+ images for their OneAPI

The business about mapping from PyPI to system dependencies is an important one, and (having not read the entire thread) I do hope that gets some attention— it's particularly curious that it's been this long and it hasn't, given Python's often-role as a glue language.

Another example of an ecosystem maintaining mappings out to system packages is ROS and rosdep:


Now it's interesting because ROS is primarily concerned with supplying a sane build-from-source story, so much of what's in the rosdep "database" is the xxxx-dev packages, but in the case of wheels, it would be more about binary dependencies, and those are auto-discoverable with ldd, shlibdeps, and the like. In Debian (and I assume other distros), the binary so packages are literally the library soname + abi versions, so if you have ldd output, you have the list of exactly what to install.

Maybe one interesting "social" piece to including this kind of functionality would be what the behaviour of pip would become. Like, would it a) go back to needing to be invoked as root and call through to the system package manager, b) emit the command needed for the user to install the required system packages, or c) download the system packages itself, extract the libraries, and insert them into its non-root-owned virtualenv or whatever other environment?

Julia did that for binary dependencies for a few years, with adapters for several linux distros, homebrew, and for cross-compiled RPMs for Windows. It worked, to a degree -- less well on Windows -- but the combinatorial complexity led to many hiccups and significant maintenance effort. Each Julia package had to account for the peculiarities of each dependency across a range of dependency versions and packaging practices (linkage policies, bundling policies, naming variations, distro versions) -- and this is easier in Julia than in (C)Python because shared libraries are accessed via locally-JIT'd FFI, so there is no need to eg compile extensions for 4 different CPython ABIs (Julia also has syntactic macros which can be helpful here).

To provide a better experience for both package authors and users, as well as reducing the maintenance burden, the community has developed and migrated to a unified system called BinaryBuilder (https://binarybuilder.org) over the past 2-3 years. BinaryBuilder allows targeting all supported platforms with a single build script and also "audits" build products for common compatibility and linkage snafus (similar to some of the conda-build tooling and auditwheel). One downside of "cross-compilation everywhere" is that it's not always well-supported by upstream libraries, so the initial effort to build a library may be higher if patches need to be developed, but that effort is arguably higher-leverage than a per-distro packaging approach (eg https://twitter.com/Blosc2/status/1395425736597585920).

All that to say: "make binaries a distro packager's problem" sounds like a simplifying step, but there are some big caveats. It has been tried before, both in other languages and in Python: the fact that conda and manylinux don't use system packages was not borne out of inexperience. One additional issue is that distro packages are only available with sysadmin consent in shared unix-like environments, which can be very limiting for end-users.

The deeper question (and surely an uncomfortable one) is why they are so large, and what can be done to shrink them.

As a longtime hater of dkms blobs, I can sympathize with the python folks.

Does anyone know how much bandwidth/transfer PyPI uses at the CDN level? Are these stats published anywhere?

I totally believe their CDN partner values their donated services at $1.5M/mo, but curious what that looks like in per-GB pricing. :)

Looking at their sponsors list, looks like Fastly. Commendable.

Quick back of the napkin calculation shows that it is using roughly 17 petabytes (!) a month of bandwidth.

How does conda handle this problem? To me as an end user conda installation for things that use the GPU seems to work a lot better, and the only real downside is that it combines awkwardly with packages that are installed using pip.

Conda uses channels; basically you can think of each channel like a remote or mirror which hosts some packages. conda-forge is a specific channel, which I believe is a community-maintained automated build infrastructure which produces builds for many arches and distros.

It's not a bad way to do it actually.

Sizes look really huge just for python glue/sdk code.

That makes no sense to me.

Do anyone know why they are so big? What in the package takes so much space? Do they embed all GPU drivers? Or full model assets?

They are GPU accelerated python packages so they have to carry along the GPU bits. They don't take along drivers but they have to carry along binaries for the various GPU architectures .

If you only want to support one GPU model/generation, one CPU arch, and one OS it isn't hard but god forbid you want anything more than that you end up with LxMxN variants and a bunch of compatibility/variant selector code.

Central registries for many languages have a hard time keeping up with the increased use by automated processes.

In GitLab we have a mirror functionality for docker container images, I look forward to us extending it to other package types as well. This should reduce load in the central registry, speed up the download, and improve the overview of what packages are in use at the organization.

What's in there? Huge pre-trained models?

PyPI will not host large models, those have to be hosted externally. I had to deal with that before and wrote about that here.


As another reply mentioned, the main things PyPI hosts are compiled extension libs. Sometimes these result in giant 300MB binary files, but more often what happens is you have many versions of a ~20MB file due to the combinations of OS, Python version, and API version.

GPU compiler toolchains / libraries / things. Bare CUDA SDK is 3GB these days for example.

CUDA 11 seems to have made a big jump in size. I can't push CUDA 11 docker images to our gitlab registry anymore. Quite a pain point.

Does GitLab have a hardcover limit on docket images? Or do you hit a timeout?

github too: actions runners were teetering for us on 10, and then fell over on 11

PyPI packages don't include the base CUDA install (not sure if you meant to imply that, but it can read that way).

But yes, this is it. If you have a compiled library based on CUDA that results in a pretty large binary, and you need one for each version of the CUDA API * Python version * OS, that adds up quickly.

No, I didn't want to dig into specifics, so it was just "see, things related to CUDA are huge on all sides".

Why does each Python package need its own Base CUDA SDK? Seems like that would just be one Python package and other packages would pull it in.

To quote Ralf Gommers (post 16 in the thread) [0]:

> The issue with CUDA 11 in particular is not just that CUDA 11 is huge, but that anyone who wants to depend on it needs to bundle it in, because there is no standalone cuda or cuda11 package. It’s also highly unlikely that there will be one in the near to medium future because (leaving aside practicalities like ABI issues and wheel tags), there’s a social/ownership issue. The only entities that would be in a position to package CUDA as a separate package are NIVIDIA itself, or the PyPA/PSF. Both are quite unlikely to want to do this. For a Debian, Homebrew or conda-forge CUDA package it’s clear who would own this - there’s a team and a governance model for each. For PyPI it’s much less clear. And it took conda-forge 2 years to get permission from NVIDIA to redistribute CUDA, so it’s not like some individual can just step in and get this done.


I hear there is a full build of OpenJDK in the CUDA SDK to start with.

Probably to compile tensor flow with bazel?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact