Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Morgan – PyPI Mirror for Restricted/Offline Environments (github.com/ido50)
77 points by idop on Sept 23, 2022 | hide | past | favorite | 27 comments
Mirroring PyPI packages for environments/networks that do not have access to the Internet is hard. It's actually hard even in environments that do have access to the Internet. Most solutions out there either:

1. Depend on pip to download and cache package distributions. This means those downloads will probably only work in a similar environment (same Python interpreter, same libc), because of the nature of binary package distributions and the fact that packages have optional dependencies for different environments.

2. Depend on other PyPI packages, meaning installing the mirror in a restricted environment in itself is too difficult.

3. Cannot resolve dependencies of dependencies, meaning mirroring PyPI partially is extremely difficult, and PyPI is huge.

Morgan works differently. It creates a mirror based on a configuration file that defines target environments (using Python's standard Environment Markers specification from PEP 345) and a list of package requirement strings (e.g. "requests>=2.24.0"). It downloads all files relevant to the target environments from PyPI (both source and binary distributions), and recursively resolves and downloads their dependencies, again based on the target environments. It then extracts a single-file server to the mirror directory that works with Python 3.7+, has no outside dependencies, and implements the standard Simple API. This directory can be copied to the restricted network, through whatever security policies are in place, and deployed easily with a simple `python server.py` command.

I should note that Morgan can find dependencies from various metadata sources inside package distributions, including standard METADATA/PKG-INFO/pyproject.toml files, and non-standard files such as setuptools' requires.txt.

There's more information in the Git repository. If this is interesting to you, I'll be happy to receive your feedback.

Thanks!




hey,

We were running with the same problem (supercomputer with clusters of different architecture and no outgoing connections permitted) and so we created "pypickup" [1,2]. nice to see that we came with similar solutions! I have some questions:

1. is the directory of packages you create compatible with the PEP 503? (so I can use `--index-url file://PATH_TO_LOCAL_CACHE` flat with pip and it should work)

2. is there some filtering mechanism? e.g. we are not interested in non-release versions ("dev" versions, "rc" versions, "post" versions, ...)

3. I guess that the way morgan resolves dependencies is by manually parsing files like "pyproject.toml" or "requirements.txt" and it does not ask the build-system for the dependencies. if so...

   - does "morgan" detect build-dependencies?

   - which build-systems are compatible?

   - is "morgan" capable of detecting more complex dependency specifications? e.g. "oldest-supported-numpy" which is used by "spicy" has dependency strings like the following: numpy==1.19.2; python_version=='3.8' and platform_machine=='aarch64' and platform_python_implementation != 'PyPy'
kudos for the good work

[1] https://pypi.org/project/pypickup/ [2] https://github.com/UB-Quantic/pypickup


Too bad your project didn't come up in any of my searches while researching this problem. Probably because it doesn't use the word "mirror" at all :)

As for your questions:

1. I don't see any mention of directory structures in PEP 503. The Morgan server does implement PEP 503 though. In any case, I tried installing now straight from the directory and it didn't work. Are you sure you meant PEP 503?

2. Where Morgan differs from pypickup, as I can see, is that it interprets requirement strings as per PEP 508 (e.g. "requests>=2.40.0; python_version < '3.8'") instead of providing a command such as `pypickup add requests`. For every requirement string, it looks for the latest version in PyPI that satisfies it, and downloads that version. You can filter _in_ the requirement strings, other than that Morgan doesn't have any specific handling of dev/rc/etc.

3. Morgan detects and downloads the build system based either on the [build-system] section of pyproject.toml, or the setup_requires.txt file (from setuptools). These are the sources currently supported. It doesn't actually care what the build system is, it simply attempts to find where it is defined and download it as well.

As for complex dependency specifications, yes, they are supported and honored (Morgan relies on the "packaging" library to properly evaluate those). By the way, I recently moved from Poetry to Hatch for managing the Morgan project itself specifically because I got fed up with Poetry not honoring those specifications, and trying to download completely irrelevant packages.


Well, we first named it "pypi-cache" but there is a package named "pypicache" from the year 2007 and we had to rename it. We always thought of it as a "cache" rather than a "mirror"... but yes, "mirror" is more appropriate. Btw we released it just 1 week ago which is also maybe why you did not find it.

1. Well, the flag "--index-url" explicitly says that "... should point to a repository compliant with PEP 503 (the simple repository API) or a local directory laid out in the same format". PEP 503 defines the directory structure where there is a folder per package, an "index.html" on the root with a link to each package and *an "index.html" in each package folder that has a link per available file*.

URLs are not limited to "https", they can also be relative paths. So the trick we do is to download the file to the folder of the package and add an anchor to that file in the "index.html" of the package. For example,

If you go to https://pypi.org/simple/numpy, you will find links like the following: <a href="https://files.pythonhosted.org/packages/f6/d8/ab692a75f584d1..." data-requires-python=">=3.8">numpy-1.22.4.zip</a>

But we download it and write, <a href="./numpy-1.22.4.zip" data-requires-python=">=3.8">numpy-1.22.4.zip</a>

This is specially important for us because we cannot setup any kind of server.

2. Okay nice. Yep, we thought that parsing would be more difficult and that relying on parsing would be problematic due to the different build-systems and that many packages still do not have the "pyproject.toml" file. We opted for a manual approach in which you do "pypickup add" until you have no more "dependency missing" errors. Your approach looks much better to me, but like you said is limited to "pyproject.toml" and "setuptools" right now.

Btw, does it also downloads extra dependencies?

3. Nice. I also stopped using Poetry for things like that, but now I manually write my "pyproject.toml" with "setuptools".

I like the idea on trying to parse the dependencies. I will probably try something but since we download all files (filtering some of them), it would be more costly. Maybe in some weeks when I'm more free.


Ahh, I get it, it needs index.html files. I can easily implement this, but I actually did want the server because I wanted it to be easily accessible from multiple machines, I also wanted to implement the JSON API, and also want (in an upcoming version) to allow uploading private packages to the mirror.

As for extra dependencies, yes, they will be mirrored, but only if relevant, i.e. if they are included in a requirement string (be it a direct requirement or a dependency of a dependency).


Ahh ok. In our case all the machine have a shared network filesystem where we store the mirror.

Great about the extras.

Would you mind if we reference each other in the readmes?


Yeah sure, no problem.


Maybe I'm confused about what this offers, but I have been running private pypi repositories for a decade now, and it never required more than running an HTTP server with directory listing.

As for doing partial mirroring of pypi with only what you are using, is that really a good idea anyway? it will break whenever you add or change any dependency.


The problem isn't really on the serving side, it's on the mirroring side. Trying to mirror PyPI - at its current 13.4 TB size[1] - and bringing all those terabytes into a restricted network with security policies and no access to the internet, is impossible. Partial mirror is the only way to go for such a use case, and given that Morgan automatically resolves and mirrors dependencies, adding new dependency shouldn't break anything.

[1] https://pypi.org/stats/


Can't you resolve the dependencies by running pip download when you have internet and later serving that directory with a local HTTP server as the parent suggested? Pip download will resolve all the dependencies for you already the same way as pip install would.


No, as I mention both in this post and in the README. Pip will download binary distributions (wheels) that were compiled for the system it is running on. If my mirror is meant to serve a different version of Python installed on a different OS with a different libc (or other such differences), then it won't work. I could try to match the target environment on the mirroring side, say with Docker, but this is either cumbersome or still not possible if you have legacy environments from years before.


By default it will download wheels for the system it is running on, but there are knobs to tweak that.

https://pip.pypa.io/en/stable/cli/pip_download/#cmdoption-pl...

https://peps.python.org/pep-0425/


Makes sense and you're right we did encounter issues when changing platforms at the time when we were using some self rolled janky versio n of this! Thanks


You can download as source packages instead of wheels but then you need to make sure you have all the requisite compilers and libraries needed. This isn't an issue for Python-only dependencies but can be difficult for dependencies with lots of native code like numpy/pandas where you need a C toolchain & Fortran toolchain installed (and possibly other libs)

If you're using something like Docker/containers, you can download the dependencies inside the container and be reasonably sure you get the right wheels. This becomes trickier when you have different setups like developers on Windows and production on Linux.


Now I am curious.

- how big is it if you exclude non-Python3 compatible? - how big if you only wanted the latest version of everything?


Came here to say this. I run private pypi repositories for this use case and it works fine. Ive had to thumbdrive over all of our dependancies from the wheels etc. A single bash script that runs all the checks and downloads and zips to the offline environment then use your pip install like normal with the login creds to your offline pypi registry.


Out of curiosity, how do you run yours?


Thanks for posting this. I'm going to give setting up Morgan a shot when I've got some free cycles.

I'd hesitantly accepted the risk of serving a devpi server over vsock and into my (personal) restricted VLAN. I did so because using a shared folder meant I'd need have cached the module and any dependencies from my internet-connected VLAN first.

Combined with debmirror[0], vscodeoffline[1], and some nightly snatcher shell scripts, I think I have most of my needs covered.

[0] https://help.ubuntu.com/community/Debmirror

[1] https://github.com/LOLINTERNETZ/vscodeoffline


This is pretty cool. I created simpleindex[1] a while ago to solve a different problem, but since the solution is essentially also running a custom index server, it has several overlapping functionalities to Morgan’s server script. I wonder if there’s a common pattern that can be extracted out…

BTW I also maintain resolvelib (mentioned in another comment), feel free to shoot any questions in the issue tracker or the PyPA Discord[2], or any other means. The documentation is a bit sparse and there are not many resources on dependency resolution in general, and there’s a few of us that help each other out on things.

[1]: https://github.com/uranusjr/simpleindex [2]: https://discord.com/invite/pypa


This looks similar to some Bazel rules I'm working on. I'm also using the approach of defining target environments up front [1], but the main difference is that I'm currently offloading the actual resolution process to Poetry or PDM, which both generate cross-platform lock files.

But Poetry and PDM don't add build dependencies to lock files - which I need - so I'm thinking of building a custom resolver.

Did you consider using resolvelib [2], which is what underlies both pip and PDM?

[1] https://github.com/jvolkman/rules_pycross/blob/main/examples...

[2] https://github.com/sarugaku/resolvelib


By the way, Poetry's dependency resolution isn't that great. It doesn't properly evaluate optional dependencies. For example, when I try to install pymongo on Linux, it will insist on installing pywin32 as well, even though it is completely irrelevant. It's given me a lot of headaches.


I didn't know about resolvelib, looks interesting, I'll have to give it a deeper look, thanks.


Thanks for creating it and looking forward to try it out.

I have been looking for similar solution and the whitelist used to fail with other tools as they weren't resolving the dependencies.


When I worked at Microsoft, one team created a big solution for an e-commerce customer using Kubernetes, Helm charts, etc. Beautiful.

Then I had to take it to run in mainland China.

Nope.


Oh neat. Not only do I share a name with a project, it's a project I was seriously thinking of starting.


:) Naming projects is hard, so I tend to give it as little thought as possible. I was playing Red Dead Redemption 2 while writing the first version of this so I just named it after Arthur Morgan, the main protagonist.


I figured it was someone lamenting working at Morgan Stanley for not letting you pull in dependencies without a lot of red tape ;)


When I was at Morgan I saw three or four people create something like this. :)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: