A PyTorch Approach to ML Infrastructure

extr · on June 30, 2023

Very interesting. I just worked to implement a baby version of this kind of system at work. Similar to this project, our basic use case was allowing researchers to quickly/easily execute their arbitrary R&D code on cloud resources. It's difficult to know in advance what they might be doing, and we wanted to avoid a situation where they are pushing a docker container or submitting a file every time they change something. So we made it possible for them to "just" ship a single class/function without leaving their local interactive environment.

I see from looking at the source here, run.house is using the same approach of cloudpickling the function. That works, but one struggle we are having is it's quite brittle. It's all gravy assuming everyone is operating in perfectly fresh environments that mirror the cluster, but this is rarely the case. Even subtle changes in the execution environment locally can produce segfaults when run on the server. Very hard to debug. The code here looks a lot more mature, so I'm assuming this is more robust than what we have. But would be curious if the developers have run into similar challenges.

donnygreenberg · on June 30, 2023

Hi! That's awesome to hear, and very aligned with the devx we're going for. How was your system received?

In fact we totally agree and are not cloudpickling the function because of the package minor version issues. We sync over the code to the destination environment and the server imports it fresh, which is much more robust. The one piece of code that cloudpickles functions is a trap door for certain weird situations, but frankly we haven't had to use it in months.

extr · on June 30, 2023

Our system was super well received minus the stability issues. I think the UX of being able to "ship" code like this is a big leap over the alternatives - it actually gives me a lot of confidence in the approach seeing that someone else had a similar thought.

Very interesting about the implementation. I admittedly did not read that closely and clearly did not grok the what the actual hot path was there, will check it out more. May have to borrow your approach or perhaps just adopt this wholesale :) Regardless, super cool project, will be following.

donnygreenberg · on June 30, 2023

Excellent! Don't hesitate to reach out (donny at run dot house) if you want to chat about adopting our approach or using Runhouse.

cbarrick · on June 30, 2023

> Just as PyTorch lets you send a model .to("cuda"), Runhouse enables hardware heterogeneity by letting you send your code (or dataset, environment, pipeline, etc) .to(“cloud_instance”, “on_prem”, “data_store”...), all from inside a Python notebook or script. There’s no need to manually move the code and data around, package into docker containers, or translate into a pipeline DAG.

From an SRE perspective, this sounds like a nightmare. Controlled releases are really important for reliability. I definitely don't want my devs doing manual rollouts from a notebook.

donnygreenberg · on June 30, 2023

It's a great point. The funny thing is that the rest of the world just has dev, QA, and prod staging and canaries, while ML has "the 6 months it takes to translate from notebook to pipeline" or "uploading a new checkpoint/image to the platform." We can just stage properly and release through CD like everyone else does, but not spend 6 months flipping the switch. We built the ability to specify the package for a function as a git repo and a particular revision to enable this, and hopefully it means more people rely on version control as the source of truth in prod and not the most recently uploaded model checkpoint to Sagemaker. During experimentation though, it really is frustrating that many systems only allow you run what you've committed.

We've also built a basic permissioning system to control who can actually overwrite the saved version of a resource, so there are no accidents. E.g. if the prod inference blob is saved at "mikes_pizza/nlp/bert/bert_prod", you can set it so only x accounts can overwrite that metadata to point to a new model. Ideally we just inherit existing RBAC groups sometime soon.

Does that make sense? Curious if you had something else in mind as far as the danger.

cbarrick · on June 30, 2023

Ah, I see. The ability to push to infra is more about the development loop than prod rollouts. Prod can (and should) use CD with a well understood source of truth.

Thanks, I was misunderstanding the purpose of the feature.

m_ke · on June 30, 2023

Since people are suggesting alternatives, I'd like to shoutout skypilot: https://github.com/skypilot-org/skypilot

EDIT: looks like this actually uses it under the hood: https://github.com/run-house/runhouse/blob/main/requirements...

donnygreenberg · on June 30, 2023

Yes, we work pretty closely with them and they're lovely. Everyone should try SkyPilot.

voz_ · on June 30, 2023

This is a cool approach. I really like the notion of small, powerful components that compose well together. ML infra is sorely missing this piece. I wish you the best of luck!

guluarte · on June 30, 2023

Sounds similar to https://dstack.ai/docs/

ipsum2 · on June 30, 2023

> Please make sure the function does not rely on any local variables, including imports (which should be moved inside the function body)

This seems like a major limitation and pretty antithetical to the PyTorch approach.

donnygreenberg · on July 6, 2023

Good catch - this is actually only for functions defined inside interactive Notebook or iPython environments, and we do have an option to bypass it (which serializes the function and state), but you probably don't want it by default. Any function defined inside normal python (even if you're using it within a notebook) doesn't doesn't need imports or variables in scope, you can ship it as is to remote compute. But notebooks are gnarly things, where you might have defined a variable and changed it many times, and finally used it inside a function which you want to run remotely as a service. You probably don't want that variable's state to be shipped over, and PyTorch doesn't do this either. This is why so much of stateful PyTorch is meant to be defined inside new nn.Module classes, to neatly package the state to send over to GPU memory. We offer more flexibility than that, but the more state you ship over, the more likely you are to hit version conflicts between the local and remote environments, which can be really annoying. We practically never run into those issues at Runhouse nowadays because we think we've found a sweet spot.

chenzhekl · on June 30, 2023

How do you compare Runhouse with Ray which also simplifies distributed computing?

donnygreenberg · on June 30, 2023

Good question. We actually use Ray to handle a bunch of the scheduling within the compute, but largely see our role as outside the compute. Meaning, Ray provides a powerful DSL for distributed compute, while we are aggressively DSL-free so users can ship Ray code, PyTorch Distributed, Accelerate, Horovod, etc. to their hardware through Runhouse. We're more focused on connecting disparate compute and storage and making them multiplayer (but largely see the cluster as an opaque unit of compute) while they're more focused on enabling distribution inside the cluster, if that makes sense.

pavelstoev · on June 30, 2023

Have you tired Hidet ? https://pypi.org/project/hidet/

donnygreenberg · on June 30, 2023

No, I know of CentML but don't deeply know the surface of hardware they compile for. I'm enthusiastic about projects like this and others which integrate with PyTorch 2.0. Flexible compilers make the value of being able to ship your code around to various hardware even more powerful.

pavelstoev · on June 30, 2023

Thanks for your comment ! Current focus of optimizations is for Nvidia GPUs but others are in the works. Hidet comes with Hidet.Script which abstracts some of the CUDA struggles and may make the ML optimizations efforts easier to implement. It is still evolving so documentation is limited but here are some examples: https://github.com/hidet-org/hidet/tree/main/python/hidet/gr...

nologic01 · on June 29, 2023

How would you position this vs the Modular/Mojo approach which aims to relieve similar pain points.

donnygreenberg · on June 30, 2023

That's a good question. I actually love the Mojo concept, but see it as very different. They're creating a portable acceleration option in Python proper, while we're trying to make it so you can easily ship around such code to different infra. You can see them or other DSLs like Ray as handling "inside the cluster" while we're focused on solving "outside the cluster." That's what I've picked up from their marketing but could be missing something.

In general making code itself more portable is great (which is the objective of many ML compilers) and will make Runhouse even more valuable, because the ability to take the same code and send it to different places shines when those different places can be different compute types.

daveguy · on June 29, 2023

Looks like Runhouse is FOSS (Apache 2.0) and you get to choose your own infrastructure. I will try out Runhouse. Mojo wants me to send them my info to get started.

voz_ · on June 30, 2023

What is the mojo approach?