More

donnygreenberg · 2024-08-13T19:29:17 1723577357

If you want a PyTorch-like experience on your own GPUs (either static or cloud), see https://github.com/run-house/runhouse

donnygreenberg · 2024-04-03T15:29:51 1712158191

Would be nice if this came with scripts which could launch the examples on compatible GPUs on cloud providers (rather than trying to guess?). Would anyone else be interested in that? Considering putting it together.

donnygreenberg · 2023-12-04T22:18:29 1701728309

Neat idea and elegant execution. The vibe I'm getting from the README is that I'd want to reach for this as a Python dev (over Streamlit or Gradio) if I already have a FastAPI app and want a simple dashboard in front of it. Is that right or am I misreading?

scolvin · 2023-12-05T00:51:20 1701737480

Yes I think that's fair as the no-brainer use case.

I hope it can be do more powerful things too.

donnygreenberg · on Oct 5, 2023

Yes! It's natively supported in the cluster object. You can specify rh.cluster(name="my-a10", instance_type="A10:1", provider="aws"), and gcp, azure, and lambda labs are also supported, and if you leave provider empty it will pick the cheapest one across whichever clouds you have set up locally (e.g. have a ~/.aws/credentials)

donnygreenberg · on July 6, 2023

Good catch - this is actually only for functions defined inside interactive Notebook or iPython environments, and we do have an option to bypass it (which serializes the function and state), but you probably don't want it by default. Any function defined inside normal python (even if you're using it within a notebook) doesn't doesn't need imports or variables in scope, you can ship it as is to remote compute. But notebooks are gnarly things, where you might have defined a variable and changed it many times, and finally used it inside a function which you want to run remotely as a service. You probably don't want that variable's state to be shipped over, and PyTorch doesn't do this either. This is why so much of stateful PyTorch is meant to be defined inside new nn.Module classes, to neatly package the state to send over to GPU memory. We offer more flexibility than that, but the more state you ship over, the more likely you are to hit version conflicts between the local and remote environments, which can be really annoying. We practically never run into those issues at Runhouse nowadays because we think we've found a sweet spot.

donnygreenberg · on June 30, 2023

Yes, we work pretty closely with them and they're lovely. Everyone should try SkyPilot.

donnygreenberg · on June 30, 2023

Good question. We actually use Ray to handle a bunch of the scheduling within the compute, but largely see our role as outside the compute. Meaning, Ray provides a powerful DSL for distributed compute, while we are aggressively DSL-free so users can ship Ray code, PyTorch Distributed, Accelerate, Horovod, etc. to their hardware through Runhouse. We're more focused on connecting disparate compute and storage and making them multiplayer (but largely see the cluster as an opaque unit of compute) while they're more focused on enabling distribution inside the cluster, if that makes sense.

donnygreenberg · on June 30, 2023

No, I know of CentML but don't deeply know the surface of hardware they compile for. I'm enthusiastic about projects like this and others which integrate with PyTorch 2.0. Flexible compilers make the value of being able to ship your code around to various hardware even more powerful.

pavelstoev · on June 30, 2023

Thanks for your comment ! Current focus of optimizations is for Nvidia GPUs but others are in the works. Hidet comes with Hidet.Script which abstracts some of the CUDA struggles and may make the ML optimizations efforts easier to implement. It is still evolving so documentation is limited but here are some examples: https://github.com/hidet-org/hidet/tree/main/python/hidet/gr...

donnygreenberg · on June 30, 2023

It's a great point. The funny thing is that the rest of the world just has dev, QA, and prod staging and canaries, while ML has "the 6 months it takes to translate from notebook to pipeline" or "uploading a new checkpoint/image to the platform." We can just stage properly and release through CD like everyone else does, but not spend 6 months flipping the switch. We built the ability to specify the package for a function as a git repo and a particular revision to enable this, and hopefully it means more people rely on version control as the source of truth in prod and not the most recently uploaded model checkpoint to Sagemaker. During experimentation though, it really is frustrating that many systems only allow you run what you've committed.

We've also built a basic permissioning system to control who can actually overwrite the saved version of a resource, so there are no accidents. E.g. if the prod inference blob is saved at "mikes_pizza/nlp/bert/bert_prod", you can set it so only x accounts can overwrite that metadata to point to a new model. Ideally we just inherit existing RBAC groups sometime soon.

Does that make sense? Curious if you had something else in mind as far as the danger.

cbarrick · on June 30, 2023

Ah, I see. The ability to push to infra is more about the development loop than prod rollouts. Prod can (and should) use CD with a well understood source of truth.

Thanks, I was misunderstanding the purpose of the feature.

donnygreenberg · on June 30, 2023

That's a good question. I actually love the Mojo concept, but see it as very different. They're creating a portable acceleration option in Python proper, while we're trying to make it so you can easily ship around such code to different infra. You can see them or other DSLs like Ray as handling "inside the cluster" while we're focused on solving "outside the cluster." That's what I've picked up from their marketing but could be missing something.

In general making code itself more portable is great (which is the objective of many ML compilers) and will make Runhouse even more valuable, because the ability to take the same code and send it to different places shines when those different places can be different compute types.