Hacker News new | past | comments | ask | show | jobs | submit login
Ray: A Distributed Framework for Emerging AI Applications (micahlerner.com)
134 points by mlerner on July 4, 2021 | hide | past | favorite | 34 comments



I used Ray to train a massive GPT model by putting each layer on a separate TPU. Ray was able to send all the gradients back and forth as needed.

It scaled fine up to 33 TPUs (i.e. 33 layers).

Ray is impressive as hell.

By the way, I didn't write the code to do any of that. kindiana, aka "the guy that wrote GPT-J", also happened to write this:

https://github.com/kingoflolz/swarm-jax

I just ran it and it worked. Which is extraordinarily unusual for TPUs, historically speaking.

I'm pushing my luck at this point, since it's crossing the line from enthusiasm to spam. But if you want to try out Ray on TPUs like I did here, I posted a (massive) amount of detail on how to get started, and why: https://news.ycombinator.com/item?id=27728225

That's the last I'll be mentioning it for some time, though.

Ray + JAX is such a killer combo.


as someone new to ML/DL where can I get started? would you recommend the Fast AI course using pytorch based libraries or something else that focuses on tensorflow?


The key is to find something you think is fun, and play with it. It doesn’t matter what it’s written in. I don’t think it would be fun to take a course, so I never did. But it was a blast to get all this working: https://youtube.com/channel/UCqpwMaJbb-zj-MlMkRd28QQ

You can see my early videos were crude meme attempts, and eventually it morphed to 100 TPU training. You can’t really predict the things that you’ll like, so don’t try.

What’s fun to you? That’s the question to focus on. For me, it was language modeling and image generation, which I learned from https://gwern.net/GPT-2 and Peter Baylor’s’ stylegan notebook, respectively. Then I transitioned to audio memes https://youtu.be/koU3L7WBz_s and kept going from there.

Tensorflow is bullshit, pytorch is different bullshit, Jax is pretty friggin awesome but still has one or two flecks of bs. But none of it feels hard to deal with, because it’s all an enormous box of legos; as long as you chase your interests, you’ll never* feel like it’s annoying.

* you’ll be annoyed all the time, but at least you’ll keep going.


Can this be done with Dask?


I'm not sure. I thought it was nothing short of a miracle that it could be done at all. I tried, hard, in Tensorflow, to make it work. But there was also no way to communicate directly from TPU to TPU; I had to serialize to GCE buckets as a middle step, which added massive complexity to the code.

The ray solution here is so simple. It was a joy to use. But I don't know anything about Dask.

By the way, if anyone wants to see how the loss graphs turned out: https://twitter.com/theshawwn/status/1406171487988498433

(I wish I'd uploaded the logs to tensorboard.dev for posterity. But you can see quite clearly in the screenshots all the information you'd want to see anyway, with apologies to blind engineers. Oh lord, is there a single blind engineer working in ML? Suddenly it's an appealing idea to try to make tensorboard accessible... I wonder how blind people could interpret graphs. Averages, probably.)


I don't know about TPUs, but in GPU land, yeah, you can doing fast GPU<>GPU transfers without much code, incl. for dask kernels. More typically, the code is automatically optimized enough without doing manual optimization here, and at least for us, we end up spending our optimization time elsewhere.

I don't remember what's normal for direct GPU<>GPU, but for many cases we see, the occasions we've done it is through a special pinned memory mode through a staging area. That used to be hard, but nowadays with the rapids.ai ecosystem (cupy / rmm / etc), nice python wrappers.

Dask is part of that ecosystem ("dask-cudf"), but helps more w/ automation around bigger-than-memory paging, multi-gpu dispatch, and multi-node dispatch. Underneath, it does some nice things for you, like setting CPU<>GPU affinities. However, doing custom peer-to-peer / NUMA stuff quickly gets you back to cupy/rmm, and thankfully, that's Python and integrates nicely with pandas/arrow/ etc :)

EDIT: There's a wild world of fancy NVLink GPU<>GPU interconnects in more expensive multigpu boxes. We mostly deal with more end-to-end IO issues like network/SSD array ->PCI cards->GPUs as the I/O bottleneck, such as during bigger-than-memory and oneshot/cold use, so I can't speak to the p2p bits as much.


There is also ucx-py that can be used with dask_cuda for rapid GPU-GPU communication

https://github.com/rapidsai/ucx-py


I've not trained model using Dask, But I've used it for distributed computing exploration over local network for some data science workload. I found that dask much more stable when compared with Ray, Modin for multi-architecture distributed computing i.e. nodes with different CPU arch - ARMv8, x86_64.

My goal was to explore the extent of distributed computing using local low power compute nodes where different architectures are common and not to be compared with professional work like gp has detailed with homogeneous architectures.

But in case you'd like to indulge in similar masochist activities, I have couple of gists like installing Ray on ARM[1], Apache Arrow on ARM[2].

[1] https://gist.github.com/heavyinfo/aa0bf2feb02aedb3b38eef203b...

[2] https://gist.github.com/heavyinfo/04e1326bb9bed9cecb19c2d603...


Although it's early days, José Valim and some other folks are working on adding AI-related capabilities to Elixir and the (Erlang) BEAM. See "Introducing Nx" (https://www.youtube.com/watch?v=fPKMmJpAGWc) for an intro.

Given that they already have a robust Actor model as a base to work from, it occurs to me that they may be able to use some of Ray's ideas as they go along...


I've used Ray for about a year (typically for thousands of ML tasks, spread across ~48-120 cores simultaneously) and it's a pleasure to use at least using the basic API. Admittedly, I had problems when trying to use some of the more advanced approaches but I didn't really need them and I can definitely recommend it since the performance is great.


Just out of curiosity, what kind of work requires thousands of ML tasks? (Assuming you're talking about training and not inference?)


The thousands of tasks are inference but I also use ray to train/update a double digit models simultaneously (~1 per user).


Given how Ray "provides [...] exactly-once semantics" for its actors, you could draw similarities between it and workflow-as-code frameworks such as https://temporal.io. The way that Ray splits up actors and tasks looks similar to Temporal's Workflows + Activities split: Workflows (Ray actors) contain orchestration logic and have their method calls/results durably logged. Activities (Ray tasks) perform the expensive computations and any interaction with external systems and are not durably logged.

If you're in the .NET ecosystem or interested in distributed systems in general, you may like Orleans (https://github.com/dotnet/orleans), which I work on at Microsoft. Orleans contributes the Virtual Actor model which other modern actor frameworks are starting to adopt since it is well suited for the hectic, failure-prone environment of distributed systems (which those so-called Cloud Native Apps live in). The Ray paper linked from the article (https://www.usenix.org/system/files/osdi18-moritz.pdf) discusses some similarities. Slight correction on the paper: it states that "For message delivery, Orleans provides at-least-once [...] semantics". It's at-most-once. At-least-once messaging semantics (usually implemented via automatic retries) aren't ideal for these kinds of systems, in my opinion.


Neat - thanks for sharing! A friend mentioned Orleans, queueing up the paper (found this one https://www.microsoft.com/en-us/research/publication/orleans...) for future reading.


I spent the past year and a half deploying a distributed backend for bert-like models & we ultimately chose a K8s architecture & "precise" affinity mapped out, which is still hard due to cpu pinning issues. On the frontend-api, Golang gives us the ability to distribute & split requests coming in (10-20M / day & batch size averaging ~3K which splits into 50 due to model constraints). Embeddings are stored on those nodes, local ssds. Those nodes are only a handful. Models run on 2 pools, 1 dedicated and one preemptible (most nodes here) which gives us cost optimization and scheduling is simplified due to K8s. We have anywhere from 120-300 of these high compute nodes.

Wondering if anyone has similar deployments and migrated to Ray. We've evaluated it but can't afford a large migration at this point & would also need to test quite a bit & rebuild our whole automation for infra and apps.

Really interested though as the infrastructure isn't cheap and every time the model updates we are basically re-architecting it. Right now we are moving everything away from python (gunicorn/flask, and MKL) to Golang as we can get better efficiencies with data serialization (numpy ops are the biggest time eaters right now ... model input vectors constructed from flatbuffers)


Have you considered Rust? It had neat python interconnect. My team are doing early experiments - offloading tight loops/expensive computations in existing Python inference app to Rust.


Are you running inference on CPU or GPU?


CPU, GPU doesn't work out well in our case. There is the data transfer cost as well as memory constraint & the model blows up in memory for every inference call.


x2 Gunicorn workers, MKL mapped to half physical cores for each... for some reason the model (tensorflow) performs better on half vs 1 worker


There was a recent talk at PyCon US 2021 on this :)

TALK / SangBin Cho / Data Processing on Ray [https://www.youtube.com/watch?v=DNLqvdov_J4]


At a glance, Ray is a re-invention (or rebranding) of distributed object system plus agent, which was popular around '90-'00. Things like Java RMI and CORBA (remember?) was part of the trend, until REST killed them all.

On top of the distributed object foundation, Ray added a ML-oriented twists like efficient numerical data transfer with Apache Arrow, and shifted focus from (classic) agent system to RL, or general distributed ML training in general, accompanied by the Python-first approach - which simplifies a lot of things compared to traditional, often language-agnostic distributed objects.

I'm not claiming Ray is not novel. Rather, my point is that what a dated idea needs to come back may be just some relevant-today twists like these. I think Ray is good demonstration of possibility of such old new things.


Our team has used Ray for more than two years across several versions. It makes a lot of things easy that were not, and is especially adapted for our purposes, which include training and deploying a lot of reinforcement learning policies. The AnyScale team is very responsive on the support Slack, fwiw.


Great work and kudos to the Ray team! It's definitely a fresh look with a lot of lessons learned from previous generations (e.g. spark).

There are a few nice features I wish Ray would eventually get to.

On the user experience side, it would be nice to have task level logs: often time it's easier for users to reason at task level, especially the task is a facade that triggers other complicated library/subprocess calls.

For the scheduler, if there's more native support for sharded/bundled/partitioned tasks and https://cloud.google.com/blog/products/gcp/no-shard-left-beh...


Hi all, I'm one of the authors of Ray, thanks for all the comments and discussion! To add to the discussion, I'll mention a few conceptual things that have changed since we wrote the paper.

*Emphasis on the library ecosystem*

A lot of our focus is on building an ecosystem of libraries on top of Ray (much, but not all, of the focus is on machine learning libraries). Some of these libraries are built natively on top of Ray such as Ray Tune for scaling hyperparameter search (http://tune.io), RLlib for scaling reinforcement learning (http://rllib.io), Ray Serve for scaling model serving (http://rayserve.org/), and RaySGD for scaling training (https://docs.ray.io/en/master/raysgd/raysgd.html).

Some of the libraries are popular libraries on their own, which now integrate with Ray such as Horovod (https://eng.uber.com/horovod-ray/), XGBoost (https://xgboost.readthedocs.io/en/latest/tutorials/ray.html), and Dask for dataframes (https://docs.ray.io/en/master/dask-on-ray.html). While Dask itself has similarities to Ray (especially the task part of the Ray API), Dask also has libraries for scaling dataframes and arrays, which can be used as part of the Ray ecosystem (more details at https://www.anyscale.com/blog/analyzing-memory-management-an...).

Many Ray users start using Ray for one of the libraries (e.g., to scale training or hyperparameter search) as opposed to just for the core system.

*Emphasis on serverless*

Our goal with Ray is to make distributed computing as easy as possible. To do that, we think the serverless direction, which allows people to just focus on their code and not on infrastructure, is very important. Here, I don't mean serverless purely in the sense of functions as a service, but something that would allow people to run a wide variety of applications (training, data processing, inference, etc) elastically in the cloud without configuring or thinking about infrastructure. There's a lot of ongoing work here (e.g., to improve autoscaling up and down with heterogeneous resource types). More details on the topic https://www.anyscale.com/blog/the-ideal-foundation-for-a-gen....

If you're interested in this kind of stuff, consider joining us at Anyscale https://jobs.lever.co/anyscale.


> Our goal with Ray is to make distributed computing as easy as possible. To do that, we think the serverless direction, which allows people to just focus on their code and not on infrastructure, is very important.

I watched https://now.sh/ deteriorate from a simple, lovely CLI into a dystopian mess due to their push for serverless. They abandoned all other approaches and forced people to use it. Far from making things easier, it became a kafkaesque pipeline of dependencies and configuration settings just to get any small example deployed.

Things may be better now, but the experience was so jarring and offputting that I haven't used now.sh for much of anything. Used to use it for everything; somehow https://docs.ycombinator.lol/ is still running, which was a static site deployed back before their serverless stuff.

I don't know. You might be right. But just remember, the Ray library -- the actual python lib -- is your bread and butter. It's why everyone loves you. I urge you, never make the mistake of letting it deteriorate. It should be rock solid for everyone forever, with no need to interface with any of your serverless components. The day you try to monetize by trying to sneak in "value adds" by making the code "easy to integrate with your serverless stuff" is the day that you open yourself to bugs, and the temptation to ignore problems in other areas -- because after all, the serverless infra would be where you're making your money, so it makes sense to push everyone in that direction.

Ray is so excellent right now that it feels like a sports car. I hope it'll stay excellent for a decade to come. (All I want is the ability to recover from client failures in a way where, if there are tasks in flight, I can tell those tasks how to re-run once all the actors have reconnected. I'm sure there's already a way to do something like this; just haven't looked into the details quite yet.)

Best of luck, and thanks for the wonderful lib.

EDIT: https://www.anyscale.com/blog/the-ideal-foundation-for-a-gen... just gives me terrible feelings. Your best bet is to ignore me, because my gut is likely wrong here -- at 33, I'm starting to fall past the hill. But for example:

> Ray hides servers

Suppose a hacker wants to build an iOS app powered by Ray. They want to create a cluster of servers to process incoming tasks. The tasks are things like "Make memes with AI," artbreeder-style, and then send them back to the iOS client waiting for them. Then the user can enjoy their meme, you throw up a "if you like memes, give me a dollar and you can have all the memes you want," and a million people download your AI meme app and you become the Zuckerbezos of AI.

In that context, no one wants to hide servers. No one I know -- anywhere -- thinks it's a good idea. We don't want to rely on your magic solutions. We want to keep our servers running. Because my servers happen to be TPUs, and there's no way that TPUs are ever going to become serverless. But even before I was using TPU VMs, all I wanted to do was to just stick GPUs onto servers and send results around; the serverless stuff gave me the creeps. Perhaps this just means I was ineffective, though, and that everyone I know is also ineffective.

I know, I know... you're going to support your non-serverless offerings, and Ray will be wonderful forever, and it'll be roses and rainbows. I hope so. But just don't let the core library become priority #2. It should be priority #1 forever.


Thanks for the comments! A few quick notes

The term serverless is a bit overloaded. Here's what we want. (1) Ray users should be able to focus only on their application logic and not have to worry about configuring clusters at least in the common case (this is not saying that they can't configure clusters if they want to). (2) In addition, we want Ray applications to be portable and to run on different size clusters with different instance types on different cloud providers or k8s or your laptop or anywhere. The application should be decoupled from the cluster configuration. (3) Great support for autoscaling Ray clusters. When a Ray application needs more resources of a certain type (CPUs, GPUs, memory, etc), those resources should be added to the cluster. When they are no longer needed, the cluster should scale down.

This is quite different from FaaS, though seems in line with the spirit of serverless. And these "serverless" properties are not something we think of as separate from Ray, but rather as part of just making Ray work better.


Yeah, we’ll have to agree to disagree that that’s a good idea.

#3 is the crux of Ray’s power. I don’t understand why serverless is necessary. By default, I want to configure it to use my TPUs, which I control. I’m the one who creates or deletes TPUs — because I have to be! There’s no “TPU creation service” that can autoscale TPUs on demand.

Maybe there will be one day, but not today. And focusing on today will help you long term, because it’s a very real need that people actually have. It’s hard to predict what people will want, other than to cater to existing needs. It’s how every great startup got their footing: build something people want.

In that context, how could it be otherwise? When a TPU comes up, it connects to the ray cluster. That TPU cannot be shut down. I’m using it for training, for inference, for every reason I would be paying for. Interrupting those tasks isn’t something that I can “just be okay with.”

I agree that inference tasks should be able to die occasionally. But training tasks, all the time? It’s not an abstraction — you can’t pretend that the training task can run on an arbitrary TPU. It needs to run on a specific TPU, because that TPU has the model loaded.

What am I missing?


Just to add on in here, it may be fixed now but the most recent time I tried to use ray on my on-prem cluster, it was quite difficult as everything assumed I was in one of the clouds or on a single local laptop. I had used ray pre 1.0 release maybe 3 years ago and it was trivial to setup this use case back then.

I dont remember the specifics right now, but my problem may have been related to having a local kubernetes cluster that wasn't a cloud. I'd be happy to dig through details in my notes if its worthwhile feedback, just reach out.


(Signal boosting this. I, too, found it incredibly difficult to set up Ray for my purposes because of all the cloud provider assumptions.)


Is there a plan to tighter integrate into k8s, potentially in a multi-cluster/federated setting. It's a lot easier to get buy-ins for ray adoption from infra teams where k8s is the centralized compute substrate.


How does this compare to Dask?


I think they are decently well enough integrated now. Rather than starting a dask cluster, you can run dask on a Ray cluster

https://docs.ray.io/en/master/dask-on-ray.html


Dask itself (and its commercial entity Coiled) seems to be investing heavily in the PyData/ml library ecosystem. I feel like it's about time until they reach feature parity, which would be a huge win for existing users of the PyData stack.


Honestly looks really cool, I want to try using Ray.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: