Exo: Run your own AI cluster at home with everyday devices

ajnin · 2024-07-16T11:19:35

It requires mlx but it is an Apple silicon-only library as far as I can tell. How is it supposed to be (I quote) "iPhone, iPad, Android, Mac, Linux, pretty much any device" ? Has it been tested on anything else than the author's MacBook ?

alexandercheema · 2024-07-16T22:59:03

Repo maintainer here. It supports any device tinygrad does, which is a lot. We didn’t expect it to blow up so soon - the repo is still experimental. Internally we’ve mostly been testing on MacBooks and Mac Minis, and that’s where dev is happening. The swift implementation is outdated and currently broken, since Python has been changing so fast (over 20 commits in the last day). On my ToDo is CI/CD pipeline with integration tests for different device and network configurations, so we don’t randomly break stuff for certain devices.

We’re moving fast to get it stable and usable. The goal is for this to be as simple as running Dropbox. Bear with us :)

mikeqq2024 · 2024-07-26T07:15:59

why there is no intel cpu/gpu support? it seems intel cpu/gpu is dead in AI community.

orsorna · 2024-07-16T13:56:25

One of the maintainers has a video demo on his twitter claiming iOS, android and Linux. Some of the code is not released and I wish they were advertising that properly.

tama_sala · 2024-07-17T10:05:35

The library already has tinygrad support it seems, so it's not limited to Apple & MLX

orsorna · 2024-07-17T15:41:07

That is true. However (as of two days ago, it may have rapidly changed since then) the python program did not differentiate based on your architecture and would try to import mlx regardless if it's installable on your system or not, causing import errors.

alexandercheema · 2024-07-17T18:36:22

This is fixed now, with these commits: - https://github.com/exo-explore/exo/commit/dbbc7be57fb1871d2b... - https://github.com/exo-explore/exo/commit/ce46f000591d8d59c1...

Please keep the bug reports coming, we're moving fast to get this stable on all platforms.

lopuhin · 2024-07-16T16:07:56

The README says they plan to add llama.cpp support which should cover a lot of targets, also they have tinygrad already integrated I think.

dcreater · 2024-07-16T16:05:44

This is a great ideal and user friendly as well. Has the potential of converting multiple old devices overnight from being useless. However, I wish they had provided some results on tok/s, latency with some example setups.

alexandercheema · 2024-07-16T16:12:45

We didn't expect this to blow up so quickly. A lot of work needs to be done on getting different setups working. I have made an issue here: https://github.com/exo-explore/exo/issues/11

DiogoSnows · 2024-07-16T22:04:42

This is great work! I will keep an eye (and maybe even try to contribute). Looking back at the beginning of Google, I think their use of hardware and hardware agnostic platform likely contributed to support growth at lower cost. We need more of that in the AI era

alexandercheema · 2024-07-16T22:40:19

Thank you for the support! I agree on the cost point, and personally I don’t want to live in a world where all AI runs on H100s in a giant datacenter controlled by one company.

mg · 2024-07-16T06:55:35

    This enables you to run larger models
    than you would be able to on any single
    device.

No further explanation on how this is supposed to work?

If some layers of the neural network are on deviceA and some layers are on deviceB, wouldn't that mean that for every token generated, all output data from the last layer on deviceA have to be transferred to deviceB?

mikewarot · 2024-07-16T07:27:40

Yes, so you would have a vector about 8k values long to be transferred on each token generated.

You could do that easily with any modern network.

mg · 2024-07-16T07:40:15

That's exciting. So we could build a SETI@home style network of even the largest models.

I wonder if training could be done in this way too.

alexandercheema · 2024-07-16T07:48:53

Repo author here. That's correct. The embeddings for Llama-3-8B are around 8KB-10KB. For Llama-3-70B they're around 32KB. These are small enough to send around between devices on a local network. For a SETI@home style network, latency will kill you if you go over the internet. That's why we're starting with local networks.

juvo · 2024-07-16T10:49:31

how does it compare to https://github.com/bigscience-workshop/petals ?

mg · 2024-07-16T09:12:35

Ah yes. At first, I thought that since it is all one-way forward-only communication, latency would only affect the time to the first token.

But I guess the final output needs to be sent back to the first node before it can continue. So if there are 50 nodes with a latency of 40ms each, each token would take 2s to process.

alexandercheema · 2024-07-16T09:19:47

Yeah, unfortunately the autoregressive nature of these models slows it down significantly with added device<->device latency. However, you can still max out on throughput with pipeline parallelism, where you overlap execution. See: https://pytorch.org/docs/stable/pipeline.html

DiogoSnows · 2024-07-16T22:09:18

For generating synthetic data you could have a SETI@Home setup if you consider each home as a node that generates some amount of data. I mean, such a setup can be built with Exo, I wouldn’t suggest including it as part of Exo.

Out of curiosity, would you ever support training or at least fine-tuning?

steeve · 2024-07-16T07:00:41

Yes, that’s how it works (pipeline parallelism)

mg · 2024-07-16T07:03:12

Interesting. Let's do the math ...

Let's say the model has 50B parameters and 50 layers. That would mean about one billion values have to travel through the wifi for every generated token?

I wonder how much data that is in bytes and how long it takes to transfer them.

blackbear_ · 2024-07-16T07:28:35

It's not the parameters that are sent, it's the layer outputs. That makes for a few thousands floats per token

mg · 2024-07-16T07:37:52

Woops! I would have thought the number of neurons roughly equals the number of parameters, but you are right. The number of parameters is much higher.

tama_sala · 2024-07-16T17:46:24

The embedding size is only 8k so while the parameters are 70B. So it's a huge difference

pyinstallwoes · 2024-07-16T12:07:35

Swarm compute should be the norm for all compute - so much unused cpu across all the devices we collectively own.

phito · 2024-07-16T14:29:36

I'd rather my CPU to be idle and not consome much power

imp0cat · 2024-07-16T18:56:50

It depends. There is a lot of devices with quite capable cpus that are mostly doing nothing.

bastawhiz · 2024-07-16T22:50:19

I also prefer my phone to not be hot and constantly plugged in. Or for my ML workload to suddenly get slow because my partner drove the car out of range of the WiFi. Or to miss notifications because my watch's CPU was saturated.

imp0cat · 2024-07-23T11:15:01

Fair point. OTOH, something like a Tesla must have a lot of computing power that's just mostly idle when parked and charging at home.

That is unless Tesla is already using their idle cars as remote datacenters, crunching numbers to get better at self-driving.

KronisLV · 2024-07-16T12:32:50

This might not work for use cases where you need low latency, but for longer winded processing it would be amazing if possible.

For example, if I have a few servers, laptop (connected to power) as well as a desktop PC and they’re all connected to a fast local network, it’d be great to distribute the task of rendering a video or working with archive files across all of them.

greggsy · 2024-07-16T13:07:47

Those are two precise examples that benefit from single core compute power, and are wholly unsuited to distributed computing…

KronisLV · 2024-07-16T14:31:47

Distributed rendering farms have existed for a while.

greggsy · 2024-07-17T13:39:08

They render a single frame though. Admittedly, so does video rendering.

_factor · 2024-07-16T13:21:38

This exists: https://aihorde.net/

I haven’t tried it, and not the norm, but I agree it should be more common. We have a global supercomputer with higher latency, but still a supercomputer.

dchuk · 2024-07-16T14:05:51

I might just still be too tired from just waking up, but I can’t for the life of me find any details on that site about what models are actually being served by the horde?

burkaman · 2024-07-16T14:50:55

Go to https://aihorde.net/api/, scroll down to /v2/status/models, and click Try it out and then Execute. It's an enormous list and I think it can be dynamically updated, so that's probably why it isn't listed on the website.

hagope · 2024-07-16T05:04:59

I used to be excited about running models locally (LLM, stable diffusion etc) on my Mac, PC, etc. But now I have resigned to the fact that most useful AI compute will mostly be in the cloud. Sure, I can run some slow Llama3 models on my home network, but why bother when it is so cheap or free to run it on a cloud service? I know Apple is pushing local AI models; however, I have serious reservations about the impact on battery performance.

PostOnce · 2024-07-16T10:51:21

Maybe you want to conduct experiments that the cloud API doesn't allow for.

Perhaps you'd like to plug it into a toolchain that runs faster than API calls can be passed over the network? -- eventually your edge hardware is going to be able to infer a lot faster than the 50ms+ per call to the cloud.

Maybe you would like to prevent the monopolists from gaining sole control of what may be the most impactful technology of the century.

Or perhaps you don't want to share your data with Microsoft & Other Evils (formerly known as dont be evil).

You might just like to work offline. Whole towns go offline, sometimes for days, just because of bad weather. Nevermind war and infrastructure crises.

Or possibly you don't like that The Cloud model has a fervent, unshakeable belief in the propaganda of its masters. Maybe that propaganda will change one day, and not in your favor. Maybe you'd like to avoid that.

There are many more reasons in the possibility space than my limited imagination allows for.

tarruda · 2024-07-16T14:30:04

It is not like strong models are at a point where you can 100% trust their output. It is always necessary to review LLM generated text before using it.

I'd rather have a weaker model which I can always rely on being available than a strong model which is hosted by a third party service that can be shut down at any time.

Aurornis · 2024-07-16T16:01:22

> I'd rather have a weaker model which I can always rely on being available than a strong model which is hosted by a third party service that can be shut down at any time.

Every LLM project I’ve worked with has an abstraction layer for calling hosted LLMs. It’s trivial to implement another adapter to call a different LLM. It’s often does as a fallback, failover strategy.

There are also services that will merge different providers into a unified API call if you don’t want to handle the complexity on the client.

It’s really not a problem.

PostOnce · 2024-07-16T22:38:13

Suppose you live outside of America and the supermajority of LLM companies are American. You want to ask a question about whisky distillation or abortion or anything else that's legal in your jurisdiction but not in the US, but the LLM won't answer.

You've got a plethora of cloud providers, all of them aligned to a foreign country's laws and customs.

If you can choose between Anthropic, OpenAI, Google, and some others... well, that's really not a choice at all. They're all in California. What good does that do an Austrian or an Australian?

jacooper · 2024-07-17T08:22:15

Personally I found the biggest problem for local models is the lack of integrationa, it can't search the web, it can't use wolfram alpha for math, etc

LLMs are great as routers, only rarely are they good doing something on their own.

gtirloni · 2024-07-16T15:55:16

> eventually your edge hardware is going to be able to infer a lot faster than the 50ms+ per call to the cloud.

This is interesting. Is that based on any upcoming technology improvement already in the works?

a_t48 · 2024-07-16T16:40:21

GP is likely referring to network latency here. There's a tradeoff between smaller GPUs/etc at home that have no latency to use and beefier hardware in the cloud that have a minimum latency to use.

yjftsjthsd-h · 2024-07-16T19:10:33

Sure, but if the model takes multiple seconds to execute, then even 100 milliseconds of network latency seems more or less irrelevant

datameta · 2024-07-16T20:23:16

Comms is also the greatest battery drain for a remote edge system. Local inference can allow for longer operation, or operation with no network infra.

sharpshadow · 2024-07-16T19:27:34

Excellent points and being able to use available hardware in unison is amazing and I guess we are not far away from botnets utilising this kind of technology like they did with mining coins.

neop1x · 2024-07-17T08:55:09

Also hosted models are often censored and refuse talking about various topics.

jumpCastle · 2024-07-16T14:40:19

Aren't services like runpod solve half of these concerns?

wokwokwok · 2024-07-16T05:28:14

> Sure, I can run some slow Llama3 models on my home network, but why bother when it is so cheap or free to run it on a cloud service?

Obvious answer: because it's not free, and it's not cheap.

If you're playing with a UI library, lets say, QT... would you:

a) install the community version and play with ($0)

b) buy a professional license to play with (3460 €/Year)

Which one do you pick?

Well, the same goes. It turns out, renting a server large enough to run big (useful, > 8B) models is actually quite expensive. The per-api-call costs of real models (like GPT4) adds up very quickly once you're doing non-trivial work.

If you're just messing around with the tech, why would you pay $$$$ just to piss around with it and see what you can do?

Why would you not use a free version running on your old PC / mac / whatever you have lying around?

> I used to be excited about running models locally

That's an easy position to be one once you've already done it and figured out, yes, I really want the pro plan to build my $StartUP App.

If you prefer to pay for an online service and you can afford it, absolutely go for it; but isn't this an enabler for a lot of people to play and explore the tech for $0?

Isn't having more people who understand this stuff and can make meaningful (non-hype) decisions about when and where to use it good?

Isn't it nice that if meta released some 400B llama 4 model, most people can play with it, not just the ones with the $7000 mac studio? ...and keep building the open source ecosystem?

Isn't that great?

I think it's great.

Even if you don't want to play, I do.

jrm4 · 2024-07-16T13:52:06

Right, I think people here are vastly underestimating this idea of

"What if I want to play around with really PERSONAL stuff."

I've been keeping a digital journal about my whole life. I plan to throw that thing into an AI to see what happens, and you can be damn sure that it will be local.

monkmartinez · 2024-07-16T18:16:58

Yes, I am with you 100% and keep several LLaMA's on my workstation for that reason. I use Openrouter for everything else. Everything that isn't sensitive goes to one of the big kid models because they are just sooooo much better. LLaMA 400b might be the start of running with the big kids, but I know we are not close with the current available models.

itake · 2024-07-16T05:42:23

I’m a bit confused. Your reasoning doesn’t align with the data you shared.

The startup costs for just messing around at home are huge: purchasing a server and gpus, paying for electricity, time spent configuring the api.

If you want to just mess around, $100 to call the world’s best api is much cheaper than spending $2-7k Mac Studio.

Even at production level traffic, the ROI on uptime, devops, utilities, etc would take years to recapture the upfront and on-going costs of self-hosting.

Self hosting will have higher latency and lower throughput.

zeta0134 · 2024-07-16T08:06:50

You are vastly overestimating the startup cost. For me this week it was literally these commands:

pacman -S ollama

ollama serve

ollama run llama3

My basic laptop with about 16 GB of RAM can run the model just fine. It's not fast, but it's reasonably usable for messing around with the tech. That's the "startup" cost. Everything else is a matter of pushing scale and performance, and yes that can be expensive, but a novice who doesn't know what they need yet doesn't have to spend tons of money to find out. Almost any PC with a reasonable amount of RAM gets the job done.

monkmartinez · 2024-07-16T18:08:42

llama3 at 8billion params is weak sauce for anything serious, it just isn't in the same galaxy as Sonnet 3.5 or GPT-4o. The smaller and faster models like Phi are even worse. Once you progress past asking trivial questions to a point where you need to trust the output a bit more, its not worth effort in time, money and/or sweat effort to run a local model to do it.

A novice isn't going to know what they need because they don't know what they don't know. Try asking a question to LLaMA 3 at 8 billion and the same question to LLaMA 3 at 70 billion. There is a night and day difference. Sonnet, Opus and GPT-4o run circles around LLaMA 3 70b. To run LLaMA at 70 billion you need serious horse power as well, likely thousands of dollars in hardware investment. I say it again... the calculus in time, money, and effort isn't favorable to running open models on your own hardware once you pass the novice stage.

I am not ungrateful that the LLaMA's are available for many different reasons, but there is no comparison between quality of output, time, money and effort. The API's are a bargain when you really break down what it takes to run a serious model.

jononor · 2024-07-16T22:13:14

Using an LLM as a general purpose knowledge base is only one particular application of an LLM. And on which is probably best served by ChatGPT etc.

A lot of other things are possible with LLMs using the context window and completion, thanks to their "zero shot" learning capabilities. Which is also what RAG builds upon.

Aurornis · 2024-07-16T15:54:26

I’m familiar with local models. They’re fine for chatting on unimportant things.

They do not compare to the giant models like Claude Sonnet and GPT4 when it comes to trying to use them for complex things.

I continue to use both local models and the commercial cloud offerings, but I think anyone who suggests that the small local models are on par with the big closed hosted models right now is wishful thinking.

sudohackthenews · 2024-07-16T05:48:10

People have gotten manageable results on all sorts of hardware. People have even squeezed a few tokens/second out of Raspberry PIs. The small models are pretty performant- they get good results on consumer gaming hardware. My 2021 laptop with a 3070m (only 8gb vram) runs 8b models faster than I can read, and even the original M1 chips can run the models fine.

monkmartinez · 2024-07-16T18:36:34

You are right of course.... IF your metric for manageable/useable is measured only tokens per second (tok/s).

If your metric is quality of output, time, money and tok/s, there is no comparison; Local models just aren't there yet.

LorenDB · 2024-07-16T14:02:17

And why would you buy a Mac Studio? You could build a reasonable GPU-accelerated Linux box for well under $1500. For example: https://pcpartpicker.com/guide/BCWG3C/excellent-amd-gamingst...

J_Shelby_J · 2024-07-16T14:16:33

Devs that refuse to move off Apple are severely disadvantaged in the LLM era.

jondwillis · 2024-07-16T14:38:04

lol tell that to the 3 year old laptop with 64 GB of RAM that I use exclusively for local LLMs while dev’ing on my work laptop with 96 GB of RAM…

wokwokwok · 2024-07-16T06:34:59

> The startup costs for just messing around at home are huge

No, they are zero.

Most people have extra hardware lying around at home they're not using. It costs nothing but time to install python.

$100 is not free.

If you can't be bothered, sure thing, slap down that credit card and spend your $100.

...but, maybe not so for some people?

Consider students with no credit card, etc; there are a lot of people with a lot of free time and not a lot of money. Even if you don't want to use it do you do seriously think this project is totally valueless for everyone?

Maybe, it's not for you. Not everything has to be for everyone.

You are, maybe, just not the target audience here?

Aurornis · 2024-07-16T15:57:42

> You are, maybe, just not the target audience here?

The difference between an open model running on a $100 computer and the output from GPT4 or Claude Sonnet is huge.

I use local and cloud models. The difference in productivity and accuracy between what I can run locally and what I can get for under $100 of API calls per month is huge once you get past basic playing around with chat. It’s not even close right now.

So I think actually you are not the target audience for what the parent comments are taking about. If you don’t need cutting edge performance then it’s fun to play with local, open, small models. If the goal is to actually use LLMs for productivity in one way or another, spending money on the cloud providers is a far better investment.

Exceptions of course for anything that is privacy-sensitive, but you’re still sacrificing quality by using local models. It’s not really up for debate that the large hosted models are better than what you’d get from running a 7B open model locally.

lynx23 · 2024-07-16T06:41:48

And its not entitled to cliam that "Most people have extra hardware lying around at home". Your story doesn't sound plausible at all.

bryanrasmussen · 2024-07-16T09:43:12

Most people who would want to be running machine learning models probably have some hardware at home that can handle a slow task for playing around and determining if it is worthwhile to pay out for something more performant.

This is undoubtedly entitled, but thinking to yourself huh, I think it's time to try out some of this machine learning stuff is a pretty inherently entitled thing to do.

wokwokwok · 2024-07-16T06:50:59

This project is literally aiming to run on devices like old phones.

I don't think having an old phone is particularly entitled.

I think casually slapping down $100 on whim to play with an API... probably, yeah.

/shrug

itake · 2024-07-16T06:57:26

According to this tweet, Llama 3 costs about $0.20 per Million tokens using an M2.

https://x.com/awnihannun/status/1786069640948719956

In comparison, GPT3.5-turbo costs $0.50 per million tokens.

Do you think an old iPhone will less than 2x efficient?

nightski · 2024-07-16T11:46:24

FWIW depends on cost of power. Where I live cost of power is less than half the stated average.

nl · 2024-07-16T13:14:46

> Well, the same goes. It turns out, renting a server large enough to run big (useful, > 8B) models is actually quite expensive. The per-api-call costs of real models (like GPT4) adds up very quickly once you're doing non-trivial work.

I run my own models, but the truth is most of the time I just use an API provider.

TogetherAI and Groq both have free offers that are generous enough I haven't used them up in 6 months of experimentation and TogetherAI in particular has more models and gets new models up quicker than I can try them myself.

FeepingCreature · 2024-07-16T05:36:00

I just prepay $20/mo to openrouter.ai and can instantly play with every model, no further signup required.

Aurornis · 2024-07-16T15:51:47

> Why would you not use a free version running on your old PC / mac / whatever you have lying around?

Because the old PC lying around can’t come anywhere near the abilities or performance of the hosted AI compute providers. Orders of magnitudes of difference.

The parent commenter is correct: If you want cutting edge performance, there’s no replacement for the hosted solutions right now.

Running models locally is fun for playing around and experimenting, but there is no comparison between what you can run on an old PC lying around and what you can get from a hosted cluster of cutting edge hardware that offers cheap output priced per API call.

friendly_chap · 2024-07-16T06:46:49

We are running smaller models with software we wrote (self plug alert: https://github.com/singulatron/singulatron) with great success. There are obvious mistakes these models make (such as the one in our repo image - haha) sometimes but they can also be surprisingly versatile in areas you don't expect them to be, like coding.

Our demo site uses two NVIDIA GeForce RTX 3090 and our whole team is hammering it all day. The only problem is occasionally high GPU temperature.

I don't think the picture is as bleak as you paint. I actually expect Moore's Law and better AI architectures to bring on a self-hosted AI revolution in the next few years.

dotancohen · 2024-07-16T06:13:46

I have found many similarities between home AI and home astronomy. The equipment needed to get really good performance is far beyond that available to the home user, however intellectually satisfying results can be had at home as a hobby. But certainly not professional results.

grugagag · 2024-07-16T11:51:00

When learning and experimenting it could make a difference.

Cantinflas · 2024-07-16T05:10:19

Why bother running models locally? Privacy, for once, or censorship resistance.

seasonman · 2024-07-16T05:20:34

Also customizability. Sure, you can fine-tune the cloud hosted models (to a certain degree of freedom), but it will probably be expensive, inefficient, difficult and unmaintainable.

hanniabu · 2024-07-16T05:35:37

And offline access

dsign · 2024-07-16T07:58:09

For my advanced spell-checking use-case[^1], local LLMs are, sadly, not state-of-the-art. But their $0 price-point is excellent to analyze lots of sentences and catch the most obvious issues. With some clever hacking, the most difficult cases can be handled by GPT4o and Claude. I'm glad there is a wide variety of options.

[^1] Hey! If you know of spell-checking-tuned LLM models, I'm all ears (eyes).

bruce343434 · 2024-07-16T10:16:50

I think the floating point encoding of LLMs is inherently lossy, add to that the way tokenization works. The LLMs I've worked with "ignore" bad spelling and correctly interpret misspelled words. I'm guessing that for spelling LLMs, you'd want tokenization at the character level, rather than a byte pair encoding.

You could probably train any recent LLM to be better than a human at spelling correction though, where "better" might be a vague combination of faster, cheaper, and acceptable loss of accuracy. Or maybe slightly more accurate.

(A lot of people hate on LLMs for not being perfect, I don't get it. LLMs are just a tool with their own set of trade offs, no need to get rabid either for or against them. Often, things just need to be "good enough". Maybe people on this forum have higher standards than average, and can not deal with the frustration of that cognitive dissonance)

Hihowarewetoday · 2024-07-16T07:59:54

I'm not sure why you have resigned?

If you don't care about running it locally, just spend it online. Everything is good.

But you can run it locally already. Is it cheap? No. Are we still in the beginning? yes. We are still in a phase were this is a pure luxury and just getting into it by buying a 4090, is still relativly cheap in my opinion.

Why running it locally you ask? I personally think running anythingllm and similiar frameworks on your own local data is interesting.

But im pretty sure in a few years you will be able to buy cheaper ml chips for running models locally fast and cheap.

Btw. aat least i don't know a online service which is uncensored, has a lot of loras as choice and is cost effective. For just playing around with LLMs for sure there are plenty of services.

fouc · 2024-07-24T17:08:54

https://x.com/karpathy/status/1814038096218083497

LLMs will start shrinking massively in size soon, without any loss in performance.

bongodongobob · 2024-07-16T05:14:32

I have a 2 year old Thinkpad and I wouldn't necessarily call llama3 slow on it. It's not as fast as ChatGPT but certainly serviceable. This should only help.

Not sure why your throwing your hands up because this is a step towards solving your problem.

diego_sandoval · 2024-07-16T16:02:44

> why bother when it is so cheap or free to run it on a cloud service?

For the same reasons that we bother to use Open Source software instead of proprietary software.

jrm4 · 2024-07-16T13:48:50

What do you mean by useful here?

I'm saying because I've had the exact OPPOSITE thought. The intersection of Moore's Law and the likelihood that these things won't end up as some big unified singularity brain and instead little customized use cases make me think that running at home/office will perhaps be just as appealing.

cess11 · 2024-07-16T19:15:36

I don't want people I don't know snooping around in my experiments.

dws · 2024-07-16T15:58:17

> Sure, I can run some slow Llama3 models on my home network, but why bother when it is so cheap or free to run it on a cloud service?

Running locally, you can change the system prompt. I have Gemma set up on a spare NUC, and changed the system prompt from "helpful" to "snarky" and "kind, honest" to "brutally honest". Having an LLM that will roll its eyes at you and say "whatever" is refreshing.

nhod · 2024-07-16T05:17:20

Is this a hunch, or do you know of some data to back up your reservations?

Copilot+ PC’s, which all run models locally, have the best battery life of any portable PC devices, ever.

These devices have in turn taken a page out of Apple Silicon’s playbook. Apple has the benefit of deep hardware and software integration that no one else has, and is obsessive about battery life.

It is reasonable to think that battery life will not be impacted much.

fragmede · 2024-07-16T05:31:13

That doesn't seem totally reasonable. The battery life of an iphone is pretty great if you're not actually using it, but if you're using the device hard, it gets hot to the touch, along with the battery getting drained. playing resource intensive video games, maxing out the *PU won't stop and let the device sleep at all, and has a noticable hit on battery life. Where inference takes a lot of compute to perform, it's hard to imagine inference being totally free, battery-wise. It probably won't be as hard on the device as playing specific video games non-stop, but I get into phone conversations with ChatGPT as it is, so I can imagine that being a concern if you're already low on battery.

aftbit · 2024-07-16T15:03:32

What if you want to create transcripts for 100s of hours of private recorded audio? I for one do not want to share that with the cloud providers and have it get used as training data or be subject to warrentless search under the third party doctrine. Or what if you want to run a spicy Stable Diffusion fine-tune that you'd rather not have associated with your name in case the anti-porn fascists take over? I feel like there are dozens of situations where the cost is really not the main reason to prefer a local solution.

matyaskzs · 2024-07-16T10:05:39

Cloud cannot be beaten on compute / price, but moving to local could solve privacy issues and the world needs a second amendment for compute anyway.

dijit · 2024-07-17T00:59:39

> Cloud cannot be beaten on compute / price

Sorry, I can't let misinformation like that slide.

Cloud cost/benefit ratio is not good in many circumstances.

For hobbyists it works well because you run your job for very brief periods and renting is much cheaper than buying in those cases. Similarly, if your business usage is so low as to be effectively run once per day then cloud has major benefits.

However, if you are doing any kind of work that consumes more than 8hrs of computer time in a day, cloud is going to start being much more expensive.

The exact cost/benefit depends on the SKU and I'm mostly talking about CPU/Memory/Storage- for managed services like databases it's significantly worse, and I'm comparing to rented servers not self-hosting at home, which is significantly cheaper still.

Local hardware has downsides (availability, inflexibility), but it's faster and cheaper in almost all real workload scenarios where the compute would otherwise be completely idle/turned off >90% of the time.

matyaskzs · 2024-07-30T14:37:00

I should have phrased it better. If you rent cloud compute from a big provider you will probably end up paying more than if you ran that same compute, but then the actual cost of that same compute in the cloud is going to be lower when you add in economies of scale. They will get a cheaper deal on hardware, electricity and on almost anything you would need.

On the lower end, you can't beat a cheap hetzner vps for price, reliability and compute if you ran it 24/7.

CuriouslyC · 2024-07-16T12:08:51

You can beat gpt4/claude in terms of price/performance for most things by a mile using fine tuned models running in a colo. Those extra parameters give the chatbots the ability to understand malformed input and to provide off the cuff answers about almost anything, but small models can be just as smart about limited domains.

ComputerGuru · 2024-07-16T15:57:26

The problem is that once you say “fine tuned” then you have immediately slashed the user base down to virtually nothing. You need to fine-tune per-task and usually per-user (or org). There is no good way to scale that.

Apple can fine-tune a local LLM to respond to a catalog of common interactions and requests but it’s hard to see anyone else deploying fine-tuned models for non-technical audiences or even for their own purposes when most of their needs are one-off and not recurring cases of the same thing.

CuriouslyC · 2024-07-16T17:17:58

Not necessarily, you can fine tune on a general domain of knowledge (people already do this and open source the results) then use on device RAG to give it specific knowledge in the domain.

cess11 · 2024-07-16T19:17:44

I look forward to something similar being developed on top of Bumblebee and Axon, which I expect is just around the corner. Because, for me, Python does not spark joy.

alexandercheema · 2024-07-16T19:28:37

Repo author here. This sounds interesting. Could you elaborate on the benefits of Bumblebee / Axon?

cess11 · 2024-07-16T20:03:17

They run on the BEAM, and there are related IoT-platforms like Nerves. If find that to be a much nicer runtime than (C)Python.

Edit: I don't know where else to begin. It's a runtime that has lightweight processes, excellent observability, absurdly good fault tolerance, really nice programming languages and so on. It's designed for distributed computing.

alexandercheema · 2024-07-16T20:10:04

Fascinating, will check this out! I wanted to focus on Python first to build this quickly, test out ideas and iterate.

This seems like a good option for a switch.

Do you know if any of these can run on Apple/Android devices?

cess11 · 2024-07-17T09:31:04

I avoid touching Apple devices but anything that can expose a Linux shell can run the BEAM. There are two main projects for small devices, https://nerves-project.org/ for more ordinary SoC-computers and https://www.atomvm.net/ for stuff like ESP32-chips.

On Android you've got Termux in F-Droid and can pull in whatever BEAM-setup you want. That's how I first started dabbling with the BEAM, I was using a tablet for most of my recreational programming and happened to try it out and got hooked.

Erlang is pretty weird, but it just clicks for some people so it's worth spending some time checking it out. Elixir is a really nice Python-/Ruby-like on the BEAM, but with pattern matching, real macros and all the absurdly powerful stuff in the Open Telecom Platform.

Jayakumark · 2024-07-16T21:15:21

Just got https://github.com/distantmagic/paddler working across 2 machines on windows, for load balancing, This will be next level and useful for Llama 400B to run across multiple machines. But looks like windows support is not there yet.

fudged71 · 2024-07-16T22:21:33

Since this is best over a local network, I wonder how easy you could make the crowdsourcing aspect of this. How could you make it simple enough for everyone that's physically in your office to join a network to train overnight? Or get everyone at a conference to scan a QR code to contribute to a domain specific model.

alexandercheema · 2024-07-16T22:32:21

That’s where we want to get eventually. There’s a lot of work that needs to be done but I’m confident we’ll get there. Give us 3 months and it’ll be as simple as running Dropbox.

makmanalp · 2024-07-16T14:48:05

Question - if large clusters are reporting that they're seeing gains from using RDMA networks because communication overhead is a bottleneck, how is it possible that this thing is not massively bottlenecked running over a home network?

DistractionRect · 2024-07-16T15:18:32

I suspect that most of the devices you'd expect to find in your consumer cluster are too small/slow to saturate the link.

Edit: it's also a matter of scale. You probably have a small number of small/slow devices in a consumer network versus a lot of large/fast devices in your enterprise cluster.

derefr · 2024-07-16T15:44:11

I haven't looked into exactly what this project is doing, but here's my understanding:

Inference across O(N) pre-trained hidden layers isn't exactly an "embarrassingly parallel" problem, but it is an "embarrassingly pipeline-able" problem (in the CPU sense of "pipelining.") Each device can keep just one or a few layers hot in their own VRAM; and also only needs to send and receive one small embedding (<1MB) vector per timestep — which is so trivial that it's easily achievable in realtime even if all the devices are on wi-fi, talking to the same router, in your "noisy" apartment where 100 other neighbours are on the same bands.

(To put it another way: running a single inference job, has more forgiving realtime latency+throughput requirements than game streaming!)

Assuming that you have a model that's too big for any of your home machines to individually hold; and that all you care about is performance for single-concurrent-request inference on that model — then in theory, you just need one GPU of one node of your homespun Beowulf GPU cluster to have enough VRAM to keep the single largest layer of your model always-hot; and then other smaller devices can handle keeping the smaller layers always-hot. And the result should be faster than "overloading" that same model on that single largest-VRAM device and having some layers spill to CPU, or worse yet, having the GPU have to swap layers in and out repeatedly with each inference step.

(Also, if you're wondering, in the case where a single machine/node has multiple GPUs — or a GPU+VRAM and also a CPU+RAM! — you can treat this as no different than if these were multiple independent nodes, that just-so-happen to have a very efficient pipeline communication channel between them. As the VRAM+computation cost of running inference far outweighs the communication overhead of forward propagation during inference, a home-network inference-pipelining cluster scheduler like this project, would still likely "schedule" the model's layers purely in consideration of the properties of the individual GPU+VRAM (or CPU+RAM), rather than bothering to care about placement.)

---

That being said, AFAIK training is "pipeline parallelizable" exactly as inference is. And people training models do do this — but almost always only across multiple top-of-the-line GPUs in one machine; not across multiple machines.

When you think about what pipelining achieves for training — all you get is either:

1. the ability to use a bunch of small-aggregate-VRAM nodes to achieve the aggregate training capacity of fewer, larger-aggregate-VRAM nodes — but with more power consumption = higher OpEx; and where also, if you scale this to O(N), then you're dumping a quadratic amount of layer-propagation data (which is now both forward-prop and backprop data, and backprop data is bigger!) over what would likely be a shared network just to make this work. (If it's not a shared network — i.e. if it's Infiniband/other RDMA — then why did you spend all that CapEx for your network and not on your GPUs!?)

2. the ability to pipeline a bunch of large-aggregate-VRAM nodes together to train a model that will then never be able to be deployed onto any single node in existence, but can instead only exist as a "pipelined inference model" that hogs O(log N) nodes of your cluster at a time for any inference run. Which makes cluster scheduling hell (if you aren't just permanently wiring the scheduler to treat O(log N)-node groups as single "hyper-nodes"); makes it so that you'll never be able to practically open-source the model in a way anybody but other bigcorps could ever run it (if that's something you care about); and very likely means you're cutting the concurrent-inference-request-serving capacity of your huge expensive GPU cluster by O(log N)... which the product team that allowed that cluster to be budgeted is really not gonna like.

That being said, I imagine at some point one of these proprietary "Inference-as-a-Service" models has been trained at a layer size that puts it into pipelined-inference-only territory, temporarily. Doing so would be the ML engineer's equivalent to the CPU engineer's "we have no fundamentally clever advance, so this quarter we'll just crank up the clock frequency and deal with the higher TDP." (Heck, maybe GPT-4o is one of these.)

---

What people with GPU clusters want, is 1. for the output of the process to be a model that runs on a single (perhaps multi-GPU) node; and 2. for the process itself to be mostly-shared-nothing with as little cross-node communication burden as possible (such that it's just a question of building highly internally communication-efficient nodes, not so much highly-communication-efficient clusters.)

And both of those goals are achieved by sizing models so that they fit within a single node; continuously fanning out streams of training data to those nodes; and then periodically fanning back in model-weights (or model-weight deltas) in an AllReduce operation, to merge the learning of O(N) independently-training nodes to become the new baseline for those nodes.

(If you'll note, this architecture doesn't put any latency requirements on the network, only some monstrous throughput requirements [at the fan-in step] — which makes it a lot easier to design for.)

makmanalp · 2024-07-17T17:20:47

Lovely answer full of helpful details, thank you!

pierrefermat1 · 2024-07-16T10:24:01

Would be great if we could get some benchmarks on commonly available hardware setups.

festive-minsky · 2024-07-16T21:07:14

So I just tried with 2x macbook pros (M2 64GB & M3 128GB) and it was exactly the same speed as with just 1 macbook pro (M2 64GB) Not exactly a common setup but at least it's something

alexandercheema · 2024-07-16T21:28:00

Could you create a GitHub issue? There's a lot of work we'd like to do to improve this.

pierrefermat1 · 2024-07-27T05:59:43

This is so hilariously bad, how does something like this end up needing to be a user created Github issue and not being caught when you guys launch.

pharrington · 2024-07-16T10:33:09

I'm sure someone will show their benchmarks in a couple years!

gnicholas · 2024-07-17T03:36:13

This is great! I really wish Apple allowed your device to query a model you host instead of skipping to their cloud (or OpenAI). I'd love to have a Studio Pro running at home, and have my iPhone, iPad, Mac, and HomePod be able to access it instead of going to the cloud. That way I could have even more assured privacy, and I could choose what model I want to run.

alexandercheema · 2024-07-17T04:14:25

Do you mean with Apple Intelligence? You can already query models you host from Apple using exo or even just local on-device inference.

gnicholas · 2024-07-17T05:03:21

Does this work with Siri? I'm not running the beta so am not familiar with the features and limitations, but I thought that it was either answering based on on-device inference (using a closed model) or Apple's cloud (using a model you can't choose). My understanding is that you can ask OpenAI via an integration they've built, and that in the future you may be able to reach out to other hosted models. But I didn't see anything about being able to seamlessly reach out to your own locally-hosted models, either for Siri backup or anything else. But like I said, I'm not running the beta!

christkv · 2024-07-16T12:18:39

Is apple silicon with a lot of memory 32Gb and up still considered a cheapish way to run models or are there other options now?

talldayo · 2024-07-16T15:55:20

A good Apple Silicon Mac with 32gb of RAM will cost you over $2,000 on-sale. For that price you might as well buy an Nvidia machine instead, either two 3090s or a 64gb Jetson Orin board would be both cheaper and faster.

The markup on Apple hardware is so big that I just don't think "cheapish" will ever be a way to describe the position they hold in the AI market. Apple's current budget lineup gets smoked by an RTX 3060 in a cheap Linux homeserver; the bar for high-value AI has been raised pretty high.

whoami730 · 2024-07-16T11:12:07

Is it possible to use this for image recognition and like? Not sure what can be the usage of this apart from as a chatbot.

jononor · 2024-07-16T22:20:44

Image recognition can generally be done very efficiently on a single commodity PC. Even a phone that is a few years olds can do quite a lot. Or a Raspberry PI. So it generally does not need distributed computing solutions. I am talking about models like YOLO, ResNet, MobileNets, etc.

tama_sala · 2024-07-16T17:48:30

You can use other models like a vision LLM, or use AI agents as well

Aerbil313 · 2024-07-19T21:32:21

I can't wait to see malware which downloads and runs LLMs on remote C&C server command.

tarasglek · 2024-07-16T07:39:14

This is the first timer i've seen tinygrad backend in the wild. Amusing that it's supposedly more stable than llama.cpp for this project.

alexandercheema · 2024-07-16T07:44:19

Repo author here. Tinygrad changes rapidly so wouldn't it say it's "more" stable, but it certainly supports more accelerators than llama.cpp. As George Hotz likes to say, it sits somewhere on the spectrum between llama.cpp and Mojo. No hand-written kernels, optimal kernels are generated and found by beam search.

throwaway2562 · 2024-07-17T08:02:46

How long before the accursed crypto kids try to tokenise token generation with Exo clusters?

rjzzleep · 2024-07-17T08:24:22

What difference does it make? It's not like most GenAI provides more value than random tokens.

rbanffy · 2024-07-17T09:51:49

Crypto kids driving development of general purpose hardware is a win-win scenario

rbanffy · 2024-07-17T09:46:36

The all important question:

When there’s only one device left on the network, will it sing Daisy Bell?

alexandercheema · 2024-07-18T06:43:36

Not yet, should I make an issue for it?

rbanffy · 2024-07-18T14:52:34

It'd be nothing but appropriate.

thom · 2024-07-16T07:21:49

Bexowulf.

ulrischa · 2024-07-16T15:11:03

Does somebody know if it runs on a raspberry?

alexandercheema · 2024-07-16T17:23:42

It *should* but I haven't tried it. I will try it. Updated in this issue:

We could also try raspberry pi + coral usb tpu (https://coral.ai/products/) - that might be a killer combo for super cheap home ai cluster.

alexandercheema · 2024-07-16T17:23:58

Issue link: https://github.com/exo-explore/exo/issues/11

yjftsjthsd-h · 2024-07-17T04:56:18

> coral usb tpu

I thought those were so memory limited that there was no useful way to run an LLM on them?

pkeasjsjd · 2024-07-16T22:35:58

It bothers me that they don't talk about security here, I don't like it at all.

alexandercheema · 2024-07-16T22:37:53

You’re right. The assumption right now is that you’re running on trusted devices on your own local network. I will add a section in the README.

iJohnDoe · 2024-07-16T06:06:24

Anyone run this? Works?

tdubhro1 · 2024-07-16T06:23:22

The readme shows how to run it assuming you can run a python program on the device, so I expect it works with laptops and PCs but there's a note at the end of the page saying that the iOS app has fallen behind the python version so it's not clear to me how to get this running on your iphone or other such devices.

orsorna · 2024-07-16T06:43:37

The "device" in question must be Apple Silicon because the `mlx` package is a hard dependency, or at least an ARM machine (I do not have any Apple Silicon Macbooks or ARM machines to run this). I tried tweaking this before realizing calls to this library is littered all over the repo. I don't really understand the AI ecosystem very well but it seems that the use of the `mlx` library should be supplanted by some other library depending on the platform machine. Until then, and the actual release of the iOS code somewhere, "everyday devices" is limited to premium devices that almost no one has more than one of. I'm looking forward to run this on other machine platforms and squeeze out what I can from old hardware laying around. Otherwise I doubt the tagline of the project.

Edit: to add on, the only evidence that this runs anywhere but Apple Silicon is the maintainer's Twitter where they show it running on two Macbook Pros as well as other devices. I'm not sure how many of those devices are not ARM.

I'm not throwing shade at the concept the author is presenting, but I'd appreciate if they could slow down functional commits (he is writing them right now as I type) and truthfully modify the documentation to state which targets are actually able to run this.

acosmism · 2024-07-16T06:44:23

why ask? try it!

tvshtr · 2024-07-17T12:38:08

sone people value their time

Obertr · 2024-07-17T00:12:52

Okey, I'll say it. It will not work because of network bottlneckes. You need to be sending gigabytes of Data.

so by definition you need (1) good internet 20mb/s+ and (2) good devices.

This thing will not go any further than cool demo on twitter. Please prove me wrong.

alexandercheema · 2024-07-17T00:25:33

Try it out - don't trust me!

The way this works is that each device holds a partition of the model (for now a continuous set of layers). E.g. let's say you have 3 devices and the model is 32 layers. Device 1 could hold layers 1-10, device 2 holds 11-20 and device 3 holds 21-32. Each device executes the layers it's responsible for and passes on the output of its last layer (the activations) to the next device.

The activations are ~8KB for Llama-3-8B and ~32KB for Llama-3-70B (it's linear in the number of parameters in that layer and Llama-3-70B has more layers). Generally the larger the model gets (in terms of parameters), the more layers it ends up having, so we end up with sub-linear scaling so I expect Llama-3-405B to have activations on the order of ~100KB.

This is totally acceptable to send over a local network. The main issue you run into is latency, not bandwidth. Since LLMs are autoregressive (tokens are generated serially), additional latency limits throughput. However, over a local network latency is generally very low (<5ms in my experience). And if not, it's still useful depending on the use-case since you can get a lot of throughput with pipeline parallelism (overlapping requests): https://pytorch.org/docs/stable/pipeline.html

throwawaymaths · 2024-07-16T17:19:51

Is this sensible? Transformers are memory bandwidth bound. Schlepping activations around your home network (which is liable to be lossy) seems like it would result in atrocious TPS.

alexandercheema · 2024-07-16T17:21:52

"Transformers are memory bandwidth bound" - this is the precise reason why this makes sense. If a model doesn't fit into memory on a single device, it needs to be incrementally loaded into memory (offloading), which is bottlenecked by memory bandwidth. Splitting the model over multiple devices avoids this, instead trading off for latency of communicating between nodes. The network bandwidth requirements are minimal since only the activations (intermediary embeddings) are passed between devices. For Llama-3-8B these are ~10KB, for Llama-3-70B these are ~32KB.

boroboro4 · 2024-07-17T13:01:29

It worth noticing that number you're quoting is for embeddings between layers. If you split your model between 5 nodes you will need to send this 32kb 5 times. Also it's per token. Meaning if you process 1K tokens it turns to be 32 MB of data, 1M tokens - 32 GB...

yjftsjthsd-h · 2024-07-16T20:34:27

Unfortunately I don't see any licensing info, without which I'm not touching it. Which is too bad since the idea is really cool.

alexandercheema · 2024-07-16T21:08:17

Thanks for pointing out that. Fixed https://github.com/exo-explore/exo/blob/main/LICENSE

yjftsjthsd-h · 2024-07-16T22:17:59

Excellent, thank you:)