INTELLECT–1: Launching the First Decentralized Training of a 10B Parameter Model

PoignardAzur · 2024-10-12T11:18:44.000000Z

A lot of comment are sneering at various aspects of this press release, and yeah, there's some cringeworthy stuff.

But the technical aspects are pretty cool:

- Fault-tolerant training where nodes and be added and removed mid-run without interrupting the other nodes.

- Sending quantized gradients during the synchronization phase.

- (In the OpenDiLoCo article) Async synchronization.

They're also mentioning potential trustless systems where everyone can contribute compute, which would make this a truly decentralized open platform. Overall it'll be pretty interesting to see where this goes!

londons_explore · 2024-10-12T11:54:51.000000Z

> Sending quantized gradients during the synchronization phase.

I did this 9 years ago, works pretty well. I don't understand why all ML isn't async and quantized like that now. This project quantizes to 1 bit per weight and it works so well I didn't even make it configurable.

https://github.com/Hello1024/shared-tensor

radarsat1 · 2024-10-12T15:44:06.000000Z

> 1 bit per weight

does this basically correspond to moving each weight either up or down by a fixed amount? I'm a bit surprised you don't at least need a "stay same" bit, but i suppose it could balance out over multiple iterations.

Interesting that it works at all. Although, thinking on it, I could see it maybe even having a nice regularizing effect where every layer would end up have similar weight magnitudes. (like projecting onto the local n-ball as mentioned in a paper posted recently on HN)

londons_explore · 2024-10-12T20:40:45.000000Z

This is for keeping the weight vectors in sync between two machines.

The weight vectors themselves are regular floats. But the data exchanged between the machines is 1 bit. Basically, you keep track of changes to the weight vector which hasn't yet been propagated to the other machine. You quantize this to 1 bit per weight (ie. a sign bit) and send it, together with a single scale factor X, accumulating the quantization error for the next sync iteration.

You choose X to be the RMS or some similar metric of the accumulated error.

f_devd · 2024-10-12T17:22:33.000000Z

It has been more formally studied in signSGD[0], and empirically it's comparable to Adam in terms of behavior.

[0]: https://arxiv.org/pdf/1802.04434

oefrha · 2024-10-12T07:30:37.000000Z

Well I don’t have 8xH100s, but if I do, I’m probably not gonna donate it a VC-funded company. Remember “Open”AI?

https://pitchbook.com/profiles/company/588977-92

jgalt212 · 2024-10-12T12:12:49.000000Z

Very true, but if something similar were run by BOINC, I'd make a stab at contributing.

https://boinc.berkeley.edu/

csomar · 2024-10-12T12:44:16.000000Z

I don't know the intricacies of their VC deal. But if the data is open and users put in xx amount of compute and then get the model; then where is the possible harm? The trade is done and dealt. You provided some of compute and got it back, right? Unless I am misunderstanding something about their distributed model or not reading the fine prints.

ukuina · 2024-10-12T05:19:52.000000Z

> Decentralized training of INTELLECT-1 currently requires 8x H100 SXM5 GPUs.

So, your garden-variety $0.5M desktop PC, then.

Cool, cool.

[1] https://viperatech.com/shop/nvidia-dgx-h100-p4387-system-640...

DannyBee · 2024-10-12T11:24:03.000000Z

If you run it continuously for a month, it will take 13x the electric usage of your average california house.

So they really are a 10x company.

Average house is 571kwh/month, this is 10.2kw max * 24 * 30 = 7344kwh

this will cost you, in california, about $3000 bucks a month depending on your power plan :)

01HNNWZ0MV43FF · 2024-10-13T04:44:07.000000Z

What if I run it for a year?

ikeashark · 2024-10-12T08:35:56.000000Z

me: Oh cool, a project like Folding@Home but for AI compute, maybe I'll contribute as we-

> Decentralized training of INTELLECT-1 currently requires 8x H100 SXM5 GPUs.

me: and for that reason, I'm out

Also they state that later they will be adding the ability for you to contribute your own compute but how will they solve the problem of having to back-propagate to all of the remote nodes contributing to the project without egregiously slow training time?

macrolime · 2024-10-12T12:47:51.000000Z

Not exactly what I would call decentralized training. More like distributed through multiple data centers.

Decentralized training would be when you can use consumer GPUs, but that's not likely to work with backpropagation directly, but maybe with one of the backpropagation approximating algorithms.

dartos · 2024-10-12T13:26:57.000000Z

Didn’t bloom do this with their petals tool?

m3kw9 · 2024-10-12T03:22:16.000000Z

But I can already train from 30 different vendors distributed across the US, why do I need to use a “decentralized” training system? Decentralized inferercing makes more sense as that is where things can be censored

dmitrygr · 2024-10-11T22:15:05.000000Z

> solve decentralized training step-by-step to ensure AGI will be open-source, transparent, and accessible

One hell of an uncited leap from "we're multiplying a lot of numbers" to "AGI", as if it is a given

DannyBee · 2024-10-12T11:19:36.000000Z

Well i mean, it's a group of people who are doing "open, decentralized" training that requires half a million worth of non-consumer hardware and 3000 a month in electricity. Would you expect anything less than silicon valley level arrogance?

mountainriver · 2024-10-12T02:29:24.000000Z

This is cool work, I’ve been watching the slow evolution of this space for a couple years and it feels like a good way we can ensure AI is owned and accessible to everyone.

James_K · 2024-10-12T14:39:43.000000Z

My initial was quite negative, but having thought it through, I can see the logic in this. Having open models is better than closed models. That said, this page seems like a joke. Someone drank a little too much AI-koolaid methinks.

openrisk · 2024-10-12T11:41:12.000000Z

For some purposes a decentrally trained, open source LLM could be just fine? E.g. you want a stochastic parrot that is trained on a large, general purpose corpus of genuine public domain / creative commons content. Having such a tool widely available is still a quantum leap versus Lore Ipsum. Up to point you can take your time. There is no manic race to capitalize any hype. "slow open AI" instead of "fast closed AGI". Helpfully, the nature of the target corpus does not change every day. You can imagine, e.g., annual revisions, trained and rolled-out leisurely. Both costs and benefits get widely distributed.

not_a_dane · 2024-10-12T10:26:51.000000Z

Decentralised but very high entry barrier.

nickpsecurity · 2024-10-12T14:13:39.000000Z

The main benefit of this type of decentralization seems to be minimizing the node cost. One can rent the cheapest nodes to use in the system. Even the temporary instances can be replaced with others. It’s also easy for system owners to donate time.

So, mostly cost reduction mixed with some cloud, vendor diversity.

pizza · 2024-10-12T07:34:12.000000Z

So just spitballing here but this is likely a souped-up reverse engineered DisTrO [0] under the hood, right? Or could it be something else?

[0] https://www.youtube.com/watch?v=eLMJoCSjFbs

mt_ · 2024-10-12T11:14:53.000000Z

> We quantize the pseudo-gradients to int8, reducing communication requirements by 400x.

Can someone explain if it does reduce the model quality overall?

vessenes · 2024-10-12T15:37:58.000000Z

To give some intuition here, it’s not crazy to think that getting a bunch of different 8 bit precision information intended to be combined would get you roughly 32 bits of precision. Especially when it’s not always (often?) the case that for a particular weight you’ll need the edges of that mantissa.

PoignardAzur · 2024-10-12T11:19:21.000000Z

> In our experiments, we found that we are able to perform int8 quantization on the pseudo gradients without any impact on the loss curves.

Allegedly not?

empiko · 2024-10-12T11:46:26.000000Z

The gradients are noisy as they are, this additional noise probably does not hurt that much overall

monkeydust · 2024-10-12T09:17:44.000000Z

Yea, come back when you can do this on BOINC.

saulrh · 2024-10-12T03:13:33.000000Z

> Prime Intellect

Ah, yes, Prime Intellect, the AGI that went foom and genocided the universe because it was commanded to preserve human civilization without regard for human values. A strong contender for the least evil hostile superintelligence in fiction. What a wonderful thing to name your AI startup after. What's next, creating the Torment Nexus?

(my position on the book as a whole is more complex, but... really? Really?)

robertclaus · 2024-10-12T05:26:25.000000Z

You may as well just go with Roko's Basilisk.

cmrx64 · 2024-10-12T04:32:19.000000Z

Least evil… strong words.

saulrh · 2024-10-12T05:49:11.000000Z

It did host a successful and substantially-satisfying human civilization, at least until it let a couple of presumptuous self-important anarchoprimitivists kill it and genocide its subjects. Even if it was only a temporary and unstable illusion of alignment, that's one more values-satisfying civilization than the overwhelming majority of paperclippers manage. So yeah. Good? No. Least evil? Maybe.

rep_lodsb · 2024-10-12T08:30:28.000000Z

>until it let a couple of presumptuous self-important anarchoprimitivists kill it and genocide its subjects

That could have just been their private simulation. As far as I remember, it wouldn't even have outright lied to them, just let them believe they talked it into destroying itself.

gryfft · 2024-10-12T10:43:12.000000Z

GP did specify least evil hostile SI.

QuesnayJr · 2024-10-12T10:18:25.000000Z

After reading that Torment Nexus post you didn't have the urge to name an AI product Torment Nexus? Really?