Release of Fugaku-LLM – a large language model trained on supercomputer Fugaku

alexey-salmin · 2024-05-13T22:08:58

> GPUs (7) are the common choice of hardware for training large language models. However, there is a global shortage of GPUs due to the large investment from many countries to train LLMs. Under such circumstances, it is important to show that large language models can be trained using Fugaku, which uses CPUs instead of GPUs. The CPUs used in Fugaku are Japanese CPUs manufactured by Fujitsu, and play an important role in terms of revitalizing Japanese semiconductor technology.

LeoPanthera · 2024-05-13T22:46:44

ARM CPUs, specifically. Fairly unusual in the Top500 list.

NKosmatos · 2024-05-13T22:39:29

For those curious about Fugaku, it’s currently the 4th fastest supercomputer in the TOP500 list: https://www.top500.org/system/179807/

Foobar8568 · 2024-05-14T05:11:40

It's based on a Fujitsu custom ARM, with 32GB HBM2 and 512b SIMD. There have been a few discussions over the years. I am not sure how such chips could replace GPU if there is a shortage of them as it's highly customized and fairly expensive to get the hand of them.

https://www.fujitsu.com/global/about/resources/publications/...

koito17 · 2024-05-13T21:26:43

Does anyone know how this model compares to GPT-4 for Japanese output? Taking a look at GPT-4o today, the Japanese example in the landing page feels unnatural and represents a huge regression from the quality I would expect of GPT-4.

With that said, both GPT-4 and GPT-4o seem to do a good job at understanding the semantics of prompts written in Japanese. I would like to see how this model compares, given that it seems like it's trained with more Japanese data (but that may not necessarily be useful if all they did was scrape affiliate blogs)

rgrieselhuber · 2024-05-13T21:56:38

I’ve gotten very good results with Japanese output in GPT-4 but it takes a little work.

wahnfrieden · 2024-05-13T22:53:39

You haven’t tried the recent Japanese specialized gpt4 variant? Hope it’s updated for gpt4o.

glandium · 2024-05-14T00:00:34

The Japanese specialized GPT-4 variant is not generally available, per https://openai.com/index/introducing-openai-japan/.

Thaxll · 2024-05-13T23:00:25

So every day we're going to get post about new x.y.z LLM?

delichon · 2024-05-13T23:03:01

These are bubbling to the top due to high interest. When the interest subsides so will the posts. Are we fascinated by the wrong things?

yreg · 2024-05-13T23:07:19

In addition to that, it's the WWDC/GoogleIO season. Which might have an effect on announcement schedule of others as well.

nradov · 2024-05-14T05:45:53

We are always fascinated by the wrong things but that only becomes clear in retrospect.

Art9681 · 2024-05-14T16:44:33

It's almost as if LLMs have become an integral part of a modern development workflow with or without your disapproval! There would be less friction without it.

Delmololo · 2024-05-14T06:04:05

Yes!

Impressive advancements right?

apsec112 · 2024-05-13T22:11:00

This honestly feels kind of silly. Back-of-the-envelope, this training run (13B model on 380B tokens) could be done in two months on a single 8x H100 node using off-the-shelf software, at a cost of around $35K from a cloud compute provider. They don't seem to list training time, but this cluster appears to use ~3.5 MW of power, so it's going to burn something like $500/hr just in electricity costs.

mat_epice · 2024-05-14T05:46:58

The system uses 30MW, but this job used a portion of it that would consume 2.6MW.

There isn't really a figure for how much compute time it takes to train this thing, but 8x H100s have 32PF of AI compute among them. This job had 2,100 (half precision[1]) PF-in-fugaku / 158,956 nodes-in-fugaku * 13,824 nodes-in-job = 182 PF-in-job, implying it can get the job done 5.6x faster, or a little over ten days at the most optimistic.

Electricity costs for these nodes for ten days looks fairly similar to the rental costs of 8xH100s for 60 days according to my research. Lambda labs seems to have very cheap instances for 8xH100, but AWS and its ilk are much higher. However, the comparison is a little weird, as Fugaku is also a few years old now, and the contemporary GPU at the time of its release was the A100 (1/13th of an H100). The next Fujitsu chip may well narrow the power/performance gap between itself and (say) Blackwell or whatever is current at the time.

[1] https://www.fujitsu.com/global/about/innovation/fugaku/speci...

LeoPanthera · 2024-05-13T22:47:30

But the whole point of this was to show that it could be done on CPUs during the worldwide shortage of GPUs.

Doing it in the CPU was the whole point.

fnordpiglet · 2024-05-13T22:51:36

It could be done using abacuses too. A global shortage of GPUs is probably better addressed by producing more GPUs and making techniques more efficient.

slt2021 · 2024-05-14T00:04:05

I would love to see human abacus calculating gradients like in 3 body problem https://i.imgur.com/pdz7xGh.jpeg

fnordpiglet · 2024-05-14T03:58:06

Yes best scene on tv ever.

Eisenstein · 2024-05-14T00:03:42

You are right. I assume that you have a fab ready to go and are licensed to produce CUDA capable GPUs?

fnordpiglet · 2024-05-14T03:58:22

I’m certain they’re being built.

rad_gruchalski · 2024-05-14T09:04:54

so it's being addressed and in the meantime someone has some fun

fnordpiglet · 2024-05-15T00:24:01

Yeah man I’m down with using super computer clusters for anything. That’s just awesome. But the results seem more like marketing literature for GPU manufacturers to help rationalize more fab spend.

numpad0 · 2024-05-14T13:56:49

I'm sure it's used for all sorts of industrial purposes, but Fugaku/K seemed a bit of solution in search of problem since NEC backed off(~2009). They were supposed to supply vector processing cluster interconnected with Fujitsu supplied scalar cluster that became K, but cancelled it stating vector processing is an outdated concept not worth pursuing during cash crunch the company was going through.

I can't seem to find a lot of technical documents regarding NEC SX processors online, but I could find outrageously priced PCIe accelerator cards(released 2021, $18k, 300W, 4.91TF FP32, 48GB VRAM) that are supposed to be its descendants, while Fugaku sits there GPU-less, maybe such is the life.

Trapais · 2024-05-20T11:44:51

For comparison, here's 8B [Nemotron](https://huggingface.co/nvidia):

> 1,024 A100s were used for 19 days to train the model.

> NVIDIA models are trained on a diverse set of public and proprietary datasets. This model was trained on a dataset containing 3.8 Trillion tokens of text.

sveinek · 2024-05-13T23:04:19

There is probably a simple answer to this question, but why isn't it possible to use a decentralized architecture like in crypto mining to train models?

rcxdude · 2024-05-13T23:23:01

It's not a task which benefits much from dividing it into lots of small work units that all get processed in parallel without much communication between the nodes. It's naturally almost the complete opposite: it wants very high bandwidth between all the compute units, because each iteration of the training is calculating the derivative of and then updating all the weights of the network. Splitting it up only slows it down: even if you were to distribute training amongst 10x the compute nodes each of which was 10x faster, if your bandwidth drops to even 1/2 you're gonna lose out. This is why all the really big models need a lot of very tightly integrated hardware.

xyproto · 2024-05-13T23:31:31

Just like brains.

beepbooptheory · 2024-05-14T02:37:38

It seems to me like we get a lot of stuff done splitting up the brain work.

andai · 2024-05-13T23:22:09

Can you copy a neural network, train each copy on a different part of the dataset, and merge them back together somehow?

mirekrusin · 2024-05-14T03:43:10

No. Training is offset relative to starting point. If you distribute it from same point you'll have bunch of unrelated offsets. It has to be serial - output state of one training is input state of the next.

If you could do it, we'd already have SETI like networks for AI.

mendigou · 2024-05-14T19:50:25

I haven't touched this in a while, but you can train NNs in a distributed fashion and what GP described is roughly the most basic version of model parallelism, where there is a copy of the model on each node, each node receives a batch of data, and the gradients get synchronized after each batch (so they again start from the same point like you mention).

Most modern large models cannot be trained on one instance of anything (GPU, accelerators, whatever), so there's no alternative to distributed training. They also wouldn't even fit in the memory of one GPU/accelerator, so there are even more complex ways to split the model across instances.

mirekrusin · 2024-05-14T20:40:21

And their bottleneck is what? Data transfer. State is gigantic and needs to be frequently synchronized. That's why it can only work with sophisticated, ultra high bandwidth, specialized interconnects. They employ some tricks here and there but they don't scale that well, ie. with MoE you get factor of 8 scaling and it comes at a cost of lower overall number of parameters. They of course do parallelism as much as they can at model/data/pipeline levels but it's a struggle in a setting of fastest interconnects there are on the planet. Those techniques don't transfer onto networks normal people are using, using "distrubuted" phrase to describe both is conflating those two settings with dramatically different properties. It's a bit like saying that you could make L1 or L2 cpu cache bigger by connecting multiple cpus with network cable. It doesn't work like that.

You can't scale averaging parallel runs much. You need to munch through evolutions/iterations fast.

You can't ie. start with random state, schedule parallel training averaging it all out and expect that you end up with well trained network in one step.

Every next step invalidates input state for everything and the state is gigantic.

It's dominated by huge transfers at high frequency.

You can't for example have 2x gpus connected with network cable and expect speedup. You need to put them on the same motherboard to have any gains.

SETI for example is unlike that - it can be easily distributed - partial readonly snapshot, intense computation, thin result submission.

mendigou · 2024-05-14T22:29:29

Not disputing all of that, but telling the GP flat out "no" is incorrect, especially when distributed training and inference are the only way to run modern massive models.

mirekrusin · 2024-05-15T05:21:39

Inference - you can distribute much better than training. You don't need specialized interconnects for inference.

The question was:

> > There is probably a simple answer to this question, but why isn't it possible to use a decentralized architecture like in crypto mining to train models?

> Can you copy a neural network, train each copy on a different part of the dataset, and merge them back together somehow?

The answer is flat out no.

It doesn't mean parallel computation doesn't happen. Everything, including single gpu, is massively parallel computation.

Does copying happen? Yes, but it's short lived and dominates, ie. data transfer is bottleneck and they go out of their ways to avoid it.

Distributing training in decentralized architecture fashion is not possible.

magicalhippo · 2024-05-14T11:31:57

As mentioned this is difficult. AFAIK the main reason is that the power of neural nets come from the non-linear functions applied at each node ("neuron"), and thus there's nothing like the superposition principle[1] to easily combine training results.

The lack of superposition means you can't efficiently train one layer separately from the others either.

That being said, a popular non-linear function in modern neural nets is ReLU[2] which is piece-wise linear, so perhaps there's some cleverness one can do there.

[1]: https://en.wikipedia.org/wiki/Superposition_principle

[2]: https://en.wikipedia.org/wiki/Rectifier_(neural_networks)

bilbo0s · 2024-05-13T23:27:37

There are a lot of issues with federated learning.

Really depends on your problem, but in practice, the answer is usually "no".

mendigou · 2024-05-14T19:52:07

There are multiple ways to train in parallel, and that's one of them:

https://pytorch.org/tutorials//distributed/home.html

LeoPanthera · 2024-05-13T23:13:53

Wouldn't every single participant need a copy of the entire training set?

samus · 2024-05-14T05:59:01

That's the next big problem. And there need to be mechanisms to ensure that the network is not poisoned with undesirable input.

CaptainOfCoit · 2024-05-13T23:11:13

https://scholar.google.com/scholar?q=distributed+training