jwan584's comments

jwan584 · 2025-01-15T23:30:03 1736983803

A good talk on how Cerebras does power & cooling (8min) https://www.youtube.com/watch?v=wSptSOcO6Vw&ab_channel=Appli...

jwan584 · 2024-08-27T20:19:23 1724789963

batch size by Q4 will be solid double digits (cerebras employee)

moconnor · 2024-08-28T05:24:59 1724822699

Is that e.g. batch 16/32 for each operation e.g. 16-row matmuls in a pipeline? Or a pipeline of vector-math ops that has 16/32 stages? Is the pipeline also double digits deep?

jwan584 · on Dec 11, 2023

when you go from 1B to 175B, the model no longer fits in memory. so in practice you have to re-factor the model using tensor/pipeline parallelism. that's why it goes from 600 to 20K LOC.

nerpderp82 · on Dec 11, 2023

It doesn't look like Cerebras mentioned the most important part, by trading model complexity due to using a vastly more capable system, they could could refactor that 600 line model effortlessly and rerun.

They can watch different layers train and find out how to optimize training or quantization, etc.

It feels like they kinda missed the forest for the trees here. The article should have focused on model architecture optimization due to the small LoC and the system having ridiculous training capacity.

jwan584 · on Dec 11, 2023

Everyone knows Cerebras by their wafer scale chips. The less understood part is the 12TB of external memory. That's the real reason why large models fit by default and you don't have to chop it up in software ala megatron/deepspeed.

whimsicalism · on Dec 11, 2023

imo the benefits to chopping it up will always remain

jwan584 · on Sept 22, 2023

A helpful paper with the full recipe Cerebras uses to train LLMs and their process including: - Extensively deduplicated dataset (SlimPajama) - Hyperparameter search using muP - Variable sequence length training + ALiBi - Aggressive LR decay

jwan584 · on July 24, 2023

Meta announced a partnership with Qualcomm to bring LLMs to mobile. But 3B is a lot more compact than LLaMA's 7B.