txyx303's comments

txyx303 · 2024-10-01T20:20:17.000000Z

I don't think they rely on SRAM very much for training. https://cerebras.ai/blog/the-complete-guide-to-scale-out-on-... outlines the memory architecture but it seems like they are able to keep most of the storage off wafer which is how they scale to 100s of GB of parameters with "only" 10s of GB of SRAM.

txyx303 · 2024-10-01T20:15:33.000000Z

Seems like they support training on a bunch of industry standard models. I think most of the customers in the training space tend to be for fine tuning right? The P and T in GPT stand for pre-trained - then you tune for your actual specification. I don't think they will take over the insane computational effort of training Llama or GPT from scratch - those companies are using clusters that cost more than Cerebras' last evaluation.

txyx303 · 2024-10-01T20:04:34.000000Z

afaik they have the current SOTA language models for arabic

txyx303 · 2024-10-01T20:01:40.000000Z

MLPerf brings in exactly zero revenue. If they have sold every chip they can make for the next 2+ years, why would they be diverting resources to MLPerf benchmarking?

Artificial analysis does good API provider inference benchmarking and has evaluated Cerebras, Groq, Sambanova, the many Nvidia-based solutions, etc. IMO it makes way more sense to benchmark actual usable end points rather than submit closed and modified implementations to mlcommons. Graphcore had the fastest BERT submission at one point (when BERT was relevant lol) and it didn't really move the needle at all.

7e · 2024-10-01T23:52:50.000000Z

With Artificial Analysis I wonder if model tweaks are detectable. That’s the benefit of a standardized benchmark, you’re testing the hardware. If some inference vendor changes Llama under the hood, the changes are known. And of course if you don’t include precise repro. instructions in your standardized benchmark, nobody can tell how much money you’re losing (that is, how many chops are serving your requests).

txyx303 · 2024-08-27T18:40:25.000000Z

Batched inference will increase your overall throughput, but each user will still be seeing the original throughput number. It's not necessarily a memory vs compute issue in the same way training is. It's more a function of the auto-regressive nature of transformer inference as far as I understand which presents unique challenges.

If you have an H100 doing 100 tokens/sec and you batch 1000 requests, you might be able to get to 100K tok/sec but each user's request will still be outputting 100 tokens/sec which will make the speed of the response stream the same. So if your output stream speed is slow, batching might not improve user experience, even if you can get a higher chip utilization / "overall" throughput.

txyx303 · 2024-03-14T01:25:58.000000Z

That was more of a WSE-1 problem maybe? They switched to a new compute paradigm (details on their site if you look up "weight streaming") where they basically store the activation on the wafer instead of the whole model. For something very large (say, 32K context and 16k hidden dimension) this would make an activation layer only 1-2GB (16 bit or 32 bit). As I understand it, this was one of the key changes needed to go from single system boxes to these super computing clusters they have been able to deploy.

The Nvidia bandwidth to compute ratio is more necessary because they are moving things around all the time. By keeping all the outputs on the wafer and only streaming the weights, you have a much more favorable requirement for BW to compute. And the number of layers becomes less impactful because they are storing transient outputs.

This is probably one of the primary reasons they didn't need to increase SRAM for WSE-3. WSE-2 was developed based on the old "fit the whole model on the chip" paradigm but models eclipsed 1TB so the new solution is more scalable.