As a former Groq engineer, the LPU branding is not really technically accurate. These are general purpose compute chips with excellent software for transformer models. In the sense that the GroqChip is general-purpose, it is basically a highly parallel, high performance, deterministic execution engine. It 'rivals' GPUs in the sense that obviously both are used for AI execution. The whitepapers cited are basically a perfect model for how the chips operate. As you can see, they are general purpose tensor units.
That being said, yes the software is impressive and the engineering team at Groq is top-notch.
Unfortunately, the chips are mainly aimed at inference, and it seems like a lot of the large investments at the moment are being driven by training.
EDIT: In the spirit of full disclosure, I suppose I should point out that I own lots of Groq shares, so have every interest in their success.
Is there any quick summary about how exactly (architecturally) these LPUs differ from GPUs or TPUs? Intuitively where does the speedup come from, or what tradeoffs are being made in order to achieve them?
There's a youtube video somewhere. Intuitively, the speedup comes from the fact that the pipelining is controlled by software, rather than hardware. There's no memory management unit. Memory operations and timing is controlled by software entirely. This is terrible for CPUs, where it's really important to do register renaming, out-of-order execution, etc. However, for AI models, it doesn't really matter. Tensor accesses are easily optimized by software, which knows exactly how the chip runs. Essentially, the chip is able to read memory, write memory, do vector operations, multiply matrices, and do permutations all in the same cycle, and in fact it can have several of these operations going on at the same time.
The compiler obviously knows how to schedule all this to produce good timing. That's why the utilization is so high. Every piece of the chip is being used. Whereas with a GPU model, you have to kind of play 'code golf' and have a deep understanding of the architecture and the execution engine to be able to determine when a memory access is going to cause a pipeline stall.
If you understand the ISA of the GroqChip, you understand completely how the chip works. There is no pipeline stall. There is no waiting for memory. There is no memory hierarchy. Even the interconnects work at the same speed as the chip (or something like that... it's all in the paper, so public information), so when they network chips together, the latency between any two subunits of every chip is already known by the network topology and the chip timings.
TL;DR It's all very deterministic and this works well for tensor operations. Parallelism is determined at compile time
Source: https://www.youtube.com/watch?v=pb0PYhLk9r8 . This is basically exactly how the chip works. It's not some simplified architectural or marketing diagram. It's literally how it works.
>> There is no pipeline stall. There is no waiting for memory.
I mean this is a bit of a hyperbole. Even if things are deterministic, the compiler might not be able to schedule things optimally, their can be pipeline interlocks or memory bank conflicts (depending on sizes involved) in some scenarios.
There is no implicit pipeline stalls. Any pipeline stalls are detectable at compile time via trivial static analysis tools, letting engineers work around it, or decide it's fine. Sorry if that was confusing.
My mind was blown a night or two ago when Groq had end-to-end answers in ~0.25s. All my GPT-4 developer friends minds were blown too. Then we got sad the next day, when it became of crapshoot of 10-20s end-to-end latency, due to the hug of death of too many users and too few instances. I applied to their API but didn't hear anything back. I strongly believe that a zero-latency computer experience unlocks a lot of the natural human creative flow. Constantly having momentary pauses ruins my train of thought.
A perfect architecture for producing results that may or may not be correct...
For determining what goes into your faceplant feed it's perfect, for controlling your antilock brakes, not so much...
A lot of serious processing has nebulous results, like analyzing MRI images for signs of cancer, but not all computing falls into this category. Some things need to be deterministic.
Out of curiosity, does that mean that GPUs are indeterministic only in their exact execution run-time behavior, or also in the actual output they produce? If the latter, is it something to do with the indeterministic order of floating point operations causing small output differences, or something completely different?
You're being purposely terse and obtuse so I must disregard your comments as undecipherable. The reality is that on-chip memory has a lower cost-per-bit than off-chip memory. It's lithographed into the silicon on the same layers as everything else, making it essentially free if you have room on the die. That's why integrated GPUs and SoCs are so popular.
That being said, yes the software is impressive and the engineering team at Groq is top-notch.
Unfortunately, the chips are mainly aimed at inference, and it seems like a lot of the large investments at the moment are being driven by training.
EDIT: In the spirit of full disclosure, I suppose I should point out that I own lots of Groq shares, so have every interest in their success.