Hacker News new | past | comments | ask | show | jobs | submit login
GPU's Rival? What Is Language Processing Unit (LPU) (turingpost.com)
22 points by kseniase 9 months ago | hide | past | favorite | 23 comments



As a former Groq engineer, the LPU branding is not really technically accurate. These are general purpose compute chips with excellent software for transformer models. In the sense that the GroqChip is general-purpose, it is basically a highly parallel, high performance, deterministic execution engine. It 'rivals' GPUs in the sense that obviously both are used for AI execution. The whitepapers cited are basically a perfect model for how the chips operate. As you can see, they are general purpose tensor units.

That being said, yes the software is impressive and the engineering team at Groq is top-notch.

Unfortunately, the chips are mainly aimed at inference, and it seems like a lot of the large investments at the moment are being driven by training.

EDIT: In the spirit of full disclosure, I suppose I should point out that I own lots of Groq shares, so have every interest in their success.


Is there any quick summary about how exactly (architecturally) these LPUs differ from GPUs or TPUs? Intuitively where does the speedup come from, or what tradeoffs are being made in order to achieve them?


There's a youtube video somewhere. Intuitively, the speedup comes from the fact that the pipelining is controlled by software, rather than hardware. There's no memory management unit. Memory operations and timing is controlled by software entirely. This is terrible for CPUs, where it's really important to do register renaming, out-of-order execution, etc. However, for AI models, it doesn't really matter. Tensor accesses are easily optimized by software, which knows exactly how the chip runs. Essentially, the chip is able to read memory, write memory, do vector operations, multiply matrices, and do permutations all in the same cycle, and in fact it can have several of these operations going on at the same time.

The compiler obviously knows how to schedule all this to produce good timing. That's why the utilization is so high. Every piece of the chip is being used. Whereas with a GPU model, you have to kind of play 'code golf' and have a deep understanding of the architecture and the execution engine to be able to determine when a memory access is going to cause a pipeline stall.

If you understand the ISA of the GroqChip, you understand completely how the chip works. There is no pipeline stall. There is no waiting for memory. There is no memory hierarchy. Even the interconnects work at the same speed as the chip (or something like that... it's all in the paper, so public information), so when they network chips together, the latency between any two subunits of every chip is already known by the network topology and the chip timings.

TL;DR It's all very deterministic and this works well for tensor operations. Parallelism is determined at compile time

Source: https://www.youtube.com/watch?v=pb0PYhLk9r8 . This is basically exactly how the chip works. It's not some simplified architectural or marketing diagram. It's literally how it works.


I re-posted your comment here: https://www.turingpost.com/p/fod41


>> There is no pipeline stall. There is no waiting for memory.

I mean this is a bit of a hyperbole. Even if things are deterministic, the compiler might not be able to schedule things optimally, their can be pipeline interlocks or memory bank conflicts (depending on sizes involved) in some scenarios.


There is no implicit pipeline stalls. Any pipeline stalls are detectable at compile time via trivial static analysis tools, letting engineers work around it, or decide it's fine. Sorry if that was confusing.


do you think the article explains it well what LPU is?


I think the article implies that there are special features on-chip for language. The chip is general purpose


My mind was blown a night or two ago when Groq had end-to-end answers in ~0.25s. All my GPT-4 developer friends minds were blown too. Then we got sad the next day, when it became of crapshoot of 10-20s end-to-end latency, due to the hug of death of too many users and too few instances. I applied to their API but didn't hear anything back. I strongly believe that a zero-latency computer experience unlocks a lot of the natural human creative flow. Constantly having momentary pauses ruins my train of thought.


I'm considering three possibilities here:

1) AMD, Intel, or ARM buy Groq.

2) They become like ARM and license the ISA to system integrators.

3) They license a GPU core to integrate to their chips.

What other outcomes are likely?


A perfect architecture for producing results that may or may not be correct...

For determining what goes into your faceplant feed it's perfect, for controlling your antilock brakes, not so much...

A lot of serious processing has nebulous results, like analyzing MRI images for signs of cancer, but not all computing falls into this category. Some things need to be deterministic.


What do you mean, may or may not be correct? We are the only architecture for AI that gets deterministic results (unlike a GPU).


Out of curiosity, does that mean that GPUs are indeterministic only in their exact execution run-time behavior, or also in the actual output they produce? If the latter, is it something to do with the indeterministic order of floating point operations causing small output differences, or something completely different?


With breakthroughs in inference and long context understanding, we are officially entering a new era in LLMs


How many new eras has it been this week? Three? More? I'm losing track…


You sound like an LLM yourself, to be honest. Prompt: Generate a vaguely-related aphorism, using solely the title of this article.


This isn't LinkedIn.


Relying solely on on-chip SRAM will never be economically viable for most use cases.


Could you elaborate on why? Isn't it more economically viable than purchasing separate memory?


Cost-per-bit


You're being purposely terse and obtuse so I must disregard your comments as undecipherable. The reality is that on-chip memory has a lower cost-per-bit than off-chip memory. It's lithographed into the silicon on the same layers as everything else, making it essentially free if you have room on the die. That's why integrated GPUs and SoCs are so popular.


So all my CUDA code will run on an LPU?


this is a good thing (tm)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: