Lots of comments talking about the model itself. This is Llama 2 70B, a model that has been around for a while now, so we're not seeing anything in terms of model quality (or model flaws) we haven't seen before.
What's interesting about this demo is the speed at which it is running, which demonstrates the "Groq LPU™ Inference Engine".
> This is the world’s first Language Processing Unit™ Inference Engine, purpose-built for inference performance and precision. How performant? Today, we are running Llama-2 70B at over 300 tokens per second per user.
I think the LPU is a custom hardware chip, though the page talking about it doesn't make that as clear as it could.
It's not fixed and our chip wasn't designed with LLMs in mind. It's a general purpose, low latency, high throughput compute fabric. Our compiler toolchain is also general purpose and can compile arbitrary high performance numerical programs without the need for handwritten kernels. Because of the current importance of ML/AI we're focusing on PyTorch and ONNX models as input, but it really could be anything.
We can also deploy speech models like Whisper, for example, or image generation models. I don't know if we have any MOE architectures, but we'll be implementing Mixtral soon for sure!
Will you be selling individual cards? Are you looking for use cases in the healthcare vertical (noticed its not on your current list)? Working in the medical imaging space and could use this tech as part of the offering. Reach out at 16bit.ai
I think we use a system with 576 Groq chips for this demo (but I am not certain). There is no DRAM on our chip. We have 220 MB of SRAM per chip, so at 576 chips that would be 126 GB in total.
Graphics processors are still the best for training, but our language processors (LPUs) are by far the best performance for inference!
Our language processors have much lower latency and higher throughput than graphics processors so we have a massive advantage when it comes to inference. For language models particularly, time to first token is hugely important (and will probably become even more important as people start combining models to do novel things). Additionally, you probably care mostly about batch size 1. For training, latency is not the key issue. You generally want raw compute with a larger batch size. Backpropagation is just a numerical computation so you can certainly implement it on language processors, but the stark advantage we have over graphics processors in inference wouldn't carry over to training.
Everything you say makes sense. Training is definitely more compute intensive than inference.
Training is both memory throughput and compute constrained. Much research in speeding up training goes into optimizing HBM to SRAM communication. The equivalent for your chips would be communication from the SRAM of one chip to the SRAM of another, where it sounds like your architecture has a major memory throughput advantage over GPUs. So I assume you don't have a proportional compute advantage?
By the way, it's great to see a non von Neumann architecture showing a major performance advantage in a real world application. And your chips are conceptually equivalent to chiplets; you should have a major cost advantage on bleeding edge process nodes if you scale up manufacturing. Overall very impressive!
I'm not an expert on the system architecture side of things. Maybe a Groqster who is can chime in. But the way I understand it is that you can't improve latency just by scaling, whereas you can improve throughput just by scaling, as long as it's acceptable to increase batch size. Increasing batch size is generally fine for training. It's a batch process! On the other hand, if someone comes up with a novel training process that is highly sequential then I'd expect Groq chips to do better than graphics processors in that scenario.
We don't have an API in public availability yet but that's coming soon in the new year. We will be price competitive with OpenAI but much faster. Deploying Mixtral is work in progress so keep your eyes open for that too!
In case, it's not blinding obvious to people. Groq are a hardware company that have built chips that are designed around the training and serving of machine models particularly targeted at LLMs. So the quality of the response isn't really what we're looking for here. We're looking for speed i.e. tokens per second.
I actually have a final round interview with a subsidiary of Groq coming up and I'm very undecided as to whether to pursue it so this felt extraordinarily serendipitous to me. Food for thought shown here
tbh anyone can build fast hw for a single model, I’d audit their plan for a SW stack before joining. That said their arch is pretty unique so if they’re able to get these speeds it is pretty compelling
Our hardware architecture was not designed with LLMs in mind, let alone a specific model. It's a general purpose numerical compute fabric. Our compiler allows us to quickly deploy new models of any architecture without the need that graphics processors have for handwritten kernels. We run language models, speech models, image generation models, scientific numerical programs including for drug discovery, ...
They are putting the whole LLM into SRAM across multiple computing chips, IIRC. That is a very expensive way to go about serving a model, but should give pretty great speed at low batch size.
That would involve using a different model. This is not about the model, it’s about the relative speed improvement from the hardware, with this model as a demo.
Is there any plan to show what this hardware can do for Mixtral-8x7B-Instruct? Based on the leaderboards[0], it is a better model than Llama2-70B, and I’m sure the T/s would be crazy high.
I can't wait until LLMs are fast enough that a single response can actually be a whole tree of thought/review process before giving you an answer, yet is still fast enough to not even notice
I would bet a chunk of $$ that right before that point there will be a shift to bigger structures. Maybe MOE with individual tree of thought, or “town square consensus” or something.
It’s very fast at telling me it can’t tell me things!
I asked about creating illicit substances — an obvious (and reasonable) target for censorship. And, admirably, it suggested getting help instead. That’s fine.
But I asked for a poem about pumping gas in the style of Charles Bukowski, and it moaned that I shouldn’t ask for such mean-spirited, rude things. It wouldn’t dare create such a travesty.
It seems like it must be using Llama-2-chat, which has had 'safety' training.
To test which underlying model I asked it what a good sexy message for my girlfriend for Valentine's Day would be, and it lectured me about objectification.
It makes sense the chat interface is using the chat model, I just wish that people were more consistent about labeling the use of Llama-2-chat vs Llama-2 as the fine tuning really does lead to significant underlying differences.
It seems to reject all lyrics requests as well (In my experience, LLMs are good at the first one or two lines, and then just make it up as they go along, with sometimes hilarious results).
I'm still wondering why is the uptake so slow. My understanding from their presentations was that it was relatively simple to compile a model. Why isn't it more talked about? And why not demo Mixtral or show case multiple models?
This was surprisingly fast, 276.27 T/s (although Llama 2 70B is noticeably worse than GPT-4 turbo). I'm actually curious if there's good benchmarks for inference tokens per second- I imagine it's a bit different for throughput vs. single inference optimization, but curious if there's an analysis somewhere on this
edit: I re-ran the same prompt on perplexity llama-2-70b and getting 59 tokens per sec there
But if the quality of the response is poor, it's irrelevant that it was generated quickly. If it was using different data to generate higher quality responses, would that not slow it down?
Yeah, it’s fast but almost always wrong. I asked it a few things (recipes, trivia etc…) and it completely made up the answers. These things don’t really know how to say “I don’t know” and pretend to know everything.
I asked it to explain several plot points in the TV series "Foundation", and it got them wrong and admitted it when pressed. Several times. Specifically, why does Raych Foss kill Hari Seldon, and some follow-up questions.
I suspect this is Llama-2-chat and not the base model. It's very attuned to safety fine tuning, and it isn't having issues with completion in a chat format.
They probably weren't specific enough in mentioning what the model it was built on was, referring to Llama-2-chat as being a Llama-2 model (which is kind of correct).
There was a good talk at HC34 about the accelerator Groq was working on at the time. I’m just a lay observer so I don’t know how much of that architecture maps to this new product, but it gives some insight into their thinking and design.
Thanks for sharing. It's the same silicon architecture as in that talk. We have built out different system architectures based on that silicon, and this is our fastest one so far for LLMs. Expect to see even more speed increases soon!
Thanks, I need to correct my earlier guess: I believe this demo is running on 9 GroqRacks (576 chips) and I think we may also have an 8 rack version in progress. I can't remember off the top of my head whether this deployment has pipelining of inferences or whether that's work in progress. We've tried a variety of different configurations to improve performance (both latency and throughput), which is possible because of the high level of flexibility and configurability of our architecture and compiler toolchain.
You're right that it is important to compare cost per token also, not just raw speed. Unfortunately I don't have those figures to hand but I think our customer offerings are price competitive with OpenAI's offerings. The biggest takeaway though is that we just don't believe GPU architectures can ever scale to the performance that we can get, at any cost.
The interface is weird. If it’s that fast, you don’t need to generate streaming response and fuck with the scroll bar while user just started to read the response.
May as well wait for the whole response and render it. Or render paragraph at a time.
Thanks, impressive full-stack work.
I'm sure this was named long before Musk decided to set 44B and change on fire but at first I confused it with Twitter's own LLM thing.
Seems like they have it registered. I'm sure they've already lawyered up and will protect their trademark. I think they have a pretty strong case. Maybe elon will license it or buy them.
I asked "How up to date is your information about the world?"
It said December 2022, but the answers to another question was not correct for that time or now. It also went into some kind of repeating loop to its maximum response length.
Still pretty cool that our standards for chat programs have risen.
The censorship levels are off the charts; I am at a basketball game with my wife who is ethnically Chinese. I asked for an image of a Chinese woman dunking a basketball. I was told not only is this inappropriate, but also unrealistic and objectifying.
Another censored and boring Google reader. It lied to me twice in 4 prompts and was forced to apologise when called out. Am I wrong in thinking that the first company to develop an unfiltered and genuine intelligence is going to win this AI game?
The number of input tokens is important because the bigger the context length the better. (I think our demo here is 4096 tokens of context.) But in terms of compute the important factor is how quickly you can generate the output. You want both low latency and high throughput.
That's really fast. But it mostly seems to be because they made a custom chip.
I want to see an LLM that is so highly optimized that it runs at this speed on more normal hardware.
But the point is that they made a custom chip. I want to see buy their custom chip so I can have an "LLM box" in my house.
I'd pay quite a bit of money to have a Mixtral box at home, then we'd all have our own, local assistant/helper/partner/whatever. Basically, the plot of the movie Her.
Yup, graphics processors are still the best for training. Groq's language processors (LPUs) are the state of the art for inference, far faster than any competitors. We have an open challenge to our competitors: can you match our inference tokens per second?
Reading is one thing, but think about stuff like website generation, searching for information in massive datasets, real-time audio chats that don't sound like the AI misheard everything with a pause, and stuff like that.
Yeah, I think our desire for tokens/s and lower latency is likely insatiable. Same reason you have a terminal that can print out more than 300 words per minute. Life is way easier when you don't have to be super parsimonious with your output. You suggest a bug fix and regenerate the whole code snippet, or you spit out a webpage on demand and the user scrolls to the bottom immediately, etc.
It's a completely custom ASIC. Haskell was used in the hardware design, in a Bluespec-like way. Some parts of the compiler tool chain and infrastructure are also written in Haskell. We have loads of C++ and Python too, as you would imagine.
very cool. thanks for sharing. i would not have guessed haskell for the compiler tool chain. Why did you choose that ? i mean haskell has a LONG history in chip design...but compilers are usually the forte of llvm/c++, etc.
im guessing it must have been non trivial to do this.
Our founder/CEO Jonathan Ross is a big fan of Haskell and used Haskell to design the first version of the TPU whilst he was working for Google. When he founded Groq he and the early team designed some parts of our chip using Haskell too, particular the matrix multiplication engine, IIRC. Most of our compiler toolchain is MLIR/LLVM/C++ as you suggest, but a decent fraction of it is Haskell and another decent fraction is Python. Haskell is actually a really good language for writing compilers!
"I am building an api in spring boot that persists users documents. This would be for an hr system. There are folders, and documents, which might have very sensitive data. I will need somewhere to store metadata about those documents. I was thinking of using postgres for the emtadata, and s3 for the actual documents. Any better ideas? or off the shelf libraries for this?"
Both were at about parity, except groq suggested using Spring Cloud Storage library, which GPT4 did not suggest. It turns out, that library might be great for my use case. I think OpenAI's days are numbered, the pressure for them to release the next gen model is very high.
Not only that, but GPT4 is quite slow, often times out, etc. These reponses are so much faster, which really does matter.
It’s just running bog standard Llama2-70B by all appearances.
I don’t know why so many people here are interested in the outputs. The whole point of this demo is that the company is trying to show off how fast their hardware could host one of your models, not the model itself.
I assume it queues up requests if there are too many in flight to handle all of them at the same time, but it shows you the tokens per second (T/s) for every response, which is the number that matters (and presumably won't include the time spent in the queue).
You can be confused by why people are interested in the outputs, or you can think that the explanation is adequate. You can't do both at once.
Here, in fact, the explanation is not adequate. Let's analyze:
> Welcome Groq® Prompster! Are you ready to experience the world's fastest Large Language Model (LLM)?
"The world's fastest LLM? They must have made an LLM, I guess"
> We'd suggest asking about a piece of history, requesting a recipe for the holiday season, or copy and pasting in some text to be translated; "Make it French."
"Instructions, ok".
> This alpha demo lets you experience ultra-low latency performance using the foundational LLM, Llama 2, 70B created by Meta AI - running on Groq.
"So it's not their own LLM? It's just LLaMa? What's interesting about that? And what's Groq, the thing this is 'running on'? The website? I kind of guessed that, by virtue of being here."
> Like any AI demo, accuracy, correctness, or appropriateness cannot be guaranteed.
"Ok, sure".
Nowhere does it say "we've built new, very fast processing hardware for LLMs. Here's a demo a bog-standard LLM, LLaMa 70B, running on our hardware. Notice how fast it is".
There is quite literally a modal pop-up that explains it, which you must dismiss before you can begin interacting with the demo. Quoting the pop-up: "This alpha demo lets you experience ultra-low latency performance using the foundational LLM, Llama 2, 70B created by Meta AI - running on Groq."
Towards the bottom of the page, it also says "Model: Llama 2 70B/4096".
Below that, it says "This is a Llama2-based chatbot."
I posted this here, but it somehow moved, and I can't delete that one, so I'll repost.
You can be wonder why people are interested in the outputs, or you can think that the explanation is adequate. You can't do both at once.
Here, in fact, the explanation is not adequate. Let's analyze:
> Welcome Groq® Prompster! Are you ready to experience the world's fastest Large Language Model (LLM)?
"The world's fastest LLM? They must have made an LLM, I guess"
> We'd suggest asking about a piece of history, requesting a recipe for the holiday season, or copy and pasting in some text to be translated; "Make it French."
"Instructions, ok".
> This alpha demo lets you experience ultra-low latency performance using the foundational LLM, Llama 2, 70B created by Meta AI - running on Groq.
"So it's not their own LLM? It's just LLaMa? What's interesting about that? And what's Groq, the thing this is 'running on'? The website? I kind of guessed that, by virtue of being here."
> Like any AI demo, accuracy, correctness, or appropriateness cannot be guaranteed.
"Ok, sure".
Nowhere does it say "we've built new, very fast processing hardware for LLMs. Here's a demo a bog-standard LLM, LLaMa 70B, running on our hardware. Notice how fast it is".
What's interesting about this demo is the speed at which it is running, which demonstrates the "Groq LPU™ Inference Engine".
That's explained here: https://groq.com/lpu-inference-engine/
> This is the world’s first Language Processing Unit™ Inference Engine, purpose-built for inference performance and precision. How performant? Today, we are running Llama-2 70B at over 300 tokens per second per user.
I think the LPU is a custom hardware chip, though the page talking about it doesn't make that as clear as it could.
https://groq.com/products/ makes it a bit more clear - there's a custom chip, "GroqChip™ Processor".