Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

Rzor · 2024-06-21T10:45:43 1718966743

From the article: Please note: it’s not designed for real-time interactive scenarios like chatting, more suitable for data processing and other offline asynchronous scenarios. Repo: https://github.com/lyogavin/Anima/tree/main/air_llm

Gloomily3819 · 2024-06-26T19:29:49 1719430189

What a misleading article. I thought they'd done some breakthrough in resource efficiency. This is just the old and slow method tools like Ollama used.

logicallee · 2024-06-26T20:29:30 1719433770

Do you know how much disk space this takes total? When I ran it, it downloaded nearly 30 gigabytes of models and seemed to be on track to download 28 more 5 gigabyte chunks (for a total of 150 gigabytes of disk space or maybe more). What is the total size before it finishes?

lostmsu · 2024-06-26T21:44:25 1719438265

70B parameters * 2 bytes each (fp16 or bf16) = 140GB

I wish models sizes were published in bytes.

logicallee · 2024-06-27T07:33:04 1719473584

Thanks, I finished downloading it (which took many hours) onto an external hard drive (by adding a HF_HOME environmental variable for where to store that cache). Its size was 262 GB.

gavmor · 2024-06-26T19:41:21 1719430881

What method is that? Layer offloading?

Hugsun · 2024-06-26T20:40:06 1719434406

Yes, it's either that, or CPU inference. The article doesn't say.

It doesn't mention quantization either.

0cf8612b2e1e · 2024-06-26T18:52:21 1719427941

Any sense of speed? My assumption is that shuttling the weights in/out of the GPU is slow. Does GPU load + processing beat an entirely CPU solution? Doubly so if it is a huge model where the model cannot sit fully in RAM?

p1esk · 2024-06-26T19:08:48 1719428928

Depends on your CPU. I once tried 70b llama on 256 thread Epyc, it was around 1/10 of A100 (80GB) speed.

logicallee · 2024-06-26T20:30:13 1719433813

how much disk space did it use?

p1esk · 2024-06-26T20:53:29 1719435209

I didn’t check, but iirc it was an fp16 model checkpoint which we converted to int8 for inference, so I assume 140GB?

logicallee · 2024-06-27T07:34:42 1719473682

Thanks, I finished downloading it (which took many hours) onto an external hard drive (by adding a HF_HOME environmental variable for where to store that cache). Its size was 262 GB.

999900000999 · 2024-06-26T18:34:52 1719426892

Any chance that the new NPUs are going to significantly speed up running these locally.

Well I'm definitely worried about recall and all the Microsoft nonsense, I really want to be able to run and train LMMs, and other machine learning frameworks locally.

irusensei · 2024-06-26T19:31:16 1719430276

You still need lots of fast memory.

Hugsun · 2024-06-26T19:42:28 1719430948

Abysmal article. It doesn't explain anything about the claim in the title. Is there quantization? How much RAM do you need? How fast is the inference? None of these questions are addressed or even mentioned.

> Of course, it would be more reasonable to compare the similarly sized 400B models with GPT4 and Claude3 Opus

No. It's completely irrelevant to the topic of the article.

The article is mostly a press release for llama 3. It also contains a few comments by the author, they aren't bad but don't save the clickbaity, buzzy, sensationalist core.

bionhoward · 2024-06-26T19:13:15 1719429195

Llama isn’t open source because the license says you can only use it to improve itself, so the title is false

exe34 · 2024-06-26T19:20:15 1719429615

You could use it to earn money to spend on GPU to improve llama...

andrewmcwatters · 2024-06-26T19:07:51 1719428871

This is probably going to sound silly, but I wonder how it compares to TinyLlama and others.

fexelein · 2024-06-26T19:06:14 1719428774

As a cloud solution developer that has to build AI on Azure I have been using this instead of Azure OpenAI. It has sped up my development workflow a lot, and for my purposes it’s comparable enough. I’m using LM studio to load these models.

isoprophlex · 2024-06-26T19:17:29 1719429449

Can you expand a bit -- because the AOAI is so slow? What exactly helps you speed things up?

fexelein · 2024-06-26T20:55:49 1719435349

On my machine, I am able to create a prompt that suits my need and chat with the model in realtime. With 100% GPU offload, it replies within half a second. LM studio provides an OpenAI compatible api endpoint for my Dotnet software to use. This boosts my developer experience significantly. The Azure services are slow and if you want to regenerate a serie of responses (e.g part of conversation flow) it just takes too long. On my local machine I also do not worry about cloud costs.

As a bonus; I also use this for a personal project where I use prompts and Llama3 to control smart devices. JSON responses from the LLM are parsed and translated into the smart device commands from a raspberry pi. I control it using speech via my Apple Watch and Apple shortcuts to the raspberry pi’s api. It all works magically and fast. Way faster than pulling up the app on my phone. And yes the LLM is smart enough to control groups of devices using simple conversational AI.

edit; here's a demo https://www.youtube.com/watch?v=dCN1AnX8txM

kouru225 · 2024-06-26T19:31:45 1719430305

is it possible to use this for audio transcription?

1GZ0 · 2024-06-21T09:33:52 1718962432

This sounds like a game changer. I wonder if they need to do a tonne of specific work per model? If this could be implemented in Ollama, I'd be over the moon.

nutrientharvest · 2024-06-26T18:41:51 1719427311

Ollama can already run Llama-3 70B with a 4GB GPU, or no GPU at all, it'll just be slow.

Considering this says it's "not designed for real-time interactive scenarios" it's probably also really slow

cpill · 2024-06-26T19:18:29 1719429509

so how much GPU RAM does need to get the 70B going fast (ish)?

AaronFriel · 2024-06-26T19:22:08 1719429728

A good rule of thumb is that models can be quantized to 6 to 8 bits per weight without significantly degrading quality. This is convenient for the math: 70GB plus some overhead for the attention matrices (ongoing requests). This overhead depends on workload and context lengths, but you should expect about 30% more. So, around 100GB for a server under load.

programd · 2024-06-26T19:15:33 1719429333

llama3:70b using llama.cpp (used under the hood by Ollama) on a 11th Gen Intel i5-11400 @ 2.60GHz - no GPU, CPU inference only.

"Write a haiku about Hacker News mentioning AI in the title"

Here is a haiku:

  AI whispers secrets
  HN threads weave tangled debate
  Intelligence born

  eval time = 30363.04 ms / 23 runs ( 1320.13 ms per token, 0.76 tokens per second)
  total time = 34294.80 ms / 33 tokens

bityard · 2024-06-26T19:40:17 1719430817

That really doesn't seem bad. When people talk about responses of self-hosted LLMs without a beefy GPU being unusably slow, I always assumed they meant 15 minutes to hours. I do not mind waiting a few minutes if it will summarize the answer a question that will take me many times longer to research.

logicallee · 2024-06-26T20:30:51 1719433851

how much disk space did it use?