Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU (gopubby.com)
57 points by qiakai 5 months ago | hide | past | favorite | 29 comments



From the article: Please note: it’s not designed for real-time interactive scenarios like chatting, more suitable for data processing and other offline asynchronous scenarios. Repo: https://github.com/lyogavin/Anima/tree/main/air_llm


What a misleading article. I thought they'd done some breakthrough in resource efficiency. This is just the old and slow method tools like Ollama used.


Do you know how much disk space this takes total? When I ran it, it downloaded nearly 30 gigabytes of models and seemed to be on track to download 28 more 5 gigabyte chunks (for a total of 150 gigabytes of disk space or maybe more). What is the total size before it finishes?


70B parameters * 2 bytes each (fp16 or bf16) = 140GB

I wish models sizes were published in bytes.


Thanks, I finished downloading it (which took many hours) onto an external hard drive (by adding a HF_HOME environmental variable for where to store that cache). Its size was 262 GB.


What method is that? Layer offloading?


Yes, it's either that, or CPU inference. The article doesn't say.

It doesn't mention quantization either.


Any sense of speed? My assumption is that shuttling the weights in/out of the GPU is slow. Does GPU load + processing beat an entirely CPU solution? Doubly so if it is a huge model where the model cannot sit fully in RAM?


Depends on your CPU. I once tried 70b llama on 256 thread Epyc, it was around 1/10 of A100 (80GB) speed.


how much disk space did it use?


I didn’t check, but iirc it was an fp16 model checkpoint which we converted to int8 for inference, so I assume 140GB?


Thanks, I finished downloading it (which took many hours) onto an external hard drive (by adding a HF_HOME environmental variable for where to store that cache). Its size was 262 GB.


Any chance that the new NPUs are going to significantly speed up running these locally.

Well I'm definitely worried about recall and all the Microsoft nonsense, I really want to be able to run and train LMMs, and other machine learning frameworks locally.


You still need lots of fast memory.


Abysmal article. It doesn't explain anything about the claim in the title. Is there quantization? How much RAM do you need? How fast is the inference? None of these questions are addressed or even mentioned.

> Of course, it would be more reasonable to compare the similarly sized 400B models with GPT4 and Claude3 Opus

No. It's completely irrelevant to the topic of the article.

The article is mostly a press release for llama 3. It also contains a few comments by the author, they aren't bad but don't save the clickbaity, buzzy, sensationalist core.


Llama isn’t open source because the license says you can only use it to improve itself, so the title is false


You could use it to earn money to spend on GPU to improve llama...


This is probably going to sound silly, but I wonder how it compares to TinyLlama and others.


As a cloud solution developer that has to build AI on Azure I have been using this instead of Azure OpenAI. It has sped up my development workflow a lot, and for my purposes it’s comparable enough. I’m using LM studio to load these models.


Can you expand a bit -- because the AOAI is so slow? What exactly helps you speed things up?


On my machine, I am able to create a prompt that suits my need and chat with the model in realtime. With 100% GPU offload, it replies within half a second. LM studio provides an OpenAI compatible api endpoint for my Dotnet software to use. This boosts my developer experience significantly. The Azure services are slow and if you want to regenerate a serie of responses (e.g part of conversation flow) it just takes too long. On my local machine I also do not worry about cloud costs.

As a bonus; I also use this for a personal project where I use prompts and Llama3 to control smart devices. JSON responses from the LLM are parsed and translated into the smart device commands from a raspberry pi. I control it using speech via my Apple Watch and Apple shortcuts to the raspberry pi’s api. It all works magically and fast. Way faster than pulling up the app on my phone. And yes the LLM is smart enough to control groups of devices using simple conversational AI.

edit; here's a demo https://www.youtube.com/watch?v=dCN1AnX8txM


is it possible to use this for audio transcription?


This sounds like a game changer. I wonder if they need to do a tonne of specific work per model? If this could be implemented in Ollama, I'd be over the moon.


Ollama can already run Llama-3 70B with a 4GB GPU, or no GPU at all, it'll just be slow.

Considering this says it's "not designed for real-time interactive scenarios" it's probably also really slow


so how much GPU RAM does need to get the 70B going fast (ish)?


A good rule of thumb is that models can be quantized to 6 to 8 bits per weight without significantly degrading quality. This is convenient for the math: 70GB plus some overhead for the attention matrices (ongoing requests). This overhead depends on workload and context lengths, but you should expect about 30% more. So, around 100GB for a server under load.


llama3:70b using llama.cpp (used under the hood by Ollama) on a 11th Gen Intel i5-11400 @ 2.60GHz - no GPU, CPU inference only.

"Write a haiku about Hacker News mentioning AI in the title"

Here is a haiku:

  AI whispers secrets
  HN threads weave tangled debate
  Intelligence born

  eval time = 30363.04 ms / 23 runs ( 1320.13 ms per token, 0.76 tokens per second)
  total time = 34294.80 ms / 33 tokens


That really doesn't seem bad. When people talk about responses of self-hosted LLMs without a beefy GPU being unusably slow, I always assumed they meant 15 minutes to hours. I do not mind waiting a few minutes if it will summarize the answer a question that will take me many times longer to research.


how much disk space did it use?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: