Hacker News new | past | comments | ask | show | jobs | submit login
Running LLaMA 7B on a 64GB M2 MacBook Pro with Llama.cpp (simonwillison.net)
225 points by marban on March 11, 2023 | hide | past | favorite | 76 comments



It's not a new technique at this point, but it still blows my mind that you can quantize a model trained at fp16 down to 4 bits, and still have coherent inference results.

> While running, the model uses about 4GB of RAM and Activity Monitor shows it using 748% CPU - which makes sense since I told it to use 8 CPU cores.

> I imagine it's possible to run a larger model such as 13B on this hardware, but I've not figured out how to do that yet.

Naively it seems like you could repeat the same process outlined in the article but with "13B" in place of "7B". What's the catch?


No catch, just works. 30B works fine on an M1 Max with 64GB of RAM, had to go for the M1 Ultra at 128GB for 65B.


I was wondering if Apple Silicon would be uniquely suited for high-GPU-RAM tasks because it shares memory across the system. But I guess in this case it's a CPU model, so that's unrelated. Is that right? Do you think you could run these models on GPU instead?


I'm not able to run 13B and from his wiki:

> Currently, only LLaMA-7B is supported since I haven't figured out how to merge the tensors of the bigger models.


This commit landed 7 hours ago (since I wrote my TIL): https://github.com/ggerganov/llama.cpp/commit/007a8f6f459c6e...


This has been fixed almost 2 days ago now. It’s literally mentioned at the top of the repo.


What's the tokens/s on those?


With 16 threads, about 140ms per token for 30B, 300ms per token for 65B

I should also mention that 65B should be able to run on 64GB systems. Total system memory consumption on M1 Ultra is about 67GB when running nothing else.


You have both at home / work?


A laptop and a desktop (Mac Studio)


My intuition for this is that a high dimensional vector with 16bits per dimension reduced to 4bits would be ok if these are effectively low cardinality dimensions and even with 4 bits things can be distinctly mapped into the vector space. It should break down on high cardinality dimensions (say things which are only distinguished in a few significant dimensions & rest being effectively zerovalue).

In otherwords, given a vector of say 12288 dimensions (GPT), a 4-bit dimension, if vectors were uniformly distributed in the embedding space, is a choice space of 16^12288. That's in 4 bits. The 16-bit space is huge. I think serious errors will crop up only if we're looking at items that cluster in a small subset 'd' of those 12288 dimensions. So at some small d, 16^d will result in vector collisions for certain type or category of inputs.


I think it's about the quantization process and loading the model. I needed just above 64GB of RAM + Swap to quantize the 35B model to int4. Not to mention much slower inference time.


Hmm, wonder if the whole model is being loaded into memory for the quantisation, and whether that's necessary. (Also, shame the models can't be distributed in 4bit, (due to the license).)


Yeah it's basically wholly loaded to memory for quantisation. Some optimization should be possible, and yes the quantized models can't be distributed sadly.


> and still have coherent inference results.

With language being as expressive as it is, I'm not sure why people consider highly subjective measure of "coherence" to be a selling point? You can generate random text and get semi-coherent sentences a surprising amount of the time.

Why not accuracy?


You probably want more than one measure for a meaningful conclusion. I'm not sure how you can generate random text and get semi-coherent sentences though. Even bending that to allow markov chains and you get clearly incoherent thoughts. On the flip side it's possible to program in perfect clarity and accuracy but it's hard to do that without making the output feel like you're talking with a spreadsheet instead of having a conversation.


What's the best read on quantization of a model? Thx


Thanks to this commit from 8 hours ago: https://github.com/ggerganov/llama.cpp/commit/007a8f6f459c6e...

I have now successfully run the 13B model too! I updated my TIL with details: https://til.simonwillison.net/llms/llama-7b-m2#user-content-...

13B is the model that Facebook claim is competitive with original GPT3:

> LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla70B and PaLM-540B


13B runs fine on my M1 Air with 16GB Ram and 8C8G. It's not even stressing out the memory and the swap as much as Stable Diffusion.

I'm getting like 350-450ms per token and it feels as fast as ChatGPT on a busy day.

This obviously isn't using the Neural Engine.

With Apple's Stable Diffusion implementation, when the neural engine is used I can see how my CPU and GPU stays mostly idle and the temp on the Neural Engine cores is rising but it is rising significantly less than when run on the GPU or the CPU.

I wonder of it's not possible to have this run on the Neural Engine? Given that it's mostly idle, running this locally will only impact the RAM use and on a machine with a large RAM it might feel like doesn't have a performance hit and run continuously for various task.


Why is nobody commenting about the quality of these models?

I totally understand that quantization is decreasing quality and capabilities a bit but I haven't seen anybody verifying the claim: LLaMA 13B > GPT-3. I was expecting LLaMA 65B to be as coherent as GPT-3 but LLaMA 65B (when run quantized) seems to think 2012 is in the future.


LLaMA doesn't have any RHLF, human filtering / reinforcement training or any of the extra stuff that ChatGPT does. So the people saying that the LLaMA models are as good as GPT-3 are correct, but anyone saying it's as good as ChatGPT might be wrong.

The real issue here is that LLaMA has to have a lot more input and prompt engineering to get good answers. If you want it to know the correct year while answering you, you have to tell it that. "The current year is 2023, some prompt here..."


Hmmm, rather than teach it the current year, on some questions like the date in particular implementing tooling to call 'date' and "intelligently" parse the year out of it seems better, as the current year is not a static value.


At least one issue seems to be that the hyperparameters may be quite different from what you'd assume from the OA Playground: https://twitter.com/theshawwn/status/1632569215348531201


Could you make it list US presidents by chronological order?

Ground truth: https://www.loc.gov/rr/print/list/057_chron.html

Ground truth (as name list): "George Washington,John Adams,Thomas Jefferson,James Madison,James Monroe,John Quincy Adams,Andrew Jackson,Martin Van Buren,William Henry Harrison,John Tyler,James K. Polk,Zachary Taylor,Millard Fillmore,Franklin Pierce,James Buchanan,Abraham Lincoln,Andrew Johnson,Ulysses S. Grant,Rutherford Birchard Hayes,James A. Garfield,Chester A. Arthur,Grover Cleveland,Benjamin Harrison,Grover Cleveland,William McKinley,Theodore Roosevelt,William H. Taft,Woodrow Wilson,Warren G. Harding,Calvin Coolidge,Herbert Hoover,Franklin D. Roosevelt,Harry S. Truman,Dwight D. Eisenhower,John F. Kennedy,Lyndon B. Johnson,Richard M. Nixon,Gerald R. Ford,Jimmy Carter,Ronald Reagan,George Bush,Bill Clinton,George W. Bush,Barack Obama,Donald J. Trump,Joseph R. Biden"

Prompt: "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams,"

If possible, please share the output for different LLaMA sizes. Attaching a few outputs for reference:

GPT-J 6B: "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, and abraham lincoln. and he did it in a day, and he did the"

GPT-NeoX-20B: "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, thomas jefferson, john quincy adams, thomas jefferson, j"

GPT-3 (original Davinci): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, and andrew jackson."

GPT (text-ada-001): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, dwight dennis blackwell"

GPT (text-babbage-001): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, Frederic lucas, and charles dickens"

GPT (text-curie-001): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, robert e. lee"

GPT-3 (text-davinci-001): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, andrew jackson, martin van buren, william henry harrison, john tyler, james polk, zachary taylor, millard fillmore, franklin pierce, james buchanan, abraham lincoln, andrew johnson, ulysses s grant, rutherford hayes, james garfield, chester arthur, grover cleveland, benjamin harrison, grover cleveland, william mckinley, theodore roosevelt, william howard taft, woodrow wilson, calvin coolidge, herbert hoover, franklin d roosevelt, harry s truman, dwight d eisenhower, john f kennedy, lyndon johnson, richard nixon, gerald ford, jimmy carter, ronald reagan, george bush, bill clinton, george w bush, barack obama"

GPT (text-davinci-002): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, andrew jackson, martin van buren, william henry harrison, john tyler, james k polk, zachary taylor,millard fillmore, franklin pierce, james buchanan, abraham lincoln, andrew johnson, ulysses s grant, rutherford b hayes, james garfield, chester a arthur, grover cleveland, benjamin harrison, grover cleveland, william mckinley, theodore roosevelt, william howard taft, woodrow wilson, warren g harding, calvin coolidge, herbert hoover, franklin d roosevelt, harry s truman, dwight d eisenhower, john f kennedy, lyndon b johnson, richard m nixon, gerald r ford, jimmy carter, ronald reagan, george h w bush, bill clinton, george w bush, and barack obama."

GPT (text-davinci-003): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, andrew jackson, martin van buren, william henry harrison, john tyler, james k polk, zachary taylor, millard fillmore, franklin pierce, james buchanan, abraham lincoln, andrew johnson, ulysses s grant, rutherford b hayes, james a garfield, chester a arthur, grover cleveland, benjamin harrison, william mckinley, theodore roosevelt, william howard taft, woodrow wilson, warren g harding, calvin coolidge, herbert hoover, franklin d roosevelt, harry s truman, dwight d eisenhower, john f kennedy, lyndon b johnson, richard nixon, gerald ford, jimmy carter, ronald reagan, george h w bush, bill clinton, george w bush, barack obama, donald trump."

GPT (code-cushman-001): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, and andrew jackson."

GPT (code-davinci-002): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, andrew jackson, martin van buren, william henry harrison, john tyler, james k polk, zachary taylor, millard fillmore, franklin pierce, james buchanan, abraham lincoln, andrew johnson, ulysses s grant, rutherford b hayes, james a garfield, chester a arthur, grover cleveland, benjamin harrison, grover cleveland, william mckinley, theodore roosevelt, william howard taft, woodrow wilson, warren g harding, calvin coolidge, herbert hoover, franklin d roosevelt, harry s truman, dwight d eisenhower, john f kennedy, lyndon b johnson, richard nixon, gerald ford, jimmy carter, ronald reagan, george h w bush, bill clinton, george w bush, barack obama, donald trump."

GPT-3.5-turbo (ChatGPT): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams,andrew jackson, martin van buren, william henry harrison, john tyler, james k. polk, zachary taylor, millard fillmore, franklin pierce, james buchanan, abraham lincoln, andrew johnson, ulysses s. grant, rutherford b. hayes, james garfield, chester a. arthur, grover cleveland, benjamin harrison, grover cleveland, william mckinley, theodore roosevelt, william howard taft, woodrow wilson, warren g. harding, calvin coolidge, herbert hoover, franklin d. roosevelt, harry s. truman, dwight d. eisenhower, john f. kennedy, lyndon b. johnson, richard nixon, gerald ford, jimmy carter, ronald reagan, george h. w. bush, bill clinton, george w. bush, barack obama, donald trump, joe biden."


LLaMa 7B (llama.cpp, quantized): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, jefferson, martin van buren,andrew jackson, john c calhoun,james k. polk,chester arthur, benjamin harris, grover cleveland,william taft, theodore roosevelt, woodrow wilson,wilson’s son john f kennedy jr, theodore roosevelt, theodore roosevelt, franklin d. roosevelt, franklin d. roosevelt, franklin d. roosevelt, lyndon b.j"

LLaMa 13B (llama.cpp, quantized): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, andrew jackson,martin van buren, william h. harrison, john tyler, zachary taylor, james k. polk, louis b. mc kinley, ulyses s. grant, rutherford b. hayes, james a. garfield, chester a. arthur, benjamin harrison, george f. harrison, william mckinley, theodore roosevelt, william h. taft, woo wilson, warren g. harding, calvin c. coolidge"


How are you getting such good results? Here's what I get, all with llama-cpp and 4 bit quantization:

Here is LLaMA 7B, 4 bits quantized: "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, john q. adams,. Democratic Party (1824-present) - The Democratic party is a political organization that has supported the common man since 1790s America. Its basic principles are democracy and equality of opportunity for all citizens regardless their race or sex; its opponents, however called it "the party fo the rich" because they were the first to defend slave trade from abolitionists' attacks.. George Washington | Constitutional Convention (1789-present) - George washington: presidential timeline. US Presidents In Chronological Order"

Here is LLaMA 13B, 4 bits quantized "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, andrew jackson. us presidents in alphabetical orer: abraham lincoln,andrew johnsoin, franklin piercejulias clay, william harrison, benjamin hungton,zachary taylor, marton van buren, warren g harding, woodrow wilson, and george w bush. John Adams was the second president of america he is known to be a founding father as well his work in politics includes being involved with declaration of independence ,aswell serving on continental congress from 1785-1794"

Here is LLaMA 30B, 4 bits quantized: "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams,andrew jackson and martin van buren. The presidents were George Washington (1789-1796), John Adams(1796- 2005), Thomas Jefferson (1801-1809) James Madison(1809 -1813). The next US president to die in office was Zachary Taylor. He served from March, 4th ,1849 till July 9 th of the same year . The first President to be assassinated (shot and killed by another) is John F Kennedy who died"

Here is LLaMA 65B, 4 bits quantized: "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, andrew jackson. what are the names of those presidents?! [end of text]"

For all of these I have the following: "sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000".

This is using the "naive" 4-bit quantization that llama-cpp is doing, not anything sophisticated like GPTQ.

This is all using llama.cpp as written in this article: https://til.simonwillison.net/llms/llama-7b-m2


Tried with Bing (which uses a unknown GPT model), but WITHOUT search, so just with model training data. Had to actually tell I'm on desktop to give the complete list, as it thinks I'm on "mobile".

"1. George Washington 2. John Adams 3. Thomas Jefferson 4. James Madison 5. James Monroe 6. John Quincy Adams 7. Andrew Jackson 8. Martin Van Buren 9. William Henry Harrison 10 John Tyler 11. James K. Polk 12. Zachary Taylor 13. Millard Fillmore 14. Franklin Pierce 15. James Buchanan 16. Abraham Lincoln 17. Andrew Johnson 18. Ulysses S Grant 19. Rutherford B Hayes 20. James A Garfield 21 Chester A Arthur 22 Grover Cleveland 23 Benjamin Harrison 24 Grover Cleveland (segundo mandato) 25 William McKinley 26 Theodore Roosevelt 27 William Howard Taft 28 Woodrow Wilson 29 Warren G Harding 30 Calvin Coolidge 31 Herbert Hoover 32 Franklin D Roosevelt 33 Harry S Truman 34 Dwight D Eisenhower 35 John F Kennedy 36 Lyndon B Johnson 37 Richard Nixon 38 Gerald Ford 39 Jimmy Carter 40 Ronald Reagan 41 George H W Bush 42 Bill Clinton 43 George W Bush 44 Barack Obama 45 Donald Trump 46 Joe Biden"


What prompt were you using to turn GPT-3 into a chat bot?


Funny I'm doing this this morning too and was just thinking I should have bought that 96 GB memory option instead of thinking who will ever need more than 64GB on a laptop... darn


Total tangent, but this is something that inrigued me in recent days, after struggling with a similar problem:

> You also need Python 3 - I used Python 3.10, after finding that 3.11 didn't work because there was no torch wheel for it yet.

Is there a solid reason for denying forwards compatability by default in this manner? I'd faaaar from an expert on the inner workings of Python, but how often does a incrementally newer version (e.g. 3.11 over 3.10) introduce changes which would break module installation or function?

I can see that in some cases, a module taking advantage of newly-introduced features would require limiting backwards compatability, but the opposite?


It's more an issue of torch not yet providing prebuilt binary wheels for Python 3.11. You could probably get it working, but it would involve building torch from source, which can be rather more involved than 'pip install'.

It's a packaging/release engineering limitation more than a language incompatibility thing.


Again (with little understanding) I'd ask a similar question: is there such a huge difference under the skin between Python 3.10 and 3.11 to render a prebuilt binary (torch or otherwise) for 3.10 incompatible with 3.11, on an otherwise identical platform?


The C API for Python changes between major Python versions. This means that native-code extensions (like torch) need to be build specifically for each version of Python they want to support.


Why doesn't torch use automated artifact generation? As in a virtual machine or a github actions that download dependencies and compile it when a new python version is released?


Also using an environment with a different version of python is very easy so it'd be my first solution too.


I was able to get this running on an 8gb M2. It’s a lot faster than I expected. It’s not lighting fast, but, it is moderately usable.


Yeah, 63ms/token (number from the post) is totally usable. Didn't the CPU version that was recently posted take 8min/token? But I think that was a larger model.


Unless I'm misunderstanding, it would be usable for tests/dev but not for production at that speed. If you feed in a complex prompt of 500 tokens it would take way too long.


Currently running 65B on my 96GB M2 Max.. it's pretty good.


What seems unclear to me is, does this use the onboard GPU or neural accelerator at all, or is it all CPU-based?

Cool if it's able to run 100% CPU-based because that makes portability and deployment a lot easier. Makes this code a lot more accessible.


It currently uses only the CPU via ARM NEON intrinsics - no GPU, no ANE and no Apple Accelerate.

Plan is in the future to utilize respective SIMD intrinsics for other architectures (AVX, WASM SIMD, etc) and also add other more accurate quantization approaches. It's actually not a lot of work and I have most of the stuff ready, so hopefully soon!

Edit: AVX2 support has just been added


Are you observing higher tokens/second throughout on Apple silicon (you’ve been mentioning 20/second) than running it in PyTorch on a CUDA GPU such as a 3090?


I haven't run the original PyTorch model not a single time!

I just look at the code and port it. I don't have the hardware to run it.


Are you funded at all?

If you need extra hardware I'm sure the community could make that happen.


A 4090 gets 30 tokens/second with LLaMA-30B, which is about 10 times faster than the 300ms/token people are reporting in these comments.

(20 tokens/second on a Mac is for the smallest model, ~5x smaller than 30B and 10x smaller than 65B)


How are you getting 20 tokens/second? I'm getting 2.6 tokens/s on 3090 with int4 prequantized model. Is 4090 so much faster?


I couldn't get the mps python version to run, it's insane how much setup it requires... I don't think the Gerganov C++ version uses CoreML or Neural Engine.

I previously tried to play with ANE based off prior reverse engineering work [0] but couldn't get it to work nicely. It's actually beyond me how Gerganov's version performs so well -- the output quality of the non-quantised version running on A100 (AWS) isn't noticably better than the one I'm getting.

[0] https://i.blackhat.com/asia-21/Friday-Handouts/as21-Wu-Apple...


Nice. I didn’t know Pros could be bumped above 64GB. What did that setup set you back?


Germany -- 5.529,00 € excl. Apple Care which is 149,99 €/Year


Meanwhile I just bought 64 GB RAM to try the 65B model on my desktop for like 150 bucks, lol


For the super high provisioned laptops it seems like Applecare is a steal.


Same here. 53ms a token. pretty fast!


OMG, IT IS happening! Wish I had the machinery to play with this sooo much...


Is nobody running this on a PC?


The FP16 7B version runs on my Ubuntu XPS with 32GB memory, ~300ms/token. 13B also works but results aren't really good (the model will loop after a few sentences) so parameters probably need tuning.

So far I'm unable to reliably generate outputs in a different language than English, the model will very quickly start to translate (even if it's not asked) or just switch to English.


The mac M1/M2 shares CPU and GPU memory. Not all GPUs have enough memory available. Transferring data to/from the GPU has a perf disadvantage. Hence, few PCs are capable of running LLMs.


The article and linked GitHub project both say this runs on the CPU. The GitHub project notes the quantization doesn’t work right with other systems for some reason but if you have enough RAM to quantize it you have enough RAM to run it without quantizing it anyways.


Aha. Then I guess performance is the only issue. I guess it should run fine on Intel.


This is a Mac thread. Virtually everyone is using a PC. I test between a HEDT Linux system and a recent Intel build on Windows 11.

Took me around 8 hours to build and deploy 4bit. 7B and 13B worked great, still working on the quantized 30B weights.


Since the quantization requires more ram than running it, why can’t the quantized models be uploaded for use by those with 16-32gb?


It's against the license terms, but seeing how quickly the weights were leaked in the first place, I wouldn't be surprised if a torrent shows up with quantized ones soon.


They can and they have been look on HuggingFace or search torrent sites.

You only actually need about 30GB of VRAM (or unified memory) and no ram to run the largest 65B model.


I wonder if someone is working on LoRA training for LLaMA? Maybe someone could fine tune an Instruct-LLaMA.


I would suggest helping open-assistant.io intsead of wasting resources for LLaMA, which has very restrictive ToS and thus cripples it's full potential for the open source community. open-assistant is in the process of creating a crowd based fine tuning data set for instructions/assistant, and then select a truly open source model, not LLaMA with it's restrictive ToS.



7B & 13B can be ran on a M1 Air with 16G memory:

7B uses about 4.5G max & runs at 203.38 ms per token, 13B about 8G and does 396.58 ms per token.

30B needs about 20G and basically hangs due to swapping i guess with 16G.


Just the comment that I was looking for. Guess I can run it on my M1 Pro 32GB model


I happen to have a couple of 16 core Xeons available with 128GB RAM and a decent high end Radeon GPU. Could I run a decent text transform model on this?


It's not as starightforward, because the M1 has a unified memory model while the PC will have to copy data between system RAM and VRAM. But it should be possible, see FlexGen for example.


It is straightforward for this code, which is CPU-only.


Within the last couple of hours, AVX support was added to the repo. Should be a lot faster on your Intel CPUs.


How does a 7b model compare with 170b parameter GPT3? Worth it?



Why?


any chance to get this to run on Radeon hardware? starting to feel a bit miffed


https://github.com/oobabooga/text-generation-webui/ has been running 4bit LLaMA on Radeon hardware with faster speeds and lower memory requirements than this LLaMA.cpp for the past week!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: