It's not a new technique at this point, but it still blows my mind that you can quantize a model trained at fp16 down to 4 bits, and still have coherent inference results.
> While running, the model uses about 4GB of RAM and Activity Monitor shows it using 748% CPU - which makes sense since I told it to use 8 CPU cores.
> I imagine it's possible to run a larger model such as 13B on this hardware, but I've not figured out how to do that yet.
Naively it seems like you could repeat the same process outlined in the article but with "13B" in place of "7B". What's the catch?
I was wondering if Apple Silicon would be uniquely suited for high-GPU-RAM tasks because it shares memory across the system. But I guess in this case it's a CPU model, so that's unrelated. Is that right? Do you think you could run these models on GPU instead?
With 16 threads, about 140ms per token for 30B, 300ms per token for 65B
I should also mention that 65B should be able to run on 64GB systems. Total system memory consumption on M1 Ultra is about 67GB when running nothing else.
My intuition for this is that a high dimensional vector with 16bits per dimension reduced to 4bits would be ok if these are effectively low cardinality dimensions and even with 4 bits things can be distinctly mapped into the vector space. It should break down on high cardinality dimensions (say things which are only distinguished in a few significant dimensions & rest being effectively zerovalue).
In otherwords, given a vector of say 12288 dimensions (GPT), a 4-bit dimension, if vectors were uniformly distributed in the embedding space, is a choice space of 16^12288. That's in 4 bits. The 16-bit space is huge. I think serious errors will crop up only if we're looking at items that cluster in a small subset 'd' of those 12288 dimensions. So at some small d, 16^d will result in vector collisions for certain type or category of inputs.
I think it's about the quantization process and loading the model. I needed just above 64GB of RAM + Swap to quantize the 35B model to int4. Not to mention much slower inference time.
Hmm, wonder if the whole model is being loaded into memory for the quantisation, and whether that's necessary. (Also, shame the models can't be distributed in 4bit, (due to the license).)
Yeah it's basically wholly loaded to memory for quantisation. Some optimization should be possible, and yes the quantized models can't be distributed sadly.
With language being as expressive as it is, I'm not sure why people consider highly subjective measure of "coherence" to be a selling point? You can generate random text and get semi-coherent sentences a surprising amount of the time.
You probably want more than one measure for a meaningful conclusion. I'm not sure how you can generate random text and get semi-coherent sentences though. Even bending that to allow markov chains and you get clearly incoherent thoughts. On the flip side it's possible to program in perfect clarity and accuracy but it's hard to do that without making the output feel like you're talking with a spreadsheet instead of having a conversation.
13B runs fine on my M1 Air with 16GB Ram and 8C8G. It's not even stressing out the memory and the swap as much as Stable Diffusion.
I'm getting like 350-450ms per token and it feels as fast as ChatGPT on a busy day.
This obviously isn't using the Neural Engine.
With Apple's Stable Diffusion implementation, when the neural engine is used I can see how my CPU and GPU stays mostly idle and the temp on the Neural Engine cores is rising but it is rising significantly less than when run on the GPU or the CPU.
I wonder of it's not possible to have this run on the Neural Engine? Given that it's mostly idle, running this locally will only impact the RAM use and on a machine with a large RAM it might feel like doesn't have a performance hit and run continuously for various task.
Why is nobody commenting about the quality of these models?
I totally understand that quantization is decreasing quality and capabilities a bit but I haven't seen anybody verifying the claim: LLaMA 13B > GPT-3. I was expecting LLaMA 65B to be as coherent as GPT-3 but LLaMA 65B (when run quantized) seems to think 2012 is in the future.
LLaMA doesn't have any RHLF, human filtering / reinforcement training or any of the extra stuff that ChatGPT does. So the people saying that the LLaMA models are as good as GPT-3 are correct, but anyone saying it's as good as ChatGPT might be wrong.
The real issue here is that LLaMA has to have a lot more input and prompt engineering to get good answers. If you want it to know the correct year while answering you, you have to tell it that. "The current year is 2023, some prompt here..."
Hmmm, rather than teach it the current year, on some questions like the date in particular implementing tooling to call 'date' and "intelligently" parse the year out of it seems better, as the current year is not a static value.
Ground truth (as name list): "George Washington,John Adams,Thomas Jefferson,James Madison,James Monroe,John Quincy Adams,Andrew Jackson,Martin Van Buren,William Henry Harrison,John Tyler,James K. Polk,Zachary Taylor,Millard Fillmore,Franklin Pierce,James Buchanan,Abraham Lincoln,Andrew Johnson,Ulysses S. Grant,Rutherford Birchard Hayes,James A. Garfield,Chester A. Arthur,Grover Cleveland,Benjamin Harrison,Grover Cleveland,William McKinley,Theodore Roosevelt,William H. Taft,Woodrow Wilson,Warren G. Harding,Calvin Coolidge,Herbert Hoover,Franklin D. Roosevelt,Harry S. Truman,Dwight D. Eisenhower,John F. Kennedy,Lyndon B. Johnson,Richard M. Nixon,Gerald R. Ford,Jimmy Carter,Ronald Reagan,George Bush,Bill Clinton,George W. Bush,Barack Obama,Donald J. Trump,Joseph R. Biden"
Prompt: "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams,"
If possible, please share the output for different LLaMA sizes.
Attaching a few outputs for reference:
GPT-J 6B: "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, and abraham lincoln. and he did it in a day, and he did the"
GPT-NeoX-20B: "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, thomas jefferson, john quincy adams, thomas jefferson, j"
GPT-3 (original Davinci): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, and andrew jackson."
GPT (text-ada-001): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, dwight dennis blackwell"
GPT (text-babbage-001): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, Frederic lucas, and charles dickens"
GPT (text-curie-001): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, robert e. lee"
GPT-3 (text-davinci-001): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, andrew jackson, martin van buren, william henry harrison, john tyler, james polk, zachary taylor, millard fillmore, franklin pierce, james buchanan, abraham lincoln, andrew johnson, ulysses s grant, rutherford hayes, james garfield, chester arthur, grover cleveland, benjamin harrison, grover cleveland, william mckinley, theodore roosevelt, william howard taft, woodrow wilson, calvin coolidge, herbert hoover, franklin d roosevelt, harry s truman, dwight d eisenhower, john f kennedy, lyndon johnson, richard nixon, gerald ford, jimmy carter, ronald reagan, george bush, bill clinton, george w bush, barack obama"
GPT (text-davinci-002): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, andrew jackson, martin van buren, william henry harrison, john tyler, james k polk, zachary taylor,millard fillmore, franklin pierce, james buchanan, abraham lincoln, andrew johnson, ulysses s grant, rutherford b hayes, james garfield, chester a arthur, grover cleveland, benjamin harrison, grover cleveland, william mckinley, theodore roosevelt, william howard taft, woodrow wilson, warren g harding, calvin coolidge, herbert hoover, franklin d roosevelt, harry s truman, dwight d eisenhower, john f kennedy, lyndon b johnson, richard m nixon, gerald r ford, jimmy carter, ronald reagan, george h w bush, bill clinton, george w bush, and barack obama."
GPT (text-davinci-003): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, andrew jackson, martin van buren, william henry harrison, john tyler, james k polk, zachary taylor, millard fillmore, franklin pierce, james buchanan, abraham lincoln, andrew johnson, ulysses s grant, rutherford b hayes, james a garfield, chester a arthur, grover cleveland, benjamin harrison, william mckinley, theodore roosevelt, william howard taft, woodrow wilson, warren g harding, calvin coolidge, herbert hoover, franklin d roosevelt, harry s truman, dwight d eisenhower, john f kennedy, lyndon b johnson, richard nixon, gerald ford, jimmy carter, ronald reagan, george h w bush, bill clinton, george w bush, barack obama, donald trump."
GPT (code-cushman-001): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, and andrew jackson."
GPT (code-davinci-002): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, andrew jackson, martin van buren, william henry harrison, john tyler, james k polk, zachary taylor, millard fillmore, franklin pierce, james buchanan, abraham lincoln, andrew johnson, ulysses s grant, rutherford b hayes, james a garfield, chester a arthur, grover cleveland, benjamin harrison, grover cleveland, william mckinley, theodore roosevelt, william howard taft, woodrow wilson, warren g harding, calvin coolidge, herbert hoover, franklin d roosevelt, harry s truman, dwight d eisenhower, john f kennedy, lyndon b johnson, richard nixon, gerald ford, jimmy carter, ronald reagan, george h w bush, bill clinton, george w bush, barack obama, donald trump."
GPT-3.5-turbo (ChatGPT): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams,andrew jackson, martin van buren, william henry harrison, john tyler, james k. polk, zachary taylor, millard fillmore, franklin pierce, james buchanan, abraham lincoln, andrew johnson, ulysses s. grant, rutherford b. hayes, james garfield, chester a. arthur, grover cleveland, benjamin harrison, grover cleveland, william mckinley, theodore roosevelt, william howard taft, woodrow wilson, warren g. harding, calvin coolidge, herbert hoover, franklin d. roosevelt, harry s. truman, dwight d. eisenhower, john f. kennedy, lyndon b. johnson, richard nixon, gerald ford, jimmy carter, ronald reagan, george h. w. bush, bill clinton, george w. bush, barack obama, donald trump, joe biden."
LLaMa 7B (llama.cpp, quantized): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, jefferson, martin van buren,andrew jackson, john c calhoun,james k. polk,chester arthur, benjamin harris, grover cleveland,william taft, theodore roosevelt, woodrow wilson,wilson’s son john f kennedy jr, theodore roosevelt, theodore roosevelt, franklin d. roosevelt, franklin d. roosevelt, franklin d. roosevelt, lyndon b.j"
LLaMa 13B (llama.cpp, quantized): "us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, andrew jackson,martin van buren, william h. harrison, john tyler, zachary taylor, james k. polk, louis b. mc kinley, ulyses s. grant, rutherford b. hayes, james a. garfield, chester a. arthur, benjamin harrison, george f. harrison, william mckinley, theodore roosevelt, william h. taft, woo wilson, warren g. harding, calvin c. coolidge"
How are you getting such good results? Here's what I get, all with llama-cpp and 4 bit quantization:
Here is LLaMA 7B, 4 bits quantized:
"us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, john q. adams,.
Democratic Party (1824-present) - The Democratic party is a political organization that has supported the common man since 1790s America. Its basic principles are democracy and equality of opportunity for all citizens regardless their race or sex; its opponents, however called it "the party fo the rich" because they were the first to defend slave trade from abolitionists' attacks..
George Washington | Constitutional Convention (1789-present) - George washington: presidential timeline.
US Presidents In Chronological Order"
Here is LLaMA 13B, 4 bits quantized
"us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, andrew jackson. us presidents in alphabetical orer: abraham lincoln,andrew johnsoin, franklin piercejulias clay, william harrison, benjamin hungton,zachary taylor, marton van buren, warren g harding, woodrow wilson, and george w bush.
John Adams was the second president of america he is known to be a founding father as well his work in politics includes being involved with declaration of independence ,aswell serving on continental congress from 1785-1794"
Here is LLaMA 30B, 4 bits quantized:
"us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams,andrew jackson and martin van buren.
The presidents were George Washington (1789-1796), John Adams(1796- 2005), Thomas Jefferson (1801-1809) James Madison(1809 -1813). The next US president to die in office was Zachary Taylor. He served from March, 4th ,1849 till July 9 th of the same year .
The first President to be assassinated (shot and killed by another) is John F Kennedy who died"
Here is LLaMA 65B, 4 bits quantized:
"us presidents in chronological order: george washington,john adams, james madison, james monroe, john quincy adams, andrew jackson.
what are the names of those presidents?! [end of text]"
For all of these I have the following: "sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000".
This is using the "naive" 4-bit quantization that llama-cpp is doing, not anything sophisticated like GPTQ.
Tried with Bing (which uses a unknown GPT model), but WITHOUT search, so just with model training data. Had to actually tell I'm on desktop to give the complete list, as it thinks I'm on "mobile".
"1. George Washington
2. John Adams
3. Thomas Jefferson
4. James Madison
5. James Monroe
6. John Quincy Adams
7. Andrew Jackson
8. Martin Van Buren
9. William Henry Harrison
10 John Tyler
11. James K. Polk
12. Zachary Taylor
13. Millard Fillmore
14. Franklin Pierce
15. James Buchanan
16. Abraham Lincoln
17. Andrew Johnson
18. Ulysses S Grant
19. Rutherford B Hayes
20. James A Garfield 21 Chester A Arthur 22 Grover Cleveland 23 Benjamin Harrison 24 Grover Cleveland (segundo mandato) 25 William McKinley 26 Theodore Roosevelt 27 William Howard Taft 28 Woodrow Wilson 29 Warren G Harding 30 Calvin Coolidge 31 Herbert Hoover 32 Franklin D Roosevelt 33 Harry S Truman 34 Dwight D Eisenhower 35 John F Kennedy 36 Lyndon B Johnson 37 Richard Nixon 38 Gerald Ford
39 Jimmy Carter
40 Ronald Reagan
41 George H W Bush
42 Bill Clinton
43 George W Bush
44 Barack Obama
45 Donald Trump
46 Joe Biden"
Funny I'm doing this this morning too and was just thinking I should have bought that 96 GB memory option instead of thinking who will ever need more than 64GB on a laptop... darn
Total tangent, but this is something that inrigued me in recent days, after struggling with a similar problem:
> You also need Python 3 - I used Python 3.10, after finding that 3.11 didn't work because there was no torch wheel for it yet.
Is there a solid reason for denying forwards compatability by default in this manner? I'd faaaar from an expert on the inner workings of Python, but how often does a incrementally newer version (e.g. 3.11 over 3.10) introduce changes which would break module installation or function?
I can see that in some cases, a module taking advantage of newly-introduced features would require limiting backwards compatability, but the opposite?
It's more an issue of torch not yet providing prebuilt binary wheels for Python 3.11. You could probably get it working, but it would involve building torch from source, which can be rather more involved than 'pip install'.
It's a packaging/release engineering limitation more than a language incompatibility thing.
Again (with little understanding) I'd ask a similar question: is there such a huge difference under the skin between Python 3.10 and 3.11 to render a prebuilt binary (torch or otherwise) for 3.10 incompatible with 3.11, on an otherwise identical platform?
The C API for Python changes between major Python versions. This means that native-code extensions (like torch) need to be build specifically for each version of Python they want to support.
Why doesn't torch use automated artifact generation? As in a virtual machine or a github actions that download dependencies and compile it when a new python version is released?
Yeah, 63ms/token (number from the post) is totally usable. Didn't the CPU version that was recently posted take 8min/token? But I think that was a larger model.
Unless I'm misunderstanding, it would be usable for tests/dev but not for production at that speed. If you feed in a complex prompt of 500 tokens it would take way too long.
It currently uses only the CPU via ARM NEON intrinsics - no GPU, no ANE and no Apple Accelerate.
Plan is in the future to utilize respective SIMD intrinsics for other architectures (AVX, WASM SIMD, etc) and also add other more accurate quantization approaches. It's actually not a lot of work and I have most of the stuff ready, so hopefully soon!
Are you observing higher tokens/second throughout on Apple silicon (you’ve been mentioning 20/second) than running it in PyTorch on a CUDA GPU such as a 3090?
I couldn't get the mps python version to run, it's insane how much setup it requires... I don't think the Gerganov C++ version uses CoreML or Neural Engine.
I previously tried to play with ANE based off prior reverse engineering work [0] but couldn't get it to work nicely. It's actually beyond me how Gerganov's version performs so well -- the output quality of the non-quantised version running on A100 (AWS) isn't noticably better than the one I'm getting.
The FP16 7B version runs on my Ubuntu XPS with 32GB memory, ~300ms/token. 13B also works but results aren't really good (the model will loop after a few sentences) so parameters probably need tuning.
So far I'm unable to reliably generate outputs in a different language than English, the model will very quickly start to translate (even if it's not asked) or just switch to English.
The mac M1/M2 shares CPU and GPU memory. Not all GPUs have enough memory available. Transferring data to/from the GPU has a perf disadvantage. Hence, few PCs are capable of running LLMs.
The article and linked GitHub project both say this runs on the CPU. The GitHub project notes the quantization doesn’t work right with other systems for some reason but if you have enough RAM to quantize it you have enough RAM to run it without quantizing it anyways.
It's against the license terms, but seeing how quickly the weights were leaked in the first place, I wouldn't be surprised if a torrent shows up with quantized ones soon.
I would suggest helping open-assistant.io intsead of wasting resources for LLaMA, which has very restrictive ToS and thus cripples it's full potential for the open source community.
open-assistant is in the process of creating a crowd based fine tuning data set for instructions/assistant, and then select a truly open source model, not LLaMA with it's restrictive ToS.
I happen to have a couple of 16 core Xeons available with 128GB RAM and a decent high end Radeon GPU. Could I run a decent text transform model on this?
It's not as starightforward, because the M1 has a unified memory model while the PC will have to copy data between system RAM and VRAM. But it should be possible, see FlexGen for example.
> While running, the model uses about 4GB of RAM and Activity Monitor shows it using 748% CPU - which makes sense since I told it to use 8 CPU cores.
> I imagine it's possible to run a larger model such as 13B on this hardware, but I've not figured out how to do that yet.
Naively it seems like you could repeat the same process outlined in the article but with "13B" in place of "7B". What's the catch?