Hacker News new | past | comments | ask | show | jobs | submit login

This is why I think we're seeing a Stable Diffusion moment for LLMs: https://simonwillison.net/2023/Mar/11/llama/

Look at the timeline:

24th February 2023: LLaMA is announced, starts being shared with academic partners: https://research.facebook.com/publications/llama-open-and-ef...

2nd March: Someone posts a PR with a BitTorrent link to the models: https://github.com/facebookresearch/llama/pull/73

10th March: First commit to llama.cpp by Georgi Gerganov: https://github.com/ggerganov/llama.cpp/commit/26c084662903dd...

11th March: llama.cpp now runs the 7B model on a 4GB RaspberryPi: https://twitter.com/miolini/status/1634982361757790209

12th March: npx dalai llama: https://cocktailpeanut.github.io/dalai/

13th March (today): llama.cpp on a Pixel 6 phone: https://twitter.com/thiteanish/status/1635188333705043969

And now, Alpaca. It's not even lunchtime yet!

Turned this into a blog post: https://simonwillison.net/2023/Mar/13/alpaca/




Here is one question I have not seen answered yet:

All the magic of "7B LLaMA running on a potato" seems to involve lowering precision down to f16 and then further quantizing to int4.

Clearly this quantized model still outputs something resembling human language, at the very least.

But I haven't seen anyone show what effect this quantizing has on the quality of the output. If the quality of the output is bad, it's unclear if it's because the model needs to be finetuned (as Stanford did here) or if it's because the quanitizing reduced the quality, or both.

If this fine-tuned Stanford model still has excellent output after quantizing it to run on a Raspberry Pi 4GB, that would be awesome!


For 10 billion+ parameter models, the effects of quantization are relatively small, for smaller models like Llama 7B the effect becomes more dramatic, but there is ongoing research on new quantization methods (like GPTQ) that preserve significant performance even on the lower end.

Quantization isn't the only technique available for downsizing a model, Llama itself is already the result of sizing the model and input data according to "Chinchilla optimality", a very recent (as in 2022) result that e.g. GPT-3 predates. The result is that Llama-13B performs in benchmarks similarly with GPT-3 175B despite the tremendous size difference. There are separately also a variety of pruning methods to further eliminate inactive weights present in the trained model (I think this is also active research)

Finally even on something like a Raspberry Pi, implementations for inference (like llama.cpp) are nowhere near mature yet. There are already a multitude of runtimes available for inference making large tradeoffs between performance and flexibility (e.g. many models running on PyTorch vs ONNX report 5-10x speedups running under ONNX)

I think the really exciting part of Alpaca is the size and budget of the team - 5 students with $100 scraping OpenAI put this model together in a couple of hours of training. Any notions of premium persisting in the AI space for much longer seem fantastic at best, for all intents and purposes it has already been commoditized. And that's scary considering the size of the dent ChatGPT has put in my Google traffic


Llama is trained with _more_ data than is chinchilla optimal in order to make it better and cheaper at inference time, instead of just getting the highest quality of model that you can based on a given training budget. Llama has fewer parameters and was trained on more data specifically so that it would get high quality results on cheaper hardware and be easier and faster to run at inference time.


Curious about the google traffic comment. Are you saying people are visiting sites less because they can stay on Bing/OpenAI?


There is some very natural split regarding what I'll send to ChatGPT vs. what goes to Google. For example "six nations fixtures" obviously Google, but anything of depth or where recency is irrelevant goes the other direction. Asked it a few Linux questions today, how to interpret the title of a particular FRED chart, and a ton more sessions that Firefox history somehow didn't manage to correctly track the title for. I vastly prefer ChatGPT's interaction format compared to the equivalent random keyword spelunking session on Google.


Same, until I realized that about 60% of the information it gives me is either subtly wrong or 100% factually incorrect. Yet it's so, so confident.


And in that way its actually more overall correct than the most knowledgeable person on earth. With google you also get fed some very dangerously wrong info (recent example masks) but you think its correct. With ChatGPT you have to actually use your critical thinking skills and get to the truth which in my opinion a huge advancement over google.


Not really, with google you get multiple sources at a glance. Sure, they can still be wrong, but some critical thinking + multiple sources = more likely to be correct than relying on a single unreliable source.


No, the crucial thing is that a good human will tell you if they don't know something, or if they are simply unsure.


It's adorable seeing this kind of critique in the context of HN, I wonder how many folk knew my heavily upvoted comment above ("For 10 billion+ ...") was from someone who has only been looking at this stuff for a few weeks. ChatGPT is no better or worse than any consultant I've ever met (including myself), or most of the commenters you find here every single day.


It's adorable that you think people assume HN comments are factually correct. I read everything here with extreme skepticism, because I know this is all coming from flawed humans. An computer system giving authoritative text and insisting it is 100% correct is a different story.


Every bit of text from a computer system also comes from flawed humans.


I don't know the data but as an anecdote for most searches that would have returned blogspam (i.e. "what's the best birthday gift for a groom") in relying more and more on chatgpt.

I used to use it even more, but some of the recent changes reduced its ability at complex, creative tasks.


It's a nice business model, scrape the web and be the ultimate knowledge middle man


The difference is small, UNTIL you get to 4 bit quantization, where the model is noticeably dumber.

8 bits, imo, is the minimum.


WRONG. Research shows effectively imperceptible performance difference at 4-bit and even 3-bit with GPTQ quantization. You cannot tell the difference and if you think you do you're wrong, because it barely even registers on any benchmark.

(Note: llama.cpp's 4bit is naive, not GPTQ, and sucks but they are refactoring it to use GPTQ quantization)

References:

https://arxiv.org/abs/2210.17323 - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers [Oct, 2022]

https://arxiv.org/abs/2212.09720 - The case for 4-bit precision: k-bit Inference Scaling Laws [Dec, 2022]

https://github.com/ggerganov/llama.cpp/issues/9 - llama.cpp: GPTQ Quantization (3-bit and 4-bit) #9

https://github.com/qwopqwop200/GPTQ-for-LLaMa/ - 4 bits quantization of LLaMa using GPTQ


Good points, though I would gently encourage not starting a post with "WRONG." in the middle of a nuanced discussion. I remember 'way back when' when there was a .5-2% flat performance drop for UINT8 on some models when it was first introduced (depends upon the modality).

Like, 4 bit quantization really is probably enough for a number of usecases and likely beats smaller models with precision enough to make it the equivalent number of bits, but this really is only presenting half of the story. "You cannot tell the difference and if you think you do you're wrong, because it barely even registers on any benchmark" can be regarded as antagonistic, and also really doesn't line up with reality in a number of usecases. Sure, maybe for some models, UINT4 quantization is good enough. But there's a very large space of model architectures and problems, even for language learning, many of which do have very demonstrable drops in performance. And at certain perplexity levels, every bit (heh) matters.

In any case, an argument for moderation, please.


Good points, I didn't mean to come off abrasive but I can see why I would. My attention was to get attention on a thread where my new comment would be buried under the 8 other replies, so I put a big attention grabber at the start.

But again good points about the nuances of lower precision. For LLMs at least 'The Case for 4-bit Precision' and 'GPTQ' seem fairly conclusive that over ~10B parameters even 3-bit precision has virtually undetectable loss with the right trircks. Levels which, if they even mattered, can easily be overcome with a little additional training.

Newer ongoing research on LLaMA specifically[0] shows we can reduce the model's size around 84% without any meaningful performance loss through a combination of GPTQ, binning, and 3-bit.

[0] https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-i...


Tak that WRONG! as a reference to the Two Stupid Dogs, then it may be a lot easier to stomach. :)


Some parameters would be more sensitive than others I suppose? So could you use 4 bits for most, and 8 bits, or even 16, for the remaining?


I know nothing about this so my opinion means little, but I imagine it's hard to know which parameters are important enough to use more bits for.

I do wonder if it would be possible to have the model determine during training how important each parameter is, while maybe rewarding it for having more small parameters?


That's exactly why bitsandbytes has a threshold parameter to control the quantization.


Nice, good to know, thanks!


So which is better, running 7B without quantization or running 13B with? They both require about the same amount of vRAM (10gb).


Empirically, 13B with quantization.

In fact the person who said 4bit is worse is empirically incorrect.

13B with quantization even down to 3-bits has very near the same performance as uncompressed 16bit 13B with GPTQ quantization and binning.

Source: https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-i...


I looked at the numbers you posted, and am feeling concerned with how aggressively you're commenting towards a number of people on this website.

For starters, I started in this field a few years after the 2012 wave started. I've been with it for a while and have seen a lot of trends come and go. One thing that stays the same is that things are always changing. Very few things are set in stone, and due to a few other things it takes years and years before anything even begins to be finalized.

The numbers you are quoting are from various research groups, and are days to weeks old. You've antagonized a number of users in this forum, from calling them wrong directly, or saying that another person is empirically incorrect based on numbers you haven't verified yourself, and that have not had time to settle in the field yet with respect to real-world usecases. I went to one of the methods you linked, GPTQ, and it indeed had a _good_ performance to size improvement, but was not 'no difference'. This also does not count that 4-bit GPU support is still not-well supported. On 13B, for 4-bit, a .1 perplexity difference is great, but I also believe that that is also at least a noticeable drop. The .42 perplexity drop for 3 bit is massive, but also still very information efficient.

This completely ignores the conversation about (back to the GPU side of things) kernel-level support for these operators, which is very underdeveloped. Technical and unvalidated lab numbers do not represent the real world, it's like battery technologies. They are two very different things, though there are impressive tech demos and numbers out there. Like many things, in my experience, at least, it comes down to a big 'it depends'. It'll all settle out in the wash and we'll see what methods end up reigning in the long run.

Again -- please stop attacking other HN users based on a partial -- if well-researched -- understanding of the subject matter. It seems you're very involved in this topic, and I agree that more people need to hear about it. I think you could do an excellent job in sharing that news to them. That is good, and I hope the evangelism efforts go well and wish you all the best on that front. However, it seems (and this may be an inappropriate judgement on my end) that you might have become personally entangled in what is generally a technical issue.

I am just a commenter on this website, though I have used hacker news for a very long time at this point. I requested previously that you tamp down flaming the other users a bit, and I'd like to ask you once more. A good litmus test to maybe ask yourself is "Am I including any information in this message that indicates that another person may be right or wrong, or that I might be right or wrong? How strongly do I feel that my perspective is reality vs their incorrect perspective?" If you trigger that line when writing out a comment -- even if there is a strong impulse to ignore it, it may be time to step back, breathe, and separate out what is a personal issue for you, and what is a technical issue that you are passionate about. You can have both at once.

Please just slow it down a bit. I want to see what you and everyone else can mutually bring to the table in this conversation. Thank you.


Many good points. I agree with essentially everything you've said, especially regarding relative perplexity.

I'm aware that I was aggressively overselling an unnuanced and overstated position on 4-bit and especially 3-bit performance. That was partially a rhetorical tactic to swing the pendulum the other way, as it were.

And partially it was simply frustration with the number of threads I've seen in the past week of LLaMA drama spreading misinformation about bit precision like "a 16bit 13B model surely outperforms a 4-bit 30B model" which could not be further from the truth. That frustration is my own responsibility to manage and I understand that.


Definitively, 13B with quantization will perform better. 4bits has been shown to be the optimal quantization for accuracy vs memory requirements.


Yeah, 7b vs 13b is basically no comparison in any situation, 16bit 7b is def worse than 4bit 13b. I'll be looking into 30B tomorrow. I may be able to do a full matrix of tests 4-16bit X 7-30b.


This is interesting. What sizes are you seeing this for?


I have heard that the human brain uses the equivalent of around 6 bits. I wonder if that is some kind of optimum reached by evolution.


> All the magic of "7B LLaMA running on a potato" seems to involve lowering precision down to f16

LLaMa weights are f16s to start out with, no lowering necessary to get to there.

You can stream weights from RAM to the GPU pretty efficiently. If you have >= 32GB ram and >=2GB vram my code here should work for you: https://github.com/gmorenz/llama/tree/gpu_offload

There's probably a cleaner version of it somewhere else. Really you should only need >= 16 GB ram, but the (meta provided) code to load the initial weights is completely unnecessarily making two copies of the weights in RAM simultaneously. You could also lower vram requirements a bit more with a bit more work (I just made the smallest change possible to make it work)


> the (meta provided) code to load the initial weights is completely unnecessarily making two copies of the weights in RAM simultaneously

This is the kind of thing that the stable diffusion community optimized the shit out


Decrease in accuracy is negligible and decreases as model size increases. That is, larger models quantize even better than smaller models.

https://arxiv.org/abs/2210.17323


Is this because averages are weighed less (less sensitive) as the total sample size increases?


Yes. In a dense everything to everything neural network layer, the number of 'inputs' to a node is proportional to the square root of the number of weights.

Therefore, assuming quantization noise is uncorrelated, as the number of weights doubles, the number of inputs goes up by sqrt(2), and the (normalized) noise goes down by a factor of 2*(sqrt(2)).

So, as a rule of thumb, you can remove 1 bit of precision of the weights for every 4x increase in the number of weights.

All this assumes weights and activations are uncorrelated random variables - which may not hold true.


Something is wrong with this math... by your logic I could scale the network up big enough that I could quantize the weights down to zero bits...


Having fewer than 1 bit per weight is not absurd. E.g. you can use 2 bits to represent 3 'weights' if you insist that at most one of the weights is allowed to exist. If you try to order nodes so that adjacent nodes are uncorrelated the performance loss might be manageable.

People are already doing stuff like this (see sparsification) so it is conceivable to me that this is just what networks will look like in a few years.


> If you try to order nodes so that adjacent nodes are uncorrelated the performance loss might be manageable.

shower thought

In graphics we use barycentric coordinates to encode the position within an arbitrary triangle using two coordinates (u,v), with the third being constrained to be 1-u-v. If you order nodes to be correlated, could you use a similar trick to encode three weights for the price of two?


Yes, it's the same thing.


Rules of thumb typically are just first order approximations which by definition are not guaranteed to hold far from their point of interest (or point of tangency).


See: https://arxiv.org/abs/2210.17323

Q: Doesn't 4bit have worsen output performance than 8bit or 16bit? A: GPTQ doesn't quantize linearly. While RTN 8bit does reduce output quality, GPTQ 4bit has effectively little output quality loss compared to baseline uncompressed fp16.

https://i.imgur.com/xmaNNDd.png https://i.imgur.com/xmaNNDd.png


This is really interesting, thank you for the reference!

Having worked more with images based NN than language models before, I wonder: are LLM inherently more suited to aggressive quantisation, due to their very large size? I see people suggesting here 4b is pretty good, and 3b should be the target.

I remember ResNets etc can of course also be quantized, and up to 8-6b you get pretty good results with very little effort, with low-ish degradation in performance. Trying to go down to 4b is more challenging, though this paper claims with quantisation aware training 4b is possible indeed, but that means a lot of dedicate training compute needed to get to 4b (not just finetuning post-training): https://arxiv.org/abs/2105.03536


Might I suggest looking the story between the 2nd and 10th of march? I've noticed Hacker News hasn't been following certain areas of the effort. A lot of great work had happened and continues to be happen in close conjunction with the text-generation-webui (seriously, most of the cutting edge with 4-bit GPTQ etc. has been closely tied to the project).

>https://github.com/oobabooga/text-generation-webui/


Wow, yeah that's a VERY active project: https://github.com/oobabooga/text-generation-webui/graphs/co... - only started Dec 18, 2022 and already 22 contributors and 806 commits!


I'm excited to see what the OpenAssistant crowd does with these models, they seem to have gathered the dataset to finetune them.

Lots of people use these models as talk therapy. We really need 1) standalone options, 2) reproducible weights with crowd sourced datasets to reduce biases (or at least know who you're talking to).


Question: what percentage of the hype and momentum for this is so people can run sex chatbots on their local machine?


A lower portion than the equivalent number for Stable Diffusion, but still significant.


Feature-length AI-generated pornos don't seem that far off the horizon.


Or really just any text generation that chatGPT dislikes. It's nice not to be judged by a program (and perhaps logged somewhere that you asked for something "inappropriate").


Also today: ChatGLM released by Tsinghua University. I've made a separate submission for it: https://news.ycombinator.com/item?id=35150190

The GitHub page is https://github.com/THUDM/ChatGLM-6B. The GitHub description is all in Chinese, but the model itself can handle English queries on a single consumer GPU well. Considering its size, I'd say the quality of its responses are outstanding.


LLAMA.cpp with 65B parameters runs on a MacBook M1 Max with 64GB of RAM. See https://gist.github.com/zitterbewegung/4787e42617aa0be6019c3...


That is still a 4000 usd computer. You can get 2 RTX3900 used for ~1000 usd and run 65B much faster.

I have a discord server up serving almost 500 users with 65B.

https://twitter.com/ortegaalfredo/status/1635402627327590400

For some things is better than GPT3, for other even Alpaca is better.


How do you make it load on two GPUs or does llama.cpp does it automatically? I have a setup with a threadrippper and a RTX3090 and a Titan RTX. I haven't had the time to set it up so that's why I have been using my Mac.


llama.cpp doesn't use the GPU at all. The genius *.cpp (whisper.cpp, llama.cpp) projects are specifically intended to optimize/democratize otherwise GPU only models to run on CPU/non-GPU (CUDA, ROCm). Technically speaking the released models are capable of running on GPU via standard framework (PyTorch, TensorFlow) support for CPU but in practice without a lot of optimization they are incredibly slow to the point of useless, hence *.cpp.

You want something along these lines (warning: unnecessarily potentially offensive):

https://rentry.org/llama-tard-v2


Llama.cpp takes advantage that LLaMa 7B is a tiny, very optimized model. It would run in anything, and very fast. I really doubt you can run the 30B or 65B models at acceptable speed on a CPU at least for a couple years. (I'm ready to eat my words in a couple weeks)


Okay my thread ripper can handle it because it has a 128GB of Ram.


What's the correlation between parameter count and RAM usage? Will LLaMA-13B fit on my MacBook Air with 8 GB of RAM or am I stuck with 7B?


13B uses about 9GB on my MacBook Air. If you have another machine (x86) with enough RAM to convert the original LLaMA representation to GGML, you can give it a try. But quantization step must be done on MacBook.

Maybe it is more feasible for you to use 7B with larger context. For some "autocompletion" experiments with Python code I had to extend context to 2048 tokens (+1-1.5GB).


I have also seen it working on a Mac Studio with 64Gb of RAM. It is quite slow, not unbearably so, but slow.


A lot of them aren’t very good though at the same stable diffusion vram level unfortunately (and we’ve had large non consumer level gpu llms open sourced for a while eg gpt-j)


That is likely because "good" is a higher bar in language than images, because people don't mind or notice the longer range artifacts in image models as much.


A lot of people are running Llama using the CPU/system memory.



I think the Stable Diffusion moment is very dependant on someone creating a commercially licensable version of this somehow. I think the prospect of never being able to put your creations in a product is too inhibitive for the hypergrowth stable diffusion saw.


I know, this is crazy!!

I can't fathom how development has suddenly seemed to accelerate.


The timing of the Facebook leak seems suspect.


What do you mean?


I mean ChatGPT had a lot of attention, so a leak of a competing architecture would shift the attention away from ChatGPT. Which Meta's LLaMA did. And we see it swinging in the other direction with OpenAI announcing GPT-4.


Do you mean Meta’s publishing of Llama?




Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: