How Is LLaMa.cpp Possible?

_kava · on Aug 15, 2023

In case anyone is wondering, yes, there is a cost when a model is quantized.

https://oobabooga.github.io/blog/posts/perplexities/

Essentially, you lose some accuracy and there might be some weird answers and probably more likely to go off the rail and hallucinate. But the quality loss is lower the more parameters you have. So for very large model sizes the differences might be negligible. Also, this is the cost of inference only. Training is a whole other beast and requires much more power.

Still, we are looking at GPT3 level of performance on one server rack. That says something when less than a year ago, such AI was literally magic and only run on a massive datacenter. Bandwidth and memory size are probably, in my ignorance mind, easier to increase than raw compute so maybe we will soon actually have "smart" devices.

simonw · on Aug 15, 2023

I was hoping that link would answer the question that's been bugging me for months: what are the penalties that you pay for using a quantized model?

Sadly it didn't. It talked about "perplexities" and showed some floating point numbers.

I want to see examples like "here's a prompt against a model and the same prompt against a quantized version of that model, see how they differ."

mikeravkine · on Aug 16, 2023

I have several sets of quant comparisons posted on my HF spaces, the caveat is my prompts are all "English to code": https://huggingface.co/spaces/mike-ravkine/can-ai-code-compa...

The dropdown at the top selects which comparison: Falcon compares GGML, Vicuna compares bits and bytes. I have some more comparisons planned, feel free to open an issue if you'd like to see something specific: https://github.com/the-crypt-keeper/can-ai-code

version_five · on Aug 15, 2023

  I want to see examples like "here's a prompt against a model and the same prompt against a quantized version of that model, see how they differ."

We suck at evaluating and comparing models imo. There are metrics and evaluation task, but it's still very subjective.

The closer we get to assessing human like performance, the tougher it is, because it becomes more subjective and less deterministic by the nature of the task. I don't know the answer, but I know that for the metrics we have it's not so easy to translate them into any idea about the kind of performance on some specific thing you might want to do with the model.

tysam_and · on Aug 16, 2023

Not mathematically, at the very least. Perplexity is a translation of the best measure we have for informing us how a model is doing empirically over a test dataset (both pre and post). It is enough to be, usably, at least the final word on how different quantization methods perform.

Subjective ratings are different, but for compression things are quite well defined.

cj · on Aug 16, 2023

> some specific thing you might want to do with the model.

I think this right here is the answer to measuring and comparing model performance.

Instead of trying to compare models holistically, we should be comparing them for specific problem sets and use cases... the same as we compare humans against one another.

Using people as an example, a hiring manager doesn't compare 2 people holistically, they compare 2 people based on how well they're expected to perform a certain task or set of tasks.

We should be measuring and comparing models discriminately rather than holistically.

mr_toad · on Aug 16, 2023

You could have two models answer 100 questions the same way, and differ on the 101st. They’re unpredictable by nature - if we could accurately predict them we’d just use the predictions instead.

cj · on Aug 16, 2023

(Stupid question) are models still non-deterministic if you set the temperature to zero?

Would setting the temperature to zero degrade the quality of response?

version_five · on Aug 16, 2023

Even at T=0 and run deterministically, the answers still have "randomness" with respect to the exact prompt used. Change wording slightly and you've introduced randomness again even if the meaning doesn't change. It would be the same for a person.

For an llm, a trivial change in wording could produce a big change in answer, same as running it again with a new random seed. "Prompt engineering" is basically overfitting if not approached methodically. For example, it would be interesting to try deliberate permutations of an input that don't change the meaning and see how the answer changes as part of an evaluation.

Tokumei-no-hito · on Aug 16, 2023

But if T=0 and you use the exact same input (not a single word or position changes) do you get the same output? Reading your response it implies that the randomness is related to even slight changes.

version_five · on Aug 16, 2023

As a sibling comment mentioned, threading on a gpu is not automatically deterministic so you could randomness from there, although I can't think of anything in the forward pass of a normal LLM they would depend on execution order. So yes, you should get the same, it's basically just matrix multiplication. There may be some implementation details I don't know about that would add other variability though.

Look at this minimal implementation (Karpathy's) of LLaMA, the only randomness is in the "sample" function that comes in at non-zero temperature, otherwise its easy to see everything is deterministic: https://github.com/karpathy/llama2.c/blob/master/run.c

Otoh, with MoE like GPT-4 has, it can still vary at zero temperature.

visarga · on Aug 16, 2023

Some GPU operations give different results depending on the order they are done. This happens because floating point numbers are approximations and lose associativity. Requiring a strict order causes a big slowdown.

lelanthran · on Aug 16, 2023

Well the same is true for people, and yet hiring managers still available valuate for specific tasks.

brucethemoose2 · on Aug 16, 2023

It makes the model dumber.

That seems simplistic, but its really simple as that. Naive 3 bit quantization will turn llama 7B into blubbering nonsense.

But llama.cpp quantization is good! I recommend checking out the graphs ikawrakow made for their K-quants implementation:

https://github.com/ggerganov/llama.cpp/pull/1684

Basically, the more you quantize with K-quant, the dumber the model gets. 2 bit llama 13B quant, for instance, is about as dumb as 7B F16, but the dropoff is not nearly as severe from 3-6 bits.

pizza · on Aug 16, 2023

FWIW here's why perplexity is useful: it's a measure of uncertainty that can easily be compared between different sources. Perplexity k is like the uncertainty of a roll of a k-sided die. Here I think perplexity is per-token, and is a measuring the likelihood of re-generating the strings in the test set.

e.g. take a look at these two rows:

    llama-65b.ggmlv3.q4_K_M.bin 4.90639 llama.cpp
    llama-65b.ggmlv3.q3_K_M.bin 5.01299 llama.cpp

So for the reduction in size given by (q4 -> q3), you get a 2% increase in the uncertainty. Now, that doesn't tell you which specific capabilities get worsened (or even if that's really considered a huge or tiny change), but it is a succinct description of general performance decreases.

If you want more fine-grained explanations of how generation of certain types of texts get clobbered, you would probably need to prepare datasets comprised of that type of string, and measure the perplexity delta on that subset. i.e.

    dperplexity/dquantization(typed_inputs).

I think it might be more difficult to get a comprehensive sense of the qualitative differences in the other direction, e.g.

    dtype/dquantization(all_outputs).

moffkalast · on Aug 15, 2023

The problem is that it's not consistent enough for a good demo. Not even two different models, but even two different fine tunes of the same base model may be wildly differently affected by quantization. It can range from making hardly a difference to complete garbage output.

sharms · on Aug 16, 2023

I have been using nat.dev to compare quantized models and it works great.

lhl · on Aug 16, 2023

Just the other day someone published ARC comparison results for different quants as well as the code for the harness that they used to easily run lm-eval against quants to your heart's content: https://www.reddit.com/r/LocalLLaMA/comments/15rh3op/effects...

dustymcp · on Aug 16, 2023

it will be different for every usecase, the only way to find out is spinning one up..

hackernewds · on Aug 16, 2023

would an answer "there aren't much significant penalties" suffice?

redox99 · on Aug 15, 2023

>Still, we are looking at GPT3 level of performance on one server rack. That says something when less than a year ago, such AI was literally magic and only run on a massive datacenter.

I'm not sure what you mean by this. You've always been able to run GPT3 on a single server (your typical 8xA100).

brucethemoose2 · on Aug 16, 2023

8xA100 is technically a single server, but I think OP is talking about affordable and plentiful CPU hosts, or even relatively modest single GPU instances.

DGX boxes do not grow on trees, especially these days

blovescoffee · on Aug 16, 2023

Am I missing something or how do you know this? Also I think the OP was talking about a single card not multiple but that was just my reading.

redox99 · on Aug 16, 2023

Because 175B parameters (350GB for the weights FP16, let's say a bit over 400GB for actual inference), fit very comfortably on 8xA100 (640GB VRAM total).

And basically all servers will have 8xA100 (maybe 4xA100). Nobody bothers with a single A100 (of course in a VM you might have access to only one)

axiom92 · on Aug 16, 2023

> And basically all servers will have 8xA100

for those wondering: no this is not the norm. My lab at CMU doesn't own any A100s (we have A6000s).

doctorpangloss · on Aug 16, 2023

The servers the commenter is talking about are DGX machines from NVIDIA.

It doesn’t really make sense to BTO. What you gain economically you lose in the science you can do.

But nobody could have anticipated this.

_zoltan_ · on Aug 16, 2023

you could also get HGX from any of the vendors.

bobboies · on Aug 16, 2023

Wtf does HGX mean? God enough with the acronyms people.

Please take an extra ten seconds to speak in proper human language!

You could save on the worlds carbon footprint by reducing the number of times humans have to search for “what is NVIDIA hgx?” or is it “what is AMD HGX” and then subsequently visiting the websites to see if that’s right or not.

Vvector · on Aug 16, 2023

What does Wft mean? God enough with the acronyms people. /s

bobboies · on Aug 16, 2023

You got me there hahaha

However, there’s a difference between an acronym known to the broader public versus some single shot, context-specific one!

_zoltan_ · on Aug 16, 2023

who's norm? I assure you it's the norm. :)

loxias · on Aug 17, 2023

> And basically all servers will have 8xA100 (maybe 4xA100). Nobody bothers with a single A100 (of course in a VM you might have access to only one)

wishing, or guessing something, without actual experience, doesn't make it true.

YetAnotherNick · on Aug 16, 2023

The effect is lesser than you think. 5 bit quantization has negligible performance loss compared to 16 bits: https://github.com/ggerganov/llama.cpp/pull/1684

astrange · on Aug 16, 2023

This paper from last month has a method for acceptable 3-bit quantization and a start at 2-bit.

https://arxiv.org/abs/2307.13304

tysam_and · on Aug 15, 2023

Yes, there is a logarithmically-bound (or exponential if you're viewing it from another angle) falloff in the information lost in quantization. This comes from the non-uniform "value" of different weights. We can try to get around them with different methods, but at the end of the day, some parameters just hurt more to squeeze.

What is insane though is how far we've taken it. I remember when INT8 from NVIDIA seemed like a nigh-pipedream!

smcleod · on Aug 16, 2023

Good blog post, shame the site has no RSS feed!

riezebos · on Aug 16, 2023

Could this be why people recently say they see more weird results in ChatGPT? Maybe OpenAI is trying out different quantization methods for the GPT4 model(s) to reduce resource usage of ChatGPT.

pocketarc · on Aug 16, 2023

I'd be more inclined to believe that they're dropping down to gpt-3.5-turbo based on some heuristic, and that's why sometimes it gives you "dumber" responses. If you can serve 5/10 requests with 3.5 by swapping only the "easy" messages out, you've just cut your costs by nearly half (3.5 is like 5% of the cost of 4).

vbezhenar · on Aug 16, 2023

Serving me ChatGPT 3.5 when I explicitly requesting ChatGPT 4 sounds like a very bad move? They're not marketing it like "ChatGPT Basic" and "ChatGPT Pro".

ripvanwinkle · on Aug 15, 2023

Thank you! Is there a sweet spot with quantization. how much can you quantize for given model type and size and still be useful.

pseudonom- · on Aug 15, 2023

Tim Dettmers recently (https://www.manifold1.com/episodes/ai-on-your-phone-tim-dett...):

"But what we found with these neural networks is, if you use 32 bits, they're just fine. And then you use 16 bits, and they're just fine. And then with eight bits, you need to use a couple of tricks and then it's just fine.

And now we find if you can go to four bits, and for some networks, that's much easier. For some networks, it's much more difficult, but then you need a couple more tricks. And so it seems they're much more robust."

KirillPanov · on Aug 16, 2023

> And now we find if you can go to four bits

That will be really interesting for FPGAs, because the current ones are basically oceans of 4-bit computers.

Yes, you can gang together a pair of 4LUTs to make a 5LUT, and a pair of 5LUTs to make a 6LUT, but you halve your parallelism each time you do that. OTOH you can't turn a 4LUT into a pair of 3LUTs on any currently-manufactured FPGA. It's simply the "quantum unit" of currently-available hardware -- and it's been that way for at least 15 years (Altera had 3LUTs back in the 2000s). There's no fundamental reason for the number 4 -- but it is a very, very deep local minimum for the current (non-AI) customers of FPGA vendors.

WithinReason · on Aug 16, 2023

This is not generally true, sometimes quantisation can improve accuracy. I haven't seen that with LLMs yet though.

arijun · on Aug 16, 2023

Interesting, how would that work? Are there any well-known examples?

Is it: the weights all happen to be where float is sparse, so quantization ends up increasing fidelity? Or is it more of a “worse is better” dropout-type situation?

WithinReason · on Aug 16, 2023

I suspect it works as regularisation of the network. It usually happens when you train with quantisation instead of post-training quantisation, an I haven't seen that done with LLMs yet.

matsemann · on Aug 16, 2023

For image recognition it can sometimes be like that. My gut feeling is that lowering from fp32 to fp16 can get rid of some kind of overfitting or so.

prvc · on Aug 16, 2023

Any use case for using the 7B model over the 13B, quantized?

Joeri · on Aug 16, 2023

Inference speed. Sometimes 7B is good enough for the task at hand, and using 13B just makes you wait longer.

clarionbell · on Aug 16, 2023

bobboies · on Aug 16, 2023

Wtf does SBC mean? God enough with the acronyms people.

cameron_b · on Aug 16, 2023

In my experience, it usually means Small Block Chevy, but in certain communities it means Single Board Computer, an older way of referring to devices like the Raspberry Pi.

I would elaborate and say, anywhere that your computer is resource constrained ( ram, processing power ) but you still want to make up articles for your Amazon Affiliate blog

bobboies · on Aug 16, 2023

Single board computer makes sense. I wish folks would type things out.

loudmax · on Aug 16, 2023

In this context I'd assume SBC means Single Board Computer, such as a Raspberry Pi or one of the many imitators. The article itself mentions running LLaMa on a Pi 4.

The interesting implication about running an LLM on a single board computer is that if it's a proof of concept for an LLM on a smartphone. If you have a model that can produce useful results on a Ras Pi, you have something that could potentially run on hundreds of millions of smartphones. I'm not sure what the use case is for running an LLM on your phone instead of the cloud, but it opens some interesting possibilities. It depends just how useful such a small LLM could be.

brucethemoose2 · on Aug 16, 2023

This leaves a ton of stuff out.

- Token generation is serial and bandwidth bound, but prompt ingestion is not and runs in batches of 512+. Short tests are fast on pure CPU llama.cpp, but long prompting (such as with ongoing conversation) is extremely slow compared to other backends.

- Llama.cpp now has very good ~4 bit quantization that doesn't affect perplexity much. Q6_K almost has the same perplexity as FP16, but is still massively smaller.

- Batching is a big thing to ignore outside of personal deployments.

- The real magic of llama.cpp is model splitting. A small discrete GPU can completely offload prompt ingestion and part of the model inference. And it doesn't have to be an Nvidia GPU! There is no other backend that will do that so efficiently in the generative AI space.

- Hence the GPU backends (OpenCL, Metal, CUDA, soon ROCm and Vulkan) are the defacto way to run llama.cpp these days. Without them, I couldn't even run 70B on my desktop, or 33B on my (16GB RAM) laptop.

throwing_away · on Aug 16, 2023

ROCm works now! I just set it up tonight on a 6900xt with 16gb vram running wayland at the same time. The trick was using the opencl-amd package (somehow rocm packages don't depend on opencl, but llama does, idk).

I'm astonished at the results I can get from the q6_K models.

tagyro · on Aug 16, 2023

Can you please share more info on this? I have a 6900xt "gathering dust" in a proxmox server - would like to try to do a passthrough to a vm and use it. Thank you in advance!

throwing_away · on Aug 16, 2023

Sure thing. There's a bunch of ways to do it, but here's some quick notes on what I did.

* arch linux has tons of `rocm` packages. I installed pretty much all of them: https://archlinux.org/packages/?sort=&q=rocm&maintainer=&fla...

* you also need this one package from AUR: https://aur.archlinux.org/packages/opencl-amd

* llama.cpp now has GPU support including "CLBlast", which is what we need for this, so compile with LLAMA_CLBLAST=ON

* now you can run any model llama.cpp supports, so grab some ggml models that fit on the card from https://huggingface.co/TheBloke.

* Test it out with: ./main -t 30 -ngl 128 -m huginnv1.2.ggmlv3.q6_K.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"

* You should see `BLAS = 1` in the llama.cpp output and you should get maybe 5 tokens per second on a 13b 6bit quantized ggml model.

* You can compile llama-cpp-python with the same arguments and get text-generation-ui to work also, but there's a bit of dependency fighting to do it.

* koboldcpp might be better, I just haven't tried it yet

Hope that helps!

Edit: just tried https://github.com/LostRuins/koboldcpp and it also works great. I should have started here probably.

Compile with `make LLAMA_CLBLAST=1` run with ` python koboldcpp.py --useclblast 0 0 --gpulayers 128 huginnv1.2.ggmlv3.q6_K.bin`

brucethemoose2 · on Aug 16, 2023

Text gen ui is nice. Some specific nicities include pre formatted instruct templates for popular modules, good prompt caching from llama-cpp-python, and integration of a vector db.

But its also finicky, kinda unstable and the dependencies are tricky.

Koboldcpp has other nicities, like some different generation parameters to tweak and some upstream features pulled in from PRs before the official llama.cpp release has them. The UI is nice, predating llamav1. Its standalone, dead simple to compile and has integration with AI Horde, which is (IMO) a huge essential feature.

baby_souffle · on Aug 16, 2023

Same boat.

I’d love to host something local but have been so overwhelmed by the rapid progress and every time I start looking I find a guide that inevitably has a “then plug in your OpenAI api key…” step which is a hard NOPE for me.

I have a few decent gpus but I’ve got no idea where to start…

brucethemoose2 · on Aug 16, 2023

Path of least resistance:

- Download koboldcpp: https://github.com/LostRuins/koboldcpp

- Download your 70B ggml model of choice, for instance airoboros 70B Q3_K_L: https://huggingface.co/models?sort=modified&search=70b+ggml

- Run Koboldcpp with opencl (or rocm) with as many layers as you can manage on the GPU. If you use rocm, you need to install the rocm package from your linux distro (or direct from AMD on Windows).

- Access the UI over http. Switch to instruct mode and copy in the correct prompt formatting from the model download page.

- If you are feeling extra nice, get an AI Horde API key and contribute your idle time to the network, and try out other models on from other hosts: https://lite.koboldai.net/#

ant6n · on Aug 16, 2023

What’s perplexity?

regularfry · on Aug 16, 2023

Perplexity is a measure of how certain the model is of the next token. It's calculated by looking at the probabilities that the model calculates for the next token in a stream. If there are several choices for the next token with similar probabilities, that's telling you that the model is having a hard time telling what the right answer should be: the model is more perplexed, perplexity is higher. If there's a single option with a much higher implied probability than any other, that means the model is more certain, and perplexity is lower.

Note that this has nothing to do with whether the answer is objectively correct. It's just measuring how confident it is.

espadrine · on Aug 16, 2023

Reading this could make people believe it is computed from the probability distribution of the model alone.

To be clearer, it is the exponent of the average negative log probability that the model gives to the real tokens of a sample text[0]. Roughly, it relates to how strongly the model can predict the sample text. A perfect model would have perplexity one; a random model has a perplexity equal to the number of possible tokens; the worst model has infinite perplexity.

[0]: https://github.com/pytorch/torcheval/blob/3faf19c060b8a7c074...

Matumio · on Aug 16, 2023

I think what you describe ("confidence about the next token") is the entropy of the model's output. A model can be very certain about the next token (its output has low entropy) but if it is usually wrong on the text you measure it against, it will have high perplexity. (For example when the model was trained only on children's books and you measure it on Wikipedia.)

ant6n · on Aug 16, 2023

Thank you!

Matumio · on Aug 16, 2023

Perplexity is measured by testing a language model on a known text. The model's output is a probability for every possible next word/token. The model is highly perplexed if it gave a low probability to the actual next token.

PoignardAzur · on Aug 16, 2023

What's prompt ingestion?

brucethemoose2 · on Aug 16, 2023

Its the phase where llama.cpp processeses the input, as opposed to the phase where its generating the response.

Speed is particularly important. The response can be streamed word by word, so it doesn't have to be particularly fast, but slow input processing leads to very noticable latency.

jmorgan · on Aug 15, 2023

This project's been a blast to work with. While it's written in C++, it provides a C interface to compile against which makes it especially easy to extend with Go, Python and other runtimes.

A few folks and I have been building a tool with it in Go for pulling & running multiple models, and serving them on a REST API: https://github.com/jmorganca/ollama

In similar light, you haven't checked it out, llama.cpp also has a pretty extensive "server" tool (in its examples directory in the repo) with a web ui and support for grammar (e.g. forcing the output to be JSON)

lovelyviking · on Aug 17, 2023

> A few folks and I have been building a tool with it in Go for pulling & running multiple models, and serving them on a REST API: https://github.com/jmorganca/ollama

Why this tool was created? What was the reason for creating such tool? Why it was named this way?

Explanation for wishing to flag/block/bully basic simple questions: Questions mean what they ask and nothing else. Any imagination of aggression or anything else related to emotions is misplaced.

As someone unfamiliar with the field I’ve read GitHub page and didn’t find answers to things I was curious about regarding this project. So out of curiosity I wrote simple basic questions:

“Why this tool was created? What was the reason for creating such tool? Why it was named this way?“

Suggesting that if I didn’t figured it out by myself perhaps some one else who also wish to join the field and looking for entry point to try something in practice would also be confused and thus answering those simple basic questions would really help to the beginners.

With experience of entering into many new fields I know that usually there is a lot of confusion in the beginning about basic stuff .

I also suggested that those are obvious reasons why people ask questions and didn’t expect to be ‘bullied’ for basic simple questions. There were no one word of aggression in text. Perhaps seeing aggression there requires certain level of misplaced imagination.

Usually there is a fun story about the name for example GNU is Not Unix etc … This is opportunity to tell it if there is. Sometimes name holds some clue about functionality and then it’s easier to remember. Sharing it can help too to beginners.

wfurney · on Aug 16, 2023

I've also been experimenting with the C# implementation

https://github.com/trrahul/llama2.cs

pjmlp · on Aug 16, 2023

Why Parallel.For() instead of SIMD types?

I would expect a much better performance.

jay-barronville · on Aug 15, 2023

Thank you for the link to your project! I'll be playing around with it.

jcims · on Aug 16, 2023

Just wanted to say thanks homie for making something so hairy so accessible. Love it!!!

lovelyviking · on Aug 16, 2023

[flagged]

oblio · on Aug 16, 2023

Because.

On a more serious node, your questions seem very... aggressive.

Especially the one about the name, is the name offensive or what? To me it sounds quite benign.

lovelyviking · on Aug 16, 2023

As someone unfamiliar with the field I’ve read github page and didn’t find answers to things I was curious about regarding this project.

So out of curiosity I wrote simple basic questions:

“Why this tool was created? What was the reason for creating such tool? Why it was named this way?“

Suggesting that if I didn’t figured it out by myself perhaps some one else who also wish to join the field and looking for entry point to try something in practice would also be confused and thus answering those simple basic questions would really help to the beginners.

With experience of entering into many new fields I know that usually there is a lot of confusion in the beginning about basic stuff .

I also suggested that those are obvious reasons why people ask questions and didn’t expect to be ‘bullied’ for basic simple questions. There were no one word of aggression in text. Perhaps seeing aggression there requires certain level of misplaced imagination.

> Especially the one about the name, is the name offensive or what? To me it sounds quite benign.

So why the simple question doesn’t sound benign to you? Usually there is a fun story about the name for example GNU is Not Unix etc … and sometimes it holds some clue about functionality and then it’s easier to remember.

> Because

well… it’s still unclear

oblio · on Aug 17, 2023

Your comment has been flagged and my comment has received a bunch of upvotes. Multiple people consider your wording or tone aggressive.

In case you're a person with social issues, asking so many questions in a succession is considered quite rude.

Space them out, in writing add newlines. Or add more context to your questions.

Your comment history suggests that you do this often and you have a ton of downvoted comments.

Friendly advice :-)

lovelyviking · on Aug 31, 2023

Usually friendly advice is given after friendly answering question, which was not demonstrated here.

We are perfectly aware of how to deal with extra-sensitive people and kind approach is the way to go. However on this site I expect certain level of technical training and intelligence in answering direct questions without being offended by search for a deeper understanding of some topic and without inventing emotions that never been there in the first place. I was also under impression that guidelines of this site directly where encouraging answering I expect by declaring : “be kind” “don’t be rude” “suggest a good faith” and unfortunately none of those were demonstrated by your answers even when you have learned the fact that you were wrong and your accusations were completely ungrounded . The fact that such behaviour was encouraged by (as you claim it) “multiple people” including moderation of this site tells a very sad story about this site regarding following own declared principles.

And this example:

“Which crypto? Where can I pay with it? What if I don't want to have a hardware wallet everywhere with me? Can I cancel transactions and get my money back? Does it use more energy per year than Argentina?” (https://news.ycombinator.com/item?id=36656667)

tells another sad story about hypocrisy which means it’s hard to take “friendly“ advice seriously. The next time you wish to give a truly friendly advice try to be kind in the first place, do not suggest a bad faith and do not insist on false ungrounded accusations encouraged by confirmation biases especially when clarification presented.

You could have just apologise for own real aggression contrary to imaginative aggression from my questions and simply give answer while keeping “friendly” advice to yourself.

This way people would learn something about the project rather than learning something about cancelling “culture” which is under the flag of “being nice” demonstrates much worse behaviour in practice than one it claim to oppose. You actually among others should be very aware and disgusted by those tactics after you’ve seen effects of “soviet methods” in Romania.

cameron_b · on Aug 15, 2023

I’ve been working through that repo and managed the 13B dataset on a single Pi4 8gig

I’ve also replicated the work in OpenMPI ( from a thread on the llama.cpp GitHub repo ) and today I managed to get the 65B dataset operational on three pi4 nodes.

I’m not saying this as any achievement of mine, but as a comment on the current reality of reproducible LLM At home on anything you’ve got.

It really feels like this technique has arrived.

https://github.com/cameronbunce/ClusterConfig

loxias · on Aug 16, 2023

> I’ve also replicated the work in OpenMPI...

Oh cool! How did it perform?

I wonder if this would be an exciting test for Amazon's SRD protocol which appears to be built for HPC. I'm looking for an excuse to play with it...

cameron_b · on Aug 16, 2023

The objective performance I'm getting is flat poor, mostly because of the network I'm using. On the other hand, simply being able to do it at all with one node on wireless until I can pull another drop, and the rest being on 100 Mbit ... I'm really running a bargain basement cluster.

I don't know about SRD, but llama.cpp has MPI configurations built-in. I didn't have to engineer anything or rewrite anything ( I made an optimization patch, but I didn't even make that one up myself ) I just compiled it with flags set.

As far as performance on 65B, I'm still waiting for it to finish to get the timings :)

loxias · on Aug 17, 2023

I eagerly await your numbers! Maybe I'll post some of my own if I can get far enough ahead at work.

I was thinking it's time to upgrade from 1GigE anyway, 10GigE is cheap and at work we're ripping that out in favor of 25 and 50...

I'll look at the code, depending on how well the authors used MPI it could be exciting times! It's not that hard (or expensive) to get a bunch of power hungry used servers off ebay and string em together with a cheap 10GigE switch. It would be loud and power hungry but I wonder if I could have a 65B local model in the privacy of my own home, for fractions of the cost of buying a A100....

Edit: Oh, and SRD is a ... network protocol designed to work hand in hand with EFA which can substantially improve the performance of HPC MPI workloads, running on EC2, during network bound phases.

Tokumei-no-hito · on Aug 16, 2023

Nice write up man. Thanks for sharing your research

eurekin · on Aug 15, 2023

How many tokens a second?

cameron_b · on Aug 16, 2023

The other way around is whole number math. I added the 3-node output from the 13B model to github, the timings are below. The 3-node 65B job hasn't finished yet.

llama_print_timings: load time = 17766.29 ms llama_print_timings: sample time = 264.42 ms / 128 runs ( 2.07 ms per token, 484.07 tokens per second) llama_print_timings: prompt eval time = 10146.71 ms / 8 tokens ( 1268.34 ms per token, 0.79 tokens per second) llama_print_timings: eval time = 287157.12 ms / 127 runs ( 2261.08 ms per token, 0.44 tokens per second) llama_print_timings: total time = 297598.22 ms

eurekin · on Aug 16, 2023

This is very interesting and actually in the usable realm, for some use cases

cameron_b · on Aug 16, 2023

My networking setup is not optimal, but it was quite surprising how easy it was to get it all to work.

tysam_and · on Aug 15, 2023

In my best estimation, Finbarr makes pretty great content, he and I have had a number of positive interactions on Twitter. I tend to have a pretty grumpy disposition towards a lot of modern ML and such as I feel it's shovelware, but whenever Finbarr puts out work, I tend to set aside some time to give it a good gander, as I feel like it's generally pretty "meaty" (which I honestly find pretty hard to do past a certain pace). Well worth the subscribe if you have not done so already (I'm not affiliated with him, I just really like his work!).

Roark66 · on Aug 16, 2023

It is useful to mention running inference on modern cpus that have AVX2 is not that bad. Sure it is slower than on the gpu, but you get the benefit of having a single long continuous region of ram.

But there is one huge problem why this is not that popular on x86_64. Having to run in fp32. As far as I know our most common ml libraries (pytorch, tf, onnx etc) do not have an option to quantize to 4 bits and they don't have an option to run inference at anything other than fp32 on the x86_64 cpus.

It is a huge shame. There is openvino which supports int8, but if you can't easily quantize large models without a gpu, what use is it? (For small models I suppose).

So if anyone figured out a way to quantize a transformer model to 4/8 bit and run it on the x86_64 cpu platform I'm very interested in hearing about it.

_nalply · on Aug 16, 2023

Subjective experience: AVX512 helps a lot. I would have liked to read more about this. It seems that AVX512 supports fp16 in hardware and allows 32 fused multiplication-add operations per core. So I imagine on a Ryzen 9 with 12 cores you can have 384 simultaneous fused multiplication-add operations. I am not sure whether my estimation is off. Anyone know more than me?

T-A · on Aug 16, 2023

https://www.intel.com/content/www/us/en/developer/articles/t...

gliptic · on Aug 16, 2023

Sorry, but the topic of this post, llama.cpp, runs quantized 4/8 bit models just fine on x86_64 with AVX2, or am I missing some requirement you have?

Roark66 · on Aug 17, 2023

Wait, I wasn't aware llama.cpp even runs on x86_64.i thought it is arm hw only. If what you say is correct that indeed is very interesting. Especially if I can extend it to other models like falcon.

gliptic · on Aug 17, 2023

It doesn't support Falcon right now, but there's a fork that does (https://github.com/cmp-nct/ggllm.cpp/).

TMWNN · on Aug 15, 2023

Bah. We still haven't equaled the rude and hateful AI achieved in a microcomputer in 1981. <https://scp-wiki.wikidot.com/scp-079>

RosanaAnaDana · on Aug 15, 2023

We can keep reaching for that rainbow.

ggm · on Aug 15, 2023

A good analogy: as you approach the rainbow moves off. Others see you in it, but you can confirm it's somewhere else. It's an effect, a side effect, it's pretty and we value it. There's no pot of gold in literal sense, it's ephemeral value in other products.

gautamcgoel · on Aug 15, 2023

I enjoyed this article, but it seems to me that the latency numbers should have units of nanoseconds or maybe CPU cycles. I feel like the article was a bit sloppy with units.

Another question that occurs to me is: why do chipmakers even bother putting so many functional units on the chip if almost all workloads are memory bound? Based of the calculations in this article, you could decrease the number of teraflops a modern GPU can perform by a factor of 2 and not notice any appreciable difference in ML performance.

reissbaker · on Aug 15, 2023

1. I think nanosecond-scale latency numbers on operations taking dozens to hundreds of ms are probably overkill?

2. Inference is only one aspect of what GPUs are used for. Many other workloads are compute-bound. That being said, given the recent rise of these kinds of open-source, pre-trained large language models, I wouldn't be surprised if future Nvidia product launches offered variants with significantly more VRAM. There would probably be a lot of interest in "3080-equivalent compute, but 48GB VRAM" these days — certainly I would take one over a 4090 with 24GB VRAM. (Then again, that's basically an A6000, and those go for nearly $7k...)

sbierwagen · on Aug 16, 2023

Yeah, Nvidia won't do big VRAM consumer cards until AMD forces them to. They're running flat out just trying to keep up with demand for H100s at forty thousand USD each.

ColonelPhantom · on Aug 16, 2023

Or Intel! They're not making any high-end cards currently, but the A770 16GB is pretty decent if you're on a budget. The software support isn't really amazing yet, but their GPUs have some quite decent matmul acceleration.

Unfortunately support for their GPUs is not upstream in Tensorflow or Pytorch, and I don't think they're well-supported by Llama.cpp either, but Intel does look quite promising. I also believe that Intel oneMKL works on Arc GPUs these days, which in CPU land is amongst the fastest BLAS libraries out there. Also their matrix hardware is accessible using OpenCL extensions, which means that rolling a custom kernel for things related to quantization should be quite possible.

(For Tensorflow and PyTorch, you need to install a custom package called Intel Extension for $FRAMEWORK. The PyTorch one got updated to PyTorch 2.0 recently, which is promising.)

Currently rumors seem to indicate that their second generation GPUs will go from 32 to 64 Xe cores for the top model, and keep the 256 bit bus. If Intel were to double the VRAM to 32 GB as well (at least as an option, just like the A770 comes in both 8GB and 16GB variants), it'd immediately make them the crown of consumer VRAM size, which I'd wager would drive a lot of interest from the ML community.

djmips · on Aug 16, 2023

Compromise on microseconds

zamadatix · on Aug 15, 2023

The primary use case for GPUs in devices, graphics, is not as often memory bound. Even for other general data compute purposes that may not always be the case. It's specifically neural nets that have extremely wide batches of extremely simple operations occurring across extremely large chunks of memory where being memory bound is the default case.

__loam · on Aug 15, 2023

Will be interesting to see what people can do with local models, particularly for open source programming tools and PCG models for video games.

cratermoon · on Aug 15, 2023

They'll probably write fanfic and some Harry Potter ships.

oars · on Aug 15, 2023

Great article. Don't see content like this anywhere else outside of HN.

redox99 · on Aug 16, 2023

You can find content like this on Twitter if you follow the right people. In fact I read this article before it was even posted here because @karpathy tweeted about it.

notuseful · on Aug 16, 2023

[flagged]

DiggyJohnson · on Aug 16, 2023

I’ve never been a regular Twitter user, and don't really enjoy the platform, but this comment of yours is either abusing the word “shameful” or betraying a major lack of understanding that you can’t expect other people to care deeply about the things you care deeply about.

There’s a lot of shameful things in this world, GP using twitter isn’t one of them. Not even a “little.”

class4behavior · on Aug 16, 2023

It's betraying the word shameful in that it's an utter understatement.

If people don't care about supporting companies that enable and spread far-right content and groups, it's they who are a problem.

To say nothing of the pure disregard of the human right to privacy (and with AI now IP as well) that is forced on the the rest of the world by the dominance of the US market.

DiggyJohnson · on Aug 16, 2023

Apologies, but I genuinely find the intensity of this comment delusional. (1) Doesn't this perspective apply to any social media platform, for example Facebook Groups. And (2) RE: pure disregard of the human right to privacy: doesn't this implicate pretty much all popular digital platforms?

chucksmash · on Aug 16, 2023

https://pbfcomics.com/comics/deeply-held-beliefs/

axus · on Aug 16, 2023

There are a lot of oil producers I consider shameful, but I'm still going to buy gasoline every few weeks.

jokethrowaway · on Aug 16, 2023

It's both you and right wing people who are the problem.

You're both puppets in the hands of the powerful who wants us divided and weak.

Don't trust authority. Don't trust anyone. In the past left wing people cared and fought about freedom, today is the right wing fighting for freedom.

It's all irrelevant anyway, governments keep growing stronger and stronger during right or left governments. And soon there is not going to be anywhere to run to.

Joe_Boogz · on Aug 16, 2023

Shameful? Just because you dislike it doesn’t mean everyone has to.

PKop · on Aug 16, 2023

Says you. I'd prefer to take advantage of the resources available there. Your silly moralism is easy to ignore

Joeri · on Aug 16, 2023

Mastodon is a good way to get at this content in unfiltered form. Take a look at the raw feed of the sigmoid.social instance or follow the #LLM hashtag.

If you want it more predigested and summarized I can recommend the AI Explained channel on youtube.

ionwake · on Aug 16, 2023

Does anyone know what the next breakthroughs will be and their rough timelines regarding locally run models? For instance will anything like chat gpt 4 be runnable on an M1 Mac within the next year?

loudmax · on Aug 16, 2023

"Breakthroughs" are inherently hard to predict. This field is advancing at a very rapid pace. A lot of the improvements are incremental, but even these are coming very fast, and moving in different directions at once.

I don't think there's any likelihood of replicating GPT-4 on your M1 in the next twelve months, especially not if you're expecting responsive performance. What we could see are a plethora of models dedicated to doing particular tasks. Say, specialists in aspects of programming or accounting or law or medicine. Or general knowledge models with access to a local cache of Wikipedia. Individually, none of these models would have to come close to GPT-4's overall level of power and flexibility. But collectively, they could reach that level of utility.

gdiamos · on Aug 16, 2023

Memory bound token generation is a limitation of transformer decoders.

In the past, hardware has motivation algorithm innovations.

I’m curious how long it will take until we see more hardware friendly models.

marmaduke · on Aug 16, 2023

The Rwkv family of models qualifies, since it computes like a recurrent network at inference time.

pedrovhb · on Aug 16, 2023

Given the massive imbalance in the memory bandwidth bottleneck, I wonder why specialized hardware is the way it is. Is there some use case in which processing is the bottleneck, or at least it's more even? Are we expecting some software paradigm shift which will change the balance? Why couldn't they just make a cheaper, more rounded card which isn't heavily underutilized because of a large bottleneck?

brucethemoose2 · on Aug 16, 2023

See: https://en.m.wikipedia.org/wiki/Random-access_memory#Memory_...

But llama.cpp, and llms in general, are very atypically memory bound, even for AI workloads. Other models/backends will typically make more use of compute and cache.

For llms specifically, cloud hosts will batch requests to make better use of GPUs.

But you are not far off. There is a proposition to pipe chips with very fast memory (aka no external memory at all) together: https://www.nextplatform.com/2023/07/12/microsofts-chiplet-c...

PeterisP · on Aug 16, 2023

IMHO the current bottleneck is driven by the partial overlap between the gaming GPU market and machine learning needs. The expensive things that are needed by both (i.e. parallel calculations) are included in all cards, but expensive things that aren't needed by games (i.e. memory bandwidth) become a bottleneck unless you use expensive ML-specific niche hardware instead of consumer (gamer) GPUs.

simne · on Aug 21, 2023

I'm shocked - before I seen it working myself, I was not believe, so large (smart) LLM could run on desktop cpu.

- core i7, 4 cores, 3.3GHz, 64G ddr4-2400, 70b model, ~0.5 tokens/s; 30b model ~1 tokens/s.

Havoc · on Aug 15, 2023

What I find more stunning is what this implies going forward. If tech advances as it tends to do then having a 200bn model fit into consumer hardware isn't that far away.

Might not be AGI but I think cliched as it is that would "change everything". If not at 200 then 400 or whatever. Doesn't matter - the direction of travel seems certain.

gct · on Aug 15, 2023

Basically Ray Kurzweil's argument, he's been saying $1000 worth of compute will be able to match human performance around 2029 for decades now.

KerrAvon · on Aug 15, 2023

First, there has to be something capable of matching human performance at a much higher cost. This is still just spicy autocomplete.

Waterluvian · on Aug 15, 2023

Humans just do spicy autocomplete too.

gerdesj · on Aug 15, 2023

Really? I can guess at what spicy autocomplete might actually mean but I doubt a LLM ... OK ChatGPT did a pretty good job of it (I've just asked it), whilst sidestepping the definition of spicy in this context. It is after all a very good next word guesser, given a context, and its not ... me! I am capable of hallucinating but it was 30 odd years ago since I hunted out certain fungi on Dartmoor, or smoked hemp.

To be fair, we humans do often interrupt each other to second guess a sentence completion. Done correctly it is a brief satisfying collaboration. Done wrong ... I've been married for 18 years and know when to bite my tongue, but I still get it wrong from time to time - sometimes deliberately. Despite that, me and the wiff can autocomplete each other's sentences with uncanny accuracy and end up with perfect harmony or a cough slight disagreement as a result.

We are getting some phenomenal slide rules these days but the darleks are not going to be flying up the stairwell just yet, nor will SkyNet be taking over tomorrow.

That said, you just know that some noddy is trying to sell a nuclear "deterrent" LLM AI thingie somewhere. Thankfully, production military equipment takes quite a while to get to deployment. There is a good chance that we will get to grips with all this stuff before SkyNet is let loose for real 8)

not2b · on Aug 15, 2023

No, a human isn't born with a set of knowledge like a freshly trained LLM, keeping the model fixed and responding to input. The analog to the model changes based on the human's experience. Just making bigger and bigger LLMs won't give you this.

mlboss · on Aug 16, 2023

Humans are born with millions of years of evolutionary training embedded in their dna and brain. We are not born with nothing.

KirillPanov · on Aug 16, 2023

Yeah but chimpanzees and cats and mice have all the same million-year-old stuff that we do.

The million-year-old stuff is not what makes humans interesting.

mlboss · on Aug 17, 2023

I have been thinking about this lately. Things that are recently developed like language skills are easiest to replicate by machines. But things that too took a time to develop like walking and grabbing are still something that machines struggle with.

regularfry · on Aug 16, 2023

It's what makes humans possible. "Necessary but not sufficient" is the phrase that pays.

tysam_and · on Aug 15, 2023

Ah, but I believe you forget the implicit biases of genetic programming. Instincts in my experience are the skeleton, and in a sense the default basis functions for the structure of how we live, see, do, and learn.

not2b · on Aug 16, 2023

No, I don't forget that. There's obviously a starting point, behaviors and abilities that newborns already have. The point is that the model is not static.

Waterluvian · on Aug 15, 2023

So a human is different because it keeps training its neural network?

eurekin · on Aug 15, 2023

Whoa, imagine you get a good base LLM model and save all conversations with it. Run a batch process every night to fine tune a LORA on convo dataset. If I ever came across such a chat bot I'd probably freak out as to why it remembers things outside of the context window, without summarisation

MichaelZuo · on Aug 16, 2023

That's a pretty neat idea, I would be surprised if no one is already working on that.

notuseful · on Aug 16, 2023

I did it years ago on a lark with a seq2seq model in a matrix chat room.

eurekin · on Aug 16, 2023

How did it perform? Was it well received by members?

notuseful · on Aug 16, 2023

Poorly! It was a small seq2seq and was gibberish to start with. Although it did tell my friend that it loved him which was nice.

PaulDavisThe1st · on Aug 16, 2023

one reason why a human is different: just based on word count alone, most LLM's are trained on 3-5 orders of magnitude more input.

could be a difference that makes no difference, or ...

dTal · on Aug 16, 2023

Bit of an unfair comparison when humans also have a bunch of senses that LLMs don't have. They might be trained on orders of magnitude more words, but more data? Doubtful.

not2b · on Aug 16, 2023

That's the key. I'm reminded of the Helen Keller story. She was completely blind and deaf. Her teacher spent a very long time signing into her hand. It took a very long time before she realized that the sign for "water" designated the thing she could feel flowing onto her hand; before that breakthrough the signs were meaningless to her. An LLM only knows the structure of language. It doesn't know that there is an external physical world that the language refers to. It only can predict what words follow which other words, and which output is preferred. Without any senses (and the huge bandwidth of information provided by them) an LLM is very crippled.

dTal · on Aug 16, 2023

Crippled, yes, but I would disagree that it is fundamentally limited, or that an input stream of human language is inadequate to bootstrap "meaning", or in some way philosophically inferior to native biological senses.

It's very interesting you bring up Helen Keller because she's generally regarded as possessing the same level of sentience - and indeed intelligence - as anyone else, despite the extreme narrowness of her sensory input. It took her much longer to get going, but it's not as if she only understood concepts that directly related to touch. The experience with "water" taught her the concept of a symbol, and from there she could bootstrap everything else. LLMs already work with symbols - that is their sense.

In fact we're all a bit like Helen Keller, in the sense that if sensory input is the basis for our entire world model, then it is a very small foundation supporting an incredibly vast and intricate edifice. There is a considerable abstraction gap between concepts like "capitalism" and any direct sensory input. We all of us, all the time, manipulate concepts without thinking through what they "mean" all the way to something we can see and touch.

goatlover · on Aug 15, 2023

No they don't. You're "just" doing what everyone else in the past has done with the brain/human intelligence and using the latest technology as a metaphor without realizing it.

Waterluvian · on Aug 15, 2023

We want to think we’re exceptional but all we can do is say “human consciousness is special” without having any way of measuring it or disproving the assertion that we’re just really fancy pattern matchers.

Take any metaphor you want, it’s the same outcome: we may all be philosophical zombies.

roarcher · on Aug 15, 2023

We may "just" be neural networks that run on meat instead of silicon, but that does not mean that we're LLMs.

Waterluvian · on Aug 15, 2023

Why doesn’t it?

marginalia_nu · on Aug 16, 2023

It's a formal logical error. One does not follow from the other without affirming the consequent.

roarcher · on Aug 16, 2023

Because not all neural networks are LLMs.

A GAN is a neural network, does that make it an LLM?

hkt · on Aug 16, 2023

We have inputs other than words, for a start

goatlover · on Aug 16, 2023

I’m conscious, maybe you’re not? Not that I really believe that. I think you probably see colors and hear sounds, even in your dreams! But engineering types tend to be persuaded by a particular view of the world, failing to understand that it is a view and not nature itself.

enterprise_cog · on Aug 16, 2023

Maybe you do. Luckily, not everyone is quite so simple.

catchnear4321 · on Aug 15, 2023

monkeys make monkeys accidentally.

monkeys make meseeks on purpose.

there is a difference, but will it be fun?

Waterluvian · on Aug 15, 2023

I am 100% invested in how much ridiculous fun this era is going to be. Right up until the moment when it becomes a horror.

ggm · on Aug 15, 2023

The irony in your statement is immense. Yes, Kurzeweil has been saying this for decades. No it doesn't mean AGI is close. These llms do nothing to advance AGI. There is no theoretical basis to the belief in emergent intelligence from statistical language models and the answers are amazingly good, highly unreliable and parrot meaning at best. There is no inductance, and no inteospection and no understanding of the deep semantic meaning of the language presented. There's no intelligence.

kimixa · on Aug 16, 2023

The lack of concept of "knowledge" is a big one for me - if that's an emergent thing it hasn't even shown hints of this yet. This to me seems a pretty hard line right now, as it limits their capability of things even inexperienced humans can do - namely decide if they actually know something, and identify when they don't know something and attempt to fix that - e.g. asking for clarification on vague inputs, or deciding if something is actually truth or fiction.

That then ties into another limitation right now - how after a training the model is pretty static, so cannot learn and has no state outside it's context buffer. This could just be another point where a few orders of magnitude more computing power can "fix" it, doing whole training steps between each input to actually incorporate new information and corrections into itself instead of relying on a fixed size small context.

But I'm not deep enough into things to say if they're fundamental issues, or current techniques will start displaying those effects as emergent characteristics as the complexity and training increases. There's been a few other examples when "known" techniques start to show unexpected characteristics down the line as they are scaled up, so can't really say for sure they'll /never/ be shown, just that the current examples don't seem to show even the beginnings of that sort of thing.

glimshe · on Aug 15, 2023

Why do you say they do nothing to advance AGI? Do you know what it takes to advance AGI? It's hard to state that without knowing how AGI would work yourself.

LLMs would be considered magic just a couple years ago. Sure, not AGI but behaves just like one for certain workloads. I find hard to believe we're not a bit closer now - or maybe even a lot closer.

shrimpx · on Aug 15, 2023

AGI should have morals, opinions, self-reflection, learn continuously from sensor data, reason, realize when they’re proven wrong and update their model of the world, and be creative. So far LLMs exhibit none of those. But LLMs exhibit a digestible distillation of a very large body of data which may be a component of an AGI.

But you can have an AGI that doesn’t have encyclopedic knowledge but it’s still highly intelligent, so I don’t think LLMs have to be an intrinsic component.

zarzavat · on Aug 16, 2023

That is not what AGI means. AGI = Artificial General Intelligence.

1. Artificial = we made it

2. General = it can solve problems in any field

3. Intelligence = the ability to solve problems

A chess engine is a very strong Artificial Intelligence. But it’s not very General, it can only evaluate chess positions.

GPT-4 is very General, you can ask it about any question and get a somewhat reasonable answer. But it’s not very intelligent, often the answer is wrong.

You’re talking about an Artificial Human. That’s a different problem. Intelligence is not species dependent. Dolphins are intelligent (a bit), aliens can be intelligent and have zero emotions or conception of self. There’s certainly plenty of amoral intelligent serial killers.

shrimpx · on Aug 16, 2023

> That’s a different problem.

That's news to me. AGI (or strong AI) is typically defined as "human-level intelligence", or "perform any task that a human or animal can." Humans and animals often perform tasks that are critically reliant on being conscious, emoting, reading body language, reasoning, etc.

Not only that but prominent thinkers who have carved out the notion of AGI (or Strong AI) tend to have consciousness, mental states, and emotions at the core of it.

I think what you're talking about is a multi-task AI, not an AGI.

ggm · on Aug 16, 2023

We don't have a good computer model of dolphin intelligence and the llms are not even remotely close to consciousness or dolphins, dogs, parrots on the intelligence Front.

zarzavat · on Aug 16, 2023

Consciousness ≠ intelligence.

Consciousness: being “awake” and perceiving the world.

Intelligence: solving problems, finding the truth.

Consciousness is perceiving the world, whereas intelligence is understanding it.

PaulDavisThe1st · on Aug 16, 2023

GPT-4 solves only 1 problem: what is the most likely stream of tokens to follow what we already have.

It is remarkably good at this. But there's absolutely zero reason to believe it can solve any other problem at all.

zarzavat · on Aug 16, 2023

How do humans write if not by intuiting what word comes after another?

Intelligence is the ability of that next word decision procedure to determine a next word that is aligned with our human intuition and model of truth.

I believe what you’re getting at is modality, that GPT-4 only provides responses in text. You can’t ask it to drive a car, or paint like Dall-e. And that’s a fair criticism, but it’s mostly just because it would make the models too large and slow, not because we don’t know how to do it. The thing we don’t know how to do is make a model reason as well as a human, and it makes sense to try to solve that in the text domain first rather than making highly multimodal models that reason poorly in all domains.

PaulDavisThe1st · on Aug 16, 2023

> How do humans write if not by intuiting what word comes after another?

We don't know. It may turn out that we use mechanisms similar to LLMs, or it might be something entirely different.

As for the rest: nobody knows how to make ChatGPT butter a piece of toast, let alone drive a car.

ChatGPT does not reason about text, either.

famouswaffles · on Aug 16, 2023

>nobody knows how to make ChatGPT butter a piece of toast

There is plenty of research on LLMs successfully piloting robots.

It's but no means a solved problem but "Nobody knows how" is a stretch.

https://tidybot.cs.princeton.edu/ https://innermonologue.github.io/

>ChatGPT does not reason about text, either.

It does and there's plenty of output to demonstrate that.

famouswaffles · on Aug 16, 2023

Another example

https://www.deepmind.com/blog/rt-2-new-model-translates-visi...

astrange · on Aug 16, 2023

That's an interface, not an implementation.

("The most likely" out of what distribution? The model's distribution. So that just means "what the model thinks the answer to your question is".)

coolspot · on Aug 15, 2023

> the answers are amazingly good, highly unreliable and parrot meaning at best. There is no inductance, and no inteospection and no understanding of the deep semantic meaning of the language presented. There's no intelligence.

Can say the same about a half of population, tbh

binary132 · on Aug 15, 2023

So what you’re saying is.... we’re going to have human-level AI, and it’s going to be incredibly stupid

bugglebeetle · on Aug 16, 2023

BAAGI (Below Average Artificial General Intelligence).

djmips · on Aug 16, 2023

Or more simply not AGI but AGS

panarky · on Aug 16, 2023

His prediction was that one human brain's worth of computing power could be acquired for $1000 by 2029. That still seems reasonable.

That's not the same as AGI or the singularity.

ggm · on Aug 16, 2023

You have a metric for human brains worth of computing power which hasn't already been exceeded? I can't do infinite precision arithmetic or the RSA algorithm in my head, or index a billion strings into lexical sort order.

But I am human, I am conscious and no visible VLSI work or algorithmic model will lead to AGI or a human equivalent computing power by 2029. Let alone for $1000.

dekhn · on Aug 16, 2023

well, you could just be hallucinating your own consciousness. By 2029 it seems not unreasonable to expect that the most sophisticated models will carry out visual and auditory interactions which could fool even the most sophisticated viewer. At that point, what really does consciousness mean? If the robot insisted to me it was conscious, how can I really say no?

ggm · on Aug 16, 2023

> it seems not unreasonable

This is where we differ.

circuit10 · on Aug 15, 2023

By most measures you could think of for intelligence languages models are improving, so I don’t see why you think this wouldn’t lead to something at least almost human-level if you scaled it up enough

Of course there could be some wall somewhere but I don’t see why there would be

ggm · on Aug 16, 2023

That's "we need a larger cowbell" thinking. It's not a theory of mind, it's wishful thinking that it will.. emerge. Absent theory I don't think moar will make it emerge, no.

circuit10 · on Aug 16, 2023

If you want theory there’s this: https://arxiv.org/abs/2001.08361 (I haven’t actually read this but I know roughly what it’s about)

It’s saying that so far the abilities of an LLMs have scaled up with its parameter count and training data size. Of course there’s no way to be sure without actually training larger models but I don’t see why the point where it stops would be just after our current best LLMs. Many properties have already emerged from making it bigger so I don’t see why this would be the exception

croes · on Aug 16, 2023

Because you need more training data for better results and they are running out of new training data.

ddingus · on Aug 16, 2023

I don't think so.

While it may be true that new data is coming in at a trickle these days, due to things like Discord, Slack, et al. all locking conversation and context up, as well as the daily volume of chapter is small relative to what is out there now.

The fact is that training data can be used in many different ways and I bet you we see the products of that fairly quickly as those who see this same as I do reach a point where they want to show n tell and test.

croes · on Aug 16, 2023

>The fact is that training data can be used in many different ways and I bet you we see the products of that fairly quickly as those who see this same as I do reach a point where they want to show n tell and test.

Sounds like wishful thinking to overcome the limitations of LLMs.

At the same time we get more and more texts generated by LLMs so it gets harder to get actual man made texts.

circuit10 · on Aug 16, 2023

That’s true for LLMs but not necessarily for reinforcement learning

Aerroon · on Aug 15, 2023

A 200b 4-bit quantized model could potentially fit into 128 GB of RAM. The inference would just be really slow.

Ie you could technically run something like that today.

I think more VRAM on GPUs isn't necessarily a technical limitation either. I think GPU manufacturers could add a lot more VRAM to their cards if they wanted to. The question is whether it would be worth the price increase.

rootusrootus · on Aug 15, 2023

> Ie you could technically run something like that today.

Yep, on higher end machines it should already be feasible. I can do 2.5-3 tok/sec on a 70B model quantized at 4 bit today with my MacBook Pro M2 MAX w/96GB. It's a little slower than a 30B, but the difference is less than I had guessed it would be. That's not super fast, but it's usable.

And that's on a machine that isn't designed for this workload. Over the next few years things should improve quite a bit. 200B does not seem like a reach.

eurekin · on Aug 15, 2023

About the RAM. I doubt they wanted to do that, since basic gpu function is to render a frame in as little ms as possible. Currently VRAM is latency optimized on consumer gpus and all memory chips are an inch away from the gpu. Light only travels as far in the gigahertz realm. Thats why they started mounting vram chips on both sides of the board, cause there was no more place left on the first side.

Just checked: light travels 30cm in one nanosecond. So if the gpu is running at 4GHz it goes only 7.5 cm.

redox99 · on Aug 16, 2023

VRAM is not latency optimized. VRAM has worse latency than your CPU RAM. The reason why it's mounted closer is because of signal integrity because of higher frequencies, not because of latency.

eurekin · on Aug 16, 2023

Interesting. Where can I read more about that?

redox99 · on Aug 16, 2023

Sorry can't provide any resources right now. If you search a bit I'm sure you'll find some latency comparisons between DDR and GDDR.

But basically GPU memory (GDDR5/6/6X/etc) is optimized for bandwidth (because GPUs need to move a lot of data, have few branches, few unknown data dependencies, high spatial locality). CPU memory is more optimized for latency (because of branchy code).

csjh · on Aug 15, 2023

IMO the direction we're going seems more like having a few small models in a MoE that are equivalent to a current 200bn model

Mertax · on Aug 16, 2023

And then things like neural implants and BCIs -- seems like your dog could have language capabilities sooner than you'd think ;)

thomasfromcdnjs · on Aug 15, 2023

Thank you very much for writing this out!

fnbr · on Aug 16, 2023

You’re welcome :)

andrewstuart · on Aug 16, 2023

>> Memory bandwidth is the limiting factor in almost everything to do with sampling from transformers.

So how about using an APU - a CPU with GPU built in. The GPU shares the CPU memory, so if you want you can have 128GB RAM and allocate 100GB to the GPU.

Sure the GPU i not fast, but if memory is important.....

brucethemoose2 · on Aug 16, 2023

You can, right now, with the OpenCL backend.

And for the moment, its slower than pure CPU. Optimizing for IGPs is not trivial.

MLC's Vulkan backend is actually quite good on my AMD APU, but unfortunately it won't split the model to a dGPU.