Hacker News new | past | comments | ask | show | jobs | submit login

Model card for base: https://huggingface.co/databricks/dbrx-base

> The model requires ~264GB of RAM

I'm wondering when everyone will transition from tracking parameter count vs evaluation metric to (total gpu RAM + total CPU RAM) vs evaluation metric.

For example, a 7B parameter model using float32s will almost certainly outperform a 7B model using float4s.

Additionally, all the examples of quantizing recently released superior models to fit on one GPU doesnt mean the quantized model is a "win." The quantized model is a different model, you need to rerun the metrics.




Looks like someone has got DBRX running on an M2 Ultra already: https://x.com/awnihannun/status/1773024954667184196?s=20


I find 500 tokens considered 'running' a stretch.

Cool to play with for a few tests, but I can't imagine using it for anything.


I can run a certain 120b on my M3 max with 128GB memory. However I found that while it “fits” Q5 was extremely slow. The story was different with Q4 though which ran just fine around ~3.5-4 t/s.

Now this model is ~134B right? It could be bog slow but on the other hand its a MoE so there might be a chance it could have satisfactory results.


From the article, should have the speed of a ~36b.


And it appears to be at ~80 GB of RAM via quantisation.


So that would be runnable on a MBP with a M2 Max, but the context window must be quite small, I don’t really find anything under about 4096 that useful


Can't wait to try this on my MacBook. I'm also just amazed at how wasteful Grok appears to be!


That's a tricky number. Does it run on an 80GB GPU, does it auto-shave some parameters to fit in 79.99GB like any articifially "intelligent" piece of code would do, or does it give up like an unintelligent piece of code?


Are you aware how Macs present memory? Their 'unified' memory approach means you could run an 80GB model on a 128GB machine.

There's no concept of 'dedicated GPU memory' as per conventional amd64 arch machines.


What?

Are you asking if the framework automatically quantizes/prunes the model on the fly?

Or are you suggesting the LLM itself should realize it's too big to run, and prune/quantize itself? Your references to "intelligent" almost leads me to the conclusion that you think the LLM should prune itself. Not only is this a chicken and egg problem, but LLMs are statistical models, they aren't inherently self bootstraping.


I realize that, but I do think it's doable to bootstrap it on a cluster and teach itself to self-prune, and surprised nobody is actively working on this.

I hate software that complains (about dependencies, resources) when you try to run it and I think that should be one of the first use cases for LLMs to get L5 autonomous software installation and execution.


Make your dreams a reality!


Worst is software that doesn't complain but fails silently.


The LLM itself should realize it’s too big and only put the important parts on the gpu. If you’re asking questions about literature there’s no need to have all the params on the gpu, just tell it to put only the ones for literature on there.


That's great, but it did not really write the program that the human asked it to do. :)


That's because it's the base model, not the instruct tuned one.


> a 7B parameter model using float32s will almost certainly outperform a 7B model using float4s

Q5 quantization performs almost on par with base models. Obviously there's some loss there, but this indicates that there's still a lot of compression that we're missing.


I'm still amazed that quantization works at all, coming out as a mild degradation in quality rather than radical dysfunction. Not that I've thought it through that much. Does quantization work with most neural networks?


> Does quantization work with most neural networks?

Yes. It works pretty well for CNN-based vision models. Or rather, I'd claim it works even better: with post-training quantization you can make most models work with minimal precision loss entirely in int8 (fixed point), that is, computation is over int8/int32, no floating point at all, instead of weight-only approach discussed here.

If you do QAT something down to 2-bit weight and 4-bit activation would work.

People aren't interested in a weight-only quantization back then because CNNs are in general "denser", i.e. bottleneck was on compute, not memory.


thanks!


Intuitively the output space is much smaller than the latent space. So during training, you need the higher precision so that the latent space converges. But during inference, you just need to be precise enough that your much smaller output space does.


> The model requires ~264GB of RAM

This feels as crazy as Grok. Was there a generation of models recently where we decided to just crank on the parameter count?


Cranking up the parameter count is literally how the current LLM craze got started. Hence the "large" in "large language model".


If you read their blog post, they mention it was pretrained on 12 Trillion tokens of text. That is ~5x the amount of the llama2 training runs.

From that, it seems somewhat likely we've hit the wall on improving <X B parameter LLMs by simply scaling up the training data, which basically forces everyone to continue scaling up if they want to keep up with SOTA.


Not recently. GPT-3 from 2020 requires even more RAM; the open-source BLOOM from 2022 did too.

In my view, the main value of larger models is distillation (which we particularly witness, for instance, with how Claude Haiku matches release-day GPT-4 despite being less than a tenth of the cost). Hopefully the distilled models will be easier to run.


Isn’t that pretty much the last 12 months?


I thought float4 sacrificed a negligible cost in evaluation quality for a 8x reduction in RAM?


For smaller models, the quality drop is meaningful. For larger ones like this one, the quality drop is negligible.


A free lunch? Wouldn't that be nice! Sometimes the quantization process improves the accuracy a little (probably by implicit regularization) but a model that's at or near capacity (as it should be) is necessarily hurt by throwing away most of the information. Language models often quantize well to small fixed-point types like int4, but it's not a magic wand.


I didn’t suggest a free lunch, just that the 8x reduction in RAM (+ faster processing) does not result in an 8x growth in the error. Thus a quantized model will outperform a non-quantized one on a evaluation/RAM metric.


That's not a good metric.


Many applications dont want to host inference on the cloud and would ideally run things locally. Hardware constraints is clearly important.

Id actually say its the most important metric for most open models now, since the price per performance of closed cloud models is so competitive with open cloud models, so edge inference that is competitive is a clear value add


It's not that memory usage isn't important, it's that dividing error by memory gives you a useless number. The benefit from incremental error decrease is highly nonlinear, as with memory. Improving error by 1% matters a lot more starting from 10% error than 80%. Also a model that used no memory and got everything wrong would have the best score.


I see, I agree with you. But I would imagine the useful metric to be “error rate below X GB memory”. We really just need memory and/or compute reported when these evaluations are performed to compile that. People do it for training reports since compute and memory is implicit based on training time (since people saturate it and report what hardware they’re using). But for inference no such details :\


But using a 8x smaller model also does not result in an 8x growth in the error, too.


I find that q6 and 5+ are subjectively as good as raw tensor files. 4 bit quality reduction is very detectable though. Of course there must be a loss of information, but perhaps there is a noise floor or something like that.


At what parameter count? Its been established that quantization has less of an effect on larger models. By the time you are at 70B quantization to 4 bits basically is negligible


Source? I’ve seen this anecdotally and heard it, but is there a paper you’re referencing?


I work mostly with mixtral and mistral 7b these days, but I did work with some 70b models before mistral came out, and I was not impressed with the 4 bit Llama-2 70b.


This paper partially finds disagreeing evidence: https://arxiv.org/abs/2403.17887


Good reference. I actually work on this stuff day-to-day which is why I feel qualified to comment on it, though mostly on images rather than natural language. I'll say in my defense that work like this is why I put a little disclaimer. It's well-known that plenty of popular models quantize/prune/sparsify well for some tasks. As the authors propose "current pretraining methods are not properly leveraging the parameters in the deeper layers of the network", this is what I was referring to as the networks not being "at capacity".


I'm more wondering when we'll have algorithms that will "do their best" given the resources they detect.

That would be what I call artificial intelligence.

Giving up because "out of memory" is not intelligence.


I suppose you could simulate dementia by loading as much of the weights as space permits and then just stopping. Then during inference, replace the missing weights with calls to random(). I'd actually be interested in seeing the results.


No but some model serving tools like llama.cpp do their best. It's just a matter of choosing the right serving tools. And I am not sure LLMs could not optimize their memory layout. Why not? Just let them play with this and learn. You can do pretty amazing things with evolutionary methods where the LLMs are the mutation operator. You evolve a population of solutions. (https://arxiv.org/abs/2206.08896)


>Giving up because "out of memory" is not intelligence.

When people can't remember the facts/theory/formulas needed to answer some test question, or can't memorize some complicated information because it's too much, they usually give up too.

So, giving up because of "out of memory" sure sounds like intelligence to me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: