This paper details the model that's been in the wild for approximately a month now. Mixtral 8x7B is very, very good. It's roughly sized at 13B, and ranked much, much higher than competitively sized models by, e.g. https://www.reddit.com/r/LocalLLaMA/comments/1916896/llm_com.... Ravenwolf notes that the model slightly outperforms some of its benchmark testing, and this is my experience. It's surprisingly good for a model of its size, and a very capable daily driver on a Mac for chat, code input and other uses.
Something that has come to light since the release of the weights, and not mentioned in this paper is that it looks like fairly likely that the 8 experts were all seeded by Mistral 7B and subsequently diverged. This has generated a lot of experimentation in the local LLM community with cloning models as a way to cheaply generate experts.
It was generally thought likely that training an 8x7B network would be as much work as training 8 7B networks, but this seems not to have been true for Mistral, which is super interesting.
There's still a lot of rapid innovation happening in this space, with papers like Calm from DeepMind this week, and a lot of the adhoc experimental layer combining happening in the wild, (see, e.g. Goliath-120b), I think we're likely to see some pretty interesting architectural improvements this year in the LLM space.
Calm seems to point the way to a next step after MoE, and models like Goliath seem to indicate that even a really really lazy version of Calm (no Linear layer combination, just literally alternating layers at full weights) can be very impactful. Overall I think we will see really, really strong models that are performant on consumer hardware in 2024, likely first half of this year.
I've had excellent results with Mixtral too - it's genuinely impressive. Only problem is that it's a relatively big model that's difficult to run with full GPU inference on consumer hardware (vs the 7b/13b models people typically use).
So far, the main consumer platform capable of running it without 'ruining' the quality of its output (with high levels of quantization) is the newer Apple Silicon Macs with unified memory - generally >=48GB. It can apparently be done on 32 or 36GB, but there's not much headroom.
Edit: As coder543 points out, yes - you can run it without more lossy levels of quantization on multi-GPU setups providing those have enough combined vram.
Mixtral works great at 3-bit quantization. It fits onto a single RTX 3090 and runs at about 50 tokens/s. The output quality is not "ruined" at all.
For the amount of money you're talking about, you could also buy two 3090s (~$750 each on eBay) and have 48GB of VRAM to run with less quantization at full speed.
M-series Macs are surprisingly flexible platforms, but they're not "the only" consumer platform that can do Mixtral.
That was my experience as well - 3-bit version is pretty good.
I also tried 2-bit version, which was disappointing.
However, there is a new 2-bit approach in the works[1] (merged yesterday) which performs surprisingly well for Mixtral 8x7B Instruct with 2.10 bits per weight (12.3 GB model size).
I could only run 2-bit q2 mode on my 32G M2 Pro. I was a little disappointed, but I look forward to try the new approach you linked. I just use Mistral’s and also a 3rd party hosting service for now.
After trying the various options for running locally, I have settled on just using Ollama - really convenient and easy, and the serve APIs let me use various LLMs in several different (mostly Lisp) programming languages.
With excellent resources from Hugging Face, tool providers, etc., I hope that the user facing interface for running LLMs is simplified even further: enter your hardware specs and get available models filtered by what runs on a user’s setup. Really, we are close to being there.
Off topic: I hope I don’t sound too lazy, but I am retired (in the last 12 years before retirement I managed a deep learning team at Capital One, worked for a while at Google and three other AI companies) and I only allocate about 2 hours a day to experiment with LLMs so I like to be efficient with my time.
Ollama[1] + Ollama WebUI[2] is a killer combination for offline/fully local LLMs. Takes all the pain out of getting LLMs going. Both projects are rapidly adding functionality including recent addition of multimodal support.
You should be able to run Q3 and maybe even Q4 quants with 32GB. Even with the GPU as you can up the max RAM allocation with:
'sudo sysctl iogpu.wired_limit_mb=12345'
That is a very interesting discussion. Weird to me that the quantization code wasn’t required to be in the same PR. Ika is also already talking about a slightly higher 2.31bpw quantization, apparently.
So you don't see significantly worse performance on 3bit quantized models compared to 4? Every 7/13b model I tried quantized gave much worse responses at 3 bit and below, whereas the differences from 4 bit to 6 or even 8 bit is more subtle.
Mixtral is a larger model, so maybe that makes it more tolerant of that level of quantization? I’ve been impressed with 3-bit Mixtral, but I haven’t done a ton of side by sides against 4-bit because I haven’t felt the need.
Fair enough. I did put 'ruining' in quotes for a reason - I haven't compared output between Q3 and Q4_K_M that I use, but you do generally sacrifice output quality at higher quantization levels.
And you're right, you can run it on a multi-GPU setup if you're so inclined.
You can also choose to run at 4-bit quantization, offloading ~27 out of 33 layers to the GPU, and that runs at about 25 tokens/s for me. I think that's about the same speed as you get out of an M1 Max running at 4 bits? Although I'm not sure about the newer M2 or M3 Max chips. Googling around, I didn't immediately see clear benchmarks for those.
Just as another data point, a CPU-only setup with Q5_K_M would give you roughly 4 tokens per second on a Ryzen laptop (Dell Inspiron 7415 upgraded to 64 GB of RAM).
Nice - that's still pretty solid.. although on a more typical 3060 or 3070 with less vram available, I probably wouldn't expect numbers quite that good.
My 14" M1 Max does around 30t/s on Mixtral Q4_K_M.
Not to my knowledge. But because the unified memory doubles as VRAM for the onboard GPU, normal GPU acceleration can access the entire model even if it's 50+ GB. That's why ASi Macs are currently the holy grail for at-home inferencing, and also why projects like llama.cpp focus so much on ASi above all else, and why so many UIs release for macOS first before other operating systems. Certain Mac models offer up to 192GB of unified memory.
Yes it has, actually: https://github.com/ml-explore/mlx-examples. It's right in the main repo. NB, I haven't tried this, I'm using llama.cpp with a non-K-quant quantization on my MBP.
I have and don't consider MLX to be production ready. I've tested it on M1Max and M1Ultra (128) machines. It's completely non-deterministic in its resource consumption, sometimes using the GPU fully, sometimes getting seemingly stuck while processing, sometimes the GPU throttles.
However, there's one curious thing: llama.cpp _always_ leads to GPU throttling on Apple Silicon (e.g. the M1Max GPU will go from 1200MHz to around 700MHz), and then fully saturates it. In the rare cases I could get MLX to stay on the GPU, it was able to keep it at the maximum clock rate. However the unpredictable pauses and seemingly unoptimized prompt processing makes it hard to pick a winner in end-to-end tokens/s
Here's the direct link and I can confirm that Mixtral-8x7B-v0.1 works on M2 Ultra 128GB via MLX and is easy to setup (the longest part is just downloading the weights):
We'll have a tutorial soon (next week) on combining/composing with Reexpress to add uncertainty quantification (and to use it for semantic search). A link will be here: Tutorial 6: https://re.express/guide.html
I liked "Using Llamafile’s OpenAI API endpoint" described there, using Justine Tunney's llamafiles for Mixtral, but the article link is out of date, as the models have been replaced with newer: https://huggingface.co/jartine
Mixtral is good but those Ravenwolf benchmarks are meaningless. It’s like some random dude trying to reinvent MMLU without any rigor or consistency and in German. Dataset contamination is a problem, but not one that’s solved by folkloric evaluation of LLMs by people asking for tips on a subreddit.
I don't think they're meaningless; they have a few benefits:
1) He doesn't have an ax to grind / an LLM to pimp out, so he's relatively even-handed
2) He uses the same (secret) test data for each model, so his testing is resistant to cherry-picking/finetuning on tests
3) He likes weirdo role-play prompting, so he has a very good sense of the edges of refusal and alignment tuning
4) He picks up stuff well before it hits the only other fair testing I know of, the chat arena
5) I think asking stuff in German is at worst neutral, and at best useful for testing capacity in edge cases.
Practically speaking, his 'preferred' non-giant models, Nous-Capybara-34B and Mixtral both are excellent in comparison with some of the others he looks at, and good recommendations.
That said, I'd like to see a test suite that GPT-4 fails at, or struggles at, at least. And, it would save him a lot of time if he could get something automated together, it's clearly a lot of effort to hand test all those models.
Any tests that are unfalsifiable and can’t be reproduced are meaningless when it comes to gauging the performance of LLMs (and most other things). I could also post on a subreddit and say I have a secret set of German tests that may or may not exist and and I like these models, but that does nothing to advance the science of evaluating these things. If you want to evaluate human preferences, you can use chatbot arena, which can be gamed, but at least reflects more than what one guy says to be true. And this is with me agreeing that Nous-Capybara is also a good model. But don’t take my word for it because it’s not worth very much!
I think we agree on almost everything about this - which is to say, probably both you and I think the Wolfram Ravenwolf tests are just about useful enough to indicate whether or not a model might be worth downloading, but certainly not enough to justify spending money, say, or planning around. So, yes, I'm with you.
I agree that better ways to evaluate models would be super super useful, and benchmarks like MMLU and whatever's next will continue to be helpful ("real" science). And, it seems like there may even be some benefits for models to training to 'ace the test' more broadly, which is interesting, and ties to some educational theories in teaching humans.
However, one area that open tests can't excel in is in this "fair" evaluation arena -- and I do think that private tests have some value there, to the extent that they can show utility and maintain trust. I don't make any claims about German sex role-play being a good or bad start for these, though.
I think there’s room for private tests, but those should probably be done by some kind of independent standards body. LLMs are an incredible advancement of machine learning research, but we diminish that when we let these kind of wholly unscientific evaluations predominate and especially in open source, where so much great work is otherwise being done to combat the monopoly tactics of big tech.
I find it baffling that anyone would take these benchmarks seriously. The methodology is not transparent, and some of the tests completely unfair, like those in German or about specific niche German topics. The author readily acknowledges that these tests are his personal interests, which is totally fair. But that it would rise to the top of that subreddit and now HN as a general measure of quality of any kind is indicative of the lack of reliable benchmarks out there.
I'm looking forward to all the hardware announcements. It's certainly looking like intentionally designed on device acceleration of LLMs for consumers is coming.
For the output performance (quality and tokens/second), Mixtral 8x7B seems to require less VRAM. But it is still hard to make it fit, even with a lot of quantization, within the GPU RAM of most discreet consumer GPUs. Perhaps a smaller base model like Phi-2 could bring the VRAM requirements down, but the MoE will bring the output quality up from Phi-2.
I’d like to note that this model’s parameter usage is low enough (13b) to run smoothly at high quality on a 3090 while beating GPT-3.5 on humaneval and sporting 32k context.
3090s are consumer grade and common on gaming rigs. I’m hoping game devs start experimenting with locally deployed Mixtral in their games. e.g. something like CIV but with each leader powered via LLM
You can also run Mixtral, at a decent token rate, on a post-2020 Apple Macbook Pro M1/M2/M3 with 32GB+ of RAM. 16GB RAM also works, sort of ok, which I suspect is the same quantization a 3090 is using, but I do notice a difference in the quantization. On my M2 Pro, the token rate and intelligence feels like GPT-3.5turbo. This is the first model I've started actually using (vs playing around with for the love of the tech) instead of GPT-3.5.
An Apple M2 Pro with 32GB of RAM is in the same price range as a gaming PC with a 3090, but its another example of normal people with moderately high performance systems "accidentally" being able to run a GPT-3.5 comparable model.
If you have an Apple meeting these specs and want to play around, LLM Studio is open source and has made it really easy to get started: https://lmstudio.ai/
I hope to see a LOT more hobby hacking as a result of Mixtral and successors.
I have so far run it on my M1 MacBook using llamafile [1] and found it to be great.
Is there any speed/performance/quality/context size/etc. advantage to using LLM Studio or any of the other *llama tools that require more setup than downloading and running a single llamafile executable?
Google tells me that the RTX 3090 is priced between US$1,480 and $1,680.
You can buy a whole PC for that, I refuse to believe that a GPU priced that highly is "consumer grade" and "common".
Are there any GPUs that are good for LLMs or other genAI that aren't absurdly priced? Or ones specifically designed for AI rather than gaming graphics?
I recently purchased a 3090 on Reddit’s hardwareswap community for $550. New GPUs are pricey right now because of shortages, but if you look around a bit it can be affordable.
Gamers and LLM/AI/ML GPU users do not find that absurdly priced. Absurdly priced in our world is $15,000 so your perceptions are off by about a order of magnitude.
Then again, the Apple II cost around $6.5k in today's dollars. [0] My hunch is that people caring less for tricking out their computers for gaming is about people not being all that interested in being able to have the top of the line graphics settings enabled when playing AAA games. But I think the history of PCs and gaming very much proves that even normal consumers are willing to spend the big bucks on technology when it enables something truly new.
I would go even further - anytime I look at the hardware survey, I am surprised by the anemic and dated hardware people are running. Most people who game are not building a custom box anymore.
To be fair, I got a card second hand to play with, and it was only £700 ($900~) and it came with manufacturer warranty. It was a bit of a gamble, but the 24GB VRAM has been a godsend for experimenting with LLMs. And playing video games at 4K!
tl;dr, no, especially since AMD is lagging behind.
Apple is the one doing the best in terms of making consumer-friendly hardware that can perform AI/ML tasks...but that involves a different problem regarding video games.
It's worth noting that the 4bit quants can run on cpu at ~reading speed, which should unlock many use cases - especially if we can precompute some of the results async.
Resource constraints would be a concern, yeah, so if you were developing a game featuring LLMs (which would, at this point in their development and maturity, be a gimmick) you would keep that in mind and keep other demands on resources low.
True, yet many games are effectively single-threaded anyways.
The bigger problem is memory capacity and bandwidth, but I suspect folks will eventually figure out some sort of QoS setup to let the system crunch LLMs using otherwise unused/idle resources.
I've been working with local models as agents, and anyone interested in trying this needs to know about llama.ccp's "gammers" feature. You can force the model's output to conform to a specific structure, which is not only useful for ensuring that you recieve eg. valid JSON output but more specific stuff like "if you choose to do x, you must also provide y", which can be great for influencing it's thinking (eg. an actor who's planning ahead might be required to resond with three of any of the five W's (it's choice which three), but then it get's to be free-form inside of the JSON string values, which can be used as context for a following selection from a restricted set of actions; or a model might have option of asking for more time to think at the end if it's response, but if it doesn't then it needs to specify it's next action). This does't impact generation speed AFAICT and can be used in vry creative ways, but results can still need re-generated if they're truncated and I had to write a function to stop immediately when the valid JSON obj is closed (ie. at the end) or when more that like five newlines are generated it a row. This will vary by model.
>Running LLMs locally to create custom dialogue for games is still years away
I had think about this, you need a small LLM for a game, you do not need it to know about movies, music bands, history, coding and all the text on the internet. I am thinking we need a small model similar to phi-2 , trained only on basic stuff then you would train it on the game world lore. Then the game would also use some "simpler" graphics( we had good looking game graphics decades ago so I do not think you could be limited to text adventures or 2D graphics, just you need some simpler and optimized graphics, it is always interesting when someone shows his unreal demo that uses most RAM/VRAM for a simple demo level then a giant game like GTA5)
VR isn't pragmatically accessible to the average gamer due to hardware requirements and the necessity of setting up the right physical environment but there are still VR games.
How did you arrive at "almost a decade"? There's been VR stuff since at least the late 1980s. On the flip side you could say it wasn't "a thing" until just a few years ago. Or that it isn't "a thing" yet.
you cannot currently run mixtral with a 32k context on a 3090. Unless am I wrong? I think the largest context I was able to reproduce was around 1500 with 2 or 3 bit, I would have to look at my notes.
> it's sort of a weakness inherent to LLM's... next word prediction isn't really supposed to be good at math.
FWIW I don't agree with this in a theoretical sense. The reason LLMs can do as much as they can is because next token prediction attempts to infer a world model for the processes that generated each next-token in the training set. I don't see a reason that this would preclude learning arithmetic in order to better predict next tokens that require arithmetic.
I'd guess that arithmetic will become suddenly reliable with one of the next significant (e.g. 2-5x) jumps in parameter count.
I disagree; 1. the parameters are there to make the model better at learning textual patterns, not arithmetic patterns. 2. Next token prediction is a terrible way to perform arithmetic. 3. Perhaps most importantly, the loss function does not incentivise being good at arithmetic at all.
Any perceived arithmetic ability is just a textual coincidence.
I agree that a sufficiently intelligent LLM, like 300+ IQ would have an excellent model of how multiplication works. It may even assist in finding new theorems, but a calculator will always be better at 926*725.
> 1. the parameters are there to make the model better at learning textual patterns, not arithmetic patterns.
There's nothing "textual" about the tokens. They are arbitrary identifiers. There's nothing "textual" about transformers! The fact that e.g. GPT-4 can accept images as input, and that its textual performance improved as a result, and that transformers are also being used for text-to-speech models should have already communicated this.
> 2. Next token prediction is a terrible way to perform arithmetic.
This is just attempting to resolve our disagreement with pure assertion. It's certainly less efficient to use an artificial intelligence to do arithmetic. But whether it's efficient is a different question than how likely it is to be possible.
> 3. Perhaps most importantly, the loss function does not incentivise being good at arithmetic at all.
This is blatantly untrue. The same argument would suggest that LLMs can't do anything that wasn't exactly in their training set already. But they can.
1. this is interesting. tokens are arbitrary identifiers... of text. It's called a 'text encoder' for a reason
2. it's a strong assertion but it is true. it's kind of inherent in the name; next word prediction. Why would you want a calculator to be making predictions? I agree that it's worth understanding whether it's possible, but your original point was that 'arithmetic will become suddenly reliable' - this is what I am disagreeing with.
3. this links to the above point - you are right: LLM's can already perform arithmetic, this could be easily proved by loading up gpt-2. but again, your original point was that LLM's will 'figure out' arithmetic - I don't believe they will. The loss function does not incentivize being good at arithmetic, it incentivizes making good predictions that are close to the true value. The loss function will not penalize an LLM who predict 'good' is the next word when it should have been 'great'. while it might penalize '2+2=3; since this sequence would be strongly represented in the training set, it's not going to penalize the model getting '1234*5678' one digit off - which is the problem.
LLMs are compression functions. An LLM that internalizes the rules of arithmetic will compress better than one that doesn't. That improvement will be measurable in the loss function, as it correctly predicts more of its training data that depends on arithmetic answers to predict the next word.
> Why would you want a calculator to be making predictions?
If the predictions are correct, why wouldn't I? You are objecting to the entire concept of LLMs at this point, there's nothing specific to arithmetic here.
You don't need multimodals to access tools such as calculator. Check out PandasAI or Langchain Agent workflow. Rather than workout 451 * 995 for example the llm constructs the pandas query, runs it, returns result to user. Works pretty well.
Sure, llms don't do discreet math or logic directly. On the other hand, they write surprisingly good code.
I'm guessing we'll see llms that do input>program(s)>run>summarize>output
I'm not really disagreeing - I just think llms will do "more of the work" themselves by way of writing and running prolog programs, symbolic math (Julia etc) and running theorem provers.
Annoyingly, Bard with Gemini will apparently use coding to answer every logical thinking question and get them all wrong. If you end with "Do not use code", it will get them all right.
I haven't read a lot of LLM papers, but I believe this is a rather weak paper low on details (note: not the results achieved of the LLM, but the paper itself). If it had landed on my desk for a review, I probably would have sent it back just based on that.
For example, they never really say how they trained the experts or which dataset they used.
It’s becoming pretty common, yeah. The two things you mentioned: training particulars and dataset mixture are also basically the only competitive advantage companies have. Since the code/architecture is trivial to reproduce, anyone with enough money can make a competing model “easily”.
OpenAI started this trend and cemented it with GPT4’s “technical report” which didn’t even specify the number of parameters in the model. They’ve been historically vague about their dataset for far longer than that though.
Exactly, same thought. Actually I would expect, that they trained each expert separately and later together, since you need to train the router network as well. I'm far from an expert in LLMs. But this would be interesting to know, especially how different training setups influence the performance.
I'm curious when we'll start to see open access multimodal models being released.
The advancement in text only models has been amazing, but a lot of the 'emergent' behavior in GPT-4 may be because of multimodal training and not just MoE or parameter sizes.
I'll be curious to see if multimodal smaller models see similar leaps.
I've heard that Google actually got the jump on OpenAI in this regard (just from people in a FAANG), and they're playing a bit of catch-up. OpenAI still has a distinct advantage on the language side, though. This is all hearsay, of course.
Is this a model that can be run using Simon Wilison's LLM tool? I cannot find any mention of Mixtral in the issues nor in the discussions. Is there an easy way to play with this model from the command line other than that?
Mistral Medium is available via their API, but not available for download, for example, so I find that confusing if you’re claiming their CEO claimed the plan is to be open for all models.
It looks like each expert is used interchangeably with no clear pattern. And earlier they say "Surprisingly, we do not observe obvious patterns in the assignment of experts based on the topic."
So then, what is the point of the "expert"?
Could this extra performance just be through the 8 expert architectural design, and not be based on the underlying training material? For example, if all 8 experts were ArXiv papers, would the performance be different?
Not sure if it answers your question, but the expertise of each expert arise from the training process and aren't assigned by a human. Thus, they wouldn't be discernable to a human. The choice of the model to use, I think, is something called like a "gating network". That's also trained to favor most appropriate model based on training.
~13B models should work well with plenty of room for other applications. Lately, I’ve heard good things about Solar10B, but new models come in a dozen by day, so it might have already been changed.
TL;DR: It looks like a convenient and explainable starting point for the proof-of-concept.
Manually combining specialist variants is a known technique. This paper automates it with a router component which mixes 2 sub-models at any given time. Training 8 slight variants of a base seems safe and configurable compared to n > 16 specialists. The latter seems like the parts could interact unpredictably.
Also, the memory usage seems predictable: it follows 2^m memory conventions by mixing 2 models at a time, so ~2x the memory is actively used at a time. I'm not up to date on the hardware implications, so it might not mean anything yet. It might one day if this approach works well enough to design around.
Something that has come to light since the release of the weights, and not mentioned in this paper is that it looks like fairly likely that the 8 experts were all seeded by Mistral 7B and subsequently diverged. This has generated a lot of experimentation in the local LLM community with cloning models as a way to cheaply generate experts.
It was generally thought likely that training an 8x7B network would be as much work as training 8 7B networks, but this seems not to have been true for Mistral, which is super interesting.
There's still a lot of rapid innovation happening in this space, with papers like Calm from DeepMind this week, and a lot of the adhoc experimental layer combining happening in the wild, (see, e.g. Goliath-120b), I think we're likely to see some pretty interesting architectural improvements this year in the LLM space.
Calm seems to point the way to a next step after MoE, and models like Goliath seem to indicate that even a really really lazy version of Calm (no Linear layer combination, just literally alternating layers at full weights) can be very impactful. Overall I think we will see really, really strong models that are performant on consumer hardware in 2024, likely first half of this year.