Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

General overview below, as the pages don't seem to be working well

  Llama 4 Models:
  - Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each.
  - They are natively multimodal: text + image input, text-only output.
  - Key achievements include industry-leading context lengths, strong coding/reasoning performance, and improved multilingual capabilities.
  - Knowledge cutoff: August 2024.

  Llama 4 Scout:
  - 17B active parameters, 16 experts, 109B total.
  - Fits on a single H100 GPU (INT4-quantized).
  - 10M token context window
  - Outperforms previous Llama releases on multimodal tasks while being more resource-friendly.
  - Employs iRoPE architecture for efficient long-context attention.
  - Tested with up to 8 images per prompt.

  Llama 4 Maverick:
  - 17B active parameters, 128 experts, 400B total.
  - 1M token context window.
  - Not single-GPU; runs on one H100 DGX host or can be distributed for greater efficiency.
  - Outperforms GPT-4o and Gemini 2.0 Flash on coding, reasoning, and multilingual tests at a competitive cost.
  - Maintains strong image understanding and grounded reasoning ability.

  Llama 4 Behemoth (Preview):
  - 288B active parameters, 16 experts, nearly 2T total.
  - Still in training; not yet released.
  - Exceeds GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks (e.g., MATH-500, GPQA Diamond).
  - Serves as the “teacher” model for Scout and Maverick via co-distillation.

  Misc:
  - MoE Architecture: Only 17B parameters activated per token, reducing inference cost.
  - Native Multimodality: Unified text + vision encoder, pre-trained on large-scale unlabeled data.



For a super ignorant person:

Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each

Those experts are LLM trained on specific tasks or what?


This was an idea that sounded somewhat silly until it was shown it worked. The idea is that you encourage through training a bunch of “experts” to diversify and “get good” at different things. These experts are say 1/10 to 1/100 of your model size if it were a dense model. So you pack them all up into one model, and you add a layer or a few layers that have the job of picking which small expert model is best for your given token input, route it to that small expert, and voila — you’ve turned a full run through the dense parameters into a quick run through a router and then a 1/10 as long run through a little model. How do you get a “picker” that’s good? Well, it’s differentiable, and all we have in ML is a hammer — so, just do gradient descent on the decider while training the experts!

This generally works well, although there are lots and lots of caveats. But it is (mostly) a free lunch, or at least a discounted lunch. I haven’t seen a ton of analysis on what different experts end up doing, but I believe it’s widely agreed that they tend to specialize. Those specializations (especially if you have a small number of experts) may be pretty esoteric / dense in their own right.

Anthropic’s interpretability team would be the ones to give a really high quality look, but I don’t think any of Anthropic’s current models are MoE.

Anecdotally, I feel MoE models sometimes exhibit slightly less “deep” thinking, but I might just be biased towards more weights. And they are undeniably faster and better per second of clock time, GPU time, memory or bandwidth usage — on all of these - than dense models with similar training regimes.


The only thing about this which may be unintuitive from the name is an "Expert" is not something like a sub-llm that's good at math and gets called when you ask a math question. Models like this have layers of networks they run tokens through and each layer is composed of 256 sub-networks, any of which can be selected (or multiple selected and merged in some way) for each layer independently.

So the net result is the same: sets of parameters in the model are specialized and selected for certain inputs. It's just a done a bit deeper in the model than one may assume.


the most unintuitive part is that from my understanding, individual tokens are routed to different experts. this is hard to comprehend with "experts" as that means two you can have different experts for two sequential tokens right?

I think where MoE is misleading is that the experts aren't what we would call "experts" in the normal world but rather they are experts for a specific token. that concept feels difficult to grasp.


It's not even per token. The routing happens once per layer, with the same token bouncing between layers.

It's more of a performance optimization than anything else, improving memory liquidity. Except it's not an optimization for running the model locally (where you only run a single query at a time, and it would be nice to keep the weights on the disk until they are relevant).

It's a performance optimization for large deployments with thousands of GPUs answering tens of thousands of queries per second. They put thousands of queries into a single batch and run them in parallel. After each layer, the queries are re-routed to the GPU holding the correct subset of weights. Individual queries will bounce across dozens of GPUs per token, distributing load.

Even though the name "expert" implies they should experts in a given topic, it's really not true. During training, they optimize for making the load distribute evenly, nothing else.


BTW, I'd love to see a large model designed from scratch for efficient local inference on low-memory devices.

While current MoE implementations are tuned for load-balancing over large pools of GPUs, there is nothing stopping you tuning them to only switch expert once or twice per token, and ideally keep the same weights across multiple tokens.

Well, nothing stopping you, but there is the question of if it will actually produce a worthwhile model.


Intuitively it feels like there ought to be significant similarities between expert layers because there are fundamentals about processing the stream of tokens that must be shared just from the geometry of the problem. If that's true, then identifying a common abstract base "expert" then specialising the individuals as low-rank adaptations on top of that base would mean you could save a lot of VRAM and expert-swapping. But it might mean you need to train from the start with that structure, rather than it being something you can distil to.


Yes, Deepseek introduced this optimisation of a common base "expert" that's always loaded. Llama 4 uses it too.


I had a sneaking suspicion that I wouldn't be the first to think of it.


DeepSeek introduced novel experts training technique which increased experts specialization. For particular given domain their implementation tends to activate same experts between different tokens, which is kinda what you’re asking for!


I think Gemma 3 is marketed for single GPU setups https://blog.google/technology/developers/gemma-3/


> It's not even per token. The routing happens once per layer, with the same token bouncing between layers.

They don't really "bounce around" though do they (during inference)? That implies the token could bounce back from eg. layer 4 -> layer 3 -> back to layer 4.


So a more correct term would be "Distributed Loading" instead of MoE.


> making the load distribute evenly, nothing else.

so you mean a "load balancer" for neural nets … well, why don't they call it that then?


Some load balancers are also routers (if they route based on service capability and not just instantaneous availability) or vice versa, but this kind isn't always, to my understanding: The experts aren't necessarily "idle" or "busy" at any given time (they're just functions to be invoked, i.e. generally data, not computing resources), but rather more or less likely to answer correctly.

Even in the single GPU case, this still saves compute over the non-MoE case.

I believe it's also possible to split experts across regions of heterogeneous memory, in which case this task really would be something like load balancing (but still based on "expertise", not instantaneous expert availability, so "router" still seems more correct in that regard.)


Also note that MoE is a decades old term, predating deep learning. It's not supposed to be interpreted literally.


> individual tokens are routed to different experts

that was AFAIK (not an expert! lol) the traditional approach

but judging by the chart on LLaMa4 blog post, now they're interleaving MoE models and dense Attention layers; so I guess this means that even a single token could be routed through different experts at every single MoE layer!


ML folks tend to invent fanciful metaphorical terms for things. Another example is “attention”. I’m expecting to see a paper “consciousness is all you need” where “consciousness” turns out to just be a Laplace transform or something.


So really it's just utilizing sparse subnetworks - more like the human brain.


The idea has also been around for at least 15 years; "ensemble learning" was a topic in my "Data Mining" textbook from around then.

Meta calls these individually smaller/weaker models "experts" but I've also heard them referred to as "bozos", because each is not particularly good at anything and it's only together that they are useful. Also bozos has better alliteration with boosting and bagging, two terms that are commonly used in ensemble learning.


MOE as an idea specific to neural networks has been around since 1991[1] . OP is probably aware, but adding for others following along, while MoE has roots in ensembling, there are some important differences: Traditional ensembles run all models in parallel and combine their outputs, whereas MoE uses a gating mechanism to activate only a subset of experts per input. This enables efficient scaling via conditional computation and expert specialization, rather than redundancy.

[1]:https://ieeexplore.ieee.org/document/6797059


If I have 5000 documents about A, and 5000 documents about B, do we know whether it's better to train one large model on all 10,000 documents, or to train 2 different specialist models and then combine them as you describe?


well you don't. but the power of gradient descent if properly managed will split them up for you. But you might get more mileage out of like 200 specialist models.


It probably depends on how much A and B overlap. If it's say English sci-fi and Chinese poetry two different models may be better.


> Anecdotally, I feel MoE models sometimes exhibit slightly less “deep” thinking

Makes sense to compare apples with apples. Same compute amount, right? Or you are giving less time to MoE model and then feel like it underperforms. Shouldn't be surprising...

> These experts are say 1/10 to 1/100 of your model size if it were a dense model

Just to be correct, each layer (attention + fully connected) has it's own router and experts. There are usually 30++ layers. It can't be 1/10 per expert as there are literally hundreds of them.


Cool. Those that mean I could just run the query through the router and then load only the required expert? That is could I feasibly run this on my Macbook?


I've been calling for this approach for a while. It's kinda similar to how the human brain has areas that are good at specific tasks


It's already used a lot — the paper I believe is from 1991, and GPT4 among many others is MoE


yes, and it's on a per-layer basis, I think!

So if the model has 16 transformer layers to go through on a forward pass, and each layer, it gets to pick between 16 different choices, that's like 16^16 possible expert combinations!


So this is kind of an ensemble sort of thing in ML like random forest and GBT?


The "Experts" in MoE is less like a panel of doctors and more like having different brain regions with interlinked yet specialized functions.

The models get trained largely the same way as non-MoE models, except with specific parts of the model silo'd apart past a certain layer. The shared part of the model, prior to the splitting, is the "router". The router learns how to route as an AI would, so it's basically a black-box in terms of whatever internal structure emerges from this.


No, it's more like sharding of parameters. There's no understandable distinction between the experts.


I understand they're only optimizing for load distribution, but have people been trying to disentangle what the the various experts learn?


Mixture of experts involves some trained router components which routes to specific experts depending on the input, but without any terms enforcing load distribution this tends to collapse during training where most information gets routed to just one or two experts.


Keep in mind that the "experts" are selected per layer, so it's not even a single expert selection you can correlate with a token, but an interplay of abstract features across many experts at many layers.


I believe Mixture-of-Experts is a way for a neural network to group certain knowledge into smaller subsets. AFAIK there isn't a specific grouping goal, the network just figures out what goes where on it's own and then when an inference request is made it determines what "expert" would have that knowledge and routes it there. This makes the inference process much more efficient.



Llama 4 Scout, Maximum context length: 10M tokens.

This is a nice development.


Is the recall and reasoning equally good across the entirety of the 10M token window? Cause from what I've seen many of those window claims equate to more like a functional 1/10th or less context length.


It’s going to take a while to see how good this window is for real use; they’ve used a couple new ideas to get to 10M token context. Right now the only really good long token model out there is Gemini Pro - and its effectiveness does start dropping maybe in the 200k token range. I imagine insiders at GOOG have access to more than the published 1M token range there.

It will be fun to see what we get here, but I have no doubt the extra tokens will be useful - lots of use cases can do almost as well with summary-level accuracy memory.


I read somewhere that it has been trained on 256k tokens, and then expanded with RoPE on top of that, not starting from 16k like everyone does IIRC so even if it isn't really flawless at 10M, I'd expect it to be much stronger than its competitors up to those 256k.


I very much agree. I've been using Gemini 2.5 pro for coding and I've always given it a simple instruction. Never write comments. It will stop writing them for a time but it's nowhere near the 1M context window.

Now maybe this is more a lack of instruction following than context length but the fact that it works at first and then starts going downhill quickly makes me wary about how much it will pay attention to other details further back in the context.


the needle in a haystack benchmark looks good but at this point I think we need new benchmarks to test actual understanding of content in such a large window.


I think the problem is with positional encoding. If model cannot clearly separate tokens in context window they overlap which leads to mess. That encoding matters and actual position does not.


I assume they're getting these massive windows via RAG trickery, vectorization, and other tricks behind the curtain, became I've noticed the same as you- things start dipping in quality pretty quickly.

Does anyone know if I am correct in my assumption?


There's no "RAG trickery" or vector search. They changed the way they encode positions such that in theory they're less sensitive to where the token appears in the string.

That's similar to how previous long-context models worked as well, although the earlier iterations didn't work particularly well, as most have noticed; technically the model "worked" with longer contexts, but it would definitely get dumber. Still too early to tell how this newer variant works, although I'd assume it's at least somewhat better.


the large context windows generally involve RoPE[0] which is a trick that allows the training window to be smaller but expand larger during inference. it seems like they have a new "iRoPE" which might have better performance?

[0]https://arxiv.org/pdf/2104.09864


I don't think RAG will survive this time


4.8b words on English Wikipedia. Knowledge cutoff of 6 months. A valid use case is to search across Wikipedia and ground your answers. Trivially proves that RAG is still needed.


RAG still has lots of benefits for anyone paying per input token (e.g. over APIs).


Not to mention latency


And grounding for the model. Smaller models with tend to hallucinate a little less (anecdotally).


This is only for the small model. The medium model is still at 1M (like Gemini 2.5)

Even if we could get the mid models to 10M, that's still a medium-sized repo at best. Repos size growth will also accelerate as LLMs generate more code. There's no way to catch up.


RAG gets bigger as everyone else gets bigger. Flooding prompts with garbage is not a sound strategy...


How did they achieve such a long window and what are the memory requirements to utilize it?


According to [0] it's partly due to a key change they introduced in interleaving layers that use standard RoPE positional encodings and layers using what's called NoPE [1], not encoding positions at all and letting the model to figure those out on its own (this exclusively works because the LLMs are autoregressive, so the model can recognize an input token as being the very first by there not yet being any other tokens to attend to, and recursively deriving the position of the subsequent ones from that base case)

[0] https://ai.meta.com/blog/llama-4-multimodal-intelligence/ [1] https://arxiv.org/abs/2305.19466


> Knowledge cutoff: August 2024.

Could this mean training time is generally around 6 month, with 2 month of Q/A?


I wish my knowledge cutoff was August 2024.


This made me LOL louder than I have for a long time! Agree.


Couldn’t you gradually include more recent documents as you train?


You can do that but the amount of incremental data will be negligible compared to the rest of the data. Think of the knowledge cutoff more like a soft value.


That makes it harder to analyze the results of training and draw conclusions for the next round.


It scales depending on the dataset you want exposure on and the compute you have available, so any specific time box is kind of meaningless if you don’t know the rest of the inputs that went into it. The llama 3 paper went into a lot of this and how these decisions were made (see section 3 and onward): https://ai.meta.com/research/publications/the-llama-3-herd-o...

tl;dr: llama 3 was 54 days, but it’s more complicated than that.


Thanks for sharing this here. At first I loved the simple Apache-style directory listing, very classic and utilitarian way to navigate new information. Then I tried clicking the FAQ and it wouldn't load anything until I allowed two different sources of JavaScript.


I have a gut feeling, next in line will be 2 or more level of MoE. Further reducing the memory bandwidth and compute requirements. So top level MoE router decides which sub MoE to route.


The solution to all problems in computer science is add a new level of indirection (or abstraction).


Except when the solution is to collapse abstraction in the name of efficiency.


17B puts it beyond the reach of a 4090 ... anybody do 4 bit quant on it yet?


Oh, it'll never run on a 4090. 17B is the active parameter count, not the total param count (and "active" doesn't mean you can slice just those params out and put them on the GPU — which parameters are active constantly changes, even per-token. "Active" just means you get tokens faster than a dense model). It's 109B total parameters, so you'd need at least 54.5GB VRAM just for the weights alone.

A Framework Desktop, Mac Studio, or Nvidia DGX Spark should be able to handle the Scout model locally though... Maybe even at FP8, depending on how much context you need.


Well, Scout should run on the rumored 96GB 4090, since it runs on a single 80GB H100. But, yeah, it'd have to be at sub-2bit quantization to run on a standard 24GB.


Sounds runnable on 2x5090 presumably for $4k if back in stock.


True! A Framework Desktop or mid-tier Mac Studio would also work and would be cheaper — and you could even run Scout at FP8. A maxed-out Mac Studio could even handle Maverick at FP8, albeit at pretty high cost ($10k).

It's still runnable locally. Just not on a 4090.


You can swap experts in and out of VRAM, it just increases inference time substantially.

Depending on the routing function you can figure out all the active experts ahead of the forward pass for a single token and pipeline the expert loading.


Chosen expert (on each layer) depends on the input of previous layer. Not sure how you can preload the experts before forward pass.


Unless something’s changed you will need the whole model on the HPU anyway, no? So way beyond a 4090 regardless.


You can still offload most of the model to RAM and use the GPU for compute, but it's obviously much slower than what it would be if everything was on the GPU memory.

see ktransformers: https://www.reddit.com/r/LocalLLaMA/comments/1jpi0n9/ktransf...


I'm certainly not the brightest person in this thread but has there been effort to maybe bucket the computational cost of the model so that more expensive parts are on the gpu and less expensive parts are on the cpu?



A habana just for inference? Are you sure?

Also I see the 4 bit quants put it at a h100 which is fine ... I've got those at work. Maybe there will be distilled for running at home


If their knowledge cutoff is 8 months ago, then how on earth does Grok know things that happened yesterday?

I would really love to know that.


RAG?


At that scale? Is that even possible?


Nice release. I see that everyone is playing the differentiation game now: https://medium.com/thoughts-on-machine-learning/llama-4-and-...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: