I'm really glad that these HNet-inspired approaches are getting traction, I'm a big fan of that paper.
Though I wonder how much of the gains in this case are actually due to 75% extra parameters compared to the baseline, even if the inference FLOPs are matched.
Can't help but see this as a just different twist on parameter use sparsity idea leveraged by MoE models, as those also gain in performance at constant forward pass FLOPs because of extra parameters.
Correct me if I'm misinterpreting something in your argument but as I see it Matryoshka embeddings just sort the vector bases of the output space roughly by order of their importance for the task, PCA-style, so when you truncate your 4096-dimensionnal embedding down to a set of let's say 256 dimensions, those are the exact same 256 vector bases doing the core job of encoding important information for each sample, so you're back to dense retrieval on 256-dimensional vectors, just that all the minor miscellaneous slack useful for a very low fraction of queries has been trimmed away.
True sparsity would imply keeping different important vector bases for different documents, but MRL doesn't magically shuffle vector bases around depending on what's your document contains, were that the case cosine similarity between the resulting documents embeddings would simply make no sense as a similarity measure.
The ARC Prize Foundation ran extensive ablations on HRM for their slew of reasoning tasks and noted that the "hierarchical" part of their architecture is not much more impactful than a vanilla transformer of the same size with no extra hyperparameter tuning:
As far as I understand the "chunking" of input bytes is learned completely end to end, so it's basically up to the model to figure out how to most efficiently delineate and aggregate the information from the inputs according to the patterns provided to it during training.
Since it's end to end this allows them to apply this process not only to raw byte encodings but basically representations of any level, such as stacking two stages of aggregation one after another.
So in principle they could either let the model do its thing on raw bytes of an image or alternatively maybe cut it up into tiny patches ViT-style and feed that to their H-Net.
I wonder how hard would it be to adapt chunking to work in 2D and what would that even look like.
Some other notes on how multimodal inputs could be handled using this architecture are mentioned in Albert Gu's (one of the author's) blog, although only briefly, there's still much to figure out it would seem: https://goombalab.github.io/blog/2025/hnet-future/#alternati...
According to [0] it's partly due to a key change they introduced in interleaving layers that use standard RoPE positional encodings and layers using what's called NoPE [1], not encoding positions at all and letting the model to figure those out on its own (this exclusively works because the LLMs are autoregressive, so the model can recognize an input token as being the very first by there not yet being any other tokens to attend to, and recursively deriving the position of the subsequent ones from that base case)
I'm not sure I understand what you're trying to say here, information between tokens is propagated through self-attention, and there's an attention block inside each transformer block within the model, that's a whole lot of internal state that's stored in (mostly) inscrutable key and value vectors with hundreds of dimensions per attention head, around a few dozen heads per attention block, and around a few dozen blocks per model.
Yes, but all that internal state only survives until the end of the computation chain that predicts the next token - it doesn't survive across the entire sequence as it would in a recurrent network.
There is literally no difference between a model predicting the tokens "<thought> I think the second choice looks best </thought>" and a user putting those tokens into the prompt: The input for the next round would be exactly the same.
So the tokens kind of act like a bottleneck (or more precisely the sampling of exactly one next token at the end of each prediction round does). During prediction of one token, the model can go crazy with hidden state, but not across several tokens. That forces the model to do "long form" reasoning through the tokens and not through hidden state.
The key and value vectors are cached, that's kind of the whole point of autoregressive transformer models, the "state" not only survives within the KV cache but, in some sense, grows continuously with each token added, and is reused for each subsequent token.
Hmm, maybe I misunderstood that part, but so far I thought the KV cache was really just that - a cache. Because all the previous tokens of the sequence stay the same, it makes no sense to compute the same K and V vectors again in each round.
But that doesn't change that the only input to the Q, K and V calculations are the tokens (or in later layers information that was derived from the tokens) and each vector in the cache maps directly to an input token.
So I think you could disable the cache and recompute everything in each round and you'd still get the same result, just a lot slower.
That's absolutely correct, KV cache is just an optimization trick, you could run the model without it, that's how encoder-only transformers do it.
I guess what I'm trying to convey is that the latent representations within a transformer are conditioned on all previous latents through attention, so at least in principle, while the old cache of course does not change, since it grows with new tokens it means that the "state" can be brought up to date by being incorporated in an updated form into subsequent tokens.
Chain of thought isn't exactly transparent either, you shouldn't fall into the pitfall of believing that the final sequence of tokens thinking about the task is the only processing the model actually performs during CoT.
There might me a lot of other hidden computations happening within the model's latents which may not immediately influence the predicted tokens but be relevant for the model's internal processing. And even disregarding that, the model is under no formal obligation to stick to the chain of thought it produced for its final decisions.
AFAIK retrieving documents that look like the query is more commonly avoided by using a bi-encoder explicitly trained for retrieval, those generally are conditioned to align embeddings of queries to those of relevant documents, with each having a dedicated token marker, something like [QUERY] and [DOC], to make the distinction clear.
The strong suit of HyDE seems to be more in working better in settings where the documents and queries you're working with are too niche to be properly understood by a generic retrieval model and you don't have enough concrete retrieval data to fine-tune a specialized model.
In section 2 they briefly mention studies such as [1] that point out that the token outputs of a chain of thought aren't always entirely faithful to the responses of the models
I'm not sure whether it wouldn't be more reliable to let the model run on latents and try to train a separate latent-reading explainer module that has at least some approximation of what we want as an explicit optimization objective.
Assuming it actually is or has the potential to be better than CoT, from what I gathered from the paper the current results are mostly just more efficient token-wise.
I mean, it's no free lunch, you still need to expend significantly more compute for the QLoRA training compared to any usual PTQ method, be it SpinQuant or any other more conventional quantization approaches.
Though I wonder how much of the gains in this case are actually due to 75% extra parameters compared to the baseline, even if the inference FLOPs are matched.
Can't help but see this as a just different twist on parameter use sparsity idea leveraged by MoE models, as those also gain in performance at constant forward pass FLOPs because of extra parameters.
reply