Hacker News new | past | comments | ask | show | jobs | submit login
Reflection 70B, the top open-source model (twitter.com/mattshumer_)
234 points by GavCo 88 days ago | hide | past | favorite | 95 comments



Like other comments, I was also initially surprised. But I think the gains are both real and easy to understand where the improvements are coming from.

Under the hood Reflection 70B seems to be a Llama-3.1 finetune that encourages the model to add <think>, <reflection> and <output> tokens and corresponding phases. This is an evolution of Chain-of-Thought's "think step by step" -- but instead of being a prompting technique, this fine-tune bakes examples of these phases more directly into the model. So the model starts with an initial draft and 'reflects' on it before issuing a final output.

The extra effort spent on tokens, which effectively let the model 'think more' appears to let it defeat prompts which other strong models (4o, 3.5 Sonnet) appear to fumble. So for example, when asked "which is greater 9.11 or 9.9" the Reflection 70b model initially gets the wrong answer, then <reflects> on it, then spits the right output.

Personally, the comparison to Claude and 4o doesn't quite seem apples-to-apples. If you were to have 4o/Claude take multiple rounds to review and reflect on their initial drafts, would we see similar gains? I suspect they would improve massively as well.


https://huggingface.co/mattshumer/Reflection-70B says system prompt used is:

   You are a world-class AI system, capable of complex reasoning and reflection. Reason through the query inside <thinking> tags, and then provide your final response inside <output> tags. If you detect that you made a mistake in your reasoning at any point, correct yourself inside <reflection> tags.
Also, only "smarter" models can use this flow, according to https://x.com/mattshumer_/status/1831775436420083753


> Personally, the comparison to Claude and 4o doesn't quite seem apples-to-apples. If you were to have 4o/Claude take multiple rounds to review and reflect on their initial drafts, would we see similar gains? I suspect they would improve massively as well.

They may already implement this technique, we can't know.


Claude 3.5 does have some "thinking" ability - I've seen it pause and even say it was thinking before. Presumably this is just some output it decides not to show you.


THIS!!!!!!! People act like Claude and 4o are base models with no funny business behind the scenes, we don't know just how much additional prompt steps are going on for each queue, all we know is what the API or Chat interface dump out, what is happening behind that is anyones guess.. The thinking step and refinement steps likely do exist on all the major commercial models. It's such a big gain for a minimal expenditure of backend tokens, WTF wouldn't they be doing it to improve the outputs?


Well they can't do a /lot/ of hidden stuff because they have APIs, so you can see the raw output and compare it to the web interface.

But they can do a little.


As if they couldn’t postprocess the api output before they send it to the client…


No, I mean they sell API access and you can query it.


That's only in the web version, it's just that they prompt it to do some CoT in the antThinking XML tag, and hide the output from inside that tag in the UI.


The API does it too for some of their models in some situations.


Interesting, is there any documentation on this or a way to view the thinking?


I suspect GPT4o already has training for CoT. I've noticed it often responds by saying something like "let's break it down step by step". Or maybe it's the system prompt.


I am not sure, but you seem to be implying that the Reflection model is running through multiple rounds? If so, that is not what is happening here. The token generation is still linear next token prediction. It does not require multiple rounds to generate the chain of thought response. It does that in one query pass.

I have been testing the model for the last few hours and it does seem to be an improvement on LLAMA 3.1 upon which it is based. I have not tried to compare it to Claude or GPT4o because I don't expect a 70b model to outperform models of that class no matter how good it is. I would happy to be wrong though...


I had a similar idea[0], interesting to see that it actually works. The faster LLM workloads can be accelerated, the more ‘thinking’ the LLM can do before it emits a final answer.

[0]: https://news.ycombinator.com/item?id=41377042


Further than that, it feels like we could use constrained generation of outputs [0] to force the model to do X amount of output inside of a <thinking> BEFORE writing an <answer> tag. It might not always produce good results, but I'm curious what sort of effect it might have to convince models that they really should stop and think first.

[0]: https://github.com/ggerganov/llama.cpp/blob/master/grammars/...


Can we replicate this in other models without finetuning them ?



Apple infamously adds "DO NOT HALLUCINATE" to its prompts.


Huh ? Source please (this is fascinating)



what's our estimate of the cost to finetune this?


I don't know the cost, but they supposedly did all their work in 3 weeks based on something they said in this video: https://www.youtube.com/watch?v=5_m-kN64Exc


Interesting idea!

You can somewhat recreate the essence of this using a system prompt with any sufficiently sized model. Here's the prompt I tried for anybody who's interested:

  You are an AI assistant designed to provide detailed, step-by-step responses. Your outputs should follow this structure:

  1. Begin with a <thinking> section. Everything in this section is invisible to the user.
  2. Inside the thinking section:
     a. Briefly analyze the question and outline your approach.
     b. Present a clear plan of steps to solve the problem.
     c. Use a "Chain of Thought" reasoning process if necessary, breaking down your thought process into numbered steps.
  3. Include a <reflection> section for each idea where you:
     a. Review your reasoning.
     b. Check for potential errors or oversights.
     c. Confirm or adjust your conclusion if necessary.
  4. Be sure to close all reflection sections.
  5. Close the thinking section with </thinking>.
  6. Provide your final answer in an <output> section.
  
  Always use these tags in your responses. Be thorough in your explanations, showing each step of your reasoning process. Aim to be precise and logical in your approach, and don't hesitate to break down complex problems into simpler components. Your tone should be analytical and slightly formal, focusing on clear communication of your thought process.
  
  Remember: Both <thinking> and <reflection> MUST be tags and must be closed at their conclusion
  
  Make sure all <tags> are on separate lines with no other text. Do not include other text on a line containing a tag.


The model page says the prompt is:

The system prompt used for training this model is:

   You are a world-class AI system, capable of complex reasoning and reflection. Reason through the query inside <thinking> tags, and then provide your final response inside <output> tags. If you detect that you made a mistake in your reasoning at any point, correct yourself inside <reflection> tags.

from: https://huggingface.co/mattshumer/Reflection-70B


But their model is fine-tuned on top of Llama, other base models could not follow this specific system prompt.


All we need to do to turn any LLM in to an AGI is figure out what system of tags is Turing-complete. If enough of us monkeys experiment with <load>s and <store>s and <j[e,ne,gt...]>s, we'll have AGI by morning.


All you need is a <mov>


For those who didn't catch the reference: https://github.com/xoreaxeaxeax/movfuscator


Your comment is hilarious, but not that far off. I think it's funny that people are so skeptical that AGI will be here soon, yet the heaviest lifting by far has already been done.

The only real difference between artificial intelligence and artificial consciousness is self-awareness through self-supervision. Basically the more transparent that AI becomes, and the more able it is to analyze its thoughts and iterate until arriving at a solution, the more it will become like us.

Although we're still left with the problem that the only observer we can prove exists is ourself, if we can even do that. Which is only a trap within a single time/reality ethos.

We could have AGI right now today by building a swarm of LLMs learning from each other's outputs and evolving together. Roughly the scale of a small mammalian brain running a minimalist LLM per cell. Right now I feel that too much GPU power is spent on training. Had we gone with a different architecture (like the one I've wanted since the 90s and went to college for but never manifested) with highly multicore (1000 to 1 million+) CPUs with local memories running the dozen major AI models including genetic algorithms, I believe that AGI would have already come about organically. Because if we had thousands of hobbyists running that architecture in their parents' basements, something like SETI@home, the overwhelming computer power would have made space for Ray Kurzweil's predictions.

Instead we got billionaires and the coming corporate AI tech dystopia:

https://www.pcmag.com/news/musks-xai-supercomputer-goes-onli...

Promoting self-actualization and UBI to overcome wealth inequality and deliver the age of spiritual machines and the New Age are all aspects of the same challenge, and I believe that it will be solved by 2030, certainly no later than 2040. What derails it won't be a technological hurdle, but the political coopting of the human spirit through othering, artificial scarcity, perpetual war, etc.


That's a very 'Star Trek' view of human nature. History shows that whenever we solve problems we create new ones. When material scarcity is solved, we'll move to other forms of scarcity. In fact, it is already happening. Massive connectivity has made status more scarce. You could be the best guitarist in your town but today you compare yourself to all of the guitarists that you see on Instagram rather than the local ones.


Well, once you've solved AGI and material scarcity, you can just trick that side of your brain that craves status by simulating a world where you're important. Imo we're already doing a very primitive version of that with flatscreen gaming.


Nice! It would be a better benchmark to compare this prompt (w/ gpt-4o, claude) with whatever the original model was compared to.



I'd drop all "you" and also the "AI assistant" parts completely. It's just operating off a corpus after all, that kind of prompting should be completely irrelevant.

Also could replace "invisible" with wrap section with "---IGNORE---" or with "```IGNORE" markdown tags and then filter it out after


I /feel/ similarly intuition-wise. But models are crazy and what they respond to is often unpredictable. There are no lungs in an AI model but nonetheless 'take a deep breath' as a prompt has shown[0] improvement on math scores lol

Personally I strongly disapprove of the first/second person pronouns and allowing them [encouraging, even] to output 'we' when talking about humans.

[0] https://arstechnica.com/information-technology/2023/09/telli...


What’s missing here is ‘prepare a Z3 script in a <z3-verification> tag with your thinking encoded and wait for the tool run and its output before continuing’


I tried a few local LLMs. None of them could give me the right answer for "How many 'r's in straberry. All LLMs were 8-27B.


This thing would be overly verbose


You'd hide the contents of the tags in whatever presentation layer you're using. It's known that allowing the model to be verbose gives it more opportunities to perform computation, which may allow it to perform better.


I mean, this is how the Reflection model works. It's just hiding that from you in an interface.


If this does indeed beat all the closed source models, then I'm flabbergasted. The amount of time and resources Google, OpenAI, and Anthropic have put into improving the models to only be beaten in a couple weeks by two people (who as far as I know do not have PhDs and years of research experience) would be a pretty crazy feat.

That said, I'm withholding judgment on how likely the claims are. A friend who developed NoCha [1] is running the model on that benchmark, which will really stress test its ability to reason over full novels. I'll reserve judgement until then.

[1]: https://novelchallenge.github.io/


PhDs aren't relevant. It's more just a certificate that you can learn to learn and stay committed to hard and challenging things. It does give bonus points to VCs, because it's seems to be easier to market to other VCs, same applies for hedge funds.

And with fine tuning, there's zero math needed, it's a bit of common sense, and a lot's of data optimization.


I wouldn't say that PhD's aren't relevant. Remember a lot of this subsequent "bumps, steps and leaps" advancement has come _after_ the initial work by the OpenAI's etc. "Standing on the shoulders of giants" is a thing.


and these phds used some tools developed by teenager hackers. Standing on the shoulders of giants, indeed


>A friend who developed NoCha [1] is running the model on that benchmark [...]

Please do update us on the result.


Not looking good. Apparently the model was broken when they released it yesterday. The version they uploaded 8hrs ago only has an 8k context length, so we can't test it on the novels.

Here's the updates to the model config on huggingface:

https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B/c...


Anyone have or know of a list of LLM challenges like this? Targeted use cases with unpublished test data?


One question about the Novels challenge: as there are two true/false questions, a random pick of answer will give a 25% success rate right? How do some model manage to be below 25?


They know which answer is correct, they just don't want to say it.


Fine tuning needs $$$ and knowledge on how fine tuning works.


We need results from these harder/different benchmarks which give pretty bad scores to current top LLMs.

https://www.wolfram.com/llm-benchmarking-project/

https://help.kagi.com/kagi/ai/llm-benchmark.html

Edit : There are few other benchmarks that give pretty low scores (<20%) to top LLMs. Can't find them atm. There was a benchmark with common sense easy looking questions.

Edit: found two more papers

https://arxiv.org/html/2405.19616

https://arxiv.org/html/2406.02061v1

Edit: How about Wordle?

https://www.strangeloopcanon.com/p/what-can-llms-never-do

https://news.ycombinator.com/item?id=40179232


There's always the new sets from Leaderboard v2 https://huggingface.co/spaces/open-llm-leaderboard/blog


The sample answers for the horse race question are crazy. [0] Pretty much all the LLM really want to split 6 horses into two groups of three.

Only LLAMA 3 makes the justification that only 2 horses can be raced at a time, but then gets its modified question wrong by racing three horses. I personally would consider an answer that presumes some restriction to how the horses can be raced to be valid if it answers the restricted version correctly.

[0]: https://arxiv.org/html/2405.19616v2#S9.SS2.SSS1


"The task consists of going from English-language specifications to Wolfram Language code. The test cases are exercises from Stephen Wolfram's An Elementary Introduction to the Wolfram Language."

I think this benchmark would really only tell me whether Wolframs book was in the training data.


It's available online in HTML form, for free:

https://www.wolfram.com/language/elementary-introduction/3rd...


Yeah, may be should skip that benchmark.


I am happy to run the tests on Kagi LLM benchmark. Is there an API endpoint for this model anywhere?


To anyone coming into this thread late, this LLM announcement was most likely a scam. See this more recent thread: https://news.ycombinator.com/item?id=41484981


I'm surprised this does so well in benchmarks, given the intuition I'm getting about its behavior from quick testing.

I gave it a medium-complexity design problem: Design the typescript interface for the state of a react app that manages a tree of chat turns/responses and displays the current path through the tree. (In other words, the kind of state that sits logically behind the ChatGPT or Claude Web UI, where previous conversation turns can be edited and used as a branching off point for new turns.)

Reflection-70B suffered from a bad initial idea, just as Llama 70B generally does (proposing to duplicate state between the "tree of all messages" and the "path to currently displayed message"), which is a very common error. The automated reflection process identified a whole bunch of nitpicks but missed the glaring logical bug. Furthermore the final output was missing many of the details included in the initial reflection / chain-of-thought scratchpad, even though the UI hides the scratchpad as though it's unimportant for the user to read.



Worth mentioning that LlaMa 70b already had pretty high benchmark scores to begin with https://ai.meta.com/blog/meta-llama-3-1/

Still impressive that it can beat top models with fine-tuning, but now I’m mostly impressed by the fact that the 70b model was so good to begin with.


Just tried this out for coding. I asked it to download weather data for Dublin into a Pandas Dataframe and write it to Hopsworks. Worked as good as GPT-4o - code ran correctly. The playground is fast. Impressed!


At the risk of sounding like a stuck LLM, it's under the Llama licence, which isn't an open source licence because of the restrictions on fields of endeavour.


Crazy how simple the technique is if this holds up. Just <think> and <reflection> plus synthetic data, used to finetune Llama 3.1 70B.

Note that there's a threshold for how smart the model has to be to take advantage of this flow (https://x.com/mattshumer_/status/1831775436420083753) - 8B is too dumb.

In which case, what happens if you apply this to a GPT-4o finetune, or to Claude 3.5 Sonnet?

What happens if you combine it with variants of tree-based reasoning? With AlphaProof (https://www.nature.com/articles/s41586-023-06747-5#Sec3)? With MCTSr (https://arxiv.org/abs/2406.07394)?


I was just thinking - since GPT-4o and Sonnet are closed models, do we know that this method was not already used to train them? And that Reflection is simply finding a path for greater improvements than they did. Llama 3.1 apparently didn't improve as much. It's just a thought though.


If they had, this thing wouldn't be trading punches with them at its size



What parameter size are 4o and sonnet?


Seems to really fall apart on subsequent prompts, and a few times I've had code end up in the "thinking" tokens.

I'm guessing most of the training data was single-turn, instead of multi-turn, but that should be relatively easy to iterate on.


Quick update here: the model in question is apparently an attempt at an attention grab, there are open questions as to whether it is a llama 3 fine-tune, a llama 3.1 fine-tune, or a series of api calls redirecting to claude 3.5 sonnet, with a find and replace of Claude for Llama


You can try this hugging face assistant that uses Llama 3.1 70b and system prompt engineering to simulate Reflection 70b's thinking and reflection process.

https://hf.co/chat/assistant/66db391075ff4595ec2652b7


Wonder why no Llama-3.1-8B based variant if the new training method has such good results. UPDATE: didn't work well https://x.com/mattshumer_/status/1831775436420083753?t=flm41...



It's answered on Twitter. Not much improvement over other similar models at that size.


Imagine if it was the reason in big corporations to not to investigate further some similar technique :)


Can we please stop allowing links to Twitter? Rationale: the artificial limitations on that site around post size mean that most announcements (such as this one) are multiple posts. This, combined with the questionable design decision of hiding all reply tweets when a user is not logged in, means that many posts are completely missing crucial context for those of us who don’t have Twitter accounts.

Alternatively, Twitter links could be rewritten to redirect to one of the few Nitter instances that are still functional.


> Rationale: the artificial limitations on that site around post size mean that most announcements (such as this one) are multiple posts.

That limit actually doesn't apply to premium users/bluechecks, and he's using the other features like bold text.

The problem with long posts like that is one, they're annoying to read because when you open one up you don't know how much of a time commitment they will be, and two, you can't reply to just part of them.


> That limit actually doesn't apply to premium users/bluechecks, and he's using the other features like bold text.

I can't keep track of the flailing over at Twitter, especially because I don't have an account. Regardless, it's not all that relevant to what I was saying; maybe I got the reason wrong, but the fact remains that the vast majority of people who I see trying to post longer content on Twitter do it via multiple posts.

As a related aside, it baffles me why people still use the site when many superior alternatives are available.

> The problem with long posts like that is one, they're annoying to read because when you open one up you don't know how much of a time commitment they will be, and two, you can't reply to just part of them.

Those don't actually seem like problems to me.



I believe this is against HN's values.

HN allows, and has always allowed, links to paywalled sources, sources with geographic restrictions that refuse to display the content for some readers, and won't modify a posts URL due to the site being slashdotted / suffering from an HN hug of death. Twitter is no different, except maybe by being more ideologically polarizing.

The place for alternative URLs is, and has always been, the comments.


Yeah, I understand this has been the case, but I guess I don’t understand why it can’t be changed, or why it’s even a good thing.

Seems like most others disagree with me though, so I guess I’ll just skip over anything posted on Twitter.


Once you start discriminating content based on arbitrary rule (and this would be one), you are entering a slippery slope. Hence it is better to not let precedent take place in the first place.


I don’t find this argument convincing at all. Implementing a rule like this would make the site better for a number of people.


or why we can’t complain and flag enough until it becomes part of the culture


This make me think we should be introducing 'tokens required to answer questions correctly' dimension to each metric. Since letting the model think more verbosely is essentially giving it more compute and memory to answer the question correctly. (not that this is a bad thing, but I would be curious if other models get the answer correctly with the first couple of tokens, or after hundreds of reasoning)


Unfortunately the model is broken at present, It looks like they're working on a fix - https://huggingface.co/mattshumer/Reflection-70B/discussions...


So is reflection tuning a scam or something worth exploring?


Any way to have this work in LM Studio? Not showing up in search results.


May need an update from LM plus someone converting it to gguf format


i hope the quantized version doesnt loose to much of it's quality.


I wonder how good it is with multi-turn conversations


(removed)


Maybe I'm misreading, but i think the linked tweet says the opposite? That the model returns the mathematically correct answer, not the answer marked correct in the ground truth?


My bad - sorry!


tweet says the opposite?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: