Like other comments, I was also initially surprised. But I think the gains are both real and easy to understand where the improvements are coming from.
Under the hood Reflection 70B seems to be a Llama-3.1 finetune that encourages the model to add <think>, <reflection> and <output> tokens and corresponding phases. This is an evolution of Chain-of-Thought's "think step by step" -- but instead of being a prompting technique, this fine-tune bakes examples of these phases more directly into the model. So the model starts with an initial draft and 'reflects' on it before issuing a final output.
The extra effort spent on tokens, which effectively let the model 'think more' appears to let it defeat prompts which other strong models (4o, 3.5 Sonnet) appear to fumble. So for example, when asked "which is greater 9.11 or 9.9" the Reflection 70b model initially gets the wrong answer, then <reflects> on it, then spits the right output.
Personally, the comparison to Claude and 4o doesn't quite seem apples-to-apples. If you were to have 4o/Claude take multiple rounds to review and reflect on their initial drafts, would we see similar gains? I suspect they would improve massively as well.
You are a world-class AI system, capable of complex reasoning and reflection. Reason through the query inside <thinking> tags, and then provide your final response inside <output> tags. If you detect that you made a mistake in your reasoning at any point, correct yourself inside <reflection> tags.
> Personally, the comparison to Claude and 4o doesn't quite seem apples-to-apples. If you were to have 4o/Claude take multiple rounds to review and reflect on their initial drafts, would we see similar gains? I suspect they would improve massively as well.
They may already implement this technique, we can't know.
Claude 3.5 does have some "thinking" ability - I've seen it pause and even say it was thinking before. Presumably this is just some output it decides not to show you.
THIS!!!!!!! People act like Claude and 4o are base models with no funny business behind the scenes, we don't know just how much additional prompt steps are going on for each queue, all we know is what the API or Chat interface dump out, what is happening behind that is anyones guess.. The thinking step and refinement steps likely do exist on all the major commercial models. It's such a big gain for a minimal expenditure of backend tokens, WTF wouldn't they be doing it to improve the outputs?
That's only in the web version, it's just that they prompt it to do some CoT in the antThinking XML tag, and hide the output from inside that tag in the UI.
I suspect GPT4o already has training for CoT. I've noticed it often responds by saying something like "let's break it down step by step". Or maybe it's the system prompt.
I am not sure, but you seem to be implying that the Reflection model is running through multiple rounds? If so, that is not what is happening here. The token generation is still linear next token prediction. It does not require multiple rounds to generate the chain of thought response. It does that in one query pass.
I have been testing the model for the last few hours and it does seem to be an improvement on LLAMA 3.1 upon which it is based. I have not tried to compare it to Claude or GPT4o because I don't expect a 70b model to outperform models of that class no matter how good it is. I would happy to be wrong though...
I had a similar idea[0], interesting to see that it actually works. The faster LLM workloads can be accelerated, the more ‘thinking’ the LLM can do before it emits a final answer.
Further than that, it feels like we could use constrained generation of outputs [0] to force the model to do X amount of output inside of a <thinking> BEFORE writing an <answer> tag. It might not always produce good results, but I'm curious what sort of effect it might have to convince models that they really should stop and think first.
You can somewhat recreate the essence of this using a system prompt with any sufficiently sized model. Here's the prompt I tried for anybody who's interested:
You are an AI assistant designed to provide detailed, step-by-step responses. Your outputs should follow this structure:
1. Begin with a <thinking> section. Everything in this section is invisible to the user.
2. Inside the thinking section:
a. Briefly analyze the question and outline your approach.
b. Present a clear plan of steps to solve the problem.
c. Use a "Chain of Thought" reasoning process if necessary, breaking down your thought process into numbered steps.
3. Include a <reflection> section for each idea where you:
a. Review your reasoning.
b. Check for potential errors or oversights.
c. Confirm or adjust your conclusion if necessary.
4. Be sure to close all reflection sections.
5. Close the thinking section with </thinking>.
6. Provide your final answer in an <output> section.
Always use these tags in your responses. Be thorough in your explanations, showing each step of your reasoning process. Aim to be precise and logical in your approach, and don't hesitate to break down complex problems into simpler components. Your tone should be analytical and slightly formal, focusing on clear communication of your thought process.
Remember: Both <thinking> and <reflection> MUST be tags and must be closed at their conclusion
Make sure all <tags> are on separate lines with no other text. Do not include other text on a line containing a tag.
The system prompt used for training this model is:
You are a world-class AI system, capable of complex reasoning and reflection. Reason through the query inside <thinking> tags, and then provide your final response inside <output> tags. If you detect that you made a mistake in your reasoning at any point, correct yourself inside <reflection> tags.
All we need to do to turn any LLM in to an AGI is figure out what system of tags is Turing-complete. If enough of us monkeys experiment with <load>s and <store>s and <j[e,ne,gt...]>s, we'll have AGI by morning.
Your comment is hilarious, but not that far off. I think it's funny that people are so skeptical that AGI will be here soon, yet the heaviest lifting by far has already been done.
The only real difference between artificial intelligence and artificial consciousness is self-awareness through self-supervision. Basically the more transparent that AI becomes, and the more able it is to analyze its thoughts and iterate until arriving at a solution, the more it will become like us.
Although we're still left with the problem that the only observer we can prove exists is ourself, if we can even do that. Which is only a trap within a single time/reality ethos.
We could have AGI right now today by building a swarm of LLMs learning from each other's outputs and evolving together. Roughly the scale of a small mammalian brain running a minimalist LLM per cell. Right now I feel that too much GPU power is spent on training. Had we gone with a different architecture (like the one I've wanted since the 90s and went to college for but never manifested) with highly multicore (1000 to 1 million+) CPUs with local memories running the dozen major AI models including genetic algorithms, I believe that AGI would have already come about organically. Because if we had thousands of hobbyists running that architecture in their parents' basements, something like SETI@home, the overwhelming computer power would have made space for Ray Kurzweil's predictions.
Instead we got billionaires and the coming corporate AI tech dystopia:
Promoting self-actualization and UBI to overcome wealth inequality and deliver the age of spiritual machines and the New Age are all aspects of the same challenge, and I believe that it will be solved by 2030, certainly no later than 2040. What derails it won't be a technological hurdle, but the political coopting of the human spirit through othering, artificial scarcity, perpetual war, etc.
That's a very 'Star Trek' view of human nature. History shows that whenever we solve problems we create new ones. When material scarcity is solved, we'll move to other forms of scarcity. In fact, it is already happening. Massive connectivity has made status more scarce. You could be the best guitarist in your town but today you compare yourself to all of the guitarists that you see on Instagram rather than the local ones.
Well, once you've solved AGI and material scarcity, you can just trick that side of your brain that craves status by simulating a world where you're important. Imo we're already doing a very primitive version of that with flatscreen gaming.
I'd drop all "you" and also the "AI assistant" parts completely. It's just operating off a corpus after all, that kind of prompting should be completely irrelevant.
Also could replace "invisible" with wrap section with "---IGNORE---" or with "```IGNORE" markdown tags and then filter it out after
I /feel/ similarly intuition-wise. But models are crazy and what they respond to is often unpredictable. There are no lungs in an AI model but nonetheless 'take a deep breath' as a prompt has shown[0] improvement on math scores lol
Personally I strongly disapprove of the first/second person pronouns and allowing them [encouraging, even] to output 'we' when talking about humans.
What’s missing here is ‘prepare a Z3 script in a <z3-verification> tag with your thinking encoded and wait for the tool run and its output before continuing’
You'd hide the contents of the tags in whatever presentation layer you're using. It's known that allowing the model to be verbose gives it more opportunities to perform computation, which may allow it to perform better.
If this does indeed beat all the closed source models, then I'm flabbergasted. The amount of time and resources Google, OpenAI, and Anthropic have put into improving the models to only be beaten in a couple weeks by two people (who as far as I know do not have PhDs and years of research experience) would be a pretty crazy feat.
That said, I'm withholding judgment on how likely the claims are. A friend who developed NoCha [1] is running the model on that benchmark, which will really stress test its ability to reason over full novels. I'll reserve judgement until then.
PhDs aren't relevant. It's more just a certificate that you can learn to learn and stay committed to hard and challenging things. It does give bonus points to VCs, because it's seems to be easier to market to other VCs, same applies for hedge funds.
And with fine tuning, there's zero math needed, it's a bit of common sense, and a lot's of data optimization.
I wouldn't say that PhD's aren't relevant. Remember a lot of this subsequent "bumps, steps and leaps" advancement has come _after_ the initial work by the OpenAI's etc. "Standing on the shoulders of giants" is a thing.
Not looking good. Apparently the model was broken when they released it yesterday. The version they uploaded 8hrs ago only has an 8k context length, so we can't test it on the novels.
Here's the updates to the model config on huggingface:
One question about the Novels challenge: as there are two true/false questions, a random pick of answer will give a 25% success rate right?
How do some model manage to be below 25?
Edit : There are few other benchmarks that give pretty low scores (<20%) to top LLMs. Can't find them atm. There was a benchmark with common sense easy looking questions.
The sample answers for the horse race question are crazy. [0] Pretty much all the LLM really want to split 6 horses into two groups of three.
Only LLAMA 3 makes the justification that only 2 horses can be raced at a time, but then gets its modified question wrong by racing three horses. I personally would consider an answer that presumes some restriction to how the horses can be raced to be valid if it answers the restricted version correctly.
"The task consists of going from English-language specifications to Wolfram Language code. The test cases are exercises from Stephen Wolfram's An Elementary Introduction to the Wolfram Language."
I think this benchmark would really only tell me whether Wolframs book was in the training data.
I'm surprised this does so well in benchmarks, given the intuition I'm getting about its behavior from quick testing.
I gave it a medium-complexity design problem: Design the typescript interface for the state of a react app that manages a tree of chat turns/responses and displays the current path through the tree. (In other words, the kind of state that sits logically behind the ChatGPT or Claude Web UI, where previous conversation turns can be edited and used as a branching off point for new turns.)
Reflection-70B suffered from a bad initial idea, just as Llama 70B generally does (proposing to duplicate state between the "tree of all messages" and the "path to currently displayed message"), which is a very common error. The automated reflection process identified a whole bunch of nitpicks but missed the glaring logical bug. Furthermore the final output was missing many of the details included in the initial reflection / chain-of-thought scratchpad, even though the UI hides the scratchpad as though it's unimportant for the user to read.
Just tried this out for coding. I asked it to download weather data for Dublin into a Pandas Dataframe and write it to Hopsworks. Worked as good as GPT-4o - code ran correctly. The playground is fast. Impressed!
At the risk of sounding like a stuck LLM, it's under the Llama licence, which isn't an open source licence because of the restrictions on fields of endeavour.
I was just thinking - since GPT-4o and Sonnet are closed models, do we know that this method was not already used to train them? And that Reflection is simply finding a path for greater improvements than they did. Llama 3.1 apparently didn't improve as much. It's just a thought though.
Quick update here: the model in question is apparently an attempt at an attention grab, there are open questions as to whether it is a llama 3 fine-tune, a llama 3.1 fine-tune, or a series of api calls redirecting to claude 3.5 sonnet, with a find and replace of Claude for Llama
You can try this hugging face assistant that uses Llama 3.1 70b and system prompt engineering to simulate Reflection 70b's thinking and reflection process.
Can we please stop allowing links to Twitter? Rationale: the artificial limitations on that site around post size mean that most announcements (such as this one) are multiple posts. This, combined with the questionable design decision of hiding all reply tweets when a user is not logged in, means that many posts are completely missing crucial context for those of us who don’t have Twitter accounts.
Alternatively, Twitter links could be rewritten to redirect to one of the few Nitter instances that are still functional.
> Rationale: the artificial limitations on that site around post size mean that most announcements (such as this one) are multiple posts.
That limit actually doesn't apply to premium users/bluechecks, and he's using the other features like bold text.
The problem with long posts like that is one, they're annoying to read because when you open one up you don't know how much of a time commitment they will be, and two, you can't reply to just part of them.
> That limit actually doesn't apply to premium users/bluechecks, and he's using the other features like bold text.
I can't keep track of the flailing over at Twitter, especially because I don't have an account. Regardless, it's not all that relevant to what I was saying; maybe I got the reason wrong, but the fact remains that the vast majority of people who I see trying to post longer content on Twitter do it via multiple posts.
As a related aside, it baffles me why people still use the site when many superior alternatives are available.
> The problem with long posts like that is one, they're annoying to read because when you open one up you don't know how much of a time commitment they will be, and two, you can't reply to just part of them.
HN allows, and has always allowed, links to paywalled sources, sources with geographic restrictions that refuse to display the content for some readers, and won't modify a posts URL due to the site being slashdotted / suffering from an HN hug of death. Twitter is no different, except maybe by being more ideologically polarizing.
The place for alternative URLs is, and has always been, the comments.
Once you start discriminating content based on arbitrary rule (and this would be one), you are entering a slippery slope. Hence it is better to not let precedent take place in the first place.
This make me think we should be introducing 'tokens required to answer questions correctly' dimension to each metric. Since letting the model think more verbosely is essentially giving it more compute and memory to answer the question correctly.
(not that this is a bad thing, but I would be curious if other models get the answer correctly with the first couple of tokens, or after hundreds of reasoning)
Maybe I'm misreading, but i think the linked tweet says the opposite? That the model returns the mathematically correct answer, not the answer marked correct in the ground truth?
Under the hood Reflection 70B seems to be a Llama-3.1 finetune that encourages the model to add <think>, <reflection> and <output> tokens and corresponding phases. This is an evolution of Chain-of-Thought's "think step by step" -- but instead of being a prompting technique, this fine-tune bakes examples of these phases more directly into the model. So the model starts with an initial draft and 'reflects' on it before issuing a final output.
The extra effort spent on tokens, which effectively let the model 'think more' appears to let it defeat prompts which other strong models (4o, 3.5 Sonnet) appear to fumble. So for example, when asked "which is greater 9.11 or 9.9" the Reflection 70b model initially gets the wrong answer, then <reflects> on it, then spits the right output.
Personally, the comparison to Claude and 4o doesn't quite seem apples-to-apples. If you were to have 4o/Claude take multiple rounds to review and reflect on their initial drafts, would we see similar gains? I suspect they would improve massively as well.