Hacker News new | past | comments | ask | show | jobs | submit login
Orca 2: Teaching Small Language Models How to Reason (arxiv.org)
310 points by fgfm on Nov 21, 2023 | hide | past | favorite | 80 comments



> Progressive Learning: We start with LLaMA-2-7B or LLaMA-2-13B checkpoint and finetune it on the train split of FLAN-v2 dataset for one epoch. Note that FLAN-v2 dataset contains both zero-shot and few-shot problems. We then train on 5 million ChatGPT data from Orca 1 for 3 epochs. Then we train on the combination of 1 million GPT-4 data from Orca 1 and Orca 2’s 817K data for 4 epochs.

I think people are missing why they are comparing against Llama-2 13B/70B. They improved Llama-2 7B/13B and reach the level of a 5-10x larger model of the same base.

This is huge. Models on HF.

https://huggingface.co/papers/2311.11045


...and quantized ones from the usual suspect:

https://huggingface.co/TheBloke/Orca-2-7B-GGUF

https://huggingface.co/TheBloke/Orca-2-13B-GGUF

The 7B Q5_K_M one is small enough to run on an 8GB consumer GPU.


All the 13B files seems to be quantized.


Yeah, the 13b model outperforms the 70b Llama 2. Goes to show how much potential there is on the software optimization front as opposed to just scaling in size


It isn't.

Compared to the original Orca model and method which spawned many of the current SotA OSS models, Orca 2 models seem to perform underwhelming, below outdated 13b models and below Mistral 7b base models (e.g. [1]; didn't test myself yet, ymmv).

[1] https://twitter.com/abacaj/status/1727004543668625618?t=R_vV...


For smaller models, I'm impressed by Mistral-7b or fine-tuned variants like Zephyr. I use it regularly in Neovim[1] for mundane tasks (grammar correction, summaries, ...). I'm curious how Orca 2 performs, downloading it right now.

[1]: with https://github.com/David-Kunz/gen.nvim


Also OpenChat-3.5v model (It has 7B parameters, I think it is also a Mistral finetuning), demo: https://openchat.team/


Nice, it passes the weather test. I always ask open source models what the weather is like and see wether it hallucinates my location and a forecast. A few months ago without exception all models I tried (even larger ones) would just make up a temperature. Now it replies as it should Cool!

> what's the weather like today?

> I'm sorry, but I can't provide real-time weather information. However, I can help you with general information about weather conditions and forecasting.


oh wow this model is kinda amazing, it passes my "creative" tests that only chatgpt 3.5 did decently well with, I've recently been disillusioned that open source has been moving the wrong way due to the focus on benchmarks, but this model seems to hit the spot in usefulness in more whacky prompts ("write X in the style of Y" kinda prompts)


Always surprised how poorly these models do on the benchmarks they claim to do well. OpenChat has a benchmark radar diagram[1] but but often fails on actual samples.

[1] https://github.com/imoneoi/openchat


I'd love to see some demo of that!


A demo video is in the README (I used Mistral-7b in there).


Amazing, thank you!


Haven't seen this neovim plugin before! I'm setting this up right now.


A really important nuance here is that they are building on top of Llama-2, the pretrained model, and not Llama-2-chat.

I really think the entire field is doing a degree of damage with the chat fine tuning beyond what might be expected, because regularly part of that chat instruction is an emphasis on identification as a LLM.

The problem with this is that nearly all of the training data it's performing next token prediction on is text generated by humans.

So there's an inherent narrowing of the model scope with most of the fine tuning I've seen such that while pretrained models are harder to use, I regularly prefer them over chat models when both are available as even at similar temperatures the quality and variety of language is much improved in the pretrained over chat model.

This fine tuning was only introducing bias towards logical step by step analysis and problem solving techniques, and the results are great. But I'm willing to bet that an identical fine tuning on top of the chat model would have been much worse on the evaluations - not just the compounding of a typical fine tuning loss of a few percent, but more like a double digit relative difference.

It's quite frustrating that the anxiety over model safety is likely throwing out tens of millions of dollars worth of data in the pretrained model when only chat models are available for the SotA, and I hope in the future a lighter touch is taken on fine tuning the pretrained model and instead of focusing on safety inherent to the model it is just set behind a safety oriented discriminator or 'editor' which filters or modifies responses accordingly.

I'd happily take a 2-3x increased API cost for a much more broadly capable and performant model with similar safety characteristics but without the handicaps that come with it.

So while a lot of the gains here might be due to the fine tuning, I expect at least part is shrugging off the baggage of the chat/safety fine tuning as well. Even in the first detailed example, we can see that while Llama-2 goes off rambling later on, its statement of the relative knowledge of John vs Llama-2-chat is much more clear and connected between initial conditions and result particularly regarding theory of mind (i.e. "he assumed" vs the latter's "it must be in").


Adding to this - it's really interesting the safety stuff that *is* in this paper. Such as:

> We probe some of the categories where we see a larger difference (e.g., violent) and observe that Orca 2 tends to counter the harmful positions more often (which is penalized by the metric), while models that have gone through RLHF safety training tend to decline to respond more often (which is rewarded by the metric).

Or the fact Orca 2 is less likely to extend hate speech than Llama-2-chat which theoretically went through safety fine tuning even though Orca 2 did not have any explicit safety fine tuning.

Research over the past year has really demonstrated (a) just how impactful fine tuning can be - to the point of transmitting capabilities from larger models to smaller, and (b) that we're still clumsily wading through that process with only partial clarity on best practices as the foundational pretrained models get better and better at astounding rates.


I really really want this to work.

However at this point - benchmark success is about as effective as results from someone who has been “taught the test”

If say… Merck wanted to use this same model to reason out a logistics issue, or apply it to some business problem at scale - you’d have to deal with hallucinations all over the place.

The best analogy I have right now is that improved results on benchmarks are like better acting from Hugh Laurie as House.

If you want to watch a show - great (generative work)

If you want to get a prescription - then not so much.


I'm not a real AI doctor, I just play one on chat.openai.com.


> Merck wanted to use this same model to reason out a logistics issue, or apply it to some business problem at scale - you’d have to deal with hallucinations all over the place.

I wouldn't think Merck would leave it all to the model? There will be humans still in the loop ensuring that the output is valid for their use case? I don't think we are still there yet where we can completely productionalize these models without any human involvement later on whatsoever.


At the moment I read "how to reason" in the headline my bullshit detector started to go off.

LLMs do not reason, they do not think, they are not AGI. They generate by regurgitating.


I haven’t heard a definition of “reasoning” or “thinking” that proves humans aren’t doing exactly that same probabilistic regurgitation.

I don’t think it’s possible to prove; feels like a philosophical question.


I won't define reasoning, just call out one aspect.

We have the ability to follow a chain of reasoning, say "that didn't work out", backtrack, and consider another. ChatGPT seems to get tangled up when its first (very good) attempt goes south.

This is definitely a barrier that can be crossed by computers. AlphaZero is better than we are at it. But it is a thing we do which we clearly don't simply do with the probabilistic regurgitation method that ChatGPT uses.

That said, the human brain combines a bunch of different areas that seem to work in different ways. Our ability to engage in this kind of reason, for example, is known to mostly happen in the left frontal cortex. So it seems likely that AGI will also need to combine different modules that work in different ways.

On that note, when you add tools to ChatGPT, it suddenly can do a lot more than it did before. If those tools include the right feedback loops, the ability to store/restore context, and so on, what could it then do? This isn't just a question of putting the right capabilities in a box. They have to work together for a goal. But I'm sure that we haven't achieved the limit of what can be achieved.


these are things we can teach children to do when they don't do it at first. I don't see why we can't teach this behavior to AI. Maybe we should teach LLM's to play games or something. or do those proof thingys that they teach in US high school geometry or something like that. To learn some formal structure within which they can think about the world


Instead of going bank you can construct a tree of different reasonings with an LLM then take a vote or synthesise see Tee of thought prompting


It feels like humans do do a similar regurgitation as part of a reasoning process, but if you play around with LLMs and ask them mathematical questions beyond the absolute basics it doesn’t take long before they trip up and reveal a total lack of ‘understanding’ as we would usually understand it. I think we’re easily fooled by the fact that these models have mastered the art of talking like an expert. Within any domain you choose, they’ve mastered the form. But it only takes a small amount of real expertise (or even basic knowledge) to immediately spot that it’s all gobbledygook and I strongly suspect that when it isn’t it’s just down to luck (and the fact that almost any question you can ask has been asked before and is in the training data). Given the amount of data being swallowed, it’s hard to believe that the probabilistic regurgitation you describe is ever going to lead to anything like ‘reasoning’ purely through scaling. You’re right that asking what reasoning is may be a philosophical question, but you don’t need to go very far to empirically verify that these models absolutely do not have it.


On the other hand, it seems rather intuitive we have a logic based component? Its the underpinning of science. We have to be taught when we've stumbled upon something that needs tested. But we can be taught that. And then once we learn to recognize it, we intuitively do so in action. ChatGPT can do this in a rudimentary way as well. It says a program should work a certain way. Then it writes it. Then it runs it. Then when the answer doesn't come out as expected (at this point, probably just error cases), it goes back and changes it.

It seems similar to what we do, if on a more basic level. At any rate, it seems like a fairly straight forward 1-2 punch that, even if not truly intelligent, would let it break through its current barriers.


LLMs can be trained on all the math books in the world, starting from the easiest to the most advanced, they can regurgitate them almost perfectly, yet they won't apply the concepts in those books to their actions. I'd count the ability to learn new concepts and methods, then being able to use them as "reasoning".


Aren't there quite a few examples of LLMs giving out-of-distribution answers to stated problems? I think there are two issues with LLMs and reasoning:

1. They are single-pass and static - you "fake" short-term memory by re-feeding the questions with it answer 2. They have no real goal to achieve - one that it would split into sub-goals, plan to achieve them, estimate the returns of each, etc.

As for 2. I think this is the main point of e.g. LeCun in that LLMs in themselvs are simply single-modality world models and they lack other components to make them true agents capable of reasoning.


its those kinds of examples that make it hard to cleave a measurement of success.

Based on those kinds of results an LLM should, in theory, be able to plan, analyze and suggest improvements, without the need for human intervention.

You will see rudimentary success for this as well - however, when you push the tool further, it will stop being... "logical".

I'd refine the point to saying that you will get some low hanging fruit in terms of syntactic prediction and semantic analysis.

But when you lean ON semantic ability, the model is no longer leaning on its syntactic data set, and it fails to generalize.


It’s possible to prove.

Use an LLM to do a real world task that you should be able to achieve by reasoning.


> Use an LLM to do a real world task that you should be able to achieve by reasoning.

Such as explaining the logical fallacies in this argument and the one above?


Take anything, see how far you get before you have to really grapple with hallucination.

Once that happens, your mitigation strategy will end up being the proof.


I mean I know you're joking but yes, it would be able to do that.


Just yesterday I saw an example of a person asking GPT what "fluftable" means. The word was invented by their little daughter and they didn't know what it meant. GPT reasoned it was a portmaneau of"fluffy" and "comfortable", and it made sense because it was used in reference to a pillow. If it's just regurgitation, I'd like to know how it's able to understand novel words not found in the training data...


I would read Francois Chollet's explanation of this. It's very good: https://fchollet.substack.com/p/how-i-think-about-llm-prompt...

For words that are not in the model's vocabulary, like 'fluftable', the model uses a subword tokenization strategy. It breaks down the word into smaller known subunits (subwords or characters) and represents each subunit with its own vector. By understanding the context in which 'fluftable' appears and comparing it to known words with similar subunits, the model can infer a plausible meaning for the word. This is done by analyzing the vector space in which these representations exist, observing how the vectors align or differ from those of known words.

'As always, the most important principle for understanding LLMs is that you should resist the temptation of anthropomorphizing them.'


I'm sorry, but that's absurd. Being able to explain the precise mechanism behind reasoning would make anything sound like it's not reasoning, because of our prior experiences. If we understood human reasoning well enough to explain exactly what happens in our brain, you would conclude that we're not really reasoning because you can provide an explanation of how we're reasoning about novel, out of distribution data. This is "God of the gaps" for thought.


What you've written does nothing to disabuse any reasonable person of the notion that LLMs cannot reason; if anything you've explained how LLM's reason, not that they cannot do it.


isn't 'infer' another word for reason?


vector math in a 1536-dimensional space?


Because you’re not understanding what it’s regurgitating. It’s not a fact machine that regurgitates knowledge, in fact it’s not really so good at that. It regurgitates plausible patterns of language, and combining words and such is hardly a rare pattern


Which is also within the realms of house MD vs doctor, potentially even more so.

LLMs are trained on realms of text, good performance here is not unexpected.

To put it another way - Would you hire chat GPT?

For work, you need to have more than text skills.


With only the information we had in 2020, the two theories “language models don’t reason, they regurgitate” and “as language models scale, they begin to think and reason” made predictions, and the people who invested time and money based on the predictions of the latter theory have done well for themselves.


The people who bet on generative tasks, are getting mileage out of tit.

People who bet on reasoning tasks, not so much.


If you're trying to tell me there's a sucker born every minute, I knew that.


AGI doesn't reason either. Noone defines AGI as "AI, but with reasoning". It's "AI, that outperforms humans at all disciplines, by any degree" usually. Maybe you confused it with ASI, but even then reasoning isn't a requirement afaik.


Reasoning is a learnt concept that involves retrieving memories and running them though an algorithm, also retrieved from memory, and then you loop the process until a classifier deems the result to be adequate to the given goal.


I asked GPT 4 and it had some counter points:

Reasoning blends learned skills and natural cognition. It integrates new information, not just past memories. Reasoning is adaptable, not rigidly algorithmic. Emotions and context also shape reasoning.

which seemed to make sense.


I hope this will be found in history books and some students will point the irony that people are relying on gpt4's arguments about reasoning in a thread where it's proclaimed that said model can't reason


In fact it is not absurd or weird. The model does not need to be capable of x/reasoning to produce knowledge about x/reasoning. A book with a chapter on x/reasoning doesn't reason either.


Did you only read the title? Because the abstract gives you a pretty good idea of what they mean when they say reason. It's pretty easy to understand. No need to immediately call bullshit just because of a minor semantic disagreement.

>ThEY DON'T tHiNk. They'rE JuSt STochAStiC pARrotS. It'S not ReAL AGi.

It doesn't even matter if these claims are true or not. They're missing the point of the conversation and the paper. Reason is a perfectly valid word to use. So is think. If you ask it a question and then follow up with 'think carefully' or 'explain carefully'. You'll get the same response.

inb4 AcTUALLy LlMS Can'T do aNYtHIng CaRefUlly BECaUse pRogRAms ARen'T caRefUl


You are simply incorrect. They can reason.


and how can you tell they reason and not parrot some text in training data?

There are papers about trying LLMs on generated reasoning problems, and they usually fail.


>Usually

That implies - sometimes not. Which would prove at least some reasoning capabilities.


In this case I used 'usually' because don't remember all details and didn't want to generalize by saying 'always', but also training/benchmarking protocol can be flawed, for example LLM still can solve shallow reasoning problem by memorizing pattern.


Orca 2-13B consistently beat Llama 2-70B on most benchmarks in 0-shot. Hopefully, research papers will start to include Mistral/Zephyr 7B & Openchat 3.5. Even though they're smaller, they're getting competitive against much larger models and they're much cheaper to orchestrate.


It fails other benchmarks vs Mistral-7b. https://twitter.com/Teknium1/status/1726846755344634020

(There is some doubts about the validity of the comparaison in the comments)


Also, worth mentioning the next tweet:

  Update, I benchmarked 13b Orca 2, its still not surpassing gpt4all score of

  Base Mistral or OpenHermes 2.5 7B:

  Hermes 2.5 7B Mistral score: 73.12%

  Mistral Base 7B score: 71.16%

  Orca 13B GPT4All score: 70.58%
https://twitter.com/Teknium1/status/1726833004117635414


Are we beginning to see "specialized SLMs"? We've already seen some pretend-agent based solutions (where the same model is given several different roles and made to act as eg. ceo / architect / dev / sales in a startup).

I wonder if the way forward is to train smaller models with different sets of "skills" or "neural affinities". One for reasoning, one for summarization, one for math, one for code, etc - then combining them into full-fledged solutions. Perhaps smaller models can be "better" at their specific domains/tasks than the giant generalist models can be at any of them.


Yes, I think that is the general trend. Have one model tuned for reasoning that decides a plan, based on which you invoke other models as tools (see e.g. the ReWOO paper[0]). If I had to guess, an approach like this is what powers the recent Custom GPT/Assistant API products (based on the lag between tool invocations I would guess that they also re-prompt for plan adjustments between every set of tool calls).

Do that with a small model and hot-swap LORAs, and it should be possible to build a quite powerful local assistant on consumer hardware.

[0]: https://arxiv.org/abs/2305.18323


Yes, this is the trend. OAIs marketplace of GPTs is a confirmation of this. BabyAGI, AutoGen, AutoGPT are all multiple LLM/SLM architectures under the hood. While we don´t have access to proprietary data or the ability to run bigger models, the natural direction is to combine them with specialized tasks like you just described. The issue is then the interface - making it good and communicate seamlessly between models and what roles they + the architecture the models are operating in. The last point is up to your imagination.


Specialized LLMs, and likely SLMs too, are really the future. I use them mostly to aid programming and really just stopped paying for GPT-4. Phind and others are really on par now in my coding needs.


Isn't this the whole idea with Mixture Of Experts approach that is GPT-4 is using?


Isn't MoE with switch transformers massively inefficiemt compared to being able to customize which LLMs you are using?

I've seen a lot of agent swarm concepts in the smaller llm space that seem to provide some feedback that this is a viable avenue of research.


Is GPT-4's MOE based on combining specialized models?


This is why imho Microsoft is way cooler than Apple. They have tons of published research. In Apple, even speaking about your research with a friend may result in severe punishment.


Apple publishes too, search for it for example, but much less.


Much, much, less. They are definitely not in the same league.


I'm not sure if I'm missing something from the paper, but are multi-billion parameter models getting called "small" language models now? And when did this paradigm shift happen?


All the llama models, including the 70B one can run on consumer hardware. You might be able to fit GPT-3 (175B) at Q4 or Q3 on a Mac Studio, but that's probably the limit for consumer hardware. At 4-bit a 7B model requires some 4GB of ram, so that should probably be possible to run on a phone, just not very fast.


Gpt 3.5 turbo is 20B


I doubt that. What's your source?


There was a paper published by Microsoft that seemed to leak this detail. I'm on mobile right now and don't have a link but it should be searchable


The paper was https://arxiv.org/abs/2310.17680

It has been withdrawn with this note:

> Contains inappropriately sourced conjecture of OpenAI's ChatGPT parameter count from this http URL, a citation which was omitted. The authors do not have direct knowledge or verification of this information, and relied solely on this article, which may lead to public confusion

(the noted URL is a just a Forbes blogger with no special qualifications that would make what he claimed particularly credible).


Nowadays, small essentially means realistically useable on prosumer hardware.


When 175B, 300B, 1.8T models are considered large, 7B is considered small.


Relative term. In the world of LLMs, 7b is small.



Released under the MS Research License, so not OSI and non-commercial, for the curious.

https://huggingface.co/microsoft/Orca-2-13b/blob/main/LICENS...


This sounds quite exciting! Like Mistral all over again, only more transparent, open, and major backing probably as Microsoft are looking to significantly reduce costs now that they're expanding AI wide across their platforms? The approach truly feels like a next step in LLM design.


Official Orca-2 demo is available on huggingface Spaces now - https://huggingface.co/spaces/ari9dam/Orca-2-13B




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: