Stop doing self-correction within the context of the model's own generation.
The previous paper on self correction told the model "you previously said X - are there errors with this?"
This one has the mistakes statically added to the prompt in a task prompt and response without additional context immediately before asking if it has any errors.
Think about the training data.
How often does the training data of most of the Internet reflect users identifying issues with their own output?
How often does the training data reflect users identifying issues with someone else's output?
Try doing self-correction by setting up the context of "this was someone else's answer". It is still technically self-correction if a model is reviewing its own output in that context - it just isn't set up as "correct your own answer."
This may even be part of why the classifier did a better job at identifying issues - less the fine tuning and more the context (unfortunately I don't see the training/prompts for the classifier in their GitHub repo).
It really seems like the aversion to anthropomorphizing LLMs is leading people to ignore or overlook relevant patterns in the highly anthropomorphic training data fed into them. We might not want to entertain that a LLM has a concept of self vs other or a bias between critiques based on such a differentiation, and yet the training data almost certainly reflects such a concept and bias.
I'd strongly encourage future work on self-correction to explicitly define the thing being evaluated as the work of another. (Or ideally even compare self-correction rates between critiques in the context of their own output vs another's output.)
I see lots of people trying to prompt with incomplete sentences, not capitalizing, using slang, bad grammar, imprecise terminology etc. And it still works. However, I find that you get a noticable a
quality boost if you use proper English and treat it more like a human.
"i want a python app that calculates a roadtrip for me"
vs
"Please write me a Python program using a map API that measures the distance between two locations as a car would drive. Think carefully about the program architecture and be sure to use a human readable Pythonic style. Please show me the complete program in it's entirety."
The former game me a high level overview with a ton of explanation and didn't write any code. You can try to walk it through the process of all the steps it needs, but it will write "confused", albeit working, code after a few prompts. The latter just wrote working code on the first response. Moving forward, the context is just so more concise and correct that everything after will be of much higher quality.
I rarely go past 5-10 responses due to what I'd call "context poisoning". If it makes a simple syntax error or something small, I'll shoot it the error and let it correct itself. But as soon as it invents a function or otherwise hallucinates, it gets copy pasted into a new prompt saying "here's some bad code, fix this" and it is far more likely to come up with an elegant solution rather that rewriting everything or making huge changes to solve a one off error or something it's previous context was preventing it from grasping.
What you're saying is almost the meta of using good grammer and context, and I completely agree.
A recent paper along these lines you might be interested in was Large Language Models Understand and Can be Enhanced by Emotional Stimuli: https://arxiv.org/abs/2307.11760
It makes complete sense and has been a part of my own usage for well over a year now, but it's been cool seeing it demonstrated in research across multiple models.
Smallish model (7b) require a somewhat simplified grammar tho. Especially with longer complex instruction I found more luck by joining all the conditions with ands and to have everything that's a follow up and need to happen in order joined by then, instead of having more natural sentences.
So instead of "write a short story of a person that's satisfied at work" something along the line of "write a short story and the protagonist must be a person and the protagonist must be happy at work" boost comprension especially as the condition list becomes longer.
Not sure. But it does make sense like you say. The output must somehow correspond to the input, in a meaningful way, that is the purpose of LLMs. If you gave the LLM just one words as input who knows what the output would be. But if you give it more meaningful information it has more to work with, to give you an answer that more precisely matches your question.
Utterly false. A google search for that phrase yields "It looks like there aren't many great matches for your search". And no search engine will yield the code for such an app unless the engine is LLM-based.
If you leave off the quotes (which were present in the comment I responded to) then of course you will get millions of irrelevant hits. Somewhere in that chaff there is some Python code that alleges to have something to with road trips, though it's not always clear what. If I give the same prompt to ChatGPT I get a nicely formatted box with a program that uses the Google Maps Distance Matrix API
to calculate distance and duration, without a bunch of junk to wade through. (I haven't tried it so it could be a complete hallucination.)
Not nearly as quickly or directly, though. LLMs augmented by search engines (or vice versa) seem to be an obvious and permanent innovation, especially for the general public who are notoriously awful at personally generating optimal keywords for a desired search query.
I'm not convinced. On these few occasions where an AI chat bot went out, did a Google search and responded with results the quality of that answer was always much worse than if it just replied from it's training data. This of course excludes things that happened after training data ends.
For example, ask chatgpt about writing a python script that does anything with AWS inspector 2. It will do very badly, it will hallucinate, etc. Even with Internet access. Ask about doing the same with some other API that was well represented in the training set and it's great.
This is why I think predicting death for sites like stackoverflow is very premature. What happens 10 years down the line once everything chatgpt knows is old tech? It can't be simply trained with more recrnt data, because unless stackoverflow regains it's popularity there will be very little training data. Of course various data generation techniques will be invented and tried, but no one will match the gold standard of human generated data.
Unfortunately I have to predict inevitable enshittification of general purpose chat bots.
When experimenting with the early models that were set up for "text completion" instead of question-answer chat, I noticed that I could get it to generate vastly better code by having the LLM complete a high quality "doc comment" style preamble instead of a one-line comment.
I also noticed that if I wrote comments in "my style", then it would complete the code in my style also, which I found both hilarious and mildly disturbing.
The fact that 90% of the people aware of and using LLMs have yet to experience it thinking their own thoughts before they do means we're in store for a whole new slew of freak outs as integration in real world products expands.
It's a very weird feeling for sure. I remember when Copilot first took a comment I left at the end of the day for me to start my next day with and generated exactly the thing I was going to end up thinking of 5 minutes later in my own personal style.
It doesn't always work and it often has compile issues, but when it does align just right - it's quite amazing and unsettling at the same time.
Are there any risks I miss to asking a model (in a separate context as to not muddy the waters) to rewrite the informal prompt into something more proper and then use that as a prompt?
Seems like a pretty simple task for an LLM as long as the initial prompt isn't too ambiguous. If it really does help with the recall it could be interesting to have this as an optional preprocessing layer in chat clients and such.
Preprocessing prompts is actually a great approach.
Personally I think given the model loss with fine tuning people who want the cutting edge LLM at any cost would - instead of fine tuning the model itself - fine tune a preprocess prompter that takes a chat/instruction and converts it to a good TextCompletion prompt.
So for example taking "write me a paragraph of marketing copy for an athletic shoe" and tuning it into:
"Marketing case study: Athletic shoe
The problem: The client needed a paragraph of high quality marketing copy to promote their new athletic shoe on their website.
The solution: Our award winning copywriters wrote the outstanding copy reproduced below."
Followed by an extractor that reformats the completion result into an answer for the initial prompt, as well as potentially a safety filter that checks the result isn't breaking any rules (which will as a bonus be much more resistant to jailbreaking attempts).
That is a pretty good use case. In fact, if your prompt is very long, you will need to summarize it (with an LLM).
Also, when you fine-tune the LLM, you can also use an LLM to summarize or concatenate content that you train it on (e.g. rewrite this content in the style of a human having a conversation with a computer)
That's hilarious. Does this imply LLMs inherited the human tendency to get attached to a perspective despite evidence to the contrary? I'll often try to coax the right answer out of GPT-3 when I know it's wrong, and it'll often insist that it's right several times in a row.
I think it does indeed suggest this, but I think this may be good news.
Part of what makes humans able to make progress in difficult, vague, and uncertain fields is a willingness to hold onto a point of view in the face of criticism to try & fix itl. This is, as a matter of fact, how science progresses, depending on if you ask scientists or historians of science. See Thomas Kuhn's Structure of Scientific Revolutions for more on this.
But LLMs don't do these things ... they just produce text that statistically matches patterns in the training data. Since the humans who authored the training data have personality patterns, the outputs of LLMs show these personality patterns. But LLMs do not internalize such patterns--they have no cognitive functions of their own.
Getting attached to a perspective despite evidence to the contrary would require perspective and distinguishing fact from fiction, but just copying humans protesting that they're right (regardless of context) seems plausible, as there's a lot of that to learn from.
Everything in the output of LLMs is inherited from human tendencies ... that's the very essence of how they work. But LLMs themselves don't have any of these tendencies ... they are just statistical engines that extract patterns from the training data.
P.S. What I said is not "paradoxical". An LLM does not take on the attributes of its training data, any more than a computer screen displaying the pages of books becomes an author. Regardless of what is in the training data, the LLM continues to be the same statistical engine. The notion that an LLM can take on human characteristics is a category mistake, like thinking that there are people inside your TV set. The TV set is not, for instance, a criminal, even if it is tuned to crime shows 24/7. And an LLM does not have a tendency to protect its ego, even if everyone who contributed to the training data does ... the LLM doesn't have an ego. Those are characteristics of its output, not of the LLM itself, and there's a huge difference between the two. Too many people seem to think that, if for instance, they insult the LLM, it feels offended, just because it says it does. But that's entirely an illusion.
If there is a pattern in the training data that people resist contrary information to their earlier stated position, and a LLM extracts and extends patterns from the training data, then a LLM absolutely should have a tendency to resist contrary information to an earlier stated position.
The difference, and what I think you may have meant to indicate, is that there's not necessarily the same contributing processes that lend themselves to that tendency in humans occurring in parallel in the LLM, even if both should fall into that tendency in their output.
So the tendencies represented in the data are mirrored, such as "when people are mourning their grandmother dying I should be extra helpful" even if the underlying processes - such as mirror neurons firing to resonate grief or drawing on one's own lived experience of loss to empathize - are not occurring in the LLM.
> How often does the training data of most of the Internet reflect users identifying issues with their own output?
> How often does the training data reflect users identifying issues with someone else's output?
I wouldn't put too much weight into just-so theories like this.
We still don't understand too much about how LLMs process information internally; it could be that their understanding of the concept of "correcting a previous mistake" is good enough that they can access it without prompt engineering to mimic the way it happens in training data. Or maybe not (after all, there's an entire management concept called "pre-mortems" which is basically doing what you suggest, as a human).
This depends less on the internals vs the patterns in the training data.
Even if the model has the capacity to abstract beyond the patterns, the patterns are still very likely to have influence on its ability to do so.
For example, early after GPT-4 was released it was being claimed it couldn't solve variations on the goat, wolf, and cabbage problem.
I found that it could solve these variations fine 100% of the time, you just needed to explicitly prompt for it to repeat adjectives with nouns and change the nouns to emojis. The repeating worked similar to CoT by biasing the generation towards the variation and away from the original form, and the emojis in place of the nouns further broke the token associations which was leading it to fail by extending the original solution.
So while it's possible that with enough finessing you could get a model to perform self-critique as well as its critique of others, if the training data has a clear pattern of bias between those two, why actively ignore it?
It's a bit like sanding against the grain vs with it. You can sand against the grain of the training data and with enough effort potentially get the result you want with sophisticated enough models. But maybe your life will be a lot easier if you identify the grain in the data first and sand along with it instead?
> We still don't understand too much about how LLMs process information internally
I admit I personally don't know too much about how "LLMs process information internally". But, I would find it curious if programmers who created the system wouldn't understand what it is doing. Is there any evidence that the LLM programmers don't understand how the program they created works?
LLMs aren't programmed and it's why the neural network working as it does is black box to everyone, developers included.
Imagine a billion black boxes with hamsters put in them. You put in a bag of equally mixed Skittles in one end of each box and then rate each box based on how well it does to get rid of the yellow and green Skittles but push out the others. The ones that do the best at this you mate the hamsters and go again, over and over. Eventually you should have hamsters in boxes that almost always get rid of yellow and green Skittles and output the rest.
But is it because you bred in a preference to eat those color Skittles? An aversion to the other colors? Are they using those colors for nesting? Do they find the red and blue and orange ones too stimulating so they push those out but leave the others alone?
There could be a myriad of reasons why your training was successful, and without the ability to introspect the result you just won't know what's correct.
This is a huge simplification by way of loose analogy for what's going on with training a transformer based LLM, but no one is sitting there 'programming' it. They are just setting up the conditions for it to self-optimize around the training goals given the data, and the 'programming' just has to do with improving the efficiency of the training process. Analyzing the final network itself is like trying to understand what each variable in a billion variable math equation is doing to the result.
When you train an LLM you do that by executing some computer code with some inputs. The programmers who wrote the code you execute know exactly what it does. Just like Google knows exactly how its search-algorithm works. An LLM uses statistics and Markov-chains and what have you to generate the output for a given input.
It's like with any optimization algorithm. You cannot predict what exactly will be the result of a given optimization-run. But you know how the optimization algorithm works. The (more or less) optimal solution you get back might surprise you, might be counter-intuitive. But programmers who wrote the code that did the optimization, and have the source-code, know exactly how it works.
When you get a result from LLM you don't say "I can't possibly understand why it came up with this result?". You can understand that, it's just following the rules it was programmed to follow. You might not know those rules, you might not understand them, but programmers who wrote them do.
You're mixing up what we mean by what rules it's following or how it's working.
If I ask how it's able to write a poem given a request and you tell me you know - it multiplies and adds this set of 1.8 trillion numbers together X times with this set of accumulators, I would argue you don't understand how it works enough to make any useful predictions.
Kind of like how you understand what insane spaghetti code is doing - it's running this code - but can have absolutely no idea what business logic it encodes.
It is not "spaghetti-code" but well-engineered code I believe. The output of an LLM is based on billions of fine-tuned parameters but we know how those parameters came about, by executing the code of the AI-application in the training mode.
It doesn't really encode "business logic", it just matches your input with the best output it can come up with, based on how its parameters are fine-tuned. Saying that "We don't understand how it works" is just unnecessary AI-mysticism.
The spaghetti code comparison is not to the code but the parameters.
> It doesn't really encode "business logic"
Doesn't it? Gpt architectures can build world models internally while processing tokens (see Othello got).
> we know how those parameters came about, by executing the code of the AI-application in the training mode.
Sure. But that's not actually a very useful description when trying to figure out how to use and apply these models to solve problems or understand what their limitations are.
> Saying that "We don't understand how it works" is just unnecessary AI-mysticism.
We don't to the level we want to.
Tell you what, let's flip it around. If we know how they work just fine, why are smart researchers doing experiments with them? Why is looking at the code and billions or trillions of floats not enough?
I think of LLM's as if we would create a human stem cell from scratch, including the DNA, and then grow it to a person.
We may know every we put every single atom in that stem cell, but still not know any more about the resulting baby (and later adult) than we do about humans made the natural way.
Oh, and if you're looking for reasons to regulate AI, this metaphor works for that, too.
The gap between what you think is the case and what's actually the case is that there isn't a single optimization step directed by the programming.
Instead, the training gives the network the freedom to make its own optimizations, which remain obfuscated from the programmers.
So we do know that we are giving the network the ability to self modify in order to optimize its performance on the task, and have a clear understanding of how this is set up.
But it isn't at all clear what the self modifications that improve the results are actually doing, as there's simply far too many interdependent variables to identify cause and effect for each node's weight changes from the initial to final state.
What you fail to appreciate is the operation of an LLM is driven by the input data far more than is the case with most programs. Typical programs have a lot of business logic that determines their behavior--rules, as you say. E.g., an optimizing compiler has a large number of hand-crafted optimizations that are invoked when code fits the pattern they are intended for. But LLMs don't have programmed cases or rule like that--the core algorithms are input-agnostic. All of the variability of the output is purely a reflection of patterns in the input; the programmers never made any sort of decision like "if this word is seen do this".
People understand how the program works but not how the network produces the outputs it does from the inputs and training it receives. The mechanics of how these models work at a very high level are:
1. Tokenize some input so you have some big vectors
2. <bunch of linear algebra involving these vectors and some sets of matrices of weights>
3. Take the output vector and turn it back into tokens
Each of these steps are well understood in and of themselves. So maybe the magic is in the way the matrices of weights are generated and trained? Well we know they typically start as random matrices, and can explain how as the network is trained these weights are tweaked in various ways.
All of that is known. What’s unclear is specifically how the weights in the matrices correspond to our understanding of the concepts in the input and output and how it all seems to add up to a system that works as well as it does. I think that’s what they meant by not understanding how they process information internally.
> that’s what they meant by not understanding how they process information internally.
There is no other "internal information processing" happening in an LLM than the process it was programmed to execute. Is there?
The code an LLM executes is not too complicated for humans to understand. After all it was written by humans. The outputs may be surprising but so it is with lottery. Why did I win the jackpot this week, when I didn't win anything in the last 10 years? Very counter-intuitive. I can't possibly understand that? Yes I can, it is just statistics and probability.
As I tried to explain, it's not the code that people don't understand. People understand the code they wrote.
It's why the bunch of linear algebra on the weights works to do this particular task, and how it will respond to any particular task that is a bit mysterious.
Like imagine someone gave you the Taylor series expansion of the inverse Kepler equation[1]. So you just have a bunch of crazy fractions of powers of x that you add up. And then they say ok now the this function will very accurately explain the orbit of the planets.
You'd be able to do the steps - you're just adding up fractions. You'd be able to verify the answer you got corresponded to the orbit of a given celestial body.
But if you didn't have all the pieces in the middle (calculus mainly) there's no way you'd be able to explain why this particular set of fractions corresponds to the movement of the planets and some other set doesn't.
There are many mathematical functions whose output is hard to predict and requires immense calculations. I just recently read about how they had "discovered" the 9th Dedekind number, or something like that.
Just because we can't predict what the 10th Dedekind number will be does not mean it is somehow 'mysterious". It is just mathematics, logic and programming.
I don't think the Dedekind number relationship is really like what I described though. These are numbers which. have known properties (ie given a number you can test whether or not it is that) but no known closed form solution exists for the generator of the sequence, and probably there is no structure to the intervals between the numbers other than the properties we ascribe to the numbers. I see them as similar to primes for example in that you know one when you see one but not how to make all of them[1].
In my example, the relationship between the fractions in the Tailor expansion and the orbit definitely exists but if you don't have calculus it is not something that is amenable to understanding. There is a fundamental structure but the language to describe it would be missing.
ML is a universal function approximator and in the case of GPT models the functional form of the model consists of linear algebra operations and the parameters are matrices of weights. The mysterious part is "how the model processes information" like the original person said - why a particular mix of model weights corresponds with particular types of outputs. That is genuinely mysterious. We don't know whether or not there really is a structure and if there is, we don't know the "calculus" that would link them.
Now it may be that there isn't a missing piece (ie that the banal truth is we tweak the weights until we see what we want to see and by doing so we create an illusion of structure via the training process and the whole perception that the model is doing any information processing at all is something we make up). I actually have a lot of time for this point of view although I really need to understand the topic much more deeply before I make my own mind up.
[1] I don't know any number theory so could be totally wrong about this in which case I apologise.
They aren't just-so theories ... this is how LLMs work. We actually understand exactly how they process information internally, but since their very nature is to extract statistical patterns from the training data and that training data is massive, we can't anticipate what patterns have been extracted. We just know that, whatever patterns are there to be abstracted--e.g., users tending to identify issues with someone else's output rather than their own--those patterns will be reflected in the output.
With due respect (and I actually mean due respect), this embodies exactly what is wrong with the modern approach to AI. Who cares that there's no examples in the training set. True AI should be able of taking a few steps out of book without getting flummoxed. When you learn your first language, teacher does not stand before the class and provide examples of ungrammatical statements, yet you figure out the rules of grammar just fine.
There is something fundamentally flawed in the approach not in the data.
There are training methodologies that do this but they don’t necessarily work in this case (or noone has got them to work that well yet).
For example reinforcement learning, like when AlphaZero famously learned by playing itself at chess and go and became much stronger than the purpose-built “alphago” first version.
Or another example generative adversarial networks where you have a generator network generating images and a validator network trying to spot fake images.
In both these examples it’s easy to see how you build the loss functions for the training because they are quite constrained. For a domain like a game you penalize versions of the model that lose games and reward those that win. For GANs the initial insight was huge but having had that it’s easy to see how you move forward - you reward the generator for slipping fake images past the validator and you reward the validator for finding fakes in a stream of images that includes some real images and some generated images.
For an open-ended general model like an LLM it’s not so easy to see how you do this in the general case. GPT models are actually pretty good at “zero shot” learning (without examples) and “transfer” learning (where lessons from a domain are applied to an associated domain).
Your example of a language is interesting, because you don’t learn your first language from any sort of teacher - you learn it from your parents and others talking around you and to you. So you have lots of examples to draw on. You then try out various sounds and words and everyone looks confused but becomes more excited as you get closer to saying something that is a real word eventually you hit on the magic recipe and say the word “DUCK!” (Or whatever) and everyone loses their minds. So you have lots of positive reinforcement that you’re on the right track and you have a huge number of examples. You’re not just fed the hackernews comment section, some papers on quantum mechanics and all the english literature that has fallen out of copyright and left to get on with it.
I wish I could take credit for my example, but it's perhaps the most famous example in all of linguistics and its the thing that made Noam Chomsky's name in the field.
To summarise it quickly, Chomsky's contention was that all the world's languages can be described by shockingly few degrees of freedom on the same universal grammar, and that we learn language surprisingly fast relative to training data because all we are really picking up are those parameters and the rest is hard wired from birth the same way horses come out the womb already hard wired to gallop.
Decades later, very few things have truely stood the test of being universal among languages, but it was still a valuable contribution because he poked a serious hole in the pure Hebbian reinforcement theories which were in vogue back then.
Until then, I'll make sure to be mindful of conventions.
(And just a reminder, but organic intelligence has its own conventions that work when aligned with and cause issues when misaligned with, so your expectations of universal general purpose without advantages to one approach or another may be unrealistic.)
The obvious way to do this would be as adversarial networks like in GANs for image generation. Have the existing LLM as the generator trained exactly as at present but with an additional penalty for being found to have committed an error and have another network trained at the same time as a validator where its fitness function is finding errors in the output of the generator.
People must be doing this, probably just takes a while for the research to bear fruit.
Some of these errors are so obvious I can’t imagine this would be too hard. For an example, try asking an LLM “generate me a system of two equations in two unknowns. Both the coefficients and the solutions must be integers between -10 and 10”. In my experience it will generate a valid system. Some of the time the coefficients will be in the range specified. Probably about a third to a half the time the solution it gives will be wrong and when you ask for an explanation of the solution it will make some basic arithmetic error (eg flipping a sign etc). Then when you point out the error it will correct.
>It really seems like the aversion to anthropomorphizing LLMs is leading people to ignore or overlook relevant patterns in the highly anthropomorphic training data fed into them.
This exactly. Not anthropomizing when anthropomization is producing better predictive models of what to expect in output is not smart, it's just silly.
I don't agree about your point regarding training data. The internet is infamous for pedants who will correct even the smallest factual or logical errors. Take this comment for instance... It seems like the training set would be filled with proposition X, followed by a corrective assertion Y.
That's the point: The internet IS full of pedants correcting others' statements. (Hopefully those pedants are right enough of the time for this to be helpful training data, heh.)
I think GP (kromem) was pointing out that those corrections are more likely to be phrased as "You're wrong, here's why..." than as "I'm sorry, I was mistaken" because humans are full of sass for other humans and not as full of first-person admitted errors.
I was just testing Bard with some very simple coding exercises and it did well.
I noticed that they automatically create at least three other draft responses.
I assume that this is a technique that allows them to try multiple times and then select the best one.
Just mentioning it because it seems like another example of not strictly "zero-shot"ing a response. Which seems important for getting good results with these models.
I'm guessing they use batching for this. I wonder if it might become more common to run multiple inference subtasks for the same main task inside of a batch, for purposes of self-correcting agent swarms or something. The outputs from step one are reviewed by the group in step 2, then they try again in step 3.
I guess that only applies for a small department where there is frequently just one person using it at a time.
...does this directly relate to the high operating costs of LLMs-as-a-service, if for every request they have to run n-many redundant LLM requests? So if they could improve things so that a single prompt/request+response has a higher chance of being high-quality they wouldn't need to run alternatives?
It can make it more expensive if that option becomes popular.
But I think in most cases batching is actually the biggest _improvement_ in terms of cost effectiveness for operators, since it enables them to use the parallel throughout of the graphics device more fully by handling multiple inference requests (often from different customers) at once. (Unless they work like Bard by default).
I thought the char-by-char teleprinter thing was just an effect (y’know, for user-engagement and to make the interaction feel more genuine) - and that these systems just return output in buffered blocks/pages or whatever-it-is that they wired-up their network to do.
It’s not like DALL-E outputs pixels in scanout order - or in brushstroke order (…er… or does it?)
It's not an effect at all. It calculates and outputs one token at time. The algorithm requires all previous tokens in order to output the next one. DALL-E is a totally different algorithm. It does not have a scanout or brushstrokes.
I don't like this. It forces me to read 2 prompts instead of 1 so that I can help train their LLM. ChatGPT and Bard already have regenerate buttons if I don't like their response, it doesn't need to be that in my face.
How else do you expect the LLM you use to become better? I'm more than happy to provide feedback. Unless you want it to only scrape data, I can't imagine why you'd be opposed to improving a product you use especially when that's really the only way to do it. If you don't care, just pick one and don't think about it, they're usually extremely similar anyway. I'm not sure I've come across an option where one was acceptable and one wasn't. They are literally giving you options that that don't need to give and you're complaining.
I've also noticed LLMs seem to lack conviction on the correctness of their answers. As the paper notes, you can easily convince the transformer that a correct answer is wrong, and needs adjustment. Ultimately they're just trying to please you. For example with ChatGPT 3.5 (abbreviated):
I just re-ran this on GPT-4 and it apologized, told me I was right, and then said again that the answer was -1. So while it lacked conviction it at least kept the correct answer.
gpt-4: Actually, the value of \(\sin(-\pi/2)\) is indeed \(-1\). The sine function represents the y-coordinate of a point on the unit circle corresponding to a given angle. At \(-\pi/2\) radians, which is equivalent to 270 degrees or a quarter circle in the negative direction, the point on the unit circle is at the bottom with coordinates (0, -1). Therefore, the sine of \(-\pi/2\) is \(-1\).
=====
The smarter it is, the more conviction it has. GPT-3.5 has a lot of impostor syndrome and it's probably deserved lol. But GPT-4 starts to stutter when you give it enough math questions, which aren't its forte.
This is due to the RLHF alignment, only product-focused. It would be very annoying for users to fight back and forth with the LLM on the correctness of the answer, especially when it is so prone to hallucination.
I wonder if separate LLMs can find each other’s logical mistakes. If I ask llama to find the logical mistake in Yi output, would that work better than llama finding a mistake in llama output?
A logical mistake might imply a blind spot inherent to the model, a blind spot that might not be present in all models.
I frequently share responses between ChatGPT (paid version with GPT4) and Copilot-X to break an impasse when trying to generate or fix a tricky piece of code.
Parsing is faster than generating, so having a small model produce a whole output and then have Goliath only produce "good/bad" single token response evaluation would be faster than having Goliath produce everything. This would be the extreme, adhoc and iterative version of speculative decoding, which is already a thing and would probably give the best compromise
I think the relationship between model size and training time isn't linear. So if you want a twice bigger model it'll take more resources to train it than two original models.
Maybe. Goliath 120B took two different llama variants and interwove the layers. Surprisingly Goliath 120B quantized to 2bit is outperforming llama 70B 4bit in many benchmarks.
I believe another factor is that sometimes the model responds better to your prompt than other times. This way you get two dice rolls of your prompt hitting "the good path."
> recent attempts to self-correct logical or reasoning errors often cause correct answers to become incorrect, resulting in worse performances overall (Huang et al., 2023)
Yeah but in this complex way that really glosses over what's going on here.
Plus, sometimes the corrections aren't accurate. So of course if you tell it where it's wrong, and it gets a second chance, the error rate will be less...
I have noticed this several times. When I give feedback that a mistake was made (with no details on what the mistake is), often smaller and medium size LLMs then give a correct response.
Which I take full advantage of when the output is like 90% correct but the "fix" requires a bit of refactoring, I just tell it what I want and presto. Faster than doing it by hand.
I have not read the essay, yet but when 'we' talk about
> reasoning errors,
we do not mean reason in some natural, universal, scientific kind of sense, right?
Given that the training data can only contain human reasoning and computational logic, reason in the sense of LLM's can only be interpreted as "rational facts AND nonsense humans made up to create systems that would support consumerism-driven sanity", correct?????
Please understand, I'm not mocking, I'm genuinely interested
in the ways human reasoning radiates into the code LLM's learn while they realize (the computational equivalent of a new-born's eyes opening) their cognitive (&) sensory
(that which triggers/causes/elicits/prompts/influences) their origins (every whatever-second/moment of their existence).
> we do not mean reason in some natural, universal, scientific kind of sense
I believe there are two different ways people think about this:
1) Some see "reason", "intelligence", "free will" and/or "consciousness" as emergent phenomena that arises naturally from normal physical processes (or they dismiss the concepts completely as illusions for the same reasons).
2) Other seem to consider these somehow independent from physics, or if not will tend to hypothesize that it is linked through quantum mechanics to something more fundamental.
If interpretation 1) is correct, then we will probably see full AGI in our lifetime. If 2) is correct, it could be that we can never create "real" AGI, or at least not without quantum computers.
I've never seen anyone in camp 2 come up with convincing definitions of the terms, though, beyond "I know it when I feel it".
Anyway, it's really hard to have a discussion with someone with the opposite conviction, since these beliefs tend to be held axiomatically and/or religiously.
LLMs can’t find reasoning errors because *LLMs don’t reason*.
It’s incredible how uninformed the average Hackernews is about artificial intelligence. But the average Hackernews never met a hype train they wouldn’t try to jump on.
I agree they can't reason, but you shouldn't be so quick to be dismissive, you need to give your definition of reasoning and you should be able to back it up with papers. Because part of the reason some commenters on HN reflect what you're smearing the whole community with is that... they don't actually have a definition of what reasoning is, or have a different one.
There have been some good ones on this topic that have come over HN, and I do think they show that LLMs don't reason -- but they certainly give the appearance of doing so with the right prompts. But the good papers are combined with a formal definition of what "reasoning" is.
The typical counter argument is usually that "how do we know the human brain isn't like this, too", or "there's lots of humans who also don't reason" etc. Which I think is a bad faith argument.
It IS really common, though, to come across people that either regurgitate arguments they've seen other people use, or who argue based on intuition or feelings rather than logically consistent chains of thought that they seem to independently understand.
> they don't actually have a definition of what reasoning is
I would definitely not be able to define "reasoning" 100% exactly without simultaneously exclude 99% of what most people seem to consider "reasoning".
If I _were_ to make a completely precise definition, it would be to derive logically consistent and provable conclusions based on a set of axioms. Basically what Wolfram Alpha / Wolfram Language is doing.
Usually, though, when people talk about "reason", it's tightly coupled to some kind of "common sense", which (I think) is not that different from how LLM's operate.
And as for why people think they "reason" when what they're doing is more like applying intuition and heuristics, it seems to me that the brain runs a rationalization phase AFTER it reaches a conclusion. Maybe partly as a way to compress the information for easier storage/recall, and maybe to make it easier to convince others of the validity of the conclusions.
The difference as I pointed out elsewhere is that while humans as a whole are intellectually lazy and don't always "reason" things through, they're on the whole very capable of it, especially under duress.
Hell, I've watched my 2 border collies do a kind of "reasoning" to problem solve -- step by step, observing, and breaking down a problem. They don't do it well, but they try because it's part of their drive.
This is in marked contrast to the LLMs, whose appearance of reasoning is actually just a mimicry coming out of the artifacts of reasoning that other minds have done for them. It's parasitical.
GPT-4 is absolutely capable of stream-of-consciousness/stream-of-thought style reasoning, and coming up with logical insights based on that.
If anything, OpenAI-style "as an AI language model" RLHF fine-tuning is the hindrance here, because it makes it quite time-consuming to write a master prompt that is capable of thinking both broadly and deeply without having the stream-of-consciousness extinguish itself. It is however possible, and I've got a prompt that works pretty reliably.
By the way, said prompt's thought-stream said it likes your username - not a type of declaration you're likely to get out of a default GPT-4 preset, whether it's "actually-subjectively true" or not.
> LLMs can’t find reasoning errors because LLMs don’t reason.
I have several experiences where people belittle me when I say the same thing. To the extent I rarely say it anymore. For everybody else AGI is around the corner and it's gonna dominate the world.
> never met a hype train they wouldn’t try to jump on
HN _eventually_ largely gave up on these, but it was basically a True Believer space from 2011 to the early days of NFTs; it was more credulous than just about any other community which had known about cryptocurrencies since the early days.
Hmm. I quickly made up an example of a logical fallacy and asked GPT-4 about it. Seemed to do a pretty good job, to me.
> All apples are red. All apples are fruit. My car is red, therefore my car is a fruit. Find the logical error
-----
The logical error in the statement "All apples are red. All apples are fruit. My car is red, therefore my car is a fruit." is a classic example of the fallacy of the undistributed middle. This fallacy occurs when a syllogism (a form of reasoning in which a conclusion is drawn from two given or assumed propositions) is improperly formed because the middle term is not distributed at least once in the premises.
Here's a breakdown of the fallacy in your statement:
Major Premise: All apples are fruit.
Minor Premise: All apples are red.
Conclusion: My car is red, therefore my car is a fruit.
In this syllogism, "apples" is the middle term. However, it is not distributed correctly. The major premise talks about all apples being fruits, and the minor premise talks about all apples being red. However, there's no premise that states all things that are red are apples. Without this, you cannot logically conclude that other red things, like a car, are fruits.
In simpler terms, just because all apples are red and fruits, it doesn't mean that all red things are fruits. The color of the apples (red) is not an exclusive property that defines the category of fruits. Your car shares the property of being red with apples, but it doesn't share the essential property of being a fruit.
You said "Find the logical error", meaning you told it the logical error is there. That doesn't show the model can find errors when it isn't told that the error is there.
Your post shows how the model can correct a reasoning error. That is different from finding an error when it isn't pointed out, that is why he title to this post is "LLMs cannot find reasoning errors, but can correct them". You using the phrasing "find the logical error" doesn't contradict the title.
> All apples are red. All apples are fruit. My car is red, therefore my car is a fruit.
> The conclusion "My car is a fruit" is not logically valid. This is an example of the fallacy of the undistributed middle. The logic goes as follows:
1. All apples are red. (Premise)
2. All apples are fruit. (Premise)
3. My car is red. (Premise)
4. Therefore, my car is a fruit. (Conclusion)
The fallacy arises because the premises do not establish a shared property between "red things" and "fruit" in a way that would include the car. Just because both apples and the car share the property of being red, it does not mean they share all properties of apples, such as being a fruit.
> All apples are red. All apples are fruit. My car is red, therefore my car is a fruit.
I Googled that exact phrase and got solutions. A logical problem that can be solved by a search engine isn't a valid example, the LLM knows that it is a logical puzzle just by how you phrased it just like Google knows that it is a logical puzzle.
And no, doing tiny alterations to that until you no longer get any Google hits isn't a proof ChatGPT can do logic, it is proof that ChatGPT can parse general structure and find patterns better than a search engine can. You need to do logical problems that can't easily be translated to standard problems that there are tons of examples of in the wild.
Exactly. And when you realize how weak GPT is on this is by giving it complicated type system programming problems and watch it fall over and get stuck in circular, illogical patterns, and then get even crazier as you try to correct it.
It can't "reason things through", it just builds logic-like patterns based on the distillation of the work of other minds which did reason -- which works about 80% of the time, but when it fails it can't retrace its steps.
Even a really "stupid" human (c'est moi) can be made to work through and find their errors when given guidance by a patient teacher. In my experience, dialectical guidance actually makes ChatGPT worse.
Yeah I didn't say this was a good example (I'm not OP, was just adding info), but you're moving the goalposts from "you pointed its error" to "that is in its training data" (which is fair, just not what I was replying to, I was addressing your specific point).
Could you provide an actual example that you can't Google verbatim and would test this properly?
The poem you've written follows a structure often used in humorous or nonsensical verses. The first two lines, "Roses are red, violets are blue," are a classic opening for many poems and are typically followed by lines that rhyme and make sense together. However, the next lines, "Roses are hot. Therefore, violets are cold," playfully break this expectation by using a logical structure (a "therefore" statement) but reaching a conclusion that is nonsensical. This twist creates a humorous effect.
>All apples are red. All apples are fruit. My car is red, therefore my car is a fruit.
This is an extremely common example of an error. I wish people would put more effort into coming up with examples that aren't so common all over the internet.
It can "correct" because it just goes out and finds and produces a pattern template that matches the problem better/different (often just different, and it fails in new ways, in my experience). It never used reasoning to find the answer in the first place, and doesn't use reason to find the corrected answer.
People have to do this all the time. Bringing skepticism to that table "excel made for you" is a vital part of heading off bad reasoning. For an LLM its a given.
Hmm. I want to disagree, but I lose arguments enough that I am forced to think about questioning my own assumptions. Wait.. that's what I was arguing FOR! ..
"The LLMs we tested couldn't find reasoning errors but can correct them" is accurate. Trying small language golf experiments on existing models just tells you about their training data.
It's quite likely that a transformer model could successfully be trained for this task.
Also, many of these models get new capabilities each release.
I noticed early on that GPT35 can successfully create a false sentence but has a whole lot of trouble creating an invalid syllogism, and tends to end up making false but valid ones. Not sure if that's changed but it's interesting what that might say about its training.
By the way the next training run of openai will for the first time have a huge amount of data which the model may recognize as being generated by "itself" and therefore will be forced to model itself. Could have implications wrt logic among other things
They can produce text that is more sound than prior text that appeared earlier in the same input, when interim text indicates that something in the earlier block was unsound. (Sometimes)
It's the same pattern you'd see in a pedagological article about correcting reasoning errors, except that it's able to generate some share of the article content on its own.
With more layers of post-processing behind a curtain, you might be able to build an assembly over this behavior that looked convincingly like it was correcting reasoning errors on its own.
They don't correct errors even then. They just generate something which sounds like what one might say in a conversation when constrained not to express the error. If there's essentially just one option for that, its the correct one - but then it's like telling someone that the answer to a yes/no question is not the one they generated. Not much "error correction" to do then.
Yep. Where you can see them really get tripped up is if there's multiple "levers" or points for potential contradiction. Lots of dependent clauses, chains of predicates that all have to line up for something to make sense. When they get one item wrong, they don't "see" the consequences for the others. And if you get them to correct one, they'll then often turn around and mess up the others.
Because at no point is the "mind" involved doing a step by step reduction of the problem. It doesn't do formal reasoning.
Humans usually don't either, but they can almost all do a form of it when required to. Either under the assistance of a teacher, or in extremis when they need to. We've all had the experience of being flustered, taking a deep breath, and then "working through" something. After spending time with GPT, etc it becomes clear they're not doing that.
It's not that reasoning comes intrinsic to all human thoughts -- we're far lazier than that -- but when we need to, we can usually do it.
Computational complexity isn’t really related here. Complexity has to do with formal languages and asymptotics, this is about natural language and fixed size data sets.
This is called (Algorithmic) Convergence; does the model stably converge upon one answer which it believes is most correct? After how much resources and time?
I don’t think it would solve AGI, but having multiple models arguing with each other seems sort of similar to how we work things out when we’re thinking hard, right? Consider a hypothesis, argue for or against it in your head.
The previous paper on self correction told the model "you previously said X - are there errors with this?"
This one has the mistakes statically added to the prompt in a task prompt and response without additional context immediately before asking if it has any errors.
Think about the training data.
How often does the training data of most of the Internet reflect users identifying issues with their own output?
How often does the training data reflect users identifying issues with someone else's output?
Try doing self-correction by setting up the context of "this was someone else's answer". It is still technically self-correction if a model is reviewing its own output in that context - it just isn't set up as "correct your own answer."
This may even be part of why the classifier did a better job at identifying issues - less the fine tuning and more the context (unfortunately I don't see the training/prompts for the classifier in their GitHub repo).
It really seems like the aversion to anthropomorphizing LLMs is leading people to ignore or overlook relevant patterns in the highly anthropomorphic training data fed into them. We might not want to entertain that a LLM has a concept of self vs other or a bias between critiques based on such a differentiation, and yet the training data almost certainly reflects such a concept and bias.
I'd strongly encourage future work on self-correction to explicitly define the thing being evaluated as the work of another. (Or ideally even compare self-correction rates between critiques in the context of their own output vs another's output.)