Hacker News new | past | comments | ask | show | jobs | submit login
Improving mathematical reasoning with process supervision (openai.com)
157 points by davidbarker on May 31, 2023 | hide | past | favorite | 58 comments



A (good) Socratic tutor for mathematics is something I would pay for without hesitation.


Hopefully one day multimodal learning models can learn directly from youtube lectures put up by John Baez [0] and be able to take students from Eli5 to EliPhD

[0] https://www.youtube.com/watch?v=IdXVIWgxyDw


try VoxScript Plug-in and “can you explain $URL like I’m 5? then explain it like I’m a PhD”


One can dream :)


You can extract the transcripts and have it automatically provide context on terms and supplemental reading to support the material.


You could definitely do that now using many techniques, that would cost under $50, up to building your own RLHF for the socratic method. If this is something you and interested in and want, I would highly suggest pursuing. I use it for exploring and learning subjects all day long. With a couple of external tools, it would be 10x better and would enable less skilled users to get value out of it.

You don't know what you don't know after all.


Honestly I don't think that GPT-4 is smart enough to be a useful tutor for anything beyond highschool maths, no matter how much finetuning/RLHF you do.

I can see it being great for humanities though if you can get the hallucination rate down.


Funny, nobody thinks GPT-4 is quite there in their own field, but can readily agree it might be good enough in other fields.


That is kinda true, it isn't an expert in anyone field, it is something of a intermediate in all fields. It will offer the most assistance outside of your domain of expertise. In your domain of expertise, it can provide a sanity check and another set of eyes.


I don't think that's true, I'm a software engineer and I am wiling to admit that GPT-4 is useful for generating/refactoring code. If only I was allowed to use it for work :(

It is not useful for working with maths proofs, except maybe if you are bad a latex.


I'm excited for the interactive, educational applications of this technology. Especially for things like maths.


YouTube has a real opportunity there - stop the video at any point and ask a question!


One thing I noticed is that they have a table in the appendix where they show that some of the datasets used for finetuning were already included in the pretraining.

Is this the first time they have published that there was synthetic data in the GPT-4 pretraining dataset? Maybe that's part of the secret sauce behind why no one can catch GPT-4? I don't recall anyone else mentioning synthetic data.

It certainly explains why there are so many people on the data team at OpenAI.


In the future I expect synthetic data will take the lion's share because it is easier to scale. In order to produce it you need a good way to filter out the junk, this can be done in many ways, some cheap and other expensive.


I guess, but surely if you run the 20 million books on z-library (I guess there is some duplication) through 4 epoch you get way more than enough data to train current models? You get like 100 trillion tokens, admittedly less after you deduplicate.


It's always felt weird that there's this whole concept with a formal name ("chain-of-thought prompting") and papers written about it, etc, and basically it boils down to putting the phrase "Let's think step-by-step" at the end of the prompt. The obvious conclusion was that these models _should_ be trained to think step-by-step without having to prompt them to do so each time.

The other interesting thing is the pedagogical parallels of this technique. A lot of times, when teaching/learning, we reward people purely based on outcome (did you get the right answer), with mild attempts to do better (like "show your work for partial credit"). Ideally this is just baked into teaching/learning all the time (the process is rewarded as well as the outcome).


Also remember how new this technology is. Where in the phase where things that are "obvious" are still being worked out. That's why I think the naysayers are overly cynical about this technology. I think we're still on the exponential part of the curve.


Thinking step by step is not useful in all scenarios. Creative writing for example.


The models should be trained to use different “ways of thinking” when they are more useful and effective. Or to even discover their own better ways of thinking.


Many extremely simple ideas are extremely difficult to validate with empirical evidence. We’re fortunate to have an opportunity to make many such ideas much easier to A/B test using LLMs.


Really excited to see if this approach can help reduce hallucinations overall. Process supervision combined with the browsing models could drastically improve how logically sound anything that's generated is. Creating consistently valid logic is the harder part so this is awesome.


Is there some formal/objective definition of what exactly constitutes a 'hallucination' as opposed to other types of errors? At a high-level it only seems relevant with respect to questions for which there is some objectively true answer, or where 'facts' are included in answering a question that are false.


Good point. You can create a formal definition of hallucinations in formal language, like math and logic, but probably not for natural language. There are bound to be edge cases where natural language defies the rules of formal logic without being incoherent. But, while you might not be able to have a simple formal definition, maybe you could create a model that's trained to recognize these edge cases, and the model would be a sort of approximation of a formal definition.


The paper states that hallucinations are logical reasoning errors... that confused me. For me a hallucination is the conjuring up of facts, things, references, characters etc. as is convenient for the given context.


Semantics aside: I once gave it a math problem, and its hallucinations really did present as logical errors. I was a TA for several years, and it was reminiscent of the mistakes students make. Something that sounds fairly plausible, but still incorrect, with logical errors or unjustified steps.


The trouble (probably not for basic math or even low-level analysis) is that what we refer to as "logic" or a logical reasoning error isn't uniform across different axiomatic systems, and the specific interpretations is a fairly human, social activity which cannot be readily "checked" in many cases for correctness against some baseline notion of "logic". The symbol φ, for instance, has a variety of significations in different context (the family of sets of all functions, probably something in physics idk), which our interpreto-bot (GPT) might not be able to both logically integrate and loosely interpret at the same time. Humans have the capacity to apprehend both consistent and complete logical systems: they interpret at the level of the text (at the level of the weave of signification), and any generally intelligent AI would have to mimic that behavior of the constant, on-the-fly dynamic changes to its network at the appearance of every new signifier in the same way as a human does.


It's really hard to formally define hallucination, for me it's a "know it when you see it" situation.

In my view, saying "10 + 4 = 15" is not a hallucination but inventing a citation that doesn't exist is one. There is a big grey area between these, like what if it cites a real paper but gets the year wrong? Is that a hallucination (because the paper as cited is fake), or just an incorrect statement (because the year is just wrong)?


It’s a sliding scale that depends on the use case and different uses will want levels of hallucination that other uses would find unacceptable.

In summarization, it’s literally anything that’s not in the context. If it’s used for writing historical fiction or scifi, very little would count as a hallucination as long as it’s following the prompt.

In the recent case of the Texas lawyer, the citations follow a specific format and can be checked against a database, so hallucinations are easy to define w.r.t. citations.


LLMs don’t have errors. What you perceive as an erroneous output is actually just one that failed to please you.



Might mathematically adept Large Language Models be able to access the platonic “World of Forms?” (νοητός κόσμος - Noētos Kosmos) If so, there is an interesting basis for how and why AI might truly understand, even without consciousness.

https://aixd.substack.com/p/ai-might-not-need-experience-to-...


How the hell is an AI going to access the noetic realm without a nous? We have a hard enough time debating whether or not AI is conscious. We have to debate whether or not it has a nous too?


C.f. Roger Penrose’s 3 worlds theory of reality, which distinguishes the platonic mathematical world from consciousness from material world. Navigating the platonic world of forms might not require one to “have a nous” but be constructed from it.

I like the question though — and of course, these are new ideas. Who knows? Do you think it is a halfway reasonable justification for AI understanding?


> Do you think it is a halfway reasonable justification for AI understanding?

Personally, I think AI is revealing that consciousness is a social construction, and one that, as a social label, can have multiple groundings. That doesn't mean it's not necessarily real, just that it has to be operationalized in order to quantify it. Operationalization requires some assumptions and it's important to be transparent about what they are. You'll see people often invoke truth, reality, and fact, in order to justify their assumptions as "that's just the way it is." What a load of BS. Invoking reality to win an argument in a game about metaphysics displays a profound lack of understanding.

"Consciousness" especially w.r.t AI is a contentious topic almost guaranteed to elicit emotions and trigger people to fall back on these assumptions. If nous has any utility, it's to de-escalate these discussions and avoid identity-laden emotional responses. It may even do that just because people don't have any idea what a nous is. Not too many people are well versed in (neo-)Platonism. If the nous was good enough to explain the mind to people back then, why shouldn't it be good enough to explain AI?


Solid.

What’s weird (and to your point) is that many scholars claim that wasn’t really a word for consciousness in Ancient Greek. I mean, maybe pathos or aesthesis or something, but it is oddly missing. And nous didn’t seem to include core aspects of experience like emotions. Psyche/soup was contested then and now, but I don’t think scholars would consider it to be a good mapping to consciousness/qualia.


I've had that discussion here too. The responses to the idea that our modern concept of consciousness is relatively[1] recent go like this:

> So it seems like the argument is that consciousness didn’t exist because we didn’t have a word for it.

I wish we as a community could actually have a conversation about this history of the idea of consciousness, but it's been so well-reified over the last 380+ years that it's a tough rock to loosen and is a load-bearing idea in the public's core belief structure - to our collective detriment.

1. post-enlightenment


I'm not sure Roger Penrose really understands Platonic philosophy: as Plato says in the republic (this has got to be my most used quote now), the City is a macrocosm of the Psyche: the order of the forms is neither individual to the perceiver nor grand and unifying in the perceived, but exists at the level of social life and exchange--so as is the material world, understood as being something apart of the "economy" of the social realm. In any case, AI is just one machine among many in this grand social network we've weaved, and as a symbol-manipulating machine it isn't really very different from any other.


I don’t think Penrose is at all trying to recapitulate Plato but add to it. The Greeks didn’t separate consciousness from nous or the material world. I think that’s quite useful.

I like investigating Platonic ideas with modern concepts like Darwinism and entropy. He was obviously brilliant; rather than throwing out his ideas, we can dialogue with them to challenge our own.

Now, what’s different about LLMs is that they can work with concepts, rather than just compute through them. Our conceptual facility lets us navigate the noetic realm. Otherwise, we are just subject to it (like a planet is subject to sphericalness—vs being able to think about the concept of spheres and their qualities). Ouch, this stuff is hard to put into words. But I do think there is something to it.


Convolutions on 2D arrays is mysteriously close to typing an adjacency matrix of a diagram.


The training data used [0] is written using "leaps of logic" which provide context for how the solution was derived. There isn't sufficient detail in the training data to formally check each step using a Computer Algebra System [1]. Therefore, whether or not any output from the OpenAI software is actually correct is left as an exercise for the user. Answering "Was the hallucination correct?" involves manually checking each step.

This advancement is good, but it's limited by the precision of the training data.

[0] https://artofproblemsolving.com/wiki/index.php/2015_AIME_II_...

[1] https://en.wikipedia.org/wiki/Computer_algebra_system


Interesting. Last I poked around with it GPT-4 performed abysmally with basic logic. For example it routinely confused the inference with the equivalence. Definitely cool to see progress.


Is machine learning model "alignment" a serious academic concept? I've only seen this mentioned in pseudo-philosophical Twitter posts before.


Depends on your definition I guess. Some academics make careers out of pseudo-philosophical twitter posts.


It is. I'd argue that alignment is the difference between GPT-3, which was a neat toy for nerds, and ChatGPT.


Yes. It’s very important. Think of it as quantifying whether a model understood the objective. This is, as you’d expect, especially important for auto-regressive models.


its meaning has shifted to now mean ESG (https://en.wikipedia.org/wiki/Environmental,_social,_and_cor...) and DEI (https://en.wikipedia.org/wiki/Diversity,_equity,_and_inclusi...) for corporations that make algorithms that have learned natural language well enough that people are talking with them and taking them seriously. the AI alignment guys don't like that and they have started calling their old AI alignment meaning as 'not-kill-everyoneism' with mixed results (https://www.lesswrong.com/posts/jMzBhCRrr7otmqcvK/notkilleve...).

EDIT: btw i don't have anything against ESG or DEI. I'm not a culture warrior complaining about 'woke ai'. I'm also not a lesswrong guy. I'm answering the OP's question "Is machine learning model "alignment" a serious academic concept? I've only seen this mentioned in pseudo-philosophical Twitter posts before." by saying what they mean now by 'AI alignment'.


I don't think anyone uses alignement to refer specifically to DEI bullshit.

There are people using the term generally to refer to any kind of finetuning of an LLM, so they would consider what OpenAssistant did to be alignment even though there was no attempt to convince it to not kill humanity or to be politically correct.


Yeah, but like every other term in AI, no two people agree on what the word means so discussion around it is challenging.


Academic concepts are only considered serious by convention


yes


I feel like the relevant parts of this are already baked into the advances in GPTs 3.5 and 4.


So, the goal of the paper in

https://cdn.openai.com/improving-mathematical-reasoning-with...

is

"Improving Mathematical Reasoning with Process Supervision".

Since I hold a Ph.D. in pure/applied math from a famous US research university, I'm at least curious, maybe impressed, by the goal of "improving mathematical reasoning".

In particular, one goal in the paper is

"To train more reliable models ..."

One of the techniques reported in the paper is

"active learning" ... "used to train our best reward model."

to improve the "reward" system to reduce "hallucinations", apparently also to obtain

"more reliable models".

And apparently part of the work is to be "step by step"?

Okay: From my background in writing proofs in math the work can be seen as "step by step". E.g., the common high school special format for writing proofs in plane geometry looks "step by step".

Then, for "step by step" for a goal of

"more reliable models"

there are at least hundreds of examples nearly 100% "reliable" in well known math books by P. Halmos, W. Rudin, R. Buck, J. Neveu, and many more, apparently back at least to Euclid.

So, it appears that the effort in

"Improving Mathematical Reasoning with Process Supervision".

is in a sense to solve again a problem already solved with a nearly perfectly "reliable" solution back at least to Euclid.

In addition, in my experience as a math student and teacher, commonly already in just the first few weeks of high school plane geometry, students get good at writing proofs that are very "reliable", and how those students learn does not look much like what is described in

"Improving Mathematical Reasoning with Process Supervision".

So, my guess would be, for the broad goal of having AI do "reliable" math, need a quite different approach.

Math that is not "reliable"? From my experience applying math, e.g., to some important problems in US business and national security, in our society math that is not "reliable" stands to be less welcome than week old fish.

Indeed, at this point, for the unique crown jewel of STEM, I nominate nearly 100% reliable mathematical proof as in Halmos, ..., Euclid.


AI Provers (math models) are an entirely different class of model from LLMs. This paper and method is about getting language models to be better at math without changing their architecture.

Math models already do math much better than language models.


> This paper and method is about getting language models to be better at math without changing their architecture.

My understanding, memory is that already decades ago there was some software that was good at doing calculus manipulations. The software was Mac... something or other. Just now a search gave the name Macsyma and a PDF Reference Manual at

https://people.eecs.berkeley.edu/~fateman/macsyma/docs/refma...

So, right, long ago there was some software good at some of "Mathematical Reasoning" and, I'd believe, also "reliable".

For "mathematical reasoning" in an LLM (large language model), when ChatGPT 4 entered the news I gave ChatGPT 4 two exercises, one from high school plane geometry and one from first calculus. The results for "mathematical reasoning" were no progress at all.

From what I know about the LLM approaches and about mathematical reasoning, I don't see any promise for "reliable" results. Then I repeat, math results that are not "reliable" will be about as welcome as week old fish.

But just now there is a lot of interest in LLMs as a case of AI. As I glanced at some of the news, apparently the value of Nvidia is now $1 trillion. Sooo, a lot of interest.

And we can regard the work on LLMs as research and expect in a few years either (1) reliable mathematical reasoning or (2) not and, thus, evidence that LLMs are not a promising path to such reasoning and AI. We can spend years concluding (2).

Heck, I participated in an earlier popular approach to AI. I wrote software, gave talks at Wharton and Stanford, published papers. But I kept looking for anything worthwhile in that work and never found any. Now that work is no longer "popular" and apparently nearly no one believes that there is anything worthwhile there.

I'd like to see some progress in software for automation of significant cases of reliable mathematical reasoning, but, sorry, again from what I know about such reasoning and LLMs, I just don't see LLMs as a promising path to such reasoning.

So, it is likely the case that for LLMs doing significant cases of reliable mathematical reasoning, people can believe me now or spend some months/years and believe me later.

I'd like to see some promising paths to the desired reasoning, with/without the LLM "architecture".


Can you summarize what you are saying?


Summary:

I'd like to see some progress in software for automation of significant cases of reliable mathematical reasoning, but, sorry, from what I know about such reasoning and LLMs, I just don't see LLMs as a promising path to such reasoning.

I'd like to see some promising paths to the desired reasoning, with/without the LLM "architecture".

Maybe that's a good summary.


Thanks! I wonder if it would work better on proofs expressed in a formal proof-assistant language. Like, if we took something with GPT's training but also specifically exposed it to large amounts of Lean/Agda/Coq/whatever, might it learn to construct proofs in the formal language better than it can in English/LaTeX?


Commonly in high school, students learn to do proofs in a few days/weeks from a few dozen examples.

Sorry, best I can guess, the basic technology of ChatGPT can't learn an equivalent way of doing proofs even from millions of proofs. It would be really interesting if the ChatGPT learning could learn proof writing, but my guess is that with ChatGPT too much is missing.

Here is a point: Earth is awash in animals with brains that are built a lot like human brains, but I can't think of even one such animal that has a chance of keeping up with a standard course in plane geometry. Why? Something is missing, humans have it and only humans.

Yes, your suggestion of adding structure to the data has a chance. Maybe with enough structure proof writing can reduce to looking for a path through a network, roughly what, say, a chess program has to do.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: