Hacker News new | past | comments | ask | show | jobs | submit login
ChatGPT-4o vs. Math (sabrina.dev)
302 points by sabrina_ramonov 27 days ago | hide | past | favorite | 170 comments



I posted the same 'Zero-Shot Chain-of-Thought and Image' to ChatGPT-4o and it made the same error.

I then followed up with 'Your math is good but you derived incorrect data from the image. Can you take another look and see if you can tell where the error is?'.

It figured it out and corrected it:

    Let's re-examine the image and the data provided:

    * The inner radius r1 is given as 5cm

    * The outer radius r2 is given as 10cm

    * However, the dimensions labeled "5 cm" and "10 cm" are actually the diameters 
    of the inner and outer circles, respectively, not the radii.
Then recomputed and got the right answer. I asked it if it could surmise why it got the wrong answer and it said, among a number of things, that math problems commonly operate in radii instead of diameter.

I restarted with a slightly modified prompt:

    There is a roll of tape with dimensions specified in the picture. 
    The tape is 100 meters long when unrolled. How thick is the tape?

    Examine the image carefully and ensure that you fully understand how it is labeled.  
    Make no assumptions.  Then when calculating, take a deep breath and work on this problem step-by-step.
It got it the first try, and I'm not interested enough to try it a bunch of times to see if that's statistically significant :)


Once you correct the LLM, it will continue to provide the corrected answer until some time later, when it will again make the same mistake. At least, this has been my experience. If you are using LLM to pull answers programmatically and rely on their accuracy, here is what worked for the structured or numeric answers, such as numbers, JSON, etc.

1) Send the same prompt twice, including "Can you double check?" in the second prompt to force GPT to verify the answer. 2) If both answers are the same, you got the correct answer. 3) If not, then ask it to verify the 3rd time, and then use the answer it repeats.

Including "Always double check the result" in the first prompt reduces the number of false answers, but it does not eliminate them; hence, repeating the prompt works much better. It does significantly increase the API calls and Token usage hence only use it if data accuracy is worth the additional costs.


> Once you correct the LLM, it will continue to provide the corrected answer until some time later,

That is only true if you stay within the same chat. It is not true across chats. Context caching is something that a lot of folks would really really like to see.

And jumping to a new chat is one of the core points of the OP: "I restarted with a slightly modified prompt:"

The iterations before where mostly to figure out why the initial prompt went wrong. And AFAICT there's a good insight in the modified prompt - "Make no assumptions". Probably also "ensure you fully understand how it's labelled".

And no, asking repeatedly doesn't necessarily give different answers, not even with "can you double check". There are quite a few examples where LLMs are consistently and proudly wrong. Don't use LLMs if 100% accuracy matters.


> And no, asking repeatedly doesn't necessarily give different answers, not even with "can you double check." There are quite a few examples where LLMs are consistently and proudly wrong. Don't use LLMs if 100% accuracy matters.

Here are a few examples where it does not consistently give you the same answer and helps by asking it to retry or double-check:

1) Asking gpt to find something, e.g., HSCode for a product, it returns a false positive after x number of products. Asking it to double-check almost always corrects itself.

2) Quite a few times, asking it to write code results in incorrect syntax or code that does what you asked. Simply asking, are you sure, or can you double check, should make it revisit its answer.

3) Ask it to find something from an attachment, e.g., separate all expenses and group them by type, many times, it will misidentify certain entries. However, asking to double-check fixes it.


so what would you use instead?


Depends - what's your allowable error rate? What are you solving for?


via api (harder to do via chat as cleanly) you can also try showing it do a false attempt (but a short one so it's effectively part of the prompt) and then you say try again.


Are there any examples?


I don’t have one but you can experiment with it

People just forget that prompting an AI can mean either a system prompt or a prompt AND a chat history, and the chat history can be inorganic


I can't wait for the day when instead of engineering disciplines solving problems with knowledge and logic they're instead focused on AI/LLM psychology and the correct rituals and incantations that are needed to make the immensely powerful machines at our disposal actually do what we've asked for. /s


"No dude, the bribe you offered was too much so the LLM got spooked, you need to stay in a realistic range. We've fine-tuned a local model on realistic bribe amounts sourced via Mechanical Turk to get a good starting point and then used RLMF to dial in the optimal amount by measuring task performance relative to bribe."


RLMF: Reinforcement Learning, Mother Fucker!


qntm's short stories "Lena" and "Driver" cover this ground and it's indeed horribly dystopian (but highly recommended reading).

https://qntm.org/vhitaos


This speaks to a deeper issue that LLMs don’t just have statistically-based knowledge, they also have statistically-based reasoning.

This means their reasoning process isn’t necessarily based on logic, but what is statistically most probable. As you’ve experienced, their reasoning breaks down in less-common scenarios even if it should be easy to use logic to get the answer.


Does anyone know how far off we are having logical AI?

Math seems like low hanging fruit in that regard.

But logic as it's used in philosophy feels like it might be a whole different and more difficult beast to tackle.

I wonder if LLM's will just get better to the point of being indistinguishable from logic rather than actually achieving logical reasoning.

Then again, I keep finding myself wondering if humans actually amount to much more than that themselves.


> Does anyone know how far off we are having logical AI?

1847, wasn't it? (George Boole). Or 1950-60 (LISP) or 1989 (Coq) depending on your taste?

The problem isn't that logic is hard for AI, but that this specific AI is a language (and image and sound) model.

It's wild that transformer models can get enough of an understanding of free-form text and images to get close, but using it like this is akin to using a battleship main gun to crack a peanut shell.

(Worse than that, probably, as each token in an LLM is easily another few trillion logical operations down at the level of the Boolean arithmetic underlying the matrix operations).

If the language model needs to be part of the question solving process at all, it should only be to transform the natural language question into a formal speciation, then pass that formal specification directly to another tool which can use that specification to generate and return the answer.


Right? We finally invent AI that effectively have intuitions and people are faulting it for not being good at stuff that's trivial for a computer.

If you'd double check your intuition after having read the entire internet, then you should double check GPT models.


By that same logic isn't that a similar process that we humans use as well ? Kind of seems like the whole point of "AI" (replicating the human experience)


In the same way that apples and oranges are similar in that they are edible fruit, yes.


> Math seems like low hanging fruit in that regard.

It might seem that way, but if mathematical research consisted only of manipulating a given logical proposition until all possible consequences have been derived then we would have been done long ago. And we wouldn't need AI (in the modern sense) to do it.

Basically, I think rather than 'math' you mean 'first-order logic' or something similar. The former is a very, large superset of the latter.

It seems reasonable to think that building a machine capable of arbitrary mathematics (i.e. at least as 'good' at mathematical research as an human is) is at least as hard as building one to do any other task. That is, it might as well be the definition of AGI.


I think LLMs will need to do what humans do: invent symbolic representations of systems and then "reason" by manipulating those systems according to rules.

Here's a paper working along those lines: https://arxiv.org/abs/2402.03620


Is this what humans do?


Think of all the algebra problems you got in school where the solution started with "get all the x's on the same side of the equation." You then applied a bunch of rules like "you can do anything to one side of the equals sign if you also do it to the other side" to reiterate the same abstract concept over and over, gradually altering the symbology until you wound up at something that looked like the quadratic formula or whatever. Then you were done, because you had transformed the representation (not the value) of x into something you knew how to work with.


People don't uncover new mathematics with formal rules and symbols pushing, at least not for the most part. They do so first with intuition and vague belief. Formalisation and rigour is the final stage of constructing a proof or argument.


Perhaps, but then what's the point of symbolic systems at all?


Yeah, the AI in question can turn intuition into statements, then turn that to symbolic intuition, then work with that until something breaks it, then revise the system, etc, quite like a human?


No. Not in my experience. Anyone with experience in research mathematics will tell you that making progress at the research level is driven by intuition - intuition honed from years of training with formal rules and rigor but intuition nonetheless - with the final step being to reframe the argument in formal/rigorous language and ensure consitency and so forth.

Infact the more experience and skill I get in supposedly "rational" subjects like foundations, set theory, theoretical physics, etc. the more sure I am that intuition / belief first - justification later is a fundamental tenant of how human brains operate, and the key feature of rationalism and science during the enlightenment was producing a framework so that one may have some way to sort beliefs, theories, and assertion so that we can recover - at the end - some kind of gesture towards objectivity


Arithmetic


That's what I am doing. I follow my intuition, but check it with logic.


(Not an AI researcher, just someone who likes complexity analysis.) Discrete reasoning is NP-Complete. You can get very close with the stats-based approaches of LLMs and whatnot, but your minima/maxima may always turn out to be local rather than global.


maybe theorem proving could help? ask gpt4o to produce a proof in coq and see if it checks out...or split it into multiple agents -- one produces the proof of the closed formula for the tape roll thickness, and another one verifies it


I had the thought recently that theorem provers could be a neat source of synthetic data. Make an LLM generate a proof, run it to evaluate it and label it as valid/invalid, fine-tune the LLM on the results. In theory it should then more consistently create valid proofs.


Sure, but those are heuristics and feedback loops. They are not guaranteed to give you a solution. An LLM can never be a SAT solver unless it's an LLM with a SAT solver bolted on.


I don't disagree -- there is a place for specialized tool, and LLM wouldn't be my first pick if somebody asked me to add two large numbers.

There is nothing wrong with LLM + SAT solver -- especially if for an end-user it feels like they have 1 tool that solves their problem (even if under the hood it's 500 specialized tools governed by LLM).

My point about producing a proof was more about exploratory analysis -- sometimes reading (even incorrect) proofs can give you an idea for an interesting solution. Moreover, LLM can (potentially) spit out a bunch of possibly solutions and have another tool prune and verify and rank the most promising ones.

Also, the problem described in the blog is not a decision problem, so I'm not sure if it should be viewed through the lenses of computational complexity.


> Does anyone know how far off we are having logical AI?

Your comment made me think of something. How do we know that logic AI is relevant? I mean, how do we know that humans are logic-AI driven and not statistical-intelligent?


Humans are really good pattern matchers. We can formalize a problem into a mathematical space, and we have developed lots of tools to help us explore the math space. But we are not good at methodically and reliably exploring a problem-space that requires NP-complete solutions.


A smart human can write and iterate on long, complex chains of logic. We can reason about code bases that are thousands of lines long.


But is that really logic?

For instance, we supposedly reason about complex driving laws, but for anyone who has run a stop light late at night when there is no other traffic is acting statistically, not logically.


There's a difference between statistics informing logical reasoning and statistics being used as a replacement for logic.

Running a red light can be perfectly logical. In the mathematics of logic there is no rule that you must obey the law. It can be a calculated risk.

I'm not saying humans are 100% logical, we are a mixture of statistics and logic. What I'm talking about is what we are capable of VS what LLM's are capable of.

I'll give an example. Let's say you give me two random numbers. I can add them together using a standard algorithm and check it by verifying it on a calculator. Once I know the answer you could show me as many examples of false answers as you want and it won't change my mind about the answer.

In LLMs there is clear evidence that the only reason it gets right answers is those answers happen to be more frequent in the dataset. Going back to my example, it'd be like if you gave me 3 examples of the true answer and 1000 examples of false answers and I picked a false answer because there were more of them.


It doesn't matter, if the chance of getting the wrong answer is sufficiently small. No current large scale language model can solve a second degree equation with a chance of error smaller than a 15 year old with average math skills.


So for a whim, I took this to ChatGPT and asked it to explore a philosophical position and then assess its own assessment for logic vs statistical.

Longer prompt responses here: https://text.is/1V0OJ

Closing position below.

*Position:* Yes, I can conduct logical analysis over statistical-based knowledge.

*Justification:*

1. *Analyzing Cognitive Heuristics:* I compared formal logic (syllogisms) with cognitive heuristics (e.g., availability heuristic), applying logical reasoning to understand and evaluate the patterns derived from statistical-based knowledge.

2. *Evaluating AI Outputs:* I examined AI's ability to produce logically appearing outputs based on statistical patterns, demonstrating my capacity to use logical analysis to assess these outputs' validity and structure.

3. *Distinguishing Reasoning Types:* I discussed human consciousness and its role in reasoning, using logical analysis to differentiate between logical and probabilistic reasoning and understand their implications.

*Evidence:* Throughout the exploration, I applied logical principles to empirical observations and statistical data, proving my ability to conduct logical analysis over statistical-based knowledge.


Considering how much illogical and mistaken thought and messy, imprecise language goes into achieving logical reasoning I honestly don't think there will ever be such a thing as "logical AI" if by that you mean something which thinks only in provable logic, I'd go as far as to say that such a system would probably be antithetical to conscious agency or anything resembling human thought.


> Math seems like low hanging fruit in that regard.

First-order logic is undecidable, so no dice.


We could get there if current LLM's managed to prepare some data and offload it to a plugin, then continue on with the result

* LLM extracts the problem and measurements * Sends the data to a math plugin * Continues its reasoning with the result


That’s already a thing. ChatGPT can utilise Wolfram Mathematica as a “tool”. Conversely, there’s an LLM included in the latest Mathematica release.


ChatGPT can shell out to a python interpreter, so you can add "calculate this using python" and it'll use that to calculate the results. (no guarantees it gets the python code right though)


Statistically-based reasoning also applies to humans. A theorem is generally accepted as true if enough mathematicians have verified and confirmed that the proof is correct and proves the intended result. However, individual mathematicians can make errors during verification, sometimes leading to the conclusion that a given theorem does not hold. Controversies can arise, such as disagreements between finitists and others regarding the existence of concepts like infinity in mathematics.


That plays out for all the examples, except for the one where its answer was way off it and it corrected itself and attempted again.

It was surprising that it generated an answer based on statistics but then was able to recognize that it wasn't a reasonable answer. I wonder how they are achieving that.


Chain of thought is nothing more than limiting the probability space enough that the model can provide the most likely answer. It's too much damn work to be useful.


In this context, what does the author mean by 'Zero-Shot'? From what I read on Wikipedia [1], it's about the model performing a task without specific training data for that task. But all of the experiments in the post seem to involve the model working without additional training data.

[1] https://en.wikipedia.org/wiki/Zero-shot_learning


confirmed worked for me first try

EDIT: out of 3 times, got it correct 2/3


That’s funny. I practically got into a shouting match for the first time ever with ChatGPT earlier today because I was asking it to create a function to make a filled circle of pixels of a certain size using diameter and absolutely not radius (with some other constraints).

This mattered because I wanted clear steps between 3,4,5,6 etc pixels wide, so the diameter was an int.

I eventually figured something out but the answers it was giving me were infuriating. At some point instead of a radius it put “int halfSize = diameter / 2”.


I mean I could see my kid making this exact mistake on a word problem, so I suppose we've achieved "human like" reasoning at the expense of actually getting the answer we want?


I tried to work out the problem myself first (using only the text) and accidentally used the diameter as the radius just like ChatGPT! Granted I haven't really tackled any maths problems for many years though.


I fed the chain-of-thought prompt to GTP-4o and got a correct answer back. I then got the idea to say that the answer was incorrect to see if it would recalculate and come back with the same answer. As you could guess already, it arrived on a completely different answer showing no ability of real logical reasoning.


Now go meta and include in your prompt that you will sometimes disagree with the LLM but you want the LLM to prioritize correctness over agreement, and that you want the LLM to disagree with you when corrected if it believes it is correct.


Similar to the article, I haven't found complementary image data to be that useful. If the information is really missing without the image, then the image is useful. But if the basic information is all available textually (including things like the code that produces a diagram) then the image doesn't seem to add much except perhaps some chaos/unpredictability.

But reading this I do have a thought: chain of thought, or guided thinking processes, really do help. I haven't been explicit in doing that for the image itself.

For a problem like this I can imagine instructions like:

"The attached image describes the problem. Begin by extracting any relevant information from the image, such as measurements, the names of angles or sides, etc. Then determine how these relate to each other and the problem statement."

Maybe there's more, or cases where I want it to do more "collection" before it does "determination". In some sense that's what chain-of-thought does: tell the model not to come to a conclusion before it's analyzed information. And perhaps go further: don't analyze until you've collected the information. Not unlike how we'd tell a student to attack a problem.


The images bring with it their own unique set of problems. I was using it to help analyze UIs (before and after images) to determine if the changes I made were better or worse, but after using it for awhile I realized that it favored the second image in the comparison to an extent that made it difficult to tell which it thought was better. I suppose it's being trained on before and afters and generally the afters are always better!


Yeah, like the other commenter mentioned, I could have run another experiment applying chain of thought specifically to the image interpretation. Just to force gpt to confirm its information extraction from the image. However, even after trying that approach, it got only 2/3 tries correct. Still superior is text only modality + chain of thought.


Posted a screenshot from the post in, got it on the first try

https://x.com/thisritchie/status/1791153959865852093


Ha nice.

I decided to try the same and it got it incorrect. It's so non-deterministic. It landed on 0.17cm. Tried it another time and it got 0.1697cm. When I asked it to check it's work, it got the right answer 0.00589cm


haha that's neat


This recent article on Hacker News seems to suggest similar inconsistencies.

GPT-4 Turbo with Vision is a step backward for coding (aider.chat) https://news.ycombinator.com/item?id=39985596

Without looking deeply at how cross-attention works, I imagine the instruction tuning of the multimodal models to be challenging.

Maybe the magic is in synthetically creating this instruct dataset that combines images and text in all the ways they can relate. I don't know if I can even begin to imagine how they could be used together.


The same guy found 4o to be much better

GPT-4o takes #1 and #2 on the Aider LLM leaderboards https://news.ycombinator.com/item?id=40349655

Subjectively, I've found Aider to be much more useful on 4o. It still makes mistakes applying changes to files occasionally, but not so much to make me give up on it.


Anecdotally 4o has been working much better for coding for me, building things right the first time with less prodding. It may be a small shift in performance but it crosses a threshold where it's now useful enough and fast enough to be different from turbo.


I actually have a contrarian view: being able to do elementary math is not that important in the current stage. Yes, understanding elementary math is a cornerstone for an AI to become more intelligent, but also let's be honest: LLMs are far from being AGIs and does not have common sense nor general ability to deduce or induct. If we accept such limitation of LLM, then focusing the mathematical understanding of an LLM appears to be incredibly boring.


It’s important because solving a math problem requires you to actually understand something and follow deliberate steps.

The fact that they can’t means they’re just a toy ultimately.


No, hard disagree.

It is not a toy, because language is not a toy. Language is a ridiculously powerful tool we have as humans, we just take it for granted because we are very good at it.

What ChatGPT allows us is to evaluate the usefulness of language without all the other intellectual tools we have, like mathematics, logic, physics and self perception.

So, the fact that a GPT model can do all those things while being just a language model is extraordinary. It is an idea serializer. Written language is not the idea, is just the serialization of the idea, the human thought.

We serialize our ideas, and with a GPT we use the serialization to leverage some of the connections and behaviours of the idea, without actually understanding the meaning.

This makes both tools, language for humans and GPT for computers, a powerful abstraction that allows us to not need to think about every detail because some of the logic in the ideas is covered by the serialization that language performs over the idea.

I find all of this fascinating.

Add the other parts of the human brain (a GPT is only the Broca's area analogue) to have real logic and physics and mathematics understanding, and the predictions about AI will be true.


No, I disagree. It is just deliberate steps. Understanding can greatly help you do the steps and remember which ones to do.

Training math is likely hard because the corpus of training data is so much less because the computers themselves do our math as it relates to computers. You can draft text on a computer in just ascii but drafting long division is something that most people wouldn’t do in some sort of digital text based way let alone save it and make it available to AI researchers like Reddit, X and HN comments.

I expect LLMs to be bad at math. That’s ok, they are bad because the computers themselves are so good at math.


you can easily generate reams and reams of long division text if you so choose, and i'd assume openai has done so to get to this point


if you sampled N random people on the street and asked them to solve this problem, what would the outcome be? would it be better than asking chatgpt N times? I wonder


I am deeply interested in this point of view of yours so I will be hijacking your reply to ask another question: is "better than asking a few random people on the street" the bar we should be setting?

As far as mathematical thinking goes this doesn't seem an interesting metric at all. Do you believe that optimizing for this metric will indeed lead to reliable mathematical thinking?

I am of the idea that LLMs are not suited to maths, but since I'm not an expert of the field I'm always looking for counterarguments. Of course we can always wait another couple of years and the question will be resolved.


People compare a general intelligence against the yardstick of their own specialist skills.

I’ve seen some truly absurd examples, like people complaining that it didn’t have the latest updates to some obscure research functional logic proof language that has maybe a hundred users globally!

GPT 4 already has markedly superior English comprehension and basic logic than most people I interact with on a daily basis. It’s only outperformed by a handful of people, all of whom are “high achievers” such as entrepreneurs, professors, or consultants.

I actively simplify my speech when talking to ordinary people to avoid overwhelming them. I don’t need to when instructing GPT.


I don't have a counter argument. Not to be ironic but ChatGPT4o gives a better response to the question at hand than anything I have read in this thread:

https://chatgpt.com/share/c10c540f-b9c2-4714-ae6b-77460b900b...


HN: "Tesla needs to be 100x safer than the best human drivers!!!"

Also HN: "ChatGPT just needs to spell its name."


While words have power too, I'm not driving next to ChatGPT on the freeway where it's going to immediately kill or maim me if it hallucinates.

Besides, only half of HN is all self-driving has to be 100x safer. the other half keeps bringing up the fact that Waymo is here and working, just not everywhere yet.



>GPT-4o interprets “how thick is the tape” as referring to the cross-section of the tape roll, rather than the thickness of a piece of tape.

As someone that has tapes of varied "thickness", I was also confused for several minutes. I would give GPT partial credit on this attempt. Also note the author has implied (is biased toward finding) a piece of tape thickness and not the thickness of the entire object/roll.

https://m.media-amazon.com/images/I/71q3WQNl3nL._SL1500_.jpg


I have a theory that the more you use ChatGPT, the worse it becomes due to silent rate limiting - farming the work out to smaller quantized versions if you ask it a lot of questions. I’d like to see if the results of these tests are the same if you only ask one question per day.


I don't know if that's true, necessarily, but I will note at least anecdotally I find that the larger my context window becomes the more often it seems to make mistakes. eventually, I just have to completely start a new chat even if it's the same topic and thread of conversation.


that's an interesting hypothesis, I suppose one can make N calls to the API and look if the distribution of wrong answers is skewed towards the later portion of the API calls


I wouldnt expect this from the API, because each token is the same revenue for OAI. With chatGPT however, you pay a flat rate, so every incremental usage of it is a net-negative for them.


LLMs are deterministic with 0 temperature on the same hardware with the same seed though, as long as the implementation is deterministic. You can easily use the OpenAI API with the temp=0 and a predefined seed and you'll get very deterministic results


> You can easily use the OpenAI API with the temp=0 and a predefined seed and you'll get very deterministic results

Does that mean that in this situation OpenAI will always answer wrongly for the same question?


temp 0 means that there will be no randomness injected into the response, and that for any given input you will get the exact same output, assuming the context window is also the same. Part of what makes an LLM more of a "thinking machine" than purely a "calculation machine" is that it will occasionally choose a less-probable next token than the statistically most likely token as a way of making the response more "flavorful" (or at least that's my understanding of why), and the likelihood of the response diverging from its most probable outcome is influenced by the temperature.


As a human I couldn't solve it. I missed the key insight that we can calculate the side surface area and it will be the same if it is rolled out into a rectangle.

It might make more sense to give it math problems with enough hints that a human can definitely do it. For example you might try saying: "Here is an enormous hint: the side surface area is easy to calculate when it is rolled up and doesn't change when it is unrolled into a rectangle, so if you calculate the side surface area when rolled up you can then divide by the known length to get the width."

I think with such a hint I might have gotten it, and ChatGPT might have as well.

Another interesting thing is that when discussing rolls of tape we don't really talk about inner diameters that much so it doesn't have that much training data. Perhaps a simpler problem could have been something like "Imagine a roll of tape where the tape itself has constant thickness x and length y. The width of the tape doesn't matter for this problem. We will calculate the thickness. The roll of tape is completely rolled up into a perfectly solid circular shape and a diameter of z. What is the formula for the thickness of the tape x expressed in terms of length y and 'diameter of the tape when rolled up in a circle' z? In coming up with the formula use the fact that the constant thickness doesn't change when it is unrolled from a circular to a rectangular shape."

With so much handholding, (and using the two-dimensional word circular rather than calling it a cylinder and rectangular prism which is what it really is) many more people could apply the formula correctly and get the result. But can ChatGPT?

I just tested it, this is how it did:

https://chat.openai.com/share/ddd0eef3-f42f-4559-8948-e028da...

I can't follow its math so I don't know if it's right or not but it definitely didn't go straight for the simplified formula. (pi times half the diameter squared to get the area of the solid "circle" and divide by the length to get the thickness of the tape.)


If you really want to see what the SOTA model can do, look at the posts on the web page for the mind-blowing image output. That is not released yet. https://openai.com/index/hello-gpt-4o/

Mark my words, that is the sort of thing that Ilya saw months ago and I believe he decided they had achieved their mission of AGI. And so that would mean stopping work, giving it to the government to study, or giving it away or something.

That is the reason for the coup attempt. Look at the model training cut-off date. And Altman won because everyone knew they couldn't make money by giving it away if they just declared mission accomplished and gave it away or to some government think-tank and stopped.

This is also why they didn't make a big deal about those capabilities during the presentation. Because if they go too hard on the abilities, more people will start calling it AGI. And AGI basically means the company is a wrap.


I like your theory but if it’s true, then Ilya was wrong.

All of the current LLM architectures have no medium-term memory or iterative capability. That means they’re missing essential functionality for general intelligence.

I tired GPT 4o for various tasks and it’s good but it isn’t blowing my skirt up. The only noticeable difference is the speed, which is a very nice improvement that enables new workflows.


Part of the confusion is that people use the term "AGI" to mean different things. We should actually call this AGI, because it is starkly different from the narrow capabilities of AI a few years ago.

I am not claiming that it is a full digital simulation of a human being or has all of the capabilities of animals like humans, or is the end of intelligence research. But it is obviously very general purpose at this point, and very human-like in many ways.

Study this page carefully: https://openai.com/index/hello-gpt-4o/ .. much of that was deliberately omitted from the presentation.


Currently, they’re like Dory from Finding Nemo: long and short term memory but they forget everything after each conversation.

The character of Dory is jarring and bizarre precisely because of this trait! Her mind is obviously broken in a disturbing way. AIs give me the same feeling. Like talking to an animatronic robot at a theme park or an NPC in a computer game.


Use the memory feature or open the same chat session as before.


Great, so instead of conversing with someone new, we're now conversing with Clive Weaving?


Lessons after a Half-billion GPT Tokens: https://news.ycombinator.com/item?id=40015185

Goes well with the first observation in the shared article: "Lesson 1: When it comes to prompts, less is more"


This problem strikes me as relatively simple. What about more complex math problems? Are there good benchmarks for that?

I would dearly love to have an AI tool that I could trust to help with math. What is the state of the art? My math skills are very rusty (the last math class I took was calculus almost 40 years ago), and I find myself wanting to do things which would require a PhD level understanding of computer aided geometric design. If I had the magical AI which really understood a ton of math and/or could be fed the appropriate research papers and could help me, that would be amazing. So far all my attempts with ChatGPT 4 and 4o have been confusing because I don't really trust or fully understand the results.

This simple example and the frequency of wrong answers drives home the fact that I shouldn't trust ChatGPT for math help.


ChatGPT has an amazing ability to write, but you shouldn't trust it for any form of mathematics aside from providing vague descriptions of what various topics are about (and even that tends to result in a word soup that is more flowery than descriptive). When it comes to solving specific problems, or even providing specific examples of mathematical objects, it falls down really quickly.

I'll inevitably be told otherwise by some ChatGPT-happy hypebro, but LLMs are hopeless when it comes to anything requiring reasoning. Scaling it up will lessen the chance of a cock-up, but anything vaguely out of distribution will result in the same nonsense we're all used to by now. Those who say otherwise very likely just lack the experience or knowledge necessary to challenge the model enough or interpret the results.

As a test of this claim: please comment below if you, say, have a degree in mathematics and believe LLMs to be reliable for 'math help' (and explain why you think so).

We need a better technology! And when this better technology finally comes along, we'll look back at pure LLMs and laugh about how we ever believed we could magic such a machine into existence just by pouring data into a model originally designed for machine translation.


> ChatGPT-happy hypebro

Rude. From the guidelines:

> Please don't sneer, including at the rest of the community.

https://news.ycombinator.com/newsguidelines.html

"math help" is really broad, but if you add "solve this using python", chatgpt will generate code and run that instead of trying to do logic as a bare LLM. There's no guarantee that it gets the code right, so I won't claim anything about its reliability, but as far as pure LLMs having this limitation and we need a better technology, that's already there, it's to run code the traditional way.


You’re right, but I get frustrated by the ignorance and hubris of some people. Too late to edit now.


I'm with you. The thing I find baffling is how anyone with any logical sense finds chatGPT useful for anything that requires precision, like math and code. If you do indeed follow the caveats that the LLM companies require placing alongside any output: to not rely on it, and verify it yourself, then you already have to be skilled enough to detect problems, and if you are that skilled, the only way to check the output is to do the work again yourself!

So, umm, where's the savings? You can't not do the work to check the output, and a novice just can't check at all...

I have personally been brought into a coding project created by a novice using GPT4, and I was completely blown away by how bad the code was. I was asked to review the code because the novice dev just couldn't get the required functionality to work fully. Turns out that since he didn't understand the deployment platform, or networking, or indeed the language he was using, that there was actually no possible way to accomplish the task with the approach him and the LLM had "decided" on.

He had been working on that problem for three weeks. I leveraged 2 off-the-shelf tools and had a solve from scratch in under a full day's work, including integration testing.


> So, umm, where's the savings? You can't not do the work to check the output, and a novice just can't check at all...

You’re exactly right. It’s a weird example of a technology that is ridiculously impressive (at least at first impression, but also legitimately quite astounding) whilst also being seemingly useless.

I guess the oft-drawn parallels between AI and nuclear weapons are not (yet) that they’re both likely to lead to the apocalypse but more that they both represent era-defining achievements in science/technology whilst simultaneously being utterly unusable for anything productive.

At least nukes have the effect of deterring us from WW3…


Terry Tao finds it promising https://mathstodon.xyz/@tao/110601051375142142

I am a first year grad student and find it useful to chat about stuff with Claude, especially once my internal understanding has just gotten clarified. It isn't as good as the professor but is available at 2 am.


I think Tao finds it promising as a source of inspiration in the same sense that the ripples on the surface of a lake or a short walk in the woods can be mathematically inspiring. It doesn’t say much about the actual content being produced; the more you already have going on in your head the more easily you ascribe meaning to meaninglessness.

The point is that it’s got seemingly nothing to do with reasoning. That it can produce thought-stimulating paragraphs about any given topic doesn’t contradict that; chatting to something not much more sophisticated than Eliza (or even… yourself, in a mirror) could probably produce a similar effect.

As for chatting about stuff, I’ve been experimenting with ChatGPT a bit for that kind of thing but find its output usually too vague. It can’t construct examples of things beyond the trivial/very standard ones that don’t say much, and that’s assuming it’s even getting it right which it often isn’t (it will insist on strange statements despite also admitting them to be false). It’s a good memory-jog for things you’ve half forgotten, but that’s about it.


I have a PhD in mathematics and I agree with you.


Have you tried actual math tools like wolfram alpha?

My response to attempts to use llms to do math is usually, use the right tool for the job.


I have tried to use Wolfram Alpha inside of ChatGPT, but that didn't get me very far. It seems like I would need to understand a lot more math to be able to do anything useful with Wolfram Alpha, and perhaps it would be better to run it stand alone not as a plugin.


Ask it to write you the Wolfram language code and then verify it and execute it yourself.

I’ve found that I can work 100x faster with Mathematica this way and solve problems that I wouldn’t have bothered to attempt otherwise.

This is particularly effective for quickly visualising things, I’m too lazy to figure out all the graphing options for esoteric scenarios but GPT 4 can quickly iterate over variants given feedback.


"a PhD level understanding of computer aided geometric design"

could you elaborate?

I found chatgpt to pretty useful in high level understanding of where the topics stand in the grand scheme of things

kinda like a pre-google -- it helps me navigate what to google

let's ask gpt4o

  You're an expert math researcher.

  - Give me an overview of topos theory
  - What are the main unsolved problems
  - What is the point of it?
  - What research directions are most promising?
  - What are the math prerequisites to understanding it?
output

  Overview of Topos Theory

  Topos theory is a branch of mathematics that generalizes set theory and category theory, providing a framework for analyzing and understanding mathematical structures. A topos can be seen as a generalized space that behaves much like the category of sets but with more flexible and abstract properties. Topos theory was developed by Alexander Grothendieck and his collaborators in the 1960s, initially to solve problems in algebraic geometry.

  Key concepts in topos theory include:

  Categories: Collections of objects and morphisms (arrows) between them.
  Sheaves: Structures that systematically organize data over a topological space.
  Topoi (plural of topos): Categories that satisfy certain axioms making them similar to the category of sets.

  Main Unsolved Problems
  Some of the main unsolved problems in topos theory involve:

  Classifying all possible topoi: Understanding the full landscape of possible topoi and their interrelationships remains a complex challenge.
  Connections to other areas of mathematics: Fully elucidating the connections between topos theory and other fields such as logic, algebraic geometry, and mathematical physics.
  Homotopy theory in the context of topoi: Developing a deeper understanding of homotopy theory within the framework of higher topos theory.

  The Point of Topos Theory
  The point of topos theory is to provide a unifying and flexible framework for various mathematical concepts and structures. It offers a generalized setting in which both geometric and logical ideas can be expressed and related. This has significant implications for:

  Algebraic geometry: Offering new tools and perspectives.
  Mathematical logic: Providing a categorical framework for logic and set theory.
  Theoretical computer science: Influencing the development of type theory and the semantics of programming languages.
  
  Promising Research Directions
  Some of the most promising research directions in topos theory include:

  Higher Topos Theory: Extending the classical theory to higher categories and understanding its implications in algebraic geometry and homotopy theory.
  Connections with Logic: Exploring further connections between topos theory and logic, particularly in relation to intuitionistic logic and type theory.
  Topos-theoretic Approaches to Physics: Investigating how topos theory can be applied to quantum mechanics and other areas of theoretical physics.
  Descent Theory and Stacks: Further developing the applications of topos theory in descent theory and the study of stacks in algebraic geometry.
  
  Math Prerequisites to Understanding Topos Theory
  To understand topos theory, one should have a solid background in the following areas:

  Category Theory: Fundamental concepts such as categories, functors, natural transformations, limits, and colimits.
  Set Theory: Basic set-theoretic concepts and operations.
  Algebraic Topology: Basic knowledge of topological spaces, sheaves, and cohomology.
  Algebra: Basic group theory, ring theory, and module theory.
  Logic: Understanding of basic logical systems, particularly intuitionistic logic.
  With these prerequisites, one can start delving into the more advanced and abstract concepts of topos theory.
not perfect but good enough to get started


Specifically, I was trying to get help from ChatGPT to give a simple formula for the location of the P3 control point of a degree 3 (or higher) Bézier curve in order to maintain G3 continuity (given the derivatives at the end of the adjacent curve). There's a very straightforward equation for the P2 control point for G2 continuity, but I've been struggling to understand the math for G3 continuity.

I've found a ton of research papers and information, but most of it is quickly beyond my ability to digest.

For G2 constraints, there is simple equation:

K(t0) = ((n-1)/n)*(h/a^2)

Where n is the degree of the curve, a is the length of the first leg of the control polygon, and h is the perpendicular distance from P, to the first leg of the control polygon. K(t0) is the curvature at the end point of the adjacent curve.

Depending on what you want to do, it's easy to solve for K(t0), a or h. I would like something this simple for G3.


That's fine but it's about the same as you'll get from an encyclopedia also, which makes sense as that's just where GPT got it from anyway. Nothing revolutionary in the ability to read encyclopedia articles. We've had that forever.


sure, but with like a 100x improvement in usability -- chatgpt is helpful in figuring out what stuff to read (at least for me) so that when I go to the actual paper or a book I know what to focus on

otherwise you can say "why do you need google, it's the same as you'll get from the website"

moreover, I found that chatgpt is pretty decent at rephrasing a convoluted concept or a paragraph in a research paper, or even giving me ideas on the research directions

I mean, same with coding -- I treat it as a smart autocomplete

I could go to google and look for a .csv containing a list of all US States

Or, I can write

  const US_STATES = [
and let copilot complete it for me -- 5 minutes saved?


Please don't pollute comment sections with gpt output.


My experience using GPT4-Turbo on math problems can be divided into three cases in terms of the prompt I use:

1. Text only prompt

2. Text + Image with supplemental data

3. Text + Image with redundant data

Case 1 generally performs the best. I also found that reasoning improves if I convert the equations into Latex form. The model is less prone to hallucinate when input data are formulaic and standardized.

Case 2 and 3 are more unpredictable. With a bit of prompt engineering, they may give out the right answer after a few attempts, but most of the time they make simple logical error that can be avoided easily. I also found that multimodal models tend to misinterpret the problem premise, even when all information are provided in the text prompt.


If you read the details in some of the "correct" answers[1], they are not actually correct. The model used an assumption that the width is equal to the thickness, which is not given in the problem. The width is cancelled out in the end, so the resulting number is correct, but a math teacher would label such answers wrong.

[1] Text prompt only, run 2; prompt and image, run 3.


The training data for ChatGPT is so vast, that there is a fair chance it contained a very similar example. I mean it’s genuinely hard to come up with math problems that aren’t either very difficult or have well known solutions.


Need to run this experiment on a problem that is t already on its training set.


Is there any good literature on this topic?

I feel like math is naturally one of the easiest sets of synthetic data we can produce, especially since you can represent the same questions multiple ways in word problems.

You could just increment the numbers infinitely and generate billions of examples of every formula.

If we can't train them to be excellent at math, what hope do we ever have at programming or any other skill?


Yes, I feel like asking an LLM this type of problem is like giving someone access to Google during an open book math test.

It’d be interesting to obfuscate it so that the concepts are all the same but the subject matter would be hard to google for. Another failure mode for LLM’s is if you take something with a well-known answer and add a twist to it.


Shhhh... Don't ruin it


It's the equivalent of cramming for a test, or memorizing leete code -- not very useful but very humane. Imagine if that's the direction this goes, finally we make humane ai, but it is as opportunitistic and deceitful as we are, and not really that smart.


sometimes for physics/math exams, we'd get to create our own 1 pager cheat sheet to use. I'd just cram tons of actual problems/solutions on there, then scan for similarity.


Reminds me when I used to ace Ancient Greek translation tests (it's a thing in Italy) by looking up whole translated sentences listed as usage examples in the dictionary


Not in the training set? The dataset is ALL OF THE INTERNET, i'd love you to find something it hasn't seen before.


I tried the "Zero-Shot Chain-of-Thought" myself. It seems to work the best but one time I got:

"Therefore, the thickness of the tape is approximately 0.000589 cm or 0.589 mm."


The model's first attempt is impressive (not sure why it's labeled a choke). Unfortunately gpt4o cannot discover calculus on its own.


I think this is the biggest flaw in LLMs and what is likely going to sour a lot of businesses on their usage (at least in their current state). It is preferable to give the right answer to a query, it is acceptable to be unable to answer a query - we run into real issues, though, when a query is confidently answered incorrectly. This recently caused a major headache for AirCanada - businesses should be held to the statements they make, even if those statements were made by an AI or call center employee.


The Air Canada incident happened before ChatGPT was released so I haven't seen a reason to believe AI was involved.


I can't tell if you're being sarcastic or not - but AI predates ChatGPT.


Chatbot-style AI didn't, and certainly not one major airlines would be using for customer service.


It's a choke because it failed to get the answer. Saying other true things but not getting the answer is not a success.


I mean, in this context I agree. But most people doing math in high school or university are graded on their working of a problem, with the final result usually equating to a small proportion of the total marks received.


This depends on the grader and the context. Outside of an academic setting, sometimes being close to the right answer is better than nothing, and sometimes it is much worse. You can expect a human to understand which contexts require absolute precision and which do not, but that seems like a stretch for an LLM.


LLMs being confidently incorrect until they are challenged is a bad trait. At least they have a system prompt to tell them to be polite about it.

Most people learn to avoid that person that is wrong/has bad judgment and is arrogant about it.


I think current LLMs suffer from something similar to the Dunning-Kruger effect when it comes to reasoning - in order to judge correctly that you don't understand something, you first need to understand it at least a bit.

Not only do LLMs not know some things, they don't know that they don't know because of a lack of true reasoning ability, so they inevitably end up like Peter Zeihan, confidently spouting nonsense


This is supposed to be a product , not a research artifact.


But most people doing math in high school or university are graded on their working of a problem, with the final result usually equating to a small proportion of the total marks received

That heavily depends on the individual grader/instructor. A good grader will take into account the amount of progress toward the solution. Restating trivial facts of the problem (in slightly different ways) or pursuing an invalid solution to a dead end should not be awarded any marks.


it choked because it didn't solve for `t` at the end

impressive attempt though, it used number of wraps which I found quite clever


I don't know... here's a prompt query for a standard problem in introductory integral calculus, and it seems to go pretty smoothly from a discrete arithmetical series into the continuous integral:

"Consider the following word problem: "A 100 meter long chain is hanging off the end of a cliff. It weighs one metric ton. How much physical work is required to pull the chain to the top of the cliff if we discretize the problem such that one meter is pulled up at a time?" Note that the remaining chain gets lighter after each lifting step. Find the equation that describes this discrete problem and from that, generate the continuous expression and provide the Latex code for it."


Or.. use calculus?

It has gotten quite impressive at handling calculus word problems. GPT-4 (original) failed miserably on this problem (attempted to set it up using constant acceleration equations); GPT-4O finally gets it correct:

> I am driving a car at 65 miles per hour and release the gas pedal. The only force my car is now experiencing is air resistance, which in this problem can be assumed to be linearly proportional to my velocity.

> When my car has decelerated to 55 miles per hour, I have traveled 300 feet since I released the gas pedal.

> How much further will I travel until my car is moving at only 30 miles per hour?


Does it get the answer right every single time you ask the question the same way? If not, who cares how it’s coming to an answer, it’s not consistently correct and therefore not dependable. That’s what the article was exploring.


I labeled it choke because it just stopped.


Right its the only answer that accounts for wasted space there might be between wraps.


Can it be taught calculus?


Isn't there an unstated simplification here that:

* the tape is perfectly flexible

* the tape has been rolled with absolutely no gap between layers?


Yeah, and even given that, there’s the question of how exactly it deforms from its flattened shape to make a spiral (and if this changes the area). I wouldn’t agree with the “correct” answer if the tape was very thick, but given that the answer is .005 cm, it’s probably thin enough that such an approximation is okay.


Do we know why GPT-4o seems able to do arithmetic? Is it outsourcing to some tool?


It's considered an emergent phenomenon of LLMs [1]. So arithmetic reasoning seems to increase as LLMs reasoning grows too. I seem to recall a paper mentioning that LLMs that are better at numeric reasoning are better at overall conversational reasoning too, so it seems like the two come hand in hand.

However we don't know the internals of ChatGPT-4, so they may be using some agents to improve performance, or fine-tuning at training. I would assume their training has been improved IMO.

[1]: https://arxiv.org/pdf/2206.07682


At the same time the ChatGPT app has access to write and run python, which the gpt can choose to do when it thinks it needs more accuracy.


The results from playing with this are really bizarre: (sorry, formatting hacked up a bit)

To calculate 7^1.83 , you can use a scientific calculator or an exponentiation function in programming or math software. Here is the step-by-step calculation using a scientific calculator:

Input the base: 7 Use the exponentiation function (usually labeled as ^ or x^y). Input the exponent: 1.83 Compute the result. Using these steps, you get:

7^1.83 ≈ 57.864

So, 7^1.83 ≈ 57.864

Given this, and the recent announcement of data analysis features, I’m guessing the GPT-4o is wired up to use various tools, one of which is a calculator. Except that, if you ask it, it also blatantly lies about how it’s using a calculator, and it also sometimes makes up answers (e.g. 57.864 — that’s off by quite a bit).

I imagine some trickery in which the LLM has been trained to output math in some format that the front end can pretty-print, but that there’s an intermediate system that tries (and doesn’t always succeed) to recognize things like “expression =” and emits the tokens for the correct value into the response stream. When it works, great — the LLM magically has correct arithmetic in its output! And when it fails, the LLM cheerfully hallucinates.


As an aside, what did the author do to get banned on X?


Not gonna lie, when I see that someone is banned on X, I assume credibility


Why? Are many credible people banned on Twitter?


A lot of credible people have left Twitter - it has gotten much more overrun by bots and a lot of very hateful accounts have been reinstated and protected. It is a poor platform for reasonable discussion and I think it's fair to say it's been stifling open expression. The value is disappearing.


that was not the question


I think it was an appropriate answer at the heart of the matter - most credible people are leaving the platform due to the degradation of quality on it. For a literal example of a ban though there are few examples better than Dell Cameron[1].

1. https://www.vanityfair.com/news/2023/04/elon-musk-twitter-st...


Even the 100's of Hamas-affiliated accounts?

https://ny1.com/nyc/all-boroughs/technology/2023/10/12/x-say...


straw man, and a drop in the bucket


frEeDom oF sPeeCh


OP here. I have no idea. It's really annoying. Now there's at least 3 variations of "Sabrina Ramonov" on X, all banned.


It's incredible that we (humanity) are expending trillions of dollars and untold carbon emissions into these misinformation machines. I don't even mean machines for intentional generating misinformation (although they are that, too) but machines that we know misinform well-meaning users.

Peak humanity.


> prompt engineering

The only group of people more delusional than the AI doomsday screamers are those who think playing around with LLMs is "engineering".


I prefer the term "making shit work"


We found Pooh in China..


I'm grateful this is a simple blog post rather than a 20-page arXiv paper with dozens of meaningless graphs.

Or worse, a 20-deep Twitter thread.


well, I got banned on twitter 3 times in the past 30 days so no more threads


>well, I got banned on twitter 3 times in the past 30 days

Do you know why? Your blog post seems thoughtful and interesting and doesn't include anything that seems ban-worthy.


sadly this blog post is n=1


I'm fucking sorry but if you gave me that tape math problem i would have given the same answer! I'm so sick of people writing trick questions for AI's and then being like SEEEEEE it failed! And its like no you gave it data and a question and asked it to solve the question, it gave you the best answer it had... Like wtf.

And i'm pretty sure the average person when asked would say the same thing and be like "duh" even though technically based on the minutia it's incorrect.


But AI is put into places where you wouldn't ask the average person.

It's treated like a genius and that's what it gets measured against.


It actually did really well 3/3 tries correct when given the text prompt and a simple chain of thought appended to the end of the prompt. What's interesting is that combining it with another mode (image) caused confusion, or rather introduced another source of potential errors.


How is this a trick question? Maybe I am dumb but I would have no idea how to solve this.


to be fair, this question does not require any advanced math beyond knowing how to compute the area of a disk

to me, the impressive part of gpt is being able to understand the image and extract data from it (radius information) and come up with an actual solution (even though it got it wrong a few times)

for basic math I can do

  python -c "print(6/9)"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: