Counterpoint: o1-Pro is insanely good -- subjectively, it's as far above GPT4 as GPT4 was above 3. It's almost too good. Use it properly for an extended period of time, and one begins to worry about the future of one's children and the utility of their schooling.
o3, by all accounts, is better still.
Seems to me that things are progressing quickly enough.
Not sure what you are using it for, but it is terrible for me for coding; claude beats it always and hands down. o1 just thinks forever to come up with stuff it already tried the previous time.
People say that's just prompting without pointing to real million line+ repositories or realistic apps to show how that can be improved. So I say they are making todo and hello world apps and yes, there it works really well. Claude still beats it, every.. single.. time..
And yes, I use the Pro of all and yes, I do assume coding is done for most of people. Become a plumber or electrician or carpenter.
That so weird, it’s seems like everybody here prefers Claude.
I’ve been using Claude and openai in copilot and I find even 4o seems to understand the problem better. O1 definitely seems to get it right more for me.
I try to sprinkle 'for us/me' everywhere as much as I can; we work on LoB/ERP apps mostly. These are small frontends to massive multi million line backends. We carved a niche by providing the frontends on these backends live at the client office by a business consultant of ours: they simply solve UX issues for the client on top of large ERP by using our tool and prompting. Everything looks modern, fresh and nice; unlike basically all the competitors in this space. It's fast and no frontend people are needed for it; backend is another system we built which takes a lot longer of course as they are complex business rules. Both claude and o1 turn up something that looks similar but only the claude version will work and be, after less prompting, correct. I don't have shares in either and I want open source to win; we have all open (more open) solutions doing all the same queries and we evaluate all but claude just wins. We did manage even big wins with openai davinci in 2022 (or so; before chatgpt), but this is a massive boost allowing us to upgrade most people to business consultant and just have them build with clients real time and have the tech guys including me add manually tests and proofs (where needed) to know if we are actually fine. Works so much better than the slog with clients before; people are so bad at explaining at what they need, it was slowly driving me insane after doing it for 30+ years.
Claude web’s context window is 200K tokens. I’d be surprised if GitHub Copilot’s context window exceeds 10K.
I’ve found using Claude via Copilot in VS Code produces noticeably lower quality results than 3.5 Sonnet on web. In my experience Claude web outdoes GPT-4o consistently.
They're both okay for coding, though for my use cases (which are niche and involve quite a lot of mathematics and formal logic) o1/o1-Pro is better. It seems to have a better native grasp of mathematical concepts, and it can even answer very difficult questions from vague inputs, e.g.: https://chatgpt.com/share/676020cb-8574-8005-8b83-4bed5b13e1...
Different languages maybe? I find Sonnet v2 to be lacking in Rust knowledge compared to 4o 11-20, but excelling at Python and JS/TS. O1's strong side seems to be complex or quirky puzzle-like coding problems that can be answered in a short manner, it's meh at everything else, especially considering the price. Which is understandable given its purpose and training, but I have no use for it as that's exactly the sort of problem I wouldn't trust an LLM to solve.
Sonnet v2 in particular seems to be a bit broken with its reasoning (?) feature. The one where it detects it might be hallucinating (what's even the condition?) and reviews the reply, reflecting on it. It can make it stop halfway into the reply and decide it wrote enough, or invent some ridiculous excuse to output a worse answer. Annoying, although it doesn't trigger too often.
We do the same (all requests go to o1, sonnet and gemini and we store the results for later to compare) automatically for our research: Claude always wins. Even with specific prompting on both platforms. Especially frontend it seems o1 really is terrible.
Every time I try Gemini, it's really subpar. I found that qwen2.5-coder-32b-instruct can be better.
Also, for me 50% 50% for Sonnet and o1, but although I'm not 100% sure about it, I think o1 is better with longer and more complicated (C++) code and debugging. At least from my brief testing. Also, OpenAI models seem to be more verbose - sometimes it's better - where I'd like additional explanation on chosen fields in a SQL schema, sometimes it's too much.
EDIT: Just asked both o1 and Sonnet 3.5 the same QML coding question, and Sonnet 3.5 succeeded, o1 failed.
Very anecdotal but I’ve found that for things that are well spec’d out with a good prompt Sonnet 3.5 is far better. For problems where I might have introduced a subtle logical error o1 seems to catch it extremely well. So better reasoning might be occurring but reasoning is only a small part of what we would consider intelligence.
Exactly. The previous version of o1 did actually worse in the coding benchmarks, so I would expect it to be worse in real life scenarios.
The new version released a few days ago on the other hand is better in the benchmarks, so it would seem strange that someone used it and is saying that it’s worse than Claude.
Wins? What does this mean? Do you have any results? I see the claims that Claude is better for coding a lot but using it and using Gemini 2.0 flash and o1 and it sure doesn't seem like it.
I keep reading this on HN so I believe it has to be true in some ways, but I don't really feel like there is any difference in my limited use (programming questions or explaining some concepts).
If anything I feel like it's all been worse compared to the first release of ChatGPT, but I might be wearing rose colored glasses.
It’s the same for me. I genuinely don’t understand how I can be having such a completely different experience from the people who rave about ChatGPT. Every time I’ve tried it’s been useless.
How can some people think it’s amazing and has completely changed how they work, while for me it makes mistakes that a static analyser would catch? It’s not like I’m doing anything remarkable, for the past couple of months I’ve been doing fairly standard web dev and it can’t even fix basic problems with HTML. It will suggest things that just don’t work at all and my IDE catches, it invents APIs for packages.
One guy I work with uses it extensively and what it produces is essentially black boxes. If I find a problem with something “he” (or rather ChatGPT) has produced it takes him ages to commune with the machine spirit again to figure out how to fix it, and then he still doesn’t understand it.
I can’t help but see this as a time-bomb, how much completely inscrutable shite are these tools producing? In five years are we going to end up with a bunch of “senior engineers” who don’t actually understand what they’re doing?
Before people cry “o tempora o mores” at me and make parallels with the introduction of high-level languages, at least in order to write in a high-level language you need some basic understanding of the logic that is being executed.
> How can some people think it’s amazing and has completely changed how they work, while for me it makes mistakes that should a static analyser would catch?
There are a lot of code monkeys working on boilerplate code, these people used to rely on stack overflow and now that chatgpt is here it's a huge improvement for them
If you work on anything remotely complex or which hasn't been solved 10 times on stack overflow chatgpt isn't remotely as useful
I work on very complex problems. Some of my solutions have small, standard substeps that now I can reliably outsource to ChatGPT. Here are a few just from last week:
- write cvxpy code to find the chromatic number of a graph, and an optimal coloring, given its adjecency matrix.
- given an adjecency matrix write numpy code that enumerates all triangle-free vertex subsets.
- please port this old code from tensorflow to pytorch: ...
- in pytorch, i'd like to code a tensor network defining a 3-tensor of shape (d, d, d). my tensor consists of first projecting all three of its d-dimensional inputs to a k-dimensional vector, typically k=d/10, and then applying a (k, k, k) 3-tensor to contract these to a single number.
To be honest, these don’t sound like hard problems. These sound like they have very specific answers that I might find in the more specialized stackoverflow sections. These are also the kind of questions (not in this domain) that I’ve found yield the best results from LLMs.
In comparison asking an LLM a more project specific question “this code has a race condition where is it” while including some code usually is a crapshoot and really depends if you were lucky enough to give it the right context anyway.
Sure, these are standard problems, I’ve said so myself. My point is that my productivity is multiplied by ChatGPT, even if it can only solve standard problems. This is because, although I work on highly non-standard problems (see https://arxiv.org/abs/2311.10069 for an example), I can break them down into smaller, standard components, which ChatGPT can solve in seconds. I never ask ChatGPT "where's the race condition" kind of questions.
first time I tried it, I asked it to find bugs in a piece of very well tested C code.
It introduced an off-by-one error by miscounting the number of arguments in an sprintf call, breaking the program. And then proceeded to fail to find that bug that it introduced.
Interesting. I implemented something very similar (if not identical) a couple years ago (at work so not open source). I used a simple grammar and standard parser generator. It’s been nice to have the grammar as we’ve made tweaks over the years to change various behaviours and add features.
I think the difference comes down to interacting with it like IDE autocomplete vs. interacting with it like a colleague.
It sounds like you're doing the former -- and yeah, it can make mistakes that autocomplete wouldn't or generate code that's wrong or overly complex.
On the other hand, I've found that if you treat it more like a colleague, it works wonderfully. Ask it to do something, then read the code and ask follow-up questions. If you see something that's wrong or just seems off, tell it, and ask it to fix it. If you don't understand something, ask for an explanation. I've found that this process generates great code that I often understand better than if I had written it from scratch, and in a fraction of the time.
It also sounds like you're asking it to do basic tasks that you already know how to do. I find that it's most useful in tackling things that I don't know how to do. It'll already have read all of the documentation and know the right way to call whatever APIs, etc, and -- this is key -- you can have a conversation with it to clear up anything that's confusing.
This takes a big shift in mindset if you've been using IDEs all your life and have expectations of LLMs being a fancy autocomplete. And you really have to unlearn a lot of stuff to get the most out of them.
I'm in the same boat as the person you're responding to. I really don't understand how to get anything helpful out of ChatGPT, or more than anything basic out of Claude.
> I've found that if you treat it more like a colleague, it works wonderfully.
This is what I've been trying to do. I don't use LLM code completion tools. I'll ask anything from how to do something "basicish" with html & css, and it'll always output something that doesn't work as expected. Question it and I'll get into a loop of the same response code, regardless of how I explain that it isn't correct.
On the other end of the scale, I'll ask about an architectural or design decision. I'll often get a response that is in the realm of what I'd expect. When drilling down and asking specifics however, the responses really start to fall apart. I inevitably end up in the loop of asking if an alternative is [more performant/best practice/the language idiomatic way] and getting the "Sorry, you're correct" response. The longer I stay in that loop, the more it contradicts itself, and the less cohesive the answers get.
I _wish_ I could get the results from LLMs that so many people seem to. It just doesn't happen for me.
My approach is a lot of writing out ideas and giving them to ChatGPT. ChatGPT sometimes nods along, sometimes offers bad or meaningless suggestions, sometimes offers good suggestions, sometimes points out (what should have been) obvious errors or mistakes. The process of writing stuff out is useful anyway and sometimes getting good feedback on it is even better.
When coding I will often find myself in kind of a reverse pattern from how people seem to be using ChatGPT. I work in a jupyter notebook in a haphazard way getting things to functional and basically correct, after this I select all, copy, paste, and ask ChatGPT to refactor and refine to something more maintainable. My janky blocks of code and one offs become well documented scripts and functions.
I find a lot of people do the opposite, where they ask ChatGPT to start, then get frustrated when ChatGPT only goes 70% of the way and it's difficult to complete the imperfectly understood assignment - harder than doing it all yourself. With my method, where I start and get things basically working, ChatGPT knows what I'm going for, I get to do the part of coding I enjoy, and I wind up with something more durable, reusable, and shareable.
Finally, ChatGPT is wonderful in areas where you don't know very much at all. One example, I've got this idea in my head for a product I'll likely never build - but it's fun to plan out.
My idea is roughly a smart bidet that can detect metabolites in urine. I got this idea when a urinalysis showed I had high levels of ketones in my urine. When I was reading about what that meant I discovered it's a marker for diabetic ketoacidosis (a severe problem for ~100k people a year) and it can also be indicator for colorectal cancer as well as indicating a "ketosis" state that some people intentionally try to enter for dieting or wellness reasons. (My own ketones were caused by unintentionally being in ketosis, I'm fine, thanks for wondering.)
Right now, you detect ketones in urine with a strip that you pee on, and that works well enough - but it could be better because who wants to use a test strip all the time? Enter the smart bidet. The bidet gives us an excuse to connect power to our device and bring the sensor along. Bluetooth detects a nearby phone (and therefore identity of the depositor), a motion sensor can detect a stream of urine triggering our detection, and then use our sensor to detect ketones which we track overtime in the app, ideally with additional metabolites that have useful diagnostic purposes.
How to detect ketones? Is it even possible? I wonder to ChatGPT if spectroscopy is the right method of detection here. ChatGPT suggests a retractable electrochemical probe similar to an extant product that can detect a kind of ketone in blood. ChatGPT knows what kind of ketone is most detectable in urine. ChatGPT can link me to scientific instrument companies that make similar (ish) probes where I could contact them and ask if they sold this type of thing, and so on.
Basically, I go from peeing on a test strip and wondering if I could automate this to chat with ChatGPT - having, what was in my opinion, an interesting conversation with the LLM, where we worked through what ketones are, the different kinds, the prevalence of ketones in different bodily fluids, types of spectroscopy that might detect acetoacetate (available in urine) and how much that would cost and what challenges would be and so on, followed by the idea of electrochemical probes and how retracting and extending the probe might prolong its lifespan and maybe a heating element could be added to dry the probe to preserve it even better and so on.
Was ChatGPT right about all that? I don't know. If I were really interested I would try to validate what it said, and I suspect I would find it was mostly right and incomplete or off in places. Basically like having a pretty smart and really knowledgeable friend who is not infallible.
Without ChatGPT I would have likely thought "I wonder if I can automate this", maybe googled for some tracking product, then forgot about it. With ChatGPT I quickly got a much better understanding of a system that I glancingly came into conscious contact with.
It's not hard to project out that level of improved insight and guess that it will lead to valuable life contributions. In fact, I would say it did in that one example alone.
The urinalysis (which was combined with a blood test) said something like "ketones +3" and if you google "urine ketones +3" you get a explanations that don't apply to me (alcohol, vigorous exercise, intentional dieting) or "diabetic ketoacidosis" which google warns you is a serious health condition.
In the follow up with the doctor I asked about the ketones. The doctor said "Oh, you were probably just dehydrated, don't worry about it, you don't have diabetic ketoacidosis" and the conversation moved on and soon concluded. In the moment I was just relieved there was an innocent explanation. But, as I thought about it, shouldn't other results in the blood or urine test indicate dehydration? I asked ChatGPT (and confirmed on Google) and sure enough there were 3 other signals that should have been there if I was dehydrated that were not there.
"What does this mean?" I wondered to ChatGPT. ChatGPT basically told me it was probably nothing, but if I was worried I could do an at home test - which I didn't even know existed (though I could have found through carefully reading the first google result). So I go to Target and get an at home test kit (bottle of test strips), 24 gatorades, and a couple liters of pedialyte to ensure I'm well hydrated.
I start drinking my usual 64 ounces of water a day, plus lots of gatorade and pedialyte and over a couple days I remain at high ketones in urine. Definitely not dehydrated. Consulting with ChatGPT I start telling it everything I'm eating and it points out that I'm just accidentally in a ketogenic diet. ChatGPT suggests some simple carbs for me, I start eating those, and the ketone content of my urine falls off in roughly the exact timeframe that ChatGPT predicted (i.e. it told me if you eat this meal you should see ketones decline in ~4 hours).
Now, in some sense this didn't really matter. If I had simply listened to my doctor's explanation I would've been fine. Wrong, but fine. It wasn't dehydration, it was just accidentally being in a ketogenic diet. But, I take all this as evidence of how ChatGPT now, as it exists, helped me to understand my test results in a way that real doctors weren't able to - partially because ChatGPT exists in a form where I can just ping it with whatever stray thoughts come to mind and it will answer instantly. I'm sure if I could just text my doctor those same thoughts we would've come to the same conclusion.
> How can some people think it’s amazing and has completely changed how they work, while for me it makes mistakes that should a static analyser would catch? It’s not like I’m doing anything remarkable, for the past couple of months I’ve been doing fairly standard web dev and it can’t even fix basic problems with HTML.
Part of this is, I think, anchoring and expectation management: you hear people say it's amazing and wonderful, and then you see it fall over and you're naturally disappointed.
My formative years started off with Commodore 64 basic going "?SYNTAX ERROR" from most typos plus a lot of "I don't know what that means" from the text adventures, then Metrowerks' C compiler telling me there were errors on every line *after but not including* the one where I forgot the semicolon, then surprises in VisualBasic and Java where I was getting integer division rather than floats, then the fantastic oddity where accidentally leaning on the option key on a mac keyboard while pressing minus turns the minus into an n-dash which looked completely identical to a minus on the Xcode default font at the time and thus produced a very confusing compiler error…
So my expectations have always been low for machine generated output. And it has wildly exceeded those low expectations.
But the expectation management goes both ways, especially when the comparison is "normal humans" rather than "best practices". I've seen things you wouldn't believe...
Entire files copy-pasted line for line, "TODO: deduplicate" and all,
20 minute app starts passed off as "optimized solutions."
FAQs filled with nothing but Bob Ross quotes,
a zen garden of "happy little accidents."
I watched iOS developers use UI tests
as a complete replacement for storyboards,
bi-weekly commits, each a sprawling novel of despair,
where every change log was a tragic odyssey.
Google Spreadsheets masquerading as bug trackers,
Swift juniors not knowing their ! from their ?,
All those hacks and horrors… lost in time,
Time to deploy.
(All true, and all pre-dating ChatGPT).
> It will suggest things that just don’t work at all and my IDE catches, it invents APIs for packages.
Aye. I've even had that with models forgetting the APIs they themselves have created, just outside the context window.
To me, these are tools. They're fantastic tools, but they're not something you can blindly fire-and-forget…
…fortunately for me, because my passive income is not quite high enough to cover mortgage payments, and I'm looking for work.
> In five years are we going to end up with a bunch of “senior engineers” who don’t actually understand what they’re doing?
Yes, if we're lucky.
If we're not, the models keep getting better and we don't have any "senior engineers" at all.
The ones who use it extensively are the same that used to hit up stackoverflow as the first port of call for every trivial problem that came their way. They're not really engineers, they just want to get stuff done.
Same, on every release from openai, anthropic I keep reading how the new model is so much better (insert hyperbole here) than the previous one yet when using it I feel like they are mostly the same as last year.
If you've ever used any enterprise software for long enough, you know the exact same song and dance.
They release version Grand Banana. Purported to be approximately 30% faster with brand new features like Algorithmic Triple Layering and Enhanced Compulsory Alignment. You open the app. Everything is slower, things are harder to find and it breaks in new, fun ways. Your organization pays a couple hundred more per person for these benefits. Their stock soars, people celebrate the release and your management says they can't wait to see the improvement in workflows now that they've been able to lay off a quarter of your team.
Has there been improvements in LLMs over time? Somewhat, most of it concentrated at the beginning (because they siphoned up a bunch of data in a dubious manner). Now it's just part of their sales cycle, to keep pumping up numbers while no one sees any meaningful improvement.
One use-case: They help with learning things quickly by having a chat and asking questions. And they never get tired or emotional. Tutoring 24/7.
They also generate small code or scripts, as well as automate small things, when you're not sure how, but you know there's a way. You need to ensure you have a way to verify the results.
They do language tasks like grammar-fixing, perfect translation, etc.
They're 100 times easier and faster than search engines, if you limit your uses to that.
They can't help you learn what they don't know themselves.
I'm trying to use them to read historical handwritten documents in old Norwegian (Danish, pretty much). Not only do they not handle the German-style handwriting, but what they spit out looks like the sort of thing GPT-2 would spit out if you asked it to write Norwegian (only slightly better than Swedish Muppet Swedish Chef's Swedish). It seems the experimental tuning has made it worse at the task I most desperately want to use it for.
And when you think about it, how could it not overfit in some sense, when trained on its own output? No new information is coming in, so it pretty much has to get worse at something to get better at all the benchmarks.
Hah, no. They're good, but they definitely make stuff up when the context gets too long. Always check their output, just the same as you already note they need for small code and scripts.
I had a 30 min argument with o1-pro where it was convinced it had solved the halting problem. Tried to gaslight me into thinking I just didn’t understand the subtlety of the argument. But it’s susceptible to appeal to authority and when I started quoting snippets of textbooks and mathoverflow it finally relented and claimed there had been a “misunderstanding”. It really does argue like a human though now...
I had a similar experience with regular o1 about integral that was divergent. It was adamant that it wasn't and would respond to any attempt at persuasion with variants of "its a standard integral" with a "subtle cancellation". When I asked for any source for this standard integral it produced references to support its argument that existed but didn't actually contain the integral. When I told it the references didn't have the result and backpedalled (gaslighting!) to "I never told you they were in there". When I pointed out that in fact it did it insisted this was just a "misunderstanding". It only relented when I told it Mathematica agreed the integral was divergent. It still insisted it never said that the books it pointed to contained this (false, non-sensical) result.
This was new behaviour for me to see in an LLM. Usually the problem is these things would just fold when you pushed back. I don't know which is better, but being this confidently wrong (and "lying" when confronted with it) is troubling.
The troubling part is that the references themselves existed -- one was an obscure Russian text that is difficult to find (but is exactly where you'd expect to find this kind of result, if it existed).
o3, by all accounts, is better still.
Seems to me that things are progressing quickly enough.