> Ask an LLM to generate you 100 more lines of code, no problem you will get something. Ask the same LLM to look at 10000 lines of code and intelligently remove 100... good luck with that!
These two tasks have a very different difficulty level though. It will be the same with a human coder. If you give me a new 10k sloc codebase and ask to add a method, to cover some new case I can probably do it in a hour to a day, depending on my familiarity with the language, subject matter, codebase overall state, documentation, etc.
New 10k codebase and a task of removing 100 lines? That's probably at least half a week to understand how it all works (disregarding simple cases like a hundred-line comment bloc with old code), before I can make such a change safely.
Is a "reasoning" model really different? Or is it just clever prompting (and feeding previous outputs) for an existing model? Possibly with some RLHF reasoning examples?
OpenAI doesn't have a large enough database of reasoning texts to train a foundational LLM off it? I thought such a db simply does not exist as humans don't really write enough texts like this.
It's trained via reinforcement learning on essentially infinite synthetic reasoning data. You can generate infinite reasoning data because there are infinite math and coding problems that can be created with machine-checkable solutions, and machines can make infinite different attempts at reasoning their way to the answer. Similar to how models trained to learn chess by self-play have essentially unlimited training data.
We don't know the specifics of GPT-o1 to judge, but we can look at open weights model for an example. Qwen-32B is a base model, QwQ-32B is a "reasoning" variant. You're broadly correct that the magic, such as it is, is in training the model into a long-winded CoT, but the improvements from it are massive. QwQ-32B beats larger 70B models in most tasks, and in some cases it beats Claude.
I just tried QwQ 32B, i didn't know about it. I used it to generate, some code GPT generated 2 days ago perfect code without even sweating.
QwQ generated 10 pages of it's reasoning steps, and the code is probably not correct. [1] includes both answers from QwQ and GPT.
Breaking down it's reasoning steps to such an excruciating detailed prose is certainly not user friendly, but it is intriguing. I wonder what an ideal use case for it would be.
The issue is that most people, especially when prompted, can provide their level of confidence in the answer or even refuse to provide an answer if they are not sure. LLMs, by default, seem to be extremely confident in their answers, and it's quite hard to get the "confidence" level out of them (if that metric is even applicable to LLMs). That's why they are so good at duping people into believing them after all.
> The issue is that most people, especially when prompted, can provide their level of confidence in the answer or even refuse to provide an answer if they are not sure.
People also pull this figure out of their ass, over or undertrust themselves, and lie. I'm not sure self-reported confidence is that interesting compared to "showing your work".
No (naturally). But my thought process is that if you use advanced voice even half an hour a day, it's probably a fair price based on API costs. If you use it more, for something like language learning or entertaining kids who love it, it's potentially a bargain.
For 2, I'll single out the skill of tailoring technical explanations to your counterparty's level of understanding and technical knowledge. The ability to explain to less technical people what your new project/feature does without going into too many unnecessary details and without being too high-level is invaluable. It builds confidence in your work for them (I know what this thing is doing - maybe not all nuts and bolts, but enough to operate with confidence) and in you as a professional (this guy clearly understands what he's working on and does not try to bury me in jargon or oversimplify things).
This exact scenario is one of the best interview questions I’ve been asked and have repeatedly re-used when on panels myself.
Taking a complex domain and effectively communicating it (correctly) at different levels requires having not just rote knowledge but an actual understanding.
Reverse engineering the dudes holding your funds isn't a good idea to begin with. Too much risk. Better to work with them directly or switch to a better service which does feature APIs.
I honestly doubt there exist an actual reputable resource having it on a same page. Each language tracks their own latest version(s). Wikipedia tracks latest versions for a variety of software but it's on different pages.
These two tasks have a very different difficulty level though. It will be the same with a human coder. If you give me a new 10k sloc codebase and ask to add a method, to cover some new case I can probably do it in a hour to a day, depending on my familiarity with the language, subject matter, codebase overall state, documentation, etc.
New 10k codebase and a task of removing 100 lines? That's probably at least half a week to understand how it all works (disregarding simple cases like a hundred-line comment bloc with old code), before I can make such a change safely.
reply