Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Fine-tuning now available for GPT-4o (openai.com)
216 points by davidbarker on Aug 20, 2024 | hide | past | favorite | 87 comments


Seems like their competition (Google, Anthropic) have both shipped prompt caching, which is a lot more developer-friendly and gets a lot of the same benefits as fine-tuning.

You just cache a prompt with a ton of examples that you would have otherwise fine-tuned on. You can trivially update that prompt whenever, no asynchronous fine-tuning job needed.

Wonder if OpenAI will stick with fine-tuning or go towards prompt caching (or both). Fine-tuning has uses, but prompt caching gets you 99% of the benefits for 1% the effort.


They, in my experience, result in different outcomes. Fine-tuning is better at shaping the response, for example getting a correct JSON format, or making the responses shorters and concise, or use emojis, or whatever. Long prompts are better for in-context learning, fine-tuning doesn't seem to impart knowledge well, it just increases or decreases the likelihoods of existing knowledge coming out.


I agree that fine-tuning isn't good at imparting knowledge, though in my experience, K-shot prompting is very close to as good as fine-tuning at getting output formats (for recent models, at least)

So except for cases where you are really trying to perfectly match a particular writing style and want to fine-tune on loads of some author's text, I think prompt caching mostly dominates fine-tuning for most real-world workloads.

I'd still fine-tune if I wanted my model to have a really particular "voice" though.


Genuinely curious as you have experience with fine-tuning, which I don't yet. Given how simple it has become to get the correct response shape, do you feel there's still a point to it? If you need JSON and want to be very sure you can just use function calling, if you only need simple boundaries, use Claude-like XML, if it's something very complicated you can give a few shots.

From my understanding, fine-tuning allows for reducing model size, meaning less latency and cost. That seems like it would be the biggest advantage, no?


> Given how simple it has become to get the correct response shape, do you feel there's still a point to it?

I am personally working on systems which don't do mrequire uch structured data output; my work is more around structure, style, etc. Fine-tuning is more effective and more consistent for me in those contexts.

One thing to remember is that the big models, like 4o and Claude, both are fine-tuned for chat interactions. This tuning can often make them dumber, more verbose, etc. because that's what the human feedback testing likes. You can look at some of the rating tests on chatbot arena as an example where two give the same answer, but people have voted more favorably for the one that delivers it with "more personality."

There are SLMs that are better tuned for specific use cases and outperform the big models as a result of this, and the examples OpenAI showed in the article make it clear the big models can benefit from this as well.

> From my understanding, fine-tuning allows for reducing model size, meaning less latency and cost. That seems like it would be the biggest advantage, no?

This is partially correct, fine-tuning can be a step of this process. The full pipeline looks more like:

1. Create a prompt which gets a large model to output (mostly) correct responses.

2. Build a dataset of those inputs/outputs, probably with human or LLM curation/judging in the loop since some will still be wrong.

3. Fine tune a small model on those inputs/outputs.

Now you have a smaller model which behaves more like the large model you were able to prompt engineer into instruction following.


I would have expected fine-tuning to be good at imparting knowledge. Pre-training is often done for a single epoch only and models soak up the knowledge like crazy without multiple passes so why would fine-tuning be any different?


Because the learning rate vs. pre-training is completely different. This is not accurate, but my mental model is that the LLM's initial training establishes the "space of concepts and ideas" while tuning (like RLHF and fine-tuning) changes how it expresses those concepts and ideas. It works well for me in deciding my approach.


The problem is that it’s almost impossible to teach knowledge to an LLM without teaching a specific form of expression at the same time.

When you have a question/answer tuple in your training data, you are also teaching the model that every other way of answering the question is wrong.

So while the LLM would probably be capable of generating maybe 100 answers to the question that would be equally useful (just using different phrasing, different choice of words, etc.) come on, you are forcing it to update its parameters to suppress all of these, except the one specific form that you selected.

So you’re not really adding knowledge. Instead, you’re chiseling knowledge away.


I'm not sure this is entirely true, but I guess we could test it by generating a large enough dataset around a specific concept, and see if we can add that concept to a model. Or change one that exists to something else entirely.

For example, create an "idea", generate thousands of Q&A pairs about that idea from different angles, and generate conversations about that idea, then train the model on it. This is essentially the Phi process, but with a single concept instead of "everything."

My guess is that fine-tuning cannot add the concept to the model without suffering from catastrophic loss everywhere else. However if we added that same data to the pre-training dataset and retrain the model, the model would express the idea correctly.


OpenAI has a mysterious reference to caching in their guide to latency.

> Maximize shared prompt prefix, by putting dynamic portions (e.g. RAG results, history, etc) later in the prompt. This makes your request more KV cache-friendly (which most LLM providers use) and means fewer input tokens are processed on each request.

source: https://platform.openai.com/docs/guides/latency-optimization


Sounds very similar to how e.g. docker/layers work - the later you put the more dynamic stuff, the higher the change that previous layers are reused. Meaning the cache entries are likely chained, i.e., each one depends on the preceding entry.


How do you know it's not prompt caching in the backend?


OpenAI haven't documented if they do this, but it should be reasonably easy to determine via experiments against their API.

Claude's prompt caching gives a very material performance improvement - they claim up to a 4x performance boost in https://www.anthropic.com/news/prompt-caching - so if OpenAI have similar techniques, even undocumented, they should become visible through sending the same prompt a bunch of times and measuring the latency.


OpenAI docs say they also improve latency with caching: https://news.ycombinator.com/item?id=41302358


IMO they are definitely doing something. I've had countless occasions where the first few times I run a prompt I get excellent output, then several tries later it just returns the same dumb-sounding AI shite.


It's also pretty much required if you want to do anything complicated with prompt networks or agent-type stuff.

Without caching it's too expensive.


Does their prompt caching work with semantically equivalent queries, or do they match lexically?


You have to mark exactly which parts of the prompt you want to cache.

The only vendor I've seen with "automatic" prompt caching so far is DeepSeek https://platform.deepseek.com/api-docs/news/news0802/


I'm still looking forward to a fine-tuning vendor publishing an interactive real-world example of one of their fine-tuned models, accompanied by the exact training data that was used to create it.

OpenAI's examples here are artificial, and there's no mechanism to try out the results: https://platform.openai.com/docs/guides/fine-tuning/fine-tun...

Fine-tuning is expensive in terms of both time and money (more so time these days, a lot of the vendors have free trials now). Before I put that work in I want to get a much better idea of the kind of results I can expect.


Fine tuning always felt like something that only works at incredible scale with a lot of data.

Prompt engineering, RAG, etc. provide so much more uplift per unit time invested. Maybe fine tuning makes sense if you have a 9 figure budget and thousands of humans to throw at it. I don't feel this works in a typical startup ecosystem. I couldn't move the needle with the amount of data I had on hand.


To this point, guaranteeing that your fine-tuning doesn't actually weaken the hallucination-resistance of state-of-the-art base models, by overfitting to specific examples in your training set, is a really hard thing to measure. Unless you have robust ways of validating LLM output so that you can do real cross-validation, few-shot prompts, where you can instruct the LLM exactly how much you want it to generalize from those examples, may be far better at achieving many goals.


What about techniques like LoRA?


I thought these models were already leveraging this technique (or something approximating it) as part of their fine tuning offering.


perhaps my ignorance and my beginner level on the topic are biasing me, but in my experience, fine-tuning the GPT-3.5 turbo has elevated our operation to another level. When we put it into production, it became much faster, more accurate, and cheaper when compared to original gpt4. Our use case is to make unstructured data return a JSON with some key-value entities where the order matters.


Say I want to ask a model about a code base 500k tokens in size. The questions would not be to generate code, but rather to help understand it in a way that require the model to globally reason about everything at once.

How might performance compare between:

1.) Using a model like Gemini to load it all into context at once

2.) Using one of the various summarization systems/embedding/RAG etc.

3.) Fine tuning the whole code base into Gpt-4o


I have 5 lines shell to load git into single file and using google AI studio. Tested it to simplify dreamcoder implementation but results are ~ so far


tools like aider [0] maintain a large context with minimal token usage by reading a repo's .git folder. impressive results

[0] https://github.com/paul-gauthier/aider


also, aider has added experimental support for prompt caching: https://github.com/paul-gauthier/aider/issues/1086#issuecomm...

I really like the control you get with aider over the LLM context. You can /add or /drop source code, markdown notes. You can /clear the chat. /tokens shows you the context and the cost, you can see what each prompt will cost you.

I find aider best used in conjunction with a git diff view in VSCode, I run aider with --no-auto-commits and then manually review each time in VSCode.

I'm keen to learn any AI coding workflows if anyone has any links. I've benefitted greatly from tips such as using type hints and documentation for the LLM's benefit.


1 then 2. Never 3.


Interesting why would you say that?

There’s not been much opportunity previously to easily fine tune a Gpt 4 class model. I haven’t seen anything written up on this being tried.


It's wild how quickly the AI landscape shifts. Just 6 months ago, I couldn't go a day without using ChatGPT. Now, it's like finding an old flip phone in a drawer. I've gone full Claude convert. Wonder where I'll be six months from now.


Github's Copilot Workspace or Cursor's new composer beta feature. That's where you'll be in 6 months.

Multi-file code suggestions with an intimate understanding of the entire code base.

https://x.com/aantix/status/1819794837375263228

P.S. Or Plandex, https://plandex.ai/


Cursor + Sonnet feels like a cheat code right now. Feels similar to the excitement I had the first day I ever used GitHub copilot.

I saw this this morning - it really made me think about what the future of 1-man non technical founder startups will look like. https://x.com/0xluffyb/status/1825854097481736479?s=42&t=7-X...


I was going to try copilot workspace till during signup process it asked for access to my private repos then noped out of there. Why would they need that unless they wanted to use it to train off my data?


I have sub to Claude now due to projects/artifacts (caching really) but aider with new gpt4o works well


how you find it better? haven't seen the benefits


> Fine-tuned models remain entirely under your control

Also on the very next paragraph:

> We’ve also implemented layered safety mitigations for fine-tuned models to ensure they aren’t being misused.

Well done, Sam


How do y'all deploy fine-tuned models? I have separate projects for staging and prod, but it doesn't seem like a fine-tune can be shared across projects.

Am I wrong to split projects by env? Am I expected to run fine-tunes separately per env (surely not)? Am I missing an option to share fine-tunes across projects?


Are you talking about fine-tuned models that you host yourself, or fine-tuned models from a hosted provider like OpenAI?

What do you mean by an "env" here?


I mean OpenAI hosted fine-tunes (same as referenced in OP.)

I have a staging deployment and a production deployment. Ideally anything that I roll out to production, I can try on staging first — including switching from gpt-4o to a fine-tuned gpt-4o. I don't want the production API key to be accessible by my staging app, so I have two separate projects within the OpenAI dashboard. One is called my-app-prod, and the other is my-app-staging.

To illustrate the problem further, I also have infrastructure to eval models before changing what production is running. The eval infrastructure also has its own OpenAI project, so that I can set separate billing limits on running evals. Any fine-tuned model needs to be benchmarked first, but again, I'm not sure how to make the same fine-tune available to both the eval project, and the production app project.


Hey, engineer on the OpenAI fine-tuning team here. We know this is something of a pain and we’re trying to come up with a way to allow you to share / move models across projects. If you really need to share a model across projects right now the best way to do it is to train the model in the default project, it will then be available to use and all other projects. That’s not an ideal solution obviously but is the only mechanism available currently


Gotcha, thanks for the answer! We're a pretty heavy user of OpenAI (my company's called Semgrep) and I'd love to know if you ever run private betas or work with development partners for features like project permissioning. If so, feel free to email me at bence@semgrep.com — otherwise, appreciate your work on improving this!


There seems to be no information on how this fine-tuning is implemented, and this matters, because parameter efficient techniques minimize catestrophic forgetting, which is most of what causes fine-tuned models to perform worse.

Until this happens, and until we have flexibility of fine-tuning techniques, I will push investors still towards PyReft, Lora, related, and the open source models where all of these techniques work.


Do any vendors allow fine-tuning on images yet?


Just catching up the phrase of marketing "this is just the start ...". Isn't this overused?


What kind of data or data formatting do you need to fine-tune GPT-4o? I'd love to throw a bunch of documentation at it and let it learn, but I don't have the resources to extract knowledge, format it as questions and answers, etc.


Fine-tuning generally isn't an effective way to add extra knowledge from things documentation - my understanding is that the vast amounts of knowledge in the original training data tend to overwhelm any extra knowledge you try to add by fine-tuning.

OpenAI's documentation has good examples of how the data should be formatted (and when it's appropriate to fine-tune): https://platform.openai.com/docs/guides/fine-tuning/fine-tun...


does the same result apply to LoRA?


Fine tuning is most often performed by training a LoRA. It's almost certainly what OAI are doing, as they can inference many lightweight LoRA in parallel atop the same foundation model.



When Cosine fine tunes a model to be generally better at coding than gpt-4o that leaves me vaguely confused about the role that each part plays in building the best LLMs (and I would not be shocked to learn that this confusion was also SOTA)


I don’t think it’s all that surprising. VRAM is much smaller than all the data the bot was trained on, so some stuff gets dropped or “lossily compressed”. In the case of foundation models, this is likely a bit of everything.

Fine tunes take that and change what gets dropped or compressed. Rather than somewhat evenly forgetting things, it forgets very little of the knowledge they’re targeting (like coding) and in exchange drops much more of “everything else”.

I don’t believe it works this way, but in a sense, it’s like GPT4o is a 400B parameter model but only devotes 20B parameters to coding because the rest are taken up by Wikipedia and knowing French and what not. A 70B coding fine-tune might be able to devote more than 20B parameters to coding, in exchange for only speaking English, having very little encyclopedic knowledge, etc.

It’s kind of like CPUs vs ASICs. CPUs (GPT4o) will perform better on average at random tasks, while ASICs (finetunes) do dramatically better on the task they were built for, even if they have a lower transistor count.


But is that not what MoE is supposed to solve already? Why no it bake whatever Cosine does into the coding expert?


I am not an expert or even particularly knowledgeable about this, so treat the following as “armchair speculation by an enthusiast”.

Wikipedia says that as of 2023, each expert is ~10 billion parameters, so they’re still much smaller than something like Deepseek Coder 70B.

I’m not sure why they’re so small though. I don’t know whether there’s some kind of architectural issue or super linear scaling somewhere preventing them from growing, or if they’re just trying to keep inference costs down.


GPT-4o has seen a lot of examples of what people write about coding on the web, what code exists, and what tasks people want to do with that. But that general data doesn't include full process of coding - get a bug report, look at a codebase beforehand, make changes to the codebase, test them and see exactly what bugs they have, and iterate. That process is what SWE-bench tests.

It's possible OpenAI did some coding fine-tuning themselves; Meta's Llama 3 paper [0], section 4.3.1 mentions what sort of work is needed. However, anything OpenAI did is based on their own tooling and set of assumptions - e.g. how is the existing code input into the LLM, what set of actions can the LLM take (e.g. look up documentation), what language is the output code being written in, etc. Cosine's LLM framework may do things differently and have different features, so you'd need to fine-tune the LLM to take maximum advantage of the framework.

It's like dropping the LLM down in front of Vim when it had only ever used or even heard of notepad (or even emacs); there needs to be some training to make it work well with the new tools it has.

[0]: https://arxiv.org/pdf/2407.21783


Huh? It says: "Cosine's Genie achieves a SOTA score of 43.8% on the new SWE-bench Verified benchmark" with a link to https://www.swebench.com/

But the SWE-bench leaderboard (linked to in the post) doesn't show Cosine Genie at all, instead showing Amazon at the top with 38.8% accuracy.

If true, seems wild that Genie, a 10 person startup with $2.5m in funding can actually achieve SOTA results over Google, Amazon, Microsoft, Anthropic, OpenAI etc. It's not like the big players are overlooking this problem of using LLMs to automate software engineering. Anyone have more color on this? I see some speculation online that the training data can easily get contaminated with benchmark questions, but not much careful evidence.


It's with an asterisk. Here's their comment from cosine's website.

Note SWE-Bench has recently modified their submission requirements, now asking for the full working process of our AI model in addition to the final results -their condition to have us appear on the offical leaderboard. This change poses a significant challenge for us, as our proprietary methodology is evident in these internal processes. Publicly sharing this information would essentially open-source our approach, undermining the competitive advantage we’ve worked hard to develop. For now, we’ve decided to keep our model’s internal workings confidential. However we’ve made the model’s final outputs publicly available on GitHub for independent verification. These outputs clearly demonstrate our model’s 30% success rate on the SWE-Bench tasks.


It looks like they didn't want to make a public submission in order to avoid disclosing the model internals: https://cosine.sh/blog/genie-technical-report#:~:text=SWE%2D....


"verified" being the (rather confusing) keyword here.

https://openai.com/index/introducing-swe-bench-verified/


Your link doesn't actually point to the leaderboard. This link does https://www.swebench.com/ and you can click on the "Verified" tab. I don't see any entry for Genie.


Curious to hear some real-world use cases for fine-tuning and what your setup looks like in terms of:

- frameworks

- hardware

- foundation models (OS vs 3rd party)

- evaluation & training data

Did the performance gains justify the investment for fine-tuning?


I think vision models have a lot bigger ROI from fine tuning than language models. That being said, I do consider fine-tunes to be helpful in improving smaller models as long as the domain is limited in scope. In other words, fine tuning allows you to get similar performance in smaller models, but improved performance in larger models seems pretty elusive at the moment, albeit somewhat possible with very creatively engineered training datasets.

An example of this is DeepSeek-Coder, which can essentially be considered a fine-tune of a fine-tune of a Mixtral model. It performs very similarly to Claude 3.5 Sonnet, which is pretty damn impressive, but it does it at less than 1/10th the cost.

What I don't understand though is why anyone would even remotely consider fine tuning a GPT-4o model that they will never fully own, when they could spend the same resources on fine tuning a Llama3.1 model that they will own. And even if you absolutely don't care about ownership (???), why not do a fine tune of an Anthropic model which is already significantly better than GPT-4o. At this point, with the laggard performance of OpenAI and their shameless attempts at regulatory hostility to competitors, I can't imagine ever giving them any of my money, let alone owning my derivative work.


"An example of this is DeepSeek-Coder, which can essentially be considered a fine-tune of a fine-tune of a Mixtral model"

I've not heard that anywhere else. My impression from https://huggingface.co/deepseek-ai/deepseek-coder-33b-base was that DeepSeek-Coder was trained from scratch.


The current deepseek-coder version (v2) is actually a fine tune off of the deepseek v2 model, and was not trained from scratch.

I’m now getting conflicting information about the origin of the deepseek MOE framework, so I may be wrong about it starting with a Mixtral model.


Some much focus on fine-tuning when it can actively make performance on reasoning and planning benchmarks worse (over a baseline of already worse-than-coin-toss).

Why not give us nice things for integrating with knowledge graphs and rules engines pretty please?


I think I’m not understanding what goes into making this available. Fine-tuning is a known technique, why is it available “only” now?


Because this isn't you fine-tuning an open weights model and running it yourself - this is fine-tuning via OpenAI's API and them hosting the resulting model (and everybody else's differently fine-tuned ones) and running it for you.


still has a customer noncompete so you’re still paying to get wrecked long term


When are they going to fix ChatGPT Classic not using Custom Instructions?


How are people feeling with regards to GPT-4o versus Claude 3.5 Sonnet? I recently watched this Primeagen video [0] about how, because LLMs don't actually understand anything (yes, AI Effect included [1]), one does not actually gain as much usefulness as they'd expect, especially with subtly wrong outputs. Over time, it just wastes way more time and becomes a form of learned helplessness (and yes, I do know about Socrates' dialogue, I saw it originally elsewhere on HN and had been quoting it for some time [2]).

[0] https://www.youtube.com/watch?v=1-hk3JaGlSU

[1] https://en.wikipedia.org/wiki/AI_effect

[2] https://news.ycombinator.com/item?id=40920318


Through this same lens, using Google is a form of learned helplessness.


>learned helplessness

This is an existing phrase with an explicit meaning that you are clashing with.


If you follow [3], yes, even learning to read and write is a form of learned helplessness, as Thoth and Socrates conclude. Now, you might not thing that affects our day to day world, but it does; because imagine if we could not read nor write, what sorts of hypotheses we might come up with, now extend that to what we say about LLMs. To those that say those are inequal circumstances, I invite you prove why. Thoth and Socrates know a hell of a lot more than you ever did, to be frank.


You're doing a bit of a logical fallacy here where you're framing the idea of "learned helplessness" as a bad thing intrinsic to the use of LLMs, but then backpedaling to suggest that writing itself is also learned helplessness. And the latter I'm fine to agree with, but it changes the terms under which you made your comment in the first place to poo poo them. If your point is that technological advances cede some part of humanity to technology, sure, but by nature of that point this isn't relevant commentary on LLMs outside the fact they too participate in that lineage, which, okay, but that isn't really saying anything besides "LLMs are technology".


https://myswamp.substack.com/p/ymxb-is-hard-for-llms

I recently made this blog post showing how y=mx+b is very error prone in GPT4o, and pretty accurate (to a point) in Claude 3.5.

I haven't gone down the rabbit hole yet, but I was wondering if fine tuning could fix math errors in LLMs. My initial hunch and understanding is it will not. I'll have to give your links a read/watch.


I'll also explain my downvote, it's because of the assertion that LLMs "don't actually understand anything", which to anyone who's actually successfully used LLMs to solve a difficult problem is clearly false, unless you use some contrived definition of the word "understand" that doesn't match how the word is actually used in normal conversation.


I am referring specifically to what was claimed in the video I quoted, so, unless you watched the actual video, what you are saying has not much bearing on what I am actually saying. Sorry to say it, and not to be harsh, but I constructed my comment to specifically point to such an instance via bracket quotes, please respond to what was said in said video.


Asking people to watch a 30m video in order to understand your comment isn't reasonable - can you summarize the point from the video that you're arguing here so people can respond to it without putting in all of that extra work?


Well, I did; LLMs are not necessarily intelligent enough to not cause new problems in terms of the solutions they produce. This is a fundamental flaw of LLMs that is covered even by mainstream media, much less the AI Effect as shown by Wikipedia. At worst, they might turn a 0.1x engineer into a 10x one, ie a 1x one, except with no ability to actually solve problems cohesively.


I'm an experienced engineer and I've seen what I estimate to be a 2-5x productivity improvement in the time I spend typing code into my computer from embracing LLM-assisted development.

Typing-in-code is only 10% of the work that I do, but this is still a very meaningful improvement for me.

I've written a bunch more about my own experiences here: https://simonwillison.net/series/using-llms/ and here: https://simonwillison.net/tags/ai-assisted-programming/


The bigger ones have gained a rough understanding of a few systems [1]. Which is really impressive and gives an answer to the Chinese Room experiment. In my experience they don’t understand a lot of things I ask about very well. But the fact that they understand anything at all is impressive.

1. https://danangell.com/blog/posts/gpt-understands/


If five years ago someone said that in half a decade we'd have a computer program that could solve medium-complexity Leetcode problems that it had never seen before, hardly anyone would believe them. Now we have programs that can do exactly this, and yet some people never miss a chance to try to trivialise what just a few years ago would have been considered an amazing, world-changing achievement.


Can it though? My understanding is that ChatGPT has all the Leetcode problems memorized, maybe it can extrapolate to substantially similar ones in its training set.

I tried it for advent of code 2023, and it was pretty helpless.


As do most people.


I don't think that's true - I think decent programmers can figure out mediums if they apply themselves.


I feel like this comment was made in good faith so I wanted to explain my downvote: I think it's just too far off-topic for this particular announcement. It's better as a separate discussion.


Thanks, however I thought it was fairly aligned simply due to 4o receiving the same sorts of functionality as Claude had for a while.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: