Hacker News new | past | comments | ask | show | jobs | submit login
OpenAI and Microsoft Azure to deprecate GPT-4 32K (twitter.com/headinthebox)
79 points by tosh 9 days ago | hide | past | favorite | 62 comments





A lot of people here haven't integrated GPT into a customer facing production system, and it shows

gpt-4, gpt-4-turbo, and gpt-4o are not the same models. They are mostly close enough when you have a human in the loop, and loose constraints. But if you are building systems off of the (already fragile) prompt based output, you will have to go through a very manual process of tuning your prompts to get the same/similar output out of the new model. It will break in weird ways that makes you feel like you are trying to nail Jello to a tree

There are software tools/services that help with this, and a ton more that merely promise to, but most of the tooling around LLMs these days gives the illusion of a reliable tool rather than results of one. It's the early days of the gold rush still, and every one wants to be seen as one of the first


Maybe we shouldn't be selling products built on such a shaky foundation? Like Health Insurance products for example.

[2]: https://insurtechdigital.com/articles/chatgpt-the-risks-and-...

--- please disregard [1] it was a terrible initial source I pulled of Google

[1]: https://medium.com/artivatic/use-of-chatgpt-4-in-health-insu...


Building products on shaky foundations is a tried-and-true approach in IT business.

For a different point of view from someone with extremely credible credentials (learned this stuff from Hinton among many other things) and a much more sober and balanced take on all this I recommend the following interview with Nick Frosst (don’t be put off by the clickbait YouTube title, that’s a very silly caption):

https://youtu.be/4JF1V2hzGKE


Minimum Viable Products are pretty much by definition built on shaky foundations. At least with software written by humans the failure modes are somewhat bounded by the architecture of the system as opposed to the who-knows-what-the-model-will-hallucinate of AI.

I think that is the key problem, a traditional MVP is a mostly known entity. It may be missing some features, some bugs, etc. But it is an MVP not because it was necessarily rushed out the door (I mean... it was, but differently) but because it has some rough edges and is likely missing major features.

Where what it seems we are getting with a lot of these companies shoving AI into something and calling it a product, is an MVP that is an MVP due to an unknown and untested nature.


The term MVP was cover for shoving poor quality software out on the market long before AI became involved. This is unfortunate, but inevitable when the term was popularized. AI is incredibly easy to tack on now, so people are doing that too.

That is true, but I think rushing to add AI features made it a completely different situation.

We get a lot of MVP crap before, don't get me wrong. But at least it was understood crap. Sure it may have bugs in it and that is to be expected. But there was a limit in how wrong it could go. Since at the end of the day it was still limited to the code within the application and the server (if there is one).

Meanwhile when an over-reliance on an LLM goes wrong, depending on how it goes wrong could be catastrophic.

As we have seen time and time again just in the last couple months, when LLM's are shoved into something we seem to get a serious lack of testing under the guise of "beta".


But ultimately we have to test and release things to see what works and what doesn't. Very many usecases don't require perfect accuracy.

I’m not really sure this is an entirely fair argument.

If you rely on third party packages of any type, you have dependencies that can rapidly and unexpectedly break with an update. Semantic versioning is supposed to help with this, but it doesn’t always help.


> It will break in weird ways that makes you feel like you are trying to nail Jello to a tree

Probably the best description of working with LLM agents I've read


It gets more interesting when you get to benchmarking your prompts for accuracy. If you don't have an evaluation set you are flying blind. Any model update or small fix could break edge cases while you don't know.

We are using benchmarking on our own eval sets, which makes it easier to measure the variance that I’ve found impossible to eliminate.

Make sure you don’t upload that evaluation set to any service that resells data (or gets scraped) for LLM training!

Came here to say the same thing, it sums it up perfectly

Hopefully you built a solid eval system around the core of your GenAI usage, otherwise, yes, this is going to be very painful :)

My naive answer: turn away from Silicon Valley modernity with its unicorns and runways and “”marketing””, and embrace the boring stuffy academics! https://dspy-docs.vercel.app/

I never got DSPy. I only tried a brief example, but can someone explain why it's better than alternatives? Not that I hold LangChain in particularly high regard...

hosted on Vercel and Github...

Is it Winter already?

Oh god, what I would give for an AI winter right now… I think we’ve officially hit AI global warming

It can't be, it's too big to die.

I've seen people mention this lib before and I have a hard time understanding the use cases Nad how it's used.

One of the decent parts about open weights models is that you can't have this happen to you. You get to keep access to any model you want. This is essential for continuity of products.

what is chatgpt as a product more than just “really good weights”?

It's just really good weights - significantly better weights than anyone else has.

why has nobody else been able to catchup/come close?

Really good weights that you don't have to run yourself

good weights are the key, but it has python interpreter and bing search. and likely more in development.

Probably a lot that we don't know about, because ChatGPT has better reasoning skills than most other systems out there, and we know reasoning is not really about the weights.

Chatgpt can't reason about anything.

it can pretend, which often looks indistinguishable. so that you can't say if it's real or not.

just wonder, do you know any opens source LLM? that can be reproduced?

Looks like they are just cleaning house of lesser-used models? This came via mail last week.

  Back in June 2023 and November 2023, we announced the following models will be deprecated on June 13th, 2024:
   gpt-3.5-turbo-0301 
   gpt-3.5-turbo-0613
   gpt-3.5-turbo-16k-0613
  We noticed that your organization recently used at least one of these models. To help minimize any disruption, we are extending your access to these models for an additional 3 month grace period until September 13th, 2024. After this date, these models will be fully decommissioned.

Probably more expensive as well.

So much for code migrations! Although most examples are about summarization or needle in a haystack search, applications whose output should be the size of input are probably more important although less advertised.

Curious if that is a business decision or a technical decision aka if optimizations for cheap and fast 128k gpt4o work only for small outputs


I'm confused here. What's an "output context"? My assumption was that the context window was shared across input and (prior) output. You put everything into the model at once, and then at the end the first unused context vector becomes a vector you can decode for a single token of output. With multiple token output meaning you repeatedly inference, decode, sample, append, and repeat until you sample an end token. Is this just a limit of OpenAI APIs or something I'm forgetting?

The long story short is you are technically correct but in practice things are a little different. There are 2 factors to consider here:

1. Model Capability

You are right that mechanically, input and output tokens in a standard decoder Transformer are "the same". A 32K context should mean you can have 1 input tokens and 32K output tokens (you actually get 1 bonus token), or 32K input tokens and 1 output token,

However, if you feed an LM "too much" of its own input (read: have too long an output length), it starts to go off the rails, empirically. The word "too much" is doing some work here: it's a balance of both (1) LLM labs having data that covers that many output tokens in an example and (2) LLMs labs having empirical tests to have confidence that the model won't reasonably go off the rails within some output limit. (Note, this isn't pretraining but the instruction tuning/RLHF after, so you don't just get examples for free)

In short, labs will often train a model targeting an output context length, and put out an offering based on that.

2. Infrastructure

While mathematically having the model read external input and its own output are the same, the infrastructure is wildly different. This is one of the first things you learn when deploying these models: you basically have a different stack for "encoding" and "decoding" (using those terms loosely. This is after all still a decoder only model). This means you need to set max lengths for both encoding and decoding separately.

So, after a long time of optimizing both the implementation and length hyperparameters (or just winging it), the lab will decide "we have a good implementation for up to 31K input and 1k output" and then go from there. If they wanted to change that, there's a bunch of infrastructure work involved. And because of the economies of batching, you want many inputs to have as close to the same lengths as possible, so you want to offer fewer configurations (some of this bucketing may be performed hidden from the user). Anyway, this is why it may become uneconomical to offer a model at a given length configuration (input or output) after some time.


No. Newer models have 128K of input tokens, but only 4096 output tokens.

It's the number of tokens the model can output in one pass. There's subtle differences between running it multiple times to get a bigger output and running it once to get a bigger output. These are things that only really show up when you integrate these models into production code.

This isn't unique for OpenAI models. A lot of the open source ones has similar limitations.

I am having trouble understanding what the complaint is here.

The docs still mention bigger models with 128k tokens and smaller models with 8k tokens. It seems reasonable to optimize for big and small use cases differently? I don't see how we are being "robbed".


> I am having trouble understanding what the complaint is here.

The appropriate level of due diligence for each LLM model transition is to run your various prompts into the new model and make sure they still produce the correct output; and, if they don't produce good output, to update the prompts so that they continue to produce good output.

Just yesterday, I was experimenting with 4o and assumed I could do a flat migration for some work. 4o actually provided worse results - results I explicitly asked to *not* have in my 4 output (and that I didn't have in my 4 output).

It's tedious to have to change models after you've already done a proper validation suite against one model.

That would be (at least my) complaint.

I've even version-stamped the models I use on purpose to avoid surprises.


Not directly addressing your point but asking an LLM to not include something in its output often doesn't work well. Its a bit like saying to someone "whatever you do, don't think about elephants."

4o is pretty bad in my experience. They shouldn't have used the "4.0" naming for it or said its Lite or something. Instead they are trying to market it as roughly equivalent which it definitely is not.

At that point why not use your own hosted open source model that is more reproducible for you?

Cost of maintaining architecture, cost of complexity of internal infrastructure, knowledge level required for self-hosting, complexity of local one-boxing, and a slew of other reasons.

Everything's a tradeoff, but it seems that part of the tradeoffs include access to tools critical for your product to function correctly being taken away without a way to get it back. Maybe that can be an acceptable tradeoff, but I'd personally not like living with that.

At present, my startup isn't making money - in fact, it's not even released. As a result, I'm trying to prioritize getting it out of the door while still being affordable for myself enough to bootstrap it.

To do this, I've made, and continue to make tradeoffs. Among many of the tradeoffs I'm currently making that I intend to resolve ASAP is having OpenAI as a single source of failure. I intend to have some of the other hosted solutions as other options for LLM processing. One of the many, many options that will be considered at that time is self-hosting it, as well, as another option.

I've already spent more time than I should perfecting various smaller pieces, increasing reliability, etc. Each time I choose perfection, I lose more time, more runway, more potential market share; and, something I've recently had to learn:

Each time I lock myself in a previous step to get that step perfect, I miss the lessons I'm about to have to learn in the next stage of the process, including new issues I'll run into that increase the next step's complexity above my initial estimates.

Everything is a tradeoff. Choosing to use a commercially available solution with known and relatively set costs while accepting it may slowly change underfoot (while also knowing I have alternatives I can swap to if an emergency comes up that should only take a little bit to transfer to) is one I've made.


Because open source models aren't as good as GPT-4.

The main limit is that you can have 128k tokens of input, but only 4k tokens of output per run. gpt4-32k lets you have up to 32k tokens of output per run. Some applications need that much output. Especially for token dense things like code and JSON.

So it's a price concern? Because you could run for 4k output 8 times to get 32K? Or does the RLHF stuff prevent you from feeding the output back in as more input and still get a decent result? The underlying transformers shouldn't care because they'll be doing that already effectively.

I'd say it's less a price concern and more a consistency of output concern. It doesn't make much sense to continue incomplete JSON like that I don't think. I need to do some more research.

you can just feed that output into another call, to have the next call continue it, since you have more than 28k extra context. The output per token is faster anyways right, so speed isn't an issue. It's just slightly more dev work (really only a couple lines of code)

How do you know it will have the same state of mind? And how much does that cost.

Because the state of mind is derived from the input tokens.

Is there a study or anything that that is guaranteed adding an incomplete assistant: response as the input and the API taking off exactly the same way on the same position?

It’s how LLMs work - they are effectively recursive at inference time, after each token is sampled, you feed it back in. You will end up with the same model state (not including noise) as if that had been the original input prompt.

LLMs sure. My question is whether it is the same in practice for LLMs behind said API. I found no official documentation that we will get exactly the same result as far as I can tell.

And no one here touched how high a multiple the cost is, so I assume its pretty high.


If you've spent ages fine tuning your prompt/context to have it work for your integration, it's not a given it will work similarly on a model of a different size. Might have to essentially start from scratch.

Ah I see, taking a page from the Google playbook and aggressively culling less popular products. I wonder how many sales Google's reputation for capricious culling has cost them.

Arguably it's rather more like taking a page from versioned builds, meaning not supporting old builds indefinitely, through a process of notice with grace period till deprecation.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: