If you guys think Chatgpt and gpt-4 are static you haven't been using it. The answers change constantly (and not because of the inherent randomness of the response but openai is constantly making it "safer" and improving it's output via humans) - basically when any article comes out saying "Chatgpt can't solve X puzzle) within a day suddenly it can solve that puzzle perfectly. I can't tell you how many jailbreaks just suddenly stopped working after being published on Reddit.
It's possible to change the output both by tuning the parameters of the model and also client-side by doing filtering, adjusting the system prompt and/or adding/removing things from the user prompt. It's very possible to change the resulting answers without changing the model itself. This is noticeably in ChatGPT, as what you say is true, the answers change from time to time.
But when using the API, you get direct access to the model, parameters, system prompt and user prompts. If you give it a try to use that, you'll notice that you'll be getting the same answers as you did before as well, it doesn't change that often and hasn't changed since I got access to it.
This kind of stuff is likely done without changing model parameters and instead via filtering on the server and prompt engineering. One day is simply too short to train and evaluate the model on a new fine tuned task.
I'm assuming the model has a hand writtn "prefilter" and "postfilter" which both modifies any prompt going in and the token that are spit out? If they discover that the model has problems with prompts phrased a certain way for example, it would be very easy to add a transform that converts prompts to a better format. Such filters and transforms could be part of a product sitting on top of the GPT4-model without being part of the model itself? As such, they could be deployed every day. But tracking changes in those bits wouldn't give any insight into the model itself only how the team works to block jailbreaks or improve corner cases.
I think improved filtering for jailbreaks is very unlikely to correspond to the kinds of model improvements that would result in drawing a better unicorn.
In fact the more safeguards the dumber the model gets, as they published.
Which is very interesting. You already have a model that consumes nearly the entire internet with almost no standards or discernment, whereas a smart human is incredibly discerning with information (I’m sure you know what % of internet content that you read is actually high quality, and how even in the high quality parts it’s still incredibly tricky to get figure out good stuff - not to mention that half the good stuff is actually buried in low quality pools). But then you layer in political correctness and dramatically limit the usefulness.