Yes, this project seems like a misunderstanding of what Bubeck and team were obsering with their unicorn test. GPT-4 was being trained, and checkpoints were provided to them to experiment with. The improvements in the unicorn reflected further training progress.
The models on offer now are frozen(-ish). Per the models[0] page, the non-snapshot model IDs "[w]ill be updated with our latest model iteration". So this project will eventually hit another version of the model, but (a) one image definitely will not be enough to reliably discern a difference and (b) seems like that 'latest model iteration' cadence will be much, lower than a day.
If you guys think Chatgpt and gpt-4 are static you haven't been using it. The answers change constantly (and not because of the inherent randomness of the response but openai is constantly making it "safer" and improving it's output via humans) - basically when any article comes out saying "Chatgpt can't solve X puzzle) within a day suddenly it can solve that puzzle perfectly. I can't tell you how many jailbreaks just suddenly stopped working after being published on Reddit.
It's possible to change the output both by tuning the parameters of the model and also client-side by doing filtering, adjusting the system prompt and/or adding/removing things from the user prompt. It's very possible to change the resulting answers without changing the model itself. This is noticeably in ChatGPT, as what you say is true, the answers change from time to time.
But when using the API, you get direct access to the model, parameters, system prompt and user prompts. If you give it a try to use that, you'll notice that you'll be getting the same answers as you did before as well, it doesn't change that often and hasn't changed since I got access to it.
This kind of stuff is likely done without changing model parameters and instead via filtering on the server and prompt engineering. One day is simply too short to train and evaluate the model on a new fine tuned task.
I'm assuming the model has a hand writtn "prefilter" and "postfilter" which both modifies any prompt going in and the token that are spit out? If they discover that the model has problems with prompts phrased a certain way for example, it would be very easy to add a transform that converts prompts to a better format. Such filters and transforms could be part of a product sitting on top of the GPT4-model without being part of the model itself? As such, they could be deployed every day. But tracking changes in those bits wouldn't give any insight into the model itself only how the team works to block jailbreaks or improve corner cases.
I think improved filtering for jailbreaks is very unlikely to correspond to the kinds of model improvements that would result in drawing a better unicorn.
In fact the more safeguards the dumber the model gets, as they published.
Which is very interesting. You already have a model that consumes nearly the entire internet with almost no standards or discernment, whereas a smart human is incredibly discerning with information (I’m sure you know what % of internet content that you read is actually high quality, and how even in the high quality parts it’s still incredibly tricky to get figure out good stuff - not to mention that half the good stuff is actually buried in low quality pools). But then you layer in political correctness and dramatically limit the usefulness.
Greg Brockman just stated in a long post on Twitter yesterday: "...it’s easy to create a continuum of incrementally-better AIs (such as by deploying subsequent checkpoints of a given training run), which presents a safety opportunity very unlike our historical approach of infrequent major model upgrades." [1] This implies OpenAI will be shifting strategy to incrementally releasing future models, so we won't just suddenly see a GPT-5, but GPT-4.x along the way over the next months. The models page you cited also says, "some of our models are now being continually updated," and they will only offer alternative models which are snapshots that are frozen for 3 month windows for those needing more stability in the model.
Sebastien Bubeck gave a talk to MIT CSAIL on the Sparks of AGI paper where he commented on how RLHF (training for more safety) has caused the Unicorn test to become less recognizable [2]. He comments about how the safety training is at odds with a lot of these abilities. This seems congruent with the initial results from this project, and it will be interesting to see if they can restore this ability as they continue to push safety training.
My expectation is that there will be incremental updates to the model, so while I'm providing the model `gpt-4` for completions, I'm recording the actual model, `gpt-4-0314` in this case, along with the result.
I don't want to monitor (and potentially miss) model updates, which is likely as this is very much a fire and forget project that I'll review over time. One per day seems more sensible to me than a large batch per month, as the daily generations are more likely to track model changes if they become frequent.
Will be exciting to see where we are in a few months!
But if each image is random, what does an image being "better" one day actually tell you about the underlying model?
That's why I suggested generating 16 models each time instead of just 1 a- if all 16 are noticeably better than the previous day you've learned something a lot more interesting than if just one appears to be better than the previous one.
I don't think any individual sample will tell us much. I think we'll need to review this in 3 months, 6 months, 12 months etc, and look for patterns of changes. Ultimately, this is just a fun side project, something to leave running and check back on once every month or two.
I suggest the author generate a few different examples per day, each with a different temperature setting. 0.0 temperature ought to be deterministic on the same model, but would be interesting to look for trends at higher temperatures over time.
In the talk he specifically mentions the very interesting fact that as they improved "alignment" it affected the unicorn output (negatively if I remember correctly). So as long as "alignment" is changing, the output should change. Not sure but ongoing RLHF, changing "system" prompts etc can and do change while the underlying foundational model need not.
"Running this project daily doesn't make sense if GPT-4 is not being constantly updated"
With a suggestion to run it monthly instead, and generate 16 images at a time, and backfill it for GPT3 and GPT3.5.