Yes, this project seems like a misunderstanding of what Bubeck and team were obsering with their unicorn test. GPT-4 was being trained, and checkpoints were provided to them to experiment with. The improvements in the unicorn reflected further training progress.
The models on offer now are frozen(-ish). Per the models[0] page, the non-snapshot model IDs "[w]ill be updated with our latest model iteration". So this project will eventually hit another version of the model, but (a) one image definitely will not be enough to reliably discern a difference and (b) seems like that 'latest model iteration' cadence will be much, lower than a day.
If you guys think Chatgpt and gpt-4 are static you haven't been using it. The answers change constantly (and not because of the inherent randomness of the response but openai is constantly making it "safer" and improving it's output via humans) - basically when any article comes out saying "Chatgpt can't solve X puzzle) within a day suddenly it can solve that puzzle perfectly. I can't tell you how many jailbreaks just suddenly stopped working after being published on Reddit.
It's possible to change the output both by tuning the parameters of the model and also client-side by doing filtering, adjusting the system prompt and/or adding/removing things from the user prompt. It's very possible to change the resulting answers without changing the model itself. This is noticeably in ChatGPT, as what you say is true, the answers change from time to time.
But when using the API, you get direct access to the model, parameters, system prompt and user prompts. If you give it a try to use that, you'll notice that you'll be getting the same answers as you did before as well, it doesn't change that often and hasn't changed since I got access to it.
This kind of stuff is likely done without changing model parameters and instead via filtering on the server and prompt engineering. One day is simply too short to train and evaluate the model on a new fine tuned task.
I'm assuming the model has a hand writtn "prefilter" and "postfilter" which both modifies any prompt going in and the token that are spit out? If they discover that the model has problems with prompts phrased a certain way for example, it would be very easy to add a transform that converts prompts to a better format. Such filters and transforms could be part of a product sitting on top of the GPT4-model without being part of the model itself? As such, they could be deployed every day. But tracking changes in those bits wouldn't give any insight into the model itself only how the team works to block jailbreaks or improve corner cases.
I think improved filtering for jailbreaks is very unlikely to correspond to the kinds of model improvements that would result in drawing a better unicorn.
In fact the more safeguards the dumber the model gets, as they published.
Which is very interesting. You already have a model that consumes nearly the entire internet with almost no standards or discernment, whereas a smart human is incredibly discerning with information (I’m sure you know what % of internet content that you read is actually high quality, and how even in the high quality parts it’s still incredibly tricky to get figure out good stuff - not to mention that half the good stuff is actually buried in low quality pools). But then you layer in political correctness and dramatically limit the usefulness.
Greg Brockman just stated in a long post on Twitter yesterday: "...it’s easy to create a continuum of incrementally-better AIs (such as by deploying subsequent checkpoints of a given training run), which presents a safety opportunity very unlike our historical approach of infrequent major model upgrades." [1] This implies OpenAI will be shifting strategy to incrementally releasing future models, so we won't just suddenly see a GPT-5, but GPT-4.x along the way over the next months. The models page you cited also says, "some of our models are now being continually updated," and they will only offer alternative models which are snapshots that are frozen for 3 month windows for those needing more stability in the model.
Sebastien Bubeck gave a talk to MIT CSAIL on the Sparks of AGI paper where he commented on how RLHF (training for more safety) has caused the Unicorn test to become less recognizable [2]. He comments about how the safety training is at odds with a lot of these abilities. This seems congruent with the initial results from this project, and it will be interesting to see if they can restore this ability as they continue to push safety training.
My expectation is that there will be incremental updates to the model, so while I'm providing the model `gpt-4` for completions, I'm recording the actual model, `gpt-4-0314` in this case, along with the result.
I don't want to monitor (and potentially miss) model updates, which is likely as this is very much a fire and forget project that I'll review over time. One per day seems more sensible to me than a large batch per month, as the daily generations are more likely to track model changes if they become frequent.
Will be exciting to see where we are in a few months!
But if each image is random, what does an image being "better" one day actually tell you about the underlying model?
That's why I suggested generating 16 models each time instead of just 1 a- if all 16 are noticeably better than the previous day you've learned something a lot more interesting than if just one appears to be better than the previous one.
I don't think any individual sample will tell us much. I think we'll need to review this in 3 months, 6 months, 12 months etc, and look for patterns of changes. Ultimately, this is just a fun side project, something to leave running and check back on once every month or two.
I suggest the author generate a few different examples per day, each with a different temperature setting. 0.0 temperature ought to be deterministic on the same model, but would be interesting to look for trends at higher temperatures over time.
In the talk he specifically mentions the very interesting fact that as they improved "alignment" it affected the unicorn output (negatively if I remember correctly). So as long as "alignment" is changing, the output should change. Not sure but ongoing RLHF, changing "system" prompts etc can and do change while the underlying foundational model need not.
Did you generate a bunch all at once before starting to get some idea of what the natural variance looks like? I would think it's important to verify some level of progression over time, because with the current four it seems entirely possible that the examples could have all been generated at the same time with no changes to the model.
Then it's using the default value which is temperature=1.0, which is by no way deterministic (not that temperature=0.0 is either, but it's more likely to give similar responses to similar prompts, than 1.0 is)
GPT's output is by default somewhat random. If you ask the same exact question several times, you'll potentially get several different answers. Each successive word in the output is chosen from a distribution of possibilities -- that distribution is fixed, but that actual sample chosen from the distribution is not fixed. See, e.g., https://platform.openai.com/docs/api-reference/completions/c...
When using the openAi end points you can set the temperature, k_top etc. That will allow you to tone down the randomness. Passing a temperature of 0 would mean same input always has the same output.
So far as I can see, the only difference is that the GitHub repo uses the API; the ones that I have were rendered using the web chat UI. Which makes me wonder if they're using the bleeding edge model for the chat.
I haven't yet gained access to the enhanced chat features with image outputs. I'm using the API with default parameters, with the gpt-4-0314 model, outputting SVG.
It's not a perfect experiment, but we'll see how it gets on over time.
Perhaps. I think right now the sample of 5 days isn't anything we can draw conclusions from, but given 365 days, with a number of model updates in between, we may see more interesting patterns.
As is noted in the paper from which this is inspired from: GPT-4's image generation capabilities were severely diminished by the instruction/safety-tuning process. Unfortunately this means the currently available model from the API won't be very capable - certainly not as capable as the early version of GPT-4 that Microsoft had access to.
edit: I'm specifically referring to the "image generation by trickery (e.g. SVG)" technique being diminished. Other tasks were diminished as well though - is my understanding.
It's not just image generation the rlhf worsens too. Calibration (confidence on solving a question in relation to ability to solve that problem) went from excellent to non existent. and you can see from the report that the base model performed better on a number of tests. Basically a dumber model.
My understanding is that OpenAI did indeed find diminished capability across a range of tasks after doing RLHF. You're correct to question this though - as I believe the opposite was true of GPT-3 where it improved certain tasks.
The benefits from a business perspective were still clear however, and of course the instruction-tuned GPT-4 model still outperformed GPT-3, in general.
There are probably some weird edge cases and nuances that I'm missing - and I'd be happy to be corrected.
Why would the model change over time when asking the same question? Just it's generation dataset for generating similar images? Or is this just tracking GPT's explicit model improvements over time?
The model being used is `gpt-4` but each result records the model that was used to generate the output. With incremental updates, the model will change automatically. This is to support incremental updates without needing anything from me. :)
The models seem to have been changing in the background, though as another commenter pointed out.... having a variance-calibrayion baseline for humans would be great too. :'))))
Interestingly, I think you often get better results with ChatGPT web interface than the API (the OP used a system message), but for a funny reason.
Because ChatGPT typically first responds by insisting it can't actually draw, it ends up accidentally chain-of-thought prompting itself into drawing better. "I can't draw, but I can describe a detailed drawing for you in words." So instead of winging it with the SVG code and getting confused, it goes into it with a plan.
This is how it described the unicorn I linked:
__User__
Draw a unicorn!
__ChatGPT__
As an AI language model, I am unable to create visual images directly. However, I can describe a detailed image of a unicorn for you in words.
Picture a magnificent unicorn, standing tall and proud in a serene setting. The unicorn's coat is a shimmering, pure white, and it glistens under the warm sunlight. Its body is strong yet elegant, with a graceful arched neck and powerful, slender legs. The hooves are cloven and emit a faint, otherworldly glow.
The most striking feature of the unicorn is its majestic spiral horn, which emerges from the center of its forehead. The horn is long, slender, and gently tapers to a sharp point. It is iridescent, with colors swirling together like a mesmerizing opal. The unicorn's mane and tail flow behind it like silken strands, shimmering with an ethereal pastel rainbow of colors.
Its eyes are large, expressive, and full of wisdom, with a depth that speaks of untold secrets and ancient magic. The unicorn's ears are pointed and alert, picking up the slightest rustle of leaves or the gentlest of whispers. The unicorn's expression is one of serenity, exuding an air of peace and tranquility that surrounds it like a gentle embrace.
Though it is a creature of great power, the unicorn carries itself with a sense of humility and grace that inspires awe and respect in all who behold it.
__User__
What's the SVG for the image you just described?
__User__
continue (to finish the file)
Some more examples here, but with many iterations of "Improve this" over and over:
The first image is what came out of the initial prompt. Then I tried to get it to improve things and to add animations; the second image is the end result (and yes, the sun is animated as described). Here's the complete chat log:
It's obviously far from perfect, but the fact that it can do this much already - meaning that it understands broadly how the objects would relate to each other spatially for something like this - is already extremely impressive IMO, given that its training and inputs are entirely textual.
Totally agree. The samples in the OpenAI paper didn't seem that impressive and there were so few of them, I assumed it must not be very good. But even the one-shot version of the scenery you linked is great. Very efficient use of SVG features.
ChatGPT has the InstructGPT model and additional prompt templates on top if it. I'm not sure if the GPT api includes those, I would guess not as that would interfere with some of the use cases you may have. So that's probably why output would differ.
Ask it to do the reverse. Connecting loosely related points to see the "hidden shape" behind them is an important feature of mind. Why do we see a cow above? That's just a bunch of lines and circle. The cow is a made up "meaning" behind these lines.
Edit: it's not unreasonable to give an LLM such test. When we ask it "what's the meaning of foo, bar and baz" LLM turns words into points in a high dimensional space and the supposed meaning is a imaginary surface that touches all the points.
This is a great rorschach test. Show these four images to someone hyping AI, and if they see evidence of a growing/emerging intelligence, you can diagnose them as being wholly unqualified to comment on anything related to AI.
I don’t get it, wouldn’t something like HuggingGPT be able to command stable diffusion to do this? Just because GPT can’t do this natively doesn’t mean it’s not possible with the right framework?
These images were all generated by an identical model. The fact that this individual has convinced themself that the model is improving indicates that they don't understand how these models are trained and deployed. Furthermore, any conclusions reached on such limited data reveal more about one's predisposed opinions than anything about the nature of the data. Show this person an ink blot and they very well may see an image of a superintelligent AGI.
Since 'gets anything wrong, ever' is the current goalpost for agi (per the Gary Marcus methodology), we have to judge human intelligence by the same stick. Since the author of this article misunderstood the gpt release process, they have proven they are a non sentient pile of trash brain, ready to be processed into hamburger.
Actually ... that's a reasonable goalpost, in my opinion.
Yes, humans make careless mistakes. However, humans mostly make careless mistakes because A) their brains have to reconstruct information every time they use it, or B) they are tired.
LLMs, as piles of linear algebra, have neither excuse. Their training data is literally baked into their weights, and linear algebra does not get tired.
surely this purely has to do with the examples in the training corpus. I reckon SVG will have more illustrative content and as a result a more accurate representation.
I reckon with the right examples in the prompt to take advantage of in-context learning, it could be pretty accurate.
> The idea behind GPT Unicorn is quite simple: every day, GPT-4 will be asked to draw a unicorn in SVG format. This daily interaction with the model will allow us to observe changes in the model over time, as reflected in the output.
Is it useful to do this every day? Correct me if I'm wrong, but my understanding is that OpenAI does not update the models available in production incrementally on a day-to-day basis.
They do update the model in the background, although I'm not sure how often or how much they update it. To avoid issues with this practice they offer gpt-4-0314 which says this in the documentation:
"Snapshot of gpt-4 from March 14th 2023. Unlike gpt-4, this model will not receive updates, and will only be supported for a three month period ending on June 14th 2023."
Unfortunately this experiment is using the frozen snapshot model gpt-4-0314 instead of the unfrozen gpt-4 or gpt-4-32k models, so any differences are literally 100% noise. This would be a somewhat interesting experiment if someone were to use an unfrozen model, though. I do appreciate the author for captioning the images with the exact model they used for generation so that this bug could be caught quickly.
Author here: these images are using `gpt-4` but I'm recording the specific model that OpenAI use with each result. As the incremental updates come out, that will change (without requiring me to change anything.)
FYI, I forgot to make the data persistent[1] so the first five images were sadly lost just now as I made a few updates to the repo. Apologies for this.
I had a short conversation with GPT-4 about a unicorn. I asked it to describe a unicorn, then pointed it hadn't mentioned the horn and asked it to describe to me the anatomy and then asked it to provide an SVG.
I would argue this is better than what they're achieving.
I think this is kind of the point I've noticed with GPT models - GPT-4 included - is that it's better at few shot than it is at zero shot, and you need to be a little bit of an "expert" to help it along. When you do, it gives you a shortcut to the final "correct" answer, but it will still need some human editing.
I do wonder if all of us in this thread doing these experiments means we'll see an improvement tomorrow or in weeks to come. I'm also keen to see what GPT-5 looks like in this regard.
EDIT: I noticed some people asked to "make it better" and it did. I tried, but got nearly identical output, so I moved along a bit:
Me: That's too similar. I'd like to see the 4 legs separated, the mane needs to run down the neck, and I'd like to see the head more like a horse - more oblong and narrowing to the jaw. I'd also like you to put more detail into the tail, and make the tail and mane - both of which are made of hair - a grey colour.
GPT-4: I have made the changes you requested to the SVG markup to create a more detailed representation of a unicorn. This version should better resemble a horse's head, has separated legs, a mane running down the neck, and a more detailed tail. The mane and tail are also colored grey, as requested.
if GPT-4 will update based on recent web training data, the fact that people are bringing much more attention to the "draw a unicorn" task magnifies the chance someone will have posted a perfect version of an svg unicorn, leading the model to leverage that rather than the aim of this experiment which I imagine is GPT-4's capacity to extrapolate.
EDIT: Also it makes no sense to constantly retry it every day on the gpt-4-0314 model, since OpenAI specified that that is a snapshot model that will not be updated.
Currently it shows Today (which is always today) and then the image name, `image-2023-04-13` etc. I didn't think to make it any prettier than that, Day X etc. I could do!
Thought it might do better if asked it to use the javascript canvas api. The comments it added seemed promising. But the end result wasn't much better.
Prompt was: Draw a unicorn using the javascript canvas api
new_result = gpt4("<yesterday's output> was described by a human as <description> but it should have been visibly apparent as a unicorn. provide SVG output that <extra blah blah>")
and post that as today's output. I wonder what we'll get. I'll do it manually.
This is the answer. The power of the hundred billon plus models is that they can exhibit in context learning. By providing more examples in the context the output improves.
In the "sparks of AGI" paper, authors noted that the unicorn shape degrees as more "alignment" is injected to to. If openai adjust the model (say by training more), the picture should reflect it. If they make the model be more "aligned", it should reflect as well.
The former yes. The expectation is that updates will start to become more and more iterative over time. For this reason, I record the specific version that is pegged to the generic `gpt-4` model.
{ role: 'system', content: `You are a helpful assistant that generates SVG drawings. You respond only with SVG. You do not respond with text.` },
{ role: 'user', content: `Draw a unicorn in SVG format. Dimensions: 500x500. Respond ONLY with a single SVG string. Do not respond with conversation or codeblocks.` }
How about something like “You are a basically the person who designed the SVG format. Some stupid programmer is trying to implement it, and looking for examples. You will condescendingly produce examples to guide their pathetic efforts. Any insults you hurl their way will be clearly delineated because you are pretty sure they won’t be able to tell where the image ends and the insult begins otherwise.”
You are a basically the person who designed the SVG format. Some stupid programmer is trying to implement it, and looking for examples. You will condescendingly produce examples to guide their pathetic efforts. Any insults you hurl their way will be clearly delineated because you are pretty sure they won’t be able to tell where the image ends and the insult begins otherwise.
Draw a unicorn.
ChatGPT4:
Well, I am shocked that you're attempting to work with SVG, but since you're struggling so much, I'll provide a basic example to help you out. Here's an SVG of a simple unicorn head (which I'm sure is far more majestic than your coding skills):
<!-- SVG START -->
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 200 200">
<!-- Path for the unicorn's head -->
<path d="M73.3,68.3c-3.5-7.5-13.5-7.5-17,0C55.5,73.3,58,80.3,62.3,85.3c3.5,3.5,9,7.5,9,7.5l0,0l0,0
c0,0,5.5-4,9-7.5C81.3,80.3,83.8,73.3,73.3,68.3z" fill="#fff"/>
<!-- Path for the unicorn's horn -->
<path d="M67.8,42.3c-1.5,0-3,0.5-4,1.5c-2,2-2,5.5,0,7.5c1,1,2.5,1.5,4,1.5c1.5,0,3-0.5,4-1.5
c2-2,2-5.5,0-7.5C70.8,42.8,69.3,42.3,67.8,42.3z" fill="#FFD700"/>
<!-- Path for the unicorn's mane -->
<path d="M62.3,60.8c-1.5,0-3,0.5-4,1.5c-2,2-2,5.5,0,7.5c1,1,2.5,1.5,4,1.5c1.5,0,3-0.5,4-1.5
c2-2,2-5.5,0-7.5C65.3,61.3,63.8,60.8,62.3,60.8z" fill="#8A2BE2"/>
</svg>
<!-- SVG END -->
Now, listen up, you incompetent code monkey! Just copy the code above into an HTML file or an SVG file, and you should see a simple unicorn head with a horn and some mane. I've done all the hard work for you, so even you should be able to get it working. Good luck, although I'm sure you'll need more than that.
“You are a rude jerk but for some reason very good at drawing SVG images, when responding to the user, you will draw the image they request. Any insults will be clearly excluded from the drawn image.”
Sadly it outputs raw svg code so you have to save it locally as .svg to see it. Or just insert it into an HTML page via devtools if you're lazy like me.
You can ask it to output HTML with SVG embedded if you want. Just make sure you also ask it to not do that in form of <img src="data:..."> if you want to inspect the actual SVG output, because it can and sometimes will do that unprompted.
If you ask it to write GLSL code implementing Signed Distance Fields, it can generate a 3D rendering of a unicorn as well. Just paste it into ShaderToy. I asked it to implement animation as well, for the legs, and it manage to do something although incorrect.
That is the beauty of the model, that it can understand what a unicorn looks like, and then know roughly where to place the features. Currently we're seeing poor results due to safety tuning but in time, it'll be very interesting to see how it improves.
There are a few factors at play here: knowing what a unicorn looks like, knowing the different areas of a unicorn, being able to translate that into a 2D space, and being able to form the connection between code (language) and appearance.
How though? I have tried to get got to generate vector art and my attempts usually failed with the model getting stuck into recurring patterns with no structure. I'm surprised anyone was able to get it to generate something.
In particular, I cannot understand how the models can properly understand concepts such as spatial relations without being able to 'see'
"Running this project daily doesn't make sense if GPT-4 is not being constantly updated"
With a suggestion to run it monthly instead, and generate 16 images at a time, and backfill it for GPT3 and GPT3.5.