We're really excited about CLIP[1], OpenAI's new zero-shot image classification model. We think it's going to power a ton of creative new products. But being inspired isn't as fun as being inspiring so this weekend our team[2] (and a friend[3]) built a drawing game powered by CLIP.
The concept is simple: you draw a prompt, we feed it into CLIP and calculate a score[4] based on how close it thinks your drawing is to the prompt, we keep track of a leaderboard for each prompt. Your goal: draw the best representation of the prompt, as judged by CLIP.
*Why it's so cool*
Two weeks ago, if you wanted to make a game like this you'd have to collect a huge dataset of images matching each prompt and train a custom model to learn what drawings of each of them looked like. This meant you'd need to limit it to simple concepts like "apple" or "fish" that you could find a sufficient number of images for. And adding in new prompts ones would be just as hard; you'd have to collect more data and train another model.
With CLIP this is unnecessary. We can feed it any prompt eg "a raccoon driving a tractor" and it performs marvelously without having to be trained on custom data at all. This is because, unlike traditional image classifiers, it uses the information from the text to sort the images into buckets (not just the pixels from the image) and like GPT-3 it has already seen enough text and images to be able to generalize to new combinations of those things.
It's a bit like vocabulary: if you know what a "drawing" is, you know what a "raccoon" is, you know what "driving" is, and you know what a "tractor" is, you don't have to have seen any "drawings of a raccoon driving a tractor" before to be able to identify them.
We posted the minimal working version on twitter yesterday afternoon and discovered a few things that led to v2 late last night:
* Surprisingly, CLIP can read! It was doing a good job of ranking the initial drawings but after a couple of hours of people poking at the model they discovered that the best way to get to the top of the leaderboard was to write the prompt with the paintbrush (or add text labels to their drawing). This wasn't the behavior we wanted so last night we updated our code to also compare each image against CLIP prompts like "handwritten text" and apply a penalty to their score based on how text-heavy CLIP thinks the image is. The leaderboards look a lot better after adding that feature and recalculating the scores.
* The Internet will be the Internet. If you put out a drawing tool there is a not-insignificant minority of people who will just draw dicks (and worse). Luckily, we were able to use CLIP to help with this too! We now run each image through a variety of prompts like "Illustration of racist symbols" and if the image scores highly we put it behind a [nsfw] blocker. We don't penalize their score but they tend to naturally end up towards the bottom of the leaderboard anyway.
* People have been spending a lot of time on it! We already have over 500 drawings in the first 12 hours and some people are creating some artistic masterpieces[5]. Follow @paintdotwtf on twitter; we'll be highlighting cool things we find over the next few days as people continue to experiment.
Another goal of this weekend was to try some products from a few friends of ours. Apart from OpenAI's CLIP model, Firebase (Auth, Storage, Hosting, Firestore), Docker, and AWS Lambda we used the following tools:
* Parade[6] (YC S20) - a brand-generation tool. It was dead simple to answer some questions and receive back a customized brand identity including our fonts, colors, UI styling, and social media profile photos/regalia. Plus having this all in one place meant less coordinating amongst ourselves as we were tackling different parts of the project in parallel.
Supabase[7] (YC S20) - an open source Firebase alternative. We originally planned to use Supabase as the primary backend for the app but we needed several features that are on the roadmap but not quite baked yet. But we are using them to power the leaderboards; this is a feature that is not really plausible to create with Firebase's document store because the only way to get a count of rows is to pull them all down (which is both slow and expensive). Supabase is backed by an actual Postgres database so it really shines here. It's replacing RDS as my new SQL-hosting of choice going forward and I can see expanding our use as their Storage and Functions products come online.
LiterallyCanvas[8] - this made the drawing UI really easy to add so we could focus on the more novel bits.
All in all it was a pretty fun weekend. We were able to build something in a weekend that wouldn't have been possible in any amount of time two weeks ago. Excited to see what other things we discover about the model as more people experiment with it! Be sure to catalog any interesting findings below.
Really cool! I remember back in 2012 playing competitive drawing games on iPad but it was critiqued by humans. I wonder if this could open the door to a new more competitive genre.
The concept is simple: you draw a prompt, we feed it into CLIP and calculate a score[4] based on how close it thinks your drawing is to the prompt, we keep track of a leaderboard for each prompt. Your goal: draw the best representation of the prompt, as judged by CLIP.
*Why it's so cool*
Two weeks ago, if you wanted to make a game like this you'd have to collect a huge dataset of images matching each prompt and train a custom model to learn what drawings of each of them looked like. This meant you'd need to limit it to simple concepts like "apple" or "fish" that you could find a sufficient number of images for. And adding in new prompts ones would be just as hard; you'd have to collect more data and train another model.
With CLIP this is unnecessary. We can feed it any prompt eg "a raccoon driving a tractor" and it performs marvelously without having to be trained on custom data at all. This is because, unlike traditional image classifiers, it uses the information from the text to sort the images into buckets (not just the pixels from the image) and like GPT-3 it has already seen enough text and images to be able to generalize to new combinations of those things.
It's a bit like vocabulary: if you know what a "drawing" is, you know what a "raccoon" is, you know what "driving" is, and you know what a "tractor" is, you don't have to have seen any "drawings of a raccoon driving a tractor" before to be able to identify them.
[1] https://openai.com/blog/clip/
[2] Roboflow, https://roboflow.com
[3] Erik from Booste, https://booste.io
[4] Full technical writeup in progress; check back later this week