
Image GPT - todsacerdoti
https://openai.com/blog/image-gpt/
======
minimaxir
The model is open sourced on GitHub: [https://github.com/openai/image-
gpt](https://github.com/openai/image-gpt)

Oddly, it still uses TensorFlow like the original GPT-2 release despite
OpenAI's declared switch to PyTorch, and it has dependency hell so it's not
easy to create a wrapper tool for it.

Since it's still the GPT-2 architecture, it might be possible to port the
weights to Huggingface Transformers (for the RGB generation), and then write a
wrapper to extend it for the image rendering. (filed an issue here:
[https://github.com/huggingface/transformers/issues/5088](https://github.com/huggingface/transformers/issues/5088)
)

~~~
AaronFriel
I'm still learning from deep learning papers and videos before dipping my toes
in myself, is there a summary of why PyTorch vs TensorFlow? Does it matter for
me?

~~~
minimaxir
_Nowadays_ , there isn't a huge practical difference in terms of
performance/tooling aside from edge cases and deployment options. It mostly
depends on your syntax preference. (although there are flame wars from both
sides)

I noted the TensorFlow usage because the original GPT-2 release was TensorFlow
1.X, which led to issues when TensorFlow 2.0 was released soon after.

For model training, I strongly recommend using the higher-level Keras APIs for
TensorFlow/pytorch-lightning API for PyTorch than the respective base tools.

~~~
m0zg
There is a huge difference in developer experience still. When TF fails it
typically offers little to no clue as to what you can do to rectify the
situation. PyTorch is a lot more helpful most of the time. PyTorch is worth
choosing on this basis alone, unless you intend to deploy to mobile hardware,
where TFLite is the only real viable option.

------
jszymborski
Apparently this takes "2500-V100 days" which is an insane amount of resources
for images of this resolution.

For context, this is equivalent to 100 $10,000 GPUs running for 25 days, 24/7.

[https://twitter.com/jm_alexia/status/1273349716915470340?s=2...](https://twitter.com/jm_alexia/status/1273349716915470340?s=20)

~~~
gowld
Cost to train the model (which can be used for many images, and images can be
upscaled using AI):

$10000 * 2500days / 5yr = $15000 hardware cost

200W * 2500day * (0.10 USD / Whr) = $1.2M in electricity

~~~
slavik81
If electricity cost $0.1/Whr, a cup of tea would cost ~$10 just to run the
kettle. You've overestimated the cost of electricity by roughly a thousand-
fold.

Residential electricity prices in California are more like 0.19 USD/kWh, which
is 0.00019 USD/Wh.

~~~
andybak
Found the Brit.

------
fpgaminer
In case anyone looking through the linked article is also wondering why the
images look odd, it's because (buried in the middle):

> motivated by early color display palettes, we create our own 9-bit color
> palette to represent pixels. Using this palette yields an input sequence
> length 3 times shorter than the standard (R, G, B) palette, while still
> encoding color faithfully.

Makes total sense now. In fact, those images remind me so much of how photos
looked on the early internet, since many were palletized.

------
tux3
I'd be surprised if this architecture scales to larger resolutions, but any
move towards "general learning" is really the interesting next step to me, not
scaling up an inefficient architecture. Can they train the same GPT model on
both text and images tasks at once, and would either task benefit at all from
training on the other task?

Even GPT-3 seems to have trouble with world-modeling, it writes convincing
text that has all the signs and form of good prose, but the output repeatedly
violates physics and common sense in funny ways.

I know just enough about machine learning to have dangerously unrealistic
expectations, but I'd like if I could reasonnably hope to see signs of a
shared representation or shared knowledge between say image labeling and
language modeling. This looks like a very concrete data point to take if you
care about generality. Maybe then we can seriously talk about world-modeling.

~~~
gwern
Of course it's not going to scale much past this, it's quadratic and already
hitting painful compute levels.

However, if you were starting this research today, you'd use any of half-a-
dozen different self-attention variants which are roughly linear, including
OA's own Sparse Transformers (which they did use to generate images, just on a
far smaller scale which wouldn't be adequate to show competitive performance
with SimCLR etc). With those, it's perfectly possible to do self-attention
over whole images at 256px or higher.

(As a matter of fact, Aydao has been working on a StyleGAN which just uses
self-attention every layer instead of convolutions, using one of the new
attentions; you can see some generated image samples from Flowers here:
[https://github.com/tensorfork/tensorfork/issues/31](https://github.com/tensorfork/tensorfork/issues/31)
GAN loss, not autoregressive pixel likelihood, but it makes the point.)

~~~
gwern
* lucidrains, not Aydao

------
CShorten
I made a video explaining this paper if interested!
[https://youtu.be/7rFLnQdl22c](https://youtu.be/7rFLnQdl22c)

~~~
superasn
What an excellent video. I know very little about this field but I was able to
make sense of the things you've explained. I need to learn some more
fundamentals I guess, but I'll surely be revisiting your video after that.

------
etaioinshrdlu
This looks like a new iteration on the Pixel-RNN idea:
[https://arxiv.org/pdf/1601.06759.pdf](https://arxiv.org/pdf/1601.06759.pdf)

GPT2 definitely looks better.

~~~
bigdict
No, this is a new iteration on the AIAYN idea.

~~~
elcomet
What's this ?

~~~
bigdict
Attention is all you need is the title of the 2017 paper that introduced the
Transformer architecture.

It reflects the evolution of NLP models: (RNN) —> (RNN + attention) —>
(attention) and the idea that you don’t need the recurrent component. Just
look at all elements of the sequence at once, applying varying weights
(attention).

------
jaredtn
"As further proof, features from the model achieve state-of-the-art
performance on a number of classification datasets and near state-of-the-art
unsupervised accuracy on ImageNet."

Impressive stuff! This performs well even without domain-specific architecture
choices.

~~~
visarga
One more step for ML. It used to be that we needed hand designed image
features. Now we can learn even the image priors (spatial locality and
translation invariance) from data.

Transformers are basically learning relations between pairs of input tokens,
moving the problem to a more abstract level than predicting directly on
tokens. While CNNs excel at benefiting from those two forms of invariance,
transformers have permutation invariance, they can predict on sets, graphs and
non-euclidean spaces.

~~~
gwern
> Now we can learn even the image priors (spatial locality and translation
> invariance) from data.

Right. The attention layers even learn attention patterns which look like
convolution layer kernels! But better, presumably:
[https://arxiv.org/abs/1911.03584](https://arxiv.org/abs/1911.03584)

------
turdnagel
The post states, "When we train GPT-2 on images unrolled into long sequences
of pixels, which we call iGPT, we find that the model appears to understand
2-D image characteristics such as object appearance and category." This, to me
(former philosophy student) feels like a very low bar for "understanding."
Could the model _explain_ 2D image characteristics, or can it only generate
them? I'm sure this debate will rage on for a while, but when it comes to
intelligence, I believe we ought to be more rigorous with our use of the word.

~~~
markchen90
Author here. You're absolutely right that "understanding" is a fuzzy word. As
you pointed out, part of reason we hold this belief is that the model can
generate diverse samples and successfully complete out-of-distribution inputs.
But the other part is that the model learns (without labels) useful features
for classifying objects. Would be very interesting to test it on a broader set
of datasets which measure other 2D image characteristics.

------
woah
This may be my imagination, or maybe it’s just because the images are so
small, but these seem to have fewer of the slightly disturbing artifacts you
see in CNN generated images.

~~~
gowld
The huge artifacts obscure any uncanny valley small artifacts.

------
eanzenberg
There's some definite overfitting apparent in the completion of the bottom
half of the cat (blue bkg) photo. Every completion has a funny index card
covering the bottom including the "original", while autocomplete should be
more robust than recreating the original photo.

~~~
ashtonbaker
I think that the edge of the index card is included in the top half of the
image.

------
aabhay
In the design field, there’s an adage — constraints inspire creativity.

This work seems so unconstrained in its use of computation, that is almost
screams to me that they must be going about it the wrong way.

~~~
samgriesemer
You might take a look at The Bitter Lesson [1], it's referenced by the article
and linked around on this thread.

> _One thing that should be learned from the bitter lesson is the great power
> of general purpose methods, of methods that continue to scale with increased
> computation even as the available computation becomes very great. The two
> methods that seem to scale arbitrarily in this way are search and learning._

> _The second general point to be learned from the bitter lesson is that the
> actual contents of minds are tremendously, irredeemably complex; we should
> stop trying to find simple ways to think about the contents of minds, such
> as simple ways to think about space, objects, multiple agents, or
> symmetries._

[1]:
[http://www.incompleteideas.net/IncIdeas/BitterLesson.html](http://www.incompleteideas.net/IncIdeas/BitterLesson.html)

~~~
aabhay
I did take a look at the article before writing my comment, and I disagree
with its premise as well as conclusion. The human mind has a variety of
specialized functions. Our ocular cortex does a sort of convolution on a 2D
field with three color and one alpha ‘sensors’ (cones and rods). If
specialization were less powerful than generality, why isn’t our brain one
giant lobe with no diversity in neuron topography?

The idea that specialization is not as powerful as computation fails the most
basic test of a proactive, rather than retroactive, theory. Can you make
proactive claims about what works in any given domain? Is the solution to take
the hungriest algorithm and apply it? What about feature engineering,
cleaning, parameter tuning, analysis, etc.? Is the most power hungry solution
still the most effective? In my opinion, part of the reason humans aren’t just
giant computation blobs is that we thrive on constraints (physical, sexual,
emotional).

------
fizixer
As someone not up-to-date with literature, are one-pixel/few-pixels/small-
delta attack issues resolved yet?

~~~
janhenr
Well, there is a whole seperate line of research concerning the topic of these
input perturbations ranging from PGD to just Gaussian noise. This model does
not claim to defend against any of those.

~~~
markchen90
Author here. I ran some early experiments a while ago, and it looked like
adversarial examples for convnet classifiers didn't transfer to transformer
classifiers and vice versa. Definitely worth looking more into!

~~~
aquajet
Could you elaborate a bit more? What were the differences between transformer
adversarial examples and cnn adversarial examples?

~~~
markchen90
I didn't notice any obvious visual differences, but I'm also not an expert on
adversarial examples. The transformer models were similarly susceptible to
attacks, but while adversarial examples transferred well within a model class
(~40%), they did not across model classes (~5%). These are rough numbers from
memory, don't hold me accountable!

------
gauthamzz
What are some real world use cases for something like this?

~~~
visarga
We now understand better the capabilities of the transformer, which is the
hottest thing in AI in the last 2 years. The transformer apparently can learn
images even if nobody tells it about the properties of space. CNN's on the
other part rely heavily on these properties (spatial locality and translation
invariance).

------
abledon
i can imagine that in 15-25 years, consuming too much content esp. "AI"
generated content will have guidelines around it. The possibilities with this
stuff is just starting, it could 'overload' youtube and other hosting services
once 'creators' begin to normalize its power (10 yrs from now thers an easy
photoshp/adobe plugin to generate random video/image scenes with actors and
generated voices/movement (animated rigs aka mixamo))

------
2sk21
So how would this model be used for a classification task?

~~~
msapaydin
By extracting the features for the image (that is, the encoding, as in
transfer learning in computer vision or natural language processing with, for
instance, VGG or Bert, respectively) and feeding this to the classifier.

