
Generative Modeling with Sparse Transformers - stablemap
https://openai.com/blog/sparse-transformer/
======
yorwba
Using two attention layers with √N inputs to cover a context of size N = √N ×
√N is somewhat intuitively understandable for image data, since the
decomposition corresponds to rows and columns.

But it's quite surprising that this also works for text data, especially that
the fixed pattern performs better than the strided one, despite there not
being anything analogous to image boundaries in the data.

It'd also be interesting to see what happens for other decompositions, such as
3 layers of ∛N or a logarithmic stack of dilated convolutions.

~~~
AdamDKing
It seems that using "fixed attention" for text would encourage the network to
periodically summarize the context so far and put it in that fixed column for
the rows below to access.

Maybe the reason "strided attention" didn't work as well is that it would
require the network to put this context summary in _every_ column lest the
rows below be unable to access it. That would waste features since the summary
wouldn't vary much over time but would still be stored in full at each step.

If this is true, the approach they used for images might actually be
inefficient in a similar way.

------
joe_the_user
So "Transformers" are part of the attention-based systems, which are a
approach for modeling input-output relationships that is an alternative to
Recurrent Neural Networks. These are instead based on Convolutional Neural
Networks.

The innovation here is that the transformer is compressed, allowing the system
to deal with longer sequences.

~~~
skdotdan
Are Transformers based on convolutions?

~~~
joe_the_user
"Convolutional neural networks", whose connection to other means of
convolution is a bit tenuous.

[https://en.wikipedia.org/wiki/Convolutional_neural_network](https://en.wikipedia.org/wiki/Convolutional_neural_network)

------
skdotdan
That's really impressive!

However, I'm a bit disappointed with the code release. I was expecting the
full source code and setup.

~~~
sgillen
It seems openAI is getting less and less open. I would like to see the source
too, although I think maybe we've been a bit spoiled in the past expecting
them to share all their source with us.

They have a lot of incentives not to. Keeping the code under wraps allows them
to maintain an edge over other researchers and companies and the space, which
helps them secure more funding, publish more papers etc.

It's also not really fair to them if some other research group uses code from
OpenAI to achieve results and then doesn't share their own code or
modifications.

------
tezka
What is NLL for 32x32 Imagenet? Thats a common benchmark and it’s strange that
it’s missing from this paper. Also, will you release cifar10 samples? Curious
what they look like at 2.80

