Surprised (as a non-ML-practitioner) that a uniform tiling is used to ‘tokenize’ the image in this approach. After all, language transformers don’t operate by chunking text into arbitrary clusters of a fixed number of characters - they tokenize on semantically meaningful components extracted from the text (I.e words - and sometimes then word components)
Wouldn’t it make sense to have a process that selects regions of the image to focus in on, perhaps itself guided by some sort of attention model? So for a face it might produce tiles covering the whole face, the eyes, the mouth, whatever other details it finds helpful… all obviously accompanied by a vector representing the tile’s spatial relationship to the whole image.
Or is the assumption that this sort of thing is effectively done within the model’s layers as it combines the input from multiple tiles?
> language transformers don’t operate by chunking text into arbitrary clusters of a fixed number of characters - they tokenize on semantically meaningful components extracted from the text (I.e words - and sometimes then word components)
Don’t they? Most use BPE, which yields arbitrary but common sequence of characters that often are just syllables. Additionally, many use SentencePiece, which randomizes the subdivisions, which sometimes produces single characters instead.
(That said, what you suggest would be an interesting research avenue.)
In this paper https://arxiv.org/abs/2206.12693 I show that choosing the text snippets deliberately so that your choices minimize overall entropy will also minimize result entropy, meaning it'll increase the accuracy of your speech recognition system (for German speech).
In my case, the entropy-optimized tokenization also matched conserved features of the German language.
BPE isn’t arbitrary, though, is it? It’s based on essentially a pre trained markov model of how likely each byte is to be ‘part’ of the preceding token - ie it uses an existing semantic model for ‘what normal text looks like’ to choose potentially useful tokens (even for nonsensical text, which is why it works on stuff like code)
And it does that typically after word splitting has already happened (although yes, looking at sentencepiece, maybe not always… interesting.)
Was thinking that a similarly low level approach to tile extraction would be to grab tiles that represent areas of high detail - but then maybe that just amounts to applying some sort of compression to an image before processing…
We tried doing just that and the short answer is that we couldn't fit the gradients for arbitratily-placed tiles into a DGX with 320 GB of GPU RAM. For 32x32 pixel tiles, free location placement will 1024x your memory usage.
Byte pair encoding (used in GPT2 models) is not semantic, it's purely based on statistics of occurrences of characters groups. It's not supposed to be semantic, it's just supposed to enable a complete coverage of the whole text. Similarly, the image patches have to cover the whole image and then the transformer can learn to extract the important parts.
Image patch extraction could be statistical rather than semantic. Like if the top part of the image is just all blue sky with no high frequency detail, you don’t necessarily need to feed the model dozens of blue tiles when you could just pass one or two big blue tiles.
If there’s a small plane in part of the sky, an overlapping much smaller tile with that object in it could be passed in as well.
Really just a matter of choosing tile sizes based on how fine the detail is in that part of the image - which is sort of similar to my mental image for how BPE decides whether to cluster characters into one token or not.
But as I say, not a practitioner, and I kind of get the impression that ‘the transformer attention model just solves for all that’.
Then you introduce a resizing component, which resizes differently for different parts of the image. And you still have to decide how to select the tile sizes: if it's heuristics, well, the field of ML has been moving away from that. If it's with another learnable component, maybe it's worth a shot, at the cost of a more complicated model, which probably has two stages, so the inference time is worse (and maybe even train time)
Reminds me of using quadtrees for image compression. The lower detail parts would be covered by a large single color-value block while more detailed parts would use smaller and smaller blocks.
It seems to work okay without this, but an approach with less than totally free positioning would be to send in one more set of overlapping tiles shifted by half a tile size. Or better yet, image pyramids, which would help you handle different image sizes.
That’s significantly more data either way though, which is probably the issue.
There is also Voxel Transformer [0] for doing 3D object detection. It seems to score high but at a longer inference time penalty. I haven't heard if such models are being used in real world applications yet.
I'm currently reading Machine Learning for Coders (by the people behind fastai) and one point they make early in the book is that the ML research community loves jargon and boy were they right.
This is one of the things I enjoy most about learning things in different domains. Most of the time it looks so foreign because of.. language. But often the concepts and ideas are not as intimidating or complex as they seem once you get past the language barrier and are able to distill things down to a simpler understanding.
Maybe it's because I don't understand machine learning well enough, but this part:
> We enable order with positional embeddings. For ViT, these positional embeddings are learned vectors with the same dimensionality as our patch embeddings.
> After creating the patch embeddings and prepending the “class” embedding, we sum them all with positional embeddings.
> These positional embeddings are learned during pretraining and (sometimes) during fine-tuning. During training, these embeddings converge into vector spaces where they show high similarity to their neighboring position embeddings — particularly those sharing the same column and row
Made no sense to me.
Like, okay, the positional embedding represents the position of a "patch" within the input image? What else? How are they used, what are they compared against, how are they fed into the neural network?
So a transformer is just a "graph" network. You have a bunch of nodes in the graph, and depending on the kind of transformer, the nodes form different topologies. Now, here is the catch, with graph neural networks, we may reorder the nodes, and as long as the edges are kept okay, we are fine.
This is because graph networks are permutation invariant, but vision tasks are not. Nodes do not necessarily know their ordering with respect to the input. We use positional embeddings to inform the nodes where they are with respect to the input, because it matters.
To see how, or why, take any image, flip it upside down, cut it into patches, and then flip the patches.
The patches know what their content is, but not where. The edges in the graph merely point to the neighbours for further computation.
Positional embeddings are just vectors that you index or compute at runtime. These vectors then are concatenated or added onto the vectors of the patches or tokens.
We use positional embeddings for all sorts of tasks, such as text (position of token), vision tasks (position of patch), or even sound tasks (e.g. point in time).
They act as a "position signal" that modifies the patch embedding. The learned signals are similar to other neighbouring position signals, and the later layers of the model will use the "similarity" between signals to identify the proximity/order of different patches.
There is no explicit mechanism that tells the network to make neighboring position embeddings similar, it's just a result of the training that fortunately works and seems logical.
Like, say the embedding of a patch was just a vector (a, b, c, d). To have the attention layer "understand" position, you could just concatenate the patch's position to get (a, b, c, d, x, y). I understand that.
What I don't understand is:
- Why on earth are cosines involved?
- What does that mean:
> During training, these embeddings converge into vector spaces where they show high similarity to their neighboring position embeddings — particularly those sharing the same column and row
Does it just mean that the network learns that "x = 1" is similar to "x = 2"? Because presumably, that's not very valuable by itself, is it? It's something you could easily hardcode.
Presumably there's a step where the network goes "if there is pattern A and pattern B and both patterns have a short distance between them in the positional embedding then we have C" where A B and C are neural network magic... but your article is explaining that step, I don't understand the explanation.
I think something that can help this situation is to not use "big" machine learning words such as "patch embedding".
This is one of the issues with a lot of machine learning articles out there (not to nitpick on you, sorry); there are almost always easy and illustrative examples that you can use to break this down into a simpler explanation.
Hi! Are these positional embeddings literally made by concatenating the patch embedding with a number, then passing that through the next layer, as suggested by the figure under "Images to Patch Embeddings"?
It's the most confusing part of transformers for me. How do we train the module that creates these embeddings?
Educational videos should have a section saying: "This video was tested on the following audiences: ..." And "they had many questions, which I incorporated back into the talk".
Without this, you might as well be watching "Introduction to X, for experts in X". Or "Introduction to X, by someone who is trying to figure out X".
In NLP, the simplest positional embeddings are just a sequence of [0, 1, 2, 3, 4, .. n] with n being the length of a sentence. Once you start processing text, you split it into individual embeddings for each word, getting a list of embeddings [E_1, E_2, E_3, ... E_n]. Then you just sum up this embedding vector and positional embeddings vector across the first axis, so each embedding is increased by its positional embedding in other dimensions. This somehow works for encoding position as attention is dumb and doesn't keep positions in mind but you just added some sort of a positional signal attention will recognize. There are many ways to model positional embeddings, e.g. by generating sine/cosine curves etc. You can extend this approach to ViTs.
I think the implication here is that it is learned during pre-training. You'll usually do some sort of pre-training (think running a model through imagenet) that boils the image down into the vector embeddings, and then you have the model reconstruct the original image using only the vectors. This is basically a reduced order modeling approach. After you have an encoder that is able to turn images into vector embeddings that contain enough information to reconstrust the original image, you can then use that pre-trained network and apply it to another task with similar images. e.g. you pre-trained on imagenet reconstruction, and then you adjust the task to detect whether there is a dog riding a bicycle in the image through regular supervised training. This approach has proven to be better than just training a network from scratch (or a generic pre-trained network) to detect bicycle dogs.
Wouldn’t it make sense to have a process that selects regions of the image to focus in on, perhaps itself guided by some sort of attention model? So for a face it might produce tiles covering the whole face, the eyes, the mouth, whatever other details it finds helpful… all obviously accompanied by a vector representing the tile’s spatial relationship to the whole image.
Or is the assumption that this sort of thing is effectively done within the model’s layers as it combines the input from multiple tiles?