Vision Transformers don't use CNNs to extract anything first. (https://arxiv.org/abs/2010.11929). You could but it's not necessary and it doesn't add anything so it doesn't happen anymore.
Vision transformers won't treat the image as a sequence of pixels but that's mostly because doing that gets very expensive very fast. The image is split into patches and the patches have positional embeddings.
Vision transformers won't treat the image as a sequence of pixels but that's mostly because doing that gets very expensive very fast. The image is split into patches and the patches have positional embeddings.