Vision Transformers don't use CNNs to extract anything first. (https://arxiv.org...

Vision Transformers don't use CNNs to extract anything first. (https://arxiv.org/abs/2010.11929). You could but it's not necessary and it doesn't add anything so it doesn't happen anymore.

Vision transformers won't treat the image as a sequence of pixels but that's mostly because doing that gets very expensive very fast. The image is split into patches and the patches have positional embeddings.