Hacker News new | past | comments | ask | show | jobs | submit login

Vision Transformers don't use CNNs to extract anything first. (https://arxiv.org/abs/2010.11929). You could but it's not necessary and it doesn't add anything so it doesn't happen anymore.

Vision transformers won't treat the image as a sequence of pixels but that's mostly because doing that gets very expensive very fast. The image is split into patches and the patches have positional embeddings.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: