[0] https://arxiv.org/abs/2408.08459
Interestingly, they managed to train and inference on JPEG bitstream directly. I thought they'd need to at least build embeddings for those bitstream features or something.
[0] https://arxiv.org/abs/2408.08459