I'm wondering if it would make sense to use an H.264/5/6/AV1 encoder as the tokenizer, and then find some set of embeddings that correspond to the data in the resulting bitstream. The tokenization they're doing is morally equivalent to what video codecs already do.
Interestingly, they managed to train and inference on JPEG bitstream directly. I thought they'd need to at least build embeddings for those bitstream features or something.