I'm interested in a model that can take as input a video and output a caption to describe what is happening in the video. I've looked on huggingface etc. and can only find XCLIP from Microsoft, but that only does video classification. It doesn't write its own caption.