Currently R2R has out of the box logic for the following: csv, docx, html, json,...

Currently R2R has out of the box logic for the following:

csv, docx, html, json, md, pdf, pptx, txt, xlsx, gif, jpg, png, svg, mp3, mp4.

There are a lot of good questions around ingestion today, so we will likely figure out how to intelligently expand this.

For mp3s we use whisper to transcribe, for videos we transcribe with whisper and sample frames to "describe" with a multimodal model. For images we again transcribe to a thorough text description - https://r2r-docs.sciphi.ai/cookbooks/multimodal

We have been testing multi-modal embedding models and open source models to do the description generation. If anyone has suggestions on SOTA techniques that work well at scale we would love to chat and work to implement these. Long run we'd like the system to be able to handle multi-modal data locally.