Llava1.6, IntenVL, CogVLM2 can all do OCR with nothing but tiled image embeddings and an LLM. Feeding in OCR results from tesseract improves the reliability of the transcript, especially for long strings of random characters, but it’s not strictly necessary for the model to read the text out of the image.
Clip embeddings can absolutely “read” text if the text is large enough. Tiling enables the model to read small text.
Do you know of any guides or tutorials to doing this? I tried using the MiniCPM model for this task, but it just OCRed a tiny bit of information then told me that it couldn't extract the rest.
Clip embeddings can absolutely “read” text if the text is large enough. Tiling enables the model to read small text.