Hacker News new | past | comments | ask | show | jobs | submit login

Llava1.6, IntenVL, CogVLM2 can all do OCR with nothing but tiled image embeddings and an LLM. Feeding in OCR results from tesseract improves the reliability of the transcript, especially for long strings of random characters, but it’s not strictly necessary for the model to read the text out of the image.

Clip embeddings can absolutely “read” text if the text is large enough. Tiling enables the model to read small text.




They can do it. They can not do it particularly well compared to SoTA OCR systems.


Do you know of any guides or tutorials to doing this? I tried using the MiniCPM model for this task, but it just OCRed a tiny bit of information then told me that it couldn't extract the rest.


I bet you could get this working in https://github.com/comfyanonymous/ComfyUI

I have done some other LLava stuff in it


I thought ComfyUI was mainly for SD. I should get into the game again.


You can build just about anything with it


thanks been trying to remember the name of this project for weeks now


How well does this work on complex data tables?


I found llava to be disappointing, but Claude Haiku is quite good




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: