Hi all, I'm the author of llama-ocr. Thank you for sharing & for the kind comments! I built this earlier this week since I wanted a simple API to do OCR – it uses llama 3.2 vision (hosted on together.ai, where i work) to parse images into structured markdown. I also have it available as an npm package.
Planning to add a bunch of other features like the ability to parse PDFs, output a response in JSON, ect... If anyone has any questions, feel free to send them and I'll try to respond!
I put in a bill that has 3 identical line items and it didn't include them as 3 bullet points as usual, but generated a table with a "quantity" column that doesn't exist on the original paper.
Is this amount of larger transformation expected/desirable?
(It also means that the output is sometimes a bullet point list, sometimes a table, making further automatic processing a bit harder.)
I made a script which does exactly the same thing but locally using koboldcpp for inference. It downloads MiniCPM-V 2.6 with image projector the first time you run it. If you want to use a different model you can, but you will want to edit the instruct template to match.
MiniCPM-v 2.6 is probably the best self-hosted vision model I have used so far. Not just for OCR, but also image analysis. I have it setup, so my NVR (frigate) sends couple of images upon motion alert from a driveway security camera to Ollama with minicpm-v 2.6. I’m able to get a reasonably accurate description of the vehicle that pulled into the driveway. Including describing the person that exits the vehicle and also the license plate. All sent to my phone.
Hey! Have you tried out Edge Streaming yet? It uses the Edge Runtime which is a fraction of the cost of serverless functions and lets you stream responses for much longer than 10 seconds, giving you the "chatting" effect that you see on ChatGPT.
It's a conference registration site that involves a series of challenges involving a wordle and a multiplayer experience with a prism built with Three.js
reply