Hacker News new | past | comments | ask | show | jobs | submit | nutlope's comments login

Thank you!

Should be up, please try again!

It let me upload a file, but didn't produce any output.

Hi all, I'm the author of llama-ocr. Thank you for sharing & for the kind comments! I built this earlier this week since I wanted a simple API to do OCR – it uses llama 3.2 vision (hosted on together.ai, where i work) to parse images into structured markdown. I also have it available as an npm package.

Planning to add a bunch of other features like the ability to parse PDFs, output a response in JSON, ect... If anyone has any questions, feel free to send them and I'll try to respond!


I put in a bill that has 3 identical line items and it didn't include them as 3 bullet points as usual, but generated a table with a "quantity" column that doesn't exist on the original paper.

Is this amount of larger transformation expected/desirable?

(It also means that the output is sometimes a bullet point list, sometimes a table, making further automatic processing a bit harder.)


Here's the prompt being used, tweaking that might help: https://github.com/Nutlope/llama-ocr/blob/main/src/index.ts#...

I've had trouble with pulling scientific content out of poster PDFs, mostly because e.g. nougat falls apart with different layouts.

Have you considered that usage yet?


How accurate is this?

When compared with existing OCR systems, what sorts of mistakes does it make?


> Need an example image? Try ours. Great idea, I wish more services would have similar feature

Option to use a local LLM?

I made a script which does exactly the same thing but locally using koboldcpp for inference. It downloads MiniCPM-V 2.6 with image projector the first time you run it. If you want to use a different model you can, but you will want to edit the instruct template to match.

* https://github.com/jabberjabberjabber/LLMOCR


MiniCPM-v 2.6 is probably the best self-hosted vision model I have used so far. Not just for OCR, but also image analysis. I have it setup, so my NVR (frigate) sends couple of images upon motion alert from a driveway security camera to Ollama with minicpm-v 2.6. I’m able to get a reasonably accurate description of the vehicle that pulled into the driveway. Including describing the person that exits the vehicle and also the license plate. All sent to my phone.

Hey! Have you tried out Edge Streaming yet? It uses the Edge Runtime which is a fraction of the cost of serverless functions and lets you stream responses for much longer than 10 seconds, giving you the "chatting" effect that you see on ChatGPT.

Docs: http://vercel.fyi/streaming Example: https://vercel.com/blog/gpt-3-app-next-js-vercel-edge-functi...


I have not! thanks for letting me know, I'll give it a try.


It's a conference registration site that involves a series of challenges involving a wordle and a multiplayer experience with a prism built with Three.js


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: