Hacker News new | past | comments | ask | show | jobs | submit login
Llama3.2 vision model and nosiy images
1 point by epop 5 days ago | hide | past | favorite | discuss
I'm working on creating a powerful OCR pipeline and have tested several technologies (like Doctr, Paddle, and the LLM-based GOT). I found that Llama 3.2 gives the most accurate text extraction, especially with high-density text and irregular layouts.

However, the model struggles with noisy images—those with stamps, handwritten annotations, and other artifacts—and simply fails to produce any output. I attempted to fine-tune it using 40k images augmented with noise (including quantization noise, salt and pepper noise, skewing, handwritten text, and multiple fonts). Unfortunately, this reduced its accuracy on well-formatted images, and it still doesn’t handle noisy images effectively.

What might I be missing here?






Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: