Hacker News new | past | comments | ask | show | jobs | submit login

In case any scientist actually working on adaptive OCR is reading this, I was given a post-WWII newspaper archive (PDF scans, 1945-2006, German language) that I would like to OCR with the highest quality, compute demands are not an issue, I've got an army of A100s available.

I played with OCR post-correction algorithms an invented on method myself in 1994, but haven't worked in that space since. Initial Tesseract and GPT-4o experiments disappoint. Any pointers (papers, software) & collab. suggestions welcome.






I tried https://github.com/PaddlePaddle/PaddleOCR for my own use case (scanline images of parcel labels) and it beat Tesseract by an order of magnitude.

(Tesseract managed to get 3 fields out of a damaged label, while PaddleOCR found 35, some of them barely readable even for a human taking time to decypher them)


When I was doing OCR for some screenshots last year I managed to get it done with tesseract, but just barely. When looking for alternatives later on I found something called Surya on github which people claim does a lot better and looks quite promising. I've had it bookmarked for testing forever but I haven't gotten around to actually doing it. Maybe worth a try I guess?

Surya is on par with cloud vision offerings.

would love to give this a shot with pulse! feel free to reach out to me at ritvik [at] trypulse [dot] ai, and i’d be very curious to give these a run! in general, i’m happy to give some general advice on algos/models to fine-tune for this task

Are you targeting business or consumers?

I cannot find the pricing page.


our current customers are both enterprises and individuals.

pricing page is here https://www.runpulse.com/pricing-studio-pulse


Not currently looking for this, but can I just say thank you for being open and direct with your prices. So useful to just be able to look.

How are you on 18-19th Century cursive, English language. Do you have a guarantee for number of errors.


Not OP, but you might be looking for https://www.transkribus.org/

thanks! re: 18-19th century cursive, while we handle historical handwriting, we can't guarantee specific error rates. each document's accuracy varies based on condition, writing style, and preservation. happy to run test samples to check.

feel free to send over sample docs: sid [at] trypulse [dot] ai


No API pricing available?

Pls contact archive.org about adopting this digital archive once it exists (they also have a bad habit of accepting physical donations, if you are nearby)

I’m very far from an expert, but had good luck with EasyOCR when fiddling with such things.

If it's a large enough corpus I imagine it's worth fine tuning to the specific fonts/language used?

I would love to get access to that archive!



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: