Hacker News new | past | comments | ask | show | jobs | submit login

The quick and dirty: OCR solutions exist, but to work well they generally need a little hand-holding. You have to give your OCR software a clean image if you want clean results (this goes for tesseract, ocropus, etc). The problem is that scans are rarely so clean...they are crooked, there is a hand in it, there is half of another page in it, etc. etc.---and common OCR software doesn't correct for this too well out-of-the-box.

doc2text bridges the gap between the initial scan and the scan you should pass through your OCR to greatly increase OCR ability. It takes that dirty scan, identifies the text region, fixes skew, performs a few pre-processing operations that help with common OCR binarization, and BOOM...data that was inaccessible, now accessible.

Try running tesseract or ocropus on a bad document scan before and after using doc2text...you'll see what I mean!

P.S. I should add...the end-user is also a little different from strict OCR packages/wrappers. Users might be admin staff or academics (or kids like my RA's) who want a simple, straightforward API to extract the text we need from poorly scanned documents. doc2text is built with this need in mind.




Do you have a comparison with unpaper, which seems to do almost the same thing?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: