I'm surprised that seemingly there are no other major FOSS OCRs than Tesseract and Tesseract is quite frankly horrible. I once tried to use it on a high-resolution screenshot of a Discord message containing only the characters "0" and "1". I cropped it to only have the text, restricted character sets, tried fiddling with the images contrast and what not and the result was still quite poor, with many characters mistaken or straight up ignored.
I have little expertise in ML, but from my limited understanding, OCR is the bread and butter of the field. I've read exactly one "Intro to ML" article and it was about recognising digits. And yet, we have an abundance of high quality proprietary OCRs that can recognise printed or even hand-written text and the single open source one is having trouble with perfectly formatted text with a readable font.
Could anyone with more expertise shine some light on this current state of affairs?
> I once tried to use it on a high-resolution screenshot of a Discord message containing only the characters "0" and "1". I cropped it to only have the text, restricted character sets, tried fiddling with the images contrast and what not and the result was still quite poor, with many characters mistaken or straight up ignored.
I had the opposite experience.
My partner was doing a project for the Army Core of Engineers and they only provided information via some system called ProjNet that, best I can tell, exported PDFs of Web Pages in pure vector format so they were unsearchable. Of course they needed to search 10000 pages of documents to answer questions for the ACoE.
I was able to feed the PDFs into Tesseract and produce 1:1 text document per page of PDF and then marry it back up to the PDF so they could search the PDFs. It worked astonishingly well and took about a half an hour using the cringiest of shell scripts.
I did something similar with SDGE's published rate tables to convert their screenshots of XLS files back into tablur data. It didn't work as well but still got the job done.
It's amazing to me that there's so little in the OSS world about handwriting recognition. From an OCR perspective, I understand it's much harder than printed text, but there's not really anything for "online" handwriting recognition either (written on a screen/vectorized strokes). From my understanding online recognition should be easier than scanning printed text, and yet there aren't any tools out there that I can find.
I have little expertise in ML, but from my limited understanding, OCR is the bread and butter of the field. I've read exactly one "Intro to ML" article and it was about recognising digits. And yet, we have an abundance of high quality proprietary OCRs that can recognise printed or even hand-written text and the single open source one is having trouble with perfectly formatted text with a readable font.
Could anyone with more expertise shine some light on this current state of affairs?