I'm working on Arabic OCR for a massive collection of books and pages (over 13 million pages so far). I've tried multiple open-source models and projects, including Tesseract, Surya, and a Nougat small model fine-tuned for Arabic. However, none of them matched the latency and accuracy of Google OCR.
As a result, I developed a Python package called tahweel (https://github.com/ieasybooks/tahweel), which leverages Google Cloud Platform's Service Accounts to run OCR and provides page-level output. With the default settings, it can process a page per second. Although it's not open-source, it outperforms the other solutions by a significant margin.
For example, OCRing a PDF file using Surya on a machine with a 3060 GPU takes about the same amount of time as using the tool I mentioned, but it consumes more power and hardware resources while delivering worse results. This has been my experience with Arabic OCR specifically; I'm not sure if English OCR faces the same challenges.
Hi, I'm the author of surya (https://github.com/VikParuchuri/surya) - working on improving speed and accuracy now. Happy to collaborate if you have specific page types it's not working on. For modern/clean documents it benchmarks very similarly to Google Cloud, but working on supporting older documents better now.
Hello Vik, and thanks for your work on Surya, I really liked it once I found it, but my main issue now is the latency and hardware requirements, as accuracy could be fixed overtime for different page types.
For example, I'm deploying tahweel to one of my webapps to allow limited number of users to run OCR on PDF files. I'm using a small CPU machine for this, deploying Surya will not be the same and I think you are facing similar issues in https://www.datalab.to.
This has been my experience with Japanese texts as well. I have a number of fairly obscure Japanese books and magazines I’ve collected as part of a research interest. During the pandemic, I began digitizing them and found that nothing but Google OCR could extract the text correctly. I recently tried again with the libraries you mentioned, but they also performed worse than traditional tools.
I'm currently planning to develop a tool to correct Arabic outputs for ASR and OCR. It will function like spell-correction but with a focus specifically on these two areas. Perhaps you could start something similar for Japanese? English (and Latin languages in general) perform at a different level across multiple tasks, to be honest...
As a result, I developed a Python package called tahweel (https://github.com/ieasybooks/tahweel), which leverages Google Cloud Platform's Service Accounts to run OCR and provides page-level output. With the default settings, it can process a page per second. Although it's not open-source, it outperforms the other solutions by a significant margin.
For example, OCRing a PDF file using Surya on a machine with a 3060 GPU takes about the same amount of time as using the tool I mentioned, but it consumes more power and hardware resources while delivering worse results. This has been my experience with Arabic OCR specifically; I'm not sure if English OCR faces the same challenges.