I'm working on Arabic OCR for a massive collection of books and pages (over 13 m...

fred123 · 2024-08-09T19:33:44 1723232024

Azure Vison OCR is supposed to be the best commercial OCR model right now and it’s really cheap (same price as Google‘s)

aliosm · 2024-08-09T19:53:16 1723233196

Note that the tool is uploading/downloading to/from Google Drive through GCP Service Account credentials to perform OCR for free.

vikp · 2024-08-09T19:05:14 1723230314

Hi, I'm the author of surya (https://github.com/VikParuchuri/surya) - working on improving speed and accuracy now. Happy to collaborate if you have specific page types it's not working on. For modern/clean documents it benchmarks very similarly to Google Cloud, but working on supporting older documents better now.

aliosm · 2024-08-09T19:11:52 1723230712

Hello Vik, and thanks for your work on Surya, I really liked it once I found it, but my main issue now is the latency and hardware requirements, as accuracy could be fixed overtime for different page types.

For example, I'm deploying tahweel to one of my webapps to allow limited number of users to run OCR on PDF files. I'm using a small CPU machine for this, deploying Surya will not be the same and I think you are facing similar issues in https://www.datalab.to.

fred123 · 2024-08-09T19:34:45 1723232085

It seems to struggle with German text a lot (umlauts etc)

bugglebeetle · 2024-08-09T19:09:31 1723230571

This has been my experience with Japanese texts as well. I have a number of fairly obscure Japanese books and magazines I’ve collected as part of a research interest. During the pandemic, I began digitizing them and found that nothing but Google OCR could extract the text correctly. I recently tried again with the libraries you mentioned, but they also performed worse than traditional tools.

aliosm · 2024-08-09T19:18:59 1723231139

Good to know :3

I'm currently planning to develop a tool to correct Arabic outputs for ASR and OCR. It will function like spell-correction but with a focus specifically on these two areas. Perhaps you could start something similar for Japanese? English (and Latin languages in general) perform at a different level across multiple tasks, to be honest...