Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Indic languages OCR and searchable pdfs
31 points by the-mitr 12 days ago | hide | past | favorite | 4 comments
I am look for applications that can perform OCR on scanned images having Indic scripts (Devanagari, Tamil etc) and create a searchable pdf as an output. There are several applications which can extract the text from images, but is there any application which can create searchable pdf?





I saw Hindi in free make pdf searchable app:

https://products.aspose.app/pdf/searchable

so that, I think, it possible to extend it to Devanagari on your local with Tesseract and Aspose.Pdf with C# code snippet:

CallBackGetHocr recognizeText = (System.Drawing.Image img) => { string tmpFile = Path.Combine(outputFolder, Path.GetFileName(Path.GetTempFileName())); using System.Drawing.Bitmap bmp = new System.Drawing.Bitmap(img); bmp.Save(tmpFile);

                    string pathTempFile = $"\"{tmpFile}\"";
                    string arguments = $"{pathTempFile} {pathTempFile} --oem 1 -l {lang} hocr";

                    System.Diagnostics.ProcessStartInfo psi =
                        new System.Diagnostics.ProcessStartInfo("tesseract", arguments);

                    using (System.Diagnostics.Process p = new System.Diagnostics.Process())
                    {
                        p.StartInfo = psi;
                        p.Start();                        p.WaitForExit();
                    }

                    return File.ReadAllText($"{tmpFile}.hocr");
            };
   
   new Aspose.Pdf.Document("my_Devanagari_scan.pdf").Convert(recognizeText);

thanks will give it a shot


Check out mathpix.com

(disclaimer: I'm a founder)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: