
Using Pytesseract to Convert Images into a HTML Site - armaizadenwala
https://armaizadenwala.com/blog/pytesseract-images-to-html/
======
markdown
First of all, great work, and thank you for sharing.

The video only shows this working with an image of a text-only page. What
happens when there are photos embedded in the image?

~~~
armaizadenwala
Hi! Thank you!

Tesseract is trained to only recognize text from images. I haven't looked into
image detection yet though.

This project fits the situation where you need to digitize a bunch of physical
copies / scans of documents. Sometimes these documents have images like
company logos which would be useful to include in the final html page.

I'll try to take a look into it, it is a wonderful idea for a 2nd part. This
current post is geared towards helping others transition into the world of
data science with OCR by describing every step of the way.

------
riedel
nice. But why are you attributing tesseract solely to google when it was
initially developed by HP ? Does it help marketing nowadays?

~~~
netgusto
I'd argue that one can refer to Tesseract as Google product without being
deceptive, as it's been developed by Google since 2006 [1].

[1] [https://github.com/tesseract-ocr/tesseract#brief-
history](https://github.com/tesseract-ocr/tesseract#brief-history)

