Extracting Structured Data from Templatic Documents

combatentropy · on June 13, 2020

This reminds me of some of Sergey Brin's research papers (https://scholar.google.com/scholar?q=sergey+brin), like "What Can You Do with the Web in Your Pocket?" (https://www.cs.princeton.edu/courses/archive/fall02/cs597A/p...)

AndrewKemendo · on June 12, 2020

Years ago I tried hacking something together like this - primarily to read labels on packaged food products - but the primary roadblock I hit was with the FOSS OCR solutions not being anywhere near good enough to be reliable.

Mind you this was a few years ago and I was primarily testing with pytesseract. I would be curious if this team actually used the Google OCR API or an internally tuned one that isn't GA, and how that differs FOSS Tesseract.

xnx · on June 12, 2020

Good to remember blog posts like this for all those who claim Google isn't innovating or investing in search. This is the type of infrastructure that goes in to extracting useful information from the web.

mukuz · on June 13, 2020

Google will be able to provide us with a lot more information extracted from our mails. Like it currently does for delivery tracking and reservations.

vitorbaptistaa · on June 12, 2020

The trained model doesn't seem to be available? Does anyone know if this or a similar model is available somewhere?