
Extracting Structured Data from Templatic Documents - headalgorithm
http://ai.googleblog.com/2020/06/extracting-structured-data-from.html
======
combatentropy
This reminds me of some of Sergey Brin's research papers
([https://scholar.google.com/scholar?q=sergey+brin](https://scholar.google.com/scholar?q=sergey+brin)),
like "What Can You Do with the Web in Your Pocket?"
([https://www.cs.princeton.edu/courses/archive/fall02/cs597A/p...](https://www.cs.princeton.edu/courses/archive/fall02/cs597A/papers/brin98what.pdf))

------
AndrewKemendo
Years ago I tried hacking something together like this - primarily to read
labels on packaged food products - but the primary roadblock I hit was with
the FOSS OCR solutions not being anywhere near good enough to be reliable.

Mind you this was a few years ago and I was primarily testing with
pytesseract. I would be curious if this team actually used the Google OCR API
or an internally tuned one that isn't GA, and how that differs FOSS Tesseract.

------
xnx
Good to remember blog posts like this for all those who claim Google isn't
innovating or investing in search. This is the type of infrastructure that
goes in to extracting useful information from the web.

------
mukuz
Google will be able to provide us with a lot more information extracted from
our mails. Like it currently does for delivery tracking and reservations.

------
vitorbaptistaa
The trained model doesn't seem to be available? Does anyone know if this or a
similar model is available somewhere?

