Hacker News new | past | comments | ask | show | jobs | submit login
Extracting Structured Data from Templatic Documents (googleblog.com)
45 points by headalgorithm on June 12, 2020 | hide | past | favorite | 5 comments



This reminds me of some of Sergey Brin's research papers (https://scholar.google.com/scholar?q=sergey+brin), like "What Can You Do with the Web in Your Pocket?" (https://www.cs.princeton.edu/courses/archive/fall02/cs597A/p...)


Years ago I tried hacking something together like this - primarily to read labels on packaged food products - but the primary roadblock I hit was with the FOSS OCR solutions not being anywhere near good enough to be reliable.

Mind you this was a few years ago and I was primarily testing with pytesseract. I would be curious if this team actually used the Google OCR API or an internally tuned one that isn't GA, and how that differs FOSS Tesseract.


Good to remember blog posts like this for all those who claim Google isn't innovating or investing in search. This is the type of infrastructure that goes in to extracting useful information from the web.


Google will be able to provide us with a lot more information extracted from our mails. Like it currently does for delivery tracking and reservations.


The trained model doesn't seem to be available? Does anyone know if this or a similar model is available somewhere?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: