
Google Indexes Images from PDF Files - tomkwok
http://googlesystem.blogspot.com/2015/08/google-indexes-images-from-pdf-files.html
======
cowsandmilk
While interesting that they do this, it does not appear they are doing
anything difficult. The linked PDFs have images embedded, as opposed to say
being scanned and then automatically recognizing image borders and such.

------
kevinSuttle
This only supports restaurants keeping their menus in PDFs.

~~~
rancur
which is great, because 80% of the websites are for otherwise acceptable
restaurants with mediocre site presence and implementation. The 20% that have
a worth-visiting mobile implementation still don't support the "view the
desktop version of this site" requiring me to re-download-and-install Dolphin
Browser on android just so I can manually spoof the browser agent to
"Desktop".

I wonder if there would be a market for a rancur-approved sticker/logo that
you can place on your website, and index and search by, to help usability-
conscious users rid their online experience of pesky mobile implementations.

------
1arity
> Back in 2008, Google started to use OCR to index the full text of scanned
> PDF files. Now Google extracts images from PDF files and makes them
> searchable.

The astounding rate of progress of Google! In only 7 years they have gone from
extracting PDF text, to extracting the embedded images. When will the miracles
of those technical wizards cease to astonish all who gaze upon their
brilliance ?

I wonder if there were some other reason Google waited so long to include PDF
images? Perhaps something legal. Since the actual technical requirement is
really clear, the photographic image bytestreams are simply stored in the PDF,
and the utility for people seems quite large, there being a lot of images
stored in a lot of PDFs.

Perhaps it was simply overlooked or not on the roadmap until they made a lot
of other perhaps judged as more-important changes to their image search such
as changes to their image representation indexing ( which they do by majority
color it seems ) and image content indexing ( which it seems they contribute
to using DNN generated descriptions ).

It seems likely they would have rolled this out in a limited fashion maybe a
few times over the years before waiting longer to do it, pending whatever was
missing for a general release.

