
ScienceBeam – using computer vision to extract PDF data - kaplun
https://elifesciences.org/labs/5b56aff6/sciencebeam-using-computer-vision-to-extract-pdf-data
======
aidos
PDF is a pretty interesting format. The spec is actually a great read. It's
amazing how many features they've needed to add over the years to support
everyone's use cases.

It's a display format that doesn't have a whole lot of semantic meaning for
the most part. Often every character is individually placed so even extracting
words is a pain. It's insane that OCR (which it sounds like this uses) is the
easiest way to deal with extraction.

I highly recommend having a look inside a couple of pdfs to see how they look.
I've posted about this before but the trick is to expand the streams.

    
    
        mutool clean -d in.pdf out.pdf

~~~
petters
Each character needs to be individually placed for the best possible
typesetting (e.g with microtype).

Nevertheless, copying text from a PDF works in many cases, so there has to be
some support for that, no? Or is everything done in the reader?

~~~
aidos
Text is a funny thing in PDF.

Sometimes, there's real text broken into lines and then you're pretty good.
Often it's a subset font where the codes used in the script don't correspond
to their visual glyphs. When that happens there might be a unicode map that
lets you know which internal char maps to which external char (used for
copying text from the PDF by viewers). Sometimes that's missing and you can
rebuild it from other encoding information attached to the font. Other times
you can't get the relationship and all you're left with is the randomly
selected codes for each character — at that point it's not too dissimilar from
a simple Enigma machine problem I guess.

I sometimes see documents in which the same font has been subset again and
again, once for each word. If you have a unicode map for each one, that's
fine, if not it's not going to be much fun. In that case every word is going
to look like random characters in the PDF and those character-glyph
relationships are going to change from word to word.

Other times the glyphs are rendered as vector paths at write time and you're
down to trying to find the character from the outline it the shape. I deal
with this a lot and there are common patterns but normally each glyph will be
broken into several bits itself, so you have to find which bits go with which
glyphs before you even start.

Does anyone have a technical paper to hand, we could crack it open and take a
look for fun.

EDIT If you're trying to reliably convert everything, in a way you're better
off catering for the lowest common denominator so you can do it consistently.
Here, that means assuming you don't have any raw text data to work with and
you have to do everything with image recognition. Either way, it's a fun
problem.

------
vog
Some time ago I came to a similar conclusion: In most cases, the only way to
properly process PDF files is to render them and work on the raster images.

I was involved in a project where we needed to determine the final size of an
image in a PDF document.

This seemed simple: Just keep track of all transformation matrices applied to
the image, then calculate the final size.

But we underestimated the nonsense complexity of PDF: The image could be a
real image or an embedded EPS, which are completely different cases. The image
could have inner transparency, but could can also have an outer alpha mask by
the PDF document. Then there are clipping paths, but be aware of the always
implicitly present clipping path that is the page boundary. Oh, and an image
may be overlapped by text, or even another image, in which case you need to to
the same processing for that one, too. And so on.

After wasting lots of time almost rebuilding a PDF renderer accidentally, we
decided to use an existing renderer instead.

Turned out the only feasible solution was to render the PDF twice: with and
without the image, and to compare the results pixel by pixel.

I'm afraid the modern web might develop in a similar direction.

------
Lxr
This looks really cool and is badly needed. Our company would kill for a PDF
to semantic HTML algorithm (or service) too, using machine learning based on
computer vision. Existing options just vomit enough CSS to match the PDF
output, rather than mark up into headings, tables and the like.

------
davedx
Good stuff.

What I think would be a really nice killer app would be using OCR to extract
formulas directly into Matlab code. Would be awesome for reproducibility
studies or just people trying to implement algorithms for whatever reason.

Anyone know if there's an app for that already?

~~~
jmh530
How about PDF to Latex? Matlab allows Latex strings.

[https://www.mathworks.com/help/matlab/creating_plots/text-
wi...](https://www.mathworks.com/help/matlab/creating_plots/text-with-
mathematical-expressions-using-latex.html)

~~~
coherentpony
At that point, why work with the pdf at all? Why not just work with the tex
source?

~~~
IanCal
The problem is usually not having access to this, just the pdfs.

------
hprotagonist
How do you address older PDFs that are scanlations and have no actual textual
data at all, just embedded images?

In my experience, this is true for every PDF version of articles originally
published before about 1990.

~~~
yorwba
This project seems to convert the PDF into an image before doing the semantic
annotation, so it would work on scans as well. This doesn't give you the text,
but it gets you halfway there. The other half is can be done by passing the
discovered regions into an OCR engine to pull out the text.

The one time I needed to turn a scanned PDF (600+ page book) into searchable
text, I used this Ruby script
[https://github.com/gkovacs/pdfocr/](https://github.com/gkovacs/pdfocr/) ,
which pulls out individual pages using pdftk, turns them into images to feed
into an OCR engine of your choice (Tesseract seems to be the gold standard)
and then puts them back together. It can blow up the file size tremendously,
but worked well enough for my use case. (I did write a very special purpose
PDF compressor to shrink the file back, but that was more for fun.)

------
misiti3780
I havnt had a chance to read through this completely yet, but I'm curious if
this method is agnostic to how the PDF was created originally (LATEX, Adobe,
scanned images). It reads like that doesnt matter (treating it as an image)
but I wanted to make sure.

------
ocrcustomserver
Interesting. You can also try OCR and document layout analysis to do the same
thing (without GPUs).

Shameless plug: if you're interested in that sort of stuff, drop me a line, I
might be able to help.

------
sharemywin
couldn't you use a pdf converter and convert to html or something else and
translate that to your XML format?

~~~
dmreedy
my experience with pdf is that its a pretty open-ended, and thus pretty
difficult, format to work with. There's not a whole lot in the way of "what is
this thing supposed to be" semantics encoded into the spec. Even pdf-to-html
is kind of a crapshoot.

