
CSS selectors for PDF elements? - xstartup
Is there something like this?
======
lovelearning
What problem are you trying to solve through this approach?

There is neither any concept of element selectors nor any such ready
implementation I've come across.

However, some things in a produced structured PDF (as opposed to a PDF from a
scanned page) can be searched. For example, if you want all text objects, you
can look for sequences like "BT..Tf..Tj..ET" which are operands for Begin
Text, set Text font, show text, end text.

But not all objects are easily queryable. For example, I've seen tables that
don't show any indication that they are tables. Instead, in the PDF, they are
encoded as a series of move, stroke operations, and text operands. You'd have
to know such a pattern represents a table and look for the table by searching
the pattern.

I don't know what you are trying to solve, but assuming selectors are the
right solution, the approach I'd choose is: 1\. Use a good framework like
PDFBox and its COSObject object model to reverse engineer simple PDFs while
keeping the PDF spec [1] close at hand. This way you can understand what the
operands and patterns are.

2\. Use a framework like JXPath to build arbitrary XPath like query interface
over PDFBox's object model.

It's easier if all the PDFs are produced by the same program, and far harder
if you want to process any PDF in the wild.

Alternately, perhaps you can convert PDF to HTML, and then run the selector on
that HTML.

[1]:
[https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PD...](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf)

------
mkl
It seems unlikely. PDF files are generated by a huge number of programs, each
doing it differently. There's also almost no semantic information in the PDF
format.

Every time I have to extract information from PDFS that isn't simply text or
pixel graphics I'm basically starting from scratch.

------
codegladiator
The source code of a pdf looks very different from what it looks after
rendering (no correlation of parent/tree/sibling). Most of it is literally
absolute coordinates. Cannot have css like selectors on that.

