
Ask HN: Recommendations for PDF text extraction  - kenver
Hello HN, can anyone recommend a library/API for extracting the text and images from a PDF?<p>We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page.<p>Thanks for any suggestions.
======
atripathi
Hi, We used PdfTextStream for extracting information from pdf documents in a
similar manner as you describe (pre-known regions of the document), after
looking at few other options. It was not very easy though working with
coordinates and rectangles though :)

We observed that the text in our pdf had a structure to it. So instead we
simply dumped the text from pdf using pdftotext and wrote an ANTLR grammar for
the structure we saw. This enabled us to parse relevant information from the
text dump.

------
scorpioxy
I don't know about positional information, but I've had good luck with PDFBox
for text extraction. And by good luck I mean as good as it gets considering I
am using something for free and working with the PDF standard.

This was a system used in production but had several checks and fallback
mechanisms because the process was unreliable.

------
silvestrov
<http://www.pdflib.com/products/tet/>

TET provides precise metrics for the text, such as the position on the page,
glyph widths, and text direction. Specific areas on the page can be excluded
or included in the text extraction, e.g. to ignore headers and footers or
margins.

------
cemerick
Others have mentioned PDFTextStream (<http://snowtide.com>), which is our Java
and .NET product. Our RegionOutputTarget class
([http://snowtide.com/docframe.php/com.snowtide.pdf.RegionOutp...](http://snowtide.com/docframe.php/com.snowtide.pdf.RegionOutputTarget))
allows you do to selective text extraction based on spatial coordinates quite
easily.

If anyone has any questions, feel free to ping me.

------
iworkforthem
in Java, there are Apache PDFBox and jPDFText. the nature of pdf make it very
difficult to extract it correctly and consistently.

~~~
kenver
Thanks for your reply. We're evaluating PdfTextStream at the moment. We want
to try a few though to see which works best for the types of document we are
going to be using.

~~~
iworkforthem
past experience shown that data in XML and CSV formats are the easiest to
extract and make sense from.

~~~
whiskers
Past experience shows that if someone explicitly asks for information about
extracting text from PDFs that their source material is probably in PDF
format.

------
mgedmin
I've used pdftohtml -xml from poppler-utils for similar purposes (text with
position info; I wasn't interested in images although I believe pdftohtml
handles them too).

Poppler is the library that pdftohtml uses for this.

------
maresca
PDFSharp is good if you are using .NET

<http://www.pdfsharp.net/>

