
Ask HN: Any software to parse a PDF word by word? (maybe into SVG) - heliodor
I&#x27;m looking for a library or software to convert a PDF into SVG where the SVG renders the text grouped word by word.<p>I&#x27;ve seen the software pdf2svg produces an SVG where everything in the PDF is rendered into individual shapes. Each letter becomes a shape! So that&#x27;s too fine-grained.<p>Inkscape produces groups of words and can&#x27;t break it down further.<p>Anything else out there that can actually figure out the individual words in my PDF?<p>Actually, I&#x27;d be okay with parsing the PDF and getting a list of words and their positions. No need for the SVG.<p>Any suggestions?
======
vram22
I worked on a project for a client some time ago [1] where extraction of text
from PDFs was a big part of it. Researched several products, tried them,
settled on xpdf - the C library, used it from my C code. It worked well for
many PDFs but not for some others [2].

[1] There could be better products by now. Also, had heard good things about
PDFTextStream, which another comment here mentions.

[2] Some data would get corrupted, e.g. characters interchanged, missing
characters, etc, in the extracted text.

There are technical reasons inherent in the nature of the PDF format, due to
which 100% perfect text extraction, including correct character positions, is
not always possible for all PDF files (only for some) - that's one thing I
learned from working on that project. Some of that learning was from a key
person at the company that sells and supports xpdf. His support and knowledge
was very good. They even quickly fixed a bug or two that I found in the
product.

But things do work for many cases. As an aside, I also developed a small
program that used heuristics to identify cases in which the PDF extraction was
incorrect. Was fun.

~~~
hn12
I endorse everything Vasudev has written here. In the abstract a couple of
other libraries of interest are Java-coded PDFBox and Python-based PDFMiner.

------
MaDeuce
Here are a couple of ideas, none of which do exactly what you want. However,
they may give you some ideas...

PDFMiner[1] is a python toolkit for PDF. Among other things, it extracts text
from PDF files. It also has a tool that lets you find objects and their
coordinates in a PDF file. I have not looked at the latter functionality, but
it may get you your words and locations.

I've used Tesseract[2] to convert scanned documents into searchable PDF files.
Since a search of the PDF file will highlight matching words in the scanned
document, it clearly knows where words are and the letters that comprise them.
This might be another approach.

[1] [https://code.google.com/p/tesseract-
ocr/wiki/ReadMe](https://code.google.com/p/tesseract-ocr/wiki/ReadMe) [2]
[https://code.google.com/p/tesseract-
ocr/wiki/ReadMe](https://code.google.com/p/tesseract-ocr/wiki/ReadMe)

------
hn12
No.

Well, the correct answer to almost any such technical question is "yes and
no". This one comes as close as any to a bare "no".

It's a good question. PDF->SVG conversions are quite powerful--when they work.
They simply do NOT work, in the general case. I deal with a hundred-thousand
PDFs at a time, and they demonstrably don't behave with enough regularity to
allow for the kind of general transformation I suspect you have in mind.

As it happens, our little company does quite a bit of business extracting
content from specific classes of PDFs
[http://phaseit.net/claird/comp.text.pdf/PDF_content_extracti...](http://phaseit.net/claird/comp.text.pdf/PDF_content_extraction.html)
Coincidentally, I also research and deliver advanced SVG effects
[http://phaseit.net/claird/comp.text.xml/SVG_examples.html](http://phaseit.net/claird/comp.text.xml/SVG_examples.html)
I certainly am sympathetic to your aim. To be successful, you need to specify
your situation more precisely.

------
vmorgulis
May be with PDF.js:

[https://mozilla.github.io/pdf.js/api/draft/global.html#TextI...](https://mozilla.github.io/pdf.js/api/draft/global.html#TextItem)

Or WeasyPrint:

[http://weasyprint.org/](http://weasyprint.org/)

------
brudgers
Because PDF's can contain typesetting glyphs or even bitmap images, probably
the only universal way to extract text from arbitrary PDF's is OCR. On the
other hand, if you can make assumptions about the data and the data is clean
enough that the assumptions hold, then other methods might work. But if the
data is that clean then OCR probably will too. And OCR is as easy as a $79 all
in one to get started with.

Good luck.

------
loumf
PDFTextStream: [https://www.snowtide.com](https://www.snowtide.com)

