

Ask HN: Good tools for text extraction from PDF - lucasrp

Hi guys,<p>I&#x27;m needing a tool that allows me to convert PDF to html files. Since I work with public documents, sometimes the layout from the pdf can be pretty nasty (i&#x27;ve attached some links at the end of this post).<p>We have a in house soluction forked several years ago from Apache pdfBox. After a while we realized that forking a open source solution isnt the best answer, but kept on going because it worked.<p>Does anyone have sugestions? We are willing to contribute to the open source project we choose :)<p>Many thanks!<p>https:&#x2F;&#x2F;www.evernote.com&#x2F;shard&#x2F;s226&#x2F;sh&#x2F;17b87c1f-8f18-4b23-96ac-a9fbc2ac8502&#x2F;ea5618043f3a9c818071bd93df9f74c3<p>https:&#x2F;&#x2F;www.evernote.com&#x2F;shard&#x2F;s226&#x2F;sh&#x2F;17b87c1f-8f18-4b23-96ac-a9fbc2ac8502&#x2F;ea5618043f3a9c818071bd93df9f74c3
======
maxerickson
I've had good luck with the tools that come with xpdf:

[http://www.foolabs.com/xpdf/about.html](http://www.foolabs.com/xpdf/about.html)

But some of that is because the source I was pulling text from didn't change
the document format much from month to month.

I guess it is the library underneath jeffmould's link.

------
jeffmould
I have used the following with some success:

[http://pdftohtml.sourceforge.net/](http://pdftohtml.sourceforge.net/)

Not sure how well maintained it is still, but it did a good job of converting
basic PDF files to HTML.

There is also a Google Code product for going from HTML to PDF which works
pretty well.

