Hacker News new | comments | show | ask | jobs | submit login
Show HN: an API to extract text from a PDF (stamplin.com)
51 points by trez 1570 days ago | hide | past | web | 33 comments | favorite

Why not just system(pdf2html) - I don't see the point since this level of functionality is trivially achieved. If it did something over and above that it might be useful, like OCR, but even that's not hard to add.

indeed, at the moment, that's quite simple. The only benefit is the fact is quite easy to integrate and fast. More advance feature should be coming soon.

If you're doing this local/cli

`pdftext`, from http://www.foolabs.com/xpdf/

For OCR, `pdfimages` (also from xpdf), combined with ImageMagick's `convert`, and `tesseract` (http://code.google.com/p/tesseract-ocr/) works passably well.

I have some questions:

1. Why return an array of texts? Where do the texts get split up? At page boundaries? Column boundaries? At the end of each line? If a line is interrupted by a corner of an image and continues a couple of inches afterward, does it get treated as a separate text? (I once used a PDF->text extractor program that spit out every word sepearately, often in an incorrect order. That probably had to do with how the PDF was organized internally.)

2. "The PDF file should be smaller than 1 Mbit" -> You mean 1 megabyte, right? Because 1 megabit is only 125-128 kilobytes.

1. That's indeed dependant on the way the PDF has been created. Sometimes, you can even have letter splitted on different text token.

2. You're right, I mean Megabyte

Thanks for the clarifications.

Since a lot of PDFs are badly organized (and I wonder if some programs deliberately do that to make text extraction difficult), perhaps you could try to analyze the location of each token on the page and merge the ones that seem to belong together. That would be already 100x better than most of the free PDF->text converters out there.

we are already close to do that but with a really slow parser (this one can even replace some text on the pdf). Our problem now is to understand if developers would rather have better text extraction or some other features like image extractions, etc.. Let us know what you would prefer.

Image extraction would be cool, but to me getting a readable block of text is more important.

Or even go the OCR approach !

All things considered that's pretty sad, though. A digital archive format that cannot reliably be read by machines, even if it contains just text.

Going from PDF to nicely formatted word doc would be huge for lawyers and people who do a lot of contract negotiations. It's hard to do well though.

Isn't a better solution to get software that supports PDF natively like Acrobat, then edit the documents there instead of doing a (poor) translation to Word.

This gets you some of the way, but if the two products were combined... http://www.docverter.com/api.html

Word 2013 natively supports opening Word Docs, including some advanced features like tables, bookmarks, etc.

I've recently been working on extracting text from PDFs myself. I've found that `pdftohtml -xml` from the Poppler utils does a decent job of it, and includes a bounding box for each piece of text. I've submitted a few patches to their Bugzilla to also include the transformation matrix as well as some extra styling information.

I googled "converting PDF to text" and "converting PDF to html". A tons of services already exist out there. Apparently, it's not something new. How do you plan to compete? Are you planning to focus on data extraction rather than conversion?

the initial plan wasn't about text extraction but text modification. We noticed we already have something to "give" and created this service. Following the lean methodology, we hope we are going to get some insights about the next step.

Neat, but practically who would want to do this with an API rather than installable software?

Some languages do have indeed good enough librairies to do that, some don't. Moreover, they are quite slow and not always really easy to integrate.

And sending the whole file over the wire is preferable?

What if the document contains sensitive or privileged data?

Then use a library or local call. For web based projects with low security concerns or something reminiscent of Yahoo! Pipes this is pretty cool.

If we notice a need for more secure solution, we can provide specific server with https.


This is similar to what we do at http://searchtower.com , where you can store, view, index and search the data.

Do you do OCR for text extraction?

not right now but we are working on it. Would you be interested by this feature?

Good OCR with a nice API would be very useful.

Nice. Why no paid options? I'm guessing because this was a weekend project.

If so, nice work!

Thanks. The paying option would be coming if there is some interests in it. If there is people interested in it, please let us know in this thread or at info@stamplin.com.

do you use pdftotext internally or something else?

we have our own parser for more complicated task and use xpdf when speed is key because it's much faster.

I am writing something similar for a client. He needs data in tables extracted from the PDF. Which language are you using? I wrote two scripts, one using python and pdftotext and another using ruby pdf-reader, the ruby one gives each line of the PDF one by one which is good for extraction.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact