
Show HN: an API to extract text from a PDF - trez
http://stamplin.com/api/docs/extracttextpdf/
======
hnriot
Why not just system(pdf2html) - I don't see the point since this level of
functionality is trivially achieved. If it did something over and above that
it might be useful, like OCR, but even that's not hard to add.

~~~
trez
indeed, at the moment, that's quite simple. The only benefit is the fact is
quite easy to integrate and fast. More advance feature should be coming soon.

------
zdw
If you're doing this local/cli

`pdftext`, from [http://www.foolabs.com/xpdf/](http://www.foolabs.com/xpdf/)

For OCR, `pdfimages` (also from xpdf), combined with ImageMagick's `convert`,
and `tesseract` ([http://code.google.com/p/tesseract-
ocr/](http://code.google.com/p/tesseract-ocr/)) works passably well.

------
kijin
I have some questions:

1\. Why return an array of texts? Where do the texts get split up? At page
boundaries? Column boundaries? At the end of each line? If a line is
interrupted by a corner of an image and continues a couple of inches
afterward, does it get treated as a separate text? (I once used a PDF->text
extractor program that spit out every word sepearately, often in an incorrect
order. That probably had to do with how the PDF was organized internally.)

2\. "The PDF file should be smaller than 1 Mbit" -> You mean 1 megabyte,
right? Because 1 megabit is only 125-128 kilobytes.

~~~
trez
1\. That's indeed dependant on the way the PDF has been created. Sometimes,
you can even have letter splitted on different text token.

2\. You're right, I mean Megabyte

~~~
kijin
Thanks for the clarifications.

Since a lot of PDFs are badly organized (and I wonder if some programs
deliberately do that to make text extraction difficult), perhaps you could try
to analyze the location of each token on the page and merge the ones that seem
to belong together. That would be already 100x better than most of the free
PDF->text converters out there.

~~~
trez
we are already close to do that but with a really slow parser (this one can
even replace some text on the pdf). Our problem now is to understand if
developers would rather have better text extraction or some other features
like image extractions, etc.. Let us know what you would prefer.

~~~
kijin
Image extraction would be cool, but to me getting a readable block of text is
more important.

------
midas
Going from PDF to nicely formatted word doc would be huge for lawyers and
people who do a lot of contract negotiations. It's hard to do well though.

~~~
adsr
Isn't a better solution to get software that supports PDF natively like
Acrobat, then edit the documents there instead of doing a (poor) translation
to Word.

------
rcfox
I've recently been working on extracting text from PDFs myself. I've found
that `pdftohtml -xml` from the Poppler utils does a decent job of it, and
includes a bounding box for each piece of text. I've submitted a few patches
to their Bugzilla to also include the transformation matrix as well as some
extra styling information.

------
chenster
I googled "converting PDF to text" and "converting PDF to html". A tons of
services already exist out there. Apparently, it's not something new. How do
you plan to compete? Are you planning to focus on data extraction rather than
conversion?

~~~
trez
the initial plan wasn't about text extraction but text modification. We
noticed we already have something to "give" and created this service.
Following the lean methodology, we hope we are going to get some insights
about the next step.

------
TillE
Neat, but practically who would want to do this with an API rather than
installable software?

~~~
trez
Some languages do have indeed good enough librairies to do that, some don't.
Moreover, they are quite slow and not always really easy to integrate.

~~~
faddotio
And sending the whole file over the wire is preferable?

What if the document contains sensitive or privileged data?

~~~
cdcarter
Then use a library or local call. For web based projects with low security
concerns or something reminiscent of Yahoo! Pipes this is pretty cool.

------
ismaelc
Hey I've documented this in Mashape -
[https://www.mashape.com/ismaelc/extract-text-from-
pdfs#!docu...](https://www.mashape.com/ismaelc/extract-text-from-
pdfs#!documentation)

~~~
trez
thanks!

------
surapaneni
This is similar to what we do at
[http://searchtower.com](http://searchtower.com) , where you can store, view,
index and search the data.

------
architgupta
Do you do OCR for text extraction?

~~~
trez
not right now but we are working on it. Would you be interested by this
feature?

~~~
celer
Good OCR with a nice API would be very useful.

------
ra
Nice. Why no paid options? I'm guessing because this was a weekend project.

If so, nice work!

~~~
trez
Thanks. The paying option would be coming if there is some interests in it. If
there is people interested in it, please let us know in this thread or at
info@stamplin.com.

------
alkou
do you use pdftotext internally or something else?

~~~
trez
we have our own parser for more complicated task and use xpdf when speed is
key because it's much faster.

~~~
aksx
I am writing something similar for a client. He needs data in tables extracted
from the PDF. Which language are you using? I wrote two scripts, one using
python and pdftotext and another using ruby pdf-reader, the ruby one gives
each line of the PDF one by one which is good for extraction.

