
Show HN: An API to extract texts from images and PDF files - smougel
http://www.stamplin.com
======
zdw
What's the benefit to using this over `pdftotext` and/or `pdfimages | convert
| tesseract`?

~~~
jingo
The benefit is stamplin.com gets insight on what people are viewing and
reading. They get to see what the user sees. They can compile a database and
use or sell that information to be used for marketing purposes.

Also, it's an "API" (looks more like a url poiting to a CGI program to me, but
whatever). API's are "cool" and "fun", while running local programs that you
have control over is old and boring and not the future of computing.

~~~
trez
our API is quite new and we understand it doesn't give an outstanding value
for everybody as it target easy of use for the moment but next release is
going to add more advanced things.

About us using your data, our privacy policy will clarify that.

------
taf2
[http://www.stamplin.com/api/](http://www.stamplin.com/api/) returns 403 when
clicking on the API docs after confirming an account via email link

~~~
trez
sorry, the correct url is
[http://www.stamplin.com/api/docs/](http://www.stamplin.com/api/docs/)

------
angersock
I come bearing gifts, if anyone would like to host some of this themselves.

This follows the API documented by Stampin (minus the throttling errors)--it
does not currently do the OCR, but as mentioned elsewhere by zdw you can
probably get tesseract to get you like 80% of the way there. If you wanted to
use that, you'd likely just replace the hacky `pdftotext` callout with your
preferred toolchain.

You'll need Ruby, Sinatra, and the Xpdf tools, I believe.

Dual-licensed under the AGPL, BSD, and WTFPL licenses. idklol.

The code:

    
    
      require 'sinatra'
      require 'json'
    
      use Rack::Logger
    
      post '/extracttext' do
    
          begin
          status 204 and return unless params["file"] != nil
    
          type = params["type"] || "text"
          lang = params["lang"] || "en"
    
          tmpfilename = params["file"][:tempfile].path
          `pdftotext #{tmpfilename}`
          File.delete(tmpfilename)
    
          convfile = File.open("#{tmpfilename}.txt","r")
          lines = convfile.read.split("\n")
          convfile.close
          File.delete(convfile.path)
    
          content_type "application/json"
          {"text"=>lines}.to_json
    
          rescue
              status 500 and return
          end
      end
    

EDIT:

For God's sake run this in a jail and only on an internal network!

------
rpedela
I like the concept and it is a good start. Pulling text from PDFs is
especially painful. I think the output format needs improvement. It is just a
large array of strings. It seems like the strings are sometimes a single line,
and sometimes not. My particular use case is extracting raw data from a PDF. I
would like to see more structure to the output. For example, knowing where new
lines, tabs, etc are located would be very helpful for parsing raw data.

Here is the PDF I used to test:
[https://www.gov.uk/government/uploads/system/uploads/attachm...](https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/211934/sanctionsconlist.pdf)

Is there a technical reason for the 1-2MB limit or is it arbitrary?

~~~
trez
Thanks for your comment!

That's something we can provide pretty easily and we would try to provide that
in our next release. If you want us to help you with your specific problem,
please send us an email at info@stamplin.com.

The limit has been set to prevent our server from crashing as we do not have,
for the moment, the financial capability to support a massive server farm.
Again, if this limit prevent you from using our API, we might move the limit
up if you ask it by email.

------
mappum
The OCR is really useless. I tested it with some reddit "advice animal" memes
(because there is a need for transcriptions). You would think that text is
pretty simple and easy, but the output I got was like:

    
    
        /\n\nnmrs wn\ufb02qyi mm mm\nTlIIEI\ufb02|\ufb02llllM\u2018l co

~~~
trez
Sorry that didn't work properly for you. We are working on improving our OCR
results quality. Could you please send us at info@stamplin.com the file you
used to get this useless result?

------
gkoberger
The upgrade button doesn't work, and nobody is going to hover long enough to
see the "Not Available Yet" title. And the current 10 requests isn't even
enough to test with.

I'm excited to try this.. so figure out a way to take my money soon.

------
gnosis
This looks nice except for having to depend on your servers as a middle man.

Any chance you could release the code as Free or open source so that its users
can use it standalone on their own machines?

~~~
trez
That's not planned at the moment but if we wouldn't find a way to monetize it,
we would do it for sure.

------
RivieraKid
Why would someone want to use an API instead of a library?

~~~
trez
some langages might not have an appropriate library, some might want to not
have heavy processes on their device (mobiles). We also think that's easier to
use as there is nothing to install. That mainly depends on your case.

~~~
RivieraKid
I agree that there might situations where it can be useful, but:

1) Mobiles have pretty good CPUs. I think uploading and waiting for response
would be slower and less reliable.

2) If the mobile user doesn't have an internet connection, the app won't work.

3) As a developer, I would be dependant on an external service, that could
stop working someday.

------
rpedela
Can I assume API keys are on the roadmap? I don't particularly like using my
username and password.

~~~
trez
Yes, we'd like to increase security on each releases. It should be available
in one of the next release.

~~~
rpedela
Great!

------
antrover
Nice. Are you using the Tessaract OCR lib at the core of the extraction?

~~~
trez
yes we do

------
it_learnses
Any custom requests? Let us *know.

~~~
trez
thx, I am gonna fix that

~~~
trez
fixed

------
smougel
Any Feedback Welcomed

~~~
sebg
Looks good - does it do data tables? That's a big issue and something I've
heard about (run into) many times...

~~~
trez
Thanks for your comment! We would really appreciate if you could explain us in
more details problems you faced. I am sending you an email if that's ok for
you to discuss that.

~~~
sebg
responded to your email. good luck.

