

Ask HN: Is there anything similar to DocumentCloud for non-journalists? - bglenn09

I'm looking to be able to store pdf documents in the cloud and be able to search them.  DocumentCloud looks perfect but I'm not in journalism.  I'm having trouble finding an obvious alternative.  Do you guys like any other services or know of a simple way to do this with NoSql?  I was looking into using a hosted mongodb service but I can't find any information on searching binary data.  Thanks for any pointers.
======
Skywing
You're not going to be able to simply upload a PDF and search for text using
the raw file data. It's not readable. You're going to have to either use a
tool to extract embedded text, or perform OCR on the document if it's image-
only. A really good tool, that I have used before, is called Aspose. If you
are allowing users to upload these PDFs, you'd also need some sort of
distributed task queue, because performing the PDF file operations is not
something you want the user to have to wait on. I've used RabbitMQ for this,
and haven't had many issues. Once you have OCR'd the document and extracted
the text, then you can store the text as well as the native document in a
database like MongoDB. You would maybe even benefit from using a full-text
search engine, like ElasticSearch.

