

Ask HN / Review my startup: DocuHarvest - cemerick

DocuHarvest extracts data from documents (PDFs only for now) in a way that is accessible to nontechnical users and economical for small workflows or one-off jobs.<p>https://docuharvest.com<p>The original launch announcement is on my blog here, which provides more background:<p>"Getting valuable data out of documents should not require an I.T. staff, outside consultants, building or buying software, or an up-front investment of hundreds or thousands of dollars, regardless of how many documents and how much data is involved."<p>http://muckandbrass.com/web/x/CwBi<p>To answer probably the three most common questions:<p>1. Yes, we have a HTTP API coming, probably along with client libraries for some subset of {Java, Python, Ruby, PHP, C#/.NET, ...}.<p>2. More job types are incoming.  As noted in the announcement and on the site's front page, imaging-related jobs are up next.  Lots more in the works after that as well.<p>3. DocuHarvest is largely written in Clojure, and currently uses CouchDB as a backend.  More info here: http://groups.google.com/group/clojure/browse_frm/thread/c1c11390caac3dc<p>SMB is my initial focus (intentionally wide for now).  Good potential verticals include legal &#38; public records, medical, finance &#38; accounting.  That's all up for grabs depending on how forthcoming job types are received and by whom.<p>Feedback and suggestions are most welcome, either here or via the feedback boxes on the DocuHarvest site, twitter messages @docuharvest or @cemerick, or you can email help@docuharvest.com or cemerick@snowtide.com.<p>Thanks!
======
jashkenas
With all due respect, if you're looking to extract text and images from
documents, there are free and open-source options. DocuHarvest looks like a
great service for an end-user, but if you're a programmer, I recommend you
take a peek at Docsplit, an open-source project of ours that extracts text,
images, and metadata from documents, including non-pdfs.

<http://documentcloud.github.com/docsplit/>

Or the Python port:

<http://github.com/anderser/pydocsplit>

It's thin wrapper on top of a number of excellent open-source projects that do
the real work:

* OpenOffice / JODConverter, to convert ".doc", ".ppt", ".xls", ".rtf" into PDFs.

* GraphicsMagick and Ghostscript, to render PDFs into images of any size and format.

* Apache PDFBox, to extract UTF8 plain text and metadata from PDFs.

We're thinking about adding Tesseract-based OCR into the text-extraction, but
it's still a little difficult to figure out how to package that up portably
for multiple platforms. Full Disclosure: Docsplit is part of DocumentCloud, a
non-profit project funded by the Knight Foundation to help journalists work
with primary source documents.

~~~
cemerick
I'm guessing you didn't read the first sentence in my text opening the thread.
:-)

DocuHarvest is for those that don't know what open source is, or those that
don't want to pay for someone to monkey with it enough to make it work in
their organization.

FWIW, open source solutions have fairly poor support for newer revs of various
document formats.

~~~
jashkenas
It's a great service to offer. I just figured that being HN, the open-source
alternatives should have their place at the table, for the interested, at
least in a comment. Good luck with DocuHarvest.

------
bdickason
This is a cool idea (assuming it works)

I think that the 'What does docuharvest do' section of your video was way too
long! Just give 3 examples then show the cluttered bit :) And go a bit less
technical. Instead of "You often want to have the data held within the
documents in a more useful form" how about "All the data is spread across a
ton of different formats. Word, powerpoint, pdf.. it's hard to keep track of!"

Again "typical docuharvest workflow" makes this sound BORING and ENTERPRISE!
This could be a great tool for anyone to use, don't relegate it to the
cubicles!! :) tell me what it does like "Let's see how you can save time with
docuharvest" or "Let me show you how easy it is to make all your documents
searchable" or whatever point you're trying to get across :)

'workflow' is not fun :) Or even friendly!

Hope this helps.

~~~
cemerick
Thanks for the feedback -- maybe I should just show the demo straight off? I
felt like some background was necessary, but I certainly don't want to bore
anyone away! :-)

------
moconnor
I liked the free trial, but garbling part of the text is not a good idea. I
glanced through the text to see what sort of output I got and thought "oh,
that's rubbish, it's made tons of mistakes". It was only by chance that I
actually read the bit at the top saying parts were garbled on purpose.

Why not just give 50% of the text and cut off with a message saying that's all
you get in the trial? It give a better impression IMO.

Good luck with the startup!

~~~
cemerick
Good point, no one reads documentation / site messages. :-)

Cutting off content halfway through with a message at the end would work well
for text extraction and other long-format sorts of data, but what about form
data extraction and such? That's where the munging of content started really,
and I'm not sure what a better solution would be for short key/value pairs.

~~~
loumf
combine the mangling with the message. Meaning, change the real result to
"Result omitted in free version"

------
ams6110
If I'm non-technical, the term "job" is throwing me a bit. I'm thinking job as
in employment, which doesn't makes sense in the context. Maybe call it a
"Harvest". "Harvest a document in seconds" instead of "Start a new job in
seconds." The "New Job" button could be labeled "New Harvest" etc.

~~~
cemerick
Yeah, I've struggled with the "job" terminology quite a bit, having used
"run", "batch", and "task" in prior iterations. (Whoa, think I'm a backend dev
or something? ;-)

I'll mull over the "Harvest" idea. Seems like a good way to provide a verb for
people to use to reference the site/process, e.g. "I'll just harvest these
docs" :: "I googled and didn't find anything".

------
mediaman
This is a more complex task, but would you consider engineering an OCR data
extractor from image PDFs? I can tell you there are thousands of companies
that run into the problem of converting columns of data from paper-only
documents or fax machines into usable Excel data. And it's usually in one-off
settings, where not having to buy a software package would be hugely
appealing. Spending $0.10 - $0.25 per page would be far better than the labor
cost of doing the same thing.

~~~
cemerick
This sort of thing is absolutely on the drawing board. There's a lot of
unknowns here for us: in particular, which OCR package to use -- none I've
used are satisfactory IMO -- as well as figuring out the paper problem, which
we don't want to get involved in directly. I'm sure a complementary
partnership on the latter will make sense.

Anyway, follow @docuharvest or join the mailing list if you want to keep up
with goings-on. I think you'll be pleased over time. :-)

EDIT: note that our pricing is per _document_ , not per page. I'd hope that
keeping the simplicity of that model remains possible semi-forever, regardless
of the sort of processing DocuHarvest offers in the future.

------
bryanh
First of all, great presentation. I looked at the site and your phrasing
combined with the domain name gave me an instant idea of what you do. Maybe
some examples of use-cases would help me understand why I would you use, but I
suspect that is more me not needing that service.

I love the interface that let's you just jump right in without signing up
first. Trial run with mangled results is appealing.

Are there any direct competitors for you?

~~~
semanticist
There's certainly a good amount of text extraction work in the recruitment
industry - that's what the company I work for does and we have several
competitors.

In those cases, though, you're talking about taking CV/resume files in Word,
PDF, HTML, whatever and extracting structured data for populating a
recruitment company's databases. It's a bit more advanced than the kind of
thing being done here, but it also requires special integration work to match
the databases and handle things like candidate deduplication.

CV/resume parsing isn't going to be an industry that you can get into unless
you can hire some good sales guys and are ready to spend a lot of time
figuring out how to integrate with a dozen different database products.

~~~
retube
Hmm that's extremely interesting. My company has developed technology to
extract structured data from business websites, data types such as postal
addresses, contact details (person name, phone, fax, email, job title,
headshot), company logo, products (product name, prices & descriptions), food
and drink on menus, business description, keywords/tags and so on. There's a
fair amount of text and language analysis, machine learning, clustering
algorithms. We can process a million sites (~100 million pages) in a few days.

One of the pivots we were considering for this was Word/pdf docs, specifically
CVs. We figured that these big recruitment companies and websites must have to
process millions of loosely structured CV documents, and would need to extract
specific data from these. I don't know what they do now, but I reckon we could
do it pretty well.

~~~
semanticist
What they do now is use my employer's software or one of our competitors.

There's several people already moving in this market, so it would be tricky to
enter. One of the big things is going to be integrating with recruiter's
databases - that's where most of our time is spent these days. That, and
widening our language support.

------
zaveri
<https://docuharvest.com>

------
qq66
Doesn't Adobe have a product that takes care of the Interactive PDF form part?
I think one of the products in the LiveCycle family does exactly this. Pricing
is probably not as variable as yours though.

~~~
cemerick
Yup, they've got a variety of solutions. Of course, the point is that that's a
pretty huge up-front cost compared to (for example) $10 to get the data from
those 100 PDF forms you've got in a folder somewhere.

The vast majority of the functionality we'll be offering has no direct
parallel to any existing software package. That will become very clear as we
flesh out the types of jobs that are available.

