
Show HN: Convert PDF files into structured data - chezmo
https://docparser.com?utm_source=hn&utm_medium=email&utm_campaign=showhn
======
phonon
Is this using something like [https://github.com/creatale/node-
fv](https://github.com/creatale/node-fv) on the backend, which can accommodate
various not perfectly scanned forms to data, after you prepare a schema? Or is
it a more simplistic "mark hotspots" which won't work well/at all if if it is
not perfectly aligned/sized with the original?

~~~
chezmo
We do position based text extraction. We add however an 'unpaper' function
which tries to correct misalignments and increases the quality of the scan.

~~~
ComodoHacker
What OCR library do you use? What languages it supports?

~~~
chezmo
For scanned images we use [https://github.com/tesseract-
ocr/tesseract](https://github.com/tesseract-ocr/tesseract). For text based
PDFs we pull the text directly from the file and all languages are supported.

------
darklajid
I'm working for a company that does DMS Things™ and processing incoming PDFs
(for mailroom applications or invoice processing) is one of our core projects.
Given that this is the closest submission to my day job ever, I'm really
curious about your project.

Your online presentation looks great. The 'layout designer' if you will, the
'where are important things' screens look slick.

I do wonder how you assign those settings to incoming PDFs though. Is it the
user's responsibility to say 'This PDF? I told you how/from where to extract
data before'? Or do you have some classification system that stuffs the PDFs
into buckets (say, by vendor) and templates are assigned to those?

How many PDFs that you encounter contain text (vs. scanned/image only
documents)? For us, while the former certainly rise in popularity, the latter
are still far too common/more prevalent.

Our solution is mostly on-premise so far (online offerings are the current
focus of development) and we're quite OCR heavy, using a bunch of non-free
engines and vote between the results. We also have dynamic templates, allowing
rule sets containing rules like 'The total amount is a number satisfying
format X, usually right or below a string containing "Total"' (and our invoice
processing solution basically comes with rules like these preconfigured for
various countries).

Are your templates using absolute coordinates/regions? You mention your
'unpaper' feature - do you fix/deskew both images and regions for misaligned
pages?

(I won't mention any company/product names, because I don't want to advertise
or hijack the thread. Nor do I need to connect my HN account ~directly~ with
my employer)

~~~
alvin0
what does DMS stand for? or is "DMS Things" a tm?

~~~
whitingx
DMS = Document Management System ツ

------
evolve2k
Looks get cool, nice work.

In your FAQ it says:

There are no special requirements. There is nothing to install and you don't
need any technical know-how for setting up and using >>> mailparser.io.<<< No
coding is required.

Just pointing out a potential syntax error. Otherwise if it's meant to say
mailparser better explain what that is.

~~~
chezmo
Thanks for the heads up, I just fixed it! mailparser.io is my other product
which I launched a couple of years ago. Customers kept asking for document
parsing capabilities so I thought it would be a good idea to start Docparser.
For the FAQ I copied some text and apparently forgot to properly proof read it
:)

------
caseyf7
The Zapier integration is why I'm going to try this one.

------
unfortunateface
Save yourself a lot of support time/costs and remove the 'free' option. Your
homepage sells the product well and shows its benefits. From the feedback
you've already received it looks like you are providing more than $50 worth of
value.

------
sixhobbits
I'm always surprised by how well `pdf2text --layout` works for even
complicated looking PDFs. Has been better than most specialised (free) web
services I've tried

------
frabcus
Looks really good!

A quick advert for PDF Tables [https://pdftables.com/](https://pdftables.com/)
\- we're a bit more API-focussed.

------
petra
Depending on how well this works, this could be extremely useful for the
electronics industry, where everything is locked in a PDF - allowing someone
to build n in-depth research tool that would allow engineers to find the
optimal part(using complex queries), from any manufacturer, very fast - far
from the broken situation of today, where engineers spend tons of time
researching , and often don't get tclose to the ideal.

------
Kinnard
I wonder how their software works. I think there's untapped potential in
adobe's postscript.

~~~
lovelearning
The file format itself has all the information required to extract text from a
rectangular area. Frameworks like PDFBox and iText have supported it from a
long time.

It's upto users to define what are rows and columns. In most programmatically
generated PDFs, this is easy. But in manually typeset PDFs, there are lots of
edge cases like variable row heights or column widths, slanted table borders,
stuff like that.

~~~
chezmo
That's right! The user defines a rectangular area and we then extract the raw
text based on the position. For table extraction we use tabula.java under the
hood.

------
camel_Snake
Tried giving this[0] a shot but even just a single page was too large for the
4MB limit.

[0]
[https://archive.org/details/averageweightofm41fult](https://archive.org/details/averageweightofm41fult)

------
jamiecarruthers
I gave it a go and couldn't get useful data extracted. I sent a support query
with attached PDFs.

------
markdown
Your pricing tables mention webhooks but the faqs below them don't explain
what those are.

------
ruler88
nice! I wish I knew about this earlier, I had built a version of this on my
own to solve this very problem.

------
mordae
No source? No, thanks!

