
Show HN: Parsr – A toolchain to transform documents into usable structured text - pierre
http://par.sr
======
hbcondo714
It appears images and PDFs are the currently supported document types. If so,
is there an opportunity to support web pages? We have quite a few verbose
legal documents that consist of auto-geneated HTML that average 100 pages. A
tool like this would be helpful in automatically dividing these into their
respective section heading. It would also be beneficial to detect / remove
extra info such as page numbers. Thanks for posting this!

~~~
pierre
We plan to support HTML as an input (its in our backlog). If possible you can
DM me one of your document and we can make sure it work.

~~~
bhl
Is the HTML processed using a DOM / visual parser? I’m looking for something
like Readability or Diffbot that can extract rich text / markdown from web
pages.

~~~
PenguinCoder
Is there some reason that pandoc[1] won't work for your use case?

[1][https://pandoc.org/](https://pandoc.org/)

~~~
bhl
The visual parser is because I want to grab content from any web page and fit
it approximately to a pre-defined scheme using the underlying dom and visual
cues.

------
pininja
This seems like a super useful way to package and deliver this kind of tool
chain! I’ve been looking for PDF parsers on and off the last couple of years,
and have found it challenging to get most tools set up for data extraction and
analysis.

This one packages a off the shelf version into a Docker, and starts a GUI
website locally. Looking forward to using this more!

------
ZeroCool2u
Hmmm, looks useful. The list of dependencies is basically a who's who of doing
various types of document parsing. Is this basically just a unified interface
that wraps them all up into an API?

~~~
pierre
This is an interface that wrap multiple document tools into a common API,
allowing you for example to switch your OCR from Tesseract to Google Cloud
Vision or Abby without changing much of your code. More cloud vendor / OCR
will be supported in the near future.

More than just wrapping the OCR, there is also a document reconstruction /
cleaning pipeline that take care of reading order, heading detection and
classification, table detection and reconstruction, ... so that you have an as
clean and usableas possible Text / JSON as an output.

------
staticautomatic
I was really excited to try this until I saw that the only extraction methods
are pdfminer, finereader, and tesseract. I was hoping there was something you
rolled on your own. I've been trying for a long time to parse tables (and
nested tables) but the available extractors seem to only work on really
simple, idealized tables with virtually no skew or warping. The best I've
found so far as Amazon's Textract, but it's not that great either. Alas, every
attempt I've ever made at generalized table extraction has quickly regressed
to templates.

~~~
udayrddy
Shameless plug: Would you like to join the club of happy customers at
[https://extracttable.com](https://extracttable.com) \- API to extract tabular
data from images and PDFs without worrying about co-ordinates.

A comprehensive competitor comparison, along with outputs, is available at
[https://extracttable.com/compare.html](https://extracttable.com/compare.html)

~~~
iudqnolq
Suggestion: use a simpler synonym for idempotent on your high level pricing
overview. Something like "automatically makes sure you aren't charged for
duplicates"

------
anilgulecha
Are there any example inputs and outputs to quickly see what's possible?

~~~
pierre
There is no inputs and outputs example in the repo to allow you to see rapidly
what is possible. We should definitively add them.

For a quick test you can either run the jupyter notebook

[https://github.com/axa-
group/Parsr/tree/master/demo/jupyter-...](https://github.com/axa-
group/Parsr/tree/master/demo/jupyter-notebook)

or run the docker with the UI interface and just drag and drop documents /
play with the configuration

    
    
       docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest

------
kresten
Apache Tika is a powerful text extraction engine.

Why this over Tika?

~~~
six2seven
That's a really good question! I've been using for quite some time Tika as the
Swiss-army knife for text extraction.

They don't seem to be even using Tika behind the hood as any of the bundled
tools. Perhaps anyone has some comparisons?

------
udayrddy
That looks like a great comprehensive tool kit for data extraction. I
understand the bundle is licensed under Apache, I'm curious to check on the
needs/rules-to-follow to include a service like Abbyy.

We, extracttable.com - extract tabular data from images and PDFs over API, are
interested to contribute and integrate the service into the bundle.

------
Tade0
My old beater is insured with AXA. I didn't know they had any open source
projects going on.

------
all-out-of-hope
Very cool, amazing this is OSS.

