Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Parsr – A toolchain to transform documents into usable structured text (par.sr)
182 points by pierre 8 months ago | hide | past | favorite | 22 comments

It appears images and PDFs are the currently supported document types. If so, is there an opportunity to support web pages? We have quite a few verbose legal documents that consist of auto-geneated HTML that average 100 pages. A tool like this would be helpful in automatically dividing these into their respective section heading. It would also be beneficial to detect / remove extra info such as page numbers. Thanks for posting this!

We plan to support HTML as an input (its in our backlog). If possible you can DM me one of your document and we can make sure it work.

Is the HTML processed using a DOM / visual parser? I’m looking for something like Readability or Diffbot that can extract rich text / markdown from web pages.

Is there some reason that pandoc[1] won't work for your use case?


The visual parser is because I want to grab content from any web page and fit it approximately to a pre-defined scheme using the underlying dom and visual cues.

How complicated is the structure of these HTML pages? This sounds like something that could be accomplished with a limited amount of JavaScript.

This seems like a super useful way to package and deliver this kind of tool chain! I’ve been looking for PDF parsers on and off the last couple of years, and have found it challenging to get most tools set up for data extraction and analysis.

This one packages a off the shelf version into a Docker, and starts a GUI website locally. Looking forward to using this more!

Hmmm, looks useful. The list of dependencies is basically a who's who of doing various types of document parsing. Is this basically just a unified interface that wraps them all up into an API?

This is an interface that wrap multiple document tools into a common API, allowing you for example to switch your OCR from Tesseract to Google Cloud Vision or Abby without changing much of your code. More cloud vendor / OCR will be supported in the near future.

More than just wrapping the OCR, there is also a document reconstruction / cleaning pipeline that take care of reading order, heading detection and classification, table detection and reconstruction, ... so that you have an as clean and usableas possible Text / JSON as an output.

I was really excited to try this until I saw that the only extraction methods are pdfminer, finereader, and tesseract. I was hoping there was something you rolled on your own. I've been trying for a long time to parse tables (and nested tables) but the available extractors seem to only work on really simple, idealized tables with virtually no skew or warping. The best I've found so far as Amazon's Textract, but it's not that great either. Alas, every attempt I've ever made at generalized table extraction has quickly regressed to templates.

Shameless plug: Would you like to join the club of happy customers at https://extracttable.com - API to extract tabular data from images and PDFs without worrying about co-ordinates.

A comprehensive competitor comparison, along with outputs, is available at https://extracttable.com/compare.html

Suggestion: use a simpler synonym for idempotent on your high level pricing overview. Something like "automatically makes sure you aren't charged for duplicates"

Maybe something like lnav (https://lnav.org) would suit your needs?

Edit: I mean as a part of a custom solution

We also support Google Document understanding API for OCR, with support for other cloud OCR vendor coming soon.

We also support pdf.js as an alternative to pdfminer.

Thanks. FYI the link to the Google Vision documentation in 2.1 Extractor Tools of your documentation is broken.

Are there any example inputs and outputs to quickly see what's possible?

There is no inputs and outputs example in the repo to allow you to see rapidly what is possible. We should definitively add them.

For a quick test you can either run the jupyter notebook


or run the docker with the UI interface and just drag and drop documents / play with the configuration

   docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest

Apache Tika is a powerful text extraction engine.

Why this over Tika?

That's a really good question! I've been using for quite some time Tika as the Swiss-army knife for text extraction.

They don't seem to be even using Tika behind the hood as any of the bundled tools. Perhaps anyone has some comparisons?

That looks like a great comprehensive tool kit for data extraction. I understand the bundle is licensed under Apache, I'm curious to check on the needs/rules-to-follow to include a service like Abbyy.

We, extracttable.com - extract tabular data from images and PDFs over API, are interested to contribute and integrate the service into the bundle.

My old beater is insured with AXA. I didn't know they had any open source projects going on.

Very cool, amazing this is OSS.

Applications are open for YC Winter 2021

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact