Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Open-source Rule-based PDF parser for RAG (github.com/nlmatics)
293 points by jnathsf 5 months ago | hide | past | favorite | 32 comments
The PDF parser is a rule based parser which uses text co-ordinates (boundary box), graphics and font data. The PDF parser works off text layer and also offers a OCR option to automatically use OCR if there are scanned pages in your PDFs. The OCR feature is based off a modified version of tika which uses tesseract underneath.

The PDF Parser offers the following features:

* Sections and subsections along with their levels. * Paragraphs - combines lines. * Links between sections and paragraphs. * Tables along with the section the tables are found in. * Lists and nested lists. * Join content spread across pages. * Removal of repeating headers and footers. * Watermark removal. * OCR with boundary boxes

One additional library to add, if you're working with scientific papers: https://github.com/kermitt2/grobid. I use this with paperetl (https://github.com/neuml/paperetl).

Nice project! I've long used Tika for document parsing given it's maturity and wide number of formats supported. The XHTML output helps with chunking documents for RAG.

Here's a couple examples:

- https://neuml.hashnode.dev/build-rag-pipelines-with-txtai

- https://neuml.hashnode.dev/extract-text-from-documents

Disclaimer: I'm the primary author of txtai (https://github.com/neuml/txtai).

Off-topic, but do you know how Tika compares to other pdf parsing libraries? I was very unimpressed by pdfminer.six (what unstructured uses) as the layout detection seems pretty basic, it fails to parse multi column text, whereas MuPDF does it perfectly

Currently I'm using a mix of MuPDF + AWS Textract (for tables, mostly) but I'd love to understand what other people are doing

Tika uses PDFBox under the hood, using its built-in text extractor (which is "ok"). If you're looking for table extraction specifically, check out Tabula (https://tabula.technology) which is also built on top of PDFBox and has some contributions from the same maintainers. PDFBox actually exposes a lower-level API for text extraction (I wrote it!) than the one Tabula uses, allowing you to roll your own extractor - but that's where dragons live, trust me :)

I don't have scientific metrics but I've found the quality much better than most. It does a pretty good job to pulling data from text and tables.

This looks like it could be very helpful. The company I work for has a PDF comparison tool called "PDFC" which can read PDFs and runs comparisons of semantic differences. https://www.inetsoftware.de/products/pdf-content-comparer

Parsing PDFs can be quite the headache because the format is so complex. We support most of these features already but there are always so many edge cases that additional angles can be very helpful.

You're absolutely right; parsing PDFs can be a real headache due to their inherent complexity. The format itself can vary in structure, layout, and embedded components, making it difficult to extract and compare information consistently. Even with robust tools like PDFC, edge cases can always emerge, requiring further refinements.

Tesseract OCR fallback sounds great!

There are now a lot of file loaders for RAG (langchain, LLMindex, unstructured, ...), any reasons, like a leading benchmark score, to prefer this one?

I couldn't try this tool as it doesn't build on apple silicon (and there's no ARM docker image)

However, I have a PDF parsing use-case that I tried those RAG tools for, but the output they give me is pretty low quality – it kinda works for RAG as the LLM can work around the issues but if you want to get higher quality responses with proper references and such I think the best way is to write your own rule-based parser which is what I ended up doing (based on MuPDF though, not Tika).

Maybe that's what the authors of this tool were thinking too.

To run the docker image on apple silicon, you can use the following command to pull - it will be slower but works: docker pull --platform linux/x86_64 ghcr.io/nlmatics/nlm-ingestor:latest

Thanks, I always forget I can do that! I've given it a go and it's really impressive – the default chunker is very smart and manages to keep most of the chunk context together

The table parser in particular is really good. Is the trick that you draw some guide lines and rectangles around tables? I'm trying to understand the GraphicsStreamProcessor class as I'm not familiar with Tika, how does it know where to draw in the first place?

For me, PyMuPDF/fitz has been the best way to retain natural reading order and set dynamic enough rules to extract text in complex layouts.

None of the mentioned tools did this out of the box, none seemed easy to configured, all definitely hyped and marketed way beyond fitz though.

Same here, fitz is great, it does well enough out of the box that I can apply some simple heuristics for things like joining/splitting paragraphs where it makes a mistake and extract drawings and such and get pretty close to 100% accuracy on the output.

The only thing it doesn't do is tables detection (neither does pdfminer.six), but there are plenty of other ways to handle them.

Last time I tried Langchain (admittedly, that was ~6 months ago) the implementations for content extraction from PDFs and HTML files were very basic. Enough to get a prototype RAG solution going, but not enough to build anything reliable. This looks like a much more battle-tested implementation.

great effort and very interesting. However, I go to Github and I see "This organization has no public members" .. I do not know who you are at all, or what else might be part of this without disclosure.

Overall, I believe there has to be some middle ground for identification and trust building over time, between "hidden group with no names on $CORP secure site" and other traditional means of introduction and trust building.

thanks for posting this interesting and relevant work

Thanks for the post. Please use this server with the llmsherpa LayoutPDFReader to get optimal chunks for your LLM/RAG project: https://github.com/nlmatics/llmsherpa. See examples and notebook in the repo.

Thank you for sharing. Are there some example input output pairs somewhere?

You can use the library in conjunction with llmsherpa LayoutPDFReader.

Some examples are here with notebook: https://github.com/nlmatics/llmsherpa Here's another notebook from the repo with examples: https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks...

I tried to parse a few hundreds pdfs with it. The results are pretty decent. If this was developed in Julia, it would be ten times faster (at least).

How does this differ from Azure Document Intelligence, or are they effectively the same thing?

No, we are not doing the same thing. Most cloud parsers use a vision model and they are lot slower, expensive and you need to write code on the top of these to extract good chunks.

You can use llmsherpa library - https://github.com/nlmatics/llmsherpa with this server to get nice layout friendly chunks for your LLM/RAG project.

There’s no ocr or ai involved here (other than the standard fallback).

What this library, and something like fitz/pymupdf, allow you to do is extract the text straight from the pdf, using rules about how to parse & structure it. (Most modern pdfs you can extract text without ocr).

- much cheaper obviously but doesn’t scale (across dynamic layouts) well so you likely are using this when you can configure around a standard structure. I have found rule-based text extraction to work fairly dynamically though for things like scientific pdfs.

Last I used it, Azure Document Intelligence wasn't all that smart about choosing split points. This seems to implement better heuristics.

I wrote about split points and the need for including section hierarchy in this post: https://ambikasukla.substack.com/p/efficient-rag-with-docume...

All this is automated in the llmsherpa parser https://github.com/nlmatics/llmsherpa which you can use as an API over this library.

What is a split point? I use Textract a lot and from my testing, always beats out any of the open source tooling to extract information. That could also be highly dependent on the document format.

I think it is a reference to the place a larger document is split into chunks for calculating embeddings and storage.

I am also curious about this. ADI is reliable but does have edge case issues on malformed PDF

I fear tesseract OCR is a potential limitation though. I’ve seen it make so many mistakes

Do you ave any examples? There doesn't seem to be a single PDF file in the repo.

You can see examples in llmsherpa project - https://github.com/nlmatics/llmsherpa. This project nlm-ingestor provides you the backend to work with llmsherpa. The llmsherpa library is very convenient to use for extracting nice chunks for your LLM/RAG project.

We've been looking for something exactly like this, thanks for sharing!

How does this compare to PaddleOCR?

Looks like Apache 2 license which is nice.

"Retrieval Augmented Generation"

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact