DocQuery (https://github.com/impira/docquery), a project I work on, allows you to do something similar, but search over semantic information in the PDF files (using a large language model that is pre-trained to query business documents).
For example:
$ docquery scan "What is the due date?" /my/invoices/
/my/invoices/Order1.pdf What is the due date?: 4/27/2022
/my/invoices/Order2.pdf What is the due date?: 9/26/2022
...
It's obviously a lot slower than "grepping", but very powerful.
Wow this is exactly what I've been looking for, thank you! I just wish with these transformer models it was possible to extract a structured set of what the model "knows" (for e.g. easy search indexing ). These natural language question systems are a little too fuzzy sometimes.
Can you tell me a bit more about your use case? A few things that come to mind:
- There are some ML/transformer-based methods for extracting a known schema (e.g. NER) or an unknown schema (e.g. relation extraction).
- We're going to add a feature to DocQuery called "templates" soon for some popular document types (e.g. invoices) + a document classifier which will automatically apply the template based on the doc type.
- Our commercial product (http://impira.com/) supports all of this + is a hosted solution (many of our customers use us to automate accounts payable, process insurance documents, etc.)
Your commercial product looks very cool, but my use case is in creating an offline-first local document storage system (data never reaches a cloud). I'd like to be enable users to search through all documents for relevant pieces of information.
The templates sound very cool - are they essentially just using a preset list of (natural language) queries tied to a particular document class? It seems like you're using a version of donut for your document classification?
> but my use case is in creating an offline-first local document storage system (data never reaches a cloud).
Makes sense -- this is why we OSS'd DocQuery :)
> The templates sound very cool - are they essentially just using a preset list of (natural language) queries tied to a particular document class? It seems like you're using a version of donut for your document classification?
Yes that's the plan. We've done extensive testing with other approaches (e.g. NER) and realized that the benefits of using use-case specific queries (customizability, accuracy, flexibility for many use cases) outweigh the tradeoffs (NER only needs one execution for all fields).
Currently, we support pre-trained Donut models for both querying and classification. You can play with it by adding the --classify flag to `docquery scan`. We're releasing some new stuff soon that should be faster and more accurate.
Sweet! I'll keep an eye on the repo. Thank you for open sourcing DocQuery. I agree with your reasoning: my current attempts to find an NER model that covers all my use cases have come up short
The unstoppable administrative engine that is the American Healthcare system produces hundreds of thousands of continuously updated documents like this with no standardized format/structure.
Manually extracting/normalizing this data into a querable format is an industry all its own.
What is the development date? -> June 20, 2017
What is the medicine? -> SPINRAZA® (nusinersen)
How many doses -> 5 doses
Did the patient meet the review criteria? -> Patient met initial review criteria.
Is the patient treated with Evrysdi? -> not
> to extract a structured set of what the model "knows"
To be fair, that's impossible in the general case, since the model can know things (ie be able to answer queries) without knowing that it knows them (ie being able to produce a list of anserable queries by any means significantly more efficient than trying every query and seeing which ones work).
As a reducto ad absurdum example, consider a 'model' consisting of a deniably encrypted key-value store, where it's outright cryptographically guaranteed that you can't effiently enumerate the queries. Neural networks aren't quite that bad, but (in the general-over-NNs case) they at least superficially appear to be pretty close. (They're definitely not reliably secure though; don't depend on that.)
This is so epic. I was just ruminating about this particular use-case. who are your typical customers. Supply chain or purchasing? Also I notice that you do text extraction from Invoices? Are you using something similar to CharGRID or its derivate BERTGRID?
Wish you and your team more success!
Thank you ultrasounder! Supply chain, construction, purchasing, insurance, financial services, and healthcare are our biggest verticals. Although we have customers doing just about anything you can imagine with documents!
For invoices, we have a pre-trained model (demo here: https://huggingface.co/spaces/impira/invoices) that is pretty good at most fields, but within our product, it will automatically learn about your formats as you upload documents and confirm/correct predictions. The pre-trained model is based on LayoutLM and the additional learning we do uses a generative model (GMMs) that can learn from as little as one example.
I've long used `pdfgrep` in a very kludgey way, when stockpiling rare variants of particular used ThinkPad models (for looking up actual specs based on an IBM "type" number shown in a photo on an eBay listing, since the seller's listing of specs is often incorrect).
$ t500grep 2082-3GU
2082-3GU T9400 2.53 2GB 15.4" WSXGA+ Cam GMA, HD 3650 160G 7200 DVD±RW Intel 5100 2G Turbo 9 Bus 32 Aug 08
The Lenovo services to look up this info come and go, and are also slow, but a saved copy of the data lives forever.
(A non/less-kludge way would be to get the information from all my IBM/Lenovo PSREFs into a lovingly-engineering database/knowledge schema, simple CSV file, or `grep`-able ad hoc text file.)
Love dngrep, one of my friends used to combine all his pdfs into one and then use adobe reader to search before I showed this to him. It's very powerful and also simple to use, even for non-technical users.
I’ve been looking for an alternative to Acrobat’s ‘Advanced Search’ capability, which allows you to define a search directory (like NP++). This feature is possible with pdfgrep and other tools, but the killer for me is that it displays results in context, and allows you to quickly sift through dozens or hundreds of results with ease.
It’s literally the only reason why I have acrobat installed.
Curious as well. About a year ago I was implementing what I thought naively might not be a very difficult verification that a specific string existed (case sensitive or insensitive) within a PDF's text and had many cases where text viewed was clearly rendered in the document but many libraries couldn't identify the text. It's my understanding there's a lot of variance in how a rendered PDF may be presenting something one may assume is a simple string that really isn't after going down the rabbit hole (wasn't too surprising because I dont like to make simplicity assumptions). I couldn't find anything at the time that seemed to be error free.
Aside from applying document rendering with OCR and text recognition approaches, I ended up living with some error rate there. I think PDFgrep was one of the libraries I tested. Some other people just used libraries/tools as is with no sort of QAing but from my sample applying to several hundred verified documents, pdfgrep (and others) missed some.
A bit of a tangent, but does anyone know of a good utility that can index a large number of PDF files so one can do fast keyword searches across all of them simultaneously (free or paid)? It seems like this sort of utility used to be very common 15 years ago, but local search has kind of died on the vine.
Tried out recoll thanks to you posting this, and it's brilliant! Will be using pdfgrep a lot less from now on, though your help with pdfgrep-mode is great for when I do.
Check out the script included in one of the comments of this PR. It allows displaying recoll query output in pdfgrep.el buffers. The fix in the PR itself helps to avoid an error in cases when recoll doesn't provide highlights in the snippets.
While we're asking for tool tips: does anyone know of a tool that will cache/index web pages as the user browses, so that it can be searched/viewed offline later?
Yes, +1 for Recoll. It can also OCR those PDFs that are just an image of a page of text, and not 'live' text. Read the install notes and install the helper applications.
When searching I'll first try the application or system's native search utility, but most of the time I end up opening Recoll to actually find the thing or snippet of text I want, and it has never failed me.
This looks really neat. I wish I had known about it in gradschool. I used helm-bibtex to manage my bibliography (which is awesome, and about 10x better than mendelay). If you combined pdfgrep with the semantic information helm-bibtex already has, that would be super powerful.
Make sure to turn on pdfgrep-mode (a global minor mode, its role is to advise compilation-mode to make it deal with pdfgrep output) and then run M-x pdfgrep with a syntax similar to M-x grep, so you need to specify the filename(s) to grep in.
A full arg list after M-x pdfgrep will be something like:
pdfgrep -H -n -i points rulebooks/Tzolkin*pdf
You don't need to be in a pdf buffer to invoke it.
I've been using Ali G. Rudi's pdftxt with my own shell wrappers. From the homepage: "uses mupdf to extract text from pdf files; prefixes each line with its page number for searching."
Usually I 1) pdftxt a file and 2) based on the results, jump to a desired page in Rudi's framebuffer PDF reader, fbpdf. For this, the page number prefix in pdftxt is a particularly nice default. No temptations with too many command line options either.
What this did well, which I struggled with with some other (open source) parsers, is that it keeps things together that are visually together. Tables in HTML are a coherent object, but in PDF I guess they're often rendered as just some text boxes, and other parsers I've used would pull the text from the table cells in a seemingly unnatural if not random order. I don't have a specific need anymore at the moment, but I will think of Pdfgrep in the future, for this reason.
The way lots of programs generate PDF, especially with full (left and right) justification, is to place each and every letter individually. PostScript, and the PDF format descended from it, assume that something else is going to figure out kerning. Which unfortunately means that knowing which letters belong together in words can become a detective project.
Used this for online quizzes for one of my courses. It was pretty good, but using the terminal for stuff like this still sucks since you can't just click into the PDF.
I wish PDFs had a Ctrl-Shift-F search like VS Code that could search for text in multiple pdfs in a directory.
I use Houdahspot on MacOS for this. You can either invoke it globally with a custom hotkey (I use hyper ‘s’, for “search”), or from within a specific directory via an icon it adds to the top right corner of Finder.
Cool, about 15 years ago I built something similar for PDF, Office (OpenXML) as well as plain text as part of a search engine. Commercial/closed source of course but it was super handy.
For example:
It's obviously a lot slower than "grepping", but very powerful.