Hacker News new | past | comments | ask | show | jobs | submit login
Pdfgrep – a commandline utility to search text in PDF files (pdfgrep.org)
279 points by kretaceous on Sept 25, 2022 | hide | past | favorite | 55 comments



DocQuery (https://github.com/impira/docquery), a project I work on, allows you to do something similar, but search over semantic information in the PDF files (using a large language model that is pre-trained to query business documents).

For example:

  $ docquery scan "What is the due date?" /my/invoices/
  /my/invoices/Order1.pdf       What is the due date?: 4/27/2022
  /my/invoices/Order2.pdf       What is the due date?: 9/26/2022
  ...
It's obviously a lot slower than "grepping", but very powerful.


Wow this is exactly what I've been looking for, thank you! I just wish with these transformer models it was possible to extract a structured set of what the model "knows" (for e.g. easy search indexing ). These natural language question systems are a little too fuzzy sometimes.


Can you tell me a bit more about your use case? A few things that come to mind:

- There are some ML/transformer-based methods for extracting a known schema (e.g. NER) or an unknown schema (e.g. relation extraction). - We're going to add a feature to DocQuery called "templates" soon for some popular document types (e.g. invoices) + a document classifier which will automatically apply the template based on the doc type. - Our commercial product (http://impira.com/) supports all of this + is a hosted solution (many of our customers use us to automate accounts payable, process insurance documents, etc.)


Your commercial product looks very cool, but my use case is in creating an offline-first local document storage system (data never reaches a cloud). I'd like to be enable users to search through all documents for relevant pieces of information.

The templates sound very cool - are they essentially just using a preset list of (natural language) queries tied to a particular document class? It seems like you're using a version of donut for your document classification?


> but my use case is in creating an offline-first local document storage system (data never reaches a cloud).

Makes sense -- this is why we OSS'd DocQuery :)

> The templates sound very cool - are they essentially just using a preset list of (natural language) queries tied to a particular document class? It seems like you're using a version of donut for your document classification?

Yes that's the plan. We've done extensive testing with other approaches (e.g. NER) and realized that the benefits of using use-case specific queries (customizability, accuracy, flexibility for many use cases) outweigh the tradeoffs (NER only needs one execution for all fields).

Currently, we support pre-trained Donut models for both querying and classification. You can play with it by adding the --classify flag to `docquery scan`. We're releasing some new stuff soon that should be faster and more accurate.


Sweet! I'll keep an eye on the repo. Thank you for open sourcing DocQuery. I agree with your reasoning: my current attempts to find an NER model that covers all my use cases have come up short


Since you mention insurance documents, could you speak to how well this would extract data from a policy document like https://ahca.myflorida.com/medicaid/Prescribed_Drug/drug_cri... ?

The unstoppable administrative engine that is the American Healthcare system produces hundreds of thousands of continuously updated documents like this with no standardized format/structure.

Manually extracting/normalizing this data into a querable format is an industry all its own.


It's very easy to try! Just plug that URL here: https://huggingface.co/spaces/impira/docquery.

I tried a few questions:

  What is the development date? -> June 20, 2017
  What is the medicine? -> SPINRAZA® (nusinersen)
  How many doses -> 5 doses
  Did the patient meet the review criteria? -> Patient met initial review criteria.
  Is the patient treated with Evrysdi? -> not


> to extract a structured set of what the model "knows"

To be fair, that's impossible in the general case, since the model can know things (ie be able to answer queries) without knowing that it knows them (ie being able to produce a list of anserable queries by any means significantly more efficient than trying every query and seeing which ones work).

As a reducto ad absurdum example, consider a 'model' consisting of a deniably encrypted key-value store, where it's outright cryptographically guaranteed that you can't effiently enumerate the queries. Neural networks aren't quite that bad, but (in the general-over-NNs case) they at least superficially appear to be pretty close. (They're definitely not reliably secure though; don't depend on that.)


This is so epic. I was just ruminating about this particular use-case. who are your typical customers. Supply chain or purchasing? Also I notice that you do text extraction from Invoices? Are you using something similar to CharGRID or its derivate BERTGRID? Wish you and your team more success!


Thank you ultrasounder! Supply chain, construction, purchasing, insurance, financial services, and healthcare are our biggest verticals. Although we have customers doing just about anything you can imagine with documents!

For invoices, we have a pre-trained model (demo here: https://huggingface.co/spaces/impira/invoices) that is pretty good at most fields, but within our product, it will automatically learn about your formats as you upload documents and confirm/correct predictions. The pre-trained model is based on LayoutLM and the additional learning we do uses a generative model (GMMs) that can learn from as little as one example.

LMK if you have any other questions.


Can this work like AI readability extension? What's the content of this page? What are the navigation options on this page?


I’d like a way to index a lot of case law and then ask it questions.


See also https://github.com/phiresky/ripgrep-all (`ripgrep`, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.)


Was already a big fan of ripgrep and pdfgrep. rga (ripgrep-all) was such a hidden gem for me.


I recently found out that evince works fine with gzipped pdf (although not zstd). pdfgrep doesn't handle that but rga does.


I've long used `pdfgrep` in a very kludgey way, when stockpiling rare variants of particular used ThinkPad models (for looking up actual specs based on an IBM "type" number shown in a photo on an eBay listing, since the seller's listing of specs is often incorrect).

Example shell function:

    t500grep() {
        pdfgrep "$1" /home/user/doc/lenovo-psref-withdrawn-thinkpad-2005-to-2013-2013-12-447.pdf
    }
Example run:

    $ t500grep 2082-3GU
    2082-3GU      T9400   2.53   2GB   15.4" WSXGA+    Cam GMA, HD 3650     160G   7200   DVD±RW       Intel 5100            2G Turbo   9   Bus 32   Aug 08
The Lenovo services to look up this info come and go, and are also slow, but a saved copy of the data lives forever.

(A non/less-kludge way would be to get the information from all my IBM/Lenovo PSREFs into a lovingly-engineering database/knowledge schema, simple CSV file, or `grep`-able ad hoc text file.)


For Windows, there is dngrep: http://dngrep.github.io


Love dngrep, one of my friends used to combine all his pdfs into one and then use adobe reader to search before I showed this to him. It's very powerful and also simple to use, even for non-technical users.


I’ve been looking for an alternative to Acrobat’s ‘Advanced Search’ capability, which allows you to define a search directory (like NP++). This feature is possible with pdfgrep and other tools, but the killer for me is that it displays results in context, and allows you to quickly sift through dozens or hundreds of results with ease.

It’s literally the only reason why I have acrobat installed.


Call pdf grep from Emacs with pdf-tools installed.


Tangential:

Some time ago I built an automation [1] that identifies whether the given PDFs contain the specified keywords, outputting the result as a CSV file.

Similar to PDFGrep, probably much slower, but potentially more convenient for people preferring GUIs

[1] https://github.com/bendersej/pdf-keywords-extractor


Out of curiosity, how did you solve the issue of extracting text from the pdf, error free? Or did you use another package?


Looking at the list of dependencies, it seems like they use poppler-cpp to render the PDFs.

https://gitlab.com/pdfgrep/pdfgrep#dependencies


Popper tools pdftotext -layout is great


Curious as well. About a year ago I was implementing what I thought naively might not be a very difficult verification that a specific string existed (case sensitive or insensitive) within a PDF's text and had many cases where text viewed was clearly rendered in the document but many libraries couldn't identify the text. It's my understanding there's a lot of variance in how a rendered PDF may be presenting something one may assume is a simple string that really isn't after going down the rabbit hole (wasn't too surprising because I dont like to make simplicity assumptions). I couldn't find anything at the time that seemed to be error free.

Aside from applying document rendering with OCR and text recognition approaches, I ended up living with some error rate there. I think PDFgrep was one of the libraries I tested. Some other people just used libraries/tools as is with no sort of QAing but from my sample applying to several hundred verified documents, pdfgrep (and others) missed some.


A bit of a tangent, but does anyone know of a good utility that can index a large number of PDF files so one can do fast keyword searches across all of them simultaneously (free or paid)? It seems like this sort of utility used to be very common 15 years ago, but local search has kind of died on the vine.


Recoll is a nice one, uses Xapian for the index.

https://www.lesbonscomptes.com/recoll/


Tried out recoll thanks to you posting this, and it's brilliant! Will be using pdfgrep a lot less from now on, though your help with pdfgrep-mode is great for when I do.


Check out the script included in one of the comments of this PR. It allows displaying recoll query output in pdfgrep.el buffers. The fix in the PR itself helps to avoid an error in cases when recoll doesn't provide highlights in the snippets.

https://github.com/jeremy-compostella/pdfgrep/pull/8


While we're asking for tool tips: does anyone know of a tool that will cache/index web pages as the user browses, so that it can be searched/viewed offline later?


Macos’ spotlight can do this AFAIK.


Yes, and Spotlight's also useable from the command line as mdfind, which has an -onlyin switch to restrict the search to a directory.


DEVONsphere Express, Recoll, also the latest major version of Calibre.


I am working on looqs, it can do that (and also will render the page immediatly): https://github.com/quitesimpleorg/looqs


Recoll?


Yes, +1 for Recoll. It can also OCR those PDFs that are just an image of a page of text, and not 'live' text. Read the install notes and install the helper applications.

When searching I'll first try the application or system's native search utility, but most of the time I end up opening Recoll to actually find the thing or snippet of text I want, and it has never failed me.

https://www.lesbonscomptes.com/recoll/pages/features.html#do...


dtSearch


For Emacs users there is also https://github.com/jeremy-compostella/pdfgrep which lets you browse the results and open the original docs highlighting the selected match.

It's on MELPA.


This looks really neat. I wish I had known about it in gradschool. I used helm-bibtex to manage my bibliography (which is awesome, and about 10x better than mendelay). If you combined pdfgrep with the semantic information helm-bibtex already has, that would be super powerful.


Tried this the other day but couldn't figure out how to use it. Is it invoked from eshell, dired, a pdftool buffer, or what?


Make sure to turn on pdfgrep-mode (a global minor mode, its role is to advise compilation-mode to make it deal with pdfgrep output) and then run M-x pdfgrep with a syntax similar to M-x grep, so you need to specify the filename(s) to grep in.

A full arg list after M-x pdfgrep will be something like:

    pdfgrep -H -n -i points rulebooks/Tzolkin*pdf
You don't need to be in a pdf buffer to invoke it.

Here is what I have in my .emacs:

    (use-package pdfgrep
      :config
      (pdfgrep-mode))


I've been using Ali G. Rudi's pdftxt with my own shell wrappers. From the homepage: "uses mupdf to extract text from pdf files; prefixes each line with its page number for searching."

Usually I 1) pdftxt a file and 2) based on the results, jump to a desired page in Rudi's framebuffer PDF reader, fbpdf. For this, the page number prefix in pdftxt is a particularly nice default. No temptations with too many command line options either.

https://litcave.rudi.ir/pdftxt-0.7.tar.gz


What this did well, which I struggled with with some other (open source) parsers, is that it keeps things together that are visually together. Tables in HTML are a coherent object, but in PDF I guess they're often rendered as just some text boxes, and other parsers I've used would pull the text from the table cells in a seemingly unnatural if not random order. I don't have a specific need anymore at the moment, but I will think of Pdfgrep in the future, for this reason.


The way lots of programs generate PDF, especially with full (left and right) justification, is to place each and every letter individually. PostScript, and the PDF format descended from it, assume that something else is going to figure out kerning. Which unfortunately means that knowing which letters belong together in words can become a detective project.


Used this for online quizzes for one of my courses. It was pretty good, but using the terminal for stuff like this still sucks since you can't just click into the PDF.

I wish PDFs had a Ctrl-Shift-F search like VS Code that could search for text in multiple pdfs in a directory.


I use Houdahspot on MacOS for this. You can either invoke it globally with a custom hotkey (I use hyper ‘s’, for “search”), or from within a specific directory via an icon it adds to the top right corner of Finder.


Advanced Search in Acrobat allows you to define a directory. It’s the only reason I ever have it installed.


Cool, about 15 years ago I built something similar for PDF, Office (OpenXML) as well as plain text as part of a search engine. Commercial/closed source of course but it was super handy.


I used this tools a lot for writing my thesis. It made me search through many papers saved on my computer easily and fast.


Wanted to play with it, but just killed the terminal install of prereqs that was was still chugging since last night.


Just what I needed to search my collection of comp-sci articles. Regular grep fails on most PDFs.

I installed the Ubuntu package. Thanks!


+1 -- after trying several tool sets over years, pdfgrep is currently used daily around here


Catdoc utility does the same for .doc MS word files. Maybe for PDFs also.


pdfgrep is great. Worked like a charme to diff updates to a contract




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: