$ docquery scan "What is the due date?" /my/invoices/
/my/invoices/Order1.pdf What is the due date?: 4/27/2022
/my/invoices/Order2.pdf What is the due date?: 9/26/2022
- There are some ML/transformer-based methods for extracting a known schema (e.g. NER) or an unknown schema (e.g. relation extraction).
- We're going to add a feature to DocQuery called "templates" soon for some popular document types (e.g. invoices) + a document classifier which will automatically apply the template based on the doc type.
- Our commercial product (http://impira.com/) supports all of this + is a hosted solution (many of our customers use us to automate accounts payable, process insurance documents, etc.)
The templates sound very cool - are they essentially just using a preset list of (natural language) queries tied to a particular document class? It seems like you're using a version of donut for your document classification?
Makes sense -- this is why we OSS'd DocQuery :)
> The templates sound very cool - are they essentially just using a preset list of (natural language) queries tied to a particular document class? It seems like you're using a version of donut for your document classification?
Yes that's the plan. We've done extensive testing with other approaches (e.g. NER) and realized that the benefits of using use-case specific queries (customizability, accuracy, flexibility for many use cases) outweigh the tradeoffs (NER only needs one execution for all fields).
Currently, we support pre-trained Donut models for both querying and classification. You can play with it by adding the --classify flag to `docquery scan`. We're releasing some new stuff soon that should be faster and more accurate.
The unstoppable administrative engine that is the American Healthcare system produces hundreds of thousands of continuously updated documents like this with no standardized format/structure.
Manually extracting/normalizing this data into a querable format is an industry all its own.
I tried a few questions:
What is the development date? -> June 20, 2017
What is the medicine? -> SPINRAZA® (nusinersen)
How many doses -> 5 doses
Did the patient meet the review criteria? -> Patient met initial review criteria.
Is the patient treated with Evrysdi? -> not
To be fair, that's impossible in the general case, since the model can know things (ie be able to answer queries) without knowing that it knows them (ie being able to produce a list of anserable queries by any means significantly more efficient than trying every query and seeing which ones work).
As a reducto ad absurdum example, consider a 'model' consisting of a deniably encrypted key-value store, where it's outright cryptographically guaranteed that you can't effiently enumerate the queries. Neural networks aren't quite that bad, but (in the general-over-NNs case) they at least superficially appear to be pretty close. (They're definitely not reliably secure though; don't depend on that.)
For invoices, we have a pre-trained model (demo here: https://huggingface.co/spaces/impira/invoices) that is pretty good at most fields, but within our product, it will automatically learn about your formats as you upload documents and confirm/correct predictions. The pre-trained model is based on LayoutLM and the additional learning we do uses a generative model (GMMs) that can learn from as little as one example.
LMK if you have any other questions.
Example shell function:
pdfgrep "$1" /home/user/doc/lenovo-psref-withdrawn-thinkpad-2005-to-2013-2013-12-447.pdf
$ t500grep 2082-3GU
2082-3GU T9400 2.53 2GB 15.4" WSXGA+ Cam GMA, HD 3650 160G 7200 DVD±RW Intel 5100 2G Turbo 9 Bus 32 Aug 08
(A non/less-kludge way would be to get the information from all my IBM/Lenovo PSREFs into a lovingly-engineering database/knowledge schema, simple CSV file, or `grep`-able ad hoc text file.)
It’s literally the only reason why I have acrobat installed.
Some time ago I built an automation  that identifies whether the given PDFs contain the specified keywords, outputting the result as a CSV file.
Similar to PDFGrep, probably much slower, but potentially more convenient for people preferring GUIs
Aside from applying document rendering with OCR and text recognition approaches, I ended up living with some error rate there. I think PDFgrep was one of the libraries I tested. Some other people just used libraries/tools as is with no sort of QAing but from my sample applying to several hundred verified documents, pdfgrep (and others) missed some.
When searching I'll first try the application or system's native search utility, but most of the time I end up opening Recoll to actually find the thing or snippet of text I want, and it has never failed me.
It's on MELPA.
A full arg list after M-x pdfgrep will be something like:
pdfgrep -H -n -i points rulebooks/Tzolkin*pdf
Here is what I have in my .emacs:
Usually I 1) pdftxt a file and 2) based on the results, jump to a desired page in Rudi's framebuffer PDF reader, fbpdf. For this, the page number prefix in pdftxt is a particularly nice default. No temptations with too many command line options either.
I wish PDFs had a Ctrl-Shift-F search like VS Code that could search for text in multiple pdfs in a directory.
I installed the Ubuntu package. Thanks!