Filling in PDF forms with Python (2018)

LarryMade2 · on April 11, 2020

I've used (in PHP) TCPDF (was FPDF) and FPDi, FPDI extends TCPDF to import the PDF original (used as a background/template) and then TCPDF can write content (text, lines, shapes, images, etc.) on top of it.

Using fillable PDFs... A lot of PDFs I encountered didn't have fillable fields or were those provided not as large as I needed to fill properly, so mapped the content on top of the PDF ignoring any forms. (might not be possible if the forms preform some other function than just to populate the form)

Initially I would identify fields and create a form content database/array.

Overlay a grid on the "template PDF" with the TCPDF (point measurement is the PS/PDF standard) and hand determine the field location coordinates. (Making an HTML overlay won't cut it you need precise measurements.)

Add in some paging logic for handling multi page data, etc. and have it put it all together.

...

But PHP isn't one for signatures, so Python would be best...

Looking at resources like I've done in in PHP, you have to use two libraries - one to import the pages (with PYPDF2) and then to create the content on the imported page (pyfpdf). Looks like they are exclusive, bso you create content, then merge pdf template with content PDF.

Someone wrote an example: https://gist.github.com/dwayneblew/79da32727358b502f6ec

This should get you closer I think.

luminadiffusion · on April 11, 2020

I had to solve this problem a few years ago. My solution was as follows:

1. Convert PDF -> multiple individual SVG files. (I probably used Cairo)

2. Use Inkscape to set your fields and name them like Django template variables.

3. Store these prepared SVGs in the file system of your app.

4. Call them when needed and fill them in with the Django Template rendering engine.

4a. Works with including images too if you convert them to base64 encoded PNG, then insert them into the SVG.

5. Convert individual SVGs -> individual PDFs. (Cairo)

6. Merge individual PDFs into a single combined PDF. (Cairo)

7. Deliver finished merged PDF.

After the initial step of preparing your SVG, which can take a bit of time to get right, it only takes about 2-3s to produce a fully compiled PDF and gives you all of the necessary functionality out of them - sans all of these chaotic intervening libraries.

I can't tell you how many months it took me to figure that out. It was a while though. When the system was operational, we were sending 10,000 multipage PDFs per day on a single Django instance on a T2 medium AWS instance.

formalsystem · on April 11, 2020

Anyone familiar with the history of how PDFs became such a widespread format in the first place? I get that it looks nice but not being able to edit it by default just seems weird to me.

ivan_ah · on April 11, 2020

This was posted on HN a while back and has some interesting bits of history and context: https://www.vice.com/en_us/article/pam43n/why-the-pdf-is-sec... via https://news.ycombinator.com/item?id=19819789

orev · on April 11, 2020

The “read only” nature of PDFs is a feature, not a bug. The idea being that once you distribute the PDF, it can’t be changed, and thus has more “truth” than something editable would. Even now PDF is considered an acceptable format for legal documents where Word docx is not. Of course this is completely false safety given that many programs can edit PDFs.

izacus · on April 11, 2020

> Of course this is completely false safety given that many programs can edit PDFs.

Well, unless you sign the PDF. Even more, you can sign each edit separately, so you can do things like add content and signatures and still verify who added what. Meaning: one party can create PDF with forms, sign it, then the party filling out the form can sign their own changes for authentication.

And let's not forget the fact that PDF renders correctly on pretty much any machine you put it on - this is incredibly important.

metaphor · on April 11, 2020

> Of course this is completely false safety given that many programs can edit PDFs.

But can you edit a properly signed[1] PDF without breaking it? From an integrity perspective, that's what matters; otherwise, it's just inherently more portable until non-repudiation becomes relevant.

[1] As in not SHA-1: https://shattered.io/

alasdairking · on April 11, 2020

Including Word nowadays!

mkl · on April 11, 2020

PDF is a high-quality vector format that displays the same on every device, can be produced by many different applications but use the same viewer, and has a public specification that's an ISO standard. There's very little competition. Compressed PostScript is clunky, slow, and still big, XPS was too late, DJVU is primarily for scans. Things like Word or HTML display differently on different devices, Word and many other formats are or were proprietary.

cronopios · on April 11, 2020

What about [DVI](https://en.wikipedia.org/wiki/Device_independent_file_format)?

mkl · on April 11, 2020

Only really used for TeX-related things. Not general purpose as it can't embed fonts, and graphics is usually by embedding PostScript I think.

djrobstep · on April 11, 2020

If you think that's crazy, wait until you hear about HTML!

emmelaich · on April 11, 2020

You laugh, but surely there is a case for standard of static html+css these days to do 95% of what pdf does.

I don't think anyone is motivated enough to make that standard, however.

ethanwillis · on April 11, 2020

Read up on Postscript specifically. There's some interesting history there, like Display Postscript.

simon04 · on April 11, 2020

In my opinion, one should _always_ try to get changes/fixes/patches applied upstream. Even if the still need some discussion/tuning. In the long run everyone benefits. Think of 1000 people maintaining their own fork of the Linux kernel.

mckmk · on April 11, 2020

I had a similar need at a company I worked for. My solution was actually quite similar to the author's #1 with the major exception being that I used ODG for LibreOffice Draw which mostly solves the author's two main complaints here. Background images can be high quality and placing your text is as easy as clicking where you want to place your text box.

The only other major difference is that I didn't interact with UNO. Since Open Document Format files are zipped XML files I extracted the content.xml and did regular expressions for my variables then replaced them.

We did have to do signatures as well but that turns out to be not THAT much of a pain. If you insert an image on top of your form manually then look at the resulting file you can pretty much copy the part of the XML that refers to the inserted image, insert the signature image into the ODG zip and make sure the names line up and it will work.

It's worth noting that the practice of editing complex XML with regular expressions is not always advisable. In my experience it works fairly reliably with ODG because the format remains simple. But, with ODT it can result in corrupted files quite easily because additional XML can lie inbetween the text letters of your variables. Then you'll be on a mission to find and ignore all the text markup like XML bold markers and span tags and paragraph markers and style tags.... before you know it your simple unzip regex rezip becomes a whole library.

phonethrowaway · on April 11, 2020

I've had great success with this library:

https://github.com/christopher-ramirez/secretary

someotherperson · on April 11, 2020

There is more than one way to deal with this, unfortunately it's all quite messy. But it is possible.

One is to use Inkscape to layer the image as an OCG, the other is to treat the image as a "watermark" (which is really just another layer) with your image via PyPDF2 or similar.