“ PDF is a fabulous format” I will never forgive the pain PDF caused me when I w...

JKCalhoun · 2024-02-01T13:31:54

PDF is supposed to a be a printer format, not a word processing document format. While I too would love to nail down a PDF subset to be a standard (for example requiring the accessibility tags that make text extraction easy) perhaps trying to create a hybrid format, one that satisfies both printers and resizable windows, is already an impossible goal.

(I've always had to keep my love of PDF a secret from fellow nerds. But here's another secret, I like printing documents out from time to time.)

da_chicken · 2024-02-01T13:51:21

I really appreciate what PDF can accomplish, but I also really dislike that it turns into a black box. There really ought to be something that can describe a document structure and also describe document layout in a durable and portable manner. In the range of XML/JSON <-> HTML+CSS <-> PDF <-> PS <-> RAW, it really does feel like there's something missing between HTML and PDF.

And it can't be LaTeX, because the document shouldn't be a programming language at all. "The document is a program" has proven itself to be a terrible scheme overall.

layer8 · 2024-02-01T18:22:55

PDF includes optional document structure information. Most PDF creation software chooses to not generate it, though.

JKCalhoun · 2024-02-01T16:24:23

ePub is kind of trying to be that? Or maybe that hews too close to HTML.

It can reflow but tries to paginate HTML ... the way printing a web page tries to paginate HTML, ha ha.

grotorea · 2024-02-01T14:16:16

I wonder a bit if we wouldn't have a easier time extracting data, resizing pages etc if we sent HTML files instead of PDF. Are even half of PDFs printed at all?

niels_bom · 2024-02-03T20:27:19

Did you go the “display it, then OCR what’s displayed” route as a last ditch effort?