I will never forgive the pain PDF caused me when I worked on a project to parse millions of PDF files from various sources. Just reconstructing paragraphs was a huge effort not even mentioning parsing tables. I think we should do better for something that’s basically a standard. PDF manuals also suck big time.
PDF is supposed to a be a printer format, not a word processing document format. While I too would love to nail down a PDF subset to be a standard (for example requiring the accessibility tags that make text extraction easy) perhaps trying to create a hybrid format, one that satisfies both printers and resizable windows, is already an impossible goal.
(I've always had to keep my love of PDF a secret from fellow nerds. But here's another secret, I like printing documents out from time to time.)
I really appreciate what PDF can accomplish, but I also really dislike that it turns into a black box. There really ought to be something that can describe a document structure and also describe document layout in a durable and portable manner. In the range of XML/JSON <-> HTML+CSS <-> PDF <-> PS <-> RAW, it really does feel like there's something missing between HTML and PDF.
And it can't be LaTeX, because the document shouldn't be a programming language at all. "The document is a program" has proven itself to be a terrible scheme overall.
I wonder a bit if we wouldn't have a easier time extracting data, resizing pages etc if we sent HTML files instead of PDF. Are even half of PDFs printed at all?
I will never forgive the pain PDF caused me when I worked on a project to parse millions of PDF files from various sources. Just reconstructing paragraphs was a huge effort not even mentioning parsing tables. I think we should do better for something that’s basically a standard. PDF manuals also suck big time.