Hacker News new | past | comments | ask | show | jobs | submit login

“ PDF is a fabulous format”

I will never forgive the pain PDF caused me when I worked on a project to parse millions of PDF files from various sources. Just reconstructing paragraphs was a huge effort not even mentioning parsing tables. I think we should do better for something that’s basically a standard. PDF manuals also suck big time.




PDF is supposed to a be a printer format, not a word processing document format. While I too would love to nail down a PDF subset to be a standard (for example requiring the accessibility tags that make text extraction easy) perhaps trying to create a hybrid format, one that satisfies both printers and resizable windows, is already an impossible goal.

(I've always had to keep my love of PDF a secret from fellow nerds. But here's another secret, I like printing documents out from time to time.)


I really appreciate what PDF can accomplish, but I also really dislike that it turns into a black box. There really ought to be something that can describe a document structure and also describe document layout in a durable and portable manner. In the range of XML/JSON <-> HTML+CSS <-> PDF <-> PS <-> RAW, it really does feel like there's something missing between HTML and PDF.

And it can't be LaTeX, because the document shouldn't be a programming language at all. "The document is a program" has proven itself to be a terrible scheme overall.


PDF includes optional document structure information. Most PDF creation software chooses to not generate it, though.


ePub is kind of trying to be that? Or maybe that hews too close to HTML.

It can reflow but tries to paginate HTML ... the way printing a web page tries to paginate HTML, ha ha.


I wonder a bit if we wouldn't have a easier time extracting data, resizing pages etc if we sent HTML files instead of PDF. Are even half of PDFs printed at all?


Did you go the “display it, then OCR what’s displayed” route as a last ditch effort?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: