Hacker News new | past | comments | ask | show | jobs | submit login

I've looked into PDFs a little for a personal project of mine that worked with auto-generated PDFs, and I've found they can really vary wildly in the amount of bloat, and I've also found that much of the bloat seems to be from including fonts. If you actually look at the PDF file's binary, it's not that hard to understand, and the actual text is stored mainly as ASCII/utf8 I think. So, if I wanted to really efficiently store a bunch of PDFs that come from the same source, it seems like it should be possible to copy the "bloat" sections (likely embedded fonts) which are all completely identical, and use those in a dictionary in a custom compressor, and then just use zlib for the rest.

I've also noticed that some PDF generators include far, far more bloat than others, though I'm not sure why.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: