
Lopdf: Rust library for PDF document manipulation - adamnemecek
https://github.com/J-F-Liu/lopdf
======
userbinator
PDF is an unusual and awkward hybrid format containing both textual (complete
with comments!) and binary data. Thanks to this and other dubious design
choices, it's far easier to write a PDF than to read one. One of the most
memorable examples is indirect object references; when reading an array, you
can only distinguish them from integers once you've already read and parsed
_2_ integers, and then see the letter 'R' (e.g. "5 0 R"). A trivial
transposition that puts the 'R' first (e.g. "R 5 0") would've simplified
parsing greatly, since then the first character is enough to know what comes
next. I remember having to deal with lots of trivial-yet-easily-fixable
annoyances like this when I worked on some PDF parsing code.

~~~
Someone
It’s easiest to see it as what it is: a textual format that evolved allowing
binary blobs inside it.

You can write a PDF in notepad, or every text editor, and, if you know your
postscript, it isn’t that bad of an experience if you forget about modern
features, except for that table of contents at the end of the file.

See
[https://brendanzagaeski.appspot.com/0004.html](https://brendanzagaeski.appspot.com/0004.html)

~~~
userbinator
It's a textual format... with byte offsets. That's the most unusual part, and
it's been there since the beginning (PDF 1.0).

~~~
peapicker
While PDF uses different syntax, it is conceptually similar to the IFF file
format from 1985 (the basis of which is still used in some file formats like
PNG and AIFF) in being a format that can contain multiple types of data. PDF
is a little bit more focused on being an object-based file format in specific,
but one could have made PDF fit into IFF boxes easily in an alternative
universe.

------
muizelaar
Here's the beginnings of a pdf viewer built with lopdf and WebRender:
[https://github.com/srijs/rpdf/commits/master](https://github.com/srijs/rpdf/commits/master)

------
netghost
I see a lot of complaints about the PDF format here, what's the alternative
though? Is there another format that achieves the same goals which people
should rally around? Just curious given PDF's ubiquity.

(For the record, I had to deal with text encodings in PDFs, and yes, it was a
pain)

~~~
bsaul
Pdf needs to have a « v2 » format that simply drops backward compatibility for
everything that predates utf8, then create open converters tools to this new
format.

Only support one type of font embedding, with a single encoding (make
everything utf-8) : it won’t change anything noticeable regarding the file
size, and will greatly help at least parsing text.

I’ve only dealt with text parsing, so that’s the only easy improvement i see,
but i’m pretty sure following the same logic on graphic content should be
possible.

~~~
fulafel
There's PDF/A.

~~~
bsaul
Thanks for the info, i've never heard of that, that's very refreshing to see
that people are indeed trying to update the standard to remove the cruft.

Sidenote : It seems the PDF/A format open yet the specification is kept behind
a paywall ????

------
timClicks
OT perhaps but does anyone know a readable reference for PDF opcodes? The
example code feels somewhat opaque:

    
    
      let content = Content {
       operations: vec![
        Operation::new("BT", vec![]),
        Operation::new("Tf", vec!["F1".into(), 48.into()]),
        Operation::new("Td", vec![100.into(), 600.into()]),
        Operation::new("Tj", vec![Object::string_literal("Hello World!")]),
        Operation::new("ET", vec![]),
       ],
      };

~~~
mkl
The specification itself [1] is mostly pretty readable. For the opcodes, you
want Appendix A.

[1] "PDF Reference, Sixth Edition, version 1.7" at
[https://www.adobe.com/devnet/pdf/pdf_reference_archive.html](https://www.adobe.com/devnet/pdf/pdf_reference_archive.html)

------
fourier_mode
An example of the resulting generated pdf would've been a great.

------
propter_hoc
How does this compare to pdfbox (or the free versions of itext)?

------
fxfan
You cannot even take out non-ascii strings (or search) if the PDF file had
been created in a certain way. That's so 1980.

PDF is not exactly guaranteed to be portable- and is a format that needs to be
dumped by anyone who cares about a global portability.

~~~
afiori
Still, for all its negative sides, is the best approximation of a portable and
reliable "just work" document format.

