
Minimal PDF - ingve
https://brendanzagaeski.appspot.com/0004.html
======
aidos
I've spent much of the last year down in the internals of pdfs. I recommend
looking inside a PDF to see what's going on. PDF gets a hard time but once
you've figured out the basics, it's actually pretty readable.

Some top tips; if you decompress the streams first, you'll get something you
can read and edit with a text editor

    
    
        mutool clean -d -i in.pdf out.pdf
    

If you hand mess with the PDF, you can run it through mutool again to fix up
the object positions.

Text isn't flowed / layed out like HTML. Every glyph is more or less manually
positioned.

Text is generally done with subset fonts. As a result characters end up being
mapped to \1, \2 etc. So you can't normally just search for strings but you
can often - though not always easily find the characters from the Unicode map.

~~~
xyzxyz998
That was informative, thank you.

I think a lot of people in the dev/power uesr community would mind paying $1
for a Kindle ebook where you note all your findings.

There have been so many instances where I wanted to do stuff with pdfs but
ended up deflated.

> subset fonts

So you mean if a font has been embedded with three glyphs, 0x41=A, 0x61=a,
0x62=b, then string Aba would be \1\3\2?

~~~
kccqzy
The PDF reference is freely available and pretty readable too. I would
recommend just read that.

To answer your question, subsetting a font just means taking a portion of its
glyphs and it doesn't imply remapping. In fact for almost sane PDF files you
will find ASCII characters mapped to themselves, making text search within
decompressed PDF possible. My dirty watermark remover script basically uses
qpdf to decompress the thing and then use regular expressions to search for Tj
or TJ right after the specified string.

~~~
PaulHoule
This is a copy of the ISO 32000 PDF specification:

[http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/p...](http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf)

This is a long document but it is very well written, if you read it on the bus
or while you're waiting for your compiler to finish, you will get to
understand it.

~~~
j45
Thanks for sharing!

------
j_s
See also on the same site: _Hand-coded PDF tutorial_ |
[https://brendanzagaeski.appspot.com/0005.html](https://brendanzagaeski.appspot.com/0005.html)

If you need more, the "free" (trade for your email) e-book from Syncfusion
_PDF Succinctly_ demonstrates manipulation barely one level of abstraction
higher (not calculating any offsets manually):
[https://www.syncfusion.com/resources/techportal/details/eboo...](https://www.syncfusion.com/resources/techportal/details/ebooks/pdf)

"With the help of a utility program called pdftk[1] from PDF Labs, we’ll build
a PDF document from scratch, learning how to position elements, select fonts,
draw vector graphics, and create interactive tables of contents along the
way."

[1] [https://www.pdflabs.com/tools/pdftk-
server/](https://www.pdflabs.com/tools/pdftk-server/)

~~~
logicallee
>If you need more, the "free" (trade for your email)

Just a note that Google lets you get a new email address that isn't spammable,
through security through obscurity:

-> You can use your gmail address, add a + after it, and add a keyword. So if you are jsmith@gmail.com you can give out jsmith+syncfusionpdfsuccintly@gmail.com and then later if that starts getting spammed you can redirect it.

NOTE:

This is an incorrect solution (Google, please fix this) because anyone can run
a regex removing the + part.

Instead the correct solution is that if you have gmail open, in a single click
you should be able to generate a high-entropy gmail address (that does not
deplete the namespace) and link it on _your_ end with
"syncfusionpdfsuccintly".

If I already have gmail open, 7 seconds to create a new gmail address as
follows:

    
    
      1.  Click something to start the process
    
      2.  Type "syncfusionpdfsuccintly" to tag it on my end
    
      3.  Click something to copy a resulting high-entropy gmail name into the clipboard.
    

I should then be able to paste it into a form, get it delivered straight into
my inbox ( _never_ spam), and redirect it to spam if it starts getting
spammed.

This would allow people to contact us without ever getting into spam, while
entirely removing their ability to contact us if this email address starts
getting spammed. There are no downsides.

I believe Google's engineers are smart enough to move from security through
obscurity (relying on the knowledge that no spammer can _ever_ invent and run
the exact regex s/\\+[.]+@/@/g to remove the security through obscurity, as
this would entirely break this security, exposing the underlying "protected"
email addresses) to something that works.

Until that day comes you can rely on the security through obscurity to give
out a secure email address that can't be spammed. Just add a + and a tag!

Please.

Google: I believe you are smart enough to understand this comment and
implement this solution, which can be prototyped in 30 minutes and solves the
spam problem forever. You can do it! I believe in you. You're 99.999% there
and your security through obscurity works very well for me. I use it.

I hope you will go above and beyond and solve the remaining 0.001%. It would
just make me feel better to know that a 13-character regex couldn't defeat
your solution.

~~~
apostacy
I wrote something in node.js that does exactly this. But I have only used it
for personal use. I'm honestly surprised that nobody has done this already.

Right now it just silently drops expired addresses. But it is so satisfying to
think about bouncing (but stuff like bouncing behavior is something you have
to consider when running a mail service).

I was thinking of turning it into a service, but I'd have to read up on how to
scale it. Running an SMTP server takes a lot of care. I've found that just
using nearlyfreespeech.net's mail forwarding is most reliable to receive
emails. So I do that for now, since it is on a small scale.

I just got so frustrated with how this problem has such an obvious technical
solution. At least for us users, anyway. It's not a solution for the
marketers.

I strongly suspect that Google is very reluctant to do anything to make the
email landscape unstable. I think that if Google started offering this, it
would shake up so much of their business.

~~~
logicallee
As I mentioned, Google already offers this. Just add a tag with + to your
existing gmail address and you instantly create a tagged new one.

~~~
ant6n
Except as pointed out, a spammer (etc) can remove the tag and get the original
address using a regex.

~~~
logicallee
only until Google fixes it as suggested. they have some smart people there and
I believe in them!

------
tptacek
The biggest complexity (and security) problem with PDF is that it's also
effectively an archive format, in which more or less every display file format
conceived of before ~2007 can be embedded.

~~~
cgb223
Maybe it's time for a PDF-2017 standard that drops support for those older
exploitable formats

~~~
colejohnson66
If they’re exploitable, how would a new version help? Attackers would just use
the older, exploitable versions. And if PDF viewers only allowed the newer
version, you’d break support with every PDF made.

~~~
Dylan16807
>And if PDF viewers only allowed the newer version, you’d break support with
every PDF made.

That is not at all something that would have to be true.

------
ekr
See also klange's resume:
[https://github.com/klange/resume](https://github.com/klange/resume). Resume
pdf that's also a valid ISO 9660, bootable toaru OS image.

~~~
userbinator
Looks like he did it "the hard way" \--- and unfortunately it's not a truly
valid PDF since the startxref isn't within the last 1KB of the file and the
version number in the header is corrupt. Not all PDF readers will accept that.

On the other hand, it is possible to make a completely valid PDF and bootable
ISO. The first 32KB of an ISO is officially "unused", which is probably why
GRUB decided to put itself there, but that can be relocated somewhere else ---
the El Torito boot descriptor will need to be updated to point to it --- and
the PDF signature (which can be a valid one) and as many objects as will fit
can be put in that area, with the rest anywhere else. The xref table can be
moved to the very end and the offsets updated to point to the objects.

------
ad_hominem
I've encountered .pdf files which internally embed a proprietary Adobe
extension called XFA[1]. I think they are created using Adobe's LiveCycle
product.

They are a real pain because they render fine in Adobe Acrobat, but most other
PDF renderers (including browser built-in ones) can't render them. Instead
they render a blob of interstitial "loading..." text that is also embedded in
the PDF (which the XFA rendering would then overwrite). It was a pain to me
personally because I had to figure out a way to do programmatic form-filling
of some fillable form XFAs, and most PDF libraries don't work with them (they
expect traditional AcroForms fillable forms).

But in reading the XFA specification I found it interesting it had its own
JavaScript interpreter (including supporting XHR requests as part of some
internet-integrated form-filling feature) and another proprietary scripting
language called FormCalc. I guess it opened my eyes to PDFs being a container
format and the kinds of things they allow you to embed.

[1]: [https://en.wikipedia.org/wiki/XFA](https://en.wikipedia.org/wiki/XFA)

~~~
userbinator
That's unfortunate. I guess the same thing that happened to HTML pages
(turning into less accessible JS-based SPAs) is now happening to PDFs.

------
kazinator
Plain text ... but with _hard offsets_ ... encoded as _decimal integers_.
Yikes!

------
fizixer
This is good but Postscript is even better. Someday I'll learn it and see what
I can do with it.

~~~
protomyth
Very much worth learning, if for nothing else being an extremely cool stack
language. I learned it for my first job that I only had Turbo C 2.0, foxbase,
and an HP4 printer with the Postscript module to do graphical reporting on a
dataset.

chris_st's recommendations are how I learned it.

~~~
pjmlp
You lucky guy.

I remember going though HP and Epson printer manuals, writing down their
control escape codes into a xBase table so that our Clipper application could
talk to the printers and do the respective formatiing.

Having access to a PS printer would have been a much more positive experience.

------
cmurf
I'm frustrated by governments using PDF for full in forms and yet open source
tools are very weak in this area.

This is not better than paper and pencil, in terms of accessibility. And we
need to do better somehow.

------
amenghra
If you like this you might enjoy this repo:
[https://github.com/mathiasbynens/small](https://github.com/mathiasbynens/small)

------
mp3geek
Would be nice if browsers would support saving pages directly as pdf using
there own pdf librarys.

~~~
abritinthebay
Everything on the Mac has Print to PDF natively and it’s _so nice_

It’s one of the things I miss most when using an other OS.

~~~
rcarmo
Not just print, but also the ability to natively _manipulate_ PDF, because the
Mac still has the display Postscript stack from the NeXT era and PDF
essentially an envelope for it.

It's underused these days, but still available to apps, and they can
interchange data in that format. Linux support for PDF isn't anywhere near as
integrated.

~~~
duskwuff
> the Mac still has the display Postscript stack from the NeXT era and PDF
> essentially an envelope for it.

This is a common misconception. Display PostScript was never present in any
released version of macOS. It was replaced by the Quartz renderer, which is
rather different.

Quartz can display and output to PDFs, but it does not use PDF as an internal
format.

------
eru
> Most PDF files do not look readable in a text editor. Compression,
> encryption, and embedded images are largely to blame. After removing these
> three components, one can more easily see that PDF is a human-readable
> document description language.

Of course, PDF is intentionally so weird: it was a move by Adobe because other
companies were getting too good at handling postscript.

Embedding custom compression inside your format is seldom worth it: .ps.gz is
usually smaller than pdf.

~~~
kccqzy
It's hardly custom compression. It's just deflate and ZIP, IIRC.

~~~
eru
Yes, true. But it's build into the format, instead of transparently applied
afterwards.

~~~
kccqzy
The same could be said for PNG and many other common formats that use zlib.

------
Noctem
This page was helpful to me a couple years ago while crafting the tiny PDF
used for testing in Homebrew. [https://github.com/Homebrew/legacy-
homebrew/pull/36606](https://github.com/Homebrew/legacy-homebrew/pull/36606)

------
jimjimjim
PDF is like c++

it's used everywhere because you can do everything with it.

This also leads to the problem where you can do anything with it.

so each industry is kind of coming up with their own subset of pdf that
applies some restrictions in the hopes of making them verifiable.

the downside is that these subsets slowly start bloating until they allow
everything anyway.

i'm looking at you PDFa. grr.

------
ilaksh
PDF is literally the worst possible format for document exchange because it
has the most unnecessary complexity of all document formats, which makes it
the hardest to access. But popularity and merit are two totally different
things.

~~~
abritinthebay
That’s unfair. It’s terrible from a technical perspective (due to cruft,
mostly) but nothing comes close from a deliverable standpoint.

It’s simply the most reliable print-ready format. It is a Portable Document
Format in every way that matters _to the end user_

There’s a reason one of the main uses of LaTeX is outputting to PDF

~~~
kingkongjaffa
Completely agree, as a dumb user of latex I only care that I can make a
document and that it looks the same on every computer or browser or printed
out.

That solution just happens to be latex, especially since virtually all
computers will have some way of viewing and printing pdfs by default.

~~~
abritinthebay
I really wish there was a better solution to typesetting than LaTeX (well,
XeTeX if you’re serious about Unicode & language support I guess, _which you
should be_ )

I hate it _so much_ , if it wasn’t for its excellent abilities in specific
areas (hyphenation, etc) I’d much prefer CSS.

Yes, TeX makes me prefer _CSS_ for layout. That’s how painful I find it.

