
What is the smallest possible valid PDF? - oftenwrong
https://stackoverflow.com/questions/17279712/what-is-the-smallest-possible-valid-pdf/17280876#17280876
======
simonw
There's actually a good practical use-case for this: you're building software
that needs to detect PDF files (as a smaller detail of what it dies, not its
primary purpose) and you want to include a tiny one in your unit tests.

I did that here with tiny images in JPEG, PNG and GIF
[https://github.com/simonw/datasette-render-
images/blob/maste...](https://github.com/simonw/datasette-render-
images/blob/master/test_datasette_render_images.py)

~~~
eesmith
Doesn't a PDF detector only need to check if the first few bytes are '%PDF-1.'
?

That is, do you need a "valid PDF" detector or "more likely a PDF than
anything else" detector?

~~~
rovr138
Depends. If you need to parse it, that might still cause errors.

~~~
eesmith
Agreed.

Though when I think of "detector" I think more of
[https://en.wikipedia.org/wiki/File_(command)](https://en.wikipedia.org/wiki/File_\(command\))
and not something which verifies the file is in the correct format.

~~~
BiteCode_dev
Well...

file 'setupTests.ts' setupTests.ts: Java source, ASCII text

I wouldn't put too much trust in file.

~~~
klodolph
I don’t know what you expected. File is just there to give a good guess at a
file’s format. There are a ton of reasons why this problem is hard, and there
are reasons to make “file” less accurate in order to make the implementation
simpler and more secure.

But it will work fine for PDFs, often enough.

------
LargoLasskhyfv
Not exactly on topic, but this reminds me of somewhere around the year 2000,
where i had to produce "documentation" in a hurry, in under an hour before
delivering and deploying some systems. I _think_ i used Dia, but am not sure
anymore. Anyways, all essential information in DIN A4 landscape mode, nice
diagrams with network structure, IP numbers and so on, ready.

Now what? Remember it was around the year 2000, what to use? Floppy disks of
course! Saved it, looked at it and thought it had gone wrong somehow because
it listed as 4KB only.

Used another floppy, lowlevel formatted with fdformat to be sure, taking
minutes, hurry, hurry! Saved again.

4KB!? WT..?

Booted up another system, loaded the PDF with different readers, worked.

Shrugged and hoped it worked at the customers site also, which it did, they
even said it looked nice and clear.

If only they knew...

------
jansan
If you need something to watch I recommend "Funky File Formats" where Ange
Albertini shows for example how one file can be valid in multiple file formats
at the same time. Really amazing: [https://media.ccc.de/v/31c3_-_5930_-_en_-
_saal_6_-_201412291...](https://media.ccc.de/v/31c3_-_5930_-_en_-
_saal_6_-_201412291400_-_funky_file_formats_-_ange_albertini#t=57)

------
matja
I wonder if afl-fuzz could do better. Context:
[https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-
th...](https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-thin-
air.html)

------
sushisource
Can anyone explain to me why PDF persists as the most common document format
for "official" correspondence? It's absolute dog-vomit of a format, just
unbelievably overwrought and unfriendly. I wince every time I have to sign
one, or, god forbid, actually fill in some form.

Is the explanation really as lame as "They were there first and it stuck"?

~~~
est31
Can you name a competing format that ended up being "better" than PDF? Not
intending to say there is no such format. I'm genuinely curious.

~~~
tonyedgecombe
XPS is better in many ways, apart from its obvious failure in the market.

~~~
jahewson
XPS is wonderful but it arrived 20 years too late.

------
NanoWar
Here is a great collection of smallest X:
[https://github.com/mathiasbynens/small](https://github.com/mathiasbynens/small)

~~~
app4soft
So, _`pdf.pdf`_ [0] (130 Bytes) from this repo is the smallest valid PDF?

[0]
[https://github.com/mathiasbynens/small/blob/master/pdf.pdf](https://github.com/mathiasbynens/small/blob/master/pdf.pdf)

~~~
macintux
I suspect the answer depends on how you define “valid”.

~~~
app4soft
I'm itself hard to answer to this question.

In means of "minimal valid raster image" \- its raster image with 1px × 1px;
in means of "minimal valid vector image" \- its vector image with single dot
OR single line segment.

But I can't imagine what is "minimal valid raster PDF" means.

------
e12e
> Acrobat opens it

Unfortunately, this is a valid measure of a "valid Pdf" \- it doesn't mean the
Pdf will work in other readers.

I've had some luck with qpdf --check - as far as I can recall, I've not seen
problems with a file that passes.

[http://qpdf.sourceforge.net/files/qpdf-
manual.html#ref.testi...](http://qpdf.sourceforge.net/files/qpdf-
manual.html#ref.testing-options)

Also worth looking into is mutools with its --clean option.

------
curveto
Technically, the %PDF doesn't have to occur at byte 0. But, like many others
inferred, that pushes onward toward real world structure (vs. an academically
correct but useless PDF).

If you want a fully covered example it'll need a trailer, xref and at least
one obj reference. ...and there are TWO flavors of those (linearized and not).
...and flate coded and not.

So, for a test harness you'd actually want a collection of small files.

~~~
bronson
Allowing %PDF anywhere also leads to false positives.

[https://github.com/minad/mimemagic/issues/4](https://github.com/minad/mimemagic/issues/4)

[https://github.com/minad/mimemagic/issues?q=is%3Aissue+is%3A...](https://github.com/minad/mimemagic/issues?q=is%3Aissue+is%3Aopen+pdf)

