What is the smallest possible valid PDF?

simonw · on March 20, 2020

There's actually a good practical use-case for this: you're building software that needs to detect PDF files (as a smaller detail of what it dies, not its primary purpose) and you want to include a tiny one in your unit tests.

I did that here with tiny images in JPEG, PNG and GIF https://github.com/simonw/datasette-render-images/blob/maste...

anonydsfsfs · on March 20, 2020

Check out https://github.com/mathiasbynens/small

hombre_fatal · on March 20, 2020

Kinda silly including 0-byte files which is most of them. Now you're just building a list of things that take empty input.

They should move those to a simple list in the README and reserve the repo for non-empty files.

lioeters · on March 20, 2020

That is indeed a nice one, "Smallest possible syntactically valid files of different types".

Following the GitHub link (datasette-render-images) in the comment you replied to, there's a code comment with a link to the same library (small).

eesmith · on March 20, 2020

Doesn't a PDF detector only need to check if the first few bytes are '%PDF-1.' ?

That is, do you need a "valid PDF" detector or "more likely a PDF than anything else" detector?

jahewson · on March 21, 2020

Realistically, yes, but strictly speaking PDFs start with a footer and most PDF readers will accept highly corrupt files.

rovr138 · on March 20, 2020

Depends. If you need to parse it, that might still cause errors.

eesmith · on March 20, 2020

Agreed.

Though when I think of "detector" I think more of https://en.wikipedia.org/wiki/File_(command) and not something which verifies the file is in the correct format.

BiteCode_dev · on March 20, 2020

Well...

file 'setupTests.ts' setupTests.ts: Java source, ASCII text

I wouldn't put too much trust in file.

klodolph · on March 20, 2020

I don’t know what you expected. File is just there to give a good guess at a file’s format. There are a ton of reasons why this problem is hard, and there are reasons to make “file” less accurate in order to make the implementation simpler and more secure.

But it will work fine for PDFs, often enough.

jtvjan · on March 20, 2020

The heuristic used for Java source looks like this:

0 regex \^import.*;$ Java source

meshy · on March 20, 2020

That use is indeed exactly what inspired the question!

pmiller2 · on March 20, 2020

I'm curious why you needed the absolute smallest PDF file for testing? As an intellectual exercise, golfing the PDF format sounds like a bit of fun, but given that https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pd... is 13KB, I would think you could load that in a test suite on pretty much any reasonable, non-embedded platform. And, why would you want to load it on an embedded platform, anyway?

kadoban · on March 21, 2020

Testing near/at boundary conditions is generally good practice.

A minimal size file could easily catch some cases where your code assumes some sort of structure (some flag, header, metadata structure, etc.) exists when it's possible it actually doesn't.

LargoLasskhyfv · on March 20, 2020

Not exactly on topic, but this reminds me of somewhere around the year 2000, where i had to produce "documentation" in a hurry, in under an hour before delivering and deploying some systems. I think i used Dia, but am not sure anymore. Anyways, all essential information in DIN A4 landscape mode, nice diagrams with network structure, IP numbers and so on, ready.

Now what? Remember it was around the year 2000, what to use? Floppy disks of course! Saved it, looked at it and thought it had gone wrong somehow because it listed as 4KB only.

Used another floppy, lowlevel formatted with fdformat to be sure, taking minutes, hurry, hurry! Saved again.

4KB!? WT..?

Booted up another system, loaded the PDF with different readers, worked.

Shrugged and hoped it worked at the customers site also, which it did, they even said it looked nice and clear.

If only they knew...

jansan · on March 20, 2020

If you need something to watch I recommend "Funky File Formats" where Ange Albertini shows for example how one file can be valid in multiple file formats at the same time. Really amazing: https://media.ccc.de/v/31c3_-_5930_-_en_-_saal_6_-_201412291...

matja · on March 20, 2020

I wonder if afl-fuzz could do better. Context: https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th...

sushisource · on March 20, 2020

Can anyone explain to me why PDF persists as the most common document format for "official" correspondence? It's absolute dog-vomit of a format, just unbelievably overwrought and unfriendly. I wince every time I have to sign one, or, god forbid, actually fill in some form.

Is the explanation really as lame as "They were there first and it stuck"?

jahewson · on March 21, 2020

PostScript is built-in to high-end printing hardware and PDF does a good job at encapsulating PostScript which gives you high reliability that the thing on the screen is going to print out the same way that it appears on the screen. Adobe has traditionally had a dominant role in both font (Type 1, OTF) technology and licensing and creative tools (Illustrator - a .ai file is a PDF) and so both the creation and consumption sides of the ecosystem have coalesced around a common format whose semantics carry through predictably (though not simply!) from one end to another.

What makes PDF particularly challenging is that there are so many broken PDFs out there which must be tolerated and so many legacy fonts, images and formats which have accumulated in the format over time - JPEG 2000 anybody?

Most complaints I see about PDF though are usually “why can’t it just wrap lines” and the answer is that there’s a zillion ways to do that and we’re supposed to be representing the output of that process as a visual artifact, not the input, as HTML does, because the use case is printing and stability of the output is non-negotiable.

est31 · on March 20, 2020

Can you name a competing format that ended up being "better" than PDF? Not intending to say there is no such format. I'm genuinely curious.

tonyedgecombe · on March 20, 2020

XPS is better in many ways, apart from its obvious failure in the market.

jahewson · on March 21, 2020

XPS is wonderful but it arrived 20 years too late.

ghaff · on March 20, 2020

Because it became, as a descendent of Postscript, a pretty dominant standard for situations where you wanted to specify the layout.

There's certainly a lot of legacy embedded in PDF which doesn't help.

>Is the explanation really as lame as "They were there first and it stuck"?

I'm not sure they were first but they became dominant because... Adobe. And Adobe made it an open standard which pretty sealed the deal.

2ion · on March 20, 2020

Because it is the only format with mass adoption that makes presentation reasonably stable. Plus it's the format with the only reasonable authoring tools.

Personally, I enjoy working with DeJaVu files very much but you only find those of pirated ebooks.

saagarjha · on March 21, 2020

DjVu isn’t a vector format, so it’s not quite a replacement.

crazygringo · on March 20, 2020

Why do you wince when you have to sign one?

As an engineer I can see why you would find the spec inelegant.

But as an end-user, using it to fill in forms couldn't be easier.

pbhjpbhj · on March 20, 2020

What do you use for form filling. During job applications last year nearly all the pdf forms were just images, or vector boxes that you couldn't fill ...

crazygringo · on March 21, 2020

Acrobat Reader (free) has the "Add text" tool that lets you add text anywhere you want.

Similarly macOS Preview (free) has an identical "Text" tool to add text anywhere.

I've never checked other programs, but those are certainly the two default tools one would use with PDF's on Macs.

NanoWar · on March 20, 2020

Here is a great collection of smallest X: https://github.com/mathiasbynens/small

app4soft · on March 21, 2020

So, `pdf.pdf`[0] (130 Bytes) from this repo is the smallest valid PDF?

[0] https://github.com/mathiasbynens/small/blob/master/pdf.pdf

macintux · on March 21, 2020

I suspect the answer depends on how you define “valid”.

app4soft · on March 21, 2020

I'm itself hard to answer to this question.

In means of "minimal valid raster image" - its raster image with 1px × 1px; in means of "minimal valid vector image" - its vector image with single dot OR single line segment.

But I can't imagine what is "minimal valid raster PDF" means.

oftenwrong · on March 20, 2020

I would love to see a write-up (like the SO answer) for each of those.

e12e · on March 21, 2020

> Acrobat opens it

Unfortunately, this is a valid measure of a "valid Pdf" - it doesn't mean the Pdf will work in other readers.

I've had some luck with qpdf --check - as far as I can recall, I've not seen problems with a file that passes.

http://qpdf.sourceforge.net/files/qpdf-manual.html#ref.testi...

Also worth looking into is mutools with its --clean option.

curveto · on March 20, 2020

Technically, the %PDF doesn't have to occur at byte 0. But, like many others inferred, that pushes onward toward real world structure (vs. an academically correct but useless PDF).

If you want a fully covered example it'll need a trailer, xref and at least one obj reference. ...and there are TWO flavors of those (linearized and not). ...and flate coded and not.

So, for a test harness you'd actually want a collection of small files.

bronson · on March 20, 2020

Allowing %PDF anywhere also leads to false positives.

https://github.com/minad/mimemagic/issues/4

https://github.com/minad/mimemagic/issues?q=is%3Aissue+is%3A...