Hacker News new | past | comments | ask | show | jobs | submit login
What is the smallest possible valid PDF? (stackoverflow.com)
176 points by oftenwrong on March 20, 2020 | hide | past | favorite | 36 comments



There's actually a good practical use-case for this: you're building software that needs to detect PDF files (as a smaller detail of what it dies, not its primary purpose) and you want to include a tiny one in your unit tests.

I did that here with tiny images in JPEG, PNG and GIF https://github.com/simonw/datasette-render-images/blob/maste...



Kinda silly including 0-byte files which is most of them. Now you're just building a list of things that take empty input.

They should move those to a simple list in the README and reserve the repo for non-empty files.


That is indeed a nice one, "Smallest possible syntactically valid files of different types".

Following the GitHub link (datasette-render-images) in the comment you replied to, there's a code comment with a link to the same library (small).


Doesn't a PDF detector only need to check if the first few bytes are '%PDF-1.' ?

That is, do you need a "valid PDF" detector or "more likely a PDF than anything else" detector?


Realistically, yes, but strictly speaking PDFs start with a footer and most PDF readers will accept highly corrupt files.


Depends. If you need to parse it, that might still cause errors.


Agreed.

Though when I think of "detector" I think more of https://en.wikipedia.org/wiki/File_(command) and not something which verifies the file is in the correct format.


Well...

file 'setupTests.ts' setupTests.ts: Java source, ASCII text

I wouldn't put too much trust in file.


I don’t know what you expected. File is just there to give a good guess at a file’s format. There are a ton of reasons why this problem is hard, and there are reasons to make “file” less accurate in order to make the implementation simpler and more secure.

But it will work fine for PDFs, often enough.


The heuristic used for Java source looks like this:

0 regex \^import.*;$ Java source


That use is indeed exactly what inspired the question!


I'm curious why you needed the absolute smallest PDF file for testing? As an intellectual exercise, golfing the PDF format sounds like a bit of fun, but given that https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pd... is 13KB, I would think you could load that in a test suite on pretty much any reasonable, non-embedded platform. And, why would you want to load it on an embedded platform, anyway?


Testing near/at boundary conditions is generally good practice.

A minimal size file could easily catch some cases where your code assumes some sort of structure (some flag, header, metadata structure, etc.) exists when it's possible it actually doesn't.


Not exactly on topic, but this reminds me of somewhere around the year 2000, where i had to produce "documentation" in a hurry, in under an hour before delivering and deploying some systems. I think i used Dia, but am not sure anymore. Anyways, all essential information in DIN A4 landscape mode, nice diagrams with network structure, IP numbers and so on, ready.

Now what? Remember it was around the year 2000, what to use? Floppy disks of course! Saved it, looked at it and thought it had gone wrong somehow because it listed as 4KB only.

Used another floppy, lowlevel formatted with fdformat to be sure, taking minutes, hurry, hurry! Saved again.

4KB!? WT..?

Booted up another system, loaded the PDF with different readers, worked.

Shrugged and hoped it worked at the customers site also, which it did, they even said it looked nice and clear.

If only they knew...


If you need something to watch I recommend "Funky File Formats" where Ange Albertini shows for example how one file can be valid in multiple file formats at the same time. Really amazing: https://media.ccc.de/v/31c3_-_5930_-_en_-_saal_6_-_201412291...


I wonder if afl-fuzz could do better. Context: https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th...


Can anyone explain to me why PDF persists as the most common document format for "official" correspondence? It's absolute dog-vomit of a format, just unbelievably overwrought and unfriendly. I wince every time I have to sign one, or, god forbid, actually fill in some form.

Is the explanation really as lame as "They were there first and it stuck"?


PostScript is built-in to high-end printing hardware and PDF does a good job at encapsulating PostScript which gives you high reliability that the thing on the screen is going to print out the same way that it appears on the screen. Adobe has traditionally had a dominant role in both font (Type 1, OTF) technology and licensing and creative tools (Illustrator - a .ai file is a PDF) and so both the creation and consumption sides of the ecosystem have coalesced around a common format whose semantics carry through predictably (though not simply!) from one end to another.

What makes PDF particularly challenging is that there are so many broken PDFs out there which must be tolerated and so many legacy fonts, images and formats which have accumulated in the format over time - JPEG 2000 anybody?

Most complaints I see about PDF though are usually “why can’t it just wrap lines” and the answer is that there’s a zillion ways to do that and we’re supposed to be representing the output of that process as a visual artifact, not the input, as HTML does, because the use case is printing and stability of the output is non-negotiable.


Can you name a competing format that ended up being "better" than PDF? Not intending to say there is no such format. I'm genuinely curious.


XPS is better in many ways, apart from its obvious failure in the market.


XPS is wonderful but it arrived 20 years too late.


Because it became, as a descendent of Postscript, a pretty dominant standard for situations where you wanted to specify the layout.

There's certainly a lot of legacy embedded in PDF which doesn't help.

>Is the explanation really as lame as "They were there first and it stuck"?

I'm not sure they were first but they became dominant because... Adobe. And Adobe made it an open standard which pretty sealed the deal.


Because it is the only format with mass adoption that makes presentation reasonably stable. Plus it's the format with the only reasonable authoring tools.

Personally, I enjoy working with DeJaVu files very much but you only find those of pirated ebooks.


DjVu isn’t a vector format, so it’s not quite a replacement.


Why do you wince when you have to sign one?

As an engineer I can see why you would find the spec inelegant.

But as an end-user, using it to fill in forms couldn't be easier.


What do you use for form filling. During job applications last year nearly all the pdf forms were just images, or vector boxes that you couldn't fill ...


Acrobat Reader (free) has the "Add text" tool that lets you add text anywhere you want.

Similarly macOS Preview (free) has an identical "Text" tool to add text anywhere.

I've never checked other programs, but those are certainly the two default tools one would use with PDF's on Macs.


Here is a great collection of smallest X: https://github.com/mathiasbynens/small


So, `pdf.pdf`[0] (130 Bytes) from this repo is the smallest valid PDF?

[0] https://github.com/mathiasbynens/small/blob/master/pdf.pdf


I suspect the answer depends on how you define “valid”.


I'm itself hard to answer to this question.

In means of "minimal valid raster image" - its raster image with 1px × 1px; in means of "minimal valid vector image" - its vector image with single dot OR single line segment.

But I can't imagine what is "minimal valid raster PDF" means.


I would love to see a write-up (like the SO answer) for each of those.


> Acrobat opens it

Unfortunately, this is a valid measure of a "valid Pdf" - it doesn't mean the Pdf will work in other readers.

I've had some luck with qpdf --check - as far as I can recall, I've not seen problems with a file that passes.

http://qpdf.sourceforge.net/files/qpdf-manual.html#ref.testi...

Also worth looking into is mutools with its --clean option.


Technically, the %PDF doesn't have to occur at byte 0. But, like many others inferred, that pushes onward toward real world structure (vs. an academically correct but useless PDF).

If you want a fully covered example it'll need a trailer, xref and at least one obj reference. ...and there are TWO flavors of those (linearized and not). ...and flate coded and not.

So, for a test harness you'd actually want a collection of small files.





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: