
Ask HN: What is the best way for validating files? - deca6cda37d0
Users can upload files into my backend system. Other users can view those files in the client.<p>For example I only allow PDF files. What is the best way to validate that the file uploaded is indeed a PDF file. And not a otherfileformat.pdf? So the files uploaded will be rendered correctly. This is to prevent human errors.
======
necovek
You'd have to define "best":

* how much do you care about performance?

* how much do you care about safety?

The simplest and fastest would be to check for the "PDF" signature at the
start of the file. Refer to the open PDF spec to ensure you are allowing
anything that's acceptable (eg. do you care about FDF files?).

If you need to protect against malicious attempts, rather than user errors, it
gets much harder quickly (and theoretically impossible, since you can
construct files which will be both valid PDFs and something else).

To give another example, if you are aiming to protect yourself from being used
as a media-sharing service, PDF allows embedding media as well, so allowing
PDFs will not stop that — they are container formats as much as anything else.

The safest would be to reprocess and re-render only the subset you allow: but
that's most expensive in terms of implementation and CPU time, and also
somewhat limiting — you can't keep digital signatures, for instance.

~~~
deca6cda37d0
Thanks for your answer.

It is to protect against user errors. Your suggestion to check for a signature
sounds what I'm looking for.

~~~
necovek
Sorry, I missed this earlier. PDF spec turned into an ISO standard with 1.7,
and became unavailable without paying since 2.0, but 1.7 at Adobe's site is
pretty clear about the signatures (nice, simple section on headers :).

(My phone decided not to let me paste the URL, but it's a quick search away —
do not be afraid of the spec, it's quite simple, esp the parts you care about)

