
Dangerzone: Convert potentially dangerous PDFs, documents, or images to safe PDF - panarky
https://github.com/firstlookmedia/dangerzone
======
ris
It's nice that PDF security is getting a bit more attention, but there are a
number of things that this approach will trash, for instance, I don't have
high hopes for the accessibility of the resulting PDF. (edit: and needless to
say, any software in your pipeline which does full interpretation of an
untrusted file will _itself_ become the target for attacks, so this is only a
useful tool if it is run in an extremely restricted environment)

I for one have been looking a lot into PDF/A for security. PDF/A is really
meant for archival, but as a side effect has disallowed an awful lot of weird
PDF features which are a security nightmare and pdf readers tend to implement
badly/buggily. PDF/A-1 for example, the strictest level, disallows JPEG2000,
TIFF, JavaScript, PostScript, embedded files... (PDF/A-3, FWIW is essentially
useless from this angle, because they decided to allow arbitrary embedded
files, so a valid PDF/A-3 could have pretty much anything in it).

There now exists a good PDF/A validator
([https://verapdf.org/](https://verapdf.org/)) which can be used to ensure
PDFs conform to the standard, but of course, won't _fix_ them if they're not.

PDF/A has an interesting implementation detail however - compliant PDF readers
are _supposed_ to automatically "turn off" non-PDF/A features when they
encounter a PDF which declares itself as a particular PDF/A variant (even if
it then goes on to attempt to use non-compliant features), which would
_hopefully_ prevent dangerous sections from being decoded and avoid
exploitation). Another interesting feature of PDF is its appendable nature,
which might raise the possibility of being able to "declare" an arbitrary PDF
as PDF/A by simply appending an extra section to it, hopefully rendering it
less harmful (though possibly at the expense of it appearing to have missing
content when rendered).

~~~
heartbeats
Couldn't you just use a whitelist for the features?

If the reader can open a PDF/A-1 file and ignore the bad parts, can't it open
a PDF file as PDF/A-1 and remove the bad parts, before saving it again?

Then you could use the "re-rendering" technique to extract images:

1\. Create a stripped PDF/A-1 file from the original PDF.

2\. In a VM, render the two PDF files to two high-resolution image sets.

3\. Use some CV algorithm to find the differences. For example, gaussian blur,
subtract, threshold, find islands.

4\. Use this to come up with areas which use complicated PDF features and/or
images. Say this returns that there is an image on page 9 in the rectangle
((17, 338), (400, 300)).

5\. Crop out page 9 in rectangle ((17, 338), (400, 300)) from the original
PDF. Use some CV algorithm to detect the DPI and whether it's best to encode
as JPEG or PNG. Encode it and add to list of images.

6\. Add sanitized images back from list, mark PDF/A-1 file with images as
PDF/A-2.

Of course, you could do this for links or whatever as well. Spit out a list of
rectangles and link targets in the PDF, and then put them back in.

~~~
ris
There are myriad ways that PDFs could be re-written and re-rendered, but they
would all be quite complicated and/or throw away a lot of extremely useful
"meta" information (bookmarks, signed sections etc.) and almost certainly make
files much bigger. The idea of the "appending" trick would be to mutate the
original file as little as possible, but convince the reader to open it in a
safer mode.

~~~
jrowley
The small issue with the append trick is that is that it assumes the reader
application will now respect the new format and not open insecure parts...
which might not be fully implemented in all cases and is reader specific.

Fully sanitizing the PDF yields better guarantees of security at the cost of
lost functionality.

~~~
ris
> The small issue with the append trick is that is that it assumes the reader
> application will now respect the new format and not open insecure parts...
> which might not be fully implemented in all cases and is reader specific.

Yes, note my original emphasised use of the term "supposed to".

> Fully sanitizing the PDF yields better guarantees of security at the cost of
> lost functionality.

Indeed it's a tradeoff. But if you're willing to throw away the features which
this extreme sanitization would trample across and have any ability to design
PDF _out_ of your system, you're probably better off not using PDF at all in
favour of some straightforward image format.

------
rtpg
This is definitely a good brute force strategy

I ... think there’s another technique that relies a bit on trusting the
printing drivers to do the right thing, where you can tell Ghostscript to
print your document, and target another PDF. This should at least remove
interactive components in a PDF

~~~
userbinator
It's definitely brute force, in that it's the equivalent of printing a
document onto paper and then scanning it back in. This "flattening" is highly
effective at sanitising, but also removes all the semantic content in the
process; the output should be several times larger than the input (and if it
isn't, then it's an indication that something very suspicious was in the
input....)

~~~
necovek
Indeed, I was hoping for something smarter, that would remove only the "risky"
bits of PDF, but keep the overall structure (and size).

~~~
heartbeats
What about converting PDF to PostScript and back? It should keep most of the
semantic information while removing the exploits.

~~~
necovek
Yeah, perhaps. The "gruntwork" would be to figure out if that is sufficient.
Heck, taking your idea further, perhaps convert to something non-derivative
like HP's PCL5 and back. Or SVG or...

------
manthideaal
In (1) the author use as a cv a pdf that is also a bootloader , and in the
comments it seems that he has improved the code. I wonder if he could render
dangerzone as futile.

(1)
[https://news.ycombinator.com/item?id=19344146](https://news.ycombinator.com/item?id=19344146)

Edited: Added in the comments of that post there is a reference to
pocorgtfo16.pdf: is valid as a PDF document, a ZIP archive, and a Bash script
that runs a Python webserver which hosts Kaitai Struct’s WebIDE which, allows
you to view the file’s own annotated bytes. The zip archive has further
resources to insane reversing deep dives, code to study and more.

[2]
[https://www.alchemistowl.org/pocorgtfo/:w](https://www.alchemistowl.org/pocorgtfo/:w)

------
rurban
Good, but I don't think the change from running the sandbox in a VM (in Qubes
case even Xen, not kvm) to docker is an improvement. Escaping docker is much
easier than Xen.

------
namanyayg
Useful tool -- it's trivial to make a RAT bypass chat/email .doc/.PDF
attachments.

I don't open any files on my PC from people I don't personally know -- use
webviewers.

~~~
A4ET8a8uTh0
Odd question. Why would a webviewer be safer in this case?

edit: Thank you for both answers. I thought it had to do with sandbox
rationale, but couldn't mentally get past the fact that sandbox could
potentially be escaped too. Eh, I think it is time for sleep.

~~~
mettamage
Well is it safe? Don't know.

Safer: definitely. Given that the collective amount of PDF attacks is some
number, now this particular PDF needs to attack PDF and the webviewer.
Assuming that 1% of all PDFs do that, I'd say it's 100 times safer than not
using a webviewer.

If you still think that 1% of all potential PDF attacks is too unsafe, then
that's a different discussion.

If you think my 1% is off, then that's a different discussion too. All I'm
saying is that it's safer.

~~~
saagarjha
Well, PDF attacks need to attack the viewer you're using too…

~~~
mettamage
True, but in most cases this is assumed to be a popular PDF reader. If it is
specifically targeting a webviewer, I agree. But that still means that there
is some JS PDF parser in between, though that provides very little in terms of
security, I doubt that such a parser will check for malicious input.

~~~
heeen2
Afaik chromium uses the same pdf renderer as foxit (pdfium)

------
maxfan8
This is similar to how QubesOS converts “untrusted PDFs” in a disposable VM.
Quite nifty, particularly if you use OCR afterwards.

See:
[https://theinvisiblethings.blogspot.com/2013/02/converting-u...](https://theinvisiblethings.blogspot.com/2013/02/converting-
untrusted-pdfs-into-trusted.html)

------
johnlorentzson
This kind of makes me wonder why PDFs can even act maliciously in the first
place. Why does it have the ability to do these things?

~~~
masklinn
PDF derives from PostScript which is a full-blown programming language so it's
an "original sin" either way.

Then over time Adobe added a number of interactive (forms), multimedia and
rich media (embedded JS) features, leading to even more vectors.

~~~
gpvos
The page description language part of PDF is based on Postscript, but
explicitly simplified to be non-Turing-complete and safe (if implemented
sanely). The later additions are the main culprit I think.

~~~
Piskvorrr
"if implemented sanely" \- oh well. The original idea was nice.

------
badrabbit
Very nice but fyi: most malicious pdf just contains links to something
else,usually shortlinks. Social engineering is hard to mitigate.

~~~
seanhunter
If I understand this correctly, a link wouldn't survive this as the pdf is
turned into images and then those images back into a pdf. So it's essentially
like a scan of very high quality.

What you would end up with is an image that looks like a link but would not be
clickable.

~~~
leskat
> Dangerzone can optionally OCR the safe PDFs it creates, so it will have a
> text layer again

I'm not completely sure, but wouldn't this parse links and make them
accessible again, possibly even clickable?

~~~
clort
A links displayed text and its destination URL are not necessarily the same.
Rendering the document to a bitmap then OCRing that would get the display text
rather than the URL. I would think that it would be normal for a malicious URL
to be obscured with an innocent looking display text.

------
badrabbit
This wouldn't work for businesses because PDF forms and legitimate links and
even embedded objects. I think this is a great idea if you have users preview
the image version before opening the whole thing maybe? Even as an individual,
I have had to fill out pdf forms for various usgov applications, job
applications,etc...

------
zuckluni
Obligatory humblebrag/shameless plug for my open source PDF(+ other docs) to
Image converter, which runs as a web app. It's self hosted open source (but
easiest to run on FreeBSD). Uses Ghostscript/OpenOffice under the hood:

[https://github.com/dosyago/p2..git](https://github.com/dosyago/p2..git)

------
TekMol

        Download Dangerzone
    

So instead of opening a PDF from the internet on my machine, I now run code
from the internet on my machine?

~~~
seanhunter
There's a big difference between opening a random PDF and downloading code
which you can view yourself on github, no? Or does your computer (from cpu
microcode up) only run code which you have personally written and is not "from
the internet"?

