
What’s Hiding in Your PDF? - ingve
https://pspdfkit.com/blog/2018/whats-hiding-in-your-pdf/
======
m3nu
Earlier this year I was working on hybrid PDFs[1] that embed a full XML
invoice. Standardized and promoted by the German and French.[2] One more thing
to hide.

1: [https://github.com/invoice-x/factur-x-
ng](https://github.com/invoice-x/factur-x-ng) 2: [http://fnfe-
mpe.org/factur-x/factur-x_en/](http://fnfe-mpe.org/factur-x/factur-x_en/)

~~~
IshKebab
Is the PDF rendered from the XML? If not that sounds mad.

~~~
codetrotter
Mad how? PDF is a notoriously bad format to extract data from, but great for
visually representing the data to humans. XML is human unfriendly but good for
structuring data so that software can read it. Embedding the machine readable
XML representation of the data in the PDF ensures that both representations of
the data are available always.

~~~
btgeekboy
I think OP is concerned about conflicting data - the visual representation may
not match the XML data.

~~~
askvictor
Given that it's possible to have javascript in a PDF as well, it wouldn't be
too hard to have a bit of code that verifies the XML matches the human-
readable version. Or, failing that, some sort of crypto signature to check
both to see things haven't been tampered with.

~~~
josefx
As far as I can tell the standard prohibits PDF files with any dynamic
content, including javascript. Also there is no point in embedding the
verification code in the PDF if you don't trust its contents.

------
lower
The vector drawing program Ipe stores all its data in a PDF. So, you create a
drawing, save it as a PDF and can later open the PDF again for editing. Text
can be written in LaTeX format and the PDF will contain the LaTeX source for
later editing.

[https://en.wikipedia.org/wiki/Ipe_(software)](https://en.wikipedia.org/wiki/Ipe_\(software\))

~~~
jacobolus
Illustrator has been doing this for a long time. You can choose to “preserve
editing capabilities” at save time. I assume it is similar for other vector
drawing programs.

------
0xmohit
PDFs can also have file attachments.

[https://helpx.adobe.com/acrobat/using/links-attachments-
pdfs...](https://helpx.adobe.com/acrobat/using/links-attachments-pdfs.html)

The official documentation also seems to recognize those as a security risk :)

[https://helpx.adobe.com/acrobat/using/attachments-
security-r...](https://helpx.adobe.com/acrobat/using/attachments-security-
risks-reader-acrobat.html)

------
alexis_fr
This French startup claims it is able to run some Javascript in the PDF,
therefore notify you when a customer reads your offer, tell you which part he
read, where he stayed the longest, etc. Does PDF support JS?

[https://www.tilkee.com/](https://www.tilkee.com/)

~~~
favorited
Yup. If you open this PDF in Chrome, you can even play Breakout:

[https://rawgit.com/osnr/horrifying-pdf-
experiments/master/br...](https://rawgit.com/osnr/horrifying-pdf-
experiments/master/breakout.pdf)

~~~
nothis
That's creepy and makes me glad I don't use Chrome (doesn't work in Firefox).

------
kanzure
Academic publishers also insert IP addresses and other deanonymizing
information (about the user) into PDFs of academic papers, which should be
removed.

[https://github.com/kanzure/pdfparanoia](https://github.com/kanzure/pdfparanoia)

------
burtonator2011
I've been doing a TON of work around PDF lately with my Polar project and
wanted get feedback from you guys:

[https://getpolarized.io/](https://getpolarized.io/)

I just implemented bulk PDF import this weekend.

It uses pdf.js for the rendering of the PDFs and extracting the metadata
(including the fields discussed in this article)..

I have to manage a ton of PDFs for my work / research. Mostly textbooks and
compsci whitepapers) and and before working on Polar I was really struggling
to manage all the data.

------
oever
You can create a Hybrid PDF in LibreOffice. It's a PDF with ODF embedded.

[https://wiki.documentfoundation.org/Faq/Writer/PDF_Hybrid](https://wiki.documentfoundation.org/Faq/Writer/PDF_Hybrid)

------
DyslexicAtheist
there are some great pdf's on the _PoC||GTFO_ site that go deep into the
subject :)
[https://www.alchemistowl.org/pocorgtfo/](https://www.alchemistowl.org/pocorgtfo/)

------
FredFS456
Sort of related, the company whose website this blog post is on makes the best
Android PDF reader, and it's free:
[https://play.google.com/store/apps/details?id=com.pspdfkit.v...](https://play.google.com/store/apps/details?id=com.pspdfkit.viewer)

(Disclaimer: I am not associated with the company)

~~~
everybodyknows
From the Permissions disclosure:

    
    
      >full network access
      >run at startup
      >prevent device from sleeping
    

Guess is they're selling user's PDF reading activity data.

~~~
MartinMond
PSPDFKit CTO here. We're not selling any user data, PDF Viewer exists because
we sell an SDK and having a great app in the store makes it 1) easy to
showcase the SDK to potential customers and 2) gives us a broad user base that
tests the SDK and gives feedback for free.

~~~
everybodyknows
Misbehavior by other apps makes such permissions a red flag to some of us.

Prime offender: Amazon Music. Latest rev AFAICT has no "Quit". Have to resort
to Settings->Apps->Force_Stop to get its clutter off the screen.

------
pmoriarty
Are there any command-line tools that will let you see and/or edit all this
metadata?

~~~
phonon
You can check out
[http://pdfedit.cz/en/index.html](http://pdfedit.cz/en/index.html) (no longer
updated)

~~~
Theodores
This is really old and unmaintained, it also has a x-window type UI that takes
you back to 1997.

I have found that despite the crude nature of it there are things you can find
in documents that other tools gloss over.

Genuine thanks for the link and the nudge - I have to go forensic on a few
PDFs and I had forgot what the tool was as it has been a while.

Luckily the PDFs I need to edit are old and there should be no problem doing
apt-get install rather than compiling a tarball.

~~~
phonon
Yup...not easy finding a free tool that will "decompile" a pdf. Too bad it's
unmaintained :-(

[http://pdfedit.cz/screenshots/screenshot1.jpg](http://pdfedit.cz/screenshots/screenshot1.jpg)

[http://pdfedit.cz/screenshots/screenshot2.jpg](http://pdfedit.cz/screenshots/screenshot2.jpg)

[http://pdfedit.cz/screenshots/screenshot3.jpg](http://pdfedit.cz/screenshots/screenshot3.jpg)

~~~
vuln
Didier Stevens has some great cli tools

[https://blog.didierstevens.com/programs/pdf-
tools/](https://blog.didierstevens.com/programs/pdf-tools/)

~~~
phonon
[https://github.com/pdfminer/pdfminer.six](https://github.com/pdfminer/pdfminer.six)
is also useful

~~~
vuln
Awesome, thank you!

------
mpweiher
CodeDraw, my live-code-drawing + GraphViz tool stores its source in the PDF.
Also found a place to stash it in PNGs.

But you can treat a PDF very much like a big zip with some special purpose
features. If you want to.

------
xte
Well. Even files can be attached to pdfs, in few case normally as a "pdf
source" in reproducible research... But well... We have pdftk and other simple
utilities to manage pdfs metainformation... Also it's common to embed other
kind of "steganographic" information that's maybe really hard to discover, the
simpler are like "dot printers", simple white-on-white etc content in plain
pdf that can be read easily by a bot, other may use single pdf's content with
a Caesar-like cipher etc.

Pdf are a vast and not-so-clean world...

------
anigbrowl
Meanwhile most academic authors don't make any effort to add metadata that
would actually be useful. I can't count how many pdfs I have called things
like '0378final.pdf' with none of the fields like Author or Title filled in.

------
FetchBen
I built a PDF template designer, but overlooked the metadata side of things.
Might have to look into it after reading this.

[https://fetchpdf.com](https://fetchpdf.com)

------
nothis
I had to deal with PDF metadata for work and it's a deep rabbit hole. Lots of
standardization headaches, too, especially when combined with PDF/A and such.

------
tonyedgecombe
It's a shame XPS never took off, it is superior to PDF in nearly every way but
obviously failed in the market.

------
fareesh
I assume mobile devices are relatively immune to most of the scary aspects of
what can be hiding in PDFs, is that correct?

~~~
ndnxhs
Mobile PDF tools won't strip all the metadata and comments out but almost
every PDF reader not from adobe or chrome ignores all the dumb parts of PDF
like running js, having network access or restricting you from printing
without a password.

