Hacker News new | comments | show | ask | jobs | submit login
What’s Hiding in Your PDF? (pspdfkit.com)
216 points by ingve 12 days ago | hide | past | web | favorite | 61 comments

Earlier this year I was working on hybrid PDFs[1] that embed a full XML invoice. Standardized and promoted by the German and French.[2] One more thing to hide.

1: https://github.com/invoice-x/factur-x-ng 2: http://fnfe-mpe.org/factur-x/factur-x_en/

This is actually pretty cool. I'v been working in an accounting company and i'v been thinking about such thing a lot lately. Is Factur-X used in practice in Germany and France? Do you know about some other similar things?

The lower-level library is used in Odoo ERP for example. Possibly others, as there are a few implementations[1] (mostly Java). This is lots of work because you need to create the XML yourself. I tried to make mappings between simple keywords and the XPaths of different XML standards (there are a bunch)[2]

So in theory you could just give a few keywords and get the full XML. Currently this is on hold, but if someone has a use case and wants to contribute, I'd be happy to continue working on it. I also provide this other library[3] that extracts some essential data from PDFs. The plan is to use them together to automatically build the XML from just the PDF.

1: https://www.invoice-x.org/related/

2: https://www.invoice-x.org/standards/

3: https://github.com/invoice-x/invoice2data

Thank you! I will look into these. I know the problems. As i was working in the accounting firm i was dealing with imports/exports from various accounting softwares a lot. In fact i'm currently working on an app which can convert between them as my side project.

As far as I understand this the "standard" is the XML that is being embedded inside PDF, so you need a small tool to extract the XML from the PDF and then you've got the standard XML invoice. The PDF is just for nice presentation.

I did a similar approach of using XML Stylesheets (XSL) to render the standard invoice as HTML when opened, this also looked nice.

Is the PDF rendered from the XML? If not that sounds mad.

Mad how? PDF is a notoriously bad format to extract data from, but great for visually representing the data to humans. XML is human unfriendly but good for structuring data so that software can read it. Embedding the machine readable XML representation of the data in the PDF ensures that both representations of the data are available always.

I think OP is concerned about conflicting data - the visual representation may not match the XML data.

...the visual representation may not match the XML data

Hopefully OP doesn't learn about how his medical history is possibly passed around.


Given that it's possible to have javascript in a PDF as well, it wouldn't be too hard to have a bit of code that verifies the XML matches the human-readable version. Or, failing that, some sort of crypto signature to check both to see things haven't been tampered with.

As far as I can tell the standard prohibits PDF files with any dynamic content, including javascript. Also there is no point in embedding the verification code in the PDF if you don't trust its contents.

Because it will be easy to accidentally make the invoice say something different to the XML. Imagine a company accepts and pays out invoices via this PDF format...

It doesn't sound very safe and the draft I found online only describes various restrictions on the PDF and the expectation that the XML is to be seen as an alternative representation for processing. Nothing to enforce that the contents of both are identical or tamper proof. At best you could claim that an invoice were both don't match is invalid, however that would require manual verification.

Meanwhile I write the documentation of my xml configuration files in xsd and convert them to something readable using xslt. One set of data for processing and display everywhere and one less headache about duplicated and badly maintained data.

Mostly the other way round. There are 2 use cases I know of:

1. You generate invoices in your ERP and have all the info. To make your client's life easier, you embed the info as XML, so he doesn't need to type it from the PDF.

2. You get invoices without XML and use e.g. invoice2data[1] to extract key fields and then add them as XML for later.

1: https://github.com/invoice-x/invoice2data

No, but it does include validation. I agree that it's mad for a person to use this, but having a machine use the API can help ensure that the PDF and XML always match

I seem to remember JDEdwards marked up their PDF output in a way that made it easy to parse out the data.

The vector drawing program Ipe stores all its data in a PDF. So, you create a drawing, save it as a PDF and can later open the PDF again for editing. Text can be written in LaTeX format and the PDF will contain the LaTeX source for later editing.


Illustrator has been doing this for a long time. You can choose to “preserve editing capabilities” at save time. I assume it is similar for other vector drawing programs.

Ipe is fantastic. I used it for technical images in my thesis and papers and the graphics turn out beautifully despite the program being so simple.

PDFs can also have file attachments.


The official documentation also seems to recognize those as a security risk :)


This French startup claims it is able to run some Javascript in the PDF, therefore notify you when a customer reads your offer, tell you which part he read, where he stayed the longest, etc. Does PDF support JS?


Yup. If you open this PDF in Chrome, you can even play Breakout:


That's creepy and makes me glad I don't use Chrome (doesn't work in Firefox).

You can run JS in a PDF. You can even embed Flash in a PDF. We tend to think as PDFs as an innocuous document format, but there's a lot more than that baked in.

Adobe actually offers "features" like readership tracking in PDFs as part of their commercial offerings.

Thankfully about half the "features" of PDF like js tracking and restrictions on editing or printing are ignored by almost every PDF reader not made by adobe.

Strongly considering pulling Acrobat company wide and replacing with something like FoxIt.

Last time I opened PDF with JS even Acrobat warned about that and disabled it by default (but that was some time ago).

Wow, yikes, you CAN run Javascript in a PDF:


Horrible but not terribly surprising.

I’ve been told the PDF spec has some low level functionally to support a MS-DOS emulator. Don’t know how true that is.

Academic publishers also insert IP addresses and other deanonymizing information (about the user) into PDFs of academic papers, which should be removed.


I've been doing a TON of work around PDF lately with my Polar project and wanted get feedback from you guys:


I just implemented bulk PDF import this weekend.

It uses pdf.js for the rendering of the PDFs and extracting the metadata (including the fields discussed in this article)..

I have to manage a ton of PDFs for my work / research. Mostly textbooks and compsci whitepapers) and and before working on Polar I was really struggling to manage all the data.

You can create a Hybrid PDF in LibreOffice. It's a PDF with ODF embedded.


there are some great pdf's on the PoC||GTFO site that go deep into the subject :) https://www.alchemistowl.org/pocorgtfo/

Sort of related, the company whose website this blog post is on makes the best Android PDF reader, and it's free: https://play.google.com/store/apps/details?id=com.pspdfkit.v...

(Disclaimer: I am not associated with the company)

From the Permissions disclosure:

  >full network access
  >run at startup
  >prevent device from sleeping
Guess is they're selling user's PDF reading activity data.

PSPDFKit CTO here. We're not selling any user data, PDF Viewer exists because we sell an SDK and having a great app in the store makes it 1) easy to showcase the SDK to potential customers and 2) gives us a broad user base that tests the SDK and gives feedback for free.

Misbehavior by other apps makes such permissions a red flag to some of us.

Prime offender: Amazon Music. Latest rev AFAICT has no "Quit". Have to resort to Settings->Apps->Force_Stop to get its clutter off the screen.

I don't know that this isn't to somehow collect up user data, but this could be to showcase PDFs with embedded video. All my experience with the company has suggested they're on the up-and-up, though.

Disclaimer: I've used the company's SDK fairly extensively in my own product (but I'm not otherwise affiliated with them).

Are there any command-line tools that will let you see and/or edit all this metadata?

This is very good - I used it to decompress the streams in a PDF https://mupdf.com/docs/manual-mutool-clean.html

pdfinfo comes with Xpdf and allows you to view the metadata. See the man page: https://linux.die.net/man/1/pdfinfo

Didier Stevens has some great cli tools https://blog.didierstevens.com/programs/pdf-tools/

Exiftool might be able to do this. https://en.wikipedia.org/wiki/ExifTool

You can check out http://pdfedit.cz/en/index.html (no longer updated)

This is really old and unmaintained, it also has a x-window type UI that takes you back to 1997.

I have found that despite the crude nature of it there are things you can find in documents that other tools gloss over.

Genuine thanks for the link and the nudge - I have to go forensic on a few PDFs and I had forgot what the tool was as it has been a while.

Luckily the PDFs I need to edit are old and there should be no problem doing apt-get install rather than compiling a tarball.

Yup...not easy finding a free tool that will "decompile" a pdf. Too bad it's unmaintained :-(




Didier Stevens has some great cli tools


Awesome, thank you!

> it also has a x-window type UI that takes you back to 1997.

Or '87, by '97 I was using Gimp already.

Master PDF Editor can do a lot. (Linux, commercial but the watermark somehow was never inserted into the file in my version)


Thanks for that, it is really good if you just need to 'see' what is what and get images out of a PDF.

CodeDraw, my live-code-drawing + GraphViz tool stores its source in the PDF. Also found a place to stash it in PNGs.

But you can treat a PDF very much like a big zip with some special purpose features. If you want to.

Well. Even files can be attached to pdfs, in few case normally as a "pdf source" in reproducible research... But well... We have pdftk and other simple utilities to manage pdfs metainformation... Also it's common to embed other kind of "steganographic" information that's maybe really hard to discover, the simpler are like "dot printers", simple white-on-white etc content in plain pdf that can be read easily by a bot, other may use single pdf's content with a Caesar-like cipher etc.

Pdf are a vast and not-so-clean world...

Meanwhile most academic authors don't make any effort to add metadata that would actually be useful. I can't count how many pdfs I have called things like '0378final.pdf' with none of the fields like Author or Title filled in.

I built a PDF template designer, but overlooked the metadata side of things. Might have to look into it after reading this.


I had to deal with PDF metadata for work and it's a deep rabbit hole. Lots of standardization headaches, too, especially when combined with PDF/A and such.

It's a shame XPS never took off, it is superior to PDF in nearly every way but obviously failed in the market.

I assume mobile devices are relatively immune to most of the scary aspects of what can be hiding in PDFs, is that correct?

Mobile PDF tools won't strip all the metadata and comments out but almost every PDF reader not from adobe or chrome ignores all the dumb parts of PDF like running js, having network access or restricting you from printing without a password.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact