Hacker News new | comments | ask | show | jobs | submit login
50 CVEs in 50 Days: Fuzzing Adobe Reader (checkpoint.com)
344 points by myinnerbanjo 69 days ago | hide | past | web | favorite | 166 comments

As much as many of us lament the state of much of today's software, if you think of products from a certain era - IE6, Flash, Java web applets - they all had a commonality in their code quality. These are mostly a non-issue these days, but it's not because they suddenly stopped having bugs and still get active use.

I remember rolling out Adobe Reader in those days and as a product, I don't believe its core has changed much. They've certainly managed to bolt on a whole lot of new features, but that can only make the position worse.

As much as this sounds like a call to kill Adobe, something needs to happen before that's feasible. For the average enterprise, Adobe Reader is far more ingrained than these products were. Case in point, in organisation I asked the question of whether Chrome's PDF viewer would cut it for them. One large department then ordered Adobe Professional for every user. They told me they didn't need it, they just knew I wouldn't propose removing a product they'd actually paid for.

Adobe Reader needs its HTML5 moment - an alternative that's not just "good enough for most people", but one that's actually better.

To be fair, in my experience Chrome's and Firefox's PDF viewers don't cut it. They are good for a quick preview, but especially when printing they occasionally render things slightly wrong, which is unacceptable for a file format whose entire point is to look the same everywhere. Also forms.

That doesn't mean that there aren't any alternatives. Foxit for example is pretty good. But in-browser alternatives just aren't there yet

pdf.js has had 26 pull requests merged in the last month. 5,622 additions and 6,991 deletions. That's just in the project directly, not in the dependencies.

Since it's such a quickly evolving project, I wondered where form support is up to. https://github.com/mozilla/pdf.js/issues/7613

Form support is not complete. Seems like it required quite a rewrite to get the foundation in a good place to finish it off.

Are there rendering bugs? Yes. See plenty here: https://github.com/mozilla/pdf.js/issues

Of course, Acrobat also has PDF rendering bugs, and other various bugs apart from the security issues mentioned (their JavaScript implementation for example).

As for printing... browsers aren't even good at printing HTML. The best browser for printing is based on the old Opera software(PrinceXML), and Safari is probably second. Remember the Apple display system used to be based on PDF rendering... and they do a lot with CUPS and graphic designers.

However, printing PDFs on many browsers can go directly to the printer or the OS (which mostly all support rendering PDFs directly now).

> Remember the Apple display system used to be based on PDF rendering

Does Quartz not still try to match PDF in the way its render structure is composed internally?

Is there a AFL type CI work flow for fuzzing pdf.js?

google it seems only return how to fuzz js engine kind of links.

Love to see write up of how to fuzz a JS application by doing AFL type mutations on the server return data, etc.

Chromium uses Pdfium, which is based on the Foxit code base, actually, since they bought some pieces of it: https://www.foxitsoftware.com/blog/the-interesting-history-a....

Actually I've never had rendering bugs with Chrome, though it's certainly happened with Firefox.

We don't find bugs with it it's just limited. Can't highlight, can't rotate individual pages, can't save rotated pages (you have to "print" it to "Save as pdf" again, awful ux)

Dumb question:

Why can't we just burn PNGs[0] or lossless JPEGs and just use OCR / other simple machine learning for text selection? Like, I get that there are some unfortunate souls out there that need to edit CAD documents in their PDF but for 99.999% of people PDFs do one thing that websites do not:

Print reliably well given a page format like A4.

I shouldn't have to wince ever time I open a PDF. They're so insecure that a no-click RCE only fetches $10k.

[0] Or ideally SVG, but there are some problems with fonts and licensing that I'm struggling to remember at the moment.

Because OCR is expensive (to write as software and to process for the end user) and very error prone, especially if your text is anything other than a 12 point black font in on a white background with no formatting (italics, underlines, etc.). If my document's information is valuable, I'm not going to be willing to rely on the quality of my recipient's OCR software to get a digitally readable copy of my work. I mean, at the very least, what if they're blind?

The general hatred for PDFs in the tech community is almost completely rooted in Adobe's initial decision to make PDF editing and creation cost $500. You have access to a document that want to make changes to, but you can't because it's a PDF and don't have access to the document source because the owner/publisher didn't provide that. It's a PDF because PDFs make documents that look the same everywhere, even when printed, which is and will remain critical to the purpose of publishing documents. Well, images don't solve this problem, either, because you still can't edit text in an image, and now you lose the ability to be sure about how they'll print (margins, scaling, etc.).

Furthermore, images, even compressed, are significantly larger than a well made PDF. For example, I've got a 6,700 page document of special ed student progress reports that include detailed, full-color charts and graphs of student progress with respect to goals. It's 60 MB. 8.5 KiB per page.

Then again, I imagine it won't be long before someone mentions LaTeX as a viable alternative, even though the one thing LaTeX isn't is portable. But LaTeX is primarily popular in the tech community because it lets programmers pretend to write code while they're actually writing documentation. Nowhere else will you find people telling you to use a set of programs that require a build environment when someone asks about the best home office application to use. (Yes, I know that LaTeX is a typesetting language. My cynicism is that some tech people tell others to use LaTeX when they're asked what word processor someone should use.)

Edit: Clarified second paragraph.

> Then again, I imagine it won't be long before someone mentions LaTeX as a viable alternative, even though the one thing LaTeX isn't is portable. But LaTeX is primarily popular in the tech community because it lets programmers pretend to write code

Rude remarks notwithstanding, LaTeX and its ilk let you make PDFs, which are indeed portable. Setting up LaTeX is the same as setting up any other program, some of which are not portable either. ShareLatex.com [0] also exists for the purpose of using LaTeX anywhere.

People recommend LaTeX because it's in another league when it comes to typesetting and rendering more niche notation. It's also not user hostile when it comes to binary files. LaTeX source files will always be readable decades later, <binary app here> makes no such guarantees.

Whether it's a viable alternative depends on whether the user wants to make a minimal learning investment or not. If they don't, google sheets > export to pdf always exists.

[0]: https://www.sharelatex.com/

dvi files are about as user hostile as it gets for a rendered document format, though

No the hatred for PDFs is that they're filled with bloat and horribly insecure.

As for OCR, we're able to handle underlines and italics for most fonts, though I take your point on colour. If it's especially bad they fail. Ideally it wouldn't be PNGs it would be some stripped down thing. Maybe even HTML with embeded CSS / images via data tags would fit the bill, but now we're bringing in XML-esque parsers and those are garbage too. I'm just so frustrated with dealing with PDFs. They serve a billion different purposes and they're good at none of them.

Accessibility, plus print is at ridiculous DPI compared to screen. To achieve compression you want to use the fact that there is a font being repeated across the page. OCR just isn't good enough.

Are you telling me that our compression algorithms can't compress a page of "e"s tighter than a page of random Chinese characters?

Accessibility is a fair point, but for print-to-file applications we're surely at the point where OCR can at least get the text to a readable format, no?

I've never noticed rendering errors, but my problem with the browser built-in PDF viewers is that they can't handle big complex PDFs, especially on older machines. They'll gobble up 4GB of RAM like it's nothing and start swapping on PDFs that Acroread or xpdf display in less than a second.

A counterpoint: for a while I was working at my uni's help desk, and we would ask all clients to print PDFs from Chrome as a matter of course just because it was so much more reliable at producing the correct output on paper, even when compared to Adobe Reader.

My organization uses Adobe extensively and we could never make do with the Chrome viewer. When you're opening 200-page documents with links, highlighted text and bookmarks, a browser plugin just won't be fast and responsive enough.

I would imagine this is the case in most large organizations.

> One large department then ordered Adobe Professional for every user. They told me they didn't need it, they just knew I wouldn't propose removing a product they'd actually paid for.

Why would they do that though?

I'm assuming that they didn't actually need any of the professional features, they just saw it as a way to avoid having Adobe Reader/Acrobat removed from systems in favor of something they like less but admins like more.

Assumption is correct.

Who's going to go into an entrenched, mature market?

The money's going to the easy places, new markets.

Being able to run JS in a PDF sounds scary to a lot of people, but I wouldn't throw that idea out entirely.

If you follow the work by Bret Victor & others on "explorable explanations"[0][1] and interactive scientific papers[2], you probably appreciate the need for a self-contained format for interactive documents. Could PDF be this? I don't know, I hear the spec is too scary. But I'd say we should have something like that.


[0] - https://explorabl.es/

[1] - http://worrydream.com/ExplorableExplanations/

[2] - http://worrydream.com/ScientificCommunicationAsSequentialArt...

Distill[1] is another example of interactive scientific papers (with a focus on machine learning).

But is there really a good reason to not just keep these in browser? I don't really know if there's much value in reading these locally. Maybe this would be a good fit for an electron app?

[1] https://distill.pub/

I would like HTML files to mostly replace pdf documents. However, they lack a couple things:

* A way to save back form data. I believe google is working on a js api to access local files (given a few conditions). * A way to bundle the html with every js script, ressource, css, etc, in one file, without making a huge mess.

If you had a tar.gz with an index.html inside, and the browser was to transparently allow r/w access to the archive contents from contained js scripts, this could solve a lot of use cases (heck, even "electron" apps could be replace by this). One exception being printed documents (postscript), at which pdf is quite good.

I'm in the same boat with browser-based vs electron apps (see my question [1]). I don't think PDF-based forms are an alternative to reactive web forms though, as they aren't dynamic enough. The sole purpose of PDF is page-oriented print, which html(+js) can't deliver.

[1] https://news.ycombinator.com/item?id=16773933

<Insert the long list of arguments about Internet (especially _fast_ Internet) not being as ubiquitous as living in SV could make you think.>

That behind us, there's also a matter of reliability and control. Services live much shorter than data they process; given today's trend, I wouldn't expect an online-only paper to be available after 5-10 years. Having a self-contained bundle would let me archive it independently, and would prevent any third parties from being able to interfere with my reading/exploration.

You can write self contained, single file .html documents just fine.

Not when you need images, and if you need to display 3D data you are required to actually serve textures over a server.

You can embed images in the HTML as well

Sure, if you resort to tricks like Base64 encoding in strings, that won't do for WebGL textures though.

The same idea made the web the disaster it is today.

No, it didn't.

The same idea made the web. Commercialization of the web was what caused the disaster.

I recall listening to a presentation in RSAC around 2013 or 2014 where Adobe CISO or CIO or someone pretty much said that they don’t give fucks about product security. E.g. zero impact on sales. I suspect it was thrown in as a bit of trolling attempt in a conversation but looking at their track record maybe that is the reality.

> they don’t give fucks about product security.

More accurately stated as "we sandboxed it, so anything discovered is less likely to be critical." https://www.adobe.com/devnet-docs/acrobatetk/tools/AppSec/sa...

I've heard a variant of that talk delivered by a non-C-level at an appsec/prodsec-focused conference where the rehashed quote above (though I'm blatantly paraphrasing) was the justification used. Something more closely reflecting the truth might be "we can't realistically tackle the many security defects in Acrobat and Flash, so we sandboxed both applications instead to generally reduce the technical risks posed by any vulnerabilities in code."

Except somehow we still end up with horrendous security vulnerabilities in both. Putting things in a sandbox does not necessarily mean that you did it correctly.

Exactly this. Thank You.

Honest question: why can't Adobe hire product security engineers to do this kind of vulnerability discoveries or even hire 3rd party consultants to fix bugs/vulnerabilities before they even get into production?

Every CVEs exposed by outside 3rd parties like this is a shame on their software quality and reputation, IMMO.

This. I left Adobe in 2008 (involuntarily :-) ), and it boggles my mind that they haven't done this sort of fuzz testing and fixed the issues in the last 10+ years. Sure, putting the code in a sandbox covers a multitude of sins, but I don't think that is sufficient. Many other Adobe products use the same code to read/write PDF files, and AFAIK they don't do it in a sandbox.

This is a great question and I have thought a bunch about it and the only conclusion I could make is that they dont care enough. This kind of news does not affect Adobe's stock price or their profits. Their users probably mostly don't care. So why bother paying $$$ for security engineers.

If important zero-day affected by Adobe software causing rippling effects perhaps they will care more? At this quality and enough time, it probably is bound to happen.

If you have a PDF document on your web site, please consider putting a link to https://pdfreaders.org/ instead of unfair advertisement of Adobe Reader.

Which gives (except for pdf.js) more PDF readers written in C, some with a long history of CVEs, and typically not sandboxed by default.

Since many people are using a PDF reader to read PDFs from relatively untrusted sources, do yourself a favor and at least use a reader that does not have full system access.

macOS: Preview.app (uses macOS sandboxing)

Linux: Evince Flatpak on Wayland (Flatpak uses sandboxing. Wayland because X11 apps can read all keystrokes, mouse events, do screengrabs.)

Windows: no clue

All platforms: in-browser PDF reader with a browser that sandboxes.

If you're counting on wayland to sandbox arbitrary code execution, you're getting in trouble.

Applications that can send commands to X.org servers can completely control it. The same isn't true for Wayland.

Flatpak is providing the actual application sandboxing, but being allowed to talk to the X server is a huge amount of privilege that can't really be restricted.

I think they're counting on flatpak.

> Windows: no clue

I think UWP apps are sandboxed by default, so something like Xodo PDF could be a possibility.

or Edge

Is such a link even still neccessary?

Edge, firefox and chrome have built in PDF readers.

For more control sites can self-embed pdfjs so no external reader is required.

Unfortunately, yes it is. Just yesterday, my wife tried to open a pdf transcript from her college. It would not open on anything other that Adobe Reader on a traditional os, putting it out of reach for her, being an Android/Chromebook user. Neither Chrome nor Google Drive/Docs could open it. And I could only open it in Adobe Reader on my laptop - not Firefox, not Chrome, and not whatever default viewer my laptop has. We've had this problem with PDF's from another organization, too. It is a real problem.

Yeah official transcripts from my undergrad have (or had, haven't needed one in a while) some sort of authentication thing. Fortunately adobe reader for Linux was still supported when I needed one...

I think there used to be an official port of Adobe Reader to Android. Probably discontinued now, but it was a thing.

It's not discontinued, it's actively maintained (last updated 5 Nov 2018): https://play.google.com/store/apps/details?id=com.adobe.read...

Are you going to suggest that I should remove the "Made in Notepad" animated GIF from my homepage too? You monster.

What the shit is there? Where is the most popular alternative, FoxitReader?

Foxit reader is closed source and the mentioned ones on that site are not.

Addendum: Most links of the bottom row don't work anymore. Needs updates

That site is, in my opinion, hilariously bad for the non tech-user.

Top row:

Platform (what's that?): GNU (isn't that some kind of African animal?) Linux (oh, I know that one, it's the cute penguin!)

The rest of the text in those boxes is mostly techno-babble for non-tech users (Gnome? KDE? DjVu?!??)

I understand the intent behind it, but it would only serve a very small niche of users, who can already fend for themselves.

Everyone else would go like: PDF? Ah, that's Adobe!

It's amazing browsers have so far decided to just not have an HTML archive format that could replace PDF. The majority of what PDF does can be better done in a webpage. Why not just an extension like .phd but is actually a .tar.gz that contains a webpages assets. Present like pdf's are, and done.

PDFs are supposed to look the same on every computer. Webpages can’t do that yet.

moreso, webpages were never meant to, by design. It is not print media without the print.

Not with text markup, but you could just use canvas or svg.

I don't know how accurate that is for PDF's, but webpages are supposed to look the same, and given known compatible styling, it should be on any modern browser. Browsers are extremely consistent in content presentation, that's why webpages from early 2000s still look the same.

No. Take for example font-family: sans-serif. That can look like anything, can have different widths on different devices, etc. Browser windows can have any size, devices can have various pixel densities, users can work at different zoom levels, etc. The previous big thing was responsive design.

Same counts for PDFs. If the font isn't shipped inside the bundle, the PDF will look like shit.

Good point. I noticed that too. At least web pages are made so that the content is re-flown [seems like the whole point], so it doesn't look like shit. It seems like [many?] PDFs place each character separately so if the actually used font is different from the one used during creation, the result will look very messy.

Web page rendering is far from similar on different browsers. I agreed that an alternative to PDF would be a good thing, but it probably would be more like a lightweight PDF than what HTML is today.

What? Lot's of webpages look different after simply resizing the window! The fact that this is on purpose, doesn't mean it doesn't happen (quite the opposite!).

That's because they're designed that way. You can do styling in a way that is not effected by browser window sizing, typically with specified document dimensions, or absolute positioning.

> I don't know how accurate that is for PDF's, but webpages are supposed to look the same,

One of Adobe's early talking points for the value of PDF's was that they would "look the same on all systems". Of course some context is necessary. PDF first appeared in 1993. In 1993, while the internet did exist, most individuals who were not associated with a university, research lab, or govt. agency, had no access to 'the internet'.

As well, the computing world was much more diverse. One had Dos, early Windows, and various MacOS variants all coexisting, one had numerous different variants of Unix on the numerous different RISC workstations in existence. And, here was the big deal, 'documents' created on each of these systems were to a large extent incompatible with each other. In this context, 'document' should be thought of as "a file used to create paper printouts" as opposed to what we think of a 'document' now in 2018. There was some compatibility, in that Windows systems would, sometimes, read 'documents' produced by Dos based word processors, and of course the lowest common denominator, plain text file, was 'almost' compatible (line ending differences was the biggest incompatibility). But for anything more complicated, if person X created a 'document' on Dos, and they wanted person Y, using SunOS, to see a version that "looked the same", their best bet was to print their document to paper and give Y the printer output. Because if they could send the electronic file to Y somehow, chances were that Y would be unable to open it, and even if they could, there was a good chance that it did not 'look the same' (from a 'looks like the same paper printout' level of same).

PDF came about in this world where paper was still king, and Adobe's marketing of "looks the same" was really meant to be "produces the same paper printout for the receiver Y as it does for creator X". That is why, today, in 2018, that viewing a PDF still looks like one is viewing WYSIWYG of a paper printout. PDF is, quite intimately, tied to the concept that there are discrete sheets of paper that it is formatting data onto. Yes some viewers do provide an 'almost' HTML continuous scroll look, but that is done 100% in the viewer, the underlying PDF format is very paper page oriented at its core.

So, when comparing PDF intent to web page intent, the phrase "looks the same" has different meanings. For PDF, it was designed such that "looks the same" means that a paper printout looks identical to the original. And that the designer/creator has full control over the look, while the viewer has no control over the look. For web pages, "looks the same" is far less strict, and is really not the same meaning, because the web was always intended to allow the viewer much freedom in deciding how to display the HTML content, taking away the designers ability to strictly determine look and presentation. With the result that HTML data was never meant to "look the same" with the same strictness intended by PDF.

That was really informative, thank you. Given the same rendering on browsers across platforms I imagine you could achieve the same effect as PDF, but it would be a spec on top of html+css, not inherently built for documents like PDF is as you said. There may be some differences in important edge cases, but PDF would still exist for business that relies upon it in that manner. I'm talking more of a replacement that fits the 90% of cases that don't deal with signatures and legally bound documents and such.

You can take a PDF and plot it, print it, display it on screen and it will always look the same. SVG is a closer to PDF than html is - and svg gets a lot of grief for having an overly complicated spec too.

Isn't that sort of what MHTML is?


I remember saving .mht files with IE as a kid when working on assignments so I could disconnect the dialup and give my parents their phone line back :)

Sort of, but mhtml isn't a good format. It was a hacky way of taking what emails did. It's embedding all content in a single file, not as an archive. Rather it should be you can open up the HTML archive like an actual archive and see the individual files.

Opera 12 (the original one, before the managers decided that it should be based on Chromium) had the .zip files support built in; that means that if the URL was


and index.html refers to other files, they would be read from the same zip, even if they are only inside of the zip.

I used it a lot for the local archives of the bigger content, it is amazingly convenient, and I'm sad that the same approach was not used anywhere else.

It's not trivial to get it right, in the security aspects (the zip implementation has to be robust, the url handling too) but it's doable and it would be very practical to have it.

Tangentially, the good thing of the zip format is that it has so called "central directory" which means that you don't even have to load the whole archive if not all data is needed, just the last part of the file, and from there you get the offset and the location of the needed file. So the Zip files could work beautifully with the


when they are huge(1). The small ones are most effective to be downloaded at once, of course.

1) I've actually done such a sequence by hand a few times when I had a slow internet and knew that I don't need the whole zip file, but just to see that all files are inside: I've made the range request for the end of the file which would be enough for the estimated number of files inside, and so I've had the list of all the files in the archive without needed to download the whole archive: I've reconstructed the same file size but left the rest of it be zeroes, and some of the zip tools I've used treated the archive directory exactly as I needed it.

The reason I don't suggest zip is due to it's insecurity, like zip bombing. Itd be better for archival if we just had tar, and then sometime lightweight on top of it if compression is wanted. That way you could have js generate the archive client side.

It is interesting how the older web got some things right, though, and now it's 2018 and those ideas one would think should be robust by now, isn't even there.

Zip is not inherently insecure anymore than any other URL parsing and archive handling is. Technically it's many orders of magnitude better solution for the random access (due to the existence of the central directory as I've already mentioned) than tar-gz, if it's done right.

Out of every archive format a pathological case can be constructed, just like it can from the relative file names etc, but such attempts can be simply rejected during the processing once some thresholds are reached. The original article demonstrates that JPG reading implementation can be bad enough, and the same can be said for every format, even text based. It simply has to be done right (including fuzzing at the end).

With service workers we're almost a save-as shim away from being back to this.

IIRC the epub format used for e-books is essentially a container with html.

90% of PDFs could be replaced using a background PNG/JPEG file and a visible/invisible text overlay.

Instead of forms embedded in the “.phd”, one could just use HTML forms and and then use javaScript to export it as a “.phd” document, covering 99% of PDF use cases.

.chm (windows help files format) is almost exactly that. It had its fair share of security vulnerabilities.

Acrobat Reader has been the poster boy for poor software for many years and it appears that Adobe have been good at adding new features to make it largely impossible for their competitors to keep up.

What is one to do?

Surely, the obvious answer is to ringfence PDF (or another new format) for the most basic features. These could more easily be handled by 3rd-party apps both securely and to render correctly. Let Adobe do whatever they want with their own format by adding loads of stuff people don't want, then the sell is harder for them:

Get a cheaper, safer app for writing portable docs which can do most things or pay more money for a very insecure format that does stuff you don't need.

I assume that others have attempted at some point to make an OSS alternative to PDF and I'm guessing it hasn't worked yet?

That is sort of what PDF/A is: https://en.wikipedia.org/wiki/PDF/A

What about having continuous fuzzing servers for just about any software ? kinda like virustotal.

If you're talking about the developer of the software? Potentially. As to third parties, this article goes into painstaking detail on how difficult it is to set up fuzzing for closed source binaries.

You need to understand a certain amount of "rules" around each API call, and while you can duplicate their normal usage, there's a certain amount of thought that has to go into it.

Google does this with their oss-fuzz project (only for open source projects, since as Someone1234 noted it's very difficult otherwise):


Microsoft had 'project Springfield' which became 'Microsoft Security Risk Detection' (https://www.microsoft.com/en-us/security-risk-detection/). Kind of like Fuzzing as a Service.

Anyone on HN has any experience to report with this ?

We've been working on something like this for the past couple of months and we'll be launching in early/mid January![0] We've got experience working on large scale fuzzing infrastructure (Chrome fuzzing team, Coinbase fuzzing), and have modelled it similarly to Google's oss-fuzz[1], but for private projects and clouds.

We're always looking for companies and security researchers that want to fuzz but don't have the time/knowledge on how to do so (we automate a lot of the set up process and integrate nicely into your GitHub workflow) - drop me a line if you're interested - andrei@fuzzbuzz.io

[0] - https://fuzzbuzz.io

[1] - https://github.com/google/oss-fuzz

What a strange bar graph, where each column is color coded to a year rather than just putting the years on the x axis.

Since KDE, GNOME, and FSF foundations recently got a significant contribution this year, I wonder why they wouldn't join forces and hire a couple full-time developers to make poppler and all poppler-based PDF viewers (Evince, Okular, etc) actually useful for PDF Forms, animated and interactive content.

Adobe's software is large enough and ingrained deep enough that it seems people give it a pass with today's standards for software stability. Lowering the threshold of well-shaped review nearer to git and libgit2 would yield even more value toward stepping through the software stargate.

Git total loc: 279,993

libgit2 total loc: 219,887

Git CVEs (so far): https://www.cvedetails.com/vulnerability-list/vendor_id-4008...

libgit2 CVEs (so far): https://www.cvedetails.com/vulnerability-list/vendor_id-1606...

Do people still use Adobe reader nowadays? Last time that I tried it in my school's library it took half a minute to load and render my document, and after that the whole UI was unresponsive.

I had a much better experience with Sumatra on windows and Zathura on Linux where my documents open almost instantly.

Libpoppler has poor support for PDF Forms (especially Unicode[1][2]), embedded animation and 3D extensions. In my opinion these areas are very important in real world document exchange to be ignored (as it is a case for PDF FOSS tools).

[1] https://bugs.freedesktop.org/show_bug.cgi?id=17913

[2] https://gitlab.freedesktop.org/poppler/poppler/issues/463

I have never seen anyone use any of these features in the real world. I presume that embedded animation and 3D extensions are used in art-related fields? If so that would explain my ignorance.

PDF forms are used all over the place from what I can tell -- including a bunch of county government stuff I just had to deal with. No JS was involved though.

I've seen pretty heavy js usage on US gov forms, the most recent example being i9 form you fill out when you get hired.

This is the I-9 I've used before:


Interestingly enough I can't fill it out in Firefox, but I can with Preview.app. Running pdfinfo -js yielded some script, but it basically only looks like it's there as a gatekeeper so that you don't open the file with an older version of Reader. Is there more JS in there that pdfinfo can't extract?

The PDF used to apply for the BSA's Eagle Scout rank needs that stuff. I believe the reason is related to expandable text fields that might need to insert pages into the document. None of the non-Adobe viewers can handle it.

I have to use Reader to fill out my state tax forms because they use some modern JavaScript driven system to auto fill that no other reader can handle.

We had the same situation in the UK until recently, thankfully the new API-based system has opened it up to other platforms (eg Xero) and works very well.

PDF forms are super common for government interactions in many countries. Good luck ignoring those!

A long time ago, when I used PDF exclusively as the format for my slides, I used features like animations and auto play videos embedded within PDFs. Very helpful if you want to give a presentation.

Art and engineering.

.. tried Zathura on LUbuntu just now for the first time ; appears to be a VIM for documents viewing or something.. no interest in Zathura here!

and just like that, you've convinced me to install it. Different strokes for different folks, I suppose. :)

I was also convinced to install it, although after trying it, it looks more like less for PDFs than vim for PDFs (as many of the commands search or scroll in complex ways, but none of them modify the PDF).

Still, it's interesting to have something like less for PDFs!

IMO mupdf is the real less for PDFs. It is so lightweight and straightforward it makes everything else seem terribly bloated.

Zathura can use mupdf as its PDF renderer (it can also use poppler). I like using the mupdf library through Zathura rather than using the mupdf application, because Zathura has plugins for other file formats too, like PostScript and DJVU, and that way I learn a single set of keystrokes to view all sorts of documents.

What do you use? I also had good experiences with Okular in the past but I have not tried it in ages. Before moving to zathura I used xpdf and evince but I was not satisfied by them (not to mention that they do not support postscript and djvu).

Zathura does have some vim keybinds but other than that I can't see any similarities.

Maybe try mupdf instead?

You almost make it sound like a bad thing.

I may have missed something, but it looks to me like this is really a test of just the JPEG 2000 part of Acrobat Reader. It is possible that Adobe built this part of the reader by taking some open source implementation of JPEG 2000 (such as the reference implementation), and modding it - probably by changing memory allocation to be consistent with ARs memory model. So it is possible that some or many of the discovered vulnerabilities are in fact part of the JEPG 2000 library, in which case the problem goes beyond Adobe Acrobat.

You missed something, the article says at the end they fuzzed many different parsers not just the jpeg2000 one

Ah - thanks

Oh no, not Adobe again! This company has been putting out shitty (in terms of security) for 20+ years. Anyone remember Adobe Flash? (shudder)

If you read the PDF spec from the late 90's, it is Stephen King novel-scary... container format, multiple encodings, encryption, embedded binaries, embedded JavaScript and more.

While working with the PDF format I sometimes get the impression that this complexity is what Adobe wants. As a result, Adobe Reader is the only viewer that implements the entire spec and can handle all (or most) quirks.

This is especially apparent when trying to edit arbitrary PDF files, which is sometimes not so easy or even impossible. Just the definition of fonts and the text layout is already so complicated that this is the logical consequence.

But perhaps the format has simply grown and led to additional requirements such as PDF/A, PDF/X, PDF/E and now PDF 2.0, the next standard that makes everything even more complex... Will this every stop?

PDF is an unusual format in the sense that it had a rather specific thing it tried to do and then it achieved that goal, so that it could be considered "done", but the product it was most associated with, Acrobat, tried to expand still.

PDF has the semantics of a digital print that is resolution-independent and supports copypaste and search (mostly by mapping glyphs back to text).

In addition to resolution independence being something that's higher-level than strictly "digital print", being able to capture transparency is such a higher-level feature.

From the above perspective, PDF peaked in 1.4 when it got transparency support. Supporting roughly the PDF 1.4 feature set was that allowed the Mac Preview app be good enough for Mac users so that Apple could stop bundling Acrobat Reader with Macs.

After 1.4, PDF has gotten better compression algorithms that don't really change what the format is about. PDF/A and PDF/X fit well the notion of PDF as "digital print".

But Adobe has been trying to leverage Acrobat/PDF to other areas that don't fit the notion of "digital print". These include pre-Macromedia acquisition attempts to make PDFs a more dynamic platform and later inclusion of 3D models in PDFs. Other PDF viewers still work for users most of the time without this stuff, which is a signal of what PDF really is to users ("digital print").

(Filling in paper-like forms, while not true to the notion that PDF is a final-form format sort of make sense from the point of view of digital paper, though.)

> While working with the PDF format I sometimes get the impression that this complexity is what Adobe wants. As a result, Adobe Reader is the only viewer that implements the entire spec and can handle all (or most) quirks.

While that certainly does play in Adobe's favor, the complexity of the spec. is also what occurs when over time new features, some never even envisioned by the original creators, are bolted on to keep the whole "relevant" and/or to add new "features" to keep the 'thing' from becoming obsolete.

We can certainly argue whether the addition of different features was worth the complexity increase, but simply taking an existing system and bolting on the latest "hotness" to use to add to the checklist of "why one should upgrade" features also produces similar levels of complexity.

So some of the complexity increase is merely the fact that the pdf spec. has been evolved to do things it was likely never designed to do in the first place.

The same could be said for the Microsoft Office file formats.

Or PSD, for that matter.

The Office formats are well specified, they are complex because that is the nature of the software but it is a world away from something like PSD or even PDF.

PDF is actually quite well specified, there are not many holes in the specification itself.[0] As to what Adobe Reader will do when it encounters an out-of-spec file, that is a lot fuzzier.

On the other hand, the Office file formats (especially Word) have many un- or underspecified cases.

[0] The only one I know of is finding the end of compressed inline image data.

I agree, the PDF spec is great, and very easy to understand (if slow to wade through). The hardest parts are when you have to duck out to read another spec for a contained format like TrueType.

Regarding Reader, I work with PDFs a lot, and the majority of issues have a fairly common pattern. The supplier has created a PDF in a 3rd party tool, which is invalid in a subtle way (production printers in particular are very specific about what they want to accept).

But it works fine in Adobe Reader, since it was built to be very tolerant in what it accepts, so it's often hard to convince the non-technical users that the file has an issue. It's great for end users but has meant that a lot of tools out there just didn't have to try too hard to make PDFs that mostly work, so programming workflows can be an issue.

I found quite a few areas that were vague when I was working with it.

The advantage of the office formats is they are Zip files with a ton of XML, ie they are well defined. The application parts are another matter of course.

Just because something is XML doesn't mean it is "well-defined".

No, but XML parsing is a solved problem, PDF parsing isn't.

The original criticism was that some parts are just binary blobs encoded in XML elements, which wouldn’t suprise me at all, with Microsoft being allowed to tick the ‘XML file format’ checkbox and still getting to keep the binary format advantages.

I see. I was mostly referring to semantic problems, of which I heard there are a lot (I haven't really worked with Office internals much), and also I was thinking of the pre-XML Office formats.

I remember reading in the past that Microsoft had corrupted the ISO standards body to publish essentially fake standards that were different to what MS Office actually produced, so software like Libreoffice would output files that didn't work properly in Office or visa versa. Are you saying that now this is not the case and they are full specified? I sometimes tell people about this so I want to make sure I have my facts straight.

old binary formats were pretty insane though. Lots of magic to pull the file size down.

Okay, apparently I totally didn't notice the release of PDF 2.0 a year ago, even though I was working a lot with PDFs at that time. Also, this new version is an ISO standard that costs 198 CHF to download, so I hereby predict that it is basically dead in the water, since few people will bother implementing it. The new features also don't seem very interesting, and from what I gather the spec is still backwards compatible despite the major version number increment.

I actually enjoy reading the PDF spec - it’s here for anyone who wants to take a look https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PD...

Like every bit of business software, there’s a load of stuff that shouldn’t be in there. It’s a really flexible container format though, and every one of these features went in because there was a need. Times change, things change and it could do with a tidy up, but it’s probably impossible without breaking everything for a load of businesses.

I was under the impression that PDF was created as a response to postscript being "too programmable" and not "document enough" but then they decided that PDF is too minimal and so they ended up bloating it beyond what PostScript ever was by including JavaScript, Flash, and other trash in it.

It's got a 3d model viewer in it

So does TempleOS.

Are you suggesting we replace PDF with TempleOS?

It would be an improvement!

I'm waiting for someone to post a wasm templeOS build which has some been modified to open pdfs immediately

Not divine.

No, just run Adobe in a virtual machine with constrained access to other files.

Although TempleOS clearly is a divine revelation.

I’ll get IT on it STAT! Regular employees going to love accessing PDFs in a VM

Honestly, it's possible to make VM windows show up as if they were normal programs, or you could just do things like Chrome-level internal sandboxing. There's no reason this has to be clunky.

Adobe Reader is already sandboxed a la Chrome.

According to Zerodium prices, VM escape costs as much as LPE, so it is unclear if there will be much of a security improvement beyond 2x.

Although virtual machine hypervisors don’t automatically update, unlike Adobe or Windows.

I am pretty sure that hypervisors update just like any other software, via the package manager.

I've done exactly that. Mark my words, a PDF will be the first computer program to gain full sentience.

For a walk through PDF's history and file format, check out Chas Emerick's 2018 Red Monk tech talk: "Building 100 year systems in the shadow of PDF".


I don’t know if this is true, but I’ve been told that the pdf spec at one point did/does contain some MS DOS emulation.

I’m seriously close to banning acrobat the program for my employees, just haven’t found a rock solid alternative that I can trust to not implement the same dumb parts of the spec.

Pretty sure that pdf.js from Firefox is safe. At least it runs as sandboxed javascript in the browser. I believe a standalone client may exist as well.

My issue with pdf.js is that it is really bad at copying. every time that I try to copy something from it every word (and sometimes different letters from a word) end up in a different line. I also had issues with rendering, in some (rare) cases I had it show squares instead of the actual content.

Not to mention that it is actually horrifyingly slow compared to most of the viewers that I tried.

pdf.js had its share of security vulnerabilities in the past. e.g. https://blog.mozilla.org/security/2015/08/06/firefox-exploit...

I think that chrome uses (used?) poppler, and is quite good for displaying pdfs.

no, chrome uses its own engine called pdfium.

Did PDF version 2.0 made any improvement to clarity? I think it was released last year.

Is djvu a viable alternative and if so, why isn't it used as widely as pdf?

DJVU is raster format. It's intended for scans and archiving printed media. It's possible to use it for documents produced digitally, but I don't think it will be a good idea.

PDF "core" is not that bad, but 90s "multimedia" craze turned it into badly designed graphical application runtime.

Thanks for disambiguation, the raster-vector part is really a major difference. Is PS a viable alternative (even though it is a programming language itself)?

AFAIK, PDF is mostly a container for PS with compression and better handling of fonts (BTW, can fonts be embedded in PS? How fonts are sent to printer?).

Still, both formats are too much printing-oriented. Reading documentation in PDF on computer screen is not especially pleasant, and unbearable on phones.

How do you embed fonts?

Weren’t there licensing or patent issues?

What do you mean "read"? Implemented.

Though obviously not the full thing. Nobody does that. Not even Adobe does that.

Wonder how many of these were already in use by state actors?

Awesome work!

I feel like if you’re a corporate buyer of Adobe software, you’ve got grounds for gross negligence here?

The fact that companies still have the email-> employee pc -> acrobat reader pipeline enabled says a lot about what companies really think about security, posturing aside.

(home users too, but they can plead ignorance)

The reason being that there are almost no high-profile breaches I can think of where PDF vulnerabilities have been blamed. Unlike Office macros, Flash etc etc.

I would be interested to hear the counterarguments to this.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact