I remember rolling out Adobe Reader in those days and as a product, I don't believe its core has changed much. They've certainly managed to bolt on a whole lot of new features, but that can only make the position worse.
As much as this sounds like a call to kill Adobe, something needs to happen before that's feasible. For the average enterprise, Adobe Reader is far more ingrained than these products were. Case in point, in organisation I asked the question of whether Chrome's PDF viewer would cut it for them. One large department then ordered Adobe Professional for every user. They told me they didn't need it, they just knew I wouldn't propose removing a product they'd actually paid for.
Adobe Reader needs its HTML5 moment - an alternative that's not just "good enough for most people", but one that's actually better.
That doesn't mean that there aren't any alternatives. Foxit for example is pretty good. But in-browser alternatives just aren't there yet
Since it's such a quickly evolving project, I wondered where form support is up to. https://github.com/mozilla/pdf.js/issues/7613
Form support is not complete. Seems like it required quite a rewrite to get the foundation in a good place to finish it off.
Are there rendering bugs? Yes. See plenty here: https://github.com/mozilla/pdf.js/issues
As for printing... browsers aren't even good at printing HTML. The best browser for printing is based on the old Opera software(PrinceXML), and Safari is probably second. Remember the Apple display system used to be based on PDF rendering... and they do a lot with CUPS and graphic designers.
However, printing PDFs on many browsers can go directly to the printer or the OS (which mostly all support rendering PDFs directly now).
Does Quartz not still try to match PDF in the way its render structure is composed internally?
google it seems only return how to fuzz js engine kind of links.
Love to see write up of how to fuzz a JS application by doing AFL type mutations on the server return data, etc.
Actually I've never had rendering bugs with Chrome, though it's certainly happened with Firefox.
Why can't we just burn PNGs or lossless JPEGs and just use OCR / other simple machine learning for text selection? Like, I get that there are some unfortunate souls out there that need to edit CAD documents in their PDF but for 99.999% of people PDFs do one thing that websites do not:
Print reliably well given a page format like A4.
I shouldn't have to wince ever time I open a PDF. They're so insecure that a no-click RCE only fetches $10k.
 Or ideally SVG, but there are some problems with fonts and licensing that I'm struggling to remember at the moment.
The general hatred for PDFs in the tech community is almost completely rooted in Adobe's initial decision to make PDF editing and creation cost $500. You have access to a document that want to make changes to, but you can't because it's a PDF and don't have access to the document source because the owner/publisher didn't provide that. It's a PDF because PDFs make documents that look the same everywhere, even when printed, which is and will remain critical to the purpose of publishing documents. Well, images don't solve this problem, either, because you still can't edit text in an image, and now you lose the ability to be sure about how they'll print (margins, scaling, etc.).
Furthermore, images, even compressed, are significantly larger than a well made PDF. For example, I've got a 6,700 page document of special ed student progress reports that include detailed, full-color charts and graphs of student progress with respect to goals. It's 60 MB. 8.5 KiB per page.
Then again, I imagine it won't be long before someone mentions LaTeX as a viable alternative, even though the one thing LaTeX isn't is portable. But LaTeX is primarily popular in the tech community because it lets programmers pretend to write code while they're actually writing documentation. Nowhere else will you find people telling you to use a set of programs that require a build environment when someone asks about the best home office application to use. (Yes, I know that LaTeX is a typesetting language. My cynicism is that some tech people tell others to use LaTeX when they're asked what word processor someone should use.)
Edit: Clarified second paragraph.
Rude remarks notwithstanding, LaTeX and its ilk let you make PDFs, which are indeed portable. Setting up LaTeX is the same as setting up any other program, some of which are not portable either. ShareLatex.com  also exists for the purpose of using LaTeX anywhere.
People recommend LaTeX because it's in another league when it comes to typesetting and rendering more niche notation. It's also not user hostile when it comes to binary files. LaTeX source files will always be readable decades later, <binary app here> makes no such guarantees.
Whether it's a viable alternative depends on whether the user wants to make a minimal learning investment or not. If they don't, google sheets > export to pdf always exists.
As for OCR, we're able to handle underlines and italics for most fonts, though I take your point on colour. If it's especially bad they fail. Ideally it wouldn't be PNGs it would be some stripped down thing. Maybe even HTML with embeded CSS / images via data tags would fit the bill, but now we're bringing in XML-esque parsers and those are garbage too. I'm just so frustrated with dealing with PDFs. They serve a billion different purposes and they're good at none of them.
Accessibility is a fair point, but for print-to-file applications we're surely at the point where OCR can at least get the text to a readable format, no?
I would imagine this is the case in most large organizations.
Why would they do that though?
The money's going to the easy places, new markets.
If you follow the work by Bret Victor & others on "explorable explanations" and interactive scientific papers, you probably appreciate the need for a self-contained format for interactive documents. Could PDF be this? I don't know, I hear the spec is too scary. But I'd say we should have something like that.
 - https://explorabl.es/
 - http://worrydream.com/ExplorableExplanations/
 - http://worrydream.com/ScientificCommunicationAsSequentialArt...
But is there really a good reason to not just keep these in browser? I don't really know if there's much value in reading these locally. Maybe this would be a good fit for an electron app?
* A way to save back form data. I believe google is working on a js api to access local files (given a few conditions).
* A way to bundle the html with every js script, ressource, css, etc, in one file, without making a huge mess.
If you had a tar.gz with an index.html inside, and the browser was to transparently allow r/w access to the archive contents from contained js scripts, this could solve a lot of use cases (heck, even "electron" apps could be replace by this). One exception being printed documents (postscript), at which pdf is quite good.
That behind us, there's also a matter of reliability and control. Services live much shorter than data they process; given today's trend, I wouldn't expect an online-only paper to be available after 5-10 years. Having a self-contained bundle would let me archive it independently, and would prevent any third parties from being able to interfere with my reading/exploration.
The same idea made the web. Commercialization of the web was what caused the disaster.
More accurately stated as "we sandboxed it, so anything discovered is less likely to be critical." https://www.adobe.com/devnet-docs/acrobatetk/tools/AppSec/sa...
I've heard a variant of that talk delivered by a non-C-level at an appsec/prodsec-focused conference where the rehashed quote above (though I'm blatantly paraphrasing) was the justification used. Something more closely reflecting the truth might be "we can't realistically tackle the many security defects in Acrobat and Flash, so we sandboxed both applications instead to generally reduce the technical risks posed by any vulnerabilities in code."
Every CVEs exposed by outside 3rd parties like this is a shame on their software quality and reputation, IMMO.
Since many people are using a PDF reader to read PDFs from relatively untrusted sources, do yourself a favor and at least use a reader that does not have full system access.
macOS: Preview.app (uses macOS sandboxing)
Linux: Evince Flatpak on Wayland (Flatpak uses sandboxing. Wayland because X11 apps can read all keystrokes, mouse events, do screengrabs.)
Windows: no clue
All platforms: in-browser PDF reader with a browser that sandboxes.
Flatpak is providing the actual application sandboxing, but being allowed to talk to the X server is a huge amount of privilege that can't really be restricted.
I think UWP apps are sandboxed by default, so something like Xodo PDF could be a possibility.
Edge, firefox and chrome have built in PDF readers.
For more control sites can self-embed pdfjs so no external reader is required.
Addendum: Most links of the bottom row don't work anymore. Needs updates
Platform (what's that?): GNU (isn't that some kind of African animal?) Linux (oh, I know that one, it's the cute penguin!)
The rest of the text in those boxes is mostly techno-babble for non-tech users (Gnome? KDE? DjVu?!??)
I understand the intent behind it, but it would only serve a very small niche of users, who can already fend for themselves.
Everyone else would go like: PDF? Ah, that's Adobe!
One of Adobe's early talking points for the value of PDF's was that they would "look the same on all systems". Of course some context is necessary. PDF first appeared in 1993. In 1993, while the internet did exist, most individuals who were not associated with a university, research lab, or govt. agency, had no access to 'the internet'.
As well, the computing world was much more diverse. One had Dos, early Windows, and various MacOS variants all coexisting, one had numerous different variants of Unix on the numerous different RISC workstations in existence. And, here was the big deal, 'documents' created on each of these systems were to a large extent incompatible with each other. In this context, 'document' should be thought of as "a file used to create paper printouts" as opposed to what we think of a 'document' now in 2018. There was some compatibility, in that Windows systems would, sometimes, read 'documents' produced by Dos based word processors, and of course the lowest common denominator, plain text file, was 'almost' compatible (line ending differences was the biggest incompatibility). But for anything more complicated, if person X created a 'document' on Dos, and they wanted person Y, using SunOS, to see a version that "looked the same", their best bet was to print their document to paper and give Y the printer output. Because if they could send the electronic file to Y somehow, chances were that Y would be unable to open it, and even if they could, there was a good chance that it did not 'look the same' (from a 'looks like the same paper printout' level of same).
PDF came about in this world where paper was still king, and Adobe's marketing of "looks the same" was really meant to be "produces the same paper printout for the receiver Y as it does for creator X". That is why, today, in 2018, that viewing a PDF still looks like one is viewing WYSIWYG of a paper printout. PDF is, quite intimately, tied to the concept that there are discrete sheets of paper that it is formatting data onto. Yes some viewers do provide an 'almost' HTML continuous scroll look, but that is done 100% in the viewer, the underlying PDF format is very paper page oriented at its core.
So, when comparing PDF intent to web page intent, the phrase "looks the same" has different meanings. For PDF, it was designed such that "looks the same" means that a paper printout looks identical to the original. And that the designer/creator has full control over the look, while the viewer has no control over the look. For web pages, "looks the same" is far less strict, and is really not the same meaning, because the web was always intended to allow the viewer much freedom in deciding how to display the HTML content, taking away the designers ability to strictly determine look and presentation. With the result that HTML data was never meant to "look the same" with the same strictness intended by PDF.
I remember saving .mht files with IE as a kid when working on assignments so I could disconnect the dialup and give my parents their phone line back :)
and index.html refers to other files, they would be read from the same zip, even if they are only inside of the zip.
I used it a lot for the local archives of the bigger content, it is amazingly convenient, and I'm sad that the same approach was not used anywhere else.
It's not trivial to get it right, in the security aspects (the zip implementation has to be robust, the url handling too) but it's doable and it would be very practical to have it.
Tangentially, the good thing of the zip format is that it has so called "central directory" which means that you don't even have to load the whole archive if not all data is needed, just the last part of the file, and from there you get the offset and the location of the needed file. So the Zip files could work beautifully with the
when they are huge(1). The small ones are most effective to be downloaded at once, of course.
1) I've actually done such a sequence by hand a few times when I had a slow internet and knew that I don't need the whole zip file, but just to see that all files are inside: I've made the range request for the end of the file which would be enough for the estimated number of files inside, and so I've had the list of all the files in the archive without needed to download the whole archive: I've reconstructed the same file size but left the rest of it be zeroes, and some of the zip tools I've used treated the archive directory exactly as I needed it.
It is interesting how the older web got some things right, though, and now it's 2018 and those ideas one would think should be robust by now, isn't even there.
Out of every archive format a pathological case can be constructed, just like it can from the relative file names etc, but such attempts can be simply rejected during the processing once some thresholds are reached. The original article demonstrates that JPG reading implementation can be bad enough, and the same can be said for every format, even text based. It simply has to be done right (including fuzzing at the end).
What is one to do?
Surely, the obvious answer is to ringfence PDF (or another new format) for the most basic features. These could more easily be handled by 3rd-party apps both securely and to render correctly. Let Adobe do whatever they want with their own format by adding loads of stuff people don't want, then the sell is harder for them:
Get a cheaper, safer app for writing portable docs which can do most things or pay more money for a very insecure format that does stuff you don't need.
I assume that others have attempted at some point to make an OSS alternative to PDF and I'm guessing it hasn't worked yet?
You need to understand a certain amount of "rules" around each API call, and while you can duplicate their normal usage, there's a certain amount of thought that has to go into it.
Anyone on HN has any experience to report with this ?
We're always looking for companies and security researchers that want to fuzz but don't have the time/knowledge on how to do so (we automate a lot of the set up process and integrate nicely into your GitHub workflow) - drop me a line if you're interested - firstname.lastname@example.org
 - https://fuzzbuzz.io
 - https://github.com/google/oss-fuzz
Git total loc: 279,993
libgit2 total loc: 219,887
Git CVEs (so far): https://www.cvedetails.com/vulnerability-list/vendor_id-4008...
libgit2 CVEs (so far): https://www.cvedetails.com/vulnerability-list/vendor_id-1606...
I had a much better experience with Sumatra on windows and Zathura on Linux where my documents open almost instantly.
Interestingly enough I can't fill it out in Firefox, but I can with Preview.app. Running pdfinfo -js yielded some script, but it basically only looks like it's there as a gatekeeper so that you don't open the file with an older version of Reader. Is there more JS in there that pdfinfo can't extract?
Still, it's interesting to have something like less for PDFs!
Zathura does have some vim keybinds but other than that I can't see any similarities.
This is especially apparent when trying to edit arbitrary PDF files, which is sometimes not so easy or even impossible. Just the definition of fonts and the text layout is already so complicated that this is the logical consequence.
But perhaps the format has simply grown and led to additional requirements such as PDF/A, PDF/X, PDF/E and now PDF 2.0, the next standard that makes everything even more complex... Will this every stop?
PDF has the semantics of a digital print that is resolution-independent and supports copypaste and search (mostly by mapping glyphs back to text).
In addition to resolution independence being something that's higher-level than strictly "digital print", being able to capture transparency is such a higher-level feature.
From the above perspective, PDF peaked in 1.4 when it got transparency support. Supporting roughly the PDF 1.4 feature set was that allowed the Mac Preview app be good enough for Mac users so that Apple could stop bundling Acrobat Reader with Macs.
After 1.4, PDF has gotten better compression algorithms that don't really change what the format is about. PDF/A and PDF/X fit well the notion of PDF as "digital print".
But Adobe has been trying to leverage Acrobat/PDF to other areas that don't fit the notion of "digital print". These include pre-Macromedia acquisition attempts to make PDFs a more dynamic platform and later inclusion of 3D models in PDFs. Other PDF viewers still work for users most of the time without this stuff, which is a signal of what PDF really is to users ("digital print").
(Filling in paper-like forms, while not true to the notion that PDF is a final-form format sort of make sense from the point of view of digital paper, though.)
While that certainly does play in Adobe's favor, the complexity of the spec. is also what occurs when over time new features, some never even envisioned by the original creators, are bolted on to keep the whole "relevant" and/or to add new "features" to keep the 'thing' from becoming obsolete.
We can certainly argue whether the addition of different features was worth the complexity increase, but simply taking an existing system and bolting on the latest "hotness" to use to add to the checklist of "why one should upgrade" features also produces similar levels of complexity.
So some of the complexity increase is merely the fact that the pdf spec. has been evolved to do things it was likely never designed to do in the first place.
Or PSD, for that matter.
On the other hand, the Office file formats (especially Word) have many un- or underspecified cases.
 The only one I know of is finding the end of compressed inline image data.
Regarding Reader, I work with PDFs a lot, and the majority of issues have a fairly common pattern. The supplier has created a PDF in a 3rd party tool, which is invalid in a subtle way (production printers in particular are very specific about what they want to accept).
But it works fine in Adobe Reader, since it was built to be very tolerant in what it accepts, so it's often hard to convince the non-technical users that the file has an issue. It's great for end users but has meant that a lot of tools out there just didn't have to try too hard to make PDFs that mostly work, so programming workflows can be an issue.
The advantage of the office formats is they are Zip files with a ton of XML, ie they are well defined. The application parts are another matter of course.
Like every bit of business software, there’s a load of stuff that shouldn’t be in there. It’s a really flexible container format though, and every one of these features went in because there was a need. Times change, things change and it could do with a tidy up, but it’s probably impossible without breaking everything for a load of businesses.
Although TempleOS clearly is a divine revelation.
Although virtual machine hypervisors don’t automatically update, unlike Adobe or Windows.
I’m seriously close to banning acrobat the program for my employees, just haven’t found a rock solid alternative that I can trust to not implement the same dumb parts of the spec.
Not to mention that it is actually horrifyingly slow compared to most of the viewers that I tried.
PDF "core" is not that bad, but 90s "multimedia" craze turned it into badly designed graphical application runtime.
Still, both formats are too much printing-oriented. Reading documentation in PDF on computer screen is not especially pleasant, and unbearable on phones.
Though obviously not the full thing. Nobody does that. Not even Adobe does that.
(home users too, but they can plead ignorance)