Hacker News new | comments | show | ask | jobs | submit login
How fast is PDF.js? (mozilla.org)
235 points by rnyman 1084 days ago | hide | past | web | 144 comments | favorite



Please do a google search to

    building plan filetype:pdf 
(edit: I actually don't know the search terms that would return the real architectural plans and not some texts, I guess somehow Google ranks PDFs with a lot of texts much higher than the documents with drawings -- but I want to point to these actually! If it's true that Google prefers "a lot of text" documents then attempting to googe to get the examples is a good method to miss most of the documents with the real problems!)

Then try to actually look at every page of such documents with pdf.js. The first I get is

http://www.nist.gov/el/nzertf/upload/NZERTF-Architectural-Pl...

Try it. You will want to throw your fast computer thorough the window. Then when you manage to calm down you'll try to configure your browser to never invoke pdf.js again, if you know that you need to work with such documents.

If you want a realistic benchmark, compare the speed of rendering these documents with Adobe's or some other native renderer.

I'm not a building architect but I at least don't live in "we don't draw anything significant in our documents" bubble. I know they worry for their potential customers. Forced pdf.js is a huge setback for them. If they would be able to tell the browser in the html "please don't use pdf.js for this one" they would be much happier.


I just profiled your example: http://people.mozilla.org/~bgirard/cleopatra/#report=4b995cc...

Looks like it could be running much faster:

* 20% of the time is spent copying the canvas because someone is, likely erroneously, holding a reference to the canvas. Looking into it: https://bugzilla.mozilla.org/show_bug.cgi?id=1007897

* 10% of the time is spent waiting on display transactions swaps because canvas isn't triple buffered.

* PDF.js is not getting empty transaction (canvas draw optimizations).

That's just from a quick profile. I'm sure there's a ton more things that could be improved.


Thanks! Good work! It's of much more benefit to analyze slow pdfs than to construct the "proofs" that short and simple pdfs are displayed "fast enough" (but I understand that the later feels so good). Especially since it seems that not only pdf.js will benefit from analyzing the slow ones -- if I understand you traced the problems to the cpp code of the browser? That would mean that even non-pdf-js stuff will be faster once it's fixed, is that correct?


Yes the fix to bug 1007897 will help all web content.

That profile will show calls from JS and how theycall into native C++ calls. This is very useful for profiling things like canvas.

PDF.js isn't bottlenecked by javascript performance, like many well tuned apps out there, so there's a lot of improvements that can be made by tweaking the web platform.


Does your profile explain why large PDFs are so laggy when scrolling in PDF.js? For example take http://www.math.mtu.edu/~msgocken/pdebook2/mapletut2.pdf open it in FF and hold the page down or up. On my laptop, that will lockup FF. Yet, if I open it in Acrobat Reader or Chrome, I can scroll up and down much faster without the jerky behavior.

It is even possible a JavaScript app in FF to get the kind of performance Google gets with their pdf plugin? With the power of today's PC's, it seems like something is seriously wrong with "web technology" if my machine struggles to render pdf documents.


Wonderful! You diagnose so fast it looks like magic! Do you see anything new when profiling this one:

http://www.engageny.org/sites/default/files/resource/attachm...

and

http://www.math.mtu.edu/~msgocken/pdebook2/mapletut2.pdf


http://people.mozilla.org/~bgirard/cleopatra/#report=a848ab9...

The first PDF.js document uses the DOM instead of Canvas like the previous one. This is done to support text selection. Most of the time is spent in the style system. I don't know that area very well but in the past I've seen simplifying CSS selectors make all the difference. I know a fairly important problem for B2G is spending up expensive style flush (https://bugzilla.mozilla.org/show_bug.cgi?id=931668) but I don't know enough about CSS to know if that fix will solve the problem here.


A little OT, but since you're so good at profiling Firefox, I have one more interesting "a lot of real work" page that will maybe inspire you or somebody you know:

http://bellard.org/jslinux/

This emulates in JavaScript the x86 and necessary hardware to really boot Linux 2.6.20(!) On my computer, Opera 12.17 will show "booted in 2.8 seconds," versus Firefox 29 on which it will be "booted in 7.9 seconds." That's 2.8 times slower.


Here's a profile: http://people.mozilla.org/~bgirard/cleopatra/#report=0705c03...

I get between 5-7 seconds here.

Looks like getaliasedvar is causing excessive bailouts from Ion (3rd tier JIT). On top of that the platform is trying to synchronously cancel background Ion Jitting and that is taking an excessive amount of time. Reported as: https://bugzilla.mozilla.org/show_bug.cgi?id=1007927

Tweaking the functions listed in the profile to avoid them bailing out should drastically improve this test case.


And at the opposite side of Bellard's useful code, I've also observed that a simple loop which just does the summation of the doubles like this

    var s = 0.01
    for ( var i = 0; i < 100000000; i++ )
        s += 0.1
    print( s )
Became around twice slower since some version of Firefox (of course, before that point there were a lot of speedups, very old FF can't be compared with the present state).

Still, really the biggest problems I know of at the moments are those PDFs that the architects produce.


That loop doesn't actually do anything, benchmarking it is pretty much meaningless.

It's important to have benchmarks that aren't trivially converted to no-ops or constant loads by the compiler. (In practice the JIT might not be optimizing that one out, but an aggressive C++ compiler certainly would as long as fast math is enabled - so at some point, a typical JS JIT will too).

Also ensure that you're benchmarking warmed code that has been fully jitted. JS code (other than asm.js in Firefox) has multiple stages of compilation in modern runtimes, typically triggered based on how often it is called.


You are wrong. The last line has the meaning of displaying the result to the user (you are supposed to implement it there, I'm lazy. The same goes for prior warm-up, I don't have to specify it here, I just show the loop). Because the result is needed to be shown, the browser is certainly not allowed to optimize away the calculation. Second, it's not allowed to replace it with a multiplication, as it's a floating point arithmetics and the binary representation of the constants involved is not "nice" an the same stands for partial results too. Do compare the result with the multiplication to get the idea (10000000.01 vs 9999999.99112945). All the additions have to be performed one way or another between the loading of the js and the displaying of the result. So it is a good measure of the quality of the translation from the js to the machine code which does the actual calculation and can also easily point to the unnecessary overheads as it's very simple. The regression I observed is therefore a real one, probably observable in other scenarios but harder to pinpoint and probably avoidable, as the better results did exist once. (Of course, if it would be part of some widely popular benchmark cheats would probably be developed, but at the moment there aren't any. Once anybody implements "we don't care for numerics" optimization, it of course should not be used anymore to asses the quality of JS).


> A little OT, but since you're so good at profiling Firefox

Heh, I believe bgirard wrote the Firefox profiler. He's the best there is :)


I'd say, he deserves a raise! Anyway, maybe you should really add the

http://www.nist.gov/el/nzertf/upload/NZERTF-Architectural-Pl...

to your test suites and consider it a worthy goal as it really represents a lot of documents typical for the users who produce complex plans. Have you tried it?

It's immediate in Adobe Reader (as in one second) and takes more minutes in pdf.js Firefox 29.


Yeah, that one is kind of sluggish for me too. Usable, but not smooth.

I filed an issue on pdf.js here: https://github.com/mozilla/pdf.js/issues/4761


Thanks! Finally, a Google search term that will give you a lot of slow PDFs:

     controller site:automationdirect.com filetype:pdf
based on a hint from Gracana's comment.

You have to try to actually see every page in any of the PDFs to get the idea of pain.


I just tried the mentioned document in the viewer on http://mozilla.github.io/pdf.js/web/viewer.html

I agree that the speed is a lot worse than in Preview.app for example, but it's also not unusable. I will look at it in more detail tomorrow


Try to move through the pages, don't stay on the first one! I have the 22nm i7 CPU here and compared to the work with Adobe PDF reader it's just horrible having to use pdf.js.

I must however admit that I'm not able to easily construct a Google search for more such documents, but I know a lot of people who work only with such -- they just can't work with pdf.js.

Does your benchmark measure the time to actually display everything on every page (what the human looking at all the pages must do), or just the time until the browser is responsible?

Edit: inspired from other comments, it seems that at least search for "math plots filetype:pdf" returns more guaranteed problems like

http://www.engageny.org/sites/default/files/resource/attachm...

Still it's hard to find slow PDFs with certainty just by using Google.


> Does your benchmark measure the time to actually display everything on every page (what the human looking at all the pages must do), or just the time until the browser is responsible?

It "only" benchmarks the rendering, all the overhead the viewer produces is not shown. That is intentional, as we will create our own viewer anyway

BTW: PDF.js in FF is typically slower than in Opera / Chrome, all these benchmarks used Opera


I tried loading that and it worked pretty well for me. Not fully native speed, but just a couple seconds to draw complex pages. Certainly not a "throw the computer out the window" experience.


It's really minutes to see all the drawings, i7, Win 8.1, FF 29 (to compare the CPU speeds, jslinux boots in 7.9 secs on my computer). Adobe Reader is immediate to see all the 20 pages.

Have you tried to see all the pages, all the drawings? How many minutes you needed, on which setup? Are you using Firefox? Does it use pdf.js? Is it something OS dependent, or you just didn't look at all the pages?

Now I see, Gracana mentions manuals on AutomationDirect.com. Look at this one for example:

http://www.automationdirect.com/static/specs/dl0506select.pd...


I'm on firefox nightly, using pdf.js, windows. The first pdf takes about 4-5 seconds to render a page, and it always does the page I have onscreen so it doesn't matter how many pages there are.


When measure only the first page you will completely miss a lot of problematic PDFs.


I looked at all the pages, picking them at random. It always did the one I had onscreen in seconds.


So about two or three orders of magnitude slower than native.


Apples and oranges. This is a young implementation and nothing considers its performance when generating a PDF. Until it's had significant optimization you can't draw very good conclusions about html+js vs. native.


This is wrong because PDF.js is Firefox's default way of presenting PDFs. If it's a young unoptimized code base, then maybe Firefox shouldn't make it the default. It is Apples to Apples comparison here because it's the default and how young it may be is irrelevant to the user experience.


It's a very dangerous attitude "I don't read PDF's and I don't care but the customers who read them should tolerate our young poor implementation that needs more than two minutes for just 15 pages."

(The documents like this http://www.automationdirect.com/static/specs/dl0506select.pd... )

They won't. As soon as you prevent them doing their job (and it's so if they used their PDF's normally before you changed the defaults) they will have to search for the solution. The solution is either switch the handler (still a little better for browser writers) or the browser.

By making the opening and looking at the 15-page PDF which was before instantaneous taking two minutes, you prevent them doing their job (the slowdown from subjectively 0 seconds to minutes is also subjectively infinitely worse experience!) and they must respond. They can't open just the first page. They actually care. They need all the pages.


Why do you keep acting like a document cannot be interacted with until every single page is fully rendered? Especially when it goes out of order to get the onscreen pages ready first.


When you need some information from the 15-pages document you don't think "I know I need the 9th page. You look at one page after another. You need a 0 seconds for that with a native renderer (you can't observe that you wait) and you need more minutes with pdf.js -- infinitely longer, enough to not use it.


>infinitely

It takes time to get to the PDF. It takes time to transfer the file if it's of any size. It takes time to look at the pages.

Also you can skim pages that don't have 100% of the drawings on them yet.


PDF.js vs. Adobe Reader is a reasonable comparison to make right now.

PDF.js as representative of html+js viewers vs. 'native' is completely unfair unless you compare to a similarly unoptimized native viewer.


I made some big improvements to pdf.js's speed and memory usage a few months ago, particularly for black and white scanned images -- for one 226 page document I saw ~8x rendering speed-ups and ~10x memory usage reductions.

Details are here: https://blog.mozilla.org/nnethercote/2014/02/07/a-slimmer-an.... These are present in Firefox 29 which only just came out, so if you're seeing bad performance and you're still on 28 or earlier, an upgrade might help.

If you have particular PDFs that cause pdf.js to run slowly, please file bugs about them at bugzilla.mozilla.org under the "Firefox" product and the "PDF Viewer" component. PDF is a large, complicated format, so there are many corner cases that the pdf.js developers are unlikely to see without help. Every improvement I made came about from profiling PDFs that people had reported as causing problems.


Since you've done some work on it, I'll assume you understand the architecture well. My biggest gripe with pdf.js is that resizing triggers a full page reload, since it has to redraw the canvas. Whereas this is instant on other pdf readers. Would there be any workarounds for this? Or will we be stuck with this behaviour till "pdf.js 2.0".

Edit: This is an example of what I'm talking about https://github.com/mozilla/pdf.js/issues/3853


> Since you've done some work on it, I'll assume you understand the architecture well.

Alas, no :) My work was basically a few surgical strikes in which I learned a lot about a few small corners, but my overall understanding of the code base is pretty weak.

I recommend filing a new issue here: https://github.com/mozilla/pdf.js/issues/new


Experimenting with SVG might be a good option here. Or maybe, implementing a mixture of canvas with SVG, might improve the performance.


Reporting bugs in Firefox is intimidating to say the least. Will a bug filed at GitHub like https://github.com/mozilla/pdf.js/issues/4761 still reach the right place?

bgirard's profiling work in this thread has been amazing, but I don't think I have the knowledge to interpret the profiler results well enough to file bugs on slow PDFs. A way to post "slow" PDFs to be picked up by pdf.js developers would help.


> Will a bug filed at GitHub like https://github.com/mozilla/pdf.js/issues/4761 still reach the right place?

Filing at Bugzilla will probably give you a wider audience. This can be useful if the underlying perf problem is with Firefox rather than pdf.js. Don't worry about getting some of the Bugzilla fields wrong, it happens all the time and people don't mind. A good description of the problem (with steps to reproduce) is the most important thing.

If you're still intimidated, filing something with pdf.js's GitHub tracker is a lot better than nothing.


Filing nothing will not be helpful for sure. Creating a report in any related bug reporting system will bring developer's, QA's or manager's attention. Just provide enough information, so somebody else except you can reproduce the problem.


I <3 your work on the JBIG2 decoder!

When I started to work on JBIG2, it was kind of irritating as you were often one week ahead of me. When I started to make some optimizations I would see a PR from you with the exact same things. :p

It's really nice to contribute to PDF.js. Overall between your work and Opera's contributions the font caching, images decoders and color conversions got significantly faster! And there is still a lot we can do.


Heh, sorry for the overlap :)


I have FireFox 29 and those pages rendered fairly fast for me. Maybe a couple seconds per first viewing of a page, worst case.


Which pages do you mean by "those pages"?


Whatever he means, there are also a lot of slow PDF's when googling:

   controller site:automationdirect.com filetype:pdf


It's fine for short documents, but not very pleasant for long math papers. PDF.js takes ~1 second to render a single page in my thesis. In Document Viewer (evince), the delay to render a page is barely perceptible (so I would guess < 0.1 seconds), and Adobe Reader renders pages instantaneously.


I wish it only took a second to render some of the things I've opened. I've found a lot of the manuals from Automation Direct are very slow to open and navigate. For most things I ignore pdf.js and save files to desktop so I can open them with a "proper" viewer.


I don't know if I'm the only one to have this problem but it also shows = signs as - signs.

(it doesn't always happen and I can't find a pdf where it does that now)



PDF.js is really nice and awesome, but for me it doesn't work for documents with a file size of 11MB. It was a catalogue and to finally get the thing to render faster I used pdf2htmlEX (html5) which slimmed the 11MB file down to about 2MB with no visible quality loss and text would still be perfectly selectable, even in old and crappy browsers like IE7. I'm happily using both on a customers site. The 11MB file rendered good enough on my Client's new PC, but they have an i7 ;) My own box is really slow :/ (saving for a macbook pro)

Currently I was about to write a service to autoconvert PDF documents uploaded via ownCloud into the said HTML5 document, but it seems that even though I've ssh, this server is managed and doesn't come with make and also has no poppler or fontforge libraries available. Meh :(

I'm thinking about compiling the binary with --prefix=~/.local/lib and copy the dependencies I've found to that directory on the server. Hope that works, otherwise I'll need to write an API for pdf2htmlEX on a server where I've root and upload modified & new pdf files to it, then wait for it to finish the html5 conversion and download the files to the right directory using curl. That'd be much more work than just dropping the binary onto the server and executing the service as cronjob for modfied or new files.

Any ideas on howto solve this cleverly?


You should probably report this. Most of the test cases that they used where PDF.js is too slow, involves files so large that even Adobe Reader can't render quickly, or files that are heavily damaged.


I've reported it and uploaded the pdf, so someone on IRC responsible could debug it, but we didn't get far enough to fix it. Otherwise I wouldn't just add another PDF Viewer to a site. But it's only the downloads section and the preview there really is not for reading, but looking if it's the right document. Anyway, thanks for caring for the project. There's a reason Mozilla chose PDF.js over writing it in C++ and that's not just because of security I think.


Filesize has nothing to do with speed most of the time ... Try the PDF Reference document [1], it's pretty fast.

The project is very active: in the three month that we investigated it, many of the pdfs that were unusable (e.g. NASAs budget report, one of the worst written PDFs I have seen so far) when we started are now fast enough. Also almost all the effort that Opera puts into the project is about performance.

I suggest you check the file again in PDF.js and report the PDF so somebody can look at it.

[1] http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdf...


I used the latest version from git and also stable releases. It would take about 3min to render (for this 11MB file only, all other pdfs, albeit smaller, load fast). It was the only PDF that was slow, all others were loaded almost instantly, however the other files were much slower too.


Here is telemetry data from beta users about size of PDFs http://telemetry.mozilla.org/#beta/29/PDF_VIEWER_DOCUMENT_SI... -- typical PDF size is about 200KB, and 11MB is less 20% of them.


Minor performance tip: you can disable the creation of DOM elements backing text and whatnot in PDF.js; this allows for faster rendering on some browsers.

You lose easy search, but it can make your life a lot easier if you don't need them.


You also lose text select (and thus copy+paste)


But you can approximately restore that if you use Project Naptha as well... urgh... please nobody do this.


We realize that the text layer is sometimes a performance problem. We pushed already a lot of improvements to it and there is more too come.


Not your all's fault...the nature of the beast is that a PDF, in the pathological case, could result in thousands and thousands of individually-styled DOM elements per page, right?

Not much to be done in that case.


Is it easy to do that?


Yes, you can build you own personalized extension for Firefox, Chrome or Opera from PDF.js source code, which will disable text layer. (see also #textLayer=off feature at https://github.com/mozilla/pdf.js/wiki/Debugging-pdf.js#url-...)


Memory suggests yes (I think it's just a render param you pass in or something?) but I'll need to check old code tonight to give you a definitive answer.


This is probably the worst benchmark I ever read. Talk about fitting the data to your desired predetermined result.

I mean, only check just "popular" PDFs? What about all the people that work with PDFs and are sent custom files for their enterprise/office/design agency etc, and not "popular" stuff. Has he checked what the shape of the distribution curve for popular vs "long tail" PDFs are?

Second, using the time to load their intro page as a baseline? How's that relevant? Just because he has this arbitrary subjective idea that "up to 3x that time gives a good user experience"? How about comparing to the time Actobat/Preview take?

Lastly, just measures loading the pdf? How about going through the pages?


The code of the benchmark is open source. Go check how it works and by all mean send a pull request. Any improvement there will shift focus where it matters most and help every user of PDF.js ( in Firefox itself and in the various extensions and services built on top of PDF.js ).

The benchmark measures 5x the rendering time of every single page of every single PDF.

The problem with benchmarking the "not popular" PDFs is that they're not available outside of their enterprise/office/design agency etc... But if you have any you can share publicly, please file an issue on the Github repository of PDF.js

Some people mentioned Mozilla' Telemetry. Unfortunately it is too limited for this kind of research as it can only report enum values. Using Telemetry would need some work to get a baseline/reference for each computer and the results would take weeks or months to come back due to the lag between the master version of PDF.js and the one bundled in Firefox.

We compared PDF.js against native viewers. The performance was worse but we can fix this. We already started. For the other things e.g.: rendering quality, color conversions, accessibility, text selection, binary footprint, zoom & scroll, ... PDF.js was on par or better.


So slow and CPU hog even on i7 systems, that I always configure my Firefox installations to save PDFs instead.


Ditto. It's too slow for almost every PDF I want to open. It's too slow for: docs with a lot of pictures, docs with a lot of vector graphics and docs with enough pages.

Thea architects I know have much bigger problems than me. For them it's: every document is way too slow. They would really like if they can put in their html pages the preference "any reader of linked PDF would prefer native code renderer, the content is guaranteed too demanding for js one." They can change the setting in their own browser but their potential clients will just lose the patience.

The other effect is that not only it's too slow, the vector graphics often looks wrong.


For me it's faster than the 32 bit Acrobat Reader plugin that hung the entire browser for 1-2 seconds every time I wanted to open a PDF with it.


Honestly? I only have 2x1.3GHz, but it loads damn fast for files <= 5MB. How large are the files that you open that make your browser inresponsive?


Most of the ones that have more than 10 pages, specially CS papers.


Can you link us to one, so we can do independent tests with it, please?


For example the Modula-3 reference manual, 100 pages, when scrolling with the mouse, pages can take more than 1s to render on a i7 system.

With native readers, it is almost instantaneous.

http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-52.pdf


Unable to reproduce on a 2.6GHz Core i7. It's slower than the viewer in Chrome, to be sure, but not to the degree you describe.

https://drive.google.com/file/d/0B0Ne3vac3uJuOTB0Y2owLVN3MzQ


There might be something up with your install--- run away plugin? I read lots of journal articles in FF, many in the 15-30page range with figures, with no problem on an old dual-core iMac.


My install?! As I mentioned, in all computer systems I have access to.

I just get the spinning wheel effect, while the CPU usage scales all the way up, until the whole document is downloaded, processed and finally displayed.


"OK on 96% of PDFs" is all very well, but that means that almost 1 document in 20 is performing poorly. It only takes a few slow-rendering documents to make me switch to something else, so if I look at (say) 3 PDF files a day then you've probably lost me as a user within a fortnight.


That does assume that the 1 in 20 which perform poorly are randomly distributed across the set of PDFs. It seems at least plausible that these would actually be grouped up by type somehow (e.g. it mentioned that the worst one they found was a huge vector map of the Lisbon subway system), and thus it would probably be the case that a user would either encounter them far less than that 1 in 20 or far more, depending on their own usage patterns.


Still faster than the download - wait for Acrobat/Foxit Reader - wait for marginally faster loading document cycle anyway isn't it?


> You see a histogram of the time it took to process all the pages in the PDFs in relation to the average time it takes to process the average page of the Tracemonkey Paper (the default PDF you see when opening PDF.js

While interesting data for sure, that's not really answering the same question as the headline there. How about a comparison to the rendering speed of MuPDF?


MuPDF doesn't solve the same problem, so it doesn't really matter whether MuPDF is faster.


MuPDF solves a bit of the same problem, in that it renders PDF while trying to ignore all the extensibility cruft that makes Adobe's PDF reader incredibly insecure.



> Maybe you know that there is no default PDF viewer in the Opera Browser, something we would like to change. But how to include one? Buy it from Adobe or Foxit?

If that is the problem being addressed, then I think it does?

One can ask if patching MuPDF vulnerabilities is a bigger hassle than getting PDF.js performant.


MuPDF is a very nice piece of software, but it's GPL licensed. As I understand it, this basically prevents MuPDF from actually being used in commercial software ... unless of course we would make the software open source ...


It's not GPL. They commercially license it if needed, but my understanding was that Opera is open source and would be compatible with the license on it.


There are several posts on this page touting the security benefits of PDF.js, and it's almost certainly true that PDF.js is more secure than a new pdf implementation in C or C++ would have been[1]. In the short term PDF.js makes sense.

I think there are longer term considerations here though. Web standards (HTML/CSS) and language (Javascript) were not designed to be used as a compilation target for complex programs. It was meant to be a document platform. PDF.js is fast enough to use (just about) because a massive amount of engineering time has been put into making browsers very performant. That has involved making browsers huge complex beasts that have to be written in low level unsafe languages to achieve acceptable performance[2]. By choosing to use standards that are very complicated, high level and inefficient we have made the implicit choice to require any competitive browser implementation to be highly optimised and low level. That huge pile of very complex low level code has a real security cost. Buffer overflows, dangling pointers, out of bounds reads. These things are common in browsers today[3] and will remain so while browsers have to be implemented in C++ to remain competitive performance wise.

We often engage in debates about whether Javascript/the web stack is fast enough to use for various types of software. Performance improvements are treated as an inevitability; "if it isn't fast enough today it will be tomorrow". Similar attitudes hold for missing APIs and functionality. Most people don't seem to really question what it really means for browsers to become more and more performant and more and more complex. The answer seems to be millions of lines of very complicated and low level C++.

I'm aware that Rust is one attempt to extricate ourselves from this hole, but that is years away from ever making it into a production browser, if it ever does. Meanwhile standards continue to increase in complexity and browsers continue implement those additions with large amounts of potentially unsafe C++.

I don't have any solutions to this, it's too late, we are already committed to browsers being full operating systems. But while we are all running around patting each other on the back over how 'advanced' browsers are now I do think it's worth considering the security price we are paying to make things like PDF.js possible.

[1] I think there is a bit of a false dichotomy here though. If performance is acceptable with Javascript then it presumably would have been acceptable with native languages safer than C/C++ (Java, Haskell, Go, Python, etc, take your pick).

[2] For example Chrome seems to have basically a virtual memory system implemented in the browser, with pages, heaps and pagination logic so they can use relative 32 bit pointers. ( http://article.gmane.org/gmane.os.openbsd.misc/186107 )

[3] I don't intend to pick on Firefox specifically here since these types of issues exist in all browsers, but here are some recent Firefox issues form the last month or so:

  CVE-2014-1518 Various memory safety bugs leading to possible code execution.
  CVE-2014-1523 Out of bounds read leading to crashes when viewing jpeg images.
  CVE-2014-1524 Buffer overflow when using non-XBL object as XBL leading to possible code execution.
  CVE-2014-1531 Use after free when resizing images resulting in a potentially exploitable crash.


I'm not really sure what you're getting at.

>In the short term PDF.js makes sense.

What's the long term? We're not going give up on PDFs anytime soon, and having a JavaScript-based viewer will make people who just want to consume PDFs much more secure. One less vector of attack.

>It was meant to be a document platform.

SO WHAT? Computers were never meant to be carried around in your pocket, the internet wasn't designed to propagate kitten memes, and Columbus never meant to land in America, yet here we are.

>That has involved making browsers huge complex beasts that have to be written in low level unsafe languages to achieve acceptable performance

As opposed to which safe languages? Java? .NET?

>But while we are all running around patting each other on the back over how 'advanced' browsers are now I do think it's worth considering the security price we are paying to make things like PDF.js possible.

And what was the security price for running standalone programs to read PDFs? What's the security price for ActiveX, Flash, Silverlight, Java applets?


>What's the long term?

The long term is a platform that can feasibly be implemented in higher level, safer languages and still give acceptable performance. We sacrificed that when we decided to put millions of dollars of resources into making web standards run faster instead of creating something new. What we did was probably the 'easier' route at the time, but there is a cost.

>SO WHAT?

Go and browse the source code of one of the main browsers. The cost of the complexity of web standards is plain to see in the ridiculous lengths browsers have to go to if they want to achieve decent performance. I gave you a list of recent security holes. The browser you are using right now almost certainly contains undiscovered buffer overflows, use after free and similar memory safety bugs. It's not just that we have created a situation where we are using big piles of unsafe code, it's that we are trapping ourselves into relying on this code for the foreseeable future because we are building a platform out of it and have made the specification of that platform so baroque that currently only C or C++ is flexible and low level enough for all the crazy performance hacks required.

Your argument seems to boil down to 'anything can be anything if we just want it hard enough', which might be true to an extent, but it does come with real costs, which was my point.

>As opposed to which safe languages? Java? .NET?

.NET isn't a language, but assuming you meant C#, then yes those are safer than C or C++. I gave you CVEs for various remote execution vulnerabilities in Firefox in just the last few weeks alone that would not have been possible in memory safe languages.

>And what was the security price for running standalone programs to read PDFs?

A web browser is a standalone program that reads PDFs. If it can be done 'safely' in Javascript in a web browser then it could be done in other memory safe languages as a standalone program. To answer your question though, the number of security vulnerabilities in the unsafe C code I generally use to view PDFs is quite small compared to the number of vulnerabilities in web browsers. Not surprising since browsers are orders of magnitude larger. I don't know why you want to make that comparison since it's not particularly meaningful.

>What's the security price for ActiveX, Flash, Silverlight, Java applets?

You seem to just be listing technologies that are unfashionable among the web developer community. The technologies you listed all (as far as I know) had implementations written in C or C++, for performance reasons, in an era when desktop computers were far slower than even phones are today. The fact that they all had bad security records really just reenforces my point that we shouldn't be writing platforms for running remote software in C or C++ (or at least, it should be a heavily audited core with most of the implementation in something safer).

Additionally, you can't really compare the Java applet situation where Oracle clearly don't give a shit about patching in a timely manner with web browsers that are automatically updated nearly weekly by teams that take security very seriously. The Java applets security situation is what browsers would be like if they didn't have very well resourced hyper vigilant teams trying to keep them patched.


Unfortunately Adobe Reader has suffered from bloat just as much as any web browser and has had more than its fair share of exploits.


> Web standards (HTML/CSS) and language (Javascript) were not designed to be used as a compilation target for complex programs.

http://asmjs.org/https://www.destroyallsoftware.com/talks/the-birth-and-death...

Now they are.

> I don't have any solutions to this, it's too late, we are already committed to browsers being full operating systems.

When we do get to that point, and ditch the underlying MacWinuX, there's a good chance they won't be much more complex and much less secure than what they replaced. A typical MacWinuX desktop setup is already over 200 Millions lines of code. I'd be happy to drop that to a dozen million lines instead (even though 20K are probably closer to the mark http://vpri.org/html/work/ifnct.htm). It also shouldn't be much slower than current native applications.

Heck, it may even be significantly faster. Without native code, hardware doesn't have to care about backward compatibility any more! Just patch the suitable GCC or LLVM back end, and recompile the brO-Ser. New processors will be able to have better instruction sets, be tuned for JIT compilation… The Mill CPU architecture for instance, with its low costs for branch mispredictions, already looks like nice target for interpreters.

---

> I do think it's worth considering the security price we are paying to make things like PDF.js possible.

Remember the 200 million lines I mentioned above? We're already paying that security price. For a long time, actually.

---

That said, I agree with your main point: the whole thing sucks big time, and it would be real nice if we could just start over, and have a decent full featured system that fit in, say 50.000 lines or so. Of course, that means forgoing backward compatibility, planning for many cores right away… Basically going back to the 60s, with hindsight.

Alas, as Richard P. Gabriel taught us, it'll never happen.


I don't think you have to forgo backwards compatibility. Implement a standard VM and library set that everyone can compile to. Implement HTML/JS as a module in the new system. Problem solved.


Well, it's not just HTML/JS. It's Word/OpenDocument, SMTP/POP/IMAP… Those modules are going to make for the vast majority of the code. We could easily go from 50K lines to several millions.


>Heck, it may even be significantly faster. Without native code, hardware doesn't have to care about backward compatibility any more! Just patch the suitable GCC or LLVM back end, and recompile the brO-Ser. New processors will be able to have better instruction sets, be tuned for JIT compilation… The Mill CPU architecture for instance, with its low costs for branch mispredictions, already looks like nice target for interpreters.

Heh, I hope you appreciate the irony in that one. On the one hand we have people arguing that we have to stick with the existing web platform for backwards compatibility reasons, but on the other you are suggesting it would be easy to switch the entire world to new totally incompatible processor architectures to make aforementioned web platforms performant.


It's a matter of how many people you piss off. Ditch the browser, you have to change the whole web. Ditch the processor, and you have only a couple browsers to change.

Apple did, it you know? Changing from PowerPC to X86. And they had native applications to contend with. I believe they got away with an emulation mode of some kind, I'm not sure.

I for one wouldn't like to see the web take over the way it currently does. It's a mess, and it encourages more centralization than ever. But if it does, that will be the end of x86. (Actually, x86 would die if any virtual machine took over.)


Ah, thanks for posting Gary's talk. Great stuff.


>> We often engage in debates about whether Javascript/the web stack is fast enough to use for various types of software. Performance improvements are treated as an inevitability; "if it isn't fast enough today it will be tomorrow". Similar attitudes hold for missing APIs and functionality. Most people don't seem to really question what it really means for browsers to become more and more performant and more and more complex. The answer seems to be millions of lines of very complicated and low level C++.

Ironically, not needing to run high performance native code was a part of the original appeal of web apps. Google maps worked well however old your graphics card was. Gmail was secure even when viewed on an old version of IE.

If client side rendering is too slow, then don't render on the client. Cache the rasterized pdf as tiles up in the cloud and only read the actual document if the user selects text or zoom in. That is the only way to make massive pdf's load instantly on a slow machine.


That is the only way to make massive pdf's load instantly on a slow machine.

I beg to differ. Grab an older or smaller PDF reader. One of the ones with just two features- "scroll" and "zoom"- and you will find even a terribly slow machine can keep up.

(Except when it is one of those PDFs with 30MB pictures embedded in every page)

It's the bajillion added features of dubious value, and things like using js as the backend, that has kept PDF rendering decidedly heavyweight.

When a PDF is too large for PDF.js on my little 1.0GHz laptop, I open it up in Acrobat Reader and I'm back in business.


But I want to be able to quickly view those massive PDFs, 30MB per page is not particularly large for a large graphics heavy print quality document.

PDF rendering is essentially creating a binary from a source code package. Why would you recompile every single time when the purpose of the format is to deliver an exactly consistent binary output? Arguing that the format is badly designed does not open a document any faster!


I would argue the super-high-quality image PDF meant for printing on a 1200dpi plotter is an exceptional case a netbook will never handle well, no matter the format.

Usually when the PDF is silly big, someone just didn't do a good job preparing it.


Optimising compression and caching zoom levels does make a big difference. But it is exactly what is missing in a significant number of PDF files in the wild. This is particularly common in engineering drawings/maps and they can be slow to open on any machine. The point of my original comment was that client side code sitting is isolation will always have certain limitations regardless of how it is implemented.


> If client side rendering is too slow, then don't render on the client. Cache the rasterized pdf as tiles up in the cloud and only read the actual document...

Some people are not comfortable with giving their financial/proprietary/secret information to third party's "cloud". And since they have to pay for "cloud", you end up paying for it somehow at the end. With client side code: 1) nobody knows what you are looking at, 2) and you don't rely on network connectivity much.


Some of us are very happy to be able to[1] get the actual data directly to our machine without any "cloud" blurring our view, thanks : )

[1]: I have nothing against clouds, -as long as they are not mandatory that is.


Yes I agree. A "cloud" implementation of a PDF reader would be bad in a whole different way. It is annoying to have to use the "cloud" term at all as it implies certain things beyond just being a distributed/decentralised cache of data and external processing power. Those are the things that make the cloud approach useful and its a shame some of that cannot be used in normal desktop applications.


Chrome's PDF renderer runs inside a NaCl sandbox, where any potentially unsafe machine operations are checked before execution. That's arguably _harder_ to break out of than Firefox's Javascript sandbox.


I would consider it somewhat harder. However, that's only after assuming that the author of the document has managed to get the PDF reader to start running arbitrary C and JavaScript code within the sandboxes, respectively. Because JavaScript is a fundamentally memory safe language, it is drastically less likely for a document parser written in JavaScript to end up running arbitrary code from the document than the equivalent in C.

Not that it matters in this case, since an attacker can just hand you a HTML page and run JavaScript that way, but it's worth noting that parsing in a high-level language is an accomplishment in general.


> Not that it matters in this case, since an attacker can just hand you a HTML page and run JavaScript that way, but it's worth noting that parsing in a high-level language is an accomplishment in general.

What do you mean? You have plenty of parsers written in high-level languages.


This is exactly why Rust and Servo exist.


Browsers are complex because the web is a mess not because of performance reasons. You can build a fast HTML 1.0 browser in a few lines of Java but if you want modern JavaScript + CSS + cookies + plugin suport + backwards compatibility + embedding videos + best effort rendering + ... it becomes a huge mess.


I think that was exactly OP's point.


Isn't HTML 1.0 actually a mess compared to 4.0? ;)


No, HTML 1.0 had 20 elements 13 of which are in HTML 5 so there is not much to hate. The real issue is we use a TEXT markup language to creat UI which is fundamentally broken.


Yes, this is the key point. Markup to make applications is just nuts.


I beg to differ between markup only and markup + code.

Declarative markup seems to be a whole lot harder to mess up than creating views in code. (Of course, anything is possible once you add enough smart people.)


Really? Android and Gtk is just nuts?

Of course, you can still create your ui in JS, just as you can create Gtk and Android applications without markup.


> Web standards (HTML/CSS) and language (Javascript) were not designed to be used as a compilation target for complex programs. It was meant to be a document platform.

Computers were designed to do math, not word processing or gaming or mapping of massive datasets into comprehensible reports. Attitudes like this are toxic to progress, in my opinion, and far too common.


Well, I've yet to see a video player done in a browser that didn't use the <video> element, or call out to some native plugin (flash, or realplayer, quicktime etc back in the day). I wouldn't be surprised if it had been attempted by just changing the contents of an <img> or something, but I don't think anybody would call that a sane way to play video. So clearly the original document system was not actually suitable for building things like video players - that functionality had to be added in later (implemented in native C++ of course) and exposed with an API because the web platform itself wasn't powerful enough to do it.

Anyway, I never said the web as an app platform can't be done. I'm asking if it is a good idea, and if we are fully considering the tradeoffs involved (i.e. a bit of 'engineering'). I don't find 'because we can' a great reason when it comes to things that affect the lives of most computer using people. A higher level of responsibility is involved in these situations than somebodies experimental spare time github repos.

I could sit here and think of a dozen feasible ways to get water from ground level to the first floor of my house. It wouldn't be 'toxic to progress' to point out that there are better ways than an archimedes screw powered by a horse in the garden. Thinking about better ways to do things rather than just blundering forwards with the first idea that comes to mind is exactly how progress is made.


All I meant was that just because it wasn't originally designed as an application platform doesn't mean that it hasn't evolved in the meantime. To your point, a lot of engineering time and money has been poured into moving web technologies way more than just a way to shuttle around scientific papers. HTML5, CSS3, and ES6 are far cries from their original ancestors, as are modern (and near-future) HTTP, websockets, etc. Technology evolves.

As for the hose analogy, municipal water distribution systems are still powered by gravity. Which is quite a bit older than the Archimedes screw. A good illustration that usually something simple is all you need.


Video without the video tag or plugins? Here you go: https://news.ycombinator.com/item?id=4531088


I've had a really good experience implementing a BeeLine Reader PDF coloring algorithm with PDF.js. It was surprisingly easy to work with.

http://www.beelinereader.com/pdf


Fast enough, most of the time.


I find it fast enough for reading, but not fast enough for skimming or navigation. If I want to page through a book quickly to find something, it just lags way too much rendering the pages; I have to stop every 5 or 10 pages and wait for it to fill them in, or I'll be paging through nothing but whitespace. But it's fine except for that.


Keep in mind that PDF.js is an engine + a viewer. Just a fast engine does not make a nice experience. There are many things in the viewer that one can do to make the experience of using it feel a lot faster.


While PDF is a great tech demo in practice,unless one is using it for short documents,it doesnt work very well.That's why google did what it did with doc,render the pdf as an image them put a transparent text over it. It's a better approach if one has good servers. because frankly one cant expect the browser to behave like acrobat reader and be as performant as native applications when trying to render 20MB pdfs (and a lot of books are that big,this is not an rare use case).


I love PDF.js. I built a site recently that uses embedded PDF.js to display PDFs while making it very difficult to print/copy-paste/download these documents. It is super fast, works across all modern browsers, and supports useful PDF features like bookmarks, annotations, and table of contents. It's also pretty easy to customize and theme.


Please don't do that, you're making the web a worse place. If you're going to sell documents, do what the academic paywalls do: display the first two pages (without any restrictions). Restrictions are just silly because if the content is on the user's computer, you know there is a way for them to get at it.


I know where you're coming from and I had this exact same feeling before I embarked on this project. I've been making things for free and giving away my code/projects/services for almost two decades now so I didn't jump into this without serious consideration. I didn't want to write these details in my original post because I was just lauding the PDF.js project.

There were more than a few legal requirements for making such a system. We had to show reasonable attempts were made to prevent old copies of the data from existing anywhere i.e. old printouts, copy-pasted notes etc. The documents shown had to be timestamped and watermarked with the user's full name. Unlike the typical public scribd-style document sharing site, this was already behind a login system and 100% of all user activity was monitored with the user's full knowledge. In fact, users demand that their activity to be monitored for legal and auditing purposes.

Without going into specifics, imagine a highly skilled professional needs to e-sign that they read a training document V1.23 on date X/Y/Z. This isn't a standard Terms & Conditions agreement that everyone clicks without reading. This is something that affects the professional's abilities to make life or death decisions so they really want to read the correct version of the document. In order to meet all the legal requirements (think stuff like 21 CFR Part 11), the best technical solution turned out to be a browser-based PDF reader that disabled printing/copy-pasting/downloading. I was tasked with building that and thanks to PDF.js, I did so with almost no effort.


I wasn't primarily thinking of making things free, that is obviously not always possible. What I argue against is introducing artificial and superficial restrictions, in the form of restricting GUI actions. This gives a false sense of security, because the underlying data is still on the user's computer and technically they can do whatever they want with it. Instead I believe the system that you talk about should rely on trust, something which cannot be established by technical means. Yes you could require the user to scroll through the whole document, even require a minimum of x seconds per page, or give 20 questions after they read it, but ultimately there is no substitute for trusting the user and their reading comprehension ... Of course I understand that you're not in a position to change these requirements.


Is PDF.js used by Chrome? The reason I ask is because the safe script plugin blocks rendering of all pdfs not from a trusted domain. I find this behavior quite frustrating, since PDFs afaik don't contain (or typically contain tracking code or keep executing code after the initial render of the document).


No, Chrome ships a proprietary PDF viewer as native code (Foxit).


Semi-related, but has anybody found a decent PDF generator for JS? Something where you can get a decent quality PDF (none of the screenshot via phantom.js stuff) that's text, a few tables, and a decent layout? Closest example would be an invoice for something.

P.S. This would need to work with Node/Meteor.


The performance is fine but when I have to open PDF files with foreign language neither Adobe or PDF.js can do that well. Adobe will require me to have the font package downloaded. I don't think there is such package for PDF.js which forces me to download the PDF.


They render fine if fonts are embedded; granted there are many pdfs which don't. But then it's the producers which are to blame.


What does it matter? PDF.js broke quality printing to CUPs, so it's a bit of joke if you break that printing use case.

https://bugzilla.mozilla.org/show_bug.cgi?id=932289


When used with heavy PDFs which basically contain tons of images (like scanned books) PDF.js is noticeably slower than native plugins (like Kparts plugin with Okular on KDE) used with Firefox. For more lightweight PDFs it's acceptable.


Not very fast... when loading a big PDF, before the save button can be used, it needs to render the whole thing first. I don't understand why, because all it has to do is save the file which it already has.


I believe this is a shortcoming of the default viewer that comes with PDF.js. 'coz as you say saving the file on disk has nothing to do with PDF.js itself. Bare in mind that PDF.js works in older browsers ( IE8 and 9, and even older versions of Opera, Firefox, ... ) so this might drag down the default viewer a bit.

Beside, if you know before hand that you want to save the PDF, surely you can right/Cmd click the link and save the file right away.

Anyhow, if we - Opera - decide to use PDF.js as the default PDF renderer for the Desktop browser, we will roll our own viewer which can use fancy pancy features like: <a href=# download>Save</a>


Does anyone know of a good tool for PDF generation in JavaScript? In Ruby, I use Prawn, which is great. Is there a Prawn-like library for JavaScript?


https://github.com/devongovett/pdfkit

(I've also been working on something lower-level — basically support for creating a PDF file out of its raw sections — but it's not quite yet released).


FPDF FTW!


"Also security is virtually no issue: using PDF.js is as secure as opening any other website."

I can't even begin to express how much this sentiment troubles me.

Edited to add: A big part of my concern regarding PDFs these days has to do with embedded malware, but in general I'm wary of active content. I'm all for faster rendering, but I wonder how well PDF.js protects against malicious content. I don't use the native PDF reader for that very reason.


The point he is making is that PDF.js is just a combination of JavaScript, DOM and Canvas rendering. It does nothing more than a website can do, and running websites is already something that needs to be secure. So this means that all the existing security measures already present in browsers make PDF.js safe to run.


> I'm all for faster rendering, but I wonder how well PDF.js protects against malicious content.

It doesn't need to; that's what the browser is for. PDF.js doesn't need to -- in any way -- concern itself with security; that's pretty unquestionably a good thing.


By not having a plugin, all of the rendering done by PDF.js is contained within the sandbox normally provided for any page.

If a PDF is designed to exploit PDF.js, the worst it can do is the equivalent of a cross-site scripting attack on the page hosting PDF.js.

This is a huge win over the possibility of exploiting a bug in a plugin which runs outside of the browser sandbox.


Why? The point he's making is that not having to install a third-party browser plugin to view a PDF is a big win for security.

Edit in reply to your edit: embedded malware is tailored to exploit a bug in a specific viewer implementation... so I doubt there's much floating around that targets PDF.js, I imagine Adobe Reader is a juicer target. In any case, JS running in the browser is usually well isolated (e.g. no filesystem access), can wreak havoc in the tab but not much else.


The wins on confining the content to the browser sandbox as well as integrating .pdf viewing into the browser experience greatly outweigh the current limitations for large file sizes. I hope ongoing work will fix the latter. Pdf.js is an awesome improvement that makes me breathe easier every time I click on a pdf link.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: