building plan filetype:pdf
Then try to actually look at every page of such documents with pdf.js. The first I get is
Try it. You will want to throw your fast computer thorough the window. Then when you manage to calm down you'll try to configure your browser to never invoke pdf.js again, if you know that you need to work with such documents.
If you want a realistic benchmark, compare the speed of rendering these documents with Adobe's or some other native renderer.
I'm not a building architect but I at least don't live in "we don't draw anything significant in our documents" bubble. I know they worry for their potential customers. Forced pdf.js is a huge setback for them. If they would be able to tell the browser in the html "please don't use pdf.js for this one" they would be much happier.
Looks like it could be running much faster:
* 20% of the time is spent copying the canvas because someone is, likely erroneously, holding a reference to the canvas. Looking into it: https://bugzilla.mozilla.org/show_bug.cgi?id=1007897
* 10% of the time is spent waiting on display transactions swaps because canvas isn't triple buffered.
* PDF.js is not getting empty transaction (canvas draw optimizations).
That's just from a quick profile. I'm sure there's a ton more things that could be improved.
That profile will show calls from JS and how theycall into native C++ calls. This is very useful for profiling things like canvas.
The first PDF.js document uses the DOM instead of Canvas like the previous one. This is done to support text selection. Most of the time is spent in the style system. I don't know that area very well but in the past I've seen simplifying CSS selectors make all the difference. I know a fairly important problem for B2G is spending up expensive style flush (https://bugzilla.mozilla.org/show_bug.cgi?id=931668) but I don't know enough about CSS to know if that fix will solve the problem here.
I get between 5-7 seconds here.
Looks like getaliasedvar is causing excessive bailouts from Ion (3rd tier JIT). On top of that the platform is trying to synchronously cancel background Ion Jitting and that is taking an excessive amount of time. Reported as: https://bugzilla.mozilla.org/show_bug.cgi?id=1007927
Tweaking the functions listed in the profile to avoid them bailing out should drastically improve this test case.
var s = 0.01
for ( var i = 0; i < 100000000; i++ )
s += 0.1
print( s )
Still, really the biggest problems I know of at the moments are those PDFs that the architects produce.
It's important to have benchmarks that aren't trivially converted to no-ops or constant loads by the compiler. (In practice the JIT might not be optimizing that one out, but an aggressive C++ compiler certainly would as long as fast math is enabled - so at some point, a typical JS JIT will too).
Also ensure that you're benchmarking warmed code that has been fully jitted. JS code (other than asm.js in Firefox) has multiple stages of compilation in modern runtimes, typically triggered based on how often it is called.
Heh, I believe bgirard wrote the Firefox profiler. He's the best there is :)
to your test suites and consider it a worthy goal as it really represents a lot of documents typical for the users who produce complex plans. Have you tried it?
It's immediate in Adobe Reader (as in one second) and takes more minutes in pdf.js Firefox 29.
I filed an issue on pdf.js here: https://github.com/mozilla/pdf.js/issues/4761
controller site:automationdirect.com filetype:pdf
You have to try to actually see every page in any of the PDFs to get the idea of pain.
I agree that the speed is a lot worse than in Preview.app for example, but it's also not unusable. I will look at it in more detail tomorrow
I must however admit that I'm not able to easily construct a Google search for more such documents, but I know a lot of people who work only with such -- they just can't work with pdf.js.
Does your benchmark measure the time to actually display everything on every page (what the human looking at all the pages must do), or just the time until the browser is responsible?
Edit: inspired from other comments, it seems that at least search for "math plots filetype:pdf" returns more guaranteed problems like
Still it's hard to find slow PDFs with certainty just by using Google.
It "only" benchmarks the rendering, all the overhead the viewer produces is not shown. That is intentional, as we will create our own viewer anyway
BTW: PDF.js in FF is typically slower than in Opera / Chrome, all these benchmarks used Opera
Have you tried to see all the pages, all the drawings? How many minutes you needed, on which setup? Are you using Firefox? Does it use pdf.js? Is it something OS dependent, or you just didn't look at all the pages?
Now I see, Gracana mentions manuals on AutomationDirect.com. Look at this one for example:
(The documents like this
They won't. As soon as you prevent them doing their job (and it's so if they used their PDF's normally before you changed the defaults) they will have to search for the solution. The solution is either switch the handler (still a little better for browser writers) or the browser.
By making the opening and looking at the 15-page PDF which was before instantaneous taking two minutes, you prevent them doing their job (the slowdown from subjectively 0 seconds to minutes is also subjectively infinitely worse experience!) and they must respond. They can't open just the first page. They actually care. They need all the pages.
It takes time to get to the PDF. It takes time to transfer the file if it's of any size. It takes time to look at the pages.
Also you can skim pages that don't have 100% of the drawings on them yet.
PDF.js as representative of html+js viewers vs. 'native' is completely unfair unless you compare to a similarly unoptimized native viewer.
Details are here: https://blog.mozilla.org/nnethercote/2014/02/07/a-slimmer-an.... These are present in Firefox 29 which only just came out, so if you're seeing bad performance and you're still on 28 or earlier, an upgrade might help.
If you have particular PDFs that cause pdf.js to run slowly, please file bugs about them at bugzilla.mozilla.org under the "Firefox" product and the "PDF Viewer" component. PDF is a large, complicated format, so there are many corner cases that the pdf.js developers are unlikely to see without help. Every improvement I made came about from profiling PDFs that people had reported as causing problems.
Edit: This is an example of what I'm talking about https://github.com/mozilla/pdf.js/issues/3853
Alas, no :) My work was basically a few surgical strikes in which I learned a lot about a few small corners, but my overall understanding of the code base is pretty weak.
I recommend filing a new issue here: https://github.com/mozilla/pdf.js/issues/new
bgirard's profiling work in this thread has been amazing, but I don't think I have the knowledge to interpret the profiler results well enough to file bugs on slow PDFs. A way to post "slow" PDFs to be picked up by pdf.js developers would help.
Filing at Bugzilla will probably give you a wider audience. This can be useful if the underlying perf problem is with Firefox rather than pdf.js. Don't worry about getting some of the Bugzilla fields wrong, it happens all the time and people don't mind. A good description of the problem (with steps to reproduce) is the most important thing.
If you're still intimidated, filing something with pdf.js's GitHub tracker is a lot better than nothing.
When I started to work on JBIG2, it was kind of irritating as you were often one week ahead of me. When I started to make some optimizations I would see a PR from you with the exact same things. :p
It's really nice to contribute to PDF.js. Overall between your work and Opera's contributions the font caching, images decoders and color conversions got significantly faster! And there is still a lot we can do.
controller site:automationdirect.com filetype:pdf
(it doesn't always happen and I can't find a pdf where it does that now)
Currently I was about to write a service to autoconvert PDF documents uploaded via ownCloud into the said HTML5 document, but it seems that even though I've ssh, this server is managed and doesn't come with make and also has no poppler or fontforge libraries available. Meh :(
I'm thinking about compiling the binary with --prefix=~/.local/lib and copy the dependencies I've found to that directory on the server. Hope that works, otherwise I'll need to write an API for pdf2htmlEX on a server where I've root and upload modified & new pdf files to it, then wait for it to finish the html5 conversion and download the files to the right directory using curl. That'd be much more work than just dropping the binary onto the server and executing the service as cronjob for modfied or new files.
Any ideas on howto solve this cleverly?
The project is very active: in the three month that we investigated it, many of the pdfs that were unusable (e.g. NASAs budget report, one of the worst written PDFs I have seen so far) when we started are now fast enough. Also almost all the effort that Opera puts into the project is about performance.
I suggest you check the file again in PDF.js and report the PDF so somebody can look at it.
You lose easy search, but it can make your life a lot easier if you don't need them.
Not much to be done in that case.
I mean, only check just "popular" PDFs? What about all the people that work with PDFs and are sent custom files for their enterprise/office/design agency etc, and not "popular" stuff. Has he checked what the shape of the distribution curve for popular vs "long tail" PDFs are?
Second, using the time to load their intro page as a baseline? How's that relevant? Just because he has this arbitrary subjective idea that "up to 3x that time gives a good user experience"? How about comparing to the time Actobat/Preview take?
Lastly, just measures loading the pdf? How about going through the pages?
The benchmark measures 5x the rendering time of every single page of every single PDF.
The problem with benchmarking the "not popular" PDFs is that they're not available outside of their enterprise/office/design agency etc... But if you have any you can share publicly, please file an issue on the Github repository of PDF.js
Some people mentioned Mozilla' Telemetry. Unfortunately it is too limited for this kind of research as it can only report enum values. Using Telemetry would need some work to get a baseline/reference for each computer and the results would take weeks or months to come back due to the lag between the master version of PDF.js and the one bundled in Firefox.
We compared PDF.js against native viewers. The performance was worse but we can fix this. We already started. For the other things e.g.: rendering quality, color conversions, accessibility, text selection, binary footprint, zoom & scroll, ... PDF.js was on par or better.
Thea architects I know have much bigger problems than me. For them it's: every document is way too slow. They would really like if they can put in their html pages the preference "any reader of linked PDF would prefer native code renderer, the content is guaranteed too demanding for js one." They can change the setting in their own browser but their potential clients will just lose the patience.
The other effect is that not only it's too slow, the vector graphics often looks wrong.
With native readers, it is almost instantaneous.
I just get the spinning wheel effect, while the CPU usage scales all the way up, until the whole document is downloaded, processed and finally displayed.
While interesting data for sure, that's not really answering the same question as the headline there. How about a comparison to the rendering speed of MuPDF?
If that is the problem being addressed, then I think it does?
One can ask if patching MuPDF vulnerabilities is a bigger hassle than getting PDF.js performant.
I'm aware that Rust is one attempt to extricate ourselves from this hole, but that is years away from ever making it into a production browser, if it ever does. Meanwhile standards continue to increase in complexity and browsers continue implement those additions with large amounts of potentially unsafe C++.
I don't have any solutions to this, it's too late, we are already committed to browsers being full operating systems. But while we are all running around patting each other on the back over how 'advanced' browsers are now I do think it's worth considering the security price we are paying to make things like PDF.js possible.
 For example Chrome seems to have basically a virtual memory system implemented in the browser, with pages, heaps and pagination logic so they can use relative 32 bit pointers. ( http://article.gmane.org/gmane.os.openbsd.misc/186107 )
 I don't intend to pick on Firefox specifically here since these types of issues exist in all browsers, but here are some recent Firefox issues form the last month or so:
CVE-2014-1518 Various memory safety bugs leading to possible code execution.
CVE-2014-1523 Out of bounds read leading to crashes when viewing jpeg images.
CVE-2014-1524 Buffer overflow when using non-XBL object as XBL leading to possible code execution.
CVE-2014-1531 Use after free when resizing images resulting in a potentially exploitable crash.
>In the short term PDF.js makes sense.
>It was meant to be a document platform.
SO WHAT? Computers were never meant to be carried around in your pocket, the internet wasn't designed to propagate kitten memes, and Columbus never meant to land in America, yet here we are.
>That has involved making browsers huge complex beasts that have to be written in low level unsafe languages to achieve acceptable performance
As opposed to which safe languages? Java? .NET?
>But while we are all running around patting each other on the back over how 'advanced' browsers are now I do think it's worth considering the security price we are paying to make things like PDF.js possible.
And what was the security price for running standalone programs to read PDFs? What's the security price for ActiveX, Flash, Silverlight, Java applets?
The long term is a platform that can feasibly be implemented in higher level, safer languages and still give acceptable performance. We sacrificed that when we decided to put millions of dollars of resources into making web standards run faster instead of creating something new. What we did was probably the 'easier' route at the time, but there is a cost.
Go and browse the source code of one of the main browsers. The cost of the complexity of web standards is plain to see in the ridiculous lengths browsers have to go to if they want to achieve decent performance. I gave you a list of recent security holes. The browser you are using right now almost certainly contains undiscovered buffer overflows, use after free and similar memory safety bugs. It's not just that we have created a situation where we are using big piles of unsafe code, it's that we are trapping ourselves into relying on this code for the foreseeable future because we are building a platform out of it and have made the specification of that platform so baroque that currently only C or C++ is flexible and low level enough for all the crazy performance hacks required.
Your argument seems to boil down to 'anything can be anything if we just want it hard enough', which might be true to an extent, but it does come with real costs, which was my point.
>As opposed to which safe languages? Java? .NET?
.NET isn't a language, but assuming you meant C#, then yes those are safer than C or C++. I gave you CVEs for various remote execution vulnerabilities in Firefox in just the last few weeks alone that would not have been possible in memory safe languages.
>And what was the security price for running standalone programs to read PDFs?
>What's the security price for ActiveX, Flash, Silverlight, Java applets?
You seem to just be listing technologies that are unfashionable among the web developer community. The technologies you listed all (as far as I know) had implementations written in C or C++, for performance reasons, in an era when desktop computers were far slower than even phones are today. The fact that they all had bad security records really just reenforces my point that we shouldn't be writing platforms for running remote software in C or C++ (or at least, it should be a heavily audited core with most of the implementation in something safer).
Additionally, you can't really compare the Java applet situation where Oracle clearly don't give a shit about patching in a timely manner with web browsers that are automatically updated nearly weekly by teams that take security very seriously. The Java applets security situation is what browsers would be like if they didn't have very well resourced hyper vigilant teams trying to keep them patched.
http://asmjs.org/ — https://www.destroyallsoftware.com/talks/the-birth-and-death...
Now they are.
> I don't have any solutions to this, it's too late, we are already committed to browsers being full operating systems.
When we do get to that point, and ditch the underlying MacWinuX, there's a good chance they won't be much more complex and much less secure than what they replaced. A typical MacWinuX desktop setup is already over 200 Millions lines of code. I'd be happy to drop that to a dozen million lines instead (even though 20K are probably closer to the mark http://vpri.org/html/work/ifnct.htm). It also shouldn't be much slower than current native applications.
Heck, it may even be significantly faster. Without native code, hardware doesn't have to care about backward compatibility any more! Just patch the suitable GCC or LLVM back end, and recompile the brO-Ser. New processors will be able to have better instruction sets, be tuned for JIT compilation… The Mill CPU architecture for instance, with its low costs for branch mispredictions, already looks like nice target for interpreters.
> I do think it's worth considering the security price we are paying to make things like PDF.js possible.
Remember the 200 million lines I mentioned above? We're already paying that security price. For a long time, actually.
That said, I agree with your main point: the whole thing sucks big time, and it would be real nice if we could just start over, and have a decent full featured system that fit in, say 50.000 lines or so. Of course, that means forgoing backward compatibility, planning for many cores right away… Basically going back to the 60s, with hindsight.
Alas, as Richard P. Gabriel taught us, it'll never happen.
Heh, I hope you appreciate the irony in that one. On the one hand we have people arguing that we have to stick with the existing web platform for backwards compatibility reasons, but on the other you are suggesting it would be easy to switch the entire world to new totally incompatible processor architectures to make aforementioned web platforms performant.
Apple did, it you know? Changing from PowerPC to X86. And they had native applications to contend with. I believe they got away with an emulation mode of some kind, I'm not sure.
I for one wouldn't like to see the web take over the way it currently does. It's a mess, and it encourages more centralization than ever. But if it does, that will be the end of x86. (Actually, x86 would die if any virtual machine took over.)
Ironically, not needing to run high performance native code was a part of the original appeal of web apps. Google maps worked well however old your graphics card was. Gmail was secure even when viewed on an old version of IE.
If client side rendering is too slow, then don't render on the client. Cache the rasterized pdf as tiles up in the cloud and only read the actual document if the user selects text or zoom in. That is the only way to make massive pdf's load instantly on a slow machine.
I beg to differ. Grab an older or smaller PDF reader. One of the ones with just two features- "scroll" and "zoom"- and you will find even a terribly slow machine can keep up.
(Except when it is one of those PDFs with 30MB pictures embedded in every page)
It's the bajillion added features of dubious value, and things like using js as the backend, that has kept PDF rendering decidedly heavyweight.
When a PDF is too large for PDF.js on my little 1.0GHz laptop, I open it up in Acrobat Reader and I'm back in business.
PDF rendering is essentially creating a binary from a source code package. Why would you recompile every single time when the purpose of the format is to deliver an exactly consistent binary output? Arguing that the format is badly designed does not open a document any faster!
Usually when the PDF is silly big, someone just didn't do a good job preparing it.
Some people are not comfortable with giving their financial/proprietary/secret information to third party's "cloud". And since they have to pay for "cloud", you end up paying for it somehow at the end. With client side code: 1) nobody knows what you are looking at, 2) and you don't rely on network connectivity much.
: I have nothing against clouds, -as long as they are not mandatory that is.
What do you mean? You have plenty of parsers written in high-level languages.
Declarative markup seems to be a whole lot harder to mess up than creating views in code. (Of course, anything is possible once you add enough smart people.)
Of course, you can still create your ui in JS, just as you can create Gtk and Android applications without markup.
Computers were designed to do math, not word processing or gaming or mapping of massive datasets into comprehensible reports. Attitudes like this are toxic to progress, in my opinion, and far too common.
Anyway, I never said the web as an app platform can't be done. I'm asking if it is a good idea, and if we are fully considering the tradeoffs involved (i.e. a bit of 'engineering'). I don't find 'because we can' a great reason when it comes to things that affect the lives of most computer using people. A higher level of responsibility is involved in these situations than somebodies experimental spare time github repos.
I could sit here and think of a dozen feasible ways to get water from ground level to the first floor of my house. It wouldn't be 'toxic to progress' to point out that there are better ways than an archimedes screw powered by a horse in the garden. Thinking about better ways to do things rather than just blundering forwards with the first idea that comes to mind is exactly how progress is made.
As for the hose analogy, municipal water distribution systems are still powered by gravity. Which is quite a bit older than the Archimedes screw. A good illustration that usually something simple is all you need.
There were more than a few legal requirements for making such a system. We had to show reasonable attempts were made to prevent old copies of the data from existing anywhere i.e. old printouts, copy-pasted notes etc. The documents shown had to be timestamped and watermarked with the user's full name. Unlike the typical public scribd-style document sharing site, this was already behind a login system and 100% of all user activity was monitored with the user's full knowledge. In fact, users demand that their activity to be monitored for legal and auditing purposes.
Without going into specifics, imagine a highly skilled professional needs to e-sign that they read a training document V1.23 on date X/Y/Z. This isn't a standard Terms & Conditions agreement that everyone clicks without reading. This is something that affects the professional's abilities to make life or death decisions so they really want to read the correct version of the document. In order to meet all the legal requirements (think stuff like 21 CFR Part 11), the best technical solution turned out to be a browser-based PDF reader that disabled printing/copy-pasting/downloading. I was tasked with building that and thanks to PDF.js, I did so with almost no effort.
P.S. This would need to work with Node/Meteor.
Beside, if you know before hand that you want to save the PDF, surely you can right/Cmd click the link and save the file right away.
Anyhow, if we - Opera - decide to use PDF.js as the default PDF renderer for the Desktop browser, we will roll our own viewer which can use fancy pancy features like: <a href=# download>Save</a>
(I've also been working on something lower-level — basically support for creating a PDF file out of its raw sections — but it's not quite yet released).
I can't even begin to express how much this sentiment troubles me.
Edited to add: A big part of my concern regarding PDFs these days has to do with embedded malware, but in general I'm wary of active content. I'm all for faster rendering, but I wonder how well PDF.js protects against malicious content. I don't use the native PDF reader for that very reason.
It doesn't need to; that's what the browser is for. PDF.js doesn't need to -- in any way -- concern itself with security; that's pretty unquestionably a good thing.
If a PDF is designed to exploit PDF.js, the worst it can do is the equivalent of a cross-site scripting attack on the page hosting PDF.js.
This is a huge win over the possibility of exploiting a bug in a plugin which runs outside of the browser sandbox.
Edit in reply to your edit: embedded malware is tailored to exploit a bug in a specific viewer implementation... so I doubt there's much floating around that targets PDF.js, I imagine Adobe Reader is a juicer target. In any case, JS running in the browser is usually well isolated (e.g. no filesystem access), can wreak havoc in the tab but not much else.