Hacker News new | past | comments | ask | show | jobs | submit login
PDFium: Chrome’s PDF rendering engine is now open-source (code.google.com)
421 points by andybons on May 22, 2014 | hide | past | favorite | 100 comments



I found it interesting that it seems to use Antigrain by Maxim Shemanarev in https://pdfium.googlesource.com/pdfium/+/master/core/src/fxg... (Chrome uses Skia). Unfortunately the author of Antigrain died: http://beta.slashdot.org/submission/3154635/rip-maxim-sheman... It's nice to see the fascinating Antigrain code to be used for PDF viewing every day!


Did not know about this. As an agg user this made me sad.


Nice to hear that AGG lives on. I remember exploring it in around 2002 and being mighty impressed with it.


This is good news because it renders PDFs a lot faster and better than pdf.js that Firefox uses. Also, I would have to install this binary blob to get Chromium to render PDFs. It seems Chromium could easily adopt this, but I'm not sure about Firefox.


Well, it wouldn't make sense for Firefox to adopt this:

1. It is tied to v8 (PDFs can run JS, this PDF viewer uses v8 do so - see CJS_Context::RunScript etc), so it would mean bundling 2 JS engines, with all the security downsides of that.

2. This is written in C++. You can sandbox C++ in various ways, but that would still increase the surface area of the browser, compared to pdf.js which only uses things normal web content would use.

3. pdf.js is not just meant to render pdfs, it's also a useful project to push forward the web platform. Areas where pdf.js was slow turned out to be things that were worth optimizing anyhow. This doesn't benefit people viewing pdfs directly, of course, but it's still an interesting aspect of the project.


Your arguments aren't supported by the facts:

1. It's not tied to V8.

It uses V8 currently, but:

- The vast majority of the code has nothing to do with JS.

- Almost all of the JS-related PDF code is independent of V8.

- The use of V8 is abstracted out sufficiently via IFXJS_Runtime etc that with a bit more work, a different JS runtime could be easily integrated.

2. You note that C/C++ can be sandboxed, but then claim that it "still increases the surface area of the browser".

This is nonsense: Firefox's 11 million lines of code are also written in C/C++, and should be sandboxed. You don't increase the surface area by also sandboxing the PDF plugin -- either the sandboxing mechanism works or it doesn't.

3. If anything, pdf.js's continued poor performance and operation demonstrates why forcing everyone to operate inside of a JS runtime is incredibly harmful to the web platform's progress.


> You don't increase the surface area by also sandboxing the PDF plugin -- either the sandboxing mechanism works or it doesn't.

Adding more C++ to the browser certainly does increase the attack area. Sandboxes that expose enough functionality to run a modern browser engine commonly end up with holes here and there (e.g. the Pwnium vulnerabilities) and it's best to not use them as the only layer of defense; moreover, the sandbox does not fully enforce all of the security properties that the Web platform demands (e.g. cross-origin iframes).

> 3. If anything, pdf.js's continued poor performance and operation demonstrates why forcing everyone to operate inside of a JS runtime is incredibly harmful to the web platform's progress.

As I mentioned before, pdf.js is not as JS bound as you might think. Furthermore, it was written before asm.js existed and doesn't use any asm.js; if pdf.js were JS bound, then asm.js would be a very powerful option to improve performance that would not involve dropping to unsafe native code.


The DRM sandbox is yet another attack surface, isn't it? Wouldn't it make sense to use NaCl for DRM sandboxing, and then the option is open to use the same sandbox for PDF viewing, and pdf.js can still work, giving users choice.

Creating yet another sandbox seems silly, and NaCl hasn't been hit by pwnium, it's only been a stepping stone to the renderer (I'll let comex dive into details here!)


NaCl is not exactly a stepping stone to the renderer. NaCl modules live outside the renderer process in a much tighter sandbox that uses control flow integrity and software fault isolation. Gaining code execution within the NaCl sandbox (easy since you can just send the user a NaCl module) does not expose the same attack surface as gaining code execution within a renderer process.


Hey, side question: I assume pdfium runs in the NaCl sandbox - how does that work with v8?


Sorry, I don't check HN often. As we discussed on Twitter & IRC:

It's just an OS sandbox currently. pdfium previously worked with NaCl, with a non-V8 JS VM (work done by Bill Budge). V8-on-NaCl used to work, I think it may have bitrotted since then, but it used NaCl's dyncode modify API to do PIC. The GC moves code too, so extra page permissions need to be changed when that's done, but I think that's the extent of code modification that needs to be handled for a JS JIT to work on NaCl (on top of the sandboxing).


Well, the DRM sandbox has very few exposed APIs, in contrast to the Web sandbox or Pepper.

I don't like the DRM sandbox anyhow; it's unfortunate that DRM was added to the Web, forcing a DRM module at all (speaking for myself, not my employer).


I discussed this with Alon on IRC: you wouldn't use pepper to do this. NaCl doesn't imply pepper, you can expose a subset of syscalls into the trusted code base.

I understand the feeling about DRM, but given that sandboxed DRM is going to happen I'd hope that the best efforts possible are put in to make users safe. Good sandboxing seems the right way to go. I'm not any kind of a security expert, but jschuh seems to think the current sandbox isn't sufficient: https://bugzilla.mozilla.org/show_bug.cgi?id=1011491 I hope the right improvements go into tightening the DRM sandbox :)


Well, Chrome could make this viewer not increase the surface area of the browser just by changing it to a NaCl plugin. After all, it already exposes the ability to run native code inside NaCl to the web. ;p

I like the concept of pdf.js, but it's still significantly slower, and thus provides a worse experience to the user, than native viewers.


A large part of why PDF.js is slow is because it doesn't have text coalescing. http://www.NotablePDF.com/ is based on PDF.js and has a coalescing code in production which has improved performance substantially.

It's been a significant effort on our part, and we'll be contributing it back to the PDFjs code base. Opera also has a similar coalescing effort underway by Christian Krebs.


Would you mind explaining what text coalescing is in this context?



> Well, Chrome could make this viewer not increase the surface area of the browser just by changing it to a NaCl plugin. After all, it already exposes the ability to run native code inside NaCl to the web. ;p

Will V8 run inside NaCl? As I understand it, the NaCl JIT functionality is pretty slow for use cases like polymorphic inline caching.

> I like the concept of pdf.js, but it's still significantly slower, and thus provides a worse experience to the user, than native viewers.

Most of the issues in pdf.js are actually rendering-related, not JavaScript-related—that is, they wouldn't be fixed just by changing the language to native code.


> Will V8 run inside NaCl? As I understand it, the NaCl JIT functionality is pretty slow for use cases like polymorphic inline caching.

Would that matter for PDFs? I thought js in PDFs is mostly used for form validation, which isn't very compute-heavy.


i dont know, on recent computers, more often than not i dont really see a diff between pdf.js and others as a user.

it seems to only be an issue on really heavy pdfs, which are pretty rare


I've found pdf.js to be a negative user experience on most pdfs I try to view. From what I've read, this is mainly because pdf.js directly renders to a canvas, and doesn't store decoded vector information. Other readers seem to be able to zoom instantly, even for vector graphics.

Firefox also seems to register two separate mime-types for pdf, only giving an option to use pdf.js on one of them. I've yet to dig into firefox and fix this.


The speed of pdf.js on older computers is abysmal compared to native applications like okular.


Think mobile.


On mobile (Android) both Fx and Chrome start downloading pdfs.


On mobile, both Firefox and Chrome download PDFs to be rendered by another app on Android. I'm unsure about Chrome on iOS. There is no Firefox for iOS because Apple.


> Well, Chrome could make this viewer not increase the surface area of the browser just by changing it to a NaCl plugin. After all, it already exposes the ability to run native code inside NaCl to the web. ;p

AIUI from the NaCl guys, it already does.


Yeah, but with Chrome reading PDFs is almost like seeing a normal web page, whereas with pdf.js my i7 jumps to 100% CPU usage.

In any case, I use a native viewer, which provides the best overall user experience.


That's really weird - I use pdf.js all the time, and suffer no such issues.

In particular, I enjoy it's superior font rendering (compared to the chrome implementation). I really don't get why chrome (on windows anyhow) has fairly fuzzy fonts while rendering pdf - noticably worse than pdf.js or acrobat.


The rest of the world seems to have similar experience with mine, if you check the threads at this other HN story:

https://news.ycombinator.com/item?id=7716022


It happens to me too on Debian, yet PDF.js works fine on my Nexus 7 (2012) table, even though the i7 is much faster than the Tegra 3. It's probably an hardware rendering issue.


A few other people on HN is not “the rest of the world”, particularly when most of the complaints either don't reproduce at all or are significantly less problematic than claimed.

It's dead certain that PDF.js has plenty of room to improve but that requires solid benchmarking, not anecdata. I would hope Mozilla is collecting telemetry data about common bottlenecks from millions of users and triaging to see which problems are core or artifacts from local system configuration, graphics drivers, etc.


If you actually read that thread, you'd notice that many people confirm that lots of pdf's render just fine, and that it's specific workloads where pdf.js lags.

Frankly, it's something I'll gladly put up with in most cases just to avoid bad font rendering - and that's exactly what I do.

BTW, I'm pretty sure that this kind of stuff if pretty platform (and GFX-driver) dependent - and e.g. FF on mac os performs less well IIRC.


How about converting this viewer through Emscripten using asm.js. Would be interesting to see performance comparison with pdf.js :)


Especially on Firefox. The product would be asm.js…

On the other hand, there might be an initial compilation pause when starting to load the PDF.


Caching of the compiled asm.js code is implemented: https://blog.mozilla.org/luke/2014/01/14/asm-js-aot-compilat...


Yes. And you could leave out support for PDF-embedded JS for this purpouse.


Then I would rather I have the choose of using PDF with this and Javascript disabled.

May be Opera should adopt this.


4. It's not proprietary and doesn't restrict people's ability to freely use their web browser.


What do you mean by "proprietary"? Pdfium is BSD-licensed, and pdf.js uses the Apache 2 license; both are extremely permissive.


Except one does not provide protection from the patents.


I don't really mind pdfjs being slow (it's not really slow enough that I've noticed) -- but it has some serious issues with printing/fonts (not sure quite which) -- even for "simple" pdfs from latex source (your typical paper or math homework) I've had to print from Adobe Acrobat (and, I think evince is also better -- but it's been a long while since I've printed from Linux -- for entirely unrelated reasons).


Latex documents are typically slow on PDFjs because it positions characters individually. I think it's because Latex has some specific formatting that the PDF spec/apps can't handle well enough. So for text selection PDFjs wraps each character with it's own absolutely positioned element. The massive number of DOM elements cause browsers to slow down to 5-6fps.

I've been working on coalescing the elements and you can see the fruits of my labour by dropping a PDF into https://web.notablepdf.com which is an annotation app based on PDF.js.


Oh, that explains why I thought PDF.js is horrible across the board! If you judged the source of PDF's solely by the ones I read, you'd think every PDF in the world is either produced by LaTeX or is a restaurant menu.

Are there statistics about what other sorts of PDF's people read? How else are PDF's made? I always assumed that people that use word processors would exchange files in their word processor's format, but maybe some people export to PDF? (I'm not familiar with the habits of word processor users.)


There aren't statistics that I know of. Though anecdotal evidence suggests that there's quite a lot of PDFs with this problem. There are a lot of people whom export to PDF because users then can't "edit" the file.

So far, coalescing works, but we've had to make substantial changes to a lot of places. The hard part is still ensuring it is compliant with PDF specs, which we're in the process of working through before we submit it as a patch to the PDF.js team.


Thank you for that explanation! LaTeX docs constitute the vast majority of PDFs that I view, and I've wondered for ages why PDF.js seemed to be so laggy so much of the time.


I think they are pretty open to adding bad pdfs to their test suite if you have any.


Pretty much any math and physics paper I tried was rendered with broken fonts and incorrect layout. It is horribly slow and renders a lot of pdfs incorrectly.


Kpartsplugin with Okular also render PDFs faster in Firefox than pdf.js. I'd expect that all native plugins should be faster (if they are well written).


99% of the PDFs I read work fine with the pdf.js reader, and I feel a bit more safe using it than a binary reader.


Agreed. pdf.js has come quite a long way from its beta.

In most situations I don't feel the need for a native reader, anymore.


It's really interesting to see that there are foxit employees on the list of committers. I assume that means that it was initially a fork of the foxit PDF reader?



What about this one ?

https://pdfium.googlesource.com/pdfium/+/master/core/src/fxg...

"/\ * * Copyright (c) 1998-2000, Microsoft Corp. All Rights Reserved. * * Module Name: * * Gdiplus.h * * Abstract: * * GDI+ Native C++ public header file * \/ "


Part of the Windows SDK.


Correct me if I am wrong.

I saw no evidence in the project showing that this file has BSD-style license.And since it is part of the windows SDK, it is nearly impossible to be BSD-style licensed.

Maybe it was included from foxit's code or other codebase but it is better to be put into the thirdparties directory due to the license issue.

Edited:

Thanks for pointing out my misconception about Windows SDK.


The SDK has a very permissive license (including redistribution), since it's obviously designed to be used, and it's convenient to developers to be able to redistribute parts of it to ease development.

Agreed it is always nice to have these things in a thirdparty directory, though, but the larger Chromium project actually does appear to have all of pdfium in third_party, which helps keep that clear.


Foxit isn't open-source, so I don't think it would be legal to release a derivative under the BSD license (IANAL).


It's perfectly legal for Foxit to license their own software to Google to be released under the terms of the BSD license. This would be with Foxit's consent, of course.


Right, and probably some money changing hands.


It's important to understand that a license isn't on the code itself, but merely on a copy of the code. A copyright owner reserves full rights to whatever the hell they want to do with their code; open-source projects just release a version with a specific set of restrictions. It gets more interesting with contributors, and specifying a license for the contributed code.


I used the term 'initially' for a reason. My guess would be that the Chrome team used the Foxit SDK to get it out the door quickly, then over time and with Foxit guidance, replaced parts with their home grown source. That's just a guess, though.


Why did they close down something like Google Reader and Not Google Code Hosting? Do anyone actually use it?

I wish they could either make Google Code decent or simply kill it and use GitHub instead.

Is this a new implementation? Of did Foxit release it as Open Source?


What's wrong with Google Code? I'm not telling that it is decent in any way, but it is usable. The repository hosting is there, the wiki is there, some sort of issue-tracking is there. And, if I were a Google employee, I'd rather use my companies code hosting than some others, because what if those lot get interested in founding a tea-shop?


Even codeplex is better than google code. I really hope they either make the switch or improve it.


Google code is heavily used internally at Google.


Not everything should move to GitHub (although i prefer GitHub also)


I know. I dont want everything to be in GitHub either. It could even be BitBucket. The point is Google Code is so terrible that i refuse to even look at it for more then a minutes.


It shows that the original code was Foxit's so would that mean that Foxit released the code or did Google purchase code rights and then decided to open source? Anyways, why would they host it on Google code?


That is the point. If Foxit opened it up then they deserve some credit. Otherwise the web is going to publish this as another Google PR without acknowledging their work.


I thought Chromium's PDF engine was based on Foxit; did they change that?

EDIT: I see there are Foxit employees in the commit list. Well, that explains that!

Anyway this is great news. Kudos Google.

By the way, for those confused, the source is not on svn like Google Code fails to communicate but on https://pdfium.googlesource.com/.


This is great because it is now the best open-source PDF rendering library. GhostScript, Poppler, XPdf, pdf.js -- they all sort of work alright, but are pathetic compared to FoxIt, on which this source code is based. What we now have with this source is a high performance highly compliant clean codebase of C++ PDF rendering code. Excellent news. Expect lots of future PDF innovations to be based on this.


I've had by far the best performance with Sumatra, which is open source but unfortunately Windows-only. I try Chrome's PDF reader once every few months, hoping it'll improve, but I always disable it when I see it's still disappointingly sluggish.


Sumatra just uses muPDF, FWIW.



I use ghostscript consistently more than ten years. No serious issues. I use it to read/print pdfs/pss from arxiv org. Last six years I also use sumatra. Faster but I prefer mupdf reader.


It's interesting that this came, seemingly out of the blue, a little after it was made widely known (from the mozhacks article [1]) that Opera developers were working towards integrating pdf.js into their Chromium fork.

[1] https://hacks.mozilla.org/2014/05/how-fast-is-pdf-js/


And Opera announced their move to the Chromium Content API and WebKit around a fortnight before Blink was announced.

And note that isn't the first time there's been Opera interest in pdf.js — I spent the majority of summer 2012 working on trying to get pdf.js running well in Presto, as was relatively well known around pdf.js contributors.


It looks like they did it in a hurry (on the next day) just after https://hacks.mozilla.org/2014/05/how-fast-is-pdf-js/ was published? Competition FTW!


In case the authors are lurking here, what are the main differences between this and poppler?


While the main poppler developers, who IIRC are three guys from Spain, have made a heroic effort, poppler is really not that good. Poppler was created by ripping out code from Xpdf and making it into a library.

If you look at the code, it is not really well architected. Here is a file I found a problem in - http://cgit.freedesktop.org/poppler/poppler/tree/poppler/Tex... . Take a look at that file and judge for yourself if it follows "Code Complete" type suggestions.

One reason I looked in that file is poppler does not deal well at all with many map PDFs like http://web.mta.info/nyct/maps/busqns.pdf or some others I have on my hard drive. They take forever to load.

Some PDFs have caused the applications using poppler to crash, although some of those have been patched. It's not as bad as it used to be, but still. My patch to speed up the bus map PDFs was not accepted. Then there are features like being able to enter data into PDFs and such. Compare and contrast Adobe's official Acrobat app for Linux and a PDF reader based on poppler like evince.

So the answer is a standard one - code architecture, bugs and features. The answer would be to take the PDFs that Adobe Acrobat handles but which poppler doesn't in terms of bugs and features, and see how pdfium handles them.

Of course, it's possible pdfium will handle those but fail on an entirely different class of PDFs and their pdfium specific bugs.

The PDF standard is a fairly large one. What features does pdfium handle which poppler doesn't? What percentages of PDFs crash the viewing application, or don't render correctly compared to poppler? And so forth.

I should also add that poppler usually depends on cairo for vector graphics. So once in a while the fail for a pdf is on cairo, not poppler. I have seen some of those fixed, some not.


I compared the speed (plain text extraction) of xpdf, poppler and mupdf on 100k PDFs. mupdf is in 95% cases the fastest, then comes xpdf and then poppler (the latter two crashed on a few files). SumatraPDF viewer went from poppler-only to two engines (poppler & mupdf) to mupdf+patches. At the moment from my experience SumatraPDF has the fastest and most reliable PDF engine, that is open source. So it will be interesting how this Chrome PDF open source engine (based on Foxit?) performs outside of Chrome as standalone library/commandline tool.


For us, that would be the license. BSD is far more palatable than GPL (LGPL would be fine) when we have to embed a PDF viewer in a client's project.


Also curious how it compares to mupdf?


As a developer I see it posing many problems since it is in C++. Mupdfs is in C and plays nice with many languages. Even golang.


I believe this is the source of the PPAPI plugin, and not something built into Chrome.

Anyway, this is great news for Chromium, as the PDF plugin can now be shipped to distro repos.


Chrome's PDF Viewer is a plugin. You can see it by opening chrome://plugins/.


For some one using PDF.js (which works great both on Chrome & Firefox) for my company's enterprise app - does this matter much ?


Unlikely, given that this is native code, and pdf.js is web-app friendly js.


Sadly no tests, no documentation (except the documentation from FoxIt). Not even source code documentation.


So it seems the SDK has the basic plumbing to make a commandline tool out of it: https://pdfium.googlesource.com/pdfium/+/master/fpdfsdk/src/...

Anyone interested?


Just glanced the source code, and, isn't it bad to “#include "../../../sth.h"”? Wouldn't it be better to set the include path while compiling and just “#include "sth.h"”?


Not really, having an explicit path seems more useful and clearer to me. Easier to read the code and find the file. Things like "gf" in vim will definitely work on it to open it up too.


[deleted]


Read the description of the project. It's hosted here:

https://pdfium.googlesource.com/


Except there is no description there... Not even a README.

Nothing in the wiki either https://code.google.com/p/pdfium/w/list


Click on the OP's link and use your browser to find "PDFium is an open-source PDF rendering engine." Read the next line.


In typical corporate code-dump style, no README and no clear instructions on how to build or what form the output takes. I installed gyp to try it and get a variety of errors depending on what I try (the furthest I got was complaints about v8.gyp being missing; does this have to build within the Chromium source tree?). Does any Google insider want to explain their internal build practices so a mere mortal can try to compile this code?


Per https://code.google.com/p/pdfium/issues/detail?id=1 : "Looks like the standalone build system is not yet present."


So, apparently somebody still uses Google Code.¨

edit: .... just not for the actual code.


Seriously, releasing on Google Code Project Hosting instead of GitHub? Even CodePlex is better than that.


Isn't it annoying to see code.google.com as the medium of sharing code? I've got so used to Github that google code seems like an old 20th century thing..




Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: