PDFium: Chrome’s PDF rendering engine is now open-source

atesti · on May 22, 2014

I found it interesting that it seems to use Antigrain by Maxim Shemanarev in https://pdfium.googlesource.com/pdfium/+/master/core/src/fxg... (Chrome uses Skia). Unfortunately the author of Antigrain died: http://beta.slashdot.org/submission/3154635/rip-maxim-sheman... It's nice to see the fascinating Antigrain code to be used for PDF viewing every day!

dman · on May 22, 2014

Did not know about this. As an agg user this made me sad.

hrjet · on May 22, 2014

Nice to hear that AGG lives on. I remember exploring it in around 2002 and being mighty impressed with it.

reedlaw · on May 22, 2014

This is good news because it renders PDFs a lot faster and better than pdf.js that Firefox uses. Also, I would have to install this binary blob to get Chromium to render PDFs. It seems Chromium could easily adopt this, but I'm not sure about Firefox.

azakai · on May 22, 2014

Well, it wouldn't make sense for Firefox to adopt this:

1. It is tied to v8 (PDFs can run JS, this PDF viewer uses v8 do so - see CJS_Context::RunScript etc), so it would mean bundling 2 JS engines, with all the security downsides of that.

2. This is written in C++. You can sandbox C++ in various ways, but that would still increase the surface area of the browser, compared to pdf.js which only uses things normal web content would use.

3. pdf.js is not just meant to render pdfs, it's also a useful project to push forward the web platform. Areas where pdf.js was slow turned out to be things that were worth optimizing anyhow. This doesn't benefit people viewing pdfs directly, of course, but it's still an interesting aspect of the project.

teacup50 · on May 22, 2014

Your arguments aren't supported by the facts:

1. It's not tied to V8.

It uses V8 currently, but:

- The vast majority of the code has nothing to do with JS.

- Almost all of the JS-related PDF code is independent of V8.

- The use of V8 is abstracted out sufficiently via IFXJS_Runtime etc that with a bit more work, a different JS runtime could be easily integrated.

2. You note that C/C++ can be sandboxed, but then claim that it "still increases the surface area of the browser".

This is nonsense: Firefox's 11 million lines of code are also written in C/C++, and should be sandboxed. You don't increase the surface area by also sandboxing the PDF plugin -- either the sandboxing mechanism works or it doesn't.

3. If anything, pdf.js's continued poor performance and operation demonstrates why forcing everyone to operate inside of a JS runtime is incredibly harmful to the web platform's progress.

pcwalton · on May 22, 2014

> You don't increase the surface area by also sandboxing the PDF plugin -- either the sandboxing mechanism works or it doesn't.

Adding more C++ to the browser certainly does increase the attack area. Sandboxes that expose enough functionality to run a modern browser engine commonly end up with holes here and there (e.g. the Pwnium vulnerabilities) and it's best to not use them as the only layer of defense; moreover, the sandbox does not fully enforce all of the security properties that the Web platform demands (e.g. cross-origin iframes).

> 3. If anything, pdf.js's continued poor performance and operation demonstrates why forcing everyone to operate inside of a JS runtime is incredibly harmful to the web platform's progress.

As I mentioned before, pdf.js is not as JS bound as you might think. Furthermore, it was written before asm.js existed and doesn't use any asm.js; if pdf.js were JS bound, then asm.js would be a very powerful option to improve performance that would not involve dropping to unsafe native code.

jfbastien · on May 22, 2014

The DRM sandbox is yet another attack surface, isn't it? Wouldn't it make sense to use NaCl for DRM sandboxing, and then the option is open to use the same sandbox for PDF viewing, and pdf.js can still work, giving users choice.

Creating yet another sandbox seems silly, and NaCl hasn't been hit by pwnium, it's only been a stepping stone to the renderer (I'll let comex dive into details here!)

chrisrohlf · on May 23, 2014

NaCl is not exactly a stepping stone to the renderer. NaCl modules live outside the renderer process in a much tighter sandbox that uses control flow integrity and software fault isolation. Gaining code execution within the NaCl sandbox (easy since you can just send the user a NaCl module) does not expose the same attack surface as gaining code execution within a renderer process.

azakai · on May 22, 2014

Hey, side question: I assume pdfium runs in the NaCl sandbox - how does that work with v8?

jfbastien · on May 23, 2014

Sorry, I don't check HN often. As we discussed on Twitter & IRC:

It's just an OS sandbox currently. pdfium previously worked with NaCl, with a non-V8 JS VM (work done by Bill Budge). V8-on-NaCl used to work, I think it may have bitrotted since then, but it used NaCl's dyncode modify API to do PIC. The GC moves code too, so extra page permissions need to be changed when that's done, but I think that's the extent of code modification that needs to be handled for a JS JIT to work on NaCl (on top of the sandboxing).

pcwalton · on May 22, 2014

Well, the DRM sandbox has very few exposed APIs, in contrast to the Web sandbox or Pepper.

I don't like the DRM sandbox anyhow; it's unfortunate that DRM was added to the Web, forcing a DRM module at all (speaking for myself, not my employer).

jfbastien · on May 23, 2014

I discussed this with Alon on IRC: you wouldn't use pepper to do this. NaCl doesn't imply pepper, you can expose a subset of syscalls into the trusted code base.

I understand the feeling about DRM, but given that sandboxed DRM is going to happen I'd hope that the best efforts possible are put in to make users safe. Good sandboxing seems the right way to go. I'm not any kind of a security expert, but jschuh seems to think the current sandbox isn't sufficient: https://bugzilla.mozilla.org/show_bug.cgi?id=1011491 I hope the right improvements go into tightening the DRM sandbox :)

comex · on May 22, 2014

Well, Chrome could make this viewer not increase the surface area of the browser just by changing it to a NaCl plugin. After all, it already exposes the ability to run native code inside NaCl to the web. ;p

I like the concept of pdf.js, but it's still significantly slower, and thus provides a worse experience to the user, than native viewers.

Hengjie · on May 22, 2014

A large part of why PDF.js is slow is because it doesn't have text coalescing. http://www.NotablePDF.com/ is based on PDF.js and has a coalescing code in production which has improved performance substantially.

It's been a significant effort on our part, and we'll be contributing it back to the PDFjs code base. Opera also has a similar coalescing effort underway by Christian Krebs.

garblegarble · on May 22, 2014

Would you mind explaining what text coalescing is in this context?

ehsanu1 · on May 23, 2014

Hengjie explains here: https://news.ycombinator.com/item?id=7783622

pcwalton · on May 22, 2014

> Well, Chrome could make this viewer not increase the surface area of the browser just by changing it to a NaCl plugin. After all, it already exposes the ability to run native code inside NaCl to the web. ;p

Will V8 run inside NaCl? As I understand it, the NaCl JIT functionality is pretty slow for use cases like polymorphic inline caching.

> I like the concept of pdf.js, but it's still significantly slower, and thus provides a worse experience to the user, than native viewers.

Most of the issues in pdf.js are actually rendering-related, not JavaScript-related—that is, they wouldn't be fixed just by changing the language to native code.

bla2 · on May 22, 2014

> Will V8 run inside NaCl? As I understand it, the NaCl JIT functionality is pretty slow for use cases like polymorphic inline caching.

Would that matter for PDFs? I thought js in PDFs is mostly used for form validation, which isn't very compute-heavy.

zobzu · on May 22, 2014

i dont know, on recent computers, more often than not i dont really see a diff between pdf.js and others as a user.

it seems to only be an issue on really heavy pdfs, which are pretty rare

keeperofdakeys · on May 22, 2014

I've found pdf.js to be a negative user experience on most pdfs I try to view. From what I've read, this is mainly because pdf.js directly renders to a canvas, and doesn't store decoded vector information. Other readers seem to be able to zoom instantly, even for vector graphics.

Firefox also seems to register two separate mime-types for pdf, only giving an option to use pdf.js on one of them. I've yet to dig into firefox and fix this.

orbifold · on May 22, 2014

The speed of pdf.js on older computers is abysmal compared to native applications like okular.

Jyaif · on May 22, 2014

Think mobile.

Ygg2 · on May 22, 2014

On mobile (Android) both Fx and Chrome start downloading pdfs.

JohnTHaller · on May 23, 2014

On mobile, both Firefox and Chrome download PDFs to be rendered by another app on Android. I'm unsure about Chrome on iOS. There is no Firefox for iOS because Apple.

gsnedders · on May 22, 2014

> Well, Chrome could make this viewer not increase the surface area of the browser just by changing it to a NaCl plugin. After all, it already exposes the ability to run native code inside NaCl to the web. ;p

AIUI from the NaCl guys, it already does.

pjmlp · on May 22, 2014

Yeah, but with Chrome reading PDFs is almost like seeing a normal web page, whereas with pdf.js my i7 jumps to 100% CPU usage.

In any case, I use a native viewer, which provides the best overall user experience.

emn13 · on May 22, 2014

That's really weird - I use pdf.js all the time, and suffer no such issues.

In particular, I enjoy it's superior font rendering (compared to the chrome implementation). I really don't get why chrome (on windows anyhow) has fairly fuzzy fonts while rendering pdf - noticably worse than pdf.js or acrobat.

pjmlp · on May 22, 2014

The rest of the world seems to have similar experience with mine, if you check the threads at this other HN story:

https://news.ycombinator.com/item?id=7716022

icebraining · on May 22, 2014

It happens to me too on Debian, yet PDF.js works fine on my Nexus 7 (2012) table, even though the i7 is much faster than the Tegra 3. It's probably an hardware rendering issue.

acdha · on May 23, 2014

A few other people on HN is not “the rest of the world”, particularly when most of the complaints either don't reproduce at all or are significantly less problematic than claimed.

It's dead certain that PDF.js has plenty of room to improve but that requires solid benchmarking, not anecdata. I would hope Mozilla is collecting telemetry data about common bottlenecks from millions of users and triaging to see which problems are core or artifacts from local system configuration, graphics drivers, etc.

emn13 · on May 26, 2014

If you actually read that thread, you'd notice that many people confirm that lots of pdf's render just fine, and that it's specific workloads where pdf.js lags.

Frankly, it's something I'll gladly put up with in most cases just to avoid bad font rendering - and that's exactly what I do.

BTW, I'm pretty sure that this kind of stuff if pretty platform (and GFX-driver) dependent - and e.g. FF on mac os performs less well IIRC.

v413 · on May 22, 2014

How about converting this viewer through Emscripten using asm.js. Would be interesting to see performance comparison with pdf.js :)

espadrine · on May 22, 2014

Especially on Firefox. The product would be asm.js…

On the other hand, there might be an initial compilation pause when starting to load the PDF.

pcwalton · on May 22, 2014

Caching of the compiled asm.js code is implemented: https://blog.mozilla.org/luke/2014/01/14/asm-js-aot-compilat...

fulafel · on May 22, 2014

Yes. And you could leave out support for PDF-embedded JS for this purpouse.

ksec · on May 22, 2014

Then I would rather I have the choose of using PDF with this and Javascript disabled.

May be Opera should adopt this.

yarrel · on May 22, 2014

4. It's not proprietary and doesn't restrict people's ability to freely use their web browser.

teraflop · on May 22, 2014

What do you mean by "proprietary"? Pdfium is BSD-licensed, and pdf.js uses the Apache 2 license; both are extremely permissive.

async5 · on May 22, 2014

Except one does not provide protection from the patents.

e12e · on May 22, 2014

I don't really mind pdfjs being slow (it's not really slow enough that I've noticed) -- but it has some serious issues with printing/fonts (not sure quite which) -- even for "simple" pdfs from latex source (your typical paper or math homework) I've had to print from Adobe Acrobat (and, I think evince is also better -- but it's been a long while since I've printed from Linux -- for entirely unrelated reasons).

Hengjie · on May 22, 2014

Latex documents are typically slow on PDFjs because it positions characters individually. I think it's because Latex has some specific formatting that the PDF spec/apps can't handle well enough. So for text selection PDFjs wraps each character with it's own absolutely positioned element. The massive number of DOM elements cause browsers to slow down to 5-6fps.

I've been working on coalescing the elements and you can see the fruits of my labour by dropping a PDF into https://web.notablepdf.com which is an annotation app based on PDF.js.

omaranto · on May 22, 2014

Oh, that explains why I thought PDF.js is horrible across the board! If you judged the source of PDF's solely by the ones I read, you'd think every PDF in the world is either produced by LaTeX or is a restaurant menu.

Are there statistics about what other sorts of PDF's people read? How else are PDF's made? I always assumed that people that use word processors would exchange files in their word processor's format, but maybe some people export to PDF? (I'm not familiar with the habits of word processor users.)

Hengjie · on May 23, 2014

There aren't statistics that I know of. Though anecdotal evidence suggests that there's quite a lot of PDFs with this problem. There are a lot of people whom export to PDF because users then can't "edit" the file.

So far, coalescing works, but we've had to make substantial changes to a lot of places. The hard part is still ensuring it is compliant with PDF specs, which we're in the process of working through before we submit it as a patch to the PDF.js team.

Steuard · on May 22, 2014

Thank you for that explanation! LaTeX docs constitute the vast majority of PDFs that I view, and I've wondered for ages why PDF.js seemed to be so laggy so much of the time.

stuaxo · on May 22, 2014

I think they are pretty open to adding bad pdfs to their test suite if you have any.

orbifold · on May 22, 2014

Pretty much any math and physics paper I tried was rendered with broken fonts and incorrect layout. It is horribly slow and renders a lot of pdfs incorrectly.

shmerl · on May 22, 2014

Kpartsplugin with Okular also render PDFs faster in Firefox than pdf.js. I'd expect that all native plugins should be faster (if they are well written).

TwoBit · on May 22, 2014

99% of the PDFs I read work fine with the pdf.js reader, and I feel a bit more safe using it than a binary reader.

oblio · on May 22, 2014

Agreed. pdf.js has come quite a long way from its beta.

In most situations I don't feel the need for a native reader, anymore.

yincrash · on May 22, 2014

It's really interesting to see that there are foxit employees on the list of committers. I assume that means that it was initially a fork of the foxit PDF reader?

hidamon · on May 22, 2014

https://pdfium.googlesource.com/pdfium/+/master/fpdfsdk/incl...

chilledheart · on May 22, 2014

What about this one ?

https://pdfium.googlesource.com/pdfium/+/master/core/src/fxg...

gcp · on May 22, 2014

Part of the Windows SDK.

chilledheart · on May 22, 2014

Correct me if I am wrong.

I saw no evidence in the project showing that this file has BSD-style license.And since it is part of the windows SDK, it is nearly impossible to be BSD-style licensed.

Maybe it was included from foxit's code or other codebase but it is better to be put into the thirdparties directory due to the license issue.

Edited:

Thanks for pointing out my misconception about Windows SDK.

magicalist · on May 22, 2014

The SDK has a very permissive license (including redistribution), since it's obviously designed to be used, and it's convenient to developers to be able to redistribute parts of it to ease development.

Agreed it is always nice to have these things in a thirdparty directory, though, but the larger Chromium project actually does appear to have all of pdfium in third_party, which helps keep that clear.

barbs · on May 22, 2014

Foxit isn't open-source, so I don't think it would be legal to release a derivative under the BSD license (IANAL).

awda · on May 22, 2014

It's perfectly legal for Foxit to license their own software to Google to be released under the terms of the BSD license. This would be with Foxit's consent, of course.

tfinniga · on May 22, 2014

Right, and probably some money changing hands.

keeperofdakeys · on May 22, 2014

It's important to understand that a license isn't on the code itself, but merely on a copy of the code. A copyright owner reserves full rights to whatever the hell they want to do with their code; open-source projects just release a version with a specific set of restrictions. It gets more interesting with contributors, and specifying a license for the contributed code.

yincrash · on May 22, 2014

I used the term 'initially' for a reason. My guess would be that the Chrome team used the Foxit SDK to get it out the door quickly, then over time and with Foxit guidance, replaced parts with their home grown source. That's just a guess, though.

ksec · on May 22, 2014

Why did they close down something like Google Reader and Not Google Code Hosting? Do anyone actually use it?

I wish they could either make Google Code decent or simply kill it and use GitHub instead.

Is this a new implementation? Of did Foxit release it as Open Source?

_pfxa · on May 22, 2014

What's wrong with Google Code? I'm not telling that it is decent in any way, but it is usable. The repository hosting is there, the wiki is there, some sort of issue-tracking is there. And, if I were a Google employee, I'd rather use my companies code hosting than some others, because what if those lot get interested in founding a tea-shop?

gosukiwi · on May 22, 2014

Even codeplex is better than google code. I really hope they either make the switch or improve it.

hendzen · on May 22, 2014

Google code is heavily used internally at Google.

NicoJuicy · on May 22, 2014

Not everything should move to GitHub (although i prefer GitHub also)

ksec · on May 22, 2014

I know. I dont want everything to be in GitHub either. It could even be BitBucket. The point is Google Code is so terrible that i refuse to even look at it for more then a minutes.

iamsalman · on May 22, 2014

It shows that the original code was Foxit's so would that mean that Foxit released the code or did Google purchase code rights and then decided to open source? Anyways, why would they host it on Google code?

ksec · on May 22, 2014

That is the point. If Foxit opened it up then they deserve some credit. Otherwise the web is going to publish this as another Google PR without acknowledging their work.

scrollaway · on May 22, 2014

I thought Chromium's PDF engine was based on Foxit; did they change that?

EDIT: I see there are Foxit employees in the commit list. Well, that explains that!

Anyway this is great news. Kudos Google.

By the way, for those confused, the source is not on svn like Google Code fails to communicate but on https://pdfium.googlesource.com/.

zx2c4 · on May 22, 2014

This is great because it is now the best open-source PDF rendering library. GhostScript, Poppler, XPdf, pdf.js -- they all sort of work alright, but are pathetic compared to FoxIt, on which this source code is based. What we now have with this source is a high performance highly compliant clean codebase of C++ PDF rendering code. Excellent news. Expect lots of future PDF innovations to be based on this.

TillE · on May 22, 2014

I've had by far the best performance with Sumatra, which is open source but unfortunately Windows-only. I try Chrome's PDF reader once every few months, hoping it'll improve, but I always disable it when I see it's still disappointingly sluggish.

gsnedders · on May 22, 2014

Sumatra just uses muPDF, FWIW.

johnx123-up · on May 28, 2014

Not really true. https://news.ycombinator.com/item?id=7788074

fithisux · on May 23, 2014

I use ghostscript consistently more than ten years. No serious issues. I use it to read/print pdfs/pss from arxiv org. Last six years I also use sumatra. Faster but I prefer mupdf reader.

ferongr · on May 22, 2014

It's interesting that this came, seemingly out of the blue, a little after it was made widely known (from the mozhacks article [1]) that Opera developers were working towards integrating pdf.js into their Chromium fork.

[1] https://hacks.mozilla.org/2014/05/how-fast-is-pdf-js/

gsnedders · on May 22, 2014

And Opera announced their move to the Chromium Content API and WebKit around a fortnight before Blink was announced.

And note that isn't the first time there's been Opera interest in pdf.js — I spent the majority of summer 2012 working on trying to get pdf.js running well in Presto, as was relatively well known around pdf.js contributors.

async5 · on May 22, 2014

It looks like they did it in a hurry (on the next day) just after https://hacks.mozilla.org/2014/05/how-fast-is-pdf-js/ was published? Competition FTW!

e98cuenc · on May 22, 2014

In case the authors are lurking here, what are the main differences between this and poppler?

Ologn · on May 22, 2014

While the main poppler developers, who IIRC are three guys from Spain, have made a heroic effort, poppler is really not that good. Poppler was created by ripping out code from Xpdf and making it into a library.

If you look at the code, it is not really well architected. Here is a file I found a problem in - http://cgit.freedesktop.org/poppler/poppler/tree/poppler/Tex... . Take a look at that file and judge for yourself if it follows "Code Complete" type suggestions.

One reason I looked in that file is poppler does not deal well at all with many map PDFs like http://web.mta.info/nyct/maps/busqns.pdf or some others I have on my hard drive. They take forever to load.

Some PDFs have caused the applications using poppler to crash, although some of those have been patched. It's not as bad as it used to be, but still. My patch to speed up the bus map PDFs was not accepted. Then there are features like being able to enter data into PDFs and such. Compare and contrast Adobe's official Acrobat app for Linux and a PDF reader based on poppler like evince.

So the answer is a standard one - code architecture, bugs and features. The answer would be to take the PDFs that Adobe Acrobat handles but which poppler doesn't in terms of bugs and features, and see how pdfium handles them.

Of course, it's possible pdfium will handle those but fail on an entirely different class of PDFs and their pdfium specific bugs.

The PDF standard is a fairly large one. What features does pdfium handle which poppler doesn't? What percentages of PDFs crash the viewing application, or don't render correctly compared to poppler? And so forth.

I should also add that poppler usually depends on cairo for vector graphics. So once in a while the fail for a pdf is on cairo, not poppler. I have seen some of those fixed, some not.

frik · on May 23, 2014

I compared the speed (plain text extraction) of xpdf, poppler and mupdf on 100k PDFs. mupdf is in 95% cases the fastest, then comes xpdf and then poppler (the latter two crashed on a few files). SumatraPDF viewer went from poppler-only to two engines (poppler & mupdf) to mupdf+patches. At the moment from my experience SumatraPDF has the fastest and most reliable PDF engine, that is open source. So it will be interesting how this Chrome PDF open source engine (based on Foxit?) performs outside of Chrome as standalone library/commandline tool.

lvillani · on May 22, 2014

For us, that would be the license. BSD is far more palatable than GPL (LGPL would be fine) when we have to embed a PDF viewer in a client's project.

chuckledog · on May 22, 2014

Also curious how it compares to mupdf?

fithisux · on May 22, 2014

As a developer I see it posing many problems since it is in C++. Mupdfs is in C and plays nice with many languages. Even golang.

wooptoo · on May 22, 2014

I believe this is the source of the PPAPI plugin, and not something built into Chrome.

Anyway, this is great news for Chromium, as the PDF plugin can now be shipped to distro repos.

dchest · on May 22, 2014

Chrome's PDF Viewer is a plugin. You can see it by opening chrome://plugins/.

nppc · on May 22, 2014

For some one using PDF.js (which works great both on Chrome & Firefox) for my company's enterprise app - does this matter much ?

emn13 · on May 22, 2014

Unlikely, given that this is native code, and pdf.js is web-app friendly js.

steipete · on May 22, 2014

Sadly no tests, no documentation (except the documentation from FoxIt). Not even source code documentation.

MrBuddyCasino · on May 23, 2014

So it seems the SDK has the basic plumbing to make a commandline tool out of it: https://pdfium.googlesource.com/pdfium/+/master/fpdfsdk/src/...

Anyone interested?

_pfxa · on May 22, 2014

Just glanced the source code, and, isn't it bad to “#include "../../../sth.h"”? Wouldn't it be better to set the include path while compiling and just “#include "sth.h"”?

frabcus · on May 22, 2014

Not really, having an explicit path seems more useful and clearer to me. Easier to read the code and find the file. Things like "gf" in vim will definitely work on it to open it up too.

on May 22, 2014

[deleted]

crazysim · on May 22, 2014

Read the description of the project. It's hosted here:

https://pdfium.googlesource.com/

JBiserkov · on May 22, 2014

Except there is no description there... Not even a README.

Nothing in the wiki either https://code.google.com/p/pdfium/w/list

crazysim · on May 22, 2014

Click on the OP's link and use your browser to find "PDFium is an open-source PDF rendering engine." Read the next line.

gue5t · on May 22, 2014

In typical corporate code-dump style, no README and no clear instructions on how to build or what form the output takes. I installed gyp to try it and get a variety of errors depending on what I try (the furthest I got was complaints about v8.gyp being missing; does this have to build within the Chromium source tree?). Does any Google insider want to explain their internal build practices so a mere mortal can try to compile this code?

async5 · on May 22, 2014

Per https://code.google.com/p/pdfium/issues/detail?id=1 : "Looks like the standalone build system is not yet present."

runn1ng · on May 22, 2014

So, apparently somebody still uses Google Code.¨

edit: .... just not for the actual code.

ahmett · on May 22, 2014

Seriously, releasing on Google Code Project Hosting instead of GitHub? Even CodePlex is better than that.

rushi_agrawal · on May 22, 2014

Isn't it annoying to see code.google.com as the medium of sharing code? I've got so used to Github that google code seems like an old 20th century thing..