
Show HN: Pdf-Bot, an API/CLI for Generating PDFs Using Headless Chrome - esbenp
https://github.com/esbenp/pdf-bot
======
umpox
Pretty cool! Check out Google's headless browser API Puppeteer too, they
provide a few really useful functions for doing stuff like this.

[https://github.com/GoogleChrome/puppeteer/blob/master/exampl...](https://github.com/GoogleChrome/puppeteer/blob/master/examples/pdf.js)

Really easy to work around, I used it to build a simple CLI to generate device
screenshots of a webpage by modifying the user-agent and resolution to match
each device.

[https://github.com/umpox/generateDeviceScreenshots](https://github.com/umpox/generateDeviceScreenshots)

~~~
esbenp
Yeah I think Puppeteer is a very cool project. Unfortunately, it came out
literally one or two days after i finished the initial version of pdf-bot.
Maybe I will incorporate it soon! :-)

------
bharani_m
Really nice work.

I've had a great experience working with Headless Chrome to convert webpages
to PDF for my side project EmailThis
([https://www.emailthis.me](https://www.emailthis.me)).

It uses Puppeteer by Chrome DevTools team -
[https://github.com/GoogleChrome/puppeteer](https://github.com/GoogleChrome/puppeteer).

~~~
johnwaynedoe
Cool project! I think I am going to start regularly making use of it. Does it
email you a PDF attachment, or does it just send an email with the contents of
the article within it? When I attempted to use it I did not see a pdf
attachment. Regardless/either way, really cool project.

~~~
bharani_m
Thanks, if it is unable to extract useful content from a page, EmailThis will
save it as PDF as send it as an attachmentment.

If you try saving a any discussion website (HN, Stackoverflow, Reddit), you
will get the PDF.

------
czechdeveloper
All I need now is CMYK support in Chrome and I can make HTML based print ready
PDF rendering. That would be quite upgrade compared to my current options.

~~~
hwc
I work on Chrome's PDF generation.

Please file a bug report if you think this is an important feature.

~~~
synthmeat
I'll upvote both this and track bug report, if linked.

Yes, CMYK support is very tempting feature for the whole print industry.

------
joshribakoff
What's up with the built in queue? I feel like that belongs in a different
script. For one, the built in nodeJS queue is useless in a multiple server
environment. You'd still need a distributed queue since this built in one is
only local to one server/thread. So the built in queue becomes
redundant/pointless for any kind of solution that needs to scale

~~~
xg15
I was first wondering about all the complexity in the API as well (why a
built-in queue, webhook, retry policy and storage interface when the actual
transaction I'm interested in is just "url -> pdf blob"?)

However, I think this is necessary if you want to fit it into a microservice
with a REST interface. For REST, I think the usual expectation is that a) the
request returns quickly and b) you can submit any number of requests in
parallel. Given that loading a page into headless chrome, rendering it and
generating a pdf is both resource intensive and time consuming, I guess you
need some way to decouple that process from the interface.

------
flashman
How well does it work with multiple-page PDFs? One of our banes is generating
mixed text/image downloadable reports with sensible page breaks. To save time,
we're actually doing those as docx files, with the bonus/risk that clients can
edit the content before saving it as a PDF.

~~~
DonnyV
Just use WkhtmlToPdf [https://wkhtmltopdf.org](https://wkhtmltopdf.org) and
wrap a simple service around it.

~~~
bshimmin
That's one hell of a "just".

~~~
DonnyV
I did it in 2 days. Its not very hard.

------
sdeframond
Nice work !

Depending on your use case, I feel like you guys might be interested in
[http://weasyprint.org/](http://weasyprint.org/). It is an open source HTML to
PDF converter written in Python. It passes the Acid2 test and implements CSS
Paged Media.

~~~
j_s
Always nice to discover another open source HTML rendering engine!

I recently discovered [https://github.com/ArthurHub/HTML-
Renderer](https://github.com/ArthurHub/HTML-Renderer) formerly know as
[https://htmlrenderer.codeplex.com/](https://htmlrenderer.codeplex.com/)

 _PDF generation [...] 100% managed (C#), High performance HTML Rendering
library_

------
gbuk2013
This is interesting :-)

I am currently using athenapdf[1] but I will have a play with pdf-bot.

[1]
[https://github.com/arachnys/athenapdf](https://github.com/arachnys/athenapdf)

~~~
MrSaints
Core developer of `athenapdf` here :)

I had a quick look at `pdf-bot`, and though we both rely on the same
underlying technology _(we are only just moving to headless Chromium; we were
on Electron before)_ , I believe we have slightly different ambitions with our
respective project. But, I may be biased.

For example, `pdf-bot` seems to be tied exclusively to a specific converter,
and storage backend. With `athenapdf` however, we are moving more, and more
towards building a toolkit or rather, framework for other people to construct
their own conversion processes (or even microservice)[0].

Consequently, we are working towards general abstractions like fetching,
converting, and uploading, that can have different implementations (e.g.
wkhtmltopdf, LibreOffice, Weasyprint, etc).

With our microservice assembly as well, we are focused heavily on ensuring we
have:

1\. Instrumentation, and metrics (which `pdf-bot` appears to currently lack)

2\. Support for different retry mechanisms (e.g. retry using the same
converter or retry using a different converter)

3\. Support for multiple input MIME types

4\. Synchronous API calls (`pdf-bot` appears to be mostly asynchronous, with
batch processing, and callbacks)

5\. Ease of installation (e.g. Docker), and configuration

We also have a CLI assembly[1] that can support _custom JavaScript plugins_
[2] (e.g. Markdown -> PDF, Readability, etc). So you don't need to run a
service or make API calls for conversions.

[0]
[https://github.com/arachnys/athenapdf/tree/v3/pkg](https://github.com/arachnys/athenapdf/tree/v3/pkg)

[1]
[https://github.com/arachnys/athenapdf/blob/v3/cmd/cli/main.g...](https://github.com/arachnys/athenapdf/blob/v3/cmd/cli/main.go#L27)

[2]
[https://github.com/arachnys/athenapdf/tree/v3/pkg/runner/plu...](https://github.com/arachnys/athenapdf/tree/v3/pkg/runner/plugin/js)

~~~
gbuk2013
Thank you for athenapdf and for rescuing me from the pains of wkhtmltopdf - I
am a happy user. :)

My only small problem with it was the somewhat complex setup for using
athenapdf-service with a new project (especially since I use docker-machine)
but I have now mostly automated the whole thing.

Just out of interest - do you consider asynchronous an advantage (being a Node
developer I generally love async very much)? Not that it matters to me - my
needs are trivial for the service to handle.

Edit: actually I can see how it async would make my life much more complicated
for my simple use case - I would have to write something to track requests and
responses rather than just looping through a bunch of URL's that need
converting.

~~~
MrSaints
That's interesting feedback! Thank you :)

We actually went with Docker for the set up because it simplified dependency
management tremendously, and it allowed us to deploy on platforms like
Kubernetes, Swarm, and ECS. As a plus, it gave us some confidence that if it
_works for us_ , it should _work for others_ (obviously, we have come across
cases where Docker behaves differently across platforms).

I consider asynchronous processing (in this context) as advantageous in some
cases. Indeed, when we were refactoring `athenapdf`, we considered introducing
a message queue for workers to pull work from, and to put back when the work
completes. The problem with this however, is that we can't as easily scale
horizontally (i.e. introduce node replicas behind a load balancer), as if we
tried to get / update a job, we may not get the same node we originally got. I
mean, the solution can be as easy as introducing a centralised message queue
of sorts (or even a sticky session), but that complicates the set up process,
so we decided against it.

Taken together, for our specific use cases, we believe it is a lot simpler to
consume a synchronous API. No webhooks / callbacks. No polling. No concerns
over acknowledgement. If a HTTP call fails, we will know about it immediately.
If a complex retry mechanism is needed, we think this should be accomplished
in the client application.

In the long term, I believe we should have a toolkit that can easily be
plugged into a wider orchestration engine like Conductor
([https://netflix.github.io/conductor/](https://netflix.github.io/conductor/)).
That way, anyone can develop their own conversion process pipeline with ease.

------
oxplot
Unfortunately, Chrome's kerning when it comes to printing is atrocious. Over
the years, I've constantly tested it every once in a while with the hope that
it would improve to no avail.

Currently, the only print ready HTML to PDF processor that I know is Prince
[1] and to a lesser extent Firefox.

[1]: [https://www.princexml.com/](https://www.princexml.com/)

~~~
sk5t
Prince is awesome. Its support for paged media has impressed me again and
again.

------
vbezhenar
Headless chrome is awesome. I'm using it to generate multiple PNG from SVG.

~~~
keyle
I agree with you but as a 'old' developer, I find it pretty sad that you have
to fire up a headless browser + its gigaton of code, to convert a SVG to PNG.
But I agree with you, headless chrome can be very useful.

~~~
bshimmin
I take your point in general, but I'm not sure this specific task - converting
of SVGs - has ever been one that has easily been solved by some tiny amount of
code: probably the way I would have done this in the past (say, ten years ago)
would have been using Apache Batik's rasteriser, which is far from a
lightweight solution itself.

~~~
matthewmacleod
It's not _tiny_ , but in case anybody else is trying to do the same thing, I'd
reach for librsvg
([https://wiki.gnome.org/Projects/LibRsvg](https://wiki.gnome.org/Projects/LibRsvg)).

------
brajesh
Are there any similar wrappers around headless Firefox, which has been
released recently (Firefox 55)?

Mozilla's documentation ([https://developer.mozilla.org/en-
US/Firefox/Headless_mode](https://developer.mozilla.org/en-
US/Firefox/Headless_mode)) is still incomplete.

~~~
dagurp
I assume they're waiting for Servo

------
matallo
idea: using this as "send to kindle" generating pdfs from urls you stumble
upon, and seamlessly sending them to the *@kindle.com email address to consume
in the device

I don't know if there's an easier way or service these days

~~~
rcarmo
Instapaper did that pretty well, and I think Pocket Premium does that too - in
MOBI format, which is much easier to read on the Kindle than PDF.

These days I use Calibre instead because the overall experience is better, but
for single articles it's a bit overkill.

------
bm98
Interesting to compare this to some of the "old school" solutions for
converting web pages to PDF such as htmldoc[0] or html2ps[1].

[0]
[https://github.com/michaelrsweet/htmldoc](https://github.com/michaelrsweet/htmldoc)
[1]
[http://user.it.uu.se/~jan/html2ps.html](http://user.it.uu.se/~jan/html2ps.html)

~~~
Wilya
The old school solutions lack any sort of javascript support (per the docs,
htmldoc doesn't even support css), so they wouldn't work for a lot of real
world websites. That's not really the same use case.

A better comparison would be against the likes of wkhtmltopdf[0], which uses
webkit, or the pdf generation features of phantomjs.

[0] [https://wkhtmltopdf.org/](https://wkhtmltopdf.org/)

~~~
j_s
[https://github.com/wkhtmltopdf/wkhtmltopdf/issues](https://github.com/wkhtmltopdf/wkhtmltopdf/issues)

 _1,047 Open 975 Closed_

Yep, that's about how I remember it. It was such a pain to build on Windows
(especially to get a single static binary) that people contributing fixes
would often attain hero status by attaching a random binary to an issue.
Specifically, GIF support was broken on the official Windows build for 4+
years:

[https://web.archive.org/web/20140917181225/http://code.googl...](https://web.archive.org/web/20140917181225/http://code.google.com/p/wkhtmltopdf/issues/detail?id=441#c41)

------
shubhamjain
I am guessing this works by splitting screenshots of a web page and gluing
them as pages in the PDF file. Doesn't that mean the size of the PDF would
grow to be large once it passes few pages? How does it handle content (like,
tables) that don't have line breaks?

~~~
gima
It uses Chrome's built-in print-to-PDF functionality via Chrome Debug/DevTools
Protocol. In other words it creates PDF files with real vector graphics and
text, not just images embedded in PDF.

Page.printToPDF: [https://chromedevtools.github.io/devtools-
protocol/tot/Page/...](https://chromedevtools.github.io/devtools-
protocol/tot/Page/#method-printToPDF)

~~~
shubhamjain
I didn't know that existed. How good is it with corner cases? HTML->PDF is a
notoriously difficult problem; even generating PDF is. There are several
software services which charge well for doing that (Docraptor, PrinceXML). If
it's smooth and handles everything well, is there any reason someone should
pay for them?

~~~
j_s
PDF generation (especially from a JavaScript-enhanced HTML page) has enough
corner cases that it is typically best implemented with commercial support
paying someone to polish away the rough edges.

There are many "free as in beer" (closed-source), freemium, and/or free trial
options offered as a carrot leading to a commercial product. Most have a
watermark and/or page count limitations.

[http://selectpdf.com/community-edition/](http://selectpdf.com/community-
edition/) (5 pages max)

------
rmetzler
Thank you very much. I'm excited to try it out. Great documentation, I wish
more people would explain their project's software architecture in the README.

------
igitur
How does this compare to wkhtmltopdf, which, IIRC, uses WebKit to render the
pdf?

~~~
esbenp
We actually used wkhtmltopdf before we started using pdf-bot. wkhtmltopdf
development has slowed a lot, it is very unstable and you need to run a 2 year
old alpha version to support flexbox (if I remember correctly) :-) headless
chrome is a lot more stable choice imo.

~~~
bshimmin
We had an absolutely ghastly time last year trying to implement wkhtmltopdf in
a Rails app - we probably wasted an entire week fighting with both Wicked PDF
and PDFKit before we just gave up and wrote something using Prawn instead
(which was, of course, extremely time-consuming in a different way, but at
least the end result was good).

~~~
boundlessdreamz
What problems did you run into with wkhtmltopdf? We have been using it without
much trouble. Chrome pdf generation is nice but wkhtmltopdf generates smaller
PDFs with table of contents.

~~~
bshimmin
All kinds of problems that others have mentioned above, plus in terms of the
Rails integration, it felt like we hit almost every one of the open issues on
the GitHub repos for both Wicked PDF and PDFKit. I vaguely recall fonts in
production being a problem, a general lack of reliability, performance issues,
fiddling around with various different binaries of wkhtmltopdf to find one
that maybe worked... probably other things besides. It was a bad week and I
wish I hadn't reminded myself!

(With no disrespect, of course, to the authors of these libraries - they just
didn't work well for us.)

------
criddell
Can it generate a table of contents with page numbers?

~~~
aumerle
calibre has been able to convert arbitrary HTML files to PDF with Table of
Contents with page numbers, links, embedded fonts, arbitrary headers/footers
for years, all rendered using WebKit, without a running X server, for years.

ebook-convert file.html file.pdf --pdf-add-toc

------
titel
Whould this work on AWS Lambda?

~~~
benmanns
Check out [https://github.com/adieuadieu/serverless-
chrome#printtopdf-p...](https://github.com/adieuadieu/serverless-
chrome#printtopdf-print-a-given-url-to-pdf)

