
URL to PDF Microservice - kimmobru
https://github.com/alvarcarto/url-to-pdf-api
======
erdbeerkaese
Unfortunately, the "sensible defaults" don't seem to check input URLs
correctly and allow file:// URLs.

Just try ?url=file:///etc/passwd on the demo instance.

That seems to be a quite common issue with services like this built on generic
libraries.

~~~
kimmobru
Oh wow, that's a bit embarrassing :) The urls are now restricted to http and
https only. Thanks for noticing!

~~~
qw
I am not at a computer noe, so I can’t test it, but do you take redirects into
account? I hope you are not just whitelisting the initial URL, but also any
URL’s it redirects to. If you don’t already, you should probably just disable
redirects in whatever library you use.

~~~
kimmobru
I gave this a thought for a moment. Since we're using a real browser, there
are huge amount of different ways to get the browser display a file:// link.
Redirect is one, window.location.href is another, etc. The service shouldn't
be run publicly in the internet for real use cases. If you do, the server
should be designed in a way that it's not dangerous if the web server user
gets read access to file system. I added a warning about this in the top of
the README.

~~~
cutcss
You are using Chrome headless therefore you can use group policy to add
"file://" to the URL blacklist; see
[http://www.chromium.org/administrators/policy-
list-3#URLBlac...](http://www.chromium.org/administrators/policy-
list-3#URLBlacklist)

------
mrskitch
FWIW there's a lot of things Chrome won't do properly out of the box (fonts,
emojis'... more). Most projects like this will work for _most_ of the web, but
there's a lot of nuance in getting it working across the board. This is
something I've been working on for quite some time:
[https://browserless.io](https://browserless.io)

------
renke1
I was playing around with Puppeteer the other day and was wondering if it was
possible to render a web page to a single page PDF (a page with fixed width
and variable height). Basically like creating screenshot without losing the
text information. This would solve a lot problems such as sticky elements
hiding text like in this example [1].

[1]: [https://url-to-pdf-
api.herokuapp.com/api/render?url=https://...](https://url-to-pdf-
api.herokuapp.com/api/render?url=https://hackernoon.com/why-isnt-agile-
working-d7127af1c552)

~~~
kimmobru
That's a good idea! You can achieve this by adding e.g.
&pdf.width=1000px&pdf.height=10000px parameters.

Sometimes you can get rid of the sticky headers with &emulateScreenMedia=false
parameter if the page has well implemented @media print rules in CSS. We
decided to use page.emulateMedia('screen') with Puppeteer to make PDFs look
more like the actual web page by default.

Pages which use lazy loading for images may look incorrect when rendered.
&scrollPage=true parameter may help with this. It scrolls the page to the
bottom before rendering the PDF.

Using these options make the PDF better: [https://url-to-pdf-
api.herokuapp.com/api/render?url=https://...](https://url-to-pdf-
api.herokuapp.com/api/render?url=https://hackernoon.com/why-isnt-agile-
working-d7127af1c552&emulateScreenMedia=false&scrollPage=true)

~~~
sid-
I bet that breaks on infinite scroll pages ?

------
halestock
Why use headless chrome instead of the PDF lib used by chrome directly?
[https://pdfium.googlesource.com/pdfium/](https://pdfium.googlesource.com/pdfium/)

~~~
riquito
For starter, documentation: you can't even understand what pdfium IS from that
page. After some search I see that it can do rasterization, from pdf to e.g.
png, but I couldn't find any mention of it being capable to generate the pdf
from an url. Can it?

------
ellimilial
Seems to be useful, as
[https://github.com/arachnys/athenapdf](https://github.com/arachnys/athenapdf)
done pretty similar thing.

------
aantix
I wonder how scalable the service is...

At least with PhantomJS I felt like my system would begin to lockup if there
were too many instances rendering at the same time (and it didn't appear to be
an issue of too little memory).

Nonetheless, this looks promising.

~~~
Scarbutt
You need to fire a new chrome headless process every time you create a PDF,
its not scalable but it works.

I wonder if this part of chrome could be easily extracted as a C++ library.

~~~
mrskitch
You can actually reuse a running instance and create a new context with this:
[https://chromedevtools.github.io/devtools-
protocol/tot/Targe...](https://chromedevtools.github.io/devtools-
protocol/tot/Target/#method-createBrowserContext). Issue is that most
libraries don't have an API for this (not sure why), and that long-running
Chrome instances can get to a quirky state + other issues. Certain parameters
require you start Chrome with the right flags, so reusing a running Chrome
process doesn't always work.

------
placebo
Cool. Nice to learn of another option. What would be the advantages of using
this over wkhtmltopdf ([https://wkhtmltopdf.org](https://wkhtmltopdf.org)) ?

~~~
kimmobru
Thanks for the comment! I haven't personally used wkhtmltopdf much, but I like
having Chrome as the rendering engine. In theory at least, debugging the PDFs
can be done with desktop Chrome's print preview. I don't know about
wkhtmltopdf, but url-to-pdf-api supports dynamic single-page apps, which can
be beneficial depending on the use case.

Headless Chrome is quite new so it still has some bugs, but I have a hunch
that it will in the end have most reliable and expected render results.

------
caidan
Oddly enough I just implemented my own one of these in 34 sloc using flask and
weasyprint. I chose to only have it accept html in a post rather than a url so
that it could render non-publically accessible urls. You can also pass it a
base_url (which it passes on to weasyprint) for resolving relative urls for
static assets in the html, which are usually publicly accessible. Runs on
heroku for simplicity.

------
hex1848
I am using PhantomJS for a similar project running on AWS Lambda - running
into all sorts of rendering bugs / crashes. Wanted to make the switch to
puppeteer, but as of now it requires a higher version of NodeJS than what
Lambda supports. Was in the process of looking at Docker containers for my
service, anyone have any thoughts on Heroku vs Docker?

~~~
mrskitch
Hey, I'd love to talk to you more about your issues with getting over to
puppeteer. I'm working on a product that separates application infrastructure
from Chrome as it's nightmare to try and scale with.

I've written somewhat extensively about the deployment approaches here:
[https://hackernoon.com/more-than-you-want-to-know-about-
head...](https://hackernoon.com/more-than-you-want-to-know-about-headless-
chrome-31f6b3b06d82)

~~~
tonetheman
That was a great read!

~~~
mrskitch
Hey, thanks for the kind words, really appreciate hearing that!

------
halfnibble
Oh hey! I created something similar with Flask and Docker. Except you POST the
html content and receive a PDF document back. It uses wkhtmltopdf, so it's
pretty fast.
[https://github.com/halfnibble/pdf_service](https://github.com/halfnibble/pdf_service)

~~~
j_s
Wkhtmltopdf is a dead-end, but very useful and relatively ligthweight if it
still works for your use case.

Thanks for packaging this up!

------
danso
Took a quick skim of the README. I have a general question for these web-to-
PDF services. Is the priority to honor the page styles as set forth in a
print.css-type file? Or is it to be as close as a screenshot as possible of a
webpage, which is what I think the majority of laypeople would expect.

One of the small things I've recently let myself be bothered by is how
divergent _HTML /browser_ snapshots from web.archive.org, archive.is, and
Google cache can be, even for relatively simple pages. I've already given up
on trying to make (not even sure if it's a good idea) HTML look nice as PDFs.

~~~
zerkten
This service seems to be targeted at producing controlled artefacts from your
application (e.g. invoices) where you know what CSS is in use. If you wanted
to capture the original design of the page you can use headless Chrome to
capture screenshots automatically ([https://medium.com/@dschnr/using-headless-
chrome-as-an-autom...](https://medium.com/@dschnr/using-headless-chrome-as-an-
automated-screenshot-tool-4b07dffba79a)). Perhaps headless Chrome can also
save the HTML plus assets, or some other archive format that would be
interactive.

Edit: this seems to be a take on saving the HTML plus assets -
[https://github.com/pirate/bookmark-
archiver](https://github.com/pirate/bookmark-archiver).

~~~
ge96
I think phantomjs is another option. I'm personally looking for a way to
screenshot charts produced on a site of mine so I can create thumbnails for a
quick preview. I think phantomjs is what I will try first.

Edit: ahh someone else mentioned it too.

------
gmisra
Partially off topic: does anybody know of a hosted solution that turns a pdf
into an html page, and hosts the output html (optionally, also hosts the pdf,
with a downloadable link).

~~~
space_fountain
[https://mozilla.github.io/pdf.js/](https://mozilla.github.io/pdf.js/) would
probably be what you would need to build your own. I think you can also use
just Google drive to make something like this.

Not sure of more straight forward hosting options

------
jgamman
OT (maybe) Question: I've gotten more and more annoyed over the years as links
are deleted/decay etc. Even more so recently. Is there a plug-in that makes a
'personal archive' or potentially on-sends the page to the Archive. It would
be useful to be able to search/go back in my time-line to webpages even if
they were just static pdf's.

~~~
j_s
Wallabag: a self-hostable application for saving web pages |
[https://news.ycombinator.com/item?id=14686882](https://news.ycombinator.com/item?id=14686882)
(Jul 2017)

Show HN: Kozmos – A Personal Library |
[https://news.ycombinator.com/item?id=14980075](https://news.ycombinator.com/item?id=14980075)
(Aug 2017)

specifically: [https://addons.mozilla.org/en-
US/firefox/addon/scrapbook-x/](https://addons.mozilla.org/en-
US/firefox/addon/scrapbook-x/) and
[https://chrome.google.com/webstore/detail/worldbrain-the-
res...](https://chrome.google.com/webstore/detail/worldbrain-the-
research-e/abkfbakhjpmblaafnpgjppbmioombali)

~~~
j_s
Linked elswhere in this discussion:

[https://github.com/pirate/bookmark-
archiver](https://github.com/pirate/bookmark-archiver) python script

[https://chrome.google.com/webstore/detail/singlefile/mpiodij...](https://chrome.google.com/webstore/detail/singlefile/mpiodijhokgodhhofbcjdecpffjipkle)
save as a single html file

> _uses "data URI" scheme to embed image and frame contents into the page :
> the resulting format is not MHT/MHTML_

------
fetch1
I've been working on a WYSIWYG PDF creator and API for a few months, mainly to
solve a need at my day job: [https://fetchpdf.com](https://fetchpdf.com)

PDF generation is 'fun', and I've tried several of the different options out
there for HTML to PDF generation, but settled on wkhtmltopdf.

------
oxguy3
Ooooooh, there have been a couple projects where I wished I had something like
this. Definitely bookmarking this.

------
nikisweeting
If you're interested in archiving/PDF'ing a large list of links (e.g. your
browser bookmarks or Pocket list), check out
[https://github.com/pirate/bookmark-
archiver](https://github.com/pirate/bookmark-archiver)

------
djs070
I would like to adopt something like this, but there are some pretty normal
table functions in our current print solution that I don't know how to support
in HTML.

E.g. on a multi-page invoice, show a sub-total row at the bottom of each page.
Does anyone know how to create this kind of function?

~~~
pier25
IIRC you can solve that using CSS for print media.

The browser will display a different CSS to the printer, so to speak.

~~~
amigoingtodie
Yes, and instead of px or em, you can use mm for css units (positioning, size,
etc.).

------
Bromskloss
Can you get the PDF page to adjust its size based on the HTML page to be
rendered?

------
garou
URL or HTML to PDF:
[https://github.com/tecnospeed/pastor](https://github.com/tecnospeed/pastor)

------
default-kramer
It looks like chrome's javascript interface exposes options that the command
line doesn't. Or else I'm overlooking something, because I couldn't find a way
to to hide the header and footer (which shows the date, title, url, and page
number) using the command line. But this project does hide the header and
footer.

I can't use an externally hosted service like this because some of my URLs are
non-public. So when the user requests a PDF, I render the HTML to a temp file
on the server, invoke chrome via command line, and serve up the converted PDF.

~~~
Cymen
Just in case you missed it, you can clone the git repo for this project and
use it inside your network. I'm playing with it now that way.

Right now, I'm using wkhtmltopdf which supports custom headers and footers and
I'm mulling over how to do the same with this solution.

------
bitdeveloper
@kimmobru, would you happen to know how this would handle printing a multipage
table to PDF? Specifically, I'm hoping for repeating headers per page, and not
having any problems with rows printing half on one page, and half on the next
page. I would love to replace a paid solution we use which handles this use
cases with something based on headless chrome.

------
maslam
We use urlbox.io and are really happy with it. Don't recommend building this
yourself.

------
edoceo
See also: [http://any2web.io](http://any2web.io)

~~~
dotmanish
with no privacy policy link or info on how long the documents are stored on
the server.

Stuff like receipts, etc to be converted to PDF will contain customers'
information - not to be put on a site with no info on how it's used.

~~~
nathan_f77
That's a good reminder, thanks. I'm working on a similar service, and need to
be very clear about our data retention policy.

------
overcast
This is Stallman's wet dream.

