
Saving web pages as PDFs in 2019, a real challenge - lucabenazzi
https://www.humaneinterface.net/article/saving-web-pages-as-pdfs-in-2019-a-real-challenge
======
seppel
Firefox (Mac) is not even able to print this article into a PDF. It gives you
only one page and the page ends with a half sentence with the letters cropped
in the middle. One third of the text is missing.

I cant count the number of times I had to take screenshots to save crucial
information from web pages.

------
prashnts
A tip if you're using iOS/iPadOS: In Safari, taking a screenshot now supports
taking "full page shot", which you can save as a pdf. You get this option when
you tap the tiny screenshot preview.

~~~
christefano
I had no idea. I’ve been using an app called Tailor to stitch together
multiple screenshots of web pages. Until today, that is, thanks to your tip.
Thank you!

------
xorand
>Saving pages as HTML is not ideal because a) you get an HTML file plus a
folder, not very practical if you want to retrieve them later, and b) you
never know how that page is going to render in future versions of your
browser.

Yes but for my use case, which is better scientific communication means, PDF
is not enough.

Consider for example slides for a presentation. The typical mathematician does
them in TEX which outputs a PDF. Then the PDF is (sometimes) made available
online. I realized that PDF slides are far inferior to HTML slides (where you
can add demos and whatnot, shameless example [0]). Just put all in a github
repository and anybody can take them home.

[0]
[https://mbuliga.github.io/emergent-10-years/presentation.htm...](https://mbuliga.github.io/emergent-10-years/presentation.html)

------
fbriff
I gave up on the idea of reliably saving web pages in PDF.

I use now "SingleFile", a Firefox or Chrome extension that helps to save a
complete page (with CSS, images, fonts, frames, etc.) as a single HTML file.

[https://addons.mozilla.org/en-US/firefox/addon/single-
file/](https://addons.mozilla.org/en-US/firefox/addon/single-file/)

[https://chrome.google.com/webstore/detail/singlefile/mpiodij...](https://chrome.google.com/webstore/detail/singlefile/mpiodijhokgodhhofbcjdecpffjipkle?hl=en)

[https://github.com/gildas-lormeau/SingleFile](https://github.com/gildas-
lormeau/SingleFile)

~~~
lucabenazzi
Great job! I suggest that you go deeper into how the annotation feature works
in the add-on description, as that could be potentially useful. My main issue
with this solution is whether this file format is going to be reliable in the
future. As with MHTML that was mentioned in another comment, formats that are
not widely used may one day not be supported the same way they are today. Not
sure how you've achieved this as clearly HTML cannot hold media file by itself
(unless there's something I am not aware of), so it must do something at a
file system level, so I wonder is it going to still work in future versions of
the operative system?

------
sturakov
I recommend [https://webrecorder.io/](https://webrecorder.io/) to reliably
capture websites.

~~~
lucabenazzi
Thanks so much for your comment. I've read the original blog post
([https://blog.webrecorder.io/2019/08/29/desktop-
app.html](https://blog.webrecorder.io/2019/08/29/desktop-app.html)) and it
sounds like a very effective mean of capturing web content. I was not aware
there's a standard for web archiving purposes, that sounds like something that
would kept being supported in the future. And from what I read it's better
than PDF as it captures interactive elements, as well. I will give it a try.

------
Garvey
Shouldn't the UI/UX/Whoever was responsible for the design, have supplied CSS
styles for printing?

[https://www.smashingmagazine.com/2018/05/print-
stylesheets-i...](https://www.smashingmagazine.com/2018/05/print-stylesheets-
in-2018/)

------
meehow
I had similar problem and wrote a browser addon:
[https://2read.net/](https://2read.net/) It converts websites to "readable"
form and if you have IPFS running, it will also "pin" content locally. In most
cases it works better than just printing an article. Here is an example with
mentioned article:
[https://ipfs.io/ipfs/QmYPkcXgKLBye3L8M1VJWsGAb2mJXkJSEncqcSC...](https://ipfs.io/ipfs/QmYPkcXgKLBye3L8M1VJWsGAb2mJXkJSEncqcSCkFTdHhi/)

~~~
lucabenazzi
This is brilliant! I like the fact that it also cleans up the page. I don't
know much about MFS but I see there's a video linked from the add-on
description so will take some time to listen. My main concern is what format
is going to be always readable in the long term, so it should be either
something widely accepted and distributed like PDF, or something that is going
to be retro-compatible and always running smoothly in future machines. Thanks
for your answer.

~~~
meehow
There is also HN discussion about this addon:
[https://news.ycombinator.com/item?id=18965729](https://news.ycombinator.com/item?id=18965729)

------
zmzrr
>Saving pages as HTML is not ideal because a) you get an HTML file plus a
folder, not very practical if you want to retrieve them later

MHTML exists.
[https://en.wikipedia.org/wiki/MHTML](https://en.wikipedia.org/wiki/MHTML)

~~~
tonyedgecombe
I don't know about other browsers but Safari doesn't support it.

Not only that but HTML/CSS/JavaScript has been a moving target over the last
couple of decades, sometimes people need something with longevity.

------
russellbeattie
This has been a problem for years, which is too bad. No one really uses MHTML
or any alternative. Hopefully Web Bundles* becomes a commonly supported spec.

* [https://web.dev/web-bundles/](https://web.dev/web-bundles/)

~~~
lucabenazzi
Problem is, technology that is not widely used/distributed/supported is not
reliable for archiving purposes. The idea is in principle good. I've just
learned that there's an open format called WARP that was explicitly created
for web archiving purposes, any relationship with what you are describing?

------
Apocryphon
Isn’t this what read mode is supposed to be for?

~~~
seppel
Reader view is (to my understanding) a heuristic that gives you the biggest
wall of text of the article. If it works, it is nice. But in these cases you
can usually also c&p.

However, it does not work if what you want to save is not a wall of text. Some
table, some receipt, etc. It does not work for the Google search box, for
facebook, or Amazon, just to give an example.

