Hacker News new | past | comments | ask | show | jobs | submit login
Saving web pages as PDFs in 2019, a real challenge (humaneinterface.net)
37 points by lucabenazzi on Dec 14, 2019 | hide | past | favorite | 19 comments

Firefox (Mac) is not even able to print this article into a PDF. It gives you only one page and the page ends with a half sentence with the letters cropped in the middle. One third of the text is missing.

I cant count the number of times I had to take screenshots to save crucial information from web pages.

A tip if you're using iOS/iPadOS: In Safari, taking a screenshot now supports taking "full page shot", which you can save as a pdf. You get this option when you tap the tiny screenshot preview.

I had no idea. I’ve been using an app called Tailor to stitch together multiple screenshots of web pages. Until today, that is, thanks to your tip. Thank you!

>Saving pages as HTML is not ideal because a) you get an HTML file plus a folder, not very practical if you want to retrieve them later, and b) you never know how that page is going to render in future versions of your browser.

Yes but for my use case, which is better scientific communication means, PDF is not enough.

Consider for example slides for a presentation. The typical mathematician does them in TEX which outputs a PDF. Then the PDF is (sometimes) made available online. I realized that PDF slides are far inferior to HTML slides (where you can add demos and whatnot, shameless example [0]). Just put all in a github repository and anybody can take them home.

[0] https://mbuliga.github.io/emergent-10-years/presentation.htm...

I gave up on the idea of reliably saving web pages in PDF.

I use now "SingleFile", a Firefox or Chrome extension that helps to save a complete page (with CSS, images, fonts, frames, etc.) as a single HTML file.




Great job! I suggest that you go deeper into how the annotation feature works in the add-on description, as that could be potentially useful. My main issue with this solution is whether this file format is going to be reliable in the future. As with MHTML that was mentioned in another comment, formats that are not widely used may one day not be supported the same way they are today. Not sure how you've achieved this as clearly HTML cannot hold media file by itself (unless there's something I am not aware of), so it must do something at a file system level, so I wonder is it going to still work in future versions of the operative system?

I recommend https://webrecorder.io/ to reliably capture websites.

Thanks so much for your comment. I've read the original blog post (https://blog.webrecorder.io/2019/08/29/desktop-app.html) and it sounds like a very effective mean of capturing web content. I was not aware there's a standard for web archiving purposes, that sounds like something that would kept being supported in the future. And from what I read it's better than PDF as it captures interactive elements, as well. I will give it a try.

Shouldn't the UI/UX/Whoever was responsible for the design, have supplied CSS styles for printing?


I had similar problem and wrote a browser addon: https://2read.net/ It converts websites to "readable" form and if you have IPFS running, it will also "pin" content locally. In most cases it works better than just printing an article. Here is an example with mentioned article: https://ipfs.io/ipfs/QmYPkcXgKLBye3L8M1VJWsGAb2mJXkJSEncqcSC...

This is brilliant! I like the fact that it also cleans up the page. I don't know much about MFS but I see there's a video linked from the add-on description so will take some time to listen. My main concern is what format is going to be always readable in the long term, so it should be either something widely accepted and distributed like PDF, or something that is going to be retro-compatible and always running smoothly in future machines. Thanks for your answer.

There is also HN discussion about this addon: https://news.ycombinator.com/item?id=18965729

For those interested, here's n HN discussion about the IPFS distributed files system: https://news.ycombinator.com/item?id=18650375

>Saving pages as HTML is not ideal because a) you get an HTML file plus a folder, not very practical if you want to retrieve them later

MHTML exists. https://en.wikipedia.org/wiki/MHTML

I don't know about other browsers but Safari doesn't support it.

Not only that but HTML/CSS/JavaScript has been a moving target over the last couple of decades, sometimes people need something with longevity.

This has been a problem for years, which is too bad. No one really uses MHTML or any alternative. Hopefully Web Bundles* becomes a commonly supported spec.

* https://web.dev/web-bundles/

Problem is, technology that is not widely used/distributed/supported is not reliable for archiving purposes. The idea is in principle good. I've just learned that there's an open format called WARP that was explicitly created for web archiving purposes, any relationship with what you are describing?

Isn’t this what read mode is supposed to be for?

Reader view is (to my understanding) a heuristic that gives you the biggest wall of text of the article. If it works, it is nice. But in these cases you can usually also c&p.

However, it does not work if what you want to save is not a wall of text. Some table, some receipt, etc. It does not work for the Google search box, for facebook, or Amazon, just to give an example.

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact