Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Simple Print – Convert web articles into printable PDFs (fivefilters.org)
122 points by k1m on Feb 4, 2018 | hide | past | favorite | 39 comments

This would be insanely more useful if it could display math correctly. For example, most of the math in https://colah.github.io/posts/2014-03-NN-Manifolds-Topology only shows the LaTeX code.

Thanks, we'll have to look into that.

Hey, five filters contributors: I’m a big fan of you work. How hard would it be to make a web proxy type service that produced static html with working links instead of a PDF?

The idea is to strip out all of the JavaScript before it even hits the client. People could then use it exclusively for news reading. I suspect it would reduce bandwidth and client side energy as well as block essentially all web trackers.

I ask because it looks like you’ve done most of the heavy lifting between this and your full-text RSS feed service.


This sounds interesting! Some product _landing_ pages are not readable without Javascript. With this you could at least figure out what the product is about.

Once I'm enrolled I can enable Javascript when the product makes use of it. :)

How about Readable-Web Proxy?


Tracking can be done with <img> tags and (recently) with CSS.

The idea is that everything would run through a normalizing presentation layer (like the PDF converter and RSS full text services do), and only the information needed for rendering would be sent to the client, so <img> would be rewritten to some other url and served by the proxy.

The main problem I see is that you can embed trackers in the link URLs.

(You probably want the proxy to be a stateless cache shared by many users, which prevents tracker stuff from persisting in whatever rendering engine it uses for the conversion)

Thank you for the kind words.

We haven't really thought much about running this kind of thing as a proxy service, or what it would be like to implement it. Our tools are really intended for web articles, so they will fail on lots of other kinds of content you'll encounter in day-to-day browsing. If the intention is to avoid trackers, the proxy will have to handle everything, and not everything will be web articles. Browser extensions might be best here.

For my personal browsing habits it wouldn’t be so bad. Basically, I just check a few news aggregators (like HN) and sites that have rss feeds anyway, so support for HN would get me most of the way there. I’m not sure how many news aggregators are in common use, but I’d guess covering a dozen would get most of the addressible market.

Thanks for the response!

I'd like to add that I use tools from fivefilters.org daily. (Mostly Push to Kindle) If the authors are on this thread, thank you so much for making these tools free to use and for such consistent uptime.

Thank you, Tyler. Appreciate the kind words - that's really great to hear. :)

Pretty decent idea - I'd love to see the stats on how many people print things! About ~10 years ago the agency I worked at made sure every site had a print friendly style sheet which cropped out all the crap and left you with a very readable printed document. I don't think I've met anyone that's insisted on (or even thought about) that in the last few years.

My own interest was piqued the other day when I went to pdf-print a paywalled article for a friend and the formatting was atrocious. Would have been the perfect use case for this. But that was the point I thought "does anyone print anymore or do they just message the article?" Most of the print use cases that came up in my life (maps, recipes, contracts) have been solved/replaced by SAAS products, smartphones, tablets, kindles etc.

Schools in the UK most definitely print.

Hope it has open API, for something like when request URL it will immediatly return PDF for download

Now I use PrintFriendly[0] for such things, but it was fully redesigned last years and API was changed, also now it not usable for webpages with code blocks and images.

[0] http://printfriendly.com

PrintFriendly has an api - https://www.printfriendly.com/api

Which webpage does PrintFriendly doesn't support?

We'll have an API at some point for this. But our priority right now is making it easy to run yourself on a VPS instance.

Looks very cool! Could you explain the tech that you developed ? Did you create a new algorithm for extracting the content from the page or used an existing one ?

Can be broken down to roughly these steps:

1. The tool for extracting article content is Full-Text RSS[0]. It relies on arc90's original (open source) article detection algorithm, Readability, as well as a set of site-specific extraction rules which we maintain on Github[1].

2. Our PDF Newspaper[2] application then cleans that extracted article output to normalise HTML/CSS styles so we can present the output in our own layout without things breaking. We use CSS for multi-column output.

3. Finally, we use Chromium's new headless API to request the generated output in step 2 and create three PDFs - injecting CSS to increase the font size each time. After we have the three PDF files, we check the PDFs in order of smallest to largest font size and as soon as a larger font size creates an extra new page, we discard that and keep the PDF we had before and return that. So if PDF 1 (small font size) has three pages, PDF 2 (medium font size) also three pages, but PDF 3 (larger font size) four pages, then we return PDF 2.

[0] http://fivefilters.org/content-only/ [1] https://github.com/fivefilters/ftr-site-config [2] http://fivefilters.org/pdf-newspaper/

Does this support images? I tried to convert Ernest P. Chan's blog [0] into a pdf using this and the results were not well formatted.

[0] - http://epchan.blogspot.com/

We'll add images soon. As for the blog post, should work if you use the article URL: http://pdf.fivefilters.org/simple-print/url.php?size=A4#http...

Timeline for images?

Well done!

Step two seems to be almost certainly copyright infringement in the UK, not sure about USA exactly but as you take money and then copy the content it seems likely to breech the fair use regulations.

Has this been addressed? Do you have an article on it or a position statement?

cool stuff! I tried it in my blog and the result was excellent.

The original URL should be included somewhere.

That said, I don't quite understand the advantage of this solution versus a plain article view (e.g. firefox's built-in article view or something similar provided by some plugin/service) saved to disk via an PDF printer.

Good point, we'll add original article URL at the bottom.

As for the advantages, at the moment FireFox's built-in reader view doesn't produce the kind of output we do. If you're curious, we created a video of the kind of multi-column print layout we're experimenting with here a few years ago. All HTML and CSS: https://www.youtube.com/watch?v=854Csokl3QA

So if they did create a more print-friendly output, it'd be closer. But there'd still be that step of you creating the PDF yourself. Our server-side approach allows it to be integrated in other kinds of applications. It also allows us to do things like return a PDF which doesn't include an additional page due to a slight overrun that can be avoided with a minor font size reduction (see my other comment explaining how we try to do that at the moment).

I've had co-workers share printed articles with me, perhaps because they think I'm more likely to read them than an emailed link. It would have been nicer to read them multi-column, sans ads, headers and nav.

For those using Safari, I've often had good results by turning on Reader Mode and then printing, which strips out a lot of the extraneous junk that articles online tend to have.

Very interesting, reminds me of the chrome extension, send to my Kindle, which sends a pdf version of the web page to my Kindle.

Thanks! We also have an extension and app to convert web articles into Kindle ebooks: http://fivefilters.org/kindle-it/

> send to my Kindle, which sends a pdf version of the web page to my Kindle.

Too many Kindle in single comment

Could there be a way to print to EPub? PDF is uniquely bad for mobile readability but EPub solves for that

We already produce ePub and Kindle mobipocket files in our Push to Kindle service. You can try it out here http://fivefilters.org/kindle-it/

1. Enter a URL (we also have browser extensions)

2. Click 'Preview'

3. Select the No Kindle? tab and choose 'EPUB'

I use Mozilla Pocket to do essentially just that. I add articles to Pocket with Firefox Desktop and then read them on my Kobo e-reader. I find it much more pleasant to read long articles a page at a time instead of scrolling.

Nice idea! Unfortunately the PDF will not contain any of the headings of the original document.

Oh, no... It require[0] pay for more than few pages...

> Premium access key If you pay for premium access, enter your key here. A valid key will enable more pages in each PDF (provided the given feed contains more content), more full-text fetches, and the ability to change the subheading. > Subscribe — 24 € a year

[0] http://fivefilters.org/pdf-newspaper/

That shouldn't apply to this. This is intended for single articles. PDF Newspaper was originally intended for converting RSS feeds to PDF. For large feeds we had to limit how much we processed to avoid overwhelming the servers.

I dont really get the use-case for this.

macOS has had print to pdf for any app since its inception, iOS has had it for at least the last couple of versions, surely Windows and Android have similar functionality built in by now too?

The idea is to produce a PDF that makes the article more readable than if you relied on the default print output of the site itself.

In somes cases, e.g. a site that's gone to the trouble of creating a good print stylesheet, printing from the site itself will probably create a more readable output than this. But many sites don't do that.

There's also the use case of automating PDF creation from code. For example, automatically sending a PDF to your printer (some wireless printers can now be set up with email addresses) of articles you save to read later in a read-later service.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact