Hacker News new | comments | show | ask | jobs | submit login
Show HN: Percollate – a command-line tool to grab web pages as PDFs (github.com)
123 points by danburzo 66 days ago | hide | past | web | favorite | 22 comments



You might want to include some actual pictures of the input and output in the readme. The current examples are just one-line command snippets which aren't as useful to someone who hasn't decided to use the tool yet.


You're totally right. I will add some examples to the README so it's clearer what it does (passing the article through Readability, et cetera)


I’ve been sporadically working on this over the last couple of weeks, and I think it’s now stable enough to get other people’s feedback on it. I got the idea while perusing Simon Wardley’s mapping book-in-progress (https://medium.com/wardleymaps), and I wondered whether I can bundle all the chapters into a decent-looking PDF. (It works pretty well for that purpose). I also wanted it to be a sample app for gluing things together for the purpose of producing books in the browser.

I’d love it if you gave it a spin; please let me know if you find anything nasty!


How is it different from, let's say:

chrome --headless --disable-gpu --print-to-pdf https://www.google.com/


--print-to-pdf is very limited.

As soon as you need to control anything, you have to use [Puppeteer](https://developers.google.com/web/tools/puppeteer/)


As @cronopios mentions, if you need a bit of flexibility in the output, you need to use a wrapper over Chromium — either Electron, or (the more recent) Puppeteer.

Other than that, it runs the article through Readability to extract just the main content and applies customizable CSS / HTML to it.


It uses puppeteer so basically it's chrome under the hood.


Polar has a similar feature if you're just wanting an archive of web pages:

https://getpolarized.io/

We support 'captured' HTML pages. Basically what we do is we fetch the full HTML of the content and store it in a PHZ file (polar HTML archive) and then we save that to disk (it's just a zip file with JSON metadata).

The Polar app is an Electron app so it has full access to render HTML.

We then inject our self into the network layer using protocol interceptors and if you're loading the URL you just captured we load the content from the PHZ instead of the network.

You can then annotate the content, take notes on it, tag it, and keep it forever without risk of it vanishing.

I use it for important documents that I can't afford to ever lose. For example, the Etherium whitepapers are in HTML , not PDF. they're also living documents so I can just capture anytime I want.

HTML files don't often print properly so this way I can keep them the way they were meant to be seen.


Hey, this is pretty cool. Just watched your videos. You certainly have a tool which is natural for average person's use, but you may have a barrier in natural use. The page marks is not a natural activity, but scrolling is. So if the system remembers where you are in the document as you scroll while reading, then that's a winner. In addition, a bookmarklet/icon in the browser itself could be better (but you most likely know this already).

In any case, many people have tried creating such a tool. I once used to believe that such functionalities should be part of the browser itself. But there's always been a disconnect between local files and browser. Now in the mobile world, it is even difficult to stay in sync.


very very cool! thanks for the pointer.


Examples please, and can you show me the differences made by the enhancements?


Roger that! I'll be adding examples to the README. The main enhancement right now is running the article through Readability, the others are just trying to fetch the best-quality image, and removing some things from Wikipedia articles — so, more of a placeholder for future enhancements :)


Speaking of Wikipedia articles, have you checked out their "book creator" [0], it's a link on the left of every page? It has some significant issues [1] and as of now can't be used to download pdfs, it seemed to use an "Offline Content Generator" [2] that became unrepairable. I wonder if something like Percollate could be used as the backend generator for Wikipedia's book creator?

[0] https://en.wikipedia.org/w/index.php?title=Special:Book&book...

[1] https://www.mediawiki.org/wiki/Reading/Web/PDF_Functionality

[2] https://www.mediawiki.org/wiki/Offline_content_generator


I think they're already looking into something similar, either through Electron, or their own wrapper over Chromium. I understand from skimming the resources your linked to that they had reliability issues with Electron, and I can confirm from a separate project that Electron and Puppeteer can become a little slow / unwieldy for a large-scale project. So I'm not sure :/


Just tried it on their GitHub page:

percollate pdf --output p.pdf https://github.com/danburzo/percollate

The font is gigantic and the page tiny. Barely get to the second headline on the first page.

And there is no way to tune this on the command line (yet).


You're right, the default page size (A5) and font size are things that need to be easily customizable. I resisted adding many options to the CLI, but I think this is a good way to solve it? https://github.com/danburzo/percollate/issues/27


Yes. This would work.

It's probably a good idea to not introduce flags like `--font-size 12p` or `--page A4` since it leads down a rabbit hole. (Where do you stop?). Directly passing down CSS seems appropriate here.


Implemented in the latest version :)


This made me smile:

> percollate html Not implemented yet


Hah, yes :) It's more that I haven't figured out what the HTML version should look like, and whether it should be self-contained (images, styles), how to handle web fonts, et cetera


Nice!


Thank you!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: