
Show HN: Percollate – a command-line tool to grab web pages as PDFs - danburzo
https://github.com/danburzo/percollate
======
anonytrary
You might want to include some actual pictures of the input and output in the
readme. The current examples are just one-line command snippets which aren't
as useful to someone who hasn't decided to use the tool yet.

~~~
danburzo
You're totally right. I will add some examples to the README so it's clearer
what it does (passing the article through Readability, et cetera)

------
danburzo
I’ve been sporadically working on this over the last couple of weeks, and I
think it’s now stable enough to get other people’s feedback on it. I got the
idea while perusing Simon Wardley’s mapping book-in-progress
([https://medium.com/wardleymaps](https://medium.com/wardleymaps)), and I
wondered whether I can bundle all the chapters into a decent-looking PDF. (It
works pretty well for that purpose). I also wanted it to be a sample app for
gluing things together for the purpose of producing books in the browser.

I’d love it if you gave it a spin; please let me know if you find anything
nasty!

------
dananjaya86
How is it different from, let's say:

chrome --headless --disable-gpu --print-to-pdf
[https://www.google.com/](https://www.google.com/)

~~~
cronopios
\--print-to-pdf is very limited.

As soon as you need to control anything, you have to use
[Puppeteer]([https://developers.google.com/web/tools/puppeteer/](https://developers.google.com/web/tools/puppeteer/))

------
burtonator
Polar has a similar feature if you're just wanting an archive of web pages:

[https://getpolarized.io/](https://getpolarized.io/)

We support 'captured' HTML pages. Basically what we do is we fetch the full
HTML of the content and store it in a PHZ file (polar HTML archive) and then
we save that to disk (it's just a zip file with JSON metadata).

The Polar app is an Electron app so it has full access to render HTML.

We then inject our self into the network layer using protocol interceptors and
if you're loading the URL you just captured we load the content from the PHZ
instead of the network.

You can then annotate the content, take notes on it, tag it, and keep it
forever without risk of it vanishing.

I use it for important documents that I can't afford to ever lose. For
example, the Etherium whitepapers are in HTML , not PDF. they're also living
documents so I can just capture anytime I want.

HTML files don't often print properly so this way I can keep them the way they
were meant to be seen.

~~~
webwanderings
Hey, this is pretty cool. Just watched your videos. You certainly have a tool
which is natural for average person's use, but you may have a barrier in
natural use. The page marks is not a natural activity, but scrolling is. So if
the system remembers where you are in the document as you scroll while
reading, then that's a winner. In addition, a bookmarklet/icon in the browser
itself could be better (but you most likely know this already).

In any case, many people have tried creating such a tool. I once used to
believe that such functionalities should be part of the browser itself. But
there's always been a disconnect between local files and browser. Now in the
mobile world, it is even difficult to stay in sync.

------
dustingetz
Examples please, and can you show me the differences made by the enhancements?

~~~
danburzo
Roger that! I'll be adding examples to the README. The main enhancement right
now is running the article through Readability, the others are just trying to
fetch the best-quality image, and removing some things from Wikipedia articles
— so, more of a placeholder for future enhancements :)

~~~
Ninjaneered
Speaking of Wikipedia articles, have you checked out their "book creator" [0],
it's a link on the left of every page? It has some significant issues [1] and
as of now can't be used to download pdfs, it seemed to use an "Offline Content
Generator" [2] that became unrepairable. I wonder if something like Percollate
could be used as the backend generator for Wikipedia's book creator?

[0]
[https://en.wikipedia.org/w/index.php?title=Special:Book&book...](https://en.wikipedia.org/w/index.php?title=Special:Book&bookcmd=book_creator&referer=Main+Page)

[1]
[https://www.mediawiki.org/wiki/Reading/Web/PDF_Functionality](https://www.mediawiki.org/wiki/Reading/Web/PDF_Functionality)

[2]
[https://www.mediawiki.org/wiki/Offline_content_generator](https://www.mediawiki.org/wiki/Offline_content_generator)

~~~
danburzo
I think they're already looking into something similar, either through
Electron, or their own wrapper over Chromium. I understand from skimming the
resources your linked to that they had reliability issues with Electron, and I
can confirm from a separate project that Electron and Puppeteer can become a
little slow / unwieldy for a large-scale project. So I'm not sure :/

------
heinrichhartman
Just tried it on their GitHub page:

percollate pdf --output p.pdf
[https://github.com/danburzo/percollate](https://github.com/danburzo/percollate)

The font is gigantic and the page tiny. Barely get to the second headline on
the first page.

And there is no way to tune this on the command line (yet).

~~~
danburzo
You're right, the default page size (A5) and font size are things that need to
be easily customizable. I resisted adding many options to the CLI, but I think
this is a good way to solve it?
[https://github.com/danburzo/percollate/issues/27](https://github.com/danburzo/percollate/issues/27)

~~~
heinrichhartman
Yes. This would work.

It's probably a good idea to not introduce flags like `--font-size 12p` or
`--page A4` since it leads down a rabbit hole. (Where do you stop?). Directly
passing down CSS seems appropriate here.

~~~
danburzo
Implemented in the latest version :)

------
dvfjsdhgfv
This made me smile:

> percollate html Not implemented yet

~~~
danburzo
Hah, yes :) It's more that I haven't figured out what the HTML version should
look like, and whether it should be self-contained (images, styles), how to
handle web fonts, et cetera

------
v01d4lph4
Nice!

~~~
danburzo
Thank you!

