Just for articles, mind you, not entire websites.
- screenshot urcodiaz.png --fullpage
A better option is to render a page with JS turned on and save the resulting HTML.
It came out of a screenshot/archiver I've been working at at Mozilla, but I've split it up as the screenshotting is shipping and DOM archiving is still way outside Mozilla's comfort zone.
However, as you point out, PDFs are designed around the printed page, not the flowing arbitrary-page-size documents of the web.
I.e. DOM copy > screenshot > wget?
Well, I took a screenshot, better than nothing.
$ google-chrome --headless --disable-gpu --print-to-pdf 'http://example.com'
Edit: I'm completely baffled that such widely used software as Google Chrome can have this written in the man page: "Google Chrome has hundreds of undocumented command-line flags that are added and removed at the whim of the developers."
"grab-site is made possible only because of wpull, written by Christopher Foo who spent a year making something much better than wget. ArchiveTeam's most pressing issue with wget at the time was that it kept the entire URL queue in memory instead of on disk. wpull has many other advantages over wget, including better link extraction and Python hooks."
Also has a pocket import feature.
wget -nc -np -E -H -k -K -p -U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:220.127.116.11) Gecko/20070802 SeaMonkey/1.1.4' -e robots=off
There's three backup formats: XML (same as Delicious v1 API), HTML (legacy Netscape format almost everyone can read) and JSON.
I can get back with an XML file later. There's also an API that might be of interest.
Here's someone who blogged about that:
The question was asked if you can use your API endpoint URL (https://[pinboardusername]:[pinboardpassword]@api.pinboard.i...) straight into ifttt. Maciej confirmed that it should work, the problem is that you’re essentially storing your login credentials in a 3rd party service, and you don’t know if they’re storing and transmitting it securely.
Could you add an option to either add tagging, or separate the tagged items into folders?
e.g. "programming/", "docker/" etc, I often find myself digging through my Pocket archive trying to find that one article I found 6 months ago and it gets incredibly annoying
I have half a mind to fork this and add something like https://github.com/internetarchive/warcprox, or at the very least walk through the generated HTML and brute-force inline all assets as a first pass :)
This must happen either in the browser or in a proxy like the linked warcprox, in order to catch everything. But the proxy solution is getting less practical every day with key pinning and HSTS.
Maybe a future firefox will have an option to export everything to WARC?
Not sure if it's worth including in my script though, since WARCs aren't easily browseable (correct me if I'm wrong).
I've had good luck with WebArchivePlayer: https://github.com/ikreymer/webarchiveplayer
Just one line of regex changes probably: https://github.com/pirate/pocket-archive-stream/blob/master/...
May I suggest including in the readme.md, a sample line of the Pocket export format/your input format?
Thanks for publishing!
And a comment next to the regex for parsing it: https://github.com/pirate/pocket-archive-stream/blob/master/...