
Show HN: Pocket Stream Archive – A personal Way-Back Machine - nikisweeting
https://github.com/pirate/pocket-archive-stream
======
vijucat
I still use Firefox ScrapBook for this: [https://addons.mozilla.org/en-
US/firefox/addon/scrapbook/](https://addons.mozilla.org/en-
US/firefox/addon/scrapbook/)

Just for articles, mind you, not entire websites.

~~~
cJ0th
Ditto. I just hope the transition to web extension will be smooth or at least
happening at all.

------
jl6
Screenshotting or PDFing of a website is an increasingly important archiving
tool, to supplement wget. I've come across a lot of websites that won't render
any content if not connected to a live server.

~~~
nikisweeting
I couldn't agree more. I wish more sites would load without needing multiple
seconds of JS execution and AJAX. One of my TODOs is to get full-page
screenshots working as well.

~~~
tomcooks
You might find this useful if you use Firefox:

\- shift+f2

\- screenshot urcodiaz.png --fullpage

~~~
Drdrdrq
Thank you for this tip! I have always installed Abduction plugin if I needed
this, but the less plugins the better.

------
avian
What version of Google Chrome do you need for the PDF export to work? I tried
it on 58.0.3029.96 (Linux) and this does nothing (no error messages, it just
quits without writing any files):

$ google-chrome --headless --disable-gpu --print-to-pdf
'[http://example.com'](http://example.com')

Edit: I'm completely baffled that such widely used software as Google Chrome
can have this written in the man page: "Google Chrome has hundreds of
undocumented command-line flags that are added and removed at the whim of the
developers."

~~~
nikisweeting
59 or later, --headless is a brand new feature. apt-get install google-chrome-
canary.

[https://developers.google.com/web/updates/2017/04/headless-c...](https://developers.google.com/web/updates/2017/04/headless-
chrome)

------
toomuchtodo
Highly recommend switching to wpull
([https://github.com/chfoo/wpull](https://github.com/chfoo/wpull)), which was
built as a wget replacement. It's what grab-site uses, which is a successor to
ArchiveTeam's ArchiveBot.

"grab-site is made possible only because of wpull, written by Christopher Foo
who spent a year making something much better than wget. ArchiveTeam's most
pressing issue with wget at the time was that it kept the entire URL queue in
memory instead of on disk. wpull has many other advantages over wget,
including better link extraction and Python hooks."

~~~
nikisweeting
This looks awesome, thanks for the suggestion! It'll help with WARC support as
well, looks like it can output WARCs with just a cli flag.

------
throw98987
Use zotero and you have your own personal Pocket with snapshots. In addition,
you can add tags, organize stuff into folders, etc.
[https://www.zotero.org/](https://www.zotero.org/)

~~~
nikisweeting
Zotero is awesome! It doesn't provide a publishable stream of recently added
articles though afaik.

~~~
bovine3dom
You can, if you faff a little bit [0]. Unless you mean something else by
publishable?

[0]
[https://www.zotero.org/support/groups#public_open_membership](https://www.zotero.org/support/groups#public_open_membership)

------
ticoombs
I've been running wallabag[1] for my own pocket instance. It's been running
perfectly for a couple years.

Also has a pocket import feature.

[1] [https://wallabag.org/en](https://wallabag.org/en)

------
antman
The demo does not have images. Maybe try

wget -nc -np -E -H -k -K -p -U 'Mozilla/5.0 (X11; U; Linux i686; en-US;
rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4' -e robots=off

~~~
nikisweeting
I opted not to download images using wget. I figured if I needed in-article
images the PDF+screenshot would be enough.

------
unicornporn
If it would be connected to Pinboard.in as an alternative to Pocket I would be
screaming with joy. :-)

~~~
nikisweeting
If you can get me a sample pinboard export to look at, I'll whip up a regex
that makes it work.

~~~
unicornporn
Oh, thanks!

There's three backup formats: XML (same as Delicious v1 API), HTML (legacy
Netscape format almost everyone can read) and JSON.

I can get back with an XML file later. There's also an API that might be of
interest.

Here's someone who blogged about that:

[http://behindcompanies.com/2011/12/a-guide-to-backing-up-
pin...](http://behindcompanies.com/2011/12/a-guide-to-backing-up-pinboard/)

The question was asked if you can use your API endpoint URL
([https://[pinboardusername]:[pinboardpassword]@api.pinboard.i...](https://\[pinboardusername\]:\[pinboardpassword\]@api.pinboard.in/v1/posts/all?format=json))
straight into ifttt. Maciej confirmed that it should work, the problem is that
you’re essentially storing your login credentials in a 3rd party service, and
you don’t know if they’re storing and transmitting it securely.

------
djhworld
This is cool.

Could you add an option to either add tagging, or separate the tagged items
into folders?

e.g. "programming/", "docker/" etc, I often find myself digging through my
Pocket archive trying to find that one article I found 6 months ago and it
gets incredibly annoying

~~~
nikisweeting
I like having the sites by timestamp because they're guaranteed to be unique,
and it makes traversing them easy. I'd be happy to add a tag column to the
index though, which you could use with Ctrl+F to find articles.
[https://github.com/pirate/pocket-archive-
stream/issues/1](https://github.com/pirate/pocket-archive-stream/issues/1)

------
anc84
Now if only Chromium could learn to write WARC archives, then it would be on
par! :)

Great project!

~~~
rcarmo
I've been thinking along those very same lines for a long time (this project
makes me wish for more free time).

I have half a mind to fork this and add something like
[https://github.com/internetarchive/warcprox](https://github.com/internetarchive/warcprox),
or at the very least walk through the generated HTML and brute-force inline
all assets as a first pass :)

~~~
throwaway7767
I've been thinking I'd love to have a WARC archive of all my browsing. So many
times sites I remember seeing have gone offline, and didn't get archived by
the big services. Ideally this has to happen with browser cooperation, so it
can save resources from complex dynamic pages, including responses to user
action.

This must happen either in the browser or in a proxy like the linked warcprox,
in order to catch everything. But the proxy solution is getting less practical
every day with key pinning and HSTS.

Maybe a future firefox will have an option to export everything to WARC?

------
arkenflame
I wrote a Chrome extension that similarly saves copies of pages you bookmark:
[https://chrome.google.com/webstore/detail/backmark-back-
up-t...](https://chrome.google.com/webstore/detail/backmark-back-up-the-
page/cmbflafdbcidlkkdhbmechbcpmnbcfjf)

------
fiatjaf
You see something is flawed in Redux at the point you have to pass strings
(uppercase constants defined somewhere) around, import them in every file,
pass them as identifiers of what you should do with each data.

Strings!

~~~
nikisweeting
Did you comment on the wrong article by accident?
[https://news.ycombinator.com/item?id=14273549](https://news.ycombinator.com/item?id=14273549)

~~~
fiatjaf
Yes!

Thank you.

------
bicubic
This seems neat, curious what are the use cases for this?

~~~
nikisweeting
Slowing down the inevitable tide of
[https://en.wikipedia.org/wiki/Link_rot](https://en.wikipedia.org/wiki/Link_rot).
When I cite blog posts or want to share sites that have gone down, I can swap
out the links for my archived versions.

~~~
bicubic
Why is archive.org or one of the other centralized web archives not suitable
for that? They don't index the content you want to retain?

~~~
anc84
Never rely on others to care about things you need.

------
ents
Would be cool to see this for Instapaper or Pinboard

~~~
nikisweeting
My script should work with very minimal tweaking if you can get a list of urls
+ titles from those services.

Just one line of regex changes probably: [https://github.com/pirate/pocket-
archive-stream/blob/master/...](https://github.com/pirate/pocket-archive-
stream/blob/master/archive.py#L31)

~~~
no_news_is
Just about what I was hoping to see in the comments, was the expected format
of the input.

May I suggest including in the readme.md, a sample line of the Pocket export
format/your input format?

Thanks for publishing!

~~~
nikisweeting
I included a sample of the expected pocket list format in the repo:
[https://github.com/pirate/pocket-archive-
stream/blob/master/...](https://github.com/pirate/pocket-archive-
stream/blob/master/example_ril_export.html)

And a comment next to the regex for parsing it:
[https://github.com/pirate/pocket-archive-
stream/blob/master/...](https://github.com/pirate/pocket-archive-
stream/blob/master/archive.py#L31)

------
rcarmo
This is great. All it needs is a Docker container and I'd be running it now
(need to take some time aside this weekend to do that).

------
burnbabyburn
this is really cool! I always had in mind a project where you save every page
you visit, and somehow expose them in the future to know what you visit and
maybe remembering you important stuff based on some heuristic.

------
anotheryou
yes! thank you so much. Needed this badly.

~~~
nikisweeting
You're welcome! I've been wanting to build this for ages but headless chrome
finally inspired me to actually finish it.

