
Show HN: Local Node.js app to save everything you browse and serve it offline - archivist1
https://github.com/dosyago/22120
======
ShorsHammer
It's a bit disappointing to see other people showcasing their own work without
a single mention of the link above. Perhaps make your own submission instead
if that's the intention?

As a casual reader and others obvious interest in this area I'd very much
prefer a sentence or two about the quality of the work presented, feel free to
link your own stuff afterwards, it's a bit offputting to see such blatant
self-promotion.

~~~
YPCrumble
I like when people link related projects. This is Why:

I have a use case for this project but I likely won’t get to it for a year or
so. When that happens I’ll come back to this thread and all the projects
working on the problem will be right here in the Hacker News thread. I’ll be
able to see which ones are still alive, and maybe even see why some stopped
development.

This happens all the time for me with HN - if people didn’t link their related
work the thread would have way less utility.

~~~
ShorsHammer
They could always actually view the work say something about it first yes?
[https://news.ycombinator.com/showhn.html](https://news.ycombinator.com/showhn.html)

Looking at the projects presented here with nothing else offered am not
convinced of good-faith participation. Would you say the same?

~~~
oefrha
I’d take a self-promo related project comment over an unrelated complaint
(hogging entire above-the-fold space) any day. Nothing is more frustrating
than opening a discussion thread and the top comment with a million
descendants bikesheds about something else entirely.

Also, you would occasionally see comments along the line of “awesome project /
congratulations on launching, I like the fact that it does this and that. My
project also does this and that, check it out: insert link.” Hardly any
better.

------
thefreeman
Congrats on getting your project to the front page of HN. With that said I
think you are going to need to change your approach if you want this project
to be usable as more than a toy project in the long run.

From what I can tell it essentially saves a map of url -> response in memory
as you browse. Every 10 seconds this file is serialized to json and dumped to
a cache.json file. This is going to be very inefficient as the number of web
pages indexed grows since you are rewriting the entire cache every 10 seconds
even if only a few pages have been added to it. It also will eventually exceed
the memory of the computer running the app if the content of every page ever
visited needs to be loaded into memory. I highly recommend looking into some
of the other suggestions mentioned here, either sqlite or mapping a local
directory structure to your caching strategy so that you can easily query a
given url without keeping the entire cache in memory, and also add / update
urls without rewriting the entire cache.

~~~
archivist1
My future plan was to cache responses on disk and just keep cached keys in
memory:

[https://github.com/dosyago/22120#future](https://github.com/dosyago/22120#future)

~~~
dunham
I wrote something similar years ago in Go, and settled on writing the data to
a WARC file on disk (you can gzip the individual requests and concatenate to
get random access), and also concatenating to a warc index file. The working
index was kept in memory, while the warc index was read at startup.

My version acted as a proxy and would serve the latest entry from cache if a
copy was cached. I had a special X-Skip-Cache header for when I wanted to go
around the cache. (I can't remember if it handled https or if sites just
didn't use https back then.)

My use-case was web scraping, particularly recipe and blog sites. I wanted to
be able to develop my scraping code without re-hitting the sites all the time.
Structuring it as a proxy allowed me to just write my python scraping code as
if I was talking to the server.

Previously I'd written a layer on top of the python requests library to
consult a cache stored in a directory (raw dumps of content / headers, with v2
involving git). But I found that required extra care when more than one script
was running at once, and I liked the idea of storing it in a standardized
format (WARC) that could be manipulated by other tools.

~~~
breatheoften
I tried to build something like this for jest tests in an app I worked on.

I wanted my jest tests to serve as both unit tests and service diagnostics -
so I instrumented axios and setup a hidden cache layer within it when running
inside the test suite. I was trying to figure out how to best organize the
cache so I could run tests really quickly by having all results pulled from
cache — or run it slow and as a service diagnostic mechanism by deleting the
cache before execution ... I had to extend axios to accept a bit of additional
logic from the application ...

it was hard for me to get it to work properly inside of jest though ...

------
grizzles
You could store the data in a git repo per domain, so that implicit de-
duplication happens on re-visits & for shared resources.

You could have a raw dir (the files you receive from the server) and a render
dir that consists of snapshots of the DOM + CSS with no JS & external resource
complexity.

When the global archive becomes too big, history could be discarded from all
the git repos by discarding the oldest commit in each repo, and so on.

SOLR is probably the right tool for the index but there is something
undeniably appealing about staying in the pure file paradigm - you could use
sqlite's FTS5 module to do that too.

~~~
oefrha
git is pretty bad at handling large binary blobs. Good old timestamped
directories with hardlinks (a la rsync --link-dest) probably works better.

~~~
fit2rule
Git isn't that bad at handling binary blobs - as long as you enable LFS
support, and your git repo is served locally, as suggested, you'll do fine.

~~~
oefrha
> as long as you enable LFS support

You still end up with two copies of the same file, one in the local LFS
“server”, one in the work tree, no? (I only played with LFS a bit many years
ago when it first came out, so I could be wrong.) Unless you take into account
deduplication built into certain filesystems.

~~~
fit2rule
You don't get copies until you need them - thats the point entirely. More
details here:

[https://www.atlassian.com/git/tutorials/git-
lfs](https://www.atlassian.com/git/tutorials/git-lfs)

Also saves you from beating up your index with every change.

I've been using GIT LFS for the last 6 months with an Unreal Engine project,
with multiple gigabytes of files being tracked, and it really is painless.

~~~
oefrha
> You don't get copies until you need them

I know. But you do need them, and files in your work tree don’t magically
disappear when you commit them in (presumably). So either you delete the work
tree copy immediately after pushing it to LFS server, and duplicate the server
copy every time you need to access it, in which case the file is only
duplicated then but comes with elevated cost of access, or the latest copy
sits around costing double the amount of space at all times.

~~~
fit2rule
I don't see the issue? Either you want to use Git or not. I have gigabyte-
scale files in my 6-month old repo's and haven't ever run into any issues. Of
course this may be because my git server is right next to my desk and I'm on
gigabit ethernet ..

------
aloer
what are the security implications of permanently running chrome in remote
debug mode?

a bit more than half a year ago I started playing around with this and was
surprised how on the one hand there are really really good tools nowadays for
self archiving but how on the other hand there hasn't been any progress in
implementing these in a, for end users, comfortable way

My working theory right now is that saving every request/response as well as
every interaction on a page should allow us to completely restore web site
state at any point in time and will open up some super interesting use cases
around our interaction with information found online

But in order to do this it seems necessary to go through the remote debug
protocol like this project here is doing. And since this is somewhat of an
unusual approach I could not find much information about the security aspect
of running every site at any time with remote debugging activated. Common web
scrapers/archiving tools will instead only use remote chrome debug to open and
capture specific urls

Storage is so dirt cheap today that there is zero reason why we shouldn't have
reliable historic website state for everything we have ever looked at

And judging by the HN front pages of the last months, many here are interested
in this and related use cases (search/index/annotations/collaborative
browsing)

~~~
sneak
> _Storage is so dirt cheap today that there is zero reason why we shouldn 't
> have reliable historic website state for everything we have ever looked at_

I agree entirely, but I do about half my reading on mobile, and the phone
company and the ad company have both decided that I shouldn’t be able to run
extensions of any kind in the browsers available on my phone company phone
_or_ my ad company phone.

I’m not really sure of the solution. I had planned to start a business around
this, but without mobile support it is probably a nonstarter.

~~~
alistproducer2
Run your phone traffic through a proxy and have the proxy cache stuff.

~~~
highmastdon
Proxy can't intercept https, if I'm correct

------
EastSmith
Like 20 years ago I've used a program called Teleport Pro, to do something
similar.

I would dial up with my phone modem when the internet access was cheap (during
the night), it would automatically browse a page I provided, and in the
morning I would have the page ready to read.

Fun times with 10 to 20 kb/s speeds.

~~~
101008
I had a similar experience, but I also think Internet Explorer saved the
websites you visited to browse them later in offline mode, right? I remember
some times I couldn't know if I was online or not because the website was
cached, i had to visit a different website that I never visited before to
check my connection status.

------
Eikon
I’m curious of why did you go on the path of using chrome debugging
functionality instead of implementing an HTTP proxy which would provides the
benefit of behind browser-agnostic too.

May you expand on that please?

~~~
stevekemp
After reading this I was wondering if it might be fun to write a HTTP-proxy
that a) recorded everything, in an SQLite database, and b) presented a
localhost-server which would let you search that content.

I suspect it would get very very very busy, with tracking-pixels, etc, but if
you only made it archive text/plain, text/html, and similar content-types it
might be a decent alternative to bookmarks, albeit only on a single
host/network.

Wouldn't be hard to knock up a proof of concept, perhaps I should do that this
evening.

~~~
randrandrand
I did something like this a while ago:
[https://github.com/nspin/spiderman](https://github.com/nspin/spiderman)

I used the wonderful tool mitmproxy for both recording and serving.

------
wanderingstan
Great feature. Though it feels like a UI misstep that the user had to use npm
to switch between recording and browsing. A nicer solution could be a chrome
extension button, or access the archived version via a synthetic domain. E.g.
example.com.archived

~~~
crooked-v
There's also some 'magic' potential here to have a proxy that detects whether
there's a live network interface or not (including some sanity checking
against captive portals), passes through the live sites while recording when
there is a connection, and serves from the last recorded versions when there
isn't.

~~~
daurnimator
Sounds like a local squid proxy setup from the early 2000s...

------
jdmoreira
Great idea! I like the concept.

One of the things I miss most about the old web was how trivial it was to
local mirror any website. It was great!

------
ksec
I remember I tried something similar a long time ago but decided it wasn't
worth it.

2MB per page at 100 page a day, 200MB / day. That is 73GB per year.

May be once a year I get the problem where I remember reading something about
it but could not google back the exact page in my memory. So I had a proxy
solution set up, but the maths ends up it wasn't worth paying the Storage cost
just for this one time convenient.

~~~
laurent123456
Perhaps one solution would be to extract the plain text and the URL of the
page only. That wouldn't take much space and would still be searchable.

------
mikece
Between this project and the others mentioned in the discussion these are
excellent resources for anyone needing to have a forensic record of their how
they assembled evidence from browsing open sources on the internet. Package
this as a VM that can be quickly spun up new and fresh per case and sell
support for LE types and you’ve got a business.

~~~
ragerino
If original author agrees, I can dockerize it.

~~~
archivist1
Hey that's a cool idea about a business, I've made it into a packaged Node.JS
app as a binary you can see on the releases page:

[https://github.com/dosyago/22120/releases](https://github.com/dosyago/22120/releases)

I like multiple release channels and there's plenty of ways to install and use
this.

You can download a standalone binary (Win, Mac or Linux), install globally
from npm, or just clone or download the repo and run it.

I'm not sure about docker, but could you maybe give it a try and share me the
docker file privately and I can decide if I like it?

If it's good then we can add it to the packages page on the repo. Sound OK?
Email me at cris@dosycorp.com if you like this idea. Thank you! :)

------
jimbob45
Nitpicking but am I the only one who hates "serve" being used in strange
contexts? IMHO to serve is to send something over a network. If it's all
happening locally, the verb should be "load" because it's just taking a file
and loading it into a browser at that point.

~~~
hombre_fatal
If it's running an http server locally and your browser is making requests to
it, it's definitely serving. Not sure how "loading" could be a better word
unless you're explaining it to someone nontechnical. Surely a nitpick is
supposed to be more pedantic, not less?

------
jchook
Really brilliant implementation concept.

I love how it uses the browsers debug port to save literally everything. I
have often dreamed of “a Google for everything I’ve seen before”.

I recently spent some time making something like this and hope to release it
soon as FOSS. However, it differs in some critical ways.

I desire to:

\- save pages of interest, but not a firehose of everything I ever see

\- save from anywhere on any internet device (eg mobile phone)

\- Archive rich content like YouTube videos or songs even if I do not watch
the entire (or any of) the video, and supporting credentials (eg .netrc)

Looking forward to digging deeper into this thread and your project for more
ideas!

~~~
archivist1
Thank you very much for the big compliment! I feel very happy to hear it.

A lot of people in this thread talked about proxies, as in "why did you not
implement a proxy" or "I implemented this but as a proxy"

The main advantage I see of this approach over a proxy is: simplicity.

The core of this is approximately 10 lines of code. The reason is it can hook
into the commands and events of the browser's built in Network module.

I think there's no need to build a proxy, if you can already program the
browser's in built Fetch module.

I think proxies have issues such as distribution (how do you distribute your
proxy? As a cumbersome download that requires set up? As a hosted service that
you have to maintain and cost?), and security (how do you handle TLS?), and
complexity (I built this in a couple of hours over 2 days, one of the
"obligatory bump" projects added to this thread is a proxy and has thousands
of commits).

The biggest problem I see is the complexity. I feel a proxy would create a
tonne of edge cases that have to be handled.

I did not mind sacrificing the benefits of a proxy (it can work on all
browsers, and on any device), because I did not want to run my own server for
this, but rather, crucially (I feel) give people back the power and control
over their own archive. Even more importantly for me is I want to just make
this the easiest way to archive for a particular set of users (say, Chrome
users on Desktop), really get that right and then if that works, move to other
circles later (such as mobile users, or other browsers).

Anyway, thanks for your kind comment, it really encourages me to share more
about this.

I read some of your comment history but I can't get a lock on who you are, but
you seem pretty interesting. Do you mind sharing a GitHub or something? If
not, but you'd like to continue chatting, email me cris@dosycorp.com

Thank you!

------
olah_1
You should add "upload / sync with decentralized storage" to the future goals.

Seems like a logical next step to have it sync to an IPFS or Dat drive. Not
sure how it would be implemented though.

------
jan6
I love how there's only a single browser or two in the entire world, lol
(safari I've got bo clue about) that's while assuming chrome and firefox's
debugging streams would be compatible....

you assume I don't use any forks, or custom versions, what if I use an
Electron based browser? what about Pale Moon or other forks which have older,
if any, such interfaces? what about Opera? etc etc, you get the point...I
hope...

------
CGamesPlay
Bump for my related project:
[https://github.com/CGamesPlay/chronicler](https://github.com/CGamesPlay/chronicler)

I'm actually in the process of rewriting this. I like your approach of using
DevTools to manage the requests, the approach taken in Chronicler is to hook
into Chrome's actual request engine.

You might like to look at Chronicler to see some attempts at UI for a project
like this, particular decisions around what to download and how to retrieve
it.

------
it
Related: I'm making a program in Go to inline all the resources for a web page
so it ends up being a single file that you can work with offline more easily:
[https://github.com/ijt/inline](https://github.com/ijt/inline).

------
jimktrains2
I've been building something similar, but that uses Firefox sync to grab
history and bookmarks.
[https://github.com/jimktrains/ffsyncsearch](https://github.com/jimktrains/ffsyncsearch)

~~~
myself248
This seems to be something a LOT of people are working on right now. I have
this open in another tab:
[https://news.ycombinator.com/item?id=14272133](https://news.ycombinator.com/item?id=14272133)
where several MORE alternatives are listed.

One feature I'd love that I don't see anywhere, is "also go through my
history, let me check/uncheck particular items, then submit the rest to
ArchiveBot or WBM or something. Since I apparently have a habit of visiting
sites that aren't in WBM yet.

~~~
jimktrains2
Interesting. I didn't even think to look around, I was just scratching an
itch.

I know the wbm has some tools to submit site, I should look into incorporating
calls to them too.

------
archivist1
If anyone would be interested in the next major version, please add your email
to this list to be notified:
[https://forms.gle/FJmsXCDy18RrbFtt9](https://forms.gle/FJmsXCDy18RrbFtt9)

------
calpaterson
Nice job, I think this is promising but there has got to be a better way than
having people enable their debugger. Is there any reason you can't just copy
the contents of each page and then post it somewhere?

~~~
jahewson
Seems like a good use case for a browser extension?

------
mauricesvay
Why not use a proxy?

~~~
number6
My initial though. Is there a proxy that also serves? Maybe a squid addon?
This would be awesome. I hope to see something like the way back machine but
local for all the things I ever surfed.

~~~
supermatt
Any caching proxy (including squid) will serve - that is their whole point.
You may need to tweak the configuration to ignore the website-specified expiry
and cache headers.

~~~
oefrha
Squid (or any other popular caching proxy I'm aware of) doesn't cache verbs
other than GET, so a lot of websites can't be cached this way; notably,
GraphQL APIs usually use POST for all requests, even just queries.

------
lixtra
Related wwwoffle:
[http://www.gedanken.org.uk/software/wwwoffle/](http://www.gedanken.org.uk/software/wwwoffle/)

~~~
zo1
How does this handle HTTPS traffic?

~~~
supermatt
with a self-generated root cert

------
dustingetz
could this beat google? local search of anything i have seen, plus silo search
sites for specific purpose like amazon and HN. would you miss anything given
that google results are either bought or gamed? maybe need a better social
media

------
fake-name
Obligatory bump for my project ReadableWebProxy ([https://github.com/fake-
name/ReadableWebProxy](https://github.com/fake-name/ReadableWebProxy)) that
was originally intended to do this.

At this point, it does a GIANT pile of additional things most of which are
specific to my interests, but I think it might be at least marginally
interesting to others.

It does both full autonomous web-spidering of sites you specify, as well as
synchronous rendering (You can browse other sites through it, with it
rewriting all links to be internal links, and content for unknown sites
fetched on-the-fly).

I solve the javascript problem largely by removing all of it from the content
I forward to the viewing client, though I do support remote sites that load
their content through JS via headless chromium (I wrote a library for managing
chrome that exposes the entire debugging protocol here:
[https://github.com/fake-name/ChromeController](https://github.com/fake-
name/ChromeController)
[https://pypi.org/project/ChromeController/](https://pypi.org/project/ChromeController/)).

~~~
prox
Lovely file descriptions, had a laugh :)

~~~
hk__2
Those are really bad commit messages.

~~~
moneywoes
Do you have a guide for good commit messages

~~~
EternalAugust
Here are some HN discussions on commit messages:

[https://news.ycombinator.com/item?id=18663032](https://news.ycombinator.com/item?id=18663032)

[https://news.ycombinator.com/item?id=19704486](https://news.ycombinator.com/item?id=19704486)

[https://news.ycombinator.com/item?id=21835874](https://news.ycombinator.com/item?id=21835874)

[https://news.ycombinator.com/item?id=13491879](https://news.ycombinator.com/item?id=13491879)

~~~
off_by_one
and
[https://news.ycombinator.com/item?id=21812772](https://news.ycombinator.com/item?id=21812772)

