Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Local Node.js app to save everything you browse and serve it offline (github.com/dosyago)
406 points by archivist1 on Dec 22, 2019 | hide | past | favorite | 97 comments

It's a bit disappointing to see other people showcasing their own work without a single mention of the link above. Perhaps make your own submission instead if that's the intention?

As a casual reader and others obvious interest in this area I'd very much prefer a sentence or two about the quality of the work presented, feel free to link your own stuff afterwards, it's a bit offputting to see such blatant self-promotion.

I like when people link related projects. This is Why:

I have a use case for this project but I likely won’t get to it for a year or so. When that happens I’ll come back to this thread and all the projects working on the problem will be right here in the Hacker News thread. I’ll be able to see which ones are still alive, and maybe even see why some stopped development.

This happens all the time for me with HN - if people didn’t link their related work the thread would have way less utility.

They could always actually view the work say something about it first yes? https://news.ycombinator.com/showhn.html

Looking at the projects presented here with nothing else offered am not convinced of good-faith participation. Would you say the same?

I’d take a self-promo related project comment over an unrelated complaint (hogging entire above-the-fold space) any day. Nothing is more frustrating than opening a discussion thread and the top comment with a million descendants bikesheds about something else entirely.

Also, you would occasionally see comments along the line of “awesome project / congratulations on launching, I like the fact that it does this and that. My project also does this and that, check it out: insert link.” Hardly any better.

The value to me is the link to the other project. Otherwise it would be like trying to read a recipe online - obligatory filler text with the value buried at the bottom.

I wish I worked in a domain like this. Most people who comment on my area of expertise are asking questions and those with projects are hello world level.

The projects that succeed to working prototype are one off itch scratches or API Jigsaw projects. Mine included, most of my code is API glue. But I never see HN as having utility in this way. But I'm in a small niche.

It is only to human to react like this.

Did you recognize how many people relate to their own experiences when you tell them a story while totally forgetting about the things you just said?

You are totally right and if someone wants to present their project because it is similar they should at least say something about the differences etc. not just paste a URL to promote it.

This happens to me all the time. :)

I don't know, I feel like one of the coolest things about HN is being in a community with others trying to do similar stuff. I definitely agree that it's not about just trying to hijack the top comment to get the runoff clicks, and I do see some comments here that are just links to other projects. Maybe my self-reference wasn't one that you were referring to, but I personally got a kick out of seeing another hobbyist doing something similar, and looking at their different approach to a similar problem.

When I've shared my own projects here this practice of a) people linking their own projects and b) people linking other projects/products that do similar thing was a bit disappointing.

What you really want is feedback on your own project

The problem with this stance is that it is very difficult to get a Show HN to the front page. You usually have about a half hour after submission to get four or five upvotes. (I watch /new and I've seen good projects submit over and over to try to get bites - with no luck.)

But also: self-promotion is a good thing. Or I should say: a better thing. If you can't promote your own project, then you are left to use advertising. This is why advertising is so hot right now. There aren't many avenues for self-promotion.

Let people post their links and let's see where the chips fall.

At least take the effort of putting your project in context, which also adds some interest to the reader. Thin line between "good self-promotion" and "spam".

Congrats on getting your project to the front page of HN. With that said I think you are going to need to change your approach if you want this project to be usable as more than a toy project in the long run.

From what I can tell it essentially saves a map of url -> response in memory as you browse. Every 10 seconds this file is serialized to json and dumped to a cache.json file. This is going to be very inefficient as the number of web pages indexed grows since you are rewriting the entire cache every 10 seconds even if only a few pages have been added to it. It also will eventually exceed the memory of the computer running the app if the content of every page ever visited needs to be loaded into memory. I highly recommend looking into some of the other suggestions mentioned here, either sqlite or mapping a local directory structure to your caching strategy so that you can easily query a given url without keeping the entire cache in memory, and also add / update urls without rewriting the entire cache.

My future plan was to cache responses on disk and just keep cached keys in memory:


I wrote something similar years ago in Go, and settled on writing the data to a WARC file on disk (you can gzip the individual requests and concatenate to get random access), and also concatenating to a warc index file. The working index was kept in memory, while the warc index was read at startup.

My version acted as a proxy and would serve the latest entry from cache if a copy was cached. I had a special X-Skip-Cache header for when I wanted to go around the cache. (I can't remember if it handled https or if sites just didn't use https back then.)

My use-case was web scraping, particularly recipe and blog sites. I wanted to be able to develop my scraping code without re-hitting the sites all the time. Structuring it as a proxy allowed me to just write my python scraping code as if I was talking to the server.

Previously I'd written a layer on top of the python requests library to consult a cache stored in a directory (raw dumps of content / headers, with v2 involving git). But I found that required extra care when more than one script was running at once, and I liked the idea of storing it in a standardized format (WARC) that could be manipulated by other tools.

I tried to build something like this for jest tests in an app I worked on.

I wanted my jest tests to serve as both unit tests and service diagnostics - so I instrumented axios and setup a hidden cache layer within it when running inside the test suite. I was trying to figure out how to best organize the cache so I could run tests really quickly by having all results pulled from cache — or run it slow and as a service diagnostic mechanism by deleting the cache before execution ... I had to extend axios to accept a bit of additional logic from the application ...

it was hard for me to get it to work properly inside of jest though ...

You could store the data in a git repo per domain, so that implicit de-duplication happens on re-visits & for shared resources.

You could have a raw dir (the files you receive from the server) and a render dir that consists of snapshots of the DOM + CSS with no JS & external resource complexity.

When the global archive becomes too big, history could be discarded from all the git repos by discarding the oldest commit in each repo, and so on.

SOLR is probably the right tool for the index but there is something undeniably appealing about staying in the pure file paradigm - you could use sqlite's FTS5 module to do that too.

git is pretty bad at handling large binary blobs. Good old timestamped directories with hardlinks (a la rsync --link-dest) probably works better.

Git isn't that bad at handling binary blobs - as long as you enable LFS support, and your git repo is served locally, as suggested, you'll do fine.

> as long as you enable LFS support

You still end up with two copies of the same file, one in the local LFS “server”, one in the work tree, no? (I only played with LFS a bit many years ago when it first came out, so I could be wrong.) Unless you take into account deduplication built into certain filesystems.

You don't get copies until you need them - thats the point entirely. More details here:


Also saves you from beating up your index with every change.

I've been using GIT LFS for the last 6 months with an Unreal Engine project, with multiple gigabytes of files being tracked, and it really is painless.

> You don't get copies until you need them

I know. But you do need them, and files in your work tree don’t magically disappear when you commit them in (presumably). So either you delete the work tree copy immediately after pushing it to LFS server, and duplicate the server copy every time you need to access it, in which case the file is only duplicated then but comes with elevated cost of access, or the latest copy sits around costing double the amount of space at all times.

I don't see the issue? Either you want to use Git or not. I have gigabyte-scale files in my 6-month old repo's and haven't ever run into any issues. Of course this may be because my git server is right next to my desk and I'm on gigabit ethernet ..

or just use a bare repo?

Aside: are there non-binary and/or non-large blobs? I'm thinking along the lines of ATM machine / PIN number but maybe BLOB no longer implies "binary large" without being explicit.

I like the elegance of this idea.

I like this idea, especially using git to version the store. With automatic commits, you could roll back to a particular date to see the page versions then. A personal "archive.org" sounds very awesome!

I like the idea (not of git, but a personal archive), especially if search is integrated.

However what'd make it really amazing for me would be the ability to share those archived versions with everyone around the world, so we wouldn't have to duplicate our efforts or would have a higher chance of having that specific version of one special page saved.

For now the best way to contribute to this seems to be centralization: Donate to archive.org.

what are the security implications of permanently running chrome in remote debug mode?

a bit more than half a year ago I started playing around with this and was surprised how on the one hand there are really really good tools nowadays for self archiving but how on the other hand there hasn't been any progress in implementing these in a, for end users, comfortable way

My working theory right now is that saving every request/response as well as every interaction on a page should allow us to completely restore web site state at any point in time and will open up some super interesting use cases around our interaction with information found online

But in order to do this it seems necessary to go through the remote debug protocol like this project here is doing. And since this is somewhat of an unusual approach I could not find much information about the security aspect of running every site at any time with remote debugging activated. Common web scrapers/archiving tools will instead only use remote chrome debug to open and capture specific urls

Storage is so dirt cheap today that there is zero reason why we shouldn't have reliable historic website state for everything we have ever looked at

And judging by the HN front pages of the last months, many here are interested in this and related use cases (search/index/annotations/collaborative browsing)

> Storage is so dirt cheap today that there is zero reason why we shouldn't have reliable historic website state for everything we have ever looked at

I agree entirely, but I do about half my reading on mobile, and the phone company and the ad company have both decided that I shouldn’t be able to run extensions of any kind in the browsers available on my phone company phone or my ad company phone.

I’m not really sure of the solution. I had planned to start a business around this, but without mobile support it is probably a nonstarter.

Run your phone traffic through a proxy and have the proxy cache stuff.

Proxy can't intercept https, if I'm correct

For your information, SingleFile can run on Firefox for Android [1].

[1] https://github.com/gildas-lormeau/SingleFile

All of my contacts use iMessage; Signal is starting to be viable with their iPad release now though. I am concerned about switching from Chrome to Firefox still on security grounds, but this may be sufficient to make me switch.

Wait why is Chrome good for security and Firefox not?

Chrome’s sandboxing is unparalleled in the browser space. No other browser comes close, unless you are running something in Qubes or suchlike. Then it doesn’t much matter.

Do you have a link which might give further information about how the sandboxing that Firefox offers and the one that Chrome offers are different?

I was under the impression that they were very similar.

There's something there -- if you can translate time or resource savings into value

> What are the security implications of running in remote debugging mode?

Great question. First up, as long as you don't put --remote-debugging-address= you are only exposed locally, so the debugging endpoint can only be accessed from your local machine.

That leaves open the possibility that a web page can access that.

There's two possibilities:

- fetch('http://localhost:9222/json') which errors or is opaque because it is non CORS, or

- connecting directly to the websockets for targets, which have addresses like, http://localhost:9222/devtools/page/<128_bit_hex_string>

Interestingly, you can connect to the websocket, you just need to know the random identifier.

There are probably some DevTools zero days, but apart from those it looks like it's OK unless:

0) the identifier is not random,

1) you can get past CORS on the localhost which might be possible with an exploited extension, 3rd party software or plugin or

2) you can guess the websocket 128-bit identifier. (Guessing should only take 500 billion years. Even so 128 bits seems quite short relative to some encryption keys but there's probably a reason for that.)

Regarding 0) checking the Chromium source it appears that these ids are passed in to the constructor of "DevToolsAgentHostImpl":


and are either "GUID"s or "tokens" and in the former case they are created here:


and in the latter case by a class revealingly named "unguessabletoken.h":


which in each case appears to rely on getting random bytes from a file descriptor to "urandom" which I think is an operating system level randomness primitive.

Is it possible to send a no-cors POST that causes side effects (regardless of the opaque response)? I have only used the DevTools protocol through puppeteer, so don’t know anything about its authentication. Could it be vulnerable?

Besides the websocket, the protocol has a couple of HTTP endpoints, you can see commands here:


which looks like it ignores the HTTP verb and acts only on the path. I confirmed this with tests: fetch('http://localhost:9222/json/new') and fetch('http://localhost:9222/json/new, {method:'POST', body:''}) do the same thing, as does using verb 'DELETE'.

All these open a new tab. Without knowing a 128-bit target identifier, it looks like opening a new tab is the only thing you can do if someone is running DevTools.

your browser will refuse to send the xhr post when the preflight options request does not allow it with CORS headers.

POST requests don’t necessarily trigger preflight. But now that I think about it, the DevTools protocol most likely does not accept application/x-www-form-urlencoded, so good point.

> - fetch('http://localhost:9222/json') which errors or is opaque because it is non CORS, or

What about DNS rebinding attacks?

It looks like this was patched a few months back, around ~M66


I mean your using chrome so presumably you are not that fussed about sharing your data with a mega corp... why worry about anyone else seeing it?

Like 20 years ago I've used a program called Teleport Pro, to do something similar.

I would dial up with my phone modem when the internet access was cheap (during the night), it would automatically browse a page I provided, and in the morning I would have the page ready to read.

Fun times with 10 to 20 kb/s speeds.

I had a similar experience, but I also think Internet Explorer saved the websites you visited to browse them later in offline mode, right? I remember some times I couldn't know if I was online or not because the website was cached, i had to visit a different website that I never visited before to check my connection status.

I’m curious of why did you go on the path of using chrome debugging functionality instead of implementing an HTTP proxy which would provides the benefit of behind browser-agnostic too.

May you expand on that please?

Also wonder about the proxy thing. Sounds like squid in offline mode would be broadly similar?


You'd have to mitm ssl unfortunately:


As for proxy switching, in addition to command line options, there's


Very convenient for using ssh as a socks5 proxy, for example.

After reading this I was wondering if it might be fun to write a HTTP-proxy that a) recorded everything, in an SQLite database, and b) presented a localhost-server which would let you search that content.

I suspect it would get very very very busy, with tracking-pixels, etc, but if you only made it archive text/plain, text/html, and similar content-types it might be a decent alternative to bookmarks, albeit only on a single host/network.

Wouldn't be hard to knock up a proof of concept, perhaps I should do that this evening.

I did something like this a while ago: https://github.com/nspin/spiderman

I used the wonderful tool mitmproxy for both recording and serving.

Could probably strip the tracking codes anyway. Someone linked to wwwoffle, check if that isn’t what you had in mind.

One thing I can imagine is the fact that chrome is using system proxy settings. That means that changing them will affect all other apps on the machine and could save a lot of garbage requests you don't actually need.

On the other hand user agent filtering could be implemented to partially solve this.

Chrome can accept proxy settings from the command line do all you'd need to do would be to change the shortcut to Chrome.

The more dangerous part would be doing TLS MitM to allow HTTPS content to be recorded. This is well documented but still a potential security issue if the certificate somehow gets picked up by malware or something.

Great feature. Though it feels like a UI misstep that the user had to use npm to switch between recording and browsing. A nicer solution could be a chrome extension button, or access the archived version via a synthetic domain. E.g. example.com.archived

There's also some 'magic' potential here to have a proxy that detects whether there's a live network interface or not (including some sanity checking against captive portals), passes through the live sites while recording when there is a connection, and serves from the last recorded versions when there isn't.

Sounds like a local squid proxy setup from the early 2000s...

Thanks for compliment. I totally agree re misstep and want to improve that.

Once the library server is implemented, you'll be able to browse to it (localhost:8080 or so) and access your archive from there.

Nice idea on synthetic domain, that might yield another way to do.

Great idea! I like the concept.

One of the things I miss most about the old web was how trivial it was to local mirror any website. It was great!

I remember I tried something similar a long time ago but decided it wasn't worth it.

2MB per page at 100 page a day, 200MB / day. That is 73GB per year.

May be once a year I get the problem where I remember reading something about it but could not google back the exact page in my memory. So I had a proxy solution set up, but the maths ends up it wasn't worth paying the Storage cost just for this one time convenient.

Perhaps one solution would be to extract the plain text and the URL of the page only. That wouldn't take much space and would still be searchable.

Between this project and the others mentioned in the discussion these are excellent resources for anyone needing to have a forensic record of their how they assembled evidence from browsing open sources on the internet. Package this as a VM that can be quickly spun up new and fresh per case and sell support for LE types and you’ve got a business.

If original author agrees, I can dockerize it.

Hey that's a cool idea about a business, I've made it into a packaged Node.JS app as a binary you can see on the releases page:


I like multiple release channels and there's plenty of ways to install and use this.

You can download a standalone binary (Win, Mac or Linux), install globally from npm, or just clone or download the repo and run it.

I'm not sure about docker, but could you maybe give it a try and share me the docker file privately and I can decide if I like it?

If it's good then we can add it to the packages page on the repo. Sound OK? Email me at cris@dosycorp.com if you like this idea. Thank you! :)

Nitpicking but am I the only one who hates "serve" being used in strange contexts? IMHO to serve is to send something over a network. If it's all happening locally, the verb should be "load" because it's just taking a file and loading it into a browser at that point.

If it's running an http server locally and your browser is making requests to it, it's definitely serving. Not sure how "loading" could be a better word unless you're explaining it to someone nontechnical. Surely a nitpick is supposed to be more pedantic, not less?

Really brilliant implementation concept.

I love how it uses the browsers debug port to save literally everything. I have often dreamed of “a Google for everything I’ve seen before”.

I recently spent some time making something like this and hope to release it soon as FOSS. However, it differs in some critical ways.

I desire to:

- save pages of interest, but not a firehose of everything I ever see

- save from anywhere on any internet device (eg mobile phone)

- Archive rich content like YouTube videos or songs even if I do not watch the entire (or any of) the video, and supporting credentials (eg .netrc)

Looking forward to digging deeper into this thread and your project for more ideas!

Thank you very much for the big compliment! I feel very happy to hear it.

A lot of people in this thread talked about proxies, as in "why did you not implement a proxy" or "I implemented this but as a proxy"

The main advantage I see of this approach over a proxy is: simplicity.

The core of this is approximately 10 lines of code. The reason is it can hook into the commands and events of the browser's built in Network module.

I think there's no need to build a proxy, if you can already program the browser's in built Fetch module.

I think proxies have issues such as distribution (how do you distribute your proxy? As a cumbersome download that requires set up? As a hosted service that you have to maintain and cost?), and security (how do you handle TLS?), and complexity (I built this in a couple of hours over 2 days, one of the "obligatory bump" projects added to this thread is a proxy and has thousands of commits).

The biggest problem I see is the complexity. I feel a proxy would create a tonne of edge cases that have to be handled.

I did not mind sacrificing the benefits of a proxy (it can work on all browsers, and on any device), because I did not want to run my own server for this, but rather, crucially (I feel) give people back the power and control over their own archive. Even more importantly for me is I want to just make this the easiest way to archive for a particular set of users (say, Chrome users on Desktop), really get that right and then if that works, move to other circles later (such as mobile users, or other browsers).

Anyway, thanks for your kind comment, it really encourages me to share more about this.

I read some of your comment history but I can't get a lock on who you are, but you seem pretty interesting. Do you mind sharing a GitHub or something? If not, but you'd like to continue chatting, email me cris@dosycorp.com

Thank you!

You should add "upload / sync with decentralized storage" to the future goals.

Seems like a logical next step to have it sync to an IPFS or Dat drive. Not sure how it would be implemented though.

I love how there's only a single browser or two in the entire world, lol (safari I've got bo clue about) that's while assuming chrome and firefox's debugging streams would be compatible....

you assume I don't use any forks, or custom versions, what if I use an Electron based browser? what about Pale Moon or other forks which have older, if any, such interfaces? what about Opera? etc etc, you get the point...I hope...

Bump for my related project: https://github.com/CGamesPlay/chronicler

I'm actually in the process of rewriting this. I like your approach of using DevTools to manage the requests, the approach taken in Chronicler is to hook into Chrome's actual request engine.

You might like to look at Chronicler to see some attempts at UI for a project like this, particular decisions around what to download and how to retrieve it.

I've been building something similar, but that uses Firefox sync to grab history and bookmarks. https://github.com/jimktrains/ffsyncsearch

This seems to be something a LOT of people are working on right now. I have this open in another tab: https://news.ycombinator.com/item?id=14272133 where several MORE alternatives are listed.

One feature I'd love that I don't see anywhere, is "also go through my history, let me check/uncheck particular items, then submit the rest to ArchiveBot or WBM or something. Since I apparently have a habit of visiting sites that aren't in WBM yet.

Interesting. I didn't even think to look around, I was just scratching an itch.

I know the wbm has some tools to submit site, I should look into incorporating calls to them too.

If anyone would be interested in the next major version, please add your email to this list to be notified: https://forms.gle/FJmsXCDy18RrbFtt9

Nice job, I think this is promising but there has got to be a better way than having people enable their debugger. Is there any reason you can't just copy the contents of each page and then post it somewhere?

Seems like a good use case for a browser extension?

Why not use a proxy?

My initial though. Is there a proxy that also serves? Maybe a squid addon? This would be awesome. I hope to see something like the way back machine but local for all the things I ever surfed.

Any caching proxy (including squid) will serve - that is their whole point. You may need to tweak the configuration to ignore the website-specified expiry and cache headers.

Squid (or any other popular caching proxy I'm aware of) doesn't cache verbs other than GET, so a lot of websites can't be cached this way; notably, GraphQL APIs usually use POST for all requests, even just queries.

In principle this is true but there are some caveats regarding the usability: It does not present the cache in a friendly way. There is no Index or something like a nice starting page with your top browsed sites - you get the idea.

I guess this is totally doable with squid or any other caching proxy, but I don't know any that does this

A proxy means you’re MITM your own connection (debug mode is also an issue but not as simple to take advantage of).

How does this handle HTTPS traffic?

with a self-generated root cert

could this beat google? local search of anything i have seen, plus silo search sites for specific purpose like amazon and HN. would you miss anything given that google results are either bought or gamed? maybe need a better social media

Obligatory bump for my project ReadableWebProxy (https://github.com/fake-name/ReadableWebProxy) that was originally intended to do this.

At this point, it does a GIANT pile of additional things most of which are specific to my interests, but I think it might be at least marginally interesting to others.

It does both full autonomous web-spidering of sites you specify, as well as synchronous rendering (You can browse other sites through it, with it rewriting all links to be internal links, and content for unknown sites fetched on-the-fly).

I solve the javascript problem largely by removing all of it from the content I forward to the viewing client, though I do support remote sites that load their content through JS via headless chromium (I wrote a library for managing chrome that exposes the entire debugging protocol here: https://github.com/fake-name/ChromeController https://pypi.org/project/ChromeController/).

Why are you obligated to plug your own project instead of discuss the linked content?

Very strange commit message format... they all seem too formal with capitalization and periods, yet are frustratingly ambiguous many times (e.g. "Sources.")

Lovely file descriptions, had a laugh :)

Those are really bad commit messages.

Do you have a guide for good commit messages

Explain what the commit changes. Think about the person who’ll read them. Think about your future self who’ll find a bug and search in the commits to understand what happened. "Sources" or "Oops" are the worst commit messages you would hope for.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact