It's a bit disappointing to see other people showcasing their own work without a single mention of the link above. Perhaps make your own submission instead if that's the intention?
As a casual reader and others obvious interest in this area I'd very much prefer a sentence or two about the quality of the work presented, feel free to link your own stuff afterwards, it's a bit offputting to see such blatant self-promotion.
I like when people link related projects. This is Why:
I have a use case for this project but I likely won’t get to it for a year or so. When that happens I’ll come back to this thread and all the projects working on the problem will be right here in the Hacker News thread. I’ll be able to see which ones are still alive, and maybe even see why some stopped development.
This happens all the time for me with HN - if people didn’t link their related work the thread would have way less utility.
I’d take a self-promo related project comment over an unrelated complaint (hogging entire above-the-fold space) any day. Nothing is more frustrating than opening a discussion thread and the top comment with a million descendants bikesheds about something else entirely.
Also, you would occasionally see comments along the line of “awesome project / congratulations on launching, I like the fact that it does this and that. My project also does this and that, check it out: insert link.” Hardly any better.
The value to me is the link to the other project. Otherwise it would be like trying to read a recipe online - obligatory filler text with the value buried at the bottom.
I wish I worked in a domain like this. Most people who comment on my area of expertise are asking questions and those with projects are hello world level.
The projects that succeed to working prototype are one off itch scratches or API Jigsaw projects. Mine included, most of my code is API glue. But I never see HN as having utility in this way. But I'm in a small niche.
Did you recognize how many people relate to their own experiences when you tell them a story while totally forgetting about the things you just said?
You are totally right and if someone wants to present their project because it is similar they should at least say something about the differences etc. not just paste a URL to promote it.
I don't know, I feel like one of the coolest things about HN is being in a community with others trying to do similar stuff. I definitely agree that it's not about just trying to hijack the top comment to get the runoff clicks, and I do see some comments here that are just links to other projects. Maybe my self-reference wasn't one that you were referring to, but I personally got a kick out of seeing another hobbyist doing something similar, and looking at their different approach to a similar problem.
When I've shared my own projects here this practice of a) people linking their own projects and b) people linking other projects/products that do similar thing was a bit disappointing.
What you really want is feedback on your own project
The problem with this stance is that it is very difficult to get a Show HN to the front page. You usually have about a half hour after submission to get four or five upvotes. (I watch /new and I've seen good projects submit over and over to try to get bites - with no luck.)
But also: self-promotion is a good thing. Or I should say: a better thing. If you can't promote your own project, then you are left to use advertising. This is why advertising is so hot right now. There aren't many avenues for self-promotion.
Let people post their links and let's see where the chips fall.
At least take the effort of putting your project in context, which also adds some interest to the reader. Thin line between "good self-promotion" and "spam".
Congrats on getting your project to the front page of HN. With that said I think you are going to need to change your approach if you want this project to be usable as more than a toy project in the long run.
From what I can tell it essentially saves a map of url -> response in memory as you browse. Every 10 seconds this file is serialized to json and dumped to a cache.json file. This is going to be very inefficient as the number of web pages indexed grows since you are rewriting the entire cache every 10 seconds even if only a few pages have been added to it. It also will eventually exceed the memory of the computer running the app if the content of every page ever visited needs to be loaded into memory. I highly recommend looking into some of the other suggestions mentioned here, either sqlite or mapping a local directory structure to your caching strategy so that you can easily query a given url without keeping the entire cache in memory, and also add / update urls without rewriting the entire cache.
I wrote something similar years ago in Go, and settled on writing the data to a WARC file on disk (you can gzip the individual requests and concatenate to get random access), and also concatenating to a warc index file. The working index was kept in memory, while the warc index was read at startup.
My version acted as a proxy and would serve the latest entry from cache if a copy was cached. I had a special X-Skip-Cache header for when I wanted to go around the cache. (I can't remember if it handled https or if sites just didn't use https back then.)
My use-case was web scraping, particularly recipe and blog sites. I wanted to be able to develop my scraping code without re-hitting the sites all the time. Structuring it as a proxy allowed me to just write my python scraping code as if I was talking to the server.
Previously I'd written a layer on top of the python requests library to consult a cache stored in a directory (raw dumps of content / headers, with v2 involving git). But I found that required extra care when more than one script was running at once, and I liked the idea of storing it in a standardized format (WARC) that could be manipulated by other tools.
I tried to build something like this for jest tests in an app I worked on.
I wanted my jest tests to serve as both unit tests and service diagnostics - so I instrumented axios and setup a hidden cache layer within it when running inside the test suite. I was trying to figure out how to best organize the cache so I could run tests really quickly by having all results pulled from cache — or run it slow and as a service diagnostic mechanism by deleting the cache before execution ... I had to extend axios to accept a bit of additional logic from the application ...
it was hard for me to get it to work properly inside of jest though ...
You could store the data in a git repo per domain, so that implicit de-duplication happens on re-visits & for shared resources.
You could have a raw dir (the files you receive from the server) and a render dir that consists of snapshots of the DOM + CSS with no JS & external resource complexity.
When the global archive becomes too big, history could be discarded from all the git repos by discarding the oldest commit in each repo, and so on.
SOLR is probably the right tool for the index but there is something undeniably appealing about staying in the pure file paradigm - you could use sqlite's FTS5 module to do that too.
You still end up with two copies of the same file, one in the local LFS “server”, one in the work tree, no? (I only played with LFS a bit many years ago when it first came out, so I could be wrong.) Unless you take into account deduplication built into certain filesystems.
Also saves you from beating up your index with every change.
I've been using GIT LFS for the last 6 months with an Unreal Engine project, with multiple gigabytes of files being tracked, and it really is painless.
I know. But you do need them, and files in your work tree don’t magically disappear when you commit them in (presumably). So either you delete the work tree copy immediately after pushing it to LFS server, and duplicate the server copy every time you need to access it, in which case the file is only duplicated then but comes with elevated cost of access, or the latest copy sits around costing double the amount of space at all times.
I don't see the issue? Either you want to use Git or not. I have gigabyte-scale files in my 6-month old repo's and haven't ever run into any issues. Of course this may be because my git server is right next to my desk and I'm on gigabit ethernet ..
Aside: are there non-binary and/or non-large blobs? I'm thinking along the lines of ATM machine / PIN number but maybe BLOB no longer implies "binary large" without being explicit.
I like this idea, especially using git to version the store. With automatic commits, you could roll back to a particular date to see the page versions then. A personal "archive.org" sounds very awesome!
I like the idea (not of git, but a personal archive), especially if search is integrated.
However what'd make it really amazing for me would be the ability to share those archived versions with everyone around the world, so we wouldn't have to duplicate our efforts or would have a higher chance of having that specific version of one special page saved.
For now the best way to contribute to this seems to be centralization: Donate to archive.org.
what are the security implications of permanently running chrome in remote debug mode?
a bit more than half a year ago I started playing around with this and was surprised how on the one hand there are really really good tools nowadays for self archiving but how on the other hand there hasn't been any progress in implementing these in a, for end users, comfortable way
My working theory right now is that saving every request/response as well as every interaction on a page should allow us to completely restore web site state at any point in time and will open up some super interesting use cases around our interaction with information found online
But in order to do this it seems necessary to go through the remote debug protocol like this project here is doing. And since this is somewhat of an unusual approach I could not find much information about the security aspect of running every site at any time with remote debugging activated.
Common web scrapers/archiving tools will instead only use remote chrome debug to open and capture specific urls
Storage is so dirt cheap today that there is zero reason why we shouldn't have reliable historic website state for everything we have ever looked at
And judging by the HN front pages of the last months, many here are interested in this and related use cases (search/index/annotations/collaborative browsing)
> Storage is so dirt cheap today that there is zero reason why we shouldn't have reliable historic website state for everything we have ever looked at
I agree entirely, but I do about half my reading on mobile, and the phone company and the ad company have both decided that I shouldn’t be able to run extensions of any kind in the browsers available on my phone company phone or my ad company phone.
I’m not really sure of the solution. I had planned to start a business around this, but without mobile support it is probably a nonstarter.
All of my contacts use iMessage; Signal is starting to be viable with their iPad release now though. I am concerned about switching from Chrome to Firefox still on security grounds, but this may be sufficient to make me switch.
Chrome’s sandboxing is unparalleled in the browser space. No other browser comes close, unless you are running something in Qubes or suchlike. Then it doesn’t much matter.
> What are the security implications of running in remote debugging mode?
Great question. First up, as long as you don't put --remote-debugging-address=0.0.0.0 you are only exposed locally, so the debugging endpoint can only be accessed from your local machine.
That leaves open the possibility that a web page can access that.
Interestingly, you can connect to the websocket, you just need to know the random identifier.
There are probably some DevTools zero days, but apart from those it looks like it's OK unless:
0) the identifier is not random,
1) you can get past CORS on the localhost which might be possible with an exploited extension, 3rd party software or plugin or
2) you can guess the websocket 128-bit identifier. (Guessing should only take 500 billion years. Even so 128 bits seems quite short relative to some encryption keys but there's probably a reason for that.)
Regarding 0) checking the Chromium source it appears that these ids are passed in to the constructor of "DevToolsAgentHostImpl":
which in each case appears to rely on getting random bytes from a file descriptor to "urandom" which I think is an operating system level randomness primitive.
Is it possible to send a no-cors POST that causes side effects (regardless of the opaque response)? I have only used the DevTools protocol through puppeteer, so don’t know anything about its authentication. Could it be vulnerable?
which looks like it ignores the HTTP verb and acts only on the path. I confirmed this with tests: fetch('http://localhost:9222/json/new') and fetch('http://localhost:9222/json/new, {method:'POST', body:''}) do the same thing, as does using verb 'DELETE'.
All these open a new tab. Without knowing a 128-bit target identifier, it looks like opening a new tab is the only thing you can do if someone is running DevTools.
POST requests don’t necessarily trigger preflight. But now that I think about it, the DevTools protocol most likely does not accept application/x-www-form-urlencoded, so good point.
Like 20 years ago I've used a program called Teleport Pro, to do something similar.
I would dial up with my phone modem when the internet access was cheap (during the night), it would automatically browse a page I provided, and in the morning I would have the page ready to read.
I had a similar experience, but I also think Internet Explorer saved the websites you visited to browse them later in offline mode, right? I remember some times I couldn't know if I was online or not because the website was cached, i had to visit a different website that I never visited before to check my connection status.
I’m curious of why did you go on the path of using chrome debugging functionality instead of implementing an HTTP proxy which would provides the benefit of behind browser-agnostic too.
After reading this I was wondering if it might be fun to write a HTTP-proxy that a) recorded everything, in an SQLite database, and b) presented a localhost-server which would let you search that content.
I suspect it would get very very very busy, with tracking-pixels, etc, but if you only made it archive text/plain, text/html, and similar content-types it might be a decent alternative to bookmarks, albeit only on a single host/network.
Wouldn't be hard to knock up a proof of concept, perhaps I should do that this evening.
One thing I can imagine is the fact that chrome is using system proxy settings. That means that changing them will affect all other apps on the machine and could save a lot of garbage requests you don't actually need.
On the other hand user agent filtering could be implemented to partially solve this.
Chrome can accept proxy settings from the command line do all you'd need to do would be to change the shortcut to Chrome.
The more dangerous part would be doing TLS MitM to allow HTTPS content to be recorded. This is well documented but still a potential security issue if the certificate somehow gets picked up by malware or something.
Great feature. Though it feels like a UI misstep that the user had to use npm to switch between recording and browsing. A nicer solution could be a chrome extension button, or access the archived version via a synthetic domain. E.g. example.com.archived
There's also some 'magic' potential here to have a proxy that detects whether there's a live network interface or not (including some sanity checking against captive portals), passes through the live sites while recording when there is a connection, and serves from the last recorded versions when there isn't.
I remember I tried something similar a long time ago but decided it wasn't worth it.
2MB per page at 100 page a day, 200MB / day. That is 73GB per year.
May be once a year I get the problem where I remember reading something about it but could not google back the exact page in my memory. So I had a proxy solution set up, but the maths ends up it wasn't worth paying the Storage cost just for this one time convenient.
Between this project and the others mentioned in the discussion these are excellent resources for anyone needing to have a forensic record of their how they assembled evidence from browsing open sources on the internet. Package this as a VM that can be quickly spun up new and fresh per case and sell support for LE types and you’ve got a business.
Nitpicking but am I the only one who hates "serve" being used in strange contexts? IMHO to serve is to send something over a network. If it's all happening locally, the verb should be "load" because it's just taking a file and loading it into a browser at that point.
If it's running an http server locally and your browser is making requests to it, it's definitely serving. Not sure how "loading" could be a better word unless you're explaining it to someone nontechnical. Surely a nitpick is supposed to be more pedantic, not less?
Thank you very much for the big compliment! I feel very happy to hear it.
A lot of people in this thread talked about proxies, as in "why did you not implement a proxy" or "I implemented this but as a proxy"
The main advantage I see of this approach over a proxy is: simplicity.
The core of this is approximately 10 lines of code. The reason is it can hook into the commands and events of the browser's built in Network module.
I think there's no need to build a proxy, if you can already program the browser's in built Fetch module.
I think proxies have issues such as distribution (how do you distribute your proxy? As a cumbersome download that requires set up? As a hosted service that you have to maintain and cost?), and security (how do you handle TLS?), and complexity (I built this in a couple of hours over 2 days, one of the "obligatory bump" projects added to this thread is a proxy and has thousands of commits).
The biggest problem I see is the complexity. I feel a proxy would create a tonne of edge cases that have to be handled.
I did not mind sacrificing the benefits of a proxy (it can work on all browsers, and on any device), because I did not want to run my own server for this, but rather, crucially (I feel) give people back the power and control over their own archive. Even more importantly for me is I want to just make this the easiest way to archive for a particular set of users (say, Chrome users on Desktop), really get that right and then if that works, move to other circles later (such as mobile users, or other browsers).
Anyway, thanks for your kind comment, it really encourages me to share more about this.
I read some of your comment history but I can't get a lock on who you are, but you seem pretty interesting. Do you mind sharing a GitHub or something? If not, but you'd like to continue chatting, email me cris@dosycorp.com
I love how there's only a single browser or two in the entire world, lol (safari I've got bo clue about)
that's while assuming chrome and firefox's debugging streams would be compatible....
you assume I don't use any forks, or custom versions, what if I use an Electron based browser?
what about Pale Moon or other forks which have older, if any, such interfaces?
what about Opera?
etc etc, you get the point...I hope...
I'm actually in the process of rewriting this. I like your approach of using DevTools to manage the requests, the approach taken in Chronicler is to hook into Chrome's actual request engine.
You might like to look at Chronicler to see some attempts at UI for a project like this, particular decisions around what to download and how to retrieve it.
This seems to be something a LOT of people are working on right now. I have this open in another tab: https://news.ycombinator.com/item?id=14272133 where several MORE alternatives are listed.
One feature I'd love that I don't see anywhere, is "also go through my history, let me check/uncheck particular items, then submit the rest to ArchiveBot or WBM or something. Since I apparently have a habit of visiting sites that aren't in WBM yet.
Nice job, I think this is promising but there has got to be a better way than having people enable their debugger. Is there any reason you can't just copy the contents of each page and then post it somewhere?
My initial though. Is there a proxy that also serves? Maybe a squid addon? This would be awesome. I hope to see something like the way back machine but local for all the things I ever surfed.
Any caching proxy (including squid) will serve - that is their whole point. You may need to tweak the configuration to ignore the website-specified expiry and cache headers.
Squid (or any other popular caching proxy I'm aware of) doesn't cache verbs other than GET, so a lot of websites can't be cached this way; notably, GraphQL APIs usually use POST for all requests, even just queries.
In principle this is true but there are some caveats regarding the usability: It does not present the cache in a friendly way. There is no Index or something like a nice starting page with your top browsed sites - you get the idea.
I guess this is totally doable with squid or any other caching proxy, but I don't know any that does this
could this beat google? local search of anything i have seen, plus silo search sites for specific purpose like amazon and HN. would you miss anything given that google results are either bought or gamed? maybe need a better social media
At this point, it does a GIANT pile of additional things most of which are specific to my interests, but I think it might be at least marginally interesting to others.
It does both full autonomous web-spidering of sites you specify, as well as synchronous rendering (You can browse other sites through it, with it rewriting all links to be internal links, and content for unknown sites fetched on-the-fly).
I solve the javascript problem largely by removing all of it from the content I forward to the viewing client, though I do support remote sites that load their content through JS via headless chromium (I wrote a library for managing chrome that exposes the entire debugging protocol here: https://github.com/fake-name/ChromeControllerhttps://pypi.org/project/ChromeController/).
Very strange commit message format... they all seem too formal with capitalization and periods, yet are frustratingly ambiguous many times (e.g. "Sources.")
Explain what the commit changes. Think about the person who’ll read them. Think about your future self who’ll find a bug and search in the commits to understand what happened. "Sources" or "Oops" are the worst commit messages you would hope for.
As a casual reader and others obvious interest in this area I'd very much prefer a sentence or two about the quality of the work presented, feel free to link your own stuff afterwards, it's a bit offputting to see such blatant self-promotion.