Since we wish for this, I presume its been attempted many times.
My questions is my humble hacker news readers? Can any one provide a run down of the existing open source solutions and their features/pros/cons or simply which they prefer?
Every URL I visit is automatically recorded along with the time & date visited, title and other metadata.
If the URL is visited via a link off another webpage, have that relationship recorded. Provide some sort of navigable tree / searchable database. Should be able to easily scale to tens / hundreds of millions of URLs across decades.
If the URL is visited by manually inputting that URL, provide option to type into a field something like "heard this on the radio in a show about xyz..." or "so-and-so told me about this on 2-28-2020 at lunch".
Provide option (when viewing page or drilling into history tree) to:
- paste in a paragraph or two of text from the page to associate for context
- save the entire page in WARC or similar
- rate / star / tag that page
Provide option to delete pages from history -- either entirely, or "scratched out" (maybe with a comment) so one can remember which branches of the tree are not worth following again.
Provide Fuzzy matching as-you-type search across selectable metadata fields
Search all content with regex.
This would likely involve a browser plugin I guess, but it'd be nice to have a browser-independent way of doing this to facilitate multiple browsers on multiple machines. Also, would be good to avoid "extensions no longer supported after browser update" situations.
In the time it took me to get around to typing this up I see there are a lot of other interesting suggestions here...will have to sit down & read through them (and the Linkalot docs) more closely when I've some free time.
(1) Takes a URL and optional comment as input
(2) Saves the webpage it points to into a git repo (a simple curl should suffice for most websites)
(3) Inserts that URL, title of the page pointed-to by the URL and the optional comment into an org-mode file that lives in the root of the repo
The org-mode file is a highly-searchable and context-preserving database (I can add tags, create hierarchies, add links to and from other relevant (org-mode or not) files) in the most portable format ever: plain text.
I really don't need a web interface. Actually, if I later decide that I need one, I can build one easily on top of this basic system.
I really want to be able to use this across multiple devices: mainly my two computers, and an Android phone. Using git gives me a reliable protocol for syncing between multiple devices. I want it to be a smooth experience on my phone, which would probably require some sort of git-aware app. Something similar to the Android client for the pass password manager would be ideal.
I hear that git repos can be GPG-encrypted. Ideally, I'm able to serve all this off of a repo hosted on a VPS. I don't want to rely on Dropbox (I'm trying to transition away from it) for syncing.
FWIW I've done something similar and lots of sites that use a lot of JS (and pretty much every single page webpage like twitter and FB) will not re-render correctly just because you have the files. It actually takes a lot of work to clone a webpage, the best solution I've found so far is to print a PDF from a headless chrome (but this has its own problems, like now you have to deal with a PDF).
Even generating the PDF is a lot harder than it seems, at least if you've never done it before, because there are a lot of gotchas (for example, did you know that most websites provide a second stylesheet to be used while printing that makes it look barely messed up, but still clearly broken? I didn't either)
For many "modern" sites, its really better to just take a screenshot and save the PNG.
Though there are still many sites that render just fine without JS. I've been trying out Brave Browser with JS disabled for some weeks now, and I was surprised how many sites are readable with JS disabled. And so much faster and less jumpy too.
Would you be interested in using WARC for the webpage though? This way, everything is captured in a single file and you aren't littering your repo with random files and images.
Aren't we just in luck that the weekend is just coming up!
> Would you be interested in using WARC for the webpage though? This way, everything is captured in a single file and you aren't littering your repo with random files and images.
I didn't know about this. I've looked into it a bit, and it seems perfect.
I'm not too concerned about saving webpages, I'm much more concerned about actually having a populated database of links. I only expect to need to use the saved page if the link breaks.
I can work on writing a simple elisp script (incidentally, I don't know very much elisp either, but that's something I am willing to take time out to learn because I expect to be using it a lot in the future), but I do need someone else to write the Android app.
If you're willing to change "git" to "version control", it should be pretty easy to implement that in Fossil. It doesn't require much to add an extension written in your language of choice if you're going to run it on your desktop. Plus you'd get the web interface for free if you decided to put it on a web server.
You have folders from a-z in your data directory.
You save the website in /data/o/oldestcompanies or into a deeper directory to your liking.
Let recoll take care of the rest.
- full text search
- privacy: save only what you want
- save webpage has OK rendering
- easy backup using backup software
- need to save manually and specify directory/file name
- no screenshot. (no screenshot in search results)
- save webpage not 100% accurate
- requires additional backup solution
This may be also a PRO for me since I may be able to find it without using recoll. Also, how many websites do you need to save? I assume only a tiny fraction. If you want to save everything, recoll does offer this more or less out of the box:
"Indexing visited WEB pages with the Recoll Firefox extension"
This is another stand alone software that may do what you are looking for.
It stills feels a bit complex to share data between my computers (I wish for p2p, Nextcloud support, or something alike). I don't like too much it moving DDG's instant answers to the bottom of the page, nor the default sidebar and highlighter, but that could just take some time getting used to.
<p><a href="$url" title="$txt">$url</a></p>
There is now even a Firefox add-on that works with Linkalot: https://addons.mozilla.org/en-US/firefox/addon/send-tab-url/
I dream of a browser that would merge bookmarks + history into one, with full-text search.
Looks like this: https://i.imgur.com/OMGlBpS.png
Would be great if there was a demo linked (even if the functionality seems really straight forward).
Does it support organizing the links in categories?
He is a colleague of mine on the documentation team at SUSE.
It's a bit over-engineered on the db side but it works well.