Hacker News new | past | comments | ask | show | jobs | submit login
Linkalot: A web-based inbox for your links (gitlab.com/dmpop)
80 points by lproven on Feb 28, 2020 | hide | past | favorite | 34 comments

I think many of us wish for and have considered developing an open source, local archive of the websites we visit. We want full text search, full page screenshots, good privacy (dont archive sensitive sites). We might wish for p2p encrypted syncing between devices.

Since we wish for this, I presume its been attempted many times.

My questions is my humble hacker news readers? Can any one provide a run down of the existing open source solutions and their features/pros/cons or simply which they prefer?

What I'd find to be interesting & useful is:

Every URL I visit is automatically recorded along with the time & date visited, title and other metadata.

If the URL is visited via a link off another webpage, have that relationship recorded. Provide some sort of navigable tree / searchable database. Should be able to easily scale to tens / hundreds of millions of URLs across decades.

If the URL is visited by manually inputting that URL, provide option to type into a field something like "heard this on the radio in a show about xyz..." or "so-and-so told me about this on 2-28-2020 at lunch".

Provide option (when viewing page or drilling into history tree) to:

- paste in a paragraph or two of text from the page to associate for context

- save the entire page in WARC or similar

- rate / star / tag that page

Provide option to delete pages from history -- either entirely, or "scratched out" (maybe with a comment) so one can remember which branches of the tree are not worth following again.

Provide Fuzzy matching as-you-type search across selectable metadata fields

Search all content with regex.

This would likely involve a browser plugin I guess, but it'd be nice to have a browser-independent way of doing this to facilitate multiple browsers on multiple machines. Also, would be good to avoid "extensions no longer supported after browser update" situations.

In the time it took me to get around to typing this up I see there are a lot of other interesting suggestions here...will have to sit down & read through them (and the Linkalot docs) more closely when I've some free time.

What I really want is a script that

(1) Takes a URL and optional comment as input

(2) Saves the webpage it points to into a git repo (a simple curl should suffice for most websites)

(3) Inserts that URL, title of the page pointed-to by the URL and the optional comment into an org-mode file that lives in the root of the repo

The org-mode file is a highly-searchable and context-preserving database (I can add tags, create hierarchies, add links to and from other relevant (org-mode or not) files) in the most portable format ever: plain text.

I really don't need a web interface. Actually, if I later decide that I need one, I can build one easily on top of this basic system.

I really want to be able to use this across multiple devices: mainly my two computers, and an Android phone. Using git gives me a reliable protocol for syncing between multiple devices. I want it to be a smooth experience on my phone, which would probably require some sort of git-aware app. Something similar to the Android client for the pass password manager would be ideal.

I hear that git repos can be GPG-encrypted. Ideally, I'm able to serve all this off of a repo hosted on a VPS. I don't want to rely on Dropbox (I'm trying to transition away from it) for syncing.

>(2) Saves the webpage it points into a git repo (a simple curl should suffice for most websites)

FWIW I've done something similar and lots of sites that use a lot of JS (and pretty much every single page webpage like twitter and FB) will not re-render correctly just because you have the files. It actually takes a lot of work to clone a webpage, the best solution I've found so far is to print a PDF from a headless chrome (but this has its own problems, like now you have to deal with a PDF).

Even generating the PDF is a lot harder than it seems, at least if you've never done it before, because there are a lot of gotchas (for example, did you know that most websites provide a second stylesheet to be used while printing that makes it look barely messed up, but still clearly broken? I didn't either)

If the PDF format is not mandatory for you, you might be interested in SingleFile [1] (I'm the author) which you can run from the command line. It will interpret scripts and faithfully save a snapshot of a page in a single HTML file.

[1] https://github.com/gildas-lormeau/SingleFile/tree/master/cli

> lots of sites that use a lot of JS

For many "modern" sites, its really better to just take a screenshot and save the PNG.

Though there are still many sites that render just fine without JS. I've been trying out Brave Browser with JS disabled for some weeks now, and I was surprised how many sites are readable with JS disabled. And so much faster and less jumpy too.

Hmmm.. this wouldn't be too hard to write and sounds like an interesting weekend project.

Would you be interested in using WARC for the webpage though? This way, everything is captured in a single file and you aren't littering your repo with random files and images.

> sounds like an interesting weekend project.

Aren't we just in luck that the weekend is just coming up!

> Would you be interested in using WARC for the webpage though? This way, everything is captured in a single file and you aren't littering your repo with random files and images.

I didn't know about this. I've looked into it a bit, and it seems perfect.

I'm not too concerned about saving webpages, I'm much more concerned about actually having a populated database of links. I only expect to need to use the saved page if the link breaks.

I can work on writing a simple elisp script (incidentally, I don't know very much elisp either, but that's something I am willing to take time out to learn because I expect to be using it a lot in the future), but I do need someone else to write the Android app.

You should always expect the link to break.

(1) Takes a URL and optional comment as input

(2) Saves the webpage it points to into a git repo (a simple curl should suffice for most websites)

(3) Inserts that URL, title of the page pointed-to by the URL and the optional comment into an org-mode file that lives in the root of the repo

If you're willing to change "git" to "version control", it should be pretty easy to implement that in Fossil. It doesn't require much to add an extension written in your language of choice if you're going to run it on your desktop. Plus you'd get the web interface for free if you decided to put it on a web server.

I just wrote a script that cover the first 2 points (though it does create a pdf rather than a simple curl) and allows for searching the database. Org-mode stuff could be added later. github.com/websalt/bmark

You visit a website that you may find intersting. e.g. https://businessfinancing.co.uk/the-oldest-company-in-almost...

You have folders from a-z in your data directory.

You save the website in /data/o/oldestcompanies or into a deeper directory to your liking.

Let recoll take care of the rest.


Thanks for sharing your flow/idea. If understand correctly here are the PROS and CONS.


- full text search

- privacy: save only what you want

- save webpage has OK rendering

- easy backup using backup software


- need to save manually and specify directory/file name

- no screenshot. (no screenshot in search results)

- save webpage not 100% accurate

- requires additional backup solution

Yes. Works for me. There will be always Pros and Cons.

- need to save manually and specify directory/file name

This may be also a PRO for me since I may be able to find it without using recoll. Also, how many websites do you need to save? I assume only a tiny fraction. If you want to save everything, recoll does offer this more or less out of the box:

"Indexing visited WEB pages with the Recoll Firefox extension"


This is another stand alone software that may do what you are looking for. https://getpolarized.io/2019/04/11/Polar-Initial-Crowdfundin...

I saw someone recommend Memex as a Firefox extension for full-text search in history and bookmarks the other day. I've started using it, but can't yet comment on its usefulness.

It stills feels a bit complex to share data between my computers (I wish for p2p, Nextcloud support, or something alike). I don't like too much it moving DDG's instant answers to the bottom of the page, nor the default sidebar and highlighter, but that could just take some time getting used to.

I would recommend Bookmark OS. It's not open sources but offers full text search, full page screenshots, and other neat features https://bookmarkos.com

I've seen many services like this come and go. As great as some of them are, having to trust a 3rd party with a full text archive of your complete web history is not a good value proposition. Not only due to privacy but also longevity, I don't want my archive to depends on this company remaining honest while also becoming successful enough to be around 10-20 years.

Yet to get around to using, but this looks like it may be good: https://github.com/pirate/ArchiveBox

I see they also have a good listing of alternatives https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...

Since I found Wallabag, I did not search for anything else. It fundamentally changed my reading habits. Highly recommended! https://github.com/wallabag/wallabag

Joplin. OSS Evernote with excellent web clipping; I especially like that you can capture text as markdown.

Is this an alternative to wallabag with less features? Archiving links is fine , but what's of most interest is archiving the content isn't it? So this is an online bookmark manager? Sorry it's not clear for me I only know ot saves links in plaintext files, you can add them with a bookmark and you can password protect them.

"Screenshot": water.css (https://kognise.github.io/water.css/) + the following HTML snippet, per link:

  <p><a href="$url" title="$txt">$url</a></p>
These HTML snippets are what's saved, one per line, in the mentioned plain text file ("links.txt"). The webpage is a dump of this file plus HTML/CSS boilerplate.

Yay for GitLab. Boo for no screenshot.

Update from the author (who's happy this has provoked interest :-) —

There is now even a Firefox add-on that works with Linkalot: https://addons.mozilla.org/en-US/firefox/addon/send-tab-url/

Shameless plug: you can create and share bookmarklets through https://bookmarkify.it/

Basically, bookmarks?

I dream of a browser that would merge bookmarks + history into one, with full-text search.

https://getmemex.com/ might be what you're looking for. I've tried to use it, but it somehow managed to destroy its database 3 or 4 times. After that I gave up and uninstalled the extension again.

I'm working on this (pretty early stage)


Looks like this: https://i.imgur.com/OMGlBpS.png

Happy to see not another SASS Webapp but a simple and sustainable tool.

Would be great if there was a demo linked (even if the functionality seems really straight forward).

Does it support organizing the links in categories?

Here is the author's demo version: https://tokyoma.de/linkalot/

He is a colleague of mine on the documentation team at SUSE.

I made this link collection tool: http://tentacle.rupy.se

It's a bit over-engineered on the db side but it works well.

I'd prefer a Go or C-based bookmark manager, that let's you store bookmarks in a markup or YAML-based document(s). Either one for each bookmark, or one for many. That way they can be synced using Google Drive or any cloud sync solution. Then add a web interface on top of that and browser extensions for additional features. There really is no "good" bookmarking solution at this point when comparing tools like pass/gopass for Linux for passwords.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact