$ bookmark add http://...
1. Download a static copy of the webpage in a single HTML file, with a PDF exported copy, that also take care of removing adds and unrelated content from the stored content.
2. Run something like http://smmry.com/ to create a summary of the page in few sentences and store it.
3. Use NLP techniques to extract the principle keywords and use them as tags
And another command like:
$ bookmark search "..."
* Not use regexp or complicated search pattern, but instead;
* Search in titles, tags, AND page content smartly and interactively, and;
* Sort/filter results smartly by relevance, number of matches, frecency, or anything else useful
Storing everything in a git repository or simple file structure for easy synchronization, bonus point for browsers integrations.
- ability to have certain sites run site-specific extra processing: i.e. youtube-dl youtube links
- ability to have a list of sites to be archived periodically instead of once only. And the option to be notified when a site updates. even if it were run as a batch job
- ability to ingest a PDF or ebook, identify all the URLs, snapshot all the URLS, present them as a list that links to the original, cached version, the page location
- would also be nice if the data could be stored in a human readable structure in a normal filesystem, so your ability to use the data isn't dependent on your ability to run the tool.
Overall I think it is an interesting project but the commercial potential is limited.
EDIT: maybe the document processing and periodic check thing would make more sense as a higher level tool that depended on the bookmarking tool -- and the extra processing also might make more sense as a plugin type architecture.
This is really important yes!
> Overall I think it is an interesting project but the commercial potential is limited.
Indeed. This kind of tool is mostly limited to hackers, and it is not a big enough market to create a business model I suppose.
This would have to be done for the love of open-source :-)
Yes, buku doesn't intend to be a commercial utility. I use bookmarks as a context pointer for everything I do. So I wrote buku. But it's written as a library so other projects can use it.
The problem with a fixed / structured filesystem is that it's largely inflexible. The next option would be to have some sort of hybrid -- a persistent data store plus, say, hardlinks or symlinks onto that store based on other elements, could work, but would be somewhat annoying to maintain (though not impossible, and a tools-based approach might well work).
The advantage to a filesystem-based approach is that you'd be able to use any filesystem-based tools on it: find, grep, ls, cd, etc.
This also points to the distinct limitations of filesystems and naming conventions for document-oriented systems, generally. They're OK for statically-defined computer systems. Sometimes (/proc, /sys, /dev, /dbus, ... are all exceptions for which virtualising the filesystem is a current fix), but when it comes to the human realm, the oversimplification and reliance on standards creates pain.
A related problem is: how do you identify a given work, reasonably uniquely and reasonably persistently across variants?
Taking a content hash works for git, but doesn't apply particularly well to a human-readable document where whitespace, casing, characterset, character substitutions (straight vs. curly quotes, "-" and "--" vs. en and em dash, etc.), translations, multiple output formats, etc., might all create unique (and unrelatable) hash fingerprints.
In catologues, some tuple of author, title, and publication date generally suffices, and creates the general outline of a unique, but relatable, identity foundation. A book might have multiple editions or publishers, or multiple formats (ISDN relates these to the core work). Sub-parts (a chapter or section) might be included into other works. Etc.
I'd really like to have the option, say, of creating and relating:
* The source document.
* Some standardised markup (Markdown, LaTeX, some BDSM HTML5 or similar format, etc.).
* Generated outputs (PDF, PS, ePub, DJVU, Mobi, ASCII, etc.)
* Metadata: Author, title, publisher, publication, date, URL, language, various identifiers (LoCCN, ISBN, DOI, ...), etc., subject(s), rating(s), review(s), citation(s).
For research, this could be highly invaluable.
What your are describing here is called a semantic file system, and there are some implementations of this, like for instance https://www.tagsistant.net/
Tagsistant is ... getting there. From the description, still not quite what I'm looking for. Its duplicate-file detection, for example, would note the same logical file being present twice, but not, say, War and Peace in LaTeX, generated PDF, and scanned TIFF forms.
Some way of recognising the latter as, if not the same, then at the very least related somehow, would be highly useful.
and it's got an API so you could make a command line client.
But first and foremost, it should be an open-source tool, I forgot to mention it. I don't want to be stuck because (as already said) a company/website closes. Yes backups are possible, but when the tool is gone, it's gone, and backups are not really useful. People will have to spend days trying out to figure a new tool and how to import existing data. Open-source protect from such kind of issues.
I don't say this tool is bad, it is probably really nice for some people. But at least for me it does not fulfill my wills/needs.
b) Do you seriously not consider 25$ per year reasonable for such a useful service?*
* I would pay that just to read maciej's fantastic presentations.
If I would be doing it, I would probably go with python because of the existing ecosystem: NLP, readability, beautifying, image processing or deep learning, … Probably everything is already existing, the goal would be to assemble the pieces like playing lego :-) in a much more complicated way of course.
But I'm not doing it, so your choice is the correct :-) Have a nice time working on this project!
PS: Even the image captioning is almost already implemented, amazing https://github.com/tensorflow/models/tree/master/im2txt
The main problem isn't pandoc, but HTML -- the crap that passes for Web-compatible today is simply any asshat's bad idea. I see as highly useful something which looks at what's been downloaded and reduces it to a hugely simplified structure -- Markdown will almost always be sufficient.
I've found, in writing my own strippers and manually re-writing HTML, that body content rarely amounts to more than paragraphs, italic/emphasis, and anchors/hrefs. Better-written content has internal structure via headers. Bold itself is almost never used within CMS systems for body text, it's almost always a tell for advertising or promotional content.
The sad truth is that manual rewrite or re-tagging of pages, in Markdown is often the best option I've got for getting to something remotely reasonable. The good news is that that's actually a good tool for reading an article, even if you find that on reading, it's not worth keeping :)
as for html-to-markdown conversion, http://markdownrules.com is good.
Check it out on: https://goo.gl/xMgxfJ
I like the HN community, it gather a lot of enthusiastic people!
Assuming 1 bookmark per day during 10 years, this will sum up around 8 Go. Storage is cheep enough nowadays so I can support these 8 Go over 10 years. This will probably not be mobile-friendly, but I'm almost never consulting bookmarks from my phone anyway.
Your project seems interesting, thanks for sharing! This may be especially useful to see how you implemented the readability part.
Crestify is a web application, so you could run it on a server and access it over a mobile app. Really happy to hear you like it. And I'm always open to new ideas about it :).