Hacker News new | past | comments | ask | show | jobs | submit login

It already seems to have some nice features, but for me the dream bookmark manager would be something really simple with two commands like:

$ bookmark add http://...

That will:

1. Download a static copy of the webpage in a single HTML file, with a PDF exported copy, that also take care of removing adds and unrelated content from the stored content.

2. Run something like http://smmry.com/ to create a summary of the page in few sentences and store it.

3. Use NLP techniques to extract the principle keywords and use them as tags

And another command like:

$ bookmark search "..."

That will:

* Not use regexp or complicated search pattern, but instead;

* Search in titles, tags, AND page content smartly and interactively, and;

* Sort/filter results smartly by relevance, number of matches, frecency, or anything else useful

Storing everything in a git repository or simple file structure for easy synchronization, bonus point for browsers integrations.

I've been thinking along these lines, some other features I'd like:

- ability to have certain sites run site-specific extra processing: i.e. youtube-dl youtube links

- ability to have a list of sites to be archived periodically instead of once only. And the option to be notified when a site updates. even if it were run as a batch job

- ability to ingest a PDF or ebook, identify all the URLs, snapshot all the URLS, present them as a list that links to the original, cached version, the page location

- would also be nice if the data could be stored in a human readable structure in a normal filesystem, so your ability to use the data isn't dependent on your ability to run the tool.

Overall I think it is an interesting project but the commercial potential is limited.

EDIT: maybe the document processing and periodic check thing would make more sense as a higher level tool that depended on the bookmarking tool -- and the extra processing also might make more sense as a plugin type architecture.

> would also be nice if the data could be stored in a human readable structure in a normal filesystem, so your ability to use the data isn't dependent on your ability to run the tool.

This is really important yes!

> Overall I think it is an interesting project but the commercial potential is limited.

Indeed. This kind of tool is mostly limited to hackers, and it is not a big enough market to create a business model I suppose.

This would have to be done for the love of open-source :-)

There's an option to filter the print output. But yes, currently the data is stored in SQLite only.

Yes, buku doesn't intend to be a commercial utility. I use bookmarks as a context pointer for everything I do. So I wrote buku. But it's written as a library so other projects can use it.

On the filesystem, I'm thinking of some sort of (possibly virtual?) document filesystem structured by:

* Title

* Author

* Subject

* Date

* Other?

The problem with a fixed / structured filesystem is that it's largely inflexible. The next option would be to have some sort of hybrid -- a persistent data store plus, say, hardlinks or symlinks onto that store based on other elements, could work, but would be somewhat annoying to maintain (though not impossible, and a tools-based approach might well work).

The advantage to a filesystem-based approach is that you'd be able to use any filesystem-based tools on it: find, grep, ls, cd, etc.

This also points to the distinct limitations of filesystems and naming conventions for document-oriented systems, generally. They're OK for statically-defined computer systems. Sometimes (/proc, /sys, /dev, /dbus, ... are all exceptions for which virtualising the filesystem is a current fix), but when it comes to the human realm, the oversimplification and reliance on standards creates pain.

A related problem is: how do you identify a given work, reasonably uniquely and reasonably persistently across variants?

Taking a content hash works for git, but doesn't apply particularly well to a human-readable document where whitespace, casing, characterset, character substitutions (straight vs. curly quotes, "-" and "--" vs. en and em dash, etc.), translations, multiple output formats, etc., might all create unique (and unrelatable) hash fingerprints.

In catologues, some tuple of author, title, and publication date generally suffices, and creates the general outline of a unique, but relatable, identity foundation. A book might have multiple editions or publishers, or multiple formats (ISDN relates these to the core work). Sub-parts (a chapter or section) might be included into other works. Etc.

I'd really like to have the option, say, of creating and relating:

* The source document.

* Some standardised markup (Markdown, LaTeX, some BDSM HTML5 or similar format, etc.).

* Generated outputs (PDF, PS, ePub, DJVU, Mobi, ASCII, etc.)

* Metadata: Author, title, publisher, publication, date, URL, language, various identifiers (LoCCN, ISBN, DOI, ...), etc., subject(s), rating(s), review(s), citation(s).

For research, this could be highly invaluable.

> The problem with a fixed / structured filesystem is that it's largely inflexible. The next option would be to have some sort of hybrid -- a persistent data store plus, say, hardlinks or symlinks onto that store based on other elements, could work, but would be somewhat annoying to maintain (though not impossible, and a tools-based approach might well work).

What your are describing here is called a semantic file system, and there are some implementations of this, like for instance https://www.tagsistant.net/

Thanks, I wasn't aware of the term (though I'm familiar with semantic information systems, e.g., "semantic Web", in other contexts.

Tagsistant is ... getting there. From the description, still not quite what I'm looking for. Its duplicate-file detection, for example, would note the same logical file being present twice, but not, say, War and Peace in LaTeX, generated PDF, and scanned TIFF forms.

Some way of recognising the latter as, if not the same, then at the very least related somehow, would be highly useful.

Yes, buku can be used as a bookmarking engine in a bigger project that also scrapes data. It has REST APIs to do that.

except for the the exported pdf and ad removal and git you've basically described Pinboard.in I'm pretty sure it searches the content not just the title, tags, and comment you left. It saves a copy of the page so that the site disappearing doesn't mean you've lost the info. It's not downloadable but i don't know why i'd want to backup their backup anyway. It suggests tags / keywords (probably by harvesting the plethora of other people bookmarking things) .

and it's got an API so you could make a command line client.

Ads removal and "git/simple file structure" are mandatory features for me (PDF export is optional though).

But first and foremost, it should be an open-source tool, I forgot to mention it. I don't want to be stuck because (as already said) a company/website closes. Yes backups are possible, but when the tool is gone, it's gone, and backups are not really useful. People will have to spend days trying out to figure a new tool and how to import existing data. Open-source protect from such kind of issues.

I don't say this tool is bad, it is probably really nice for some people. But at least for me it does not fulfill my wills/needs.

The thing is, Pinboard can decide to just shut down and there it goes all your saved webpages. Besides, if you want to save the webpages and have full text search, you have to pay the premium package every year forever.

a) https://pinboard.in/settings/backup

b) Do you seriously not consider 25$ per year reasonable for such a useful service?*

* I would pay that just to read maciej's fantastic presentations.

I'll build this. It sounds like a useful and fun project to build. Will build it using Crystal to be able to ship a single binary with no dependencies. SQLite will probably be enough for this project so it'll ship with it's own DB.

Also, Mozilla's Readability library[0] should help you out to extract only relevant content (this is what's behind Firefox's reading mode). So, the only semi-difficult part is the NLP.

[0] https://github.com/mozilla/readability

Let us know the project URL if you start it, so we can follow your work!

If I would be doing it, I would probably go with python because of the existing ecosystem: NLP, readability, beautifying, image processing or deep learning, … Probably everything is already existing, the goal would be to assemble the pieces like playing lego :-) in a much more complicated way of course.

But I'm not doing it, so your choice is the correct :-) Have a nice time working on this project!

PS: Even the image captioning is almost already implemented, amazing https://github.com/tensorflow/models/tree/master/im2txt

If you want other output formats, there's little you can do to improve over pandoc. That will generate ePub, .mobi, DJVU, PDF, PS, and a multitude of other formats, on the fly. HTML is a valid input for most of those.

The main problem isn't pandoc, but HTML -- the crap that passes for Web-compatible today is simply any asshat's bad idea. I see as highly useful something which looks at what's been downloaded and reduces it to a hugely simplified structure -- Markdown will almost always be sufficient.

I've found, in writing my own strippers and manually re-writing HTML, that body content rarely amounts to more than paragraphs, italic/emphasis, and anchors/hrefs. Better-written content has internal structure via headers. Bold itself is almost never used within CMS systems for body text, it's almost always a tell for advertising or promotional content.

The sad truth is that manual rewrite or re-tagging of pages, in Markdown is often the best option I've got for getting to something remotely reasonable. The good news is that that's actually a good tool for reading an article, even if you find that on reading, it's not worth keeping :)

i agree with all that. lot of good wisdom there.

as for html-to-markdown conversion, http://markdownrules.com is good.

Neat idea, we'll surely consider it. Thanks!

The autotagging feature would be so useful. I've been looking for something that does it for a long while.

At Wire, we've actually been working on the "autotagging" (& some more) for the past 2 years. We're focused on mobile and don't have a desktop version, yet. Our current version tags a page based on the keywords in metadata, title; we're still fine tuning it and will include the content in the future. The way Wire works is: instead of the conventional bookmarks, users just saves what they find and use the same search engine they use everyday to find it again. What the user saves is stored for offline viewing. In addition, we've also included an offline p2p feature that allows the user and an nearby offline friend to share what they've saved.

Check it out on: https://goo.gl/xMgxfJ

Pinboard.in doesn't "autotag" but it suggests tags and you can click them to have them added to the form you're saving. it's quick and easy.

Glad to hear!

I like the HN community, it gather a lot of enthusiastic people!

How would that deal with pictures?

I'm not sur at which level you ask this. For storing the images, Base64 encoded in the HTML files. For making them useful, I don't think it's the most important feature, but if need be it is possible to think about using OCR for extracting text, or Deep Learning for describing them [1].

[1] https://research.googleblog.com/2014/11/a-picture-is-worth-t...

The average web page is around 2MB, 60% of which is images [1]. Base64 encoding makes the size even larger. Would that compromise on higher disk usage make sense for you? I like both the OCR and DL ideas. Shameless plug, I've been working on a bookmarking service (https://github.com/crestify/crestify), and local archiving with images is something we're looking at add. Since it is something a lot of people seem to want, contributors are welcome :).

[1] https://www.soasta.com/blog/page-bloat-average-web-page-2-mb...

The system I propose would be for sure space-consuming. Removing the ads and other unneeded contents would probably help to shrink down the page size though. This is the price to pay to keep the bookmarked content available in the long-term, and I'm ready to pay this "price".

Assuming 1 bookmark per day during 10 years, this will sum up around 8 Go. Storage is cheep enough nowadays so I can support these 8 Go over 10 years. This will probably not be mobile-friendly, but I'm almost never consulting bookmarks from my phone anyway.

Your project seems interesting, thanks for sharing! This may be especially useful to see how you implemented the readability part.

According to this post[1], Pinboard had 26M new bookmarks in 1 year (15-16), between 24.5K users, which is close to ~1100 bookmarks per user per year, 3 per day. ~22GB for 10 years. Compression and other optimizations could reduce that further.

Crestify is a web application, so you could run it on a server and access it over a mobile app. Really happy to hear you like it. And I'm always open to new ideas about it :).

[1] https://blog.pinboard.in/2016/07/pinboard_turns_seven/

I meant bookmarks to pictures. Like infographics or memes

Oh, I'm not bookmarking such kind of content usually. I guess in any case a system that check the type of content (html/image/sound/video) will be needed to select the storage format, so the annotation may then also be adapted to the content type. In any case a way to manually add tag would exist so it's feasible to add such content.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact