Hacker News new | comments | show | ask | jobs | submit login
Tagsistant: a reasoning semantic filesystem for Linux and BSD (tagsistant.net)
64 points by gnosis on May 22, 2011 | hide | past | web | favorite | 39 comments

I wrote something like this using FUSE and Postgres a couple years ago. The major problem with this approach is scalability: on a large, multi-user FS, you can get TONS of tags to sift through in your root folder.

My solution, which I never got around to implementing, was to have "hierarchical" tags. e.g. "recipes:cookies:chocolatechip" is a single tag for all chocolate chip cookie recipes, all of which have the implied tags "recipes" and "recipes:cookies" (but not "cookies" or "chocolatechip"). The advantage is that your root directory will not be littered with the "cookies" and "chocolatechip" tags when these collections are fairly irrelevant.

This idea can be extended to solve multi-user conflicts (all tags are prefixed with the username) and to make sensible tags for dates -- it makes little sense to tag a file with "2011", "May", and "22", since no-one cares to find all files from the 22nd of every month; but the hierarchical tag "2011:May:22" is perfect for this situation since these files will also be present under "2011" and "2011:May".

Did you use an ontology or some kind of "manifest" to bootstrap hierarchical relations between tags?

I'm thinking about splitting Tagsistant into a client and a server to provide a multiuser environment, and probably some ontological foundations are required to coherently organize tagging coming from different users.

I never got around to implementing the hierarchy. My idea was to have a few basic entries, say, "users", "dates", "MIME-types", etc. which could be prepopulated.

I'd like to see something similar:

Lately I have thought about a filestorage like this which consists of two parts: 1) some kind of database in which you can put binary files and attach tags to them 2) a FUSE (?) driver which let's create different 'views' on this database which then can be mounted as part of a normal filesystem.

For example this would be nice for music and pdf collections. You could create different 'views' and then go to one folder to see your music/pds sorted by year and then to another folder to see them sorted by author/artist and then to a third one which is sorted by type_of_music/artist/album.

this way you would get the best of both worlds: 1) a powerful database to store and organize binary data and 2) downward compatability since you can just use the command line / bash to 'export' files to mp3 players and so on.

Woah, this pretty much describes my "perfect" file organization tool which I've been thinking about in the last months. I was thinking of combining it with revision support (every version of your file stored) together with Dropbox-like synchronization.

I think every version would be a bit much, especially on smaller drives if it had logarithmic diffs i.e. every change today. Every hour for the past week. Before that every day for the past month. etc. It would probably be more disk efficient for people with smallish SSDs, but everybodies "perfect" is different.

This is a great idea. Early use would have to be limited, because the rest of the world assumes directories. It would be great if the world ran on tags though.

Bookmark tags is why I'm still using Firefox instead of Chrome. I recently tried Chrome for a week, I really wanted to like it, but bookmark tags is what brought me back to Firefox. I look at my browser as an information manager as much as a reader, and multiple tags per bookmark is my killer feature.

Gmail's imap folders as tags, multiple tags per message, and their exposure of imap to external clients like Thunderbird, is why I was finally able to convince myself to use gmail instead of my hosting provider's email. I confess that I don't use multiple tags as much as I thought I would. But I could!

"This is a great idea. Early use would have to be limited, because the rest of the world assumes directories."

Tagsistant still uses directories. It's just that the directory names are automatically also usable as tags.

I'd consider doing something like dropping the AND by default.

  i.e. /photos/london/2011
while still having the OR and NOT and any other operations you need. the idea being that at least due to training from hierarchical filesystems the AND would seem rather implied. As far as considering the duplicate filenames i'd add a prefix to them based on the order they were added to the DB (e.g. first one gets no prefix, second one gets a 2_ or whatever you figure out) that way you won't has as many cases where a file's name changes the moment you add in a new one.

Tagsistant 0.4 doesn't use AND any longer, but that brought the need for a termination operator, which is '=' so far, but probably will be changed for shell comfort.

The 0.2 query mpoint/rock/AND/seattle/ becomes mpoint/tags/rock/seattle/=/ in 0.4

And about the prefix, Tagsistant 0.4 indeed uses a NNN_ prefix to filenames to allow for duplicated names to coexist.

One drawback I can think of is that you loose the "permanent paths" to files in some cases. If you start twiddling with relationships between tags, I can imagine how many paths previously stored in software will get broken.

I can also imagine 'identity problems'. Not counting symlinks and hardlinks, the full file path serves as its URI. How can I be sure if /photos/europe/DSCN0001.JPG and /photos/london/DSCN0001.JPG are the same files? What's the file URI here?

"If you start twiddling with relationships between tags, I can imagine how many paths previously stored in software will get broken."

Twiddling with tags is no worse than twiddling with directories.

Few people consider it a fault of the design of ordinary filesystems that if you mess with the underlying filesystem layout, software that relied on that layout might break.

The same is really the case for any kind of dependency on a certain kind of organization of your data.

The fault for breaking software that's tightly coupled to a certain underlying organization or layout lies with the software itself (for not tolerating changes) and with the user for making the changes in the first place.

"I can also imagine 'identity problems'. Not counting symlinks and hardlinks, the full file path serves as its URI. How can I be sure if /photos/europe/DSCN0001.JPG and /photos/london/DSCN0001.JPG are the same files? What's the file URI here?"

But symlinks and hardlinks are the critical bit of filesystem functionality that makes ordinary filesystems subject to the very question. So why would you not consider them?

There are various solutions to this problem on ordinary filesystems: first, your tools (like "ls") could show you that a file or directory is symlinked (though you might have to traverse through the parent directories to find out whether there is a symlink). Second, you could also use stat to check the inode of the files in question to see if they're the same.

It should not be difficult to add similar functionality to a tag-based filesystem.

Another related ramification of doing away with heirarchy and unique identifiers (tree paths) is not being able to have files with the same filename.

Say you have: /photos/london/DSCN0001.JPG and /photos/berlin/DSCN0001.JPG And the relationships: europe contains london, europe contains berlin

Now what does the 'path' /photos/europe/DSCN0001.JPG resolve to?

Tagsistant 0.2 does not allow to store two files with the same name, exactly as you say. But Tagsistant 0.4 will! Well, at the little compromise of having a small unique number prepended to each filename.

Tagsistant 0.4 has a broader vision (tagging of entire directories) but is still under development. If you have suggestions or doubts, I'll be very happy to discuss it.

Can you provide meaningful prefixes for conflicting files? When you detect a file name conflict, construct a distinguishing prefix for each conflicting file from the difference in tags on the conflicting files. (If all the tags are the same, then fallback to a synthetic prefix or overwrite the file or error out or whatever.)

For example, let's say you have /photos/london/DSCN0001.JPG and /photos/vienna/DSCN0001.JPG, where "london" and "vienna" are both included in "europe". This could yield paths like /photos/europe/london:DSCN0001.JPG and /photos/europe/vienna:DSCN0001.JPG.

The big trouble here (and, if I understand, with what you're suggesting as well) is that changing the name or tags of one file can alter the path to another as a side effect. So if I started with just /photos/vienna/DSCN0001.JPG, I might reference it as /photos/europe/DSCN0001.JPG somewhere. But when I go back and add /photos/london/DSCN0001.JPG, my reference to the photo of Vienna breaks because its name is no longer unique. As TeMPOral points out, this is a general class of problems afflicting a system like this.

It does not work exactly this way.

When you create a file "DSCN0001.JPG", it receive a prefix, even if it's not conflicting, becoming, lets say, "123_DSCN0001.JPG".

But both you and your software (say: a filemanager) are presuming the file is named "DSCN0001.JPG", not "123_DSCN0001.JPG". To solve that, Tagsistant 0.4 provides an aliasing layer that maps the original name to the prefixed one.

It's still something under development, so both the idea and the implementation can change. For example: how long should an alias exists? Just after the first access? Up to an extimated expiration time?

I'm oriented to the latter solution. Being aliases implemented as an SQL table, adding a expiration column and a garbage collecting thread should be all that is needed.

Of course, using expiring aliases is just like postponing the problem. But, in my opinion, Tagsistant is primary a personal tool, nothing that automated procedures or batches are supposed to rely on. I hope that, in this perspective, the alias workaround is an acceptable compromise.

I always thought at Tagsistant as an archiving tool, but you are totally right.

But files are also accessible from the archive/ directory where nothing is supposed to change as a consequence of tagging.

Can be a reasonable compromise?

Why couldn't this co-exist with a hirerarchical filesystem, maybe via a special subdirectory as procfs does? (With homonym files getting some distinguishing prefix, possibly based on their hierarchical paths.) It seems somewhat like a more sophisticated version of spotlight in that it can handle logical relationships between tags.

I had exactly this idea last year but I never took the time to actually implement it. Nice to see someone did!

I was excited about this at first, but I have come to think a filesystem that does away with the tree structure would be rather difficult to deal with or get used to. I will probably end up with loads of files under similar tags and it would make files rather difficult to locate due to the sheer number of them. Maybe the idea just needs a bit of refining?

This is actually something I've been wanting for a long, long time. Obviously I'm not planning on using this for / or anything system-related, but for my media archives &c it's perfect. Taking movies for example, I've found most people I know think an alphabetical approach is the way to go. This is useless, however, if I don't know what I want to watch. With the tags I can easily take tag in genres, directors, actors, etc and have my filesystem help me choose a film.

There is on major thing that I've considered would be an awesome addition to a system like this: automatic tagging from the metadata. It'd be awesome if my movies automatically got tagged by, say, resolution and length, so I wouldn't have to bother with such things.

Same here, I want something similar to manage my photo collection. The current implementation actually can have plugins that act on specific file types, so automatic tags are just a few lines of code away.

But then what advantage would this system have over a dedicated media manager or the several semantic desktop projects? The only advantage I see is allowing traditional tools (unix utilities, conventional file managers, even 'open file' popups) to take advantage of the system without any extra effort on their side. Though wouldn't dedicated tools, say a full blown semantic desktop, allow for much better integration and usability?

That really depends on the use case, I guess. For me, the use case is that I've got a media server in my living room and, in addition to being hooked up to my TV, it also has a Samba share set up so that anyone in my apartment can watch/listen/look at/read/view anything anywhere in the apartment. In this case, it certainly would be possible to have every computer set up with a semantic desktop setup, but I'm still trying to keep my Eee minimal and visitors wouldn't necessarily be set up with the semantic desktop tools.

Maybe this implies ground for a semantic server, but I'm inclined to think that Samba (or your favourite mostly-transparent sharing protocol) on top of a tag-based system is actually a good and way to accomplish this particular use case.

> The only advantage I see is allowing traditional tools (unix utilities, conventional file managers, even 'open file' popups) to take advantage of the system without any extra effort on their side.

That's a pretty big advantage.

Tagsistant already supports automatic tagging from the metadata such as you have described. The linked article says it supports plugins that can “extract tags from file contents” in the second-to-last paragraph. So if you want to tag your movies the way you said, you just have to write a plugin that uses some library to extract that data, or wait for someone else to release a plugin like that.

Yeah, I didn't notice that on first read, but that definitely helps seal the deal for me. I'll probably set it up on my server tonight and at least look into hacking together that plugin.

I think just a bit of refining may be required. The filesystem would probably come with a few tags already, so there's less chance of confusing the basic ones. With an easy-to-use tag manager that keeps track of all the extant tags, and the ability to do batch operations across all tags, I think it would work fine. The tricky part will be integrating with systems used to hierarchical directories.

Instead of having operators like AND be path components, I would have the path separator itself be an and operator, or use & and | to just make logical expressions. As long as you're throwing out the hierarchy, you may as well throw out the hierarchical syntax.

Having operators be part of the path itself actually struck me as genius. This way tagfs is 'backwards compatible' with current file managers and you can 'browse' the filesystem quite naturally. Having the operators as path separators would require changes to the filemanagers and the host of tools we already use to work with our filesystems.

Yeah, it'll be be hard to integrate in existing systems, one way or another. Thanks for your compliment.

Oh, never mind, I see what you mean.

Based on what was mentioned in the article, the 'refining' you mention might well be a plug in. Tagging anything can be a significant task---tagging many 'anythings' creates problems of scale. Another thing to consider; if this is in place at the beginning of the information cycle, I suspect that it would be easier than if this is used to convert from older approaches. Still for all of that, interesting idea/work.

Yay...semantic technologies. I hope they make a way of utilizing existing technologies from the semantic web. Like OWL :) https://secure.wikimedia.org/wikipedia/en/wiki/Web_Ontology_...

If you read his notes, he finds that OWL as it currently exists is too cumbersome for his needs.


I think OWL would fit the task quite well. Yes, right now it may seem like overkill and will probably we somewhat slow. RDFS is an alternative though. Or just using a subset of OWL, like OWL lite.

Tagsistant is not supposed to replace posix filesystems and can't be used in system directories like /bin or /etc.

Tagsistant is a personal tool to organize files (and directories, starting with version 0.4).

It provides a plugin architecture to allow autotagging.

Well done with the "Europe" tag includes anything with the "London" tag idea.

This idea is called "hierarchical tagging". Searching for that phrase in google should get you some links to other people who've discussed and implemented it.

I think that was a joke. Over in London, 'Europe' is across the channel, on the mainland.

Reminds me of WinFS (http://en.wikipedia.org/wiki/WinFS)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact