
Tagsistant: semantic filesystem for Linux - CarolineW
http://www.tagsistant.net/
======
nayuki
Hierarchical file systems lack expressiveness and are awkward in places. In my
day-to-day computing this has become more apparent and problematic with each
passing year. For example, I dislike that you are forced to give unique names
for each file, that you can classify each file in only one way, and that you
can't tell how many copies you own of a piece of data.

The smallest step up from a hierarchical file system is to allow any file to
have any number of tags, where each tag is just a simple string (no hidden IDs
or anything). I believe most tag proposals implement this concept and not much
further. I briefly looked at Tagsistant and numerous software and papers.

I sketched some of my own ideas about identifying immutable files by hash and
creating arbitrary tags that reference such files. It turns out that this way
of organizing files goes really deep, and I haven't explored all the
implications yet. It yields a completely different landscape than the file
system that we are used to today - the concepts of path, mutability,
attributes, etc. are replaced with different mechanisms.

The article is long, but I would appreciate hearing if the concepts resonate
with anybody else: [https://www.nayuki.io/page/designing-better-file-
organizatio...](https://www.nayuki.io/page/designing-better-file-organization-
around-tags-not-hierarchies#contents)

~~~
Pxtl
I remember for a while Google Drive worked on a tags metaphor and it was just
too much friction for users. The UI for the traditional file tree is just
plain cleaner. Trees provide clear delineations of ownership and
categorization. They suffer from the limits of hierarchies, but you can break
them down and get a nice tree-view of them, for example.

Having spent time categorizing my photos, adding a tag layer on top of the
existing filesystem is a great idea, but using it as a _replacement_ for the
traditional directory structure isn't.

Simple operations like "delete everything in here" becomes complicated under a
tag structure now that we no longer have a strict concept of A is within B.

~~~
nerdponx
As long as the tags themselves can be hierarchical, can't you recover all the
benefits of a tree-like file structure?

~~~
adamrezich
This. I believe a hierarchy of tags is a great solution.

~~~
Pxtl
But then don't you have the same probelm of the limits of taxonomies and you
start thinking "well is this tag a member of A or B? It should be a member of
both" and then you need to tag the tags. Turtles all the way down.

~~~
nerdponx
That's the advantage of tags vs true directories. A tag could have mutiple
parent tags.

Of course, that raises the question of what happens when you encounter a cycle
in the tag graph.

------
heinrichhartman
Great project!

I feel that the problem of archiving files is not well served by the POSIX
file system, and deserves attention. Gaps are in data safety and backup
capabilities (dropbox is a huge leap forward here) and document retrival.
Usually external idices (which go out of sync) are used to query file names.
Also there is no way to attach semantic metadata to files (appart from date
stamps, and permissions).

This tool provides an interesting stab at the latter problem. I have always
thought that semantics would be layered ontop of a POSIX file system. This
project fips the logic and implements tagging within the fs.

I wonder how compatible this is with NFS/Dropbox/git. Can I use this to tag
files on a Mac via Dropbox Sync? _digging..._

I have been using some home grown tools that allow me to put sematics into
filenames for a while now. And it has served me quite well. Files look like
this:

    
    
      2017-04-04 #S4907 #Choir List of names.pdf
      2017-04-05 #EXCITE #S5005 Notes on Data Repositories.pdf
      2017-04-06 #ARAG #S5031 $Amount=14.2EUR Invoice.pdf
    

And I have command line + CGI scripts that allow me to manage and query
folders which contain properly formatted files. I just begun writing a second
version in python:

[https://github.com/HeinrichHartmann/pile](https://github.com/HeinrichHartmann/pile)

(waaay to early to use, yet. But the README elaborates a little). Has anyone
aware of similar approaches?

E.g. using JSON documents as file names seems another obvious way to layer
semantics ontop of POSIX/fs. Is anyone doing this?

EDIT: Formatting fixes.

~~~
andai
Thanks for sharing your system!

I'd like to give you some HN formatting tips, if I may:

Add two new lines for a new line. Otherwise the formatter will put your text
on the same line (eg. with your list of files, it looks like one big filename
now)

 _italics_ work by putting asterisks (*) around the word or phrase. Though
maybe you meant to use underscores in this case :)

Best wishes!

~~~
heinrichhartman
Thx. Fixed now.

------
rakoo
Nice! An equivalent for comparison: [https://tmsu.org/](https://tmsu.org/)

~~~
seagreen
This should be higher up, I think TMSU is the best known product in this
space, and a comparison between the two would be useful.

For one thing I think TMSU's achilles heel is handling renaming of files
gracefully, I wonder how Tagsistant does it?

------
okket
On Github:
[https://github.com/StrumentiResistenti/Tagsistant](https://github.com/StrumentiResistenti/Tagsistant)

~~~
striking
And if you're having trouble accessing the page (like I am), Google's cache
can help:
[http://webcache.googleusercontent.com/search?q=cache:http://...](http://webcache.googleusercontent.com/search?q=cache:http://www.tagsistant.net/&num=1&safe=active&strip=1&vwsrc=0)

Usually I prefer to link archive.is, but it didn't manage to capture the page
before it went down this time around.

------
djsumdog
I used this years ago. I had a few scripts that would copy files to the
appropriate tags, but it ended up being super slow. Eventually I wrote my own
tools around the sqlite database it created (it's a pretty simple schema) but
I've since lost most of those and ended up rewriting a bunch in a system I've
been working on.

This was all back for the 0.2 release and I think the author changed a bunch
of this stuff in later releases. I wish there were more options for tagging
file systems in Linux. If I ever get my stuff out of the "thrown together"
phase, I'll probably publish them.

~~~
fao_
See also:
[https://news.ycombinator.com/item?id=14538060](https://news.ycombinator.com/item?id=14538060)

------
dredmorbius
I've been thinking increasingly of something along these lines myself, with
Tagsistant being among the systems which have come up in my own research.

The problems, variously, are that fixed-name hierarchical-storage filesystems
meet the needs of document-based storage, projects, workflows, sharing, and
lifecycle exceedingly poorly.

The problem is coming up with a better option.

A filesystem-based approach has the advantage that it's low-level, _not_ tied
in to a single application or toolsuite, and may be extensible.

Among the questions I've turned up include identifying what specific problems
this is trying to solve, distinguishing between _public_ and _private_
information, and what levels of standardisation might apply. There are also
some very significant questions about privacy and data leakage.

My current thoughts are largely grouped around a documents-based system
(provisionally, "/docfs"), and an online or Web-oriented system
(provisionally, "/webfs"), both under an umbrella context system, KFC (for
KFC's Fine Context). Mostly considering the domain space, workflows, and
possible solution-shaped objects. Part of that (largely focused on Web access)
discussed here:

[https://www.reddit.com/r/dredmorbius/comments/6bgowu/what_if...](https://www.reddit.com/r/dredmorbius/comments/6bgowu/what_if_the_web_was_filesystemaccessible/)

------
Animats
Humans have to tag the files, though. This is the same problem which kept the
"semantic web" from going anywhere.

~~~
chongli
_Humans have to tag the files, though._

Only once, if done properly. Look at music tagging. A good system ought to
have canonical tags for everything.

User-created files could also have a lot of auto-generated tags too. I'm
thinking along the lines of email address/URL origin, Exif metadata, source
code tags (ctags/etags), keyword extraction from prose text (via machine
learning models)...

Beyond all that, though, would be your standard date and timestamps, your
document name (which could be non-unique), and a project-based tagging scheme.

Take a look at the Library of Congress's MARC project [0][1]. It's the most
ambitious tagging project I'm aware of.

[0] [https://www.loc.gov/marc/](https://www.loc.gov/marc/)

[1]
[https://en.wikipedia.org/wiki/MARC_standards](https://en.wikipedia.org/wiki/MARC_standards)

~~~
nayuki
Could you clarify what you mean by "canonical tags for everything"? Do you
mean an online database like freedb, MusicBrainz, and such?

~~~
chongli
_Do you mean an online database like freedb, MusicBrainz, and such?_

Yes, as well as the Library of Congress and doi.org.

------
someSven
I often wanted a tag system, but this is not what I was picturing, I think.
I'll rather like to have a database and an entry in the context menue to tag
files without moving them. It should also be able to identify the file if is
being moved or renamed.

~~~
kwhitefoot
It must be possible to get the same effect without copying the files by using
links. If it is all on the same file system then hardlinks would work and be
very space efficient.

~~~
tyingq
Extended attributes can do this for most unixy operating systems. See fsetattr
and fgetattr. It's a bit tricky in that the standard tools, like find, don't
support them directly. But, the tags stay with the files, no separate DB
needed. You do have to pay attention to copying, backups, etc, to make sure
they are preserved, and no separate DB means all "queries" are full scans.

------
prewett
I've never been able to figure out what people mean by "tags" and what the
benefits are, perhaps someone can enlighten me?

I feel like the only problem with standard hierarchical file systems is that
sometimes you want files to show up in multiple places. I think this is only a
subset of files, typically "media." A photograph or a song often has multiple
categorizations. However, a lot of things, like my tax documents, source code,
Word documents, and notes typically only have one place they need to go to. It
seems like symlinks or hard links are an already-existing solution; other than
lack of great options on managing them, how do tags improve this?

As far as I can tell, tagsistant does have hierarchical tags, so it isn't
doing away with hierarchy, which is good. One problem with some tag systems
(like Gmail's original) is that under "notes" I have "lectures," "stuff I want
to remember," "things like list of books I want to read," "temporary." If I'm
looking for photos, I really don't want to see a bunch of notes tags.

~~~
cJ0th
I think tags only prove beneficial when you have a very well defined purpose.
More concrete, I think the number of tags has to be fairly limited and the
logic behind the tagging scheme should stay consistent forever.

One example of tags being superior to links is that you can search files with
logical queries. For instance, a family member asks you to send them a good
portrait of you for whatever reason. You may want to make sure that it is
somewhat recent. However, you're willing to take one that is two years old if
it looks better. Then you could search for: #portrait & #me & #2017 & #2016 &
#2015

------
CarolineW
I suspect the site is suffering the HN "Hug of Death".

As linked elsewhere[0], on github:

[https://github.com/StrumentiResistenti/Tagsistant](https://github.com/StrumentiResistenti/Tagsistant)

[0]
[https://news.ycombinator.com/item?id=14537805](https://news.ycombinator.com/item?id=14537805)

------
adamkruszewski
Great work OP! Had done something similar as my master thesis in 2005 basing
on one of Hans Reiser's papers. It was more of a quick-hack having the code
written in three days so not much have survived to this day, but you can see
some of it in action on an adobe flash[1] video at
[https://adam.kruszewski.name/assets/static/mtfs-
demo.htm](https://adam.kruszewski.name/assets/static/mtfs-demo.htm)

[1] don't judge me, mp4 in the browser wasn't main-stream back then ;)

edit: for those without flash like me -- even as a quick hack it had manual
and automatic file tagging (based on file metadata) and you could query it
using logical expressions. It also had pretty nasty memory leak I didn't care
to find out :) Still, without a first-class, built-in support from file
managers like BeOS had for its filesystem the idea is not fully realized I
think.

------
drudru11
It isn't explicit about this on its main page description. This project
depends on FUSE to work.

------
malkia
My mind read this as tagistant - which is cool and awful name at the same
time.

------
Chris2048
Now we just need a redesign of Unix tools and principles/practice to
accommodate these :-S

~~~
dredmorbius
That's actually a fairly reasonable conclusion.

You'd want to have tools which are aware of a semantic filesystem, or new
tools which can make use of them. Probably a mix of both.

Wrappers around extant tools might be a reasonable migration path toward the
former.

~~~
Chris2048
There was a concept of "db-fs", whereby rather than files in a hierarchical
folder structure, you just had blobs/collections of data you could search via
the usual search-queries. I suppose similar structure to the average no-
sql/JSON db, but as a file-system.

The problem is, it isn't even compatible with the usual interface for a fs
driver (read/write, permissions etc). There is no "search" concept in the fs
conceptual model. The model assume, to some extent, a finite-size mount,
duplication/multi-reference is only poorly supported/emulated via hard/soft
links.

plus, existing tools would, as you mention, have no support for what would now
be build into the fs - functionality such as provided by 'find' would be now
built in, such that you would need a shell syntax/dsl to utilize.

My best guess would be a 'special' query command that would result in a
virtual folder popping up on a special virtual fs mount, e.g:

    
    
        > mk_qdir *some-search-query*
        /proc/srch/023013/
    

but if ls'ing the folder kicked of a proc in the background that initiated the
query, you would have to be careful no process (a mount counter for 'df', for
example) crawled the partition.

~~~
dredmorbius
Good point on mount counts.

Doesn't that problem already exist to an extent for remote FS mounts (NFS,
etc), especially over automount?

~~~
Chris2048
Yep, if the external mount is large.

However, it's a little different for an fs system that can quickly (w/o
network-speed limits) generate recursive fs structures.

for example, what if a created a vfs that created a 'foo' folder, with a 'foo'
folder inside, and so on. The system would crawl an infinite descent of
/foo/foo/foo/foo... and so on, which would eventually fill some cache or
another.

~~~
dredmorbius
Good thoughts, and I'm thinking of possible hazards and pitfalls of the
approach.

There's already a concept of setting flags to avoid traversal of certain
filesystems, and given the proliferation of virtual and networked filesystems,
this seems useful: /proc, /sys, /dev, /udev, and a few others (it's getting to
the point I don't recognise a full mounts listing readily anymore under
Linux).

With the concepts I'm considering, in particular, of the tree as being
essentially _search traversal_ rather than _a static filesystem_ (not even one
with lots of symlinks all over the place), the potential to create some deep
dives or recursive tangles is pretty high.

Another example: tools such as 'locate' or Spotlight shouldn't attempt to
traverse this tree when generating indices. Instead, they should _query_ it
when requests are made.

Information and metadata leakage is another key consideration when moving data
off local host and/or caching among hosts.

~~~
Chris2048
Yep, in fact I think a lot of vfs are like this - doesn't /proc/ query the
kernel when returning parts of its tree (e.g. representing processes).

wrt the fs representing queries: one of the tag systems is like this, it
represents tags as folders, and the search for term 'X' and 'Y' can be found
under either folder ./X/Y/ , or ./Y/X/ ; obviously, every new tag
combinatorically expands the virtual space.

~~~
dredmorbius
Right. My vision is of a /docfs under which you might travers
/au/stephenson/ti/snowcrash _or_ /ti/snowcrash/au/stephenson, as an example.
Either way works.

If a search terminates in multiple (or no) results, it's a directory, if in a
single result, it's that file. Plus a few twists (virtual / dynamically
generated file formats, summaries, synopses, metadata).

~~~
Chris2048
Of possible Interest is the new article:

[https://news.ycombinator.com/item?id=14550060](https://news.ycombinator.com/item?id=14550060)

and
[http://www.sqlite.org/src/doc/trunk/src/test_onefile.c](http://www.sqlite.org/src/doc/trunk/src/test_onefile.c)
for an idea of turning SQlite into and _actual_ vfs.

