
Designing better file organization around tags, not hierarchies (2017) - enobrev
https://www.nayuki.io/page/designing-better-file-organization-around-tags-not-hierarchies
======
nayuki
Hello everyone, thank you for all the comments. Seeing this on the HN front
page caught me by surprise. In the past year I shared this article publicly
(Reddit) and privately (with tech-savvy acquaintances) for comment, and the
general sentiment I received was that these ideas were not ready to be read by
a mass audience. The article is way too long and pulls in many disparate
ideas; it explains both why traditional features are problematic and how new
features would work better. In the end, it is unclear what a real
implementation would look like, and what concrete benefits and annoyances
would come out of real-world usage. I was hoping to build an ugly prototype
before asking for feedback.

Regarding the comments on this HN thread, it seems the general discussion is
around tagging. This is indeed the title of the article and the main idea that
motivated my exploration, but I believe the other ideas are just as important.
I explored notions like no-filenames, strong preference for hash addressing
and references, location independence, immutability, backups and
deduplication, preference for external (non-embedded) file metadata, first-
class media libraries, and more.

I think the debate about tagging is quite adequate, and would be happy to hear
comments about the other features/non-features, and whether all the ideas fit
or don't fit cohesively as a system.

~~~
reacweb
Hello, I am trying to make an application for my wife to manage embroideries.
I encounter almost all your issues. My wife has thousands of embroideries
downloaded from internet. There are many duplicates (filenames not unique
because of internationalisation and special characters). She needs to add tags
to help search. She also needs groups of tags (tiger belongs to animals, ...).
She also has metadata (origin of the file, which license applies to which
file). Sometimes she modifies an embroidery. She needs to ensure that the
original file is not modified and to keep a link between the two files.
Sometimes there are groups of embroideries (for example letters) or there is
documentation attached to an embroidery. It is a mess and I think that your
work would help a lot to handle this kind of use cases. The current paradigm
of directory tree is outdated and something smarter could be done.

~~~
magduf
How large are these embroidery files? I feel like this is something that a SQL
database might be able to help with. It's probably not an ideal solution, but
remember that a filesystem is really nothing more than a database, which
organizes files on-disk in a particular way and indexes them so that you can
find them. It obviously has serious limitations due to its hierarchical
nature, which is why relational databases were invented, so using existing
tools it's probably quite feasible to create an application that uses a SQL
database to index all the embroidery files and store all this data on them
(license, whether it's a derivative of another one, and tags), and then if
they're too large to just store in the DB itself, just point to files in the
regular filesystem.

------
wrs
This is good thinking, and mostly overlaps with what I've tried to do a few
times, so I would love to see it finally happen somehow.

For example, the Newton storage system I worked on at Apple 1990-1996 was
based on separating organization from storage so we could have multiple (tag
and/or hierarchy) organization systems. That eventually became the "soup"
system in the shipping Newton OS, where objects were retrieved by content
rather than hierarchy.

More recently, I spent some painful years working on Microsoft's WinFS, which
had a ton of overlap with the principles here, and demonstrates just how hard
it is to go from some nice principles that all seem like the Right Thing to an
actual successful adopted implementation of the principles.

~~~
yiyus
I think most people would agree a tag filesystem or a similar concept is a
great idea. I have wanted one myself for a long time. Yet, for some reason, it
doesn't take off. Do you think it is just a problem of implementation?

~~~
lambda
A few things I can see off the top of my head that would need to be solved for
tag filesystems to be able to take off:

1\. Inertia. Most software assumes hierarchical filesystems, and assumes it
can control some portion of that hierarchy. This includes things ranging from
search paths for various things ($PATH for binaries, library paths, etc),
temporary files, preferences files for applications, assumptions made about
hierarchical filesystems in archive formats like ZIP and TAR, etc.

2\. Permissions. With a hierarchical filesystem, you can apply permissions on
higher levels of the hierarchy to control access to lower levels, and you can
have various forms of permission inheritance to control permissions on new
files. Need a design for how to do that on tag filesystems.

3\. Mounting. Filesystems come and go; some are on your OS drive, some are on
removable media, some are network filesystems. Hierarchical filesystems means
that each one has a single root, and it's easy to tell where the boundaries
are.

4\. Tagging taxonomy. What kinds of tags do you use? What happens if you mount
a filesystem in which someone else used a different tagging taxonomy than you
used? Who controls different parts of the tag space? What happens if you
import an archive of material which uses a different tagging scheme than you
use?

5\. Projects. How do you group files of different types with different tags
into discrete projects? How do you bundle related files together? How often
would you want to see files by arbitrary tag lumped together, rather than
looking in particular projects that have a pre-defined structure?

6\. UI. How do you browse tags? How do you refine down? In many cases, rather
than a general purpose tags based interface, you actually want media-specific
browsers, like ones specialized for music which let you browse by artist,
album, playlist, etc, or photo galleries that can show you previews of the
photos, or video browsers which can show projects, bins, and sequences (for
video editing), or IDEs which can either show you file hierarchy or allow you
to browse by class, function, etc.

7\. And finally, why is a tag-based filesystem necessary for this? What is
wrong with the current approach, in which there are special purpose
applications which can index, tag, and display certain types of media in
certain ways? For instance, you can use your text editor or IDE to navigate
among files within development projects, iTunes or Play Music or whatever to
browse your music, Lightroom or Darktable or iPhoto or Google Photos to manage
your pictures, iMovie or Final Cut or Premiere or Avid for browsing and
managing your video, and so on. They all frequently have some way of tagging
files, but also have specialized UIs for browsing the specific types of files
they are defined for without having to do explicit tagging, and the actual
files are just stored on a normal hierarchical filesystem.

There are a couple of good thoughts in the original post, but a lot is
handwaved away, such as mutability of files, which is an incredibly important
use case, for a huge amount of what files are used for today. Lots of people
have brought up the idea of making tag based or database based filesystems,
such as the failed WinFS effort
([https://en.wikipedia.org/wiki/WinFS](https://en.wikipedia.org/wiki/WinFS)),
but it's actually a pretty big problem to solve.

~~~
xcvbxzas
I think many of these issues can be addressed with mechanisms proposed by the
author.

Mainly, the more complex tags which can themselves refer to other tags.

1\. This is probably the trickiest one. You may be able to do some sort of
translation between a hierarchical system and the tag system using tags
themselves. You could have a series of tags that refer to each other, such
that the hierarchical location is essentially encoded in the tags themselves.

2\. Again, maybe just special tags?

3\. Yeah, again, tags. Just tag the thing with the media it's on.

4\. Aside from the basic UI side of things which should help, there is the
idea of shared tagging systems. I don't recall if that came from the author or
another commenter on HN. And you can basically ask the same question about
hierarchical systems. It's not exactly a solved problem there either.

5\. Again, the complex tags. Just make a tag for the project.

6\. Obviously UI is a big question. I'm not sure how it relates so much to
media-specific browsers though. They basically present a different view of a
section of a filesystem. You have to do some work to let them do this, or else
use a system like iTunes and buy all of your media through them.

7\. Although I feel this is well addressed by the author, one thing I think
you aren't considering is that each of these applications requires their own
setup in order to provide that view. You often can't just take the directory
from one of these programs and use a different program to view it and have it
all work properly. If you only have one program for each media type and never
want to use anything else that works, sort of. Many years ago I directed
iTunes to redo the file layout for my music collection and rendered it
effectively useless for direct browsing. I never really recovered from that
due to the time involved to sort it out.

And mutability isn't totally handwaved away, again with the complex tag system
you could tag mutated works with a reference back to the original. This
doesn't cover the case where you don't wish to retain the original, but then
you could just do a simple find/replace with the old and new hashes in the
simplest case.

~~~
lambda
I think you're missing some of the subtleties of solving these problems using
"just more tags."

In a hierarchical system, a lot of these organizational issues are local. If I
have one directory that consists of a project organized one way, and another
directory that consists of a different project organized a different way,
those different organizations don't really interact with each other in any
way.

If you are using tags for everything, in order to avoid weird mishmashes of
different ways of using tags, you would need to either have a completely
standardized tagging system that everything used consistently, or you'd have
to always include various contextual information in your queries or in your
browsing in order for the queries to make sense. For instance: [mount: my-
hd][project: my-project][type: jpeg]

I think you overstate the problem with different applications as well. For a
large amount of the metadata that is relevant for these applications, there is
a standard tagging system. ID3 for music, EXIF for images, XMP for various
image and video formats. It's true that there is some metadata that these
applications store in proprietary databases, but that's mostly an issue of it
being difficult to come to a consensus on standards that meet everyone's
needs, and it's easier to just write some proprietary metadata somewhere. With
tagging systems, if there wasn't agreement on the schema of tags, you'd still
have the same issue.

I don't think it's a bad idea to consider alternatives that are more general
and more flexible than what we're doing now, but I do think that it's pretty
easy to handwave about how nice a tag based system would be, but a lot harder
to solve all of the little problems that are going to come up and turn it into
a real, coherent, working whole, and then getting enough critical mass so that
it is used outside of a small niche with a handful of applications.

~~~
xcvbxzas
I'm sure you're right that there are a lot of overlooked subtleties. That said
I'm not sure some of those problems you mentioned would exist, or at least I'm
not sure they would be any worse with a tag system than a hierarchical one.

For example, how is that example query any worse than the current situation?
Right now you'd navigate to the project directory (requires specifying more
than your example already) and then use some search method depending on
OS/WM/etc. And then you still end up with a big list of jpegs to look through.
This is sort of a worst-case example for both systems, and still I think the
tag system comes out ahead here - by a little - just because it would give you
the ability to spread the project across multiple drives without requiring you
to do two searches if you don't know which drive the desired image is on. You
can improve the situation for either system by manually specifying more
information. Put better tags on the images or put them in more specific
directories or title them.

As for specific applications, it's not the metadata encoded into the files
that I'm talking about. It's as simple as the directory structure itself that
is used to store all of this. I can't have one application organize everything
and then trivially point another application at the directory and have it
work.

With a tag-based system this starts to change. I don't need to tell a new
music player where my music is, and then go through whatever process is needed
to let it properly work with the current directory organization. At worst I
tell it which tags to include or perhaps exclude. From there many options
exist. Maybe it pulls in metadata from the files themselves. Maybe I provide
an external file in whatever format. Maybe I tell it which tags to associate
with which fields. You could do a lot of things here.

I also won't end up telling the application to reorganize things as I did many
years ago with iTunes, which promptly made it nearly impossible to wade
through my music manually. I had it sort everything into directories based on
the artist with subdirectories for albums. It sounded great, until I
remembered just how much music I had off OCRemix, where an album is a large
collaboration between many people. All of those albums were ripped apart.
Ironically, I also had some standardization issues with things like artist
names which caused more trouble. Once I stopped using iTunes I basically
abandoned that collection because of the work required to fix it.

Yeah, standardization is going to be sort of a problem, but I don't think it's
quite as big of a deal as you think. For one, the OS is going to ship with a
bunch of standard tags just for itself to work. There will also just be a lot
of really standard stuff people are interested in that can be shipped with
them. You also have file extentions, for both specific extentions and also
generally what kind of information they contain. And finally there is just
good old translations. The hierarchical system basically utilizes all these
methods and suffers from the same problem - namely you can put directories
wherever you want and name them whatever you want. Same problem, different
manifestation.

I think the biggest benefit would come from a system that can present itself
either hierarchically or tag-based. They both have merits. I've already
presented some ideas on how you could store the hierarchical structure in the
tags. I'm not so sure how you store the tags in a hierarchical system
directly. You could probably fake it with a separate datastore easily enough
though.

Finally, when did this discussion of general design goals turn into one of a
real-world implementation, much less widespread adoption? I'm not sure how
this is relevant.

------
jessmartin
Have you considered a file system organized as a timeline that _also_ supports
tagging?

I find one of the key concepts that's not a first-class concept is _when_ the
file was modified. Rather than a file-and-folder physical analogy for the file
system UI, I think a timeline-oriented UI could present some advantages for
the way that humans actually think and work. Tags would be a helpful
orthogonal organization scheme, but I don't think they work as a primary UI
for navigation.

This is great work, though! I love the compilation of various other works, the
references, and the way you've dug into the details!

~~~
nayuki
I have thought about file tagging for over a decade, before setting out to
write the article. But a timeline-oriented file organization only came to my
awareness near the end of writing the first draft.

A year has passed since I wrote the article, and the idea of timeline
presentation has grown on me a lot. Especially because I use numerous data
systems daily that are already time-oriented: Every chat program, email,
Twitter, Facebook personal profile timeline, the "recent documents" view in
major popular applications like Microsoft Office or Adobe Reader.

You should find this blurb in my article helpful:

> The Lifestreams Software Architecture
> [http://www.cs.yale.edu/homes/freeman/dissertation/etf.pdf](http://www.cs.yale.edu/homes/freeman/dissertation/etf.pdf)
> (185 pages)

> Comprehensively designs and tests a system for workflow and archival, based
> on chronological presentation plus keyword and attribute filtering.

Thanks for your compliments on the thoroughness of my exposition. I did a lot
of research and thinking as preparation for writing. I wanted to see how other
people viewed the problem of file organization and what kind of solutions they
proposed. I wanted to find weaknesses in my arguments and to avoid repeating
unnecessary work, and of course I wanted to move toward the best solution.

~~~
hvidgaard
If all you want is knowledge about when a piece of data was created and/or
modified, then tags do that just fine.

------
on_and_off
>But fundamentally, there is a mismatch between the narrowness of hierarchies
and the rich structure of human knowledge, and the proposed system will not
presuppose the features of HFSes.

This hits the nail on the head !

All the fileSystems I had to work with are fine as engineering tools. By that
I mean using them as an engineer works just fine, their own implementation is
off topic.

As a user though.

What the hell !

I don't want to go to c:/users/me/documents/talks/stockholm2018/draft3

I just want to open my document !

I really hope that someday we expose a document based filesystem to the user.

The underlying implementation does not matter, we can always add a layer on
top of the hierarchical file system.

I just want to be able to display :

-all the games installed on my system .

-all the pictures

-all of my text documents .

-all of my pictures of Paris

etc

~~~
Benjammer
How do you make sure things are tagged properly though? A file must reside
within a folder, even if it's a default location, which forces a user to think
about the folder where the file is stored. With tags, a user could very easily
forget one tag on a file, and now any filtering on that tag is never going to
be aware of the new file existing.

What if you find another picture somewhere from your <Paris> trip, but you
forget to add the <Travel> tag? Or you have a tag for "cool architecture pics"
or something and you miss tagging one of your Paris pics with this when you
upload?

There just seems like so much friction in properly keeping tags organized,
despite how much extremely better the "read" UI is for someone browsing or
searching file collections.

~~~
niftich
Manual tags require a lot of curation and upkeep, but some "tags" are really
just restatements of attributes or facts about a file, like search filters,
e.g. ("Pictures downloaded from the web on 2018-04-05", "Files created during
installation of World of Warcraft", "Files opened in the last two weeks").

In fact, a tag-based document filesystem is largely useless without powerful
search, where tag keys and values can be searched at will.

~~~
Angostura
I suppose I should mention that while MacOS search is far from perfect, Apple
put quite a lot of working into this kind of autotagging.

Here's one a screenshot showing a tiny portion of the tags available in search
[https://imgur.com/a/AYm5V](https://imgur.com/a/AYm5V)

------
egypturnash
Every time I read an article about attempts at non-hierarchical filesystems, I
try to figure out how I'd take the huge piles of stuff I generate when I'm
drawing (and publishing) a graphic novel and reorganize it under tags. It's
never pretty.

Like, okay, sure, I tag everything with the name of the project, that's a no-
brainer. But if I just do that then I get the hundreds of files I generate
(one per page) mixed up with everything else - web-res renderings of each
page, model sheets (and their source files), promotional material, stuff sent
to publishers to try and convince them to deal with that part of the process,
and the huge mass of files I generate for each book I print (which can be more
than one for a single multi-year project): source files tweaked for print,
print-res renderings thereof, files for the kickstarter for each book... So I
tag all of these attributes too, and imagining putting all these tags _on_ a
file as I save it sure is a lot of fun, even if I imagine some sort of save
requestor that keeps a list of all my previously-used tags, including ways to
filter _those_ \- I don't care about any of the tags attached to my music
collection or my collection of cartoon porn or my programming projects when
I'm working on my comics projects, for instance, so I'd want to quickly narrow
it down to just tags found in my art projects, and...

Ultimately it just starts to look like a hierarchical structure in my mind,
except for the fact that I'm interacting with it by some kind of tag-filtering
file browser on top of a huge filesystem that mixes everything together in a
non-human-browseable structure.

~~~
dgreensp
I can strain my imagination and optimistically say that with a suitable UI,
the tag-based file browser would be no worse than the hierarchical one — using
tags to categorize your files in a not-too-coarse, not-too-fine way, and maybe
even having more flexibility in how you organize and browse your files. Do you
ever put information in the file name that could be a tag, or make up
meaningless file names when there are few enough files related to a particular
page, for example? Or put information in names that could be attached as some
sort of comment or notes metadata instead? But I think you bring up some
important issues when it comes to organizing your stuff.

Would it be better if, instead of having kitchen drawers and cabinets, we had
a tag-based system, because, you know, the structure of human knowledge and
all that? Why should I be forced to put a utensil in zero or one drawers?
Actually, the cabinets and drawers system is nice because you have a sense of
“place” — you can think intuitively about where an item is, where an item
goes; even if where an item goes is a somewhat arbitrary choice, at least you
know exactly what decisions have to be made (which drawer and which
compartment in the drawer organizer, for example) to put it away. You can also
do a traversal through the cabinets and drawers to see what objects are being
stored and how they are being organized, and any time you open a cabinet, you
are focusing on a different set of objects. Imagine if you had 10 cabinets and
10 objects in each cabinet, but you actually only have 20 objects total. Every
time you open a different cabinet, you see a different 10 of those 20 items.
Confusing.

I wonder if it would help to have required tags, and exclusive tags. Files
with tag X must have tags of type A, B, and C, and may have other nonessential
tags. X would be something like, “is a project file for some project,” and A
could be the type of project name tags, and so on.

------
hammerandtongs
Tags are great BUT

It's pretty important to realize that a files position is merely it's default
tag (and you can tag it further with many different types of systems like
extended attributes, as I think both Gnome and KDE have used at times).

Without that default tag you have a mess.

It's also important to note that Tags have a very high maintenance cost of
their own.

Duplicate, inconsistently applied and redundant tags are a aggressive cancer
in any of these systems.

No you can't just ignore them as they make it more and more difficult to
accomplish even basic viewing /scanning over files for the system and the
user.

Many many users have trouble even doing basic maintenance on their file
locations (that default tag) that makes a tag based system even more prone to
failure.

------
fao_
I spent last year ruminating on this (Independently. However I find it
interesting the article was published around the same time that I was voicing
my ideas to a friend on this!) and toying with a few prototypes. This year I
committed myself to hacking on a proper implementation (named 'libkoios' and
'koios') of it, using the Extended Filesystem Attributes. What I found
interesting is that while there was a lot of prior work systems existing for
_tagging_ , none of them use the extended attributes system, which to me feels
like a waste. However there are problems with ext(2,3,4)'s implementation of
file tags that make it difficult to store a lot of data without compression
(I'm storing one bit per tag, which allows fast masking and comparison
operations per file), so I guess that is understandable.

I believe that for image-based systems there is 8ch's /hydrus/ (probably the
only good thing to come out of the chan-networks). One upshot of there being
existing network sharing systems for tags is that it should be possible to
scrape them when autotagging things (Nobody. NOBODY, wants to manually tag
hundreds of photo memes, which is the main forseeable problem with file
tagging).

------
Slansitartop
I never personally used it, but I've heard the BeFS was designed to have
significant non-hierarchical use cases:

[https://en.wikipedia.org/wiki/Be_File_System](https://en.wikipedia.org/wiki/Be_File_System):

> [BeFS] includes support for extended file attributes (metadata), with
> indexing and querying characteristics to provide functionality similar to
> that of a relational database.

IIRC, this was pretty hyped at the time, but they had to back away from it. I
don't know if it was because if the concept was too unfamiliar to people
familiar with the hierarchical paradigm or if it didn't work as well in
practice as it was imagined.

There's also a book about it written by its designer and now freely available:
[http://www.nobius.org/dbg/practical-file-system-
design.pdf](http://www.nobius.org/dbg/practical-file-system-design.pdf)

~~~
nayuki
BeFS extended file attributes are a good example to point out. I watched this
excellent talk which shows the power of live queries:
[https://systemswe.love/archive/minneapolis-2017/ivan-
richwal...](https://systemswe.love/archive/minneapolis-2017/ivan-richwalski)

Unfortunately, file attributes is not a feature I want to see. They don't
solve the problems with naming files or deduplicating files. Metadata is
attached to a file, so when the file is gone the metadata is gone. I instead
proposed that metadata can be freestanding, and can exist even if the main
file is missing. Relevant section to read:
[https://www.nayuki.io/page/designing-better-file-
organizatio...](https://www.nayuki.io/page/designing-better-file-organization-
around-tags-not-hierarchies#internal-vs-external-metadata)

I did read the entire book "Practical File System Design with the Be File
System", but didn't find it helpful for what I was working on. I can skip the
low-level bits because I'll probably build on top of a NoSQL database or
something.

------
xchaotic
Tagging is the first step, but how do you know if you don't have overlapping
or duplicate tags, say country and folk music? If you need something that fits
both categories, you eventually start designing taxonomies and eventually
ontologies, there's just no end to it. I think tagging is a sensible,
lightweight approach, but it has limitations...

~~~
fao_
> country and folk music

Folk music has nothing to do with country.

Country was invented in the 1800s by a businessman who aimed it at primarily
white Americans. Folk music has a rich history dating before the 1200s and
before much of written language. Many songs of Irish and Welsh descent date
before record, still being played today.

Nevertheless, to answer your question:

1) metatags, parent tags, and the like provide ways to structure tag
relationships to encompass and describe what you speak of.

2) There are many networks out there that share user's tags based on file
hashes. A system could scrape these networks for existing tags and autotag
many items without the user's interaction (beyond initiating the autotagger).
Users could also get their hands dirty and ask for only tags relating to
parent categories, or something like that.

~~~
kazinator
"Folk music" possibly rides on the definition of "folk", which brings in a lot
of diversity if interpreted broadly.

The word "folk" also denotes recognizeable format in the context of commercial
broadcasting and streaming of canned music. It basically refers to a locus
roughly centered around someone crooning while strumming chords on an acoustic
guitar.

~~~
fao_
> The word "folk" also denotes recognizeable format in the context of
> commercial broadcasting and streaming of canned music.

That's a very american-centric view you have there.

~~~
kazinator
That isn't what you might call my "view"; I'm just remarking on how it seems
that a word happens to be used in a certain culture and context.

------
prepend
I find it hard to remember if Google Docs originally had tags instead of
folders. Or if I just imagined it. I can’t find it through googling.

I would rather use tags than folders, but can’t find good support in an
operating system.

Google used to be the closest since you could use the search bar as a command
line and search queries as tags. There are no folders. But they changed now
that try to guess what you’re looking for rather than what you type.

OSX has tags, but their search is slow and inaccurate.

The closest is I’ve been trying to use Gmail as an organizer with inbox
infinity rather than inbox zero. Nothing organized other than tags. Using
search to find anything.

------
boffinism
Google Drive used to be based around tags, not hierarchies. It was wonderful.
Then as it matured and catered to more and more 'normal' people it introduced
the concept of folders. The folders were initially tags 'really' \- the same
file could exist in multiple folders at the same time. But they made that
harder and harder, and now I think it's hierarchical folders through and
through.

I miss the old days.

~~~
ASalazarMX
> as it matured and catered to more and more 'normal' people it introduced the
> concept of folders

I think it has more to do with capacity growth. Relying in tag+search would be
insane if most queries returned hundreds of files,

~~~
ori_b
Yeah. Imagine a Google search that returned more than 100 results. Madness!

------
dgreensp
I found this to be a great collection of insights!

My criticisms:

Organizing your files and digital “stuff” has very little to do with the “rich
structure of human knowledge,” to me, any more than organizing your kitchen or
garage is an exercise in philosophy. The goal should be as _usable_ a system
as possible, full stop. Now, the actual content of the article is extremely
practically oriented, so I have no beef with that. I just think people get
carried away with the idea that storing a file is “representing knowledge,”
and it takes them in weird directions like trying to create elaborate
universal ontologies. The question is, is it easier or harder to find your
files, and save your files?

Whenever a phrase like “representing knowledge” or “augmenting intelligence”
comes up, it’s like everyone gets a boner, and then moves on to something
unrelated, like (hopefully) usability.

Mutability: Everything changes. The only way to have immutable facts is to
have timestamps. Image hosting sites, message boards, etc are misleading
examples of file storage because they are really means of _publishing_. When
you publish something, and people link to it, there’s a case for thinking of
immutability as the default, though even then, most things that can be
published can be retracted or edited. This comment can be edited after I
publish it. I think true immutability as a default, for files as opposed to
time-stamped facts, only makes sense in a very narrow domain.

------
Nadya
I've used a self-hosted Booru for all of my image sorting. It took me roughly
2 months to upload and tag 53,000 images and another week of cleaning up
rarely used or redundant tags. Since I used it outside of the
Artist/Series/Character hierarchies my Artist/Series/Character hierarchies
refer to Topic/Subject/Details.

For example, visualizations of various algorithms would be filed under:

Topic: Computer Science / Subjects: Algorithms, Visualizations / Data:
{Algorithm Name}

By personal restriction - something may only be filed under one topic, no more
than three subjects, and can have as much data as is relevant. It gets stored
under what I believe to be the primary subject.

The #1 problem is I add files to my filesystem without uploading and tagging
them to the Booru. Also, since the only open-source Booru software I could
find is quite dated/buggy, I'm often fighting the Booru for how I use it. Now
that I think about it, this might be a good problem for me to solve myself.

------
hokus
ALL available meta data should be exposed and forced into tags, categories and
folders.

Move everything file-like into the file system. Make emails into folders with
tagged files. Link Torrents to their files and folders. Treat zips like
folders. etc

Bit more on date range sliders, colors, files as tags and 3d models here:

[https://steemit.com/filesystem/@gaby-de-wilde/how-a-file-
sys...](https://steemit.com/filesystem/@gaby-de-wilde/how-a-file-system-
should-be-organized)

------
tfolbrecht
An approach I find interesting is the
[Perkeep]([https://perkeep.org/](https://perkeep.org/)) or Google Drive model
for post hierarchy.

Storing all files as objects and then indexing..

An interesting indexer for images would be one that groups objects by faces
recognized or exif data(camera model, GPS location, lens, date, etc) Google
Drive does this.

Perkeep can deal with tags, span devices, deal with permissions. Check out HN
user @bradfitz

~~~
cyphar
My main issue with Perkeep is that a lot of the automated tags are very
limited, and adding more is not really easy-to-do. Though the last time I used
it, it was known as Camlistore. So maybe things have changed (there were some
pretty bad UI issues back then as well).

I do like the _idea_ that nothing is deleted and everything is stored using
"permanodes" and signatures of objects defining mutations of a "permanode".
The downside is that everything is so incredibly dependent on the indexer, and
my experience is that if the indexer has a bug you are in a lot of trouble.
Also sometimes you don't want to keep everything you made 10 years ago around
-- especially if it's burning storage space that costs you money.

------
sdhgaiojfsa
Every time I use a tagging-based system, I become more convinced that tags are
what I want for almost all things, not just files.

~~~
copperx
How do you find an untagged file? Are wrongly tagged files undiscoverable?

~~~
nayuki
This is more of a UI/UX question, but I thought about it on and off for months
and have a partial answer. Look at how Gmail works - you have All Mail, but
you also have Inbox, Sent, and your own custom labels (tags). Every message
can always be found in All Mail, and you can restrict your search by date
range. Similarly on an image board like Danbooru, even if you don't tag an
image, it will appear on the chronological stream of every image ever uploaded
to the system. So, I'm hoping the design will end up something like these two
examples. You should be able to list every file, preferably in chronological
order; you should also be able to exclude files that have at least one tag;
and the software might have a special nag section showing you all the files
that you left uncategorized.

------
abathur
I've wanted something along these lines for a long time as well. I have
trouble drawing hard lines and distinctions (this is pervasive; things like
having a "favorite" anything, or the desire to debate what genres a song or
movie fall into, are rather alien to me). This makes picking "one" place for
something difficult. Because these are fine/fuzzy distinctions for me, it's
also tricky to reason my way back to where I would have put something.

The biggest directory in my _document_ hierarchy is "flotsam".

I think part of the problem is that organizational tactics/schema/heuristics
aren't global. We need an array of safe, high-quality tools with good system
support/interfaces, and the knowledge to reason about how and when to use
which. Patterns.

A stack is probably a fine way to think about organizing mail or clothes. It's
probably less useful for deciding where furniture or paintings should go. A
filesystem that made sense for organizing source code is probably not the best
tool for organizing a movie collection or a lifetime of personal documents.
Genre apparently seems like a great way to organize most of the world's movie,
book, and music store/sections, but I (unless I can get someone to check the
store's inventory system) never know whether what I'm looking for is out of
stock or just hiding in the taxonomical hinterlands.

Search can help. Tags can help. Hierarchy can help. Metadata can help.

~~~
jonnycomputer
I don't see why one couldn't have a tag system whose scope is defined by its
location in a hierarchical structure.

~~~
jonnycomputer
apparently someone thought that the notion of a tag system with a semantic
scope bound to a file tree hierarchy is such a ridiculous--nay, offensive!--
idea that they had to downvote me for it, but couldn't bother to address it on
the merits. The problem I see with tag based structures is that tags are
global in scope, and so, tags have to mean one thing and one thing only; on
the other hand, having one set of tags for a photo directory subtree, and
another for my code repository makes a hell of a lot of sense to me.

~~~
Firadeoclus
I agree, though I think this kind of hierarchy should be shallow and high
level, representing only typical search scope boundaries.

------
usrusr
Just a braindump of what I don't like about tags instead of directories, no
need to repay the advantages (I agree with some of them) :

\- lack of identity. Somewhat watered down in presence of hardlinks/symlinks,
but still much closer to identity than a tag cloud.

\- does a file without a tag exist? It is conceptually very clear how removal
of the path identity causes the actual bytes to go back into the free storage
pool (wiggle a bit for hardlinks, but still pretty clear), it would be quite
weird however to have files stick around based solely on secondary tags like
"blue".

\- lack of a consistent threshold for tagging: tags are binary, but relevance
is not. If some files are tagged close to a full text index while others are
tagged following a more minimalistic approach, the combined soup will not be
very useful.

\- too powerful for convenience: file creation in a hierarchical filesystem
usually happens with a somewhat meaningful default. PWD, an app-wide default
or an app-specific last used folder. The default sets some of the information
you might put into tags, and this information is easily corrected if it was
wrong, with a single operation that might be as easy as dragging to a
different folder. "Is the default folder the right one for this file?" is
easily discerned and corrected, a default tag cloud however would require a
full mental scan to check for applicability to the new file. Every attempt of
making those defaults more clever would just force even more scrutiny onto the
user.

------
zby
[https://web.archive.org/web/20070927003401/http://www.namesy...](https://web.archive.org/web/20070927003401/http://www.namesys.com/whitepaper.html)

Hans Reiser had some good ideas.

And by the way there is a good analogy with www - originally we had just the
addresses (somehow hierarchical), then we had hierarchical catalogue of Yahoo,
and then quickly it became too much for that and we now rely on search.

~~~
TeMPOraL
There's a difference in searching the Internet and searching your own data. On
the Internet, there's much more of everything than you'd ever need, and close
to none of it is something you've created, or even seen before. So you use
search to get some reasonably relevant results. The search doesn't have to be
- _and isn 't_ \- complete nor correct.

On the filesystem, I'd like to know the data browser isn't hiding files from
me by reporting only "top 100 relevant results", or not indexing half of it
because $reasons, or not showing them because of faulty query. Being able to
iterate through all files on your disk in a tree-like fashion seems like a
feature.

------
kbos87
I 100% agree that a tag based file system is better than a hierarchical or
folder based system. The problem is that people seem to fall into one of two
camps - too confused by how to make tags work, or too enamored with organizing
things into folders.

------
sharpercoder
A way to see folders & tags is to treat folders as tags but with the property
that items may only be of one folder. The same happens in biology where
scientists try to classify animals into one folder; a cat is put into the
mammals folder (which has multiple hierarchies of subfolders). Instead, this
classification system may be much more effective if it is classified as tags
instead, where items may belong to multiple folders (or categories). This
solves problems like the platypus, which belongs to multiple "biological
folders".

It seems that humans have a hard time reasoning with the tag concept. I'm not
sure where this comes from; is it decades of working with the folder/subfolder
idiom in Windows, where most people are grown up with? Is it the resemblance
of the physical world where we also put documents into one folder and one
folder only? Or is it our intuition to simplify matters, and therefore
seemingly make things simpler to uniquely have items belong to one container?
I don't know; most likely, it's all the reasons above plus a few that I didn't
mention.

------
nayuki
Some relevant past discussions on similar articles:
[https://news.ycombinator.com/item?id=14537650](https://news.ycombinator.com/item?id=14537650)
;
[https://news.ycombinator.com/item?id=15492795](https://news.ycombinator.com/item?id=15492795)

------
blitmap
I'm going to really enjoy piecing apart the incredible detail put into this
article.

[https://tmsu.org/](https://tmsu.org/) This is nice.

------
tunesmith
I wouldn't really call a hierarchical FS a DAG (directed acyclic graph)
because of the flaws with links you've already called out. It's not a true
DAG.

Are there graph-structure filesystems?

Tagging has always seemed confusing to me because it seems like a degenerate
case of a graph if your tags can't contain other tags. A graph that is only
two levels deep doesn't have the flexibility of a real DAG. I'm having trouble
visualizing the true correspondence between tagging and a graph structure, but
I think they're pretty much the same thing _if_ you can tag tags. Does that
sound right?

Finding easy fast ways to navigate graphs in various UXs (shell, file
explorers, etc) is an interesting challenge. Deletion is tricky.

------
kpil
I think this is fundamentally NOT how humans remembers things - I think we are
masters in "geospatial" memory compared to abstract unrelated concepts, and
"geospatial" memory is probably organised in hierarchies.

Worse, it's next to impossible to efficiently explore a large tag-cloud
compared to a hierarchical structure, which means it's much harder to _learn_
about things organised in a tag-cloud compared to hierarchical tree, or a
graph that is mostly tree-like.

As an example; it totally breaks the xkcd techsupport cheat sheet - you end up
in the "click one at random" branch basically all the time.
[https://xkcd.com/627/](https://xkcd.com/627/)

Obviously, tagging things is good too, but file systems and computers should
emphasise the tree (a better one than the Windows file strucutre though)-
rather than inventing a confusing cloud/fog of unrelated things.

------
gumby
The pain with tags is the overhead of generating them and semantic drift.
Likely the best solution is simply search (some smart semantic search) with ad
hoc tags to help.

------
irundebian
Just use tagspaces
([https://github.com/tagspaces/tagspaces](https://github.com/tagspaces/tagspaces)).

------
slx26
I believe many developers and designers have been annoyed by hierarchies in
filesystems.

But I wanted to comment on this: > there is a mismatch between the narrowness
of hierarchies and the rich structure of human knowledge

Absolutely true, this is exactly what I think annoys us most than anything, it
shows us how limited hierarchies are. But at the same time, I think it's very
relevant to keep in mind that our knowledge and the mental relationships we
can find between ideas are very hard to make explicit and complete, like you
would ideally want in a tag-based filesystem. I feel serious tagging, if
manually defined, it's quite expensive if we want it to be really useful
(surely, we can also consider complementary automatic tagging, like AI).
Hierarchies instead, might not be very expressive, but they are very simple to
use in "most" cases. So I would say we're still far from getting the best of
both worlds. The problem to "solve" is information organization/structuring,
and not even humans handle that ideally (we are more like, faulty, search
engines with random inputs, prone to forget XD).

About the other ideas, I think they are all interesting, agree a lot with
hashes usage and no-filenames, not so convinced about metadata, but haven't
really thought enough about it. I don't think we can talk about the ideas
fitting cohesively or not yet (but hey, I don't even think links in HFS are
cohesive from any perspective), we would have to see more formal proposals for
implementation and interface. This said, I hope we see more work along these
lines in the future, it's a very worthy field to explore! Maybe start small,
testing some of the ideas, we get a lot of design insight when we are working
on the implementation.

------
b0rsuk
My pet peeve with tagging systems in general, but especially community-based
tagging is _false negatives_. If I search using a tag, there's no guarantee at
all that it will display ALL the items that qualify for it.

I'll use my favorite porn site as an example, without going into any specifics
and especially linking. I just skimmed the HN Guidelines and I don't think I'm
breaking any.

Suppose I try the tag #bigtits. It is highly unlikely that I will get all the
pictures with women who have especially large breasts. It's because no one
will review all the images and verify if the tag #bigtits applies to them.
That would be very time-consuming even for the most motivated individual who
uses both hands for typing. So if I were into that particular fetish, I would
need to try #bigtits, then #busty, then #nicerack, #slimandbusty, #ygwbt...
because each tag has its proponents, and there's definitely overlap between
them. You _could_ \- and I've seen non-porn sites doing that - use a program
for automatic tagging, but then in my opinion you are defeating the purpose of
tagging, which is grouping things by interesting categories. Machine-generated
tags tend to be lifeless.

As I've said it is a pet peeve of mine, and I will likely start a project or
two to implement my fixes for a web framework or a static blog generator. I
mean that I should have confidence that a tag has been considered for all
content in the collection. Program-assisted tags can help, such as keeping
track of what tags existed at the point when a picture was added.

Then there are almost identical tags. #cat vs #cats, #tortoise vs #turtle,
#color vs #colour.

Overall, in practice, I think tagging, as usually implemented, is the most
overrated feature of the Web 2.0 era.

------
xtiansimon
I purchased the Sony Digital Paper system (DPT-RP1) and it has possibly the
most ill conceived file system design possible. All files are stored in a flat
directory on the device (eg. one long list).

Users on the Sony community site are frequently looking for updates to the
software. I'm curious which file organization solution, hierarchy or tags,
would be easier to implement?

------
fouc
I'm a big believer in tags. I tried to make a tag-based tool that merely
relies on directory names as tags, so ~/t/.tag1/.tag2/tag3/your-file-or-
directory and moves your files around so that the tag directories are always
organized by tag counts.

I had the idea there would be a series of tmv tcd tls commands that work with
the tag directory structure.

[https://github.com/foucist/tagmv](https://github.com/foucist/tagmv)

Warning - The regexp I'm using is likely broken, I suspect directories with
.git/ or other dot-directories in them causes issues. It sometimes causes the
.git/ innards to be moved out into the project directory or something like
that. I never got around to fixing it.

------
Sephr
OFTN OSWG started work on a tag-based filesystem called TPFS[1] back in 2011.
The Haskell source code might be useful to anyone interested in developing
platforms that use tag-based filesystems.

[1]: [https://github.com/oftn-oswg/TPFS](https://github.com/oftn-oswg/TPFS)

------
lkj
As I recently inherited a huge, well-tagged music collection (tens of
terabytes of files) I am very interested in this. Is there something like it,
also supporting .cue files and also storing the original filenames and
structure? A mediaplayer agnostic way to access this treasure trove would be
the best.

~~~
Nadya
I highly highly highly highly recommend beets [0].

[0] [http://beets.io/](http://beets.io/)

~~~
lkj
Wow! Thank you.

------
catenate
Since 1 June 2012, I've been taking notes in unicode text files, which contain
(occasional or adjacent) lines starting with 'nb ' and then a list of tags. I
wrote a simple tool ("nb") in Inferno's shell (thanks to Robert J. Ennis for
the port to Plan 9's rc), to (1) search for given keywords in per-directory
index files pointed to by the global index, (2) index all of the nb lines in
files in the current directory, and (3) if necessary, append, to a global
index file, a reference to the index file in the current directory.

[https://github.com/catenate/notabene](https://github.com/catenate/notabene)

I've found that I'm comfortable with the eventual consistency this offers, in
exchange for fast lookups when I want something (as opposed to indexing first,
and/or indexing globally, and so waiting for indexing to get a result). This
distributed-file approach also allows me to add tags to a variety of files:
local files, or networked file-system files, or sshfs-mounted files, or
Dropboxed files, or files under version control, or files with varying text
formats; and find tags across all of them and across all the time I've been
indexing.

It runs in linear time with respect to the number of tags I've entered, plus
the time to read and process the global index, so obviously there are many
ways I could improve the time performance (as an easy example, I could permute
the index to list all the tags in alphabetical order, and next to each tag
list the files that contain that tag).

I also wrote other tools, since the layout is so simple: for example, "nbdoc",
to catenate the actual contents of the references returned by the primary tool
(nb); and "so" (second-order), to return all the tags which appear in any nb
line with the given tag(s).

I've also found that it's not easy for me to remember what tags I might have
used in the past, or how I was thinking about something, so I try to use the
conjuction of several tags to narrow down search results, rather than try to
remember one specific tag (this seems to correspond to the observation that it
can be difficult to remember exactly where in a hierarchy you put something).

The modular approach, of per-directory indexes referenced in a global file,
also makes it easy for me to combine work-specific notes, with public notes,
with private notes, all in the same global index file, at work; but only have
the same public and private notes at home.

------
nradov
I wonder if Microsoft would be willing to take another shot at WinFS? It would
have met most of the requirements. But the project bogged down and never
shipped.

[https://en.wikipedia.org/wiki/WinFS](https://en.wikipedia.org/wiki/WinFS)

------
pronoiac
git-annex might have some interesting thoughts about this: [https://git-
annex.branchable.com/tips/metadata_driven_views/](https://git-
annex.branchable.com/tips/metadata_driven_views/)

------
sova
Have you seen "Collaborative Creation of Communal Hierarchical Taxonomies in
Social Tagging Systems" ? It is very relevant and if tags could be accurately
machine generated we could derive organizational hierarchies from just tags.

------
flukus
> When people realize they need to classify a file in more than one way, they
> will start to use shortcuts/links to try to solve the problem. (Windows has
> shortcuts, Unix has soft/symbolic links, and Mac has aliases. This is a
> ubiquitous feature, but transferring shortcuts across existing platforms is
> very hard.) This sounds like a reasonable solution, but will face trouble in
> all but the simplest use cases.

They rule out this solution because it's not perfect, but surely if the idea
had real merit this would be a serviceable test bed?

I think we have hierarchies because it's human nature to create hierarchies to
make sense of the world, we try to force them into place where they don't make
sense. We see it in biology, we see it in organisations and we see it in code,
I'm sure most of us here have worked with examples of OO hierarchies that made
no sense.

~~~
nayuki
Hi. Have you tried managing a collection of shortcuts? How do you deal with
recategorization, removing files, renaming files, moving files to different
storage devices, etc.?

The article has a whole section illustrating why both hardlinks and softlinks
won't work in the general case. [https://www.nayuki.io/page/designing-better-
file-organizatio...](https://www.nayuki.io/page/designing-better-file-
organization-around-tags-not-hierarchies#hard-and-soft-links-as-non-solutions)

~~~
flukus
> The article has a whole section illustrating why both hardlinks and
> softlinks won't work in the general case.

Yes, that's the part I quoted.

> Have you tried managing a collection of shortcuts?

No, tagging is the bottleneck for me, which is why I don't think it will ever
be useful.

> How do you deal with recategorization, removing files, renaming files,
> moving files to different storage devices, etc.?

Aside from moving to different storage device, you can make a minimum viable
product with a few shell scripts, one to tag a file, one to untag, one to
search by tag(s), one to listen for for events (move, delete). For bonus
points you could do some auto tagging from meta data.

A working implementation (even with limitations) would be a lot more
convincing than all this theory.

------
sancha_
I wrote a paper about the same topic while in Uni about 15 years ago, and also
developed a proof of concept 'filesystem' with an file explorer that uses
tags. Too bad it isn't the standard in any OS yet.

~~~
nayuki
Would you care to share the name or a link to your paper?

~~~
sancha_
I would need to dig it up from my old harddrive, if it still exists. To be
honest not sure it still exists, as I said it was a long time ago.

------
nwmcsween
I have thought about this a bit and I think if similarity hashes (probably LSH
forest) were used, automatic tagging could occur with a preset of hashes.

------
ausjke
for simple use cases, folder beats everything else.

when you have a large numbers in deep path then tags should be the way to go,
you will need a database to manage it for portability across OSes etc.

with tags we need isolate how-to-store-the-files from how-to-organize-them-
for-easy-access, tags can be used to build a virtual folder hierarchy for
example.

~~~
ausjke
just curious, why the down vote? i wish I can see who down voted, is there a
way to check?

~~~
Retra
What would you do with the names?

~~~
chadcmulligan
make an enemies of ausjke list probably

~~~
ausjke
nope, just ask why? you're what you think I guess, I have not made any lists
in my whole life and surprised you immediately guessed that.

~~~
Retra
If they wanted to say why, they would have. You've got no need to go around
hassling people for information. You could be down-voted a trillion times and
it wouldn't matter for shit in the world, so get over it.

------
_pmf_
ReiserFS flashbacks.

------
_ZeD_
You can pry HFS from my dead hands!

seriously, graphs are hard, and the possibility to lose data is serious.

~~~
_ZeD_
just to clarify: not lose data in the hardware sense, but in the cognitive
one.

In a tag-based fs I'm sure I will find a way to "lost" files.

