Hacker News new | comments | show | ask | jobs | submit login
Hierarchical File Systems are Dead (harvard.edu)
37 points by signa11 on Nov 14, 2010 | hide | past | web | favorite | 54 comments

So, yes, ontology tends to break down. Are we going to throw out all existing filesystems, though? Heck no. Fortunately, it's not hard to layer search and non-hierarchical indexing on top of existing filesystems.

I wrote a simple (Lua + SQLite) script that tags files, searches all known paths by tag, etc. It's really not hard. It hasn't been hard for decades. Why hasn't it caught on? Beats me.

Using directories as contexts + symlinks to merge them works around many issues, but the same ugly breakdown appears in e.g. statically-typed OOP class hierarchies. ("Is a platypus a mammal or an amphibian?") Sometimes, is isn't.

Basically, I see where they're coming from, but "normal filesystem, with tags" probably makes more sense to most people than "fully non-hierarchical* filesystem". And, hell, I keep spelling hierarchical wrong.

It is not hard but it hasn't caught on because having userspace scripts or daemons running tagging and indexing passes on the files in an existing file system is nowhere near transparent. Even if you use inotify or something to keep track of teh changses in real time, the illusion of transparency will leak sooner than later. And that's when your users' confidence and trust will vanish into the air.

Indexing has to be built in to the filesystem to make the searchs match the state of the filesystem 1-in-1. There's no other way.

I've seen this working live only in the Be Filesystem, or BFS. There you can be sure that if you add an image somewhere, another view of the filesystem will instantly contain the new file and removing it will again banish the image from all other views. The file either exists or doesn't, and you will never notice that a search returned information that isn't no more.

Spotlight fulfills all those requirements.

They address the idea of building a non-hierarchical system on top of a hierarchical one. They argue that doing so is bad because it adds a layer of indirection to your data, and they argue that it's unnecessary since their non-hierarchical system can in fact implement a POSIX-compatible layer easily (so they say).

Interestingly, the original Macintosh File System (MFS, the predecessor to HFS which was the predecessor to the current HFS+) was a non-hierarchical filesystem (that's why HFS has the H). The original Macintosh system software supported hierarchical directories, of course - by maintaining a database mapping file-system nodes to paths. When the user opened a particular folder, it queried the database for nodes whose path matched.

Of course, no database is perfect, and sometimes things got messed up. There was a special key chord the user could hold down during startup that would cause the Finder to rebuild all its databases. When done on an MFS volume, all the hierarchy painstakingly created by the user would be erased, and all the files on a volume would wind up in the root directory.

For anyone who doesn't know, Margo is the co-inventor of BerkeleyDB and sold her company, Sleepycat Software, to Oracle. She's quite knowledgeable as an academic, a programmer, and as a businesswoman.

I think hierarchies are one of the most common ways to organize data and most systems in this world are organized hierarchically. Governments, socioeconomic systems, taxonomies, literature, organized religion, and nature, are all often organized hierarchically. Humans seem to think in hierarchies better than in many other systems of organization. Sure, search is great, but even sites like Newegg and Amazon prominently display hierarchies to organize data, and I appreciate it. I don't think hierarchal file systems are dead at all. I think they are a very good way of organizing most file system data, and that search can be of utility even in a hierarchal file system.

Every computer system that uses tagging is a counterexample. We humans think in tagging as often as we think in hierarchies, but I agree that we tend to think of hierarchies as more well-defined and organized. This tendency can be incorrect.

I think it boils down to this: for hierarchical data, use hierarchies. I believe that files on a modern personal computer will almost certainly not be hierarchical, so it's not ideal to store them hierarchically. Videos, for example, may be movies, tv shows, screencasts, music videos, etc. Videos may also be standard definition, 720p, 1080p, etc. It's difficult to say, for example, which attribute should be the root directory for your videos, since it's reasonable to want to browse by many different attributes. A hierarchy does not apply to videos, but attributes could easily be represented by tags.

I agree. Maybe we need a hierarchical system that is dynamic. We need to organize information and in the real world we are constrained, but with computers we are not.

E.g if I need to organize my books, I could move it only to a given position, not another one, I could add labels to my books like "history", or "Initials of the author", but I could only group one way, like all my science papers on one place, if I want to group all the books that talk about sex, or crime, I will have to destroy the other group.

With computers you could create multiple directories trees with links instead of data, so I don't need to multiply the data each new tree. The tech is there(inodes).

Imagine if you study countries according to their population, so you create a hierarchy "most populated", "less populated", "no populated at all", then inside most populated you have "the most populated", "the less populated", and so on.

Then you could have another directory according to the extension in squared kilometers. Another according to their capital coordinates, and so on.

I think the most common way to organize data these days is Google.

And Google is nowhere near hierarchical.

What's a filesystem? -sent from my iPhone

It's a scary monster living outside your walled garden.

It's a way of representing information in a local context, which tends to become necessary as you have large amounts of info to track. Much like a massive crowd makes more sense as several circles of friends showing up at the same place, rather than a random mob.

-sent from your iPhone (you should lock it)

The joke would make more sense had you sent it from your Newton or your Palm Centro ;-)

Anybody remember WinFS? It's really promising technology back when it's announced, but Microsoft failed to deliver, along as many of the Vista stuff.


That's one impressive piece of vaporware... The first paragraph suggests it's coming in Windows 8...

First thing I thought of, too.

Escaping file naming hierarchies was the original motivation of Namesys aka ReiserFS[1].

1- http://reference.kfupm.edu.sa/content/n/a/the_naming_system_...

This awareness I saw it growing at Enterprise Search Customers/prospects from early 2000's on. I was part of Vivisimo (spin off from CMU) which created a "clustering engine". In the beginning we always had the "Librarians", pro categorization & classification, dismissing the interest of Clustering (or maybe they were justifying their job). Then more people in the organizations (government or corp.) said yes those 2 approaches (automatic "on the fly" clustering vs human generated taxonomies) are complimentary. Finally couple years ago, in the same week, I visited the Government of Israel which said "we just need clustering" eg. on the left side: http://search1.gov.il/govilt?query=israel and then Ferrari competition in Maranello (yeah:) even if I don't like cars;) and there their CTO explained to my surprise [was trying to tell him "we integrate well with your existing taxonomies, ontologies..."] "forget about any hierarchy or classification tree in our File Systems, we just search!". Reality/information is much more diverse & evolving than any predefined categories.

You can combine the two approaches, with "just search" which also uses human "categorisation and classification". This helps an awful lot with relevancy ranking.

In smaller data sets (eg not Google scale), where all the items are about roughly the same thing, you need some human classification for your search to work well.

Yes you are right. I was a big Friendfeed fan and proposed ideas which we implemented: humans could modify the search index on the fly with ratings, comments and tagging of the search results. Eg. people were able to search documents based on people's comment or rating for instance.

I think the lesson from "web 2.0" apps is that if you make something easy to use and useful, people will use it. So the rise of folksonomies is about usability, not doing away with structure.

In my experience, if you can make entering more structured metadata just as easy, people will enter it, and you get a big return in the ability to use the information you've collected.

I've spent the last 3.5 years building a platform for "information applications". The key observation which prompted this was that hierarchical file systems didn't work well for organising information within an organisation.

However, hierarchy itself is still incredibly valuable. People think in terms of hierarchies - it's just that they think in terms of multiple hierarchies and an item will almost always belong in more than one place in those hierarchies.

If you allow users to describe items in the way which makes sense to them, and then search and browse by any of the terms they've used, then you've eliminated almost all the frustrations of a file system. In my experience of working with people building complex information applications, you need:

  * deep hierarchy for classifying things
  * shallow hierarchy for noting relationships (eg "parent company")
  * multi-values for every single field
  * controlled values (in our case by linking to other items wherever possible)
Unfortunately, none of this stuff is done well by existing database systems. Which was annoying, because I had to write an object store.

Whenever I hear these proclamations, I always wonder what a 'source tree' would look like in a non-hierarchical file system.

A non hierarchical file system is not a file system in which hierarchies cannot be represented. A source tree could be represented by using relative paths within the project as tags.

Obviously, there are some issues that would have to be resolved. For instance, you would have a have some kind of uniqueness notion for tags in order to simulate current hierarchical file systems, or two paths in the traditional sense might be the same. Since not all tags would have such uniqueness constraints, there would also have to be some kind of tag type (or property type), and at that point we're getting awfully close to what databases do.

There's no reason you can't have both on a system. Traditional hierarchical for the uses where it makes sense (eg systems software, application code and resources) and this alternative for user data.

Your source wouldn't need to be stored in a tree, necessarily. Files in your source code could be tagged with the module name they belong to.

True, but what about when you have multiple copies of the same source tree? (though I guess this could be less of an issue of everything was using a dvcs)

Git e.g. solves that by addressing via content hash. Which is more or less inevitable for non-hierarchical storage, since that's one of the few ways you can disambiguate.

The thing is, wood trees don't share (or trade) leaves, but abstract trees can.

No, they can't share leaves. Trees are acyclic by definition - no leaf may have more than one parent.

Present-day hierarchical file systems are digraphs, not trees.

Yes, digraphs are much more flexible. Present-day wood trees -really, really cannot- share leaves.

(snark warning) I pointed out years ago, that hierarchical file systems would eventually be unnecessary on Windows, since the clear direction from Microsoft at that time was to store every file in the entire system in \windows\system32, than manage it via an extremely deep and complex hierarchy in the registry... (end snark)

Fortunately, Windows moved in a better direction since then, with user home directories and most user data stored effectively underneath then, more organization to the Windows directory, and so on.

Without having actually read the paper yet (still downloading). I always feel like these sort of things are nice, but you still want a hierarchy underneath it all.

For example, I have a document which I want to put onto a USB stick for a friend. I have no idea how their search metaphor lets me do this. I search for the document, but now how do I "move" it? Does "moving" even make sense? Now I'm not doubting that they do have an answer for this, but is the answer something that will be easy to explain?

If you just start with hierarchy, and put a working search over the top of it, then it has a nice model that is easy to explain. All files have an address (like a URL), but if you don't know (or can't be bothered to type) the address, then you use the search box (like google).

I would imagine that you could mark the intended storage destination(s) into the files tags.

hmm. first thing that popped in my mind: I'd say you don't "move" files. You can only "give" them.

does that make sense?

Hierarchical filesystems have an inherent sense of place, but are non-HFS about meaning. "This has to do with to these ideas." That doesn't translate to storage, though - if it has to take up space, where does it go?

This is why I think non-HFS ideas work better as indexes on top of existing systems. We don't have the right metaphors yet.

I'd think the opposite; if you were combining hfs and non-hfs at a visible level, the way to do it would be to have a non-hierarchical storage system underneath with superficial hierarchical (and other) views. This would give you the flexibility to change your schema without breaking file paths while still being able to organize and name things sanely.

As far as metaphors, I'd say think of calling someone on a landline vs. a cell phone. With the cell, you don't have to specify where or who joe is (/people/friends/joe/house|work|carphone|vacation|etc.) in order to reach him since joe's 1 cellphone number rings wherever he happens to be.

Another example is folders on the iphone -- moving an app into a folder with other apps doesn't actually move the app from /homescreen/app1 to /homescreen/games/app1 on the iphone hard disk (as far as I know...), it just changes the superficial hierarchical view of the data.

I want to know more about how their "thin POSIX layer" would actually provide backwards compatibility. How could you get a directory listing, look at every single file's POSIX tag to see what its path is?

It seems like you could fake tagging in any hierarchical file system by using directories for tag names and hard links to files (not symbolic links).

A hard link basically gives multiple paths to the same data, sharing attributes like the owner and permissions. Using any linked path will change the file. The exception is that "rm" deletes just one link (hence the system call "unlink"), and the file is only lost when all its links are gone.

I was thinking about this a while ago when I was writing my own personal finance software. Rather than embedding any notion of hierarchy, I allow all transactions (regardless of the source) to have an arbitrary number of tags. Hierarchy is simply the conjunction of sets of tags. I can easily re-tag transactions and rollback tagging schemes. Seems to work well for all of the processing I want to do.

I'm not a security expert, but surely there are security implications for a non-hierarchical file system.

one of the issues with non-hierarchal filesystems is that the idea of tagging stuff seems like too much effort. for this to appeal to me, there'd have to be some big shift in the GUIs for it to be easy to tag/group files

And that says nothing about non-GUI operations...

I thought Harvard professors were above trolling in paper titles.

Do you have a real-life use scenario for this DB-FS paradigm?

Funny that their paper is organized as a Hierarchical File System:

    1 Introduction
    2 The Hierarchical Namespace Albatross
        2.1 Irrelevance
        2.2 Restrictiveness
        2.3 Performance Limiting
    3 hFAD: A New File System Architecture
        3.1 API
            3.1.1 Naming Interfaces
            3.1.2 Access Interfaces
        3.2 Index Stores
        3.3 OSD Layer
        3.4 Implementation
    4 Open Questions
    5 Conclusions

A paper is not a generic facility for data storage. Text meant for human reading is to some degree sequential. You cannot just change the order of words, paragraphs or sections in it. It would be a different text. But that doesn't mean files in a directory should have inherent order.

Also, they say: "Note that we are not condemning hierarchy in general; indeed, hierarchy can be useful in a variety of situations. Rather, we are arguing against canonizing any particular hierarchy"

Well, they didn't say hierarchy was dead. Corporations, governments and militaries would beg to differ!

But they are academics. As most people familiar with computers and their technical usage, I also tend to organize things in a hierarchical manner. But I cannot ignore that most people around me do not: they just put everything on desktop and one or two other folders.

I keep things fairly well organized on my file server, but I've definitely encountered times where my folder hierarchy is insufficient. Wanting to get away from hierarchies isn't just for academics and the disorganized. Here's two examples:

I naturally have a Movies directory and a TV directory. Under each I have two directories SD and HD. Sometimes I may want to simply browse all the HD video on my system.

Where do music videos (i.e. live concerts) go? They're certainly Music, but it's fair to imagine wanting to see them listed when I'm browsing through Movies and not listed when I'm browsing my mp3's. Add in tutorial videos for music production software (I have a lot of these) and it's even more of a guessing game.

It would be cool to navigate by file type sometimes.

ie: have some \sound\mp3\all "virtual directory" automatically populated will all the mp3 files of my system.

Now that I think of it, it would be cool to be able to express such "virtual directories" by way of declarative filters.

eg: mkdir ./bigfiles ./ -r 'file.size > 100mo, subdir by file.year/file.type, file.type isnt tmp' -- ..., you get the picture

It's almost like "Smart Folders" in spotlight ;)

(Yes, I get what you're saying - you want to impose a structure on that smart folder via the metadata. It was just too tempting to make a stupid comment ;)

Well, that's because our reading habits are linear.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact