Hacker News new | past | comments | ask | show | jobs | submit login
Database as Filesystem [video] (youtube.com)
175 points by enobrev 13 days ago | hide | past | web | favorite | 58 comments





Building a POSIX-compatible file system on top of a database engine (Postgres, MariaDB) on raw partitions..

I could see advantages like being able to have indexes, relationships, run queries on files' metadata, import/export the whole tree or branches (folders), maybe easier managing of distributed file systems..

Aside from the performance overhead for the sake of experimentation, from my not-very-knowledgeable perspective, the idea seems to have enough interesting implications that I'm curious to see where it could go.

I also like other variations on this, like "file system as database", "Git repo as file system", and so on.


You can read on the background and challenges here in Salman Niazi's PhD - http://kth.diva-portal.org/smash/get/diva2:1260852/FULLTEXT0.... This is not a new topic - WinFS (part of the failed Longhorn project) was the best known example, which tried to use SQL Server as the metadata store for a new windows FS. Performance killed them.

The reason why we managed to get better performance compared to all previous attempts it three-fold:

1. we used an in-memory high performance DB (NDB) and we did not de-normalize our schemas (no caching!)

2. with denormalized tables, we needed cross-partition transactions at high performance

3. we did a reasonable job at re-implementing HDFS' POSIX-like model using read-isolated transactions and row-level locking and new protocol for sub-tree operations.


Thank you for raising these relevant points. They led me down a fascinating rabbit hole on this topic of "database as file system". This is way over my head, but to pique others' curiosity, here's the abstract:

Recent improvements in both the performance and scalability of shared-nothing, transactional, in-memory NewSQL databases have reopened the research question of whether distributed metadata for hierarchical file systems can be managed using commodity databases.

In this paper, we introduce HopsFS, a next generation distribution of the Hadoop Distributed File System (HDFS) that replaces HDFS’ single node in-memory metadata service, with a distributed metadata service built on a NewSQL database.

By removing the metadata bottleneck, HopsFS enables an order of magnitude larger and higher throughput clusters compared to HDFS. Metadata capacity has been increased to at least 37 times HDFS’ capacity, and in experiments based on a workload trace from Spotify, we show that HopsFS supports 16 to 37 times the throughput of Apache HDFS. HopsFS also has lower latency for many concurrent clients, and no downtime during failover. Finally, as metadata is now stored in a commodity database, it can be safely extended and easily exported to external systems for online analysis and free-text search.


Have you publish the workload tracing from Spotify?

We are very interesting on the results you got, looking for a way to replicate that using our setup.


The workload isn't public, but if you follow the same ratio of FS operations (60+% getBlockLocations (file stat), etc), you will get similar results. You can even assume that operations follow that distribution uniformly, although there will be bursts of different op types - in the paper, we added microbenchmarks for each FS operation to show their load on the system. That distribution is in the paper and Salman's thesis.

> Performance killed them.

I've often wondered if there's a point at which something like WinFS would become practical simply through the perpetual advance of hardware. MS started working on the concept back in the early '90s (as part of Cairo: https://en.wikipedia.org/wiki/Cairo_(operating_system)), and eventually gave up on it in 2006, which was nearly 15 years ago now. Devices today have significantly more memory and compute power on tap than they did in 2006, and if you roll the comparison all the way back to 1991 we're on a completely different planet today. And there are entirely new types of hardware that could be leveraged now too, such GPUs, solid-state disks, etc.

I have no idea if MS ever published their performance data, but if they did, it'd be interesting to use it to figure out what a theoretical machine that could run WinFS, as it was when MS gave up on it, fast enough to be useable would look like, and how close to that theoretical machine the machines of today are.


I guess when the next Windows would include a novel FS, people would not care. After decades of experimenting (does anybody remember the fears of users when it was introduced that Longhorn deprecates the traditional filesystem for query-based dynamic folders and thelike... In the end it remained only as another "addition", not a replacement), people expect a desktop filesystem to do what it always did.

Even more, the way how Android and iOS hides away the filesystem from the user in everyday tasks shows that for end users, the technical fundament isn't in the focus anymore. Modern filesystems is something which will change how servers ("backends") work but not how customers do.

That's why I think MS burrows WinFS and will probably never dig it out again...


I doubt MS "burrows" (I think you mean "buried") WinFS. There were likely many good lessons learned from that project about file systems, and there were a few useful outcomes that are present in the current NTFS.

You don't know, what you don't know until you try stuff. It's called research.


I know why they did it but to me hiding the filesystem is one of the most annoying and cumbersome parts of mobile operating systems. It's a shame to have such a lack of freedom with data.

You mean iOS? You have fairly extensive access to everything on storage on Android.

I’m sure Google Drive and iCloud Drive are pretty close. Database based metadata stores with object storage and a view that looks like a traditional hierarchical file system.

OneDrive probably is too, but I don’t understand the relationship to sharepoint.


i read that it was not performance but compatibility - how do you get all the developers to rewrite their apps onto a relational filesystem?

The OP video describes a FS based on Innodb and MySQL. It gets relatively poor performance (7-9X slower for common FS ops). However, the killer FS ops for a DB are subtree operations - 'rm -rf', 'chown -R', 'chmod -R'. In research in this field, these are the operations that get less limelight but are most important for a working system.

I hazard a guess that the above implementation could still be made many times faster (maybe even 10X) by using InnoDB's native API, instead of the much slower MySQL Server SQL API. Here's an example of what you can achieve by going native - 1m ops/sec on InnoDB 8 years ago. This is the main guy (sorry Mark!) behind RocksDB implementation:

http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-a...


> Performance killed them.

I wonder how much was due to legacy databases living in files.

For example, let's say you need to read /etc/foo.conf a lot. That file has internal structure, too, so even if you need just one value from it, you have to ask the filesystem (database) for the bytes that make up that file, then parse the whole thing (maybe), just to get the one value you want. Keeping it in a database just adds overhead.

In contrast, if your filesystem/database had actual structure, and stored the values of /etc/foo.conf in a schema, then every program that needed one value from it could do one query for just that value they need. That could be even faster than a dumb filesystem.

This seems like an area where you need to change everything all at once, or face a long slog of updating every piece of data in your system.


Nice work there.

I've been fascinated with the idea of a relational database as a file system backend since the days of Longhorn / WinFS. I even had an attempt at creating one myself as a hobby project[1] in FUSE. Performance was never an end goal for me (I was writing it in FUSE + Go after all). Whilst I did get some read operations working (and surprisingly well too considering the design choices I had made) unfortunately I never managed to get write operations working before I got distracted with another project. So it's really interesting to see other people tackle this idea and I'll take great pleasure in reading the thesis you've linked to. Thank you

[1] https://github.com/lmorg/myfs if anyone is curious. Be warned though, code isn't release quality.


What project are you referring to that gets better performance?


4. HDFS is slow.

Slow == throughput or latency?

HDFS is not nearly as 'slow' as object stores, like S3 (in latency and throughput for a single namespace).


Slow == performance left on the table.

The namenode is a threaded model with a lot of locks for each operation. This will improve with moving to Ozone and RocksDB under the hood (because Rocks doesn't leave much performance on the table). But it runs a lot of threads which can cause a lot of context switches.

The RPC uses protobuf which is cpu intensive so that's stealing from your RPS (which matters if you want to process multiple tens of thousands of requests per second).

Finally it's using Java and when you want to scale your namenode beyond 16GB heap then, afaik, you run into GC issues. If it were written in a non GC language then the namenode could scale a lot further. The memory size isn't helped by having bloated types passed around. One exmaple is UUIDs as strings all over the place because Java doesn't have a u128 native type. So you have 36 bytes for the UUID string and ~32 bytes of Java object overhead (though the JVM can unbox - it's not really something you should rely on magically happening).

Aside from all this (which are things that are baked in and will never change) Hops demonstrates there there's a lot of performance available in HDFS. But comparing XFS on MySQL to HDFS on NDB isn't really the same thing.


That's a lot of technologies and issues that you touch on. Your points have validity (apart from heap sizes - several HDFS teams manage 300+GB heaps), but you are missing the fundamental challenge - HDFS' semantics (POSIX-like) and its implications on the concurrency model. You can weaken the semantics of HDFS (remove atomic rename) and scale. If you don't want to do that you have to deal with the single-writer, multiple-reader concurrency semantics for the namespace. Speed up the RPC, separate namespace from blockspace (Ozone - btw, how do they handle global states like safe-mode without consensus between both parts?) - you will still bottleneck on mutations of the namespace.

We provided a multi-writer, multi-reader concurrency solution (see paper) based on cross-partition transactions, normalized schemas, row-level locking, and an application-level protocol for subtree ops. That is the fundamental (hard) problem for scaling hierarchical FS'.


Or alternatively, I'd like to see variations like: "Git repo as a database"; i.e. sqlite backed git.

There are a whole class of problems here that I have little or no experience with, but seemingly could be incredibly impactful. For instance, further research and investment into versioning, event-logging, etc. could bring about a very rich revolution in OS / user experience.

"What if... data (e.g. file) manipulations where semantic – rather than CRUD, you had application-layer higher level blocks that presented different abstractions to data below."

Or, as I have more experience with SQL than FS, I could just be looking at the world through the lens of having a hammer and seeing everything as a nail.


Not exactly “SQLite-backed Git” but have you looked at Fossil? https://www.fossil-scm.org/home/doc/trunk/www/index.wiki

> SQLite backed git

SQLite development is actually version controlled by an open source SQLite backed version control system called fossil.


Everything is a nail.

Whenever developer sorts get all starry eyed over "DB as an FS", they never consider data exchange between applications vendors and platforms. Their applications always exist alone in a digital vacuum.

Example: Imagine trying to attach an MS Word document stored in this DBFS to an email from a Thunderbird client. What does that look like to the user? What does it look like to the Thunderbird developer?

Abstract and encapsulate so that each application can use their own container format that includes schema and encoding details? Other applications can read this metadata directly from each containerized db object? Cool, we just reinvented the filesystem again.


You're right, any viable replacement for a filesystem must be compatible with existing filesystem APIs. That doesn't mean you can't implement the storage layer differently, or add additional APIs.

Look at ZFS. It's a filesystem, but internally looks like a database. If you try to define the terms "database" and "filesystem" generally, you'll realize how difficult it is to separate the concepts.


I remember being excited that this would be (eventually) coming to windows in the pre-Vista.

It was called WinFS: https://en.wikipedia.org/wiki/WinFS

Unfortunately this got canceled, IIRC due to difficulties with back-compat.


This reminds me a little bit of the ideas behind Reiser4[1] -- the idea being that we could unify different namespaces of data (files, metadata of files, databases, anything) into a unified namespace which you could then use Unix tools on. It is a bit of a shame that the ideas behind Reiser4 have been lost with Hans Reiser's murder conviction, and nobody has really picked them back up since then.

One of the neat ideas more directly related to this talk is that Reiser4 had the ability for users to define their own data layout (I think they were called "modules") which would allow you to optimise the filesystem layout for databases while still allowing non-database software to read the contents.

[1]: https://www.youtube.com/watch?v=mIrMVPnxa04


This was the idea behind murdererfs, right? It worked pretty well to power mp3.com for a few years.

The great https://philip.greenspun.com/panda/databases-choosing talks a bit about the issue:

The great thing about the file system is its invisibility. You probably didn't purchase it separately, you might not be aware of its existence, you won't have to run an ad in the newspaper for a file system administrator with 5+ years of experience, and it will pretty much work as advertised. All you need to do with a file system is back it up to tape every day or two.

He goes on to say that the main thing RDBMSes add to your web-serving life is ACID transactions, which I think is pretty much true. Sadly modern filesystems still don't support ACID transactions, though often they do manage to export sufficient functionality for user-level code like Postgres and SQLite to do so.


> you won't have to run an ad in the newspaper for a file system administrator with 5+ years of experience

Until you reach an extreme scale, at which point you hire the author of the filesystem.


> Sadly modern filesystems still don't support ACID transactions

They offer range locking, optimistic locking, snapshots, atomic swaps, durable writes and various other interesting features as primitives. If you're willing to build some data storage around a specific filesystem you can do a lot. The correct incantations to get those features can be a bit gnarly, but so is coding database engines.

NTFS even offers actual transactions, although their use is discouraged.


Interesting, why is their use discouraged? Because Jim Gray’s sailboat sank?


This just says that since nobody uses it they might delete it, but it doesn't explain why nobody uses it. Presumably because it doesn't work well or is hard to use, but which, and how?

Its going to be hard to retrofit transactions into the POSIX model with anything other than a very coarse granularity. In the SQL model, statements are assigned to a transaction based on which connection they came through. How should POSIX filesystem transactions be scoped?

Without rewriting software, its hard to use anything finer-grained than an entire process tree.


An entire process tree costs 100 us of CPU time to create and destroy on modern Linux, so that mechanism could deliver tens of thousands of transactions per second on a modern cellphone. That's probably more than the underlying Flash can make durable, but with nested transactions you might be able to take advantage of more than that. Still, it might be adequate.

Unfortunately we are saddled with three generations of software that uses threads like they're going out of style (which they are) due to an unfortunate confluence of Win16 refugees with a misguided attempt to run our transaction processing software on a virtual machine designed for a set-top box with no MMU. You can fork and exit on Linux in 100 us, but not if you have sixteen threads and 512 megabytes of virtual address space spread across 128 mappings.


https://en.m.wikipedia.org/wiki/Transactional_NTFS provides ACID transactions, even, I think, across machines.

HopsFS is based around using an in-memory database as the metadata store for a filesystem -

https://www.usenix.org/conference/fast17/technical-sessions/...

HopsFS is a drop-in replacement for HDFS, and the above paper showed 16X throughput improvement for Spotify's Hadoop workload over Apache HDFS. Since then, HopsFS has added small files in the metadata layer, with over 60X throughput improvement and an order of magnitude lower latency for writing small files (ACM Middleware'18):

https://www.logicalclocks.com/millions-and-millions-of-files...


Thats one of the reasons why need to rely more on micro-kernel design OS first, so it's easiar to advance and experiment with different designs for the storage layer.

The "everything is a file" mindset was great to move us from the chaos they were in back in the day to where we are right now. But with the modern ecosystem is holding us all back, because is too rigid, and a abstraction that its a poor fit to a lot of modern scenarios.

With micro-kernels, we could have the kernel dealing only with persistent storage blocks and leave the storage abstraction to the userspace where it belongs.

If im not mistaken, The Zircon kernel go even further by abstracting through pages.. so you could just map a collection of blocks to those pages and deal only with the pages in memory, no matter they source (if disk, network or only on memory).

Leaving the abstractions to userspace and by doing so, enabling us to download, install, and mix those abstractions in a way that is a good fit for a giving scenario.

As we have seen, the monolithic kernel have trouble with the bazaar approach as those experiments have to be shipped in kernel-space, with all the trouble this entails.


I do it the other way around, filesystem as database: http://root.rupy.se

Just learned to partition the SD card for my RPi 2 cluster properly with ext4 type "small" so I don't run out of inodes.

Also purchased 6x 256GB Toshiba for 40$ each, I have been waiting for large and cheap microSD cards since 2015!


TMSU (https://tmsu.org/) is another one like this. It's simple and inter-operates with everything that currently exists and it offers most features you'd want out of a database file system.

Ok, do you know how tmsu stores it's metadata? I found references to sqlite3 in the source.

ROOT actually uses the filesystem AS database, it has no separate index files except the "link" ones and those only store a binary list of long id's to other files.

ROOT has some features that are missing from tmsu though; like distributed real-time replication over async. HTTP (scales across continents). HTTP API with JSON support. User management and security. All of that within ~4000 lines of code.

I built it to have a MMO database that doesn't need manual moving of data for region switching and backup while avoiding the sync. bottleneck of all other databases.

Here are two sites that use it: http://talk.binarytask.com and http://fuse.rupy.se


The original Be File System (BFS) for BeOS had some attributes of a database. I'm happy to see more research being done on the concept.

There are different portions to a FS that actually need to be treated indivudually:

- Bricks / Blocks / Blob data.

- Metadata / Tags / Xactional data.

The deal is that any DB <-> FS has to support both of these separately. The Bricks/Blocks are optimized for raw performance/seek/access pattern with very low overhead; The Metadata are optimized for fancy querying / fast lookup semantics.

Any DB/Key Value store for large data could be modeled on top of a volume layer, like LVM / ZVOL; This could absolutely become the foundation of a FS as well.

The deal is that extensible / flexible storage space used to be something filesystems used to provide, but those functions have been moved to a volume manager in regular systems. The reliability and metadata part can be moved to the database layer, so that leaves the filesystem doing a No-Op.


Altria did something similar almost 30 years ago with MVFS - https://www-01.ibm.com/support/docview.wss?uid=swg21230196 - on top of RDM. The SCM tool (IBM Rational Clearcase) that makes use of MVFS to map a database of modifications to a file system is still around.

I always admired the concept behind Clearcase / MVFS, though in practice it was a nightmare to administer.


Clearcase might be a monster to deal with, but it has several advance features that I miss in FOSS alternatives.

One being the ability to share build binaries across users.


In a similar vein, I ran a little experimental project using redis as a distributed filesystem using this: https://steve.fi/Software/redisfs/

This kind of thing was one of the foundational notions of Plan 9, back in the late 1980s and early 1990s. I often wonder how different our world would be if they'd only had the foresight to release the code freely instead of keeping it proprietary.

See also Michael Olson's "The Design and Implementation of the Inversion File System (1993)", from the 1993 USENIX conference (https://www.usenix.org/legacy/publications/library/proceedin...), PDF here: http://db.cs.berkeley.edu/papers/S2K-93-28.pdf

Database and filesystem share a lot similarity, but still have their own characteristics. To use database as a back-end for filesystem has some benefits, such as reliable persistence, ACID transaction and etc. It reduces complexity to achieve durable storage which filesystem should deal with.

Many filesystems are able to use DB as back-end, I also made a filesystem ZboxFS (https://github.com/zboxfs/zbox) which can also use sqlite as underlying storage.


I think the real interesting value in putting an RDBMS underneath the filesystem would come from relationally modeling the objects in the filesystem, ie files and types and their contents, and then being able to query those things with SQL. If their relational schema is just inodes and blocks I don’t really see the value - after all the filesystem is just a miniature RDBMS purpose-built and optimized for that fixed schema and so it’s not surprising that their system is so much slower.

Would be awesome to query a DB, like `ls /db/america/ny/cities` using tags or something in files to automatically and predictably sort data into folders/files.

What happens to performance if you store a database inside this filesystem?

    Docker
        -> Mysql Image
            -> This DB Filesystem
                -> Separate docker images that mount the DBFS over other Mysql image.
Should be quite fun for a lot of people.

Just speculating, but the database is mostly used for the metadata. If implemented well, there will be very little overhead for reading and writing to a file, especially if the size stays constant.

Some databases rely on specific low level file system operations, which may not be optimised in a niche filesystem. But in theory there shouldn't be anything preventing it from being optimized, and the database from running at close to maximum performance.


Ok, but if you store only metadata then you lose the advantage of running all the file operations inside a transaction. Then again, posix has no notion of transactions ...

”there will be very little overhead for reading and writing to a file, especially if the size stays constant”

Not if you add primary keys, indexes, ‘on update’ clauses, etc.


graph database could handle the infinite directory nesting with ease



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: