I could see advantages like being able to have indexes, relationships, run queries on files' metadata, import/export the whole tree or branches (folders), maybe easier managing of distributed file systems..
Aside from the performance overhead for the sake of experimentation, from my not-very-knowledgeable perspective, the idea seems to have enough interesting implications that I'm curious to see where it could go.
I also like other variations on this, like "file system as database", "Git repo as file system", and so on.
The reason why we managed to get better performance compared to all previous attempts it three-fold:
1. we used an in-memory high performance DB (NDB) and we did not de-normalize our schemas (no caching!)
2. with denormalized tables, we needed cross-partition transactions at high performance
3. we did a reasonable job at re-implementing HDFS' POSIX-like model using read-isolated transactions and row-level locking and new protocol for sub-tree operations.
Recent improvements in both the performance and scalability of shared-nothing, transactional, in-memory NewSQL databases have reopened the research question of whether distributed metadata for hierarchical file systems can be managed using commodity databases.
In this paper, we introduce HopsFS, a next generation distribution of the Hadoop Distributed File System (HDFS) that replaces HDFS’ single node in-memory metadata service, with a distributed metadata service built on a NewSQL database.
By removing the metadata bottleneck, HopsFS enables an order of magnitude larger and higher throughput clusters compared to HDFS. Metadata capacity has been increased to at least 37 times HDFS’ capacity, and in experiments based on a workload trace from Spotify, we show that HopsFS supports 16 to 37 times the throughput of Apache HDFS. HopsFS also has lower latency for many concurrent clients, and no downtime during failover. Finally, as metadata is now stored in a commodity database, it can be safely extended and easily exported to external systems for online analysis and free-text search.
We are very interesting on the results you got, looking for a way to replicate that using our setup.
I've often wondered if there's a point at which something like WinFS would become practical simply through the perpetual advance of hardware. MS started working on the concept back in the early '90s (as part of Cairo: https://en.wikipedia.org/wiki/Cairo_(operating_system)), and eventually gave up on it in 2006, which was nearly 15 years ago now. Devices today have significantly more memory and compute power on tap than they did in 2006, and if you roll the comparison all the way back to 1991 we're on a completely different planet today. And there are entirely new types of hardware that could be leveraged now too, such GPUs, solid-state disks, etc.
I have no idea if MS ever published their performance data, but if they did, it'd be interesting to use it to figure out what a theoretical machine that could run WinFS, as it was when MS gave up on it, fast enough to be useable would look like, and how close to that theoretical machine the machines of today are.
Even more, the way how Android and iOS hides away the filesystem from the user in everyday tasks shows that for end users, the technical fundament isn't in the focus anymore. Modern filesystems is something which will change how servers ("backends") work but not how customers do.
That's why I think MS burrows WinFS and will probably never dig it out again...
You don't know, what you don't know until you try stuff. It's called research.
OneDrive probably is too, but I don’t understand the relationship to sharepoint.
I hazard a guess that the above implementation could still be made many times faster (maybe even 10X) by using InnoDB's native API, instead of the much slower MySQL Server SQL API. Here's an example of what you can achieve by going native - 1m ops/sec on InnoDB 8 years ago. This is the main guy (sorry Mark!) behind RocksDB implementation:
I wonder how much was due to legacy databases living in files.
For example, let's say you need to read /etc/foo.conf a lot. That file has internal structure, too, so even if you need just one value from it, you have to ask the filesystem (database) for the bytes that make up that file, then parse the whole thing (maybe), just to get the one value you want. Keeping it in a database just adds overhead.
In contrast, if your filesystem/database had actual structure, and stored the values of /etc/foo.conf in a schema, then every program that needed one value from it could do one query for just that value they need. That could be even faster than a dumb filesystem.
This seems like an area where you need to change everything all at once, or face a long slog of updating every piece of data in your system.
I've been fascinated with the idea of a relational database as a file system backend since the days of Longhorn / WinFS. I even had an attempt at creating one myself as a hobby project in FUSE. Performance was never an end goal for me (I was writing it in FUSE + Go after all). Whilst I did get some read operations working (and surprisingly well too considering the design choices I had made) unfortunately I never managed to get write operations working before I got distracted with another project. So it's really interesting to see other people tackle this idea and I'll take great pleasure in reading the thesis you've linked to. Thank you
 https://github.com/lmorg/myfs if anyone is curious. Be warned though, code isn't release quality.
HDFS is not nearly as 'slow' as object stores, like S3 (in latency and throughput for a single namespace).
The namenode is a threaded model with a lot of locks for each operation. This will improve with moving to Ozone and RocksDB under the hood (because Rocks doesn't leave much performance on the table). But it runs a lot of threads which can cause a lot of context switches.
The RPC uses protobuf which is cpu intensive so that's stealing from your RPS (which matters if you want to process multiple tens of thousands of requests per second).
Finally it's using Java and when you want to scale your namenode beyond 16GB heap then, afaik, you run into GC issues. If it were written in a non GC language then the namenode could scale a lot further. The memory size isn't helped by having bloated types passed around. One exmaple is UUIDs as strings all over the place because Java doesn't have a u128 native type. So you have 36 bytes for the UUID string and ~32 bytes of Java object overhead (though the JVM can unbox - it's not really something you should rely on magically happening).
Aside from all this (which are things that are baked in and will never change) Hops demonstrates there there's a lot of performance available in HDFS. But comparing XFS on MySQL to HDFS on NDB isn't really the same thing.
We provided a multi-writer, multi-reader concurrency solution (see paper) based on cross-partition transactions, normalized schemas, row-level locking, and an application-level protocol for subtree ops. That is the fundamental (hard) problem for scaling hierarchical FS'.
There are a whole class of problems here that I have little or no experience with, but seemingly could be incredibly impactful. For instance, further research and investment into versioning, event-logging, etc. could bring about a very rich revolution in OS / user experience.
"What if... data (e.g. file) manipulations where semantic – rather than CRUD, you had application-layer higher level blocks that presented different abstractions to data below."
Or, as I have more experience with SQL than FS, I could just be looking at the world through the lens of having a hammer and seeing everything as a nail.
SQLite development is actually version controlled by an open source SQLite backed version control system called fossil.
Whenever developer sorts get all starry eyed over "DB as an FS", they never consider data exchange between applications vendors and platforms. Their applications always exist alone in a digital vacuum.
Example: Imagine trying to attach an MS Word document stored in this DBFS to an email from a Thunderbird client. What does that look like to the user? What does it look like to the Thunderbird developer?
Abstract and encapsulate so that each application can use their own container format that includes schema and encoding details? Other applications can read this metadata directly from each containerized db object? Cool, we just reinvented the filesystem again.
Look at ZFS. It's a filesystem, but internally looks like a database. If you try to define the terms "database" and "filesystem" generally, you'll realize how difficult it is to separate the concepts.
It was called WinFS: https://en.wikipedia.org/wiki/WinFS
Unfortunately this got canceled, IIRC due to difficulties with back-compat.
One of the neat ideas more directly related to this talk is that Reiser4 had the ability for users to define their own data layout (I think they were called "modules") which would allow you to optimise the filesystem layout for databases while still allowing non-database software to read the contents.
The great https://philip.greenspun.com/panda/databases-choosing talks a bit about the issue:
The great thing about the file system is its invisibility. You probably didn't purchase it separately, you might not be aware of its existence, you won't have to run an ad in the newspaper for a file system administrator with 5+ years of experience, and it will pretty much work as advertised. All you need to do with a file system is back it up to tape every day or two.
He goes on to say that the main thing RDBMSes add to your web-serving life is ACID transactions, which I think is pretty much true. Sadly modern filesystems still don't support ACID transactions, though often they do manage to export sufficient functionality for user-level code like Postgres and SQLite to do so.
Until you reach an extreme scale, at which point you hire the author of the filesystem.
They offer range locking, optimistic locking, snapshots, atomic swaps, durable writes and various other interesting features as primitives. If you're willing to build some data storage around a specific filesystem you can do a lot. The correct incantations to get those features can be a bit gnarly, but so is coding database engines.
NTFS even offers actual transactions, although their use is discouraged.
Without rewriting software, its hard to use anything finer-grained than an entire process tree.
Unfortunately we are saddled with three generations of software that uses threads like they're going out of style (which they are) due to an unfortunate confluence of Win16 refugees with a misguided attempt to run our transaction processing software on a virtual machine designed for a set-top box with no MMU. You can fork and exit on Linux in 100 us, but not if you have sixteen threads and 512 megabytes of virtual address space spread across 128 mappings.
HopsFS is a drop-in replacement for HDFS, and the above paper showed 16X throughput improvement for Spotify's Hadoop workload over Apache HDFS. Since then, HopsFS has added small files in the metadata layer, with over 60X throughput improvement and an order of magnitude lower latency for writing small files (ACM Middleware'18):
The "everything is a file" mindset was great to move us from the chaos they were in back in the day to where we are right now. But with the modern ecosystem is holding us all back, because is too rigid, and a abstraction that its a poor fit to a lot of modern scenarios.
With micro-kernels, we could have the kernel dealing only with persistent storage blocks and leave the storage abstraction to the userspace where it belongs.
If im not mistaken, The Zircon kernel go even further by abstracting through pages.. so you could just map a collection of blocks to those pages and deal only with the pages in memory, no matter they source (if disk, network or only on memory).
Leaving the abstractions to userspace and by doing so, enabling us to download, install, and mix those abstractions in a way that is a good fit for a giving scenario.
As we have seen, the monolithic kernel have trouble with the bazaar approach as those experiments have to be shipped in kernel-space, with all the trouble this entails.
Just learned to partition the SD card for my RPi 2 cluster properly with ext4 type "small" so I don't run out of inodes.
Also purchased 6x 256GB Toshiba for 40$ each, I have been waiting for large and cheap microSD cards since 2015!
ROOT actually uses the filesystem AS database, it has no separate index files except the "link" ones and those only store a binary list of long id's to other files.
ROOT has some features that are missing from tmsu though; like distributed real-time replication over async. HTTP (scales across continents). HTTP API with JSON support. User management and security. All of that within ~4000 lines of code.
I built it to have a MMO database that doesn't need manual moving of data for region switching and backup while avoiding the sync. bottleneck of all other databases.
Here are two sites that use it: http://talk.binarytask.com and http://fuse.rupy.se
- Bricks / Blocks / Blob data.
- Metadata / Tags / Xactional data.
The deal is that any DB <-> FS has to support both of these separately. The Bricks/Blocks are optimized for raw performance/seek/access pattern with very low overhead; The Metadata are optimized for fancy querying / fast lookup semantics.
Any DB/Key Value store for large data could be modeled on top of a volume layer, like LVM / ZVOL; This could absolutely become the foundation of a FS as well.
The deal is that extensible / flexible storage space used to be something filesystems used to provide, but those functions have been moved to a volume manager in regular systems. The reliability and metadata part can be moved to the database layer, so that leaves the filesystem doing a No-Op.
I always admired the concept behind Clearcase / MVFS, though in practice it was a nightmare to administer.
One being the ability to share build binaries across users.
Many filesystems are able to use DB as back-end, I also made a filesystem ZboxFS (https://github.com/zboxfs/zbox) which can also use sqlite as underlying storage.
-> Mysql Image
-> This DB Filesystem
-> Separate docker images that mount the DBFS over other Mysql image.
Some databases rely on specific low level file system operations, which may not be optimised in a niche filesystem. But in theory there shouldn't be anything preventing it from being optimized, and the database from running at close to maximum performance.
Not if you add primary keys, indexes, ‘on update’ clauses, etc.