
Database as Filesystem [video] - enobrev
https://www.youtube.com/watch?v=wN6IwNriwHc
======
lioeters
Building a POSIX-compatible file system on top of a database engine (Postgres,
MariaDB) on raw partitions..

I could see advantages like being able to have indexes, relationships, run
queries on files' metadata, import/export the whole tree or branches
(folders), maybe easier managing of distributed file systems..

Aside from the performance overhead for the sake of experimentation, from my
not-very-knowledgeable perspective, the idea seems to have enough interesting
implications that I'm curious to see where it could go.

I also like other variations on this, like "file system as database", "Git
repo as file system", and so on.

~~~
jamesblonde
You can read on the background and challenges here in Salman Niazi's PhD -
[http://kth.diva-
portal.org/smash/get/diva2:1260852/FULLTEXT0...](http://kth.diva-
portal.org/smash/get/diva2:1260852/FULLTEXT02.pdf). This is not a new topic -
WinFS (part of the failed Longhorn project) was the best known example, which
tried to use SQL Server as the metadata store for a new windows FS.
Performance killed them.

The reason why we managed to get better performance compared to all previous
attempts it three-fold:

1\. we used an in-memory high performance DB (NDB) and we did not de-normalize
our schemas (no caching!)

2\. with denormalized tables, we needed cross-partition transactions at high
performance

3\. we did a reasonable job at re-implementing HDFS' POSIX-like model using
read-isolated transactions and row-level locking and new protocol for sub-tree
operations.

~~~
smacktoward
_> Performance killed them._

I've often wondered if there's a point at which something like WinFS would
become practical simply through the perpetual advance of hardware. MS started
working on the concept back in the early '90s (as part of Cairo:
[https://en.wikipedia.org/wiki/Cairo_(operating_system)](https://en.wikipedia.org/wiki/Cairo_\(operating_system\))),
and eventually gave up on it in 2006, which was nearly 15 years ago now.
Devices today have significantly more memory and compute power on tap than
they did in 2006, and if you roll the comparison all the way back to 1991
we're on a completely different planet today. And there are entirely new
_types_ of hardware that could be leveraged now too, such GPUs, solid-state
disks, etc.

I have no idea if MS ever published their performance data, but if they did,
it'd be interesting to use it to figure out what a theoretical machine that
could run WinFS, as it was when MS gave up on it, fast enough to be useable
would look like, and how close to that theoretical machine the machines of
today are.

~~~
ktpsns
I guess when the next Windows would include a novel FS, people would not care.
After decades of experimenting (does anybody remember the fears of users when
it was introduced that Longhorn deprecates the traditional filesystem for
query-based dynamic folders and thelike... In the end it remained only as
another "addition", not a replacement), people expect a desktop filesystem to
do what it always did.

Even more, the way how Android and iOS hides away the filesystem from the user
in everyday tasks shows that for end users, the technical fundament isn't in
the focus anymore. Modern filesystems is something which will change how
servers ("backends") work but not how customers do.

That's why I think MS burrows WinFS and will probably never dig it out
again...

~~~
nightski
I know why they did it but to me hiding the filesystem is one of the most
annoying and cumbersome parts of mobile operating systems. It's a shame to
have such a lack of freedom with data.

~~~
jplayer01
You mean iOS? You have fairly extensive access to everything on storage on
Android.

------
cyphar
This reminds me a little bit of the ideas behind Reiser4[1] -- the idea being
that we could unify different namespaces of data (files, metadata of files,
databases, anything) into a unified namespace which you could then use Unix
tools on. It is a bit of a shame that the ideas behind Reiser4 have been lost
with Hans Reiser's murder conviction, and nobody has really picked them back
up since then.

One of the neat ideas more directly related to this talk is that Reiser4 had
the ability for users to define their own data layout (I think they were
called "modules") which would allow you to optimise the filesystem layout for
databases while still allowing non-database software to read the contents.

[1]:
[https://www.youtube.com/watch?v=mIrMVPnxa04](https://www.youtube.com/watch?v=mIrMVPnxa04)

------
kragen
This was the idea behind murdererfs, right? It worked pretty well to power
mp3.com for a few years.

The great [https://philip.greenspun.com/panda/databases-
choosing](https://philip.greenspun.com/panda/databases-choosing) talks a bit
about the issue:

 _The great thing about the file system is its invisibility. You probably didn
't purchase it separately, you might not be aware of its existence, you won't
have to run an ad in the newspaper for a file system administrator with 5+
years of experience, and it will pretty much work as advertised. All you need
to do with a file system is back it up to tape every day or two._

He goes on to say that the main thing RDBMSes add to your web-serving life is
ACID transactions, which I think is pretty much true. Sadly modern filesystems
still don't support ACID transactions, though often they do manage to export
sufficient functionality for user-level code like Postgres and SQLite to do
so.

~~~
the8472
> Sadly modern filesystems still don't support ACID transactions

They offer range locking, optimistic locking, snapshots, atomic swaps, durable
writes and various other interesting features as primitives. If you're willing
to build some data storage around a specific filesystem you can do a lot. The
correct incantations to get those features can be a bit gnarly, but so is
coding database engines.

NTFS even offers actual transactions, although their use is discouraged.

~~~
kragen
Interesting, why is their use discouraged? Because Jim Gray’s sailboat sank?

~~~
the8472
[https://docs.microsoft.com/en-
us/windows/win32/fileio/deprec...](https://docs.microsoft.com/en-
us/windows/win32/fileio/deprecation-of-txf)

~~~
kragen
This just says that since nobody uses it they might delete it, but it doesn't
explain why nobody uses it. Presumably because it doesn't work well or is hard
to use, but which, and how?

------
jamesblonde
HopsFS is based around using an in-memory database as the metadata store for a
filesystem -

[https://www.usenix.org/conference/fast17/technical-
sessions/...](https://www.usenix.org/conference/fast17/technical-
sessions/presentation/niazi)

HopsFS is a drop-in replacement for HDFS, and the above paper showed 16X
throughput improvement for Spotify's Hadoop workload over Apache HDFS. Since
then, HopsFS has added small files in the metadata layer, with over 60X
throughput improvement and an order of magnitude lower latency for writing
small files (ACM Middleware'18):

[https://www.logicalclocks.com/millions-and-millions-of-
files...](https://www.logicalclocks.com/millions-and-millions-of-files-deep-
learning-at-scale-with-hopsfs/)

------
oscargrouch
Thats one of the reasons why need to rely more on micro-kernel design OS
first, so it's easiar to advance and experiment with different designs for the
storage layer.

The "everything is a file" mindset was great to move us from the chaos they
were in back in the day to where we are right now. But with the modern
ecosystem is holding us all back, because is too rigid, and a abstraction that
its a poor fit to a lot of modern scenarios.

With micro-kernels, we could have the kernel dealing only with persistent
storage blocks and leave the storage abstraction to the userspace where it
belongs.

If im not mistaken, The Zircon kernel go even further by abstracting through
pages.. so you could just map a collection of blocks to those pages and deal
only with the pages in memory, no matter they source (if disk, network or only
on memory).

Leaving the abstractions to userspace and by doing so, enabling us to
download, install, and mix those abstractions in a way that is a good fit for
a giving scenario.

As we have seen, the monolithic kernel have trouble with the bazaar approach
as those experiments have to be shipped in kernel-space, with all the trouble
this entails.

------
bullen
I do it the other way around, filesystem as database:
[http://root.rupy.se](http://root.rupy.se)

Just learned to partition the SD card for my RPi 2 cluster properly with ext4
type "small" so I don't run out of inodes.

Also purchased 6x 256GB Toshiba for 40$ each, I have been waiting for large
and cheap microSD cards since 2015!

~~~
flukus
TMSU ([https://tmsu.org/](https://tmsu.org/)) is another one like this. It's
simple and inter-operates with everything that currently exists and it offers
most features you'd want out of a database file system.

~~~
bullen
Ok, do you know how tmsu stores it's metadata? I found references to sqlite3
in the source.

ROOT actually uses the filesystem AS database, it has no separate index files
except the "link" ones and those only store a binary list of long id's to
other files.

ROOT has some features that are missing from tmsu though; like distributed
real-time replication over async. HTTP (scales across continents). HTTP API
with JSON support. User management and security. All of that within ~4000
lines of code.

I built it to have a MMO database that doesn't need manual moving of data for
region switching and backup while avoiding the sync. bottleneck of all other
databases.

Here are two sites that use it:
[http://talk.binarytask.com](http://talk.binarytask.com) and
[http://fuse.rupy.se](http://fuse.rupy.se)

------
andrewl
The original Be File System (BFS) for BeOS had _some_ attributes of a
database. I'm happy to see more research being done on the concept.

------
vkaku
There are different portions to a FS that actually need to be treated
indivudually:

\- Bricks / Blocks / Blob data.

\- Metadata / Tags / Xactional data.

The deal is that any DB <-> FS has to support both of these separately. The
Bricks/Blocks are optimized for raw performance/seek/access pattern with very
low overhead; The Metadata are optimized for fancy querying / fast lookup
semantics.

Any DB/Key Value store for large data could be modeled on top of a volume
layer, like LVM / ZVOL; This could absolutely become the foundation of a FS as
well.

The deal is that extensible / flexible storage space used to be something
filesystems used to provide, but those functions have been moved to a volume
manager in regular systems. The reliability and metadata part can be moved to
the database layer, so that leaves the filesystem doing a No-Op.

------
deepspace
Altria did something similar almost 30 years ago with MVFS -
[https://www-01.ibm.com/support/docview.wss?uid=swg21230196](https://www-01.ibm.com/support/docview.wss?uid=swg21230196)
\- on top of RDM. The SCM tool (IBM Rational Clearcase) that makes use of MVFS
to map a database of modifications to a file system is still around.

I always admired the concept behind Clearcase / MVFS, though in practice it
was a nightmare to administer.

~~~
pjmlp
Clearcase might be a monster to deal with, but it has several advance features
that I miss in FOSS alternatives.

One being the ability to share build binaries across users.

------
stackzero
In a similar vein, I ran a little experimental project using redis as a
distributed filesystem using this:
[https://steve.fi/Software/redisfs/](https://steve.fi/Software/redisfs/)

------
MikeTaylor
This kind of thing was one of the foundational notions of Plan 9, back in the
late 1980s and early 1990s. I often wonder how different our world would be if
they'd only had the foresight to release the code freely instead of keeping it
proprietary.

------
hartzell
See also Michael Olson's "The Design and Implementation of the Inversion File
System (1993)", from the 1993 USENIX conference
([https://www.usenix.org/legacy/publications/library/proceedin...](https://www.usenix.org/legacy/publications/library/proceedings/sd93/)),
PDF here:
[http://db.cs.berkeley.edu/papers/S2K-93-28.pdf](http://db.cs.berkeley.edu/papers/S2K-93-28.pdf)

------
burmecia
Database and filesystem share a lot similarity, but still have their own
characteristics. To use database as a back-end for filesystem has some
benefits, such as reliable persistence, ACID transaction and etc. It reduces
complexity to achieve durable storage which filesystem should deal with.

Many filesystems are able to use DB as back-end, I also made a filesystem
ZboxFS ([https://github.com/zboxfs/zbox](https://github.com/zboxfs/zbox))
which can also use sqlite as underlying storage.

------
gfody
I think the real interesting value in putting an RDBMS underneath the
filesystem would come from relationally modeling the objects in the
filesystem, ie files and types and their contents, and then being able to
query those things with SQL. If their relational schema is just inodes and
blocks I don’t really see the value - after all the filesystem is just a
miniature RDBMS purpose-built and optimized for that fixed schema and so it’s
not surprising that their system is so much slower.

------
techntoke
Would be awesome to query a DB, like `ls /db/america/ny/cities` using tags or
something in files to automatically and predictably sort data into
folders/files.

------
amelius
What happens to performance if you store a database inside this filesystem?

~~~
matharmin
Just speculating, but the database is mostly used for the metadata. If
implemented well, there will be very little overhead for reading and writing
to a file, especially if the size stays constant.

Some databases rely on specific low level file system operations, which may
not be optimised in a niche filesystem. But in theory there shouldn't be
anything preventing it from being optimized, and the database from running at
close to maximum performance.

~~~
amelius
Ok, but if you store only metadata then you lose the advantage of running all
the file operations inside a transaction. Then again, posix has no notion of
transactions ...

------
RocketSyntax
graph database could handle the infinite directory nesting with ease

