
A database built like an operating system: the architecture of FaunaDB - evanweaver
https://fauna.com/blog/a-database-built-like-an-operating-system
======
jandrewrogers
There is an old engineering aphorism that proper database engines are complete
operating systems implemented in user-space. This is true both literally and
figuratively; most sophisticated database kernels treat the underlying
operating system as a glorified device driver.

It is why they are both so complex to implement and so much fun to build.

------
hullsean
When I was first introduced to Oracle years ago, I was surprised at how much
it acted like an operating system inside. Once connected to the shell you
could find processes running, examine memory through data dictionary queries
and so forth. Tuning it involved looking at disk I/O patterns, memory usage
and making your loops tighter. In the SQL world that meant indexing &
reviewing plans for joins carefully.

MySQL & Postgres carry some of those same attributes. Even though they are
user processes and the OS controls the lower levels, you would typically
deploy them on their own servers, and give them all the memory & disk I/O you
could, because they were their own operating systems essentially.

It's not clear to me how building a database "like an operating system" is
something new. What I did see new in there was a consistency checking tool.
That is super important. Because in a distributed environment you're bound to
have drift.

Also I'd like to see how they managed to get full ACID Compliance. Fast
network interfaces are great but there is still some latency which means
hiccups when multiple nodes try to update the same row.

~~~
freels
I agree with you that it's not an original analogy, in fact I think you can
safely say that most larger, performance-oriented systems end up looking more
like an OS than not. There's a large overlap in the problems that need to be
solved and the resources available to distributed systems, that I'd be more
skeptical of an implementation that didn't lean heavily on the OS, or at least
borrow the general patterns.

To answer your question about consistency, we have implemented a transaction
resolution mechanism inspired by Calvin. The whitepaper contains a lot more
detail, but the short answer is that we use a consistent, distributed
transaction log to provide a global order of all read-write transactions,
which are executed deterministically on the responsible data partitions. This
allows FaunaDB to provide a guarantee of strict-serializability for read-write
transactions. Read-only transactions are by default a bit more relaxed and
provide a hard guarantee of serializability in order to avoid global
coordination overhead.

------
cr0sh
Hey, they re-invented PICK!

[https://en.wikipedia.org/wiki/Pick_operating_system](https://en.wikipedia.org/wiki/Pick_operating_system)

(actually, I don't know much about this new system to be commenting on it -
but whenever I see something with keywords like "database operating system" \-
my mind instantly drifts to PICK, because that was the first system I used
when I started my software development career)

~~~
sixdimensional
I thought the same - started out using PICK as well. Wow, I thought I was one
of the few on HN. Nice to know you!

I felt the same way when "NoSQL", document dbs, etc. came out. Ok, yeah, now
we are distributed and massive scale with better tooling... but.. same/similar
concepts.

~~~
nthcolumn
Me too. What is this 'Cocoon'?

------
borplk
Similarly I'd love to see an operating system that operated like a database.

Hate the arbitrary ascii-file nonsense. If the state of the operating system
is in a proper ACID database it would be so nice. Tooling could make it just
as easy as ascii files but under the hood it would be better.

~~~
danellis
Aren't filesystems databases? The "arbitrary ascii-file nonsense" is just like
storing text fields in your database.

~~~
krylon
In a way, yes, definitely.

But one does not get transactions. I would love a filesystem with
transactional semantics. Surprisingly often, applications use some form of
relational database not because their data is such a good fit for that model,
but because they want/need transactions. In theory, one could roll their own
transaction layer, after all, that's what RDBMSs do. But I have a hunch that
getting transactions to work correctly without totally killing performance is
not exactly trivial.

~~~
vbezhenar
NTFS definitely have transactions. I'm not sure about other file systems.

~~~
krylon
I had totally forgotten about that, you are right, NTFS does have a
transaction API.

I think I remember the API being marked as deprecated, though. Or is my memory
playing tricks on me again?

~~~
vbezhenar
Yes, they deprecated it, claiming that developers didn't use it:
[https://msdn.microsoft.com/en-
us/library/windows/desktop/hh8...](https://msdn.microsoft.com/en-
us/library/windows/desktop/hh802690\(v=vs.85\).aspx)

~~~
skissane
IMO, part of why developers didn't use it was the API design made using it too
painful. They took most of the existing file APIs and added new *Transacted
APIs alongside them, e.g. CreateFile begot CreateFileTransacted, etc. So to
add filesystem transactions to your existing code, you had to change all the
file API calls.

I think, if they had made transactions work with the existing file API–such as
by storing the hTransaction in thread-local storage, with an API to get/set
the transaction association of the current thread, and having the existing
non-transactional APIs behave transactionally in that case–it might have saw
more adoption by developers.

------
evanweaver
Hey everybody. Direct link to the PDF is: [https://fauna.com/pdf/FaunaDB-
Technical-Whitepaper.pdf](https://fauna.com/pdf/FaunaDB-Technical-
Whitepaper.pdf)

------
ablock
Daytona from AT&T Labs built the database into the OS...I don’t think it
worked very well. The inverted model makes more sense. A process per query
isn't very scalable.

~~~
sargun
Go on? Postgresql does process per session / query, and it scales ridiculously
well, especially with the hilariously multi-core systems we have today.

Although it doesn't handle unbounded concurrency, there are techniques (like
manual locking, or pgbouncer) to deal with this.

~~~
eternalban
Not sure how to read that. Isn't simply 1 process per connection?

~~~
sargun
Yes. But you can't run more than one query per connection.

~~~
eternalban
One _executing_ query per connection, meaning a connection maintains a
serialized query processing model.

------
quantum_state
Technical issues aside, as a db service platform provider, the statement in
its Terms, "FI adheres to policies and procedures to prevent data and
information loss, including a daily back-up regime, but does not make any
guarantees that there will be no loss of data or information, or that FI
policies and procedures will conform to industry best practices. FI expressly
excludes and disclaims any liability for any loss of Licensee's data and
information, no matter how such loss was caused." will make any serious user
think twice or more times ... Would welcome further comments.

~~~
zitterbewegung
Sounds like standard legal boilerplate. AWS has similar wording in their
terms. [https://aws.amazon.com/service-terms/](https://aws.amazon.com/service-
terms/) Search for loss of data.

------
batoure
The "starting from thinking about this like its linux instead of thinking of
it like its oracle" description is one I regularly use in Hadoop talks for
less technical audiences.

The paper is interesting, it leaves me with more questions than answers.

The biggest question is: -is this a new beast from the ground up or more
efficient packaging around existing distributed computing libraries?

\--For example when I hear a company say something like:

> based on log-structured merge trees (similar to Google Bigtable

I want to understand if they are just rebranding HBase for that part of the
tool

~~~
evanweaver
It is a new beast (good pun, by the way). There are no service dependencies
like HBase or Zookeeper, and there is not really any packaging to speak of
since it's just a JAR file.

Some open source is used internally, especially Netty, but our goal has always
been to build a tightly coupled system for performance reasons. For example,
the entire replication pipeline including the optimized Raft implementation is
from scratch, as is the scheduler.

LSM trees perform well especially on SSDs, but we probably will migrate away
from them eventually to something closer to LMDB in order to get a zero-copy
read path.

------
whargarbl
"FaunaDB implements a process scheduler that dynamically allocates resources
and enforces quality of service...Execution proceeds via cooperative
multitasking."

Just like Windows 95! I don't think I've ever seen a database do this.
Apparently it operates per-query or at least per application. Operating system
seems right; this is a long paper.

~~~
electrum
Presto does this as well. Execution is done by "drivers" which move data pages
between operators. It's a cooperative push model rather than a Volcano
(iterator) style pull model. There are typically more drivers than worker
threads. Drivers are assigned to threads using multilevel feedback queues
(inspired by OS schedulers).

------
corysama
So, can we get this running on a microkernel? Or, Fauna-on-Xen, anybody? :]

------
elvinyung
How does FaunaDB plan to support or enable many OLAP use cases? As far as I
understand it, there are no interactive query sessions due to Calvin, but can
I easily e.g. ETL things into a data warehouse?

~~~
freels
A better way to think about how Calvin fits into our stack is as a building
block similar to two-phase-commit, on which we've built the higher level
transaction semantics. And while we do not have any plans on supporting
session _transactions_ , this is more of a design constraint to support the
best possible OLTP experience rather than a hard technical limitation.

Since our focus has been getting OLTP right, ETLs are currently handled by
chunking record imports across multiple transactions. (We have internal
tooling for this that we've used to help customers move their data into
FaunaDB; we are planning on open-sourcing it soon.)

However, first-class support for OLAP use-cases is on our roadmap and
something we are excited about providing, so expect more from us about this in
the coming months.

~~~
elvinyung
Thanks!

------
pbnjay
Wasn't the Informix Database effectively an operating system? I'm sure it
wasn't the only one. This paper makes it sound like they're the only one.

------
zenithm
Detailed consistency model explanation starts on page 15.

------
socmag
Performance numbers anyone?

~~~
evanweaver
Write benchmarks, although the latest numbers are better still:
[https://fauna.com/blog/distributed-acid-transaction-
performa...](https://fauna.com/blog/distributed-acid-transaction-performance)

Read benchmarks are coming soon.

~~~
socmag
Great, thank you. Apologies for not going fishing for it myself. :+1:

------
frik
Some older operating systems used a database-filesystem or database like
filesystem as storage.

I heard older IBM systems had something like this. BeOS had BeFS with inbuilt
metadata index. NTFS + WinFS in user land vision was a bit like this. Office
server aka MS Sharepoint (which implemented WinFS vision for intranet) stores
all files in the database.

~~~
skissane
> I heard older IBM systems had something like this.

Most (non-Unix) mainframe and minicomputer operating systems – not just IBM's
– have support for record-oriented files and indexed files in the filesystem.
The functionality is roughly equivalent to that of non-relational "flat file"
databases or key value stores. Examples include ISAM and VSAM under z/OS (aka
MVS), and RMS under OpenVMS. This is in contrast to Unix and Windows, where
the filesystem itself only supports files as unstructured bytestreams, and any
record structure or index structure is imposed by higher levels such as
applications or shared libraries.

You might also be thinking of the IBM minicomputer operating system OS/400
(nowadays called "IBM i"), which embeds a relational database (a variant of
DB2) into the operating system. (I have often wondered how deep the
integration actually is–I believe it is something deeper than just bundling a
relational database with an OS, like how many Linux distributions include
MySQL or Postgres, but I don't really know.)

