
Apple open-sources FoundationDB - spullara
https://www.foundationdb.org/blog/foundationdb-is-open-source/
======
wwilson
This is INCREDIBLE news! FoundationDB is the greatest piece of software I’ve
ever worked on or used, and an amazing primitive for anybody who’s building
distributed systems.

The short version is that FDB is a massively scalable and fast transactional
distributed database with some of the best testing and fault-tolerance on
earth[1]. It’s in widespread production use at Apple and several other major
companies.

But the really interesting part is that it provides an extremely efficient and
low-level interface for any other system that needs to scalably store
consistent state. At FoundationDB (the company) our initial push was to use
this to write multiple different database frontends with different data models
and query languages (a SQL database, a document database, etc.) which all
stored their data in the same underlying system. A customer could then pick
whichever one they wanted, or even pick a bunch of them and only have to worry
about operating one distributed stateful thing.

But if anything, that’s too modest a vision! It’s trivial to implement the
Zookeeper API on top of FoundationDB, so there’s another thing you don’t have
to run. How about metadata storage for a distributed filesystem? Perfect use
case. How about distributed task queues? Bring it on. How about replacing your
Lucene/ElasticSearch index with something that actually scales and works?
Great idea!

And this is why this move is actually genius for Apple too. There are a
hundred such layers that could be written, SHOULD be written. But Apple is a
focused company, and there’s no reason they should write them all themselves.
Each one that the community produces, however, will help Apple to further
leverage their investment in FoundationDB. It’s really smart.

I could talk about this system for ages, and am happy to answer questions in
this thread. But for now, HUGE congratulations to the FDB team at Apple and
HUGE thanks to the executives and other stakeholders who made this happen.

Now I’m going to go think about what layers I want to build…

[1] Yes, yes, we ran Jepsen on it ourselves and found no problems. In fact,
our everyday testing was way more brutal than Jepsen, I gave a talk about it
here:
[https://www.youtube.com/watch?v=4fFDFbi3toc](https://www.youtube.com/watch?v=4fFDFbi3toc)

~~~
voidmain
Will said what I wanted to say, but: me too. I'm super happy about this and
grateful to the team that made it happen!

(I was one of the co-founders of FoundationDB-the-company and was the
architect of the product for a long time. Now that it's open source, I can
rejoin the community!)

~~~
nlavezzo
Another (non-technical) founder here - and I echo everything voidmain just
said. We built a product that is unmatched in so many important ways, and it's
fantastic that it's available to the world again. Will be exciting to watch a
community grow around it - this is a product that can benefit hugely from OS
contributions as layers that sit on top of the core KV store.

~~~
lmorris84
I wrote a java port of the python counter many moons ago [1]. Will have to
resurrect it!

[1]
[https://github.com/leemorrisdev/foundationcounter](https://github.com/leemorrisdev/foundationcounter)

~~~
voidmain
It should probably be pointed out that atomic increment is in most situations
a more efficient solution for high contention counters in modern FDB.

~~~
lmorris84
Ah I don’t believe that was available last time I used it - I’ll check it out
thanks!

------
SubuSS
This is great news, when I was with dynamo, FoundationDB was the other green
shore for me :). They did so many things so well.

A tiny bit of caution for folks trying to run systems like this though: It is
frigging hard at any reasonable scale. The whole thing might be documented /
OSS and what not, but very soon you are going to run into deep enough problems
that's going to require very core knowledge to debug, energy to deep dive.
Both of which you probably don't want to invest your time into. Do evaluate
the cloud offerings / supported offerings before spinning these up. Else
ensure you have hired experts who can keep this going. They are great as a
learning tool, pretty hard as an enterprise solution. I have seen the same
issue a ton of times with a bunch of software (redis/kafka/cassandra/mongo...)
by now. IMO In the stateful world, operating/running the damn thing is 85% of
the work, 15% being active dev. (Stateless world is a little better, but still
painful).

~~~
shawn-butler
This sounds like a business opportunity to me rather than a cautionary note.

I remember all the people who bashed Apple when they acquired FoundationDB. I
hope they are appropriately ashamed now.

~~~
15155
> I remember all the people who bashed Apple when they acquired FoundationDB.

I'm not ashamed about deriding apple for Apple taking a really, really great
product and hiding it from the world for years to come.

This is definitely some atonement, but does not totally absolve Apple from the
many times they've taken tech private.

------
djtriptych
I went to the same high school as the founders[1]. They were about the 2 best
software engineers in a school with a LOT of very smart software engineers.
Another pair founded Yext, which went public last year. I still consider that
school the group with the highest concentration of raw brain power I've ever
been a part of.

I'm probably a 1% engineer, been hired by M$, FB, and Google. These guys were
light years ahead of me. I'm not sure I'm as good now as they were at like 17
years old. In fact I'm probably only a decent engineer from having observed
the stuff they were doing back then and finding inspiration.

1:
[https://en.wikipedia.org/wiki/Thomas_Jefferson_High_School_f...](https://en.wikipedia.org/wiki/Thomas_Jefferson_High_School_for_Science_and_Technology)

~~~
eitally
I went to UVA and I think about half of the engineering school was from TJ
(including 3 of my roommates). :) I can't think of any superior public high
school (and I went to a different Governor's School in Virginia myself)
anywhere, due to the amazingly large and deep talent pool TJ pulls from.
Nothing like it exists in the Bay Area, that's for sure!

~~~
saagarjha
The top high schools in the Bay Area often give it a run for its money…

------
_xzxj
Google please take a few notes here:

1\. It’s in its own repo

2\. The build instructions are concise and clear. Dependencies are listed. You
have to follow a total of 0 links.

3\. They use a common build system and not an in-house thing.

~~~
itp
Speaking as the original author of this monstrosity of a build system, please
be careful before offering praise here. To be clear, there is a top-level,
non-recursive Makefile that uses the second expansion feature of GNU make,
translating Visual Studio project files into generated Makefile inputs that
are transformed into targets to power the build.

Although it starts by running `make`, it's about as in-house as a thing can
be.

~~~
dboreham
This kind of deep-inside-baseball from-the-horses-mouth interaction is what's
so awesome about HN!

~~~
zbentley
> deep-inside-baseball from-the-horses-mouth

Does the horse choke on the baseball? Is there an equine version of the
Heimlich maneuver to be performed on horses suffering from mixed-metaphorical-
adage-induced asphyxiation?

~~~
SiempreViernes
Yes, you bite the hand that feeds the horse baseballs.

------
panghy
We (Wavefront) has been operating petabyte scale clusters for the last 5 years
with FoundationDB (we got the source code via escrow) and we are super excited
to be involved in the opensourcing of FDB. We have operated over 50 clusters
on all kinds of aws instances and I can talk about all the amazing things we
have done with it.

[https://www.wavefront.com/wavefront-foundationdb-open-
source...](https://www.wavefront.com/wavefront-foundationdb-open-source-
project/)

~~~
panghy
We basically replaced mySQL, Zookeeper and HBase with a single KV store that
supports transactions, watches, and scales. It's not a trivial point that you
can just develop code against a single API (finally Java 8 CompletableFutures)
and not have to set up a ton of dependencies when you are building on top of
FDB. We are (obviously) experts at monitoring FoundationDB with Wavefront and
we hope to release the metric harvesting libraries and template dashboards
that we use to do so.

Almost 5 years in and we have not lost any data (but we have lost machines,
connectivity, seen kernel panics, EBS failures, SSD failures, etc., your usual
day in AWS =p).

~~~
qaq
"but we have lost machines, connectivity, seen kernel panics, EBS failures,
SSD failures, etc., your usual day in AWS " <=== This I wish more people
realized that is a day to day reality if you are in AWS at scale.

~~~
koide
As I understand it, it's like that _everywhere_ at scale, not just on AWS, it
being a property of operating at scale.

Or are you saying that AWS is particularly unreliable at scale?

~~~
qaq
On the network side no, it's much more crappy on AWS.

~~~
koide
Which provider is the best, network wise?

~~~
qaq
I only have experience with AWS and on prem and high quality colo like
Equinix. Possibly due to reduced complexity and having full control over
networking setup but significantly fewer issues vs AWS.

------
osrec
I can see everyone's extremely happy about this, which is great. As someone
who's never used it, I'd like to know more about FoundationDB and how it
compares to other offerings such as MySQL or Postgres, and which use cases is
it most suited to. I would especially love to hear the thoughts of those with
direct experience of using Foundation DB. Thanks!

~~~
rando444
Personally I'd be more interested in hearing how this compares to other
distributed noSQL implementations like Cassandra.

~~~
roboyoshi
It probably is better, because Apple switched from Cassandra to FoundationDB,
according to all the rumors [http://www.businessinsider.com/why-apple-bought-
foundationdb...](http://www.businessinsider.com/why-apple-bought-
foundationdb-2015-3?IR=T) . But as we know Apple they probably won't tell us.

------
saagarjha
It's under Apache 2.0 for those curious:
[https://github.com/apple/foundationdb/blob/master/LICENSE](https://github.com/apple/foundationdb/blob/master/LICENSE).
Also, side note: it looks like this was a private GitHub repository for at
least a couple months, since they have pull requests going back for at least
that long. I find this interesting, since Apple normally "cleans up" history
before open sourcing.

~~~
Longhanks
Swift, too, was published on GitHub including its whole version control
history, dating back to 2010 :)

------
openasocket
I hadn't heard of FoundationDB before, so I did some digging into the
features:
[https://apple.github.io/foundationdb/features.html](https://apple.github.io/foundationdb/features.html)
. It seems to claim ACID transactions with serializable isolation, but also
says later on that it uses MVCC, slower clients won't slow down operations,
and that it allows true interactive queries. I didn't think an MVCC
implementation could provide that level of isolation, and I'm not even sure
how you provide that level of isolation and those other guarantees with any
implementation, am I misunderstanding something?

~~~
voidmain
I'll try to give you a quick introduction. The architecture talk I recorded
for new engineers working on the product ran to four or five hours, I think
:-). In short, it is serializable optimistic MVCC concurrency.

A FDB transaction roughly works like this, from the client's perspective:

1\. Ask the distributed database for an appropriate (externally consistent)
read version for the transaction

2\. Do reads from a consistent MVCC snapshot at that read version. No matter
what other activity is happening you see an unchanging snapshot of the
database. Keep track of what (ranges of) data you have read

3\. Keep track of the writes you would like to do locally.

4\. If you read something that you have written in the same transaction, use
the write to satisfy the read, providing the illusion of ordering within the
transaction

5\. When and if you decide to commit the transaction, send the read version, a
list of ranges read and writes that you would like to do to the distributed
database.

6\. The distributed database assigns a write version to the transaction and
determines if, between the read and write versions, any other transaction
wrote anything that this transaction read. If so there is a conflict and this
transaction is aborted (the writes are simply not performed). If not then all
the writes happen atomically.

7\. When the transaction is sufficiently durable the database tells the client
and the client can consider the transaction committed (from an external
consistency standpoint)

The implementations of 1 and 6 are not trivial, of course :-)

So a sufficiently "slow client" doing a read write transaction in a database
with lots of contention might wind up retrying its own transaction
indefinitely, but it can't stop other readers or writers from making progress.

It's still the case that if you want great performance overall you want to
minimize conflicts between transactions!

~~~
ccleve
This is a good explanation of how it happens on a single node. What do you do
when the transaction is distributed? How do you achieve consensus? Is there a
write up on it anywhere?

~~~
wwilson
The only thing that's different in a distributed cluster is the
implementations of steps 1 and 6. As voidmain said, the details of that are
not trivial, ESPECIALLY the details of how it never produces wrong answers
during fault conditions.

I don't know that there's been an exhaustive writeup of that part, but maybe
one of us or somebody on the Apple team will put something together. It
probably won't fit in an HN comment though!

Or... maybe this is the part where I point out that the product is now open-
source, and invite you to read the (mostly very well commented) code. :-)

~~~
veesahni
The documentation ( [https://apple.github.io/foundationdb/technical-
overview.html](https://apple.github.io/foundationdb/technical-overview.html) )
sells the product, but doesn't give a deep enough explanation. As a closed
source product, that's understandable.

Going forward as an opensource product, I hope to see some clarity on the "how
it works"... Distributed, performant ACID sounds good, almost too good to be
true. Not that I doubt it at the moment, I just want to understand it better
:)

------
Osmium
Original thread for when Apple acquired FoundationDB for those curious:
[https://news.ycombinator.com/item?id=9259986](https://news.ycombinator.com/item?id=9259986)

~~~
edwinyzh
Ahah! That thread's so sad vs. this thread's so exciting :)

------
stijnsanders
The Windows installer has Lorem Ipsum instead of EULA text...
[https://twitter.com/stijnsanders/status/987042633691394050](https://twitter.com/stijnsanders/status/987042633691394050)

~~~
max23_
Also, Windows Defender SmartScreen complains about unknown publisher when
trying to execute the installer.

------
helper
I'm very interested in hearing more about what running FoundationDB in
production is like.

I believe that FoundationDB stores rows in lexicographical order by key. Other
databases like Cassandra strongly push you toward not storing data this way as
it can easily lead to hotspots in the cluster. How do you deploy a
FoundationDB cluster without leading to hotspots, or perhaps what operational
actions are available to rebalance data?

~~~
cakoose
Does Cassandra hash the primary key to get a more even distribution?

If you have a sorted data store, you can get the same distribution by keying
off a hash of the "real" primary key, right?

~~~
helper
Cassandra allows you to configure a "partitioner" that determines which nodes
a primary key belongs to. There is a ByteOrderPartitioner that stores
partitions in order lexicographically by primary key. There is also a Murmor3
hash based partitioner (which is the recommended default).

Cassandra allows you to store multiple records in sorted order within a
partition. The normal recommended way to get data locality is to store records
that are frequently accessed together in the same partition.

------
jaytaylor
Firstly: Wow! this is amazing news!!!

I'm also kind of confused.. is the single repo complete?

What about the SQL Layer [0]? Where is all this stuff in the new GH repo?

Or is only the KV part be open-source?

Looking forward to some CockroachDB vs. FDB benchmark showdowns :)

[0] [https://github.com/jaytaylor/sql-layer](https://github.com/jaytaylor/sql-
layer)

~~~
cooervo
also super interested in CockroachDB, but I just can't find enough war
stories, or stories of people using it in production...

------
AlphaOne1
I am not a database expert by any means but have been curious about
distributed data systems and had not heard of FoundationDB till now and was
very excited to read about it. On reading through the documentation, I
encountered a section on "Known Limitations"[1] which stated that keys could
not be larger than 10kb and values cannot be larger than 100kb. This seems to
be a major limitation. Am I missing something or is this strictly for storing
text?

[1] [https://apple.github.io/foundationdb/known-
limitations.html#...](https://apple.github.io/foundationdb/known-
limitations.html#large-keys-and-values)

~~~
voidmain
Because the data model is ordered, large blobs can and normally should be
mapped to a bunch of adjacent keys and read with a range read, not a single
huge value. That also allows you to read or write just part of one
efficiently.

------
nlavezzo
For those asking about use in production, Wavefront just posted this:

"Wavefront by VMware’s Ongoing Commitment to the FoundationDB Open Source
Project"

[https://www.wavefront.com/wavefront-foundationdb-open-
source...](https://www.wavefront.com/wavefront-foundationdb-open-source-
project/)

------
adolph
_" because it is an ordered key-value store, FoundationDB can use range reads
to efficiently scan large swaths of data"_

[https://apple.github.io/foundationdb/features.html](https://apple.github.io/foundationdb/features.html)

I wonder how it compares to MUMPS databases like Intersystems Cache and FIS
GtM?

~~~
chdsbd
It seems that MUMPS doesn't have serializable isolation.

~~~
adolph
Yeah, when I read about that I thought it sounded neat—make sure the index
updates with the data. On the other hand, I think of CAP theorem as an iron
triangle. If you are gaining consistency, what’s the trade off?

~~~
jakevn
This _is_ the usual trade off, but what makes FoundationDB so crazy is that
it's a CP system that has a performance profile that AP systems would have a
hard time matching.

------
sheeshkebab
It’s neat although there is no sql front end.

Bloomberg’s comdb2 was open sourced recently
[https://github.com/bloomberg/comdb2](https://github.com/bloomberg/comdb2) \-
it seems similar, but would be interesting to see comparison.

~~~
pdeva1
comdb2 is not a big data db per say. it compares more to mysql than
foundationdb [https://blog.dripstat.com/first-look-at-bloombergs-
amazing-c...](https://blog.dripstat.com/first-look-at-bloombergs-amazing-
comdb2/)

~~~
sheeshkebab
Great write up!

------
mavilein
Wow this is some really exciting news! I think it would be amazing to create a
GraphQL API for FoundationDB. Therefore i have created a feature request for
this in the Prisma repo. For those who don't know Prisma is a GraphQL database
mapper for various databases.

[https://github.com/graphcool/prisma/issues/2240](https://github.com/graphcool/prisma/issues/2240)

------
m0meni
How's this compare to
[https://github.com/pingcap/tikv](https://github.com/pingcap/tikv)? It's a
relatively new distributed KV store written in Rust that also is transactional
and backs the new TiDB database.

~~~
warmwaffles
Well on first guess, I would assume that FoundationDB is more mature than
tikv, although I have never used either but tikv looks cool

~~~
scv119
Agreed, technology-wise FDB and TIDB look almost the same.

------
zackmorris
Looks promising, but does anyone know if FoundationDB has external events or
triggers, similar to Firebase or RethinkDB? I can't seem to find much on it.

If not, then a lot of potential is being left on the table, because usage
would require wrapping FoundationDB in a proxy or middleware of some kind to
synthesize events, which can be extremely difficult to get right (due to race
conditions, atomicity issues, etc). Without events, apps can find themselves
polling or rolling their own pub/sub metaphor over and over again. If anyone
with sway is reading this, events are very high on the priority list for me
thanx!

~~~
spullara
It has the ability to watch keys so building a notification system on top of
it is pretty easy. Really only limited by your imagination.

~~~
voidmain
In addition to single-key asynchronous watches, there are also versionstamped
ops (for maintaining your own, sophisticated log in a layer) and configurable
key range transaction logging (but see the caveats in my other post on the
topic).

I'm not sure it has every feature it will ever need in this area, but it's a
pretty good starting point for building "reactive" stuff.

~~~
polskibus
Does the trigger execute in scope of triggering transaction with the same
isolation level?

~~~
voidmain
Versionstamped operations and transaction logging are fully transactional.
Watches are asynchronous: they are used to optimize a polling loop that would
"work" without them.

~~~
ryanworl
Would versionstamped operations fit for the log abstraction modeling question
I've asked about here on the forums?

[https://forums.foundationdb.org/t/log-abstraction-on-
foundat...](https://forums.foundationdb.org/t/log-abstraction-on-
foundationdb/117)

~~~
voidmain
Yes.

------
rammy1234
__May you do good and not evil. __May you find forgiveness for yourself and
forgive others. __May you share freely, never taking more than you give.

from their source code blessing. notes.

~~~
puddums
That's from sqlite. (Awesome tech, awesome license).

Related snippet from the "Distinctive Features Of SQLite" page[1] from the
sqlite project:

 _The source code files for other SQL database engines typically begin with a
comment describing your legal rights to view and copy that file. The SQLite
source code contains no license since it is not governed by copyright. Instead
of a license, the SQLite source code offers a blessing:

May you do good and not evil May you find forgiveness for yourself and forgive
others May you share freely, never taking more than you give._

[1] [https://sqlite.org/different.html](https://sqlite.org/different.html)

------
angrygoat
Awesome - the code is up on github:
[https://github.com/apple/foundationdb/](https://github.com/apple/foundationdb/)

------
arthursilva
This is great news!

Unfortunately it looks like they striped out some important things, notably
the storage engine (there's now a sqlite fallback).

Edit: Apparently it was always sqlite as per replies bellow.

~~~
voidmain
The storage engine is and always was a fairly heavily modified asynchronous
version of sqlite's btree. It's been extremely reliable, which was always our
top priority, and the performance isn't bad. But honestly when there was a
problem with it our development velocity improving it wasn't great.

It's super easily pluggable[1], so now that it is open source people can
experiment with other engines. I think there is a lot of room for improvement.
Also architecturally it's designed in anticipation of being able to run
different storage engines for different key ranges and for different replicas.
For example, you might keep one replica in a btree on SSD (for random reads)
and two on spinning disks in a log structured engine.

[1]
[https://github.com/apple/foundationdb/blob/master/fdbserver/...](https://github.com/apple/foundationdb/blob/master/fdbserver/IKeyValueStore.h)

It looks to me like Apple has made a pretty complete release of the key/value
store. What's missing is

(1) Layers! Everything from relational databases to full text search engines
to message queues

(2) Monitoring stuff. Unsurprisingly it doesn't look like we have the tools
for monitoring log files, etc. Wavefront (also a major user!) is a great
commercial solution, but there should be something OSS

~~~
mzeier
Truthfully at Wavefront we've taken the json status directly into telegraf.
Plus a bunch of python tooling to massage additional telemetry on a clusters
health (coordinator reachability for example).

Plus even more tooling (mostly Ansible) for managing large fleets.

~~~
voidmain
I saw @spullara do pretty neat stuff with our log files in Wavefront.

Will you guys think about open sourcing tooling? Apple is realistically never
going to do that stuff.

~~~
panghy
Everything is fair game, being a monitoring company, we certainly will have
first-class fdb support. We already have tons of workflows and templates.

------
samspenc
How does this stack up against HBase and Cassandra, which seem to have gotten
traction already in the same areas that FoundationDB seems best suited for?

~~~
Bahamut
An interesting fact is that Apple is probably one of the biggest users, if not
the biggest user, of Cassandra out there.

Can't speak to HBase, but one thing Cassandra doesn't guarantee is ACID - I've
seen some data consistency issues that has arisen from Cassandra in our usage,
although it hasn't been a huge problem for us. That difference alone probably
brings a lot of value to FoundationDB.

~~~
Joeri
My understanding of hbase (which could be wrong) is that writes of a
particular key always go to the region master first, so if you read from the
master you always get the latest value of a key. The tricky part is that when
the region master goes offline another one needs to take its place, and you
can get inconsistency or unavailability depending on how it is set up.

I’d like to see a deep dive of how foundationdb handles this. It has to trade-
off consistency for availability at some point and it would be nice to know
exactly where.

~~~
poooogles
From what I understand that shouldn't happen as writes are written to a WAL on
HDFS (been a while since I've used it).

From my memory writes within a row are atomic. It seems to pass Jepsen as well
[1].

1\.
[https://www.google.co.uk/amp/s/yokota.blog/2015/09/30/call-m...](https://www.google.co.uk/amp/s/yokota.blog/2015/09/30/call-
me-maybe-hbase/amp/)

------
groguelon
Originally, FDB was a DB supporting 3 models: \- KV \- Document \- Graph

It seems the announcement concerns only the KV one. Someone has information
for the 2 other ones?

Thank you.

~~~
nlavezzo
The core product is a distributed, highly fault tolerant ordered Key Value
store with true serializable ACID transactions. All of the layers (including
document, graph) sit on top of that and inherit its ACID properties,
scalability, fault tolerance, etc. It doesn't appear to me that they released
any of the top level layers, but those are MUCH simpler to build, and that's
where the OS community can step in.

~~~
sunw
How is FoundationDB's graph performance/feature set compared to those of other
graph and multi-model databases?

~~~
nlavezzo
FoundationDB at its core is not a graph database. You could build a graph
database on top of it, using FoundationDB as a very strong and feature rich
storage engine, however you'd like. It would be much simpler to do than
building a new (especially a distributed) graph database from scratch.

~~~
pluma
So FDB's role is more comparable to that of, say, RocksDB or LevelDB than an
application-level database like Postgres?

I didn't really pay much attention to Foundation before Apple bought them and
am unsure how it fits in the wider database ecosystem.

~~~
voidmain
You can think of FDB as a _distributed_ storage engine. It has the same low
level data model as the engines you mention, but has distributed transactions,
fault tolerance, automatic data partitioning, operational tooling, etc built
in. So if you build an "application level database" or library on top of it,
it is automatically a distributed database.

------
monkeydust
Can someone explain to business / product guy why this is so exciting?
(honestly keen to understand)

~~~
voidmain
Well, it's infrastucture. Moreover, it's infrastructure for infrastructure! So
if that sounds super boring you don't have to be excited :-)

But I would tell the story something like this: state storage is the root of
(almost) all operational evil. It's very easy to make a system reliable if
it's totally stateless. Even most bugs can be lived with if the worst you have
to do is restart a service and carry on! But to do anything interesting you
have to store state somewhere, and you have to modify that state concurrently
without screwing it up.

And the many challenges of operating stateful systems are _greatly_ multiplied
if you have a lot of different ones. For example, if you have a datacenter
outage and _some_ but not all of your stateful systems deal with it correctly,
probably your application as a whole is still down.

So as one more stateful system, does FoundationDB just make that worse? Well,
FoundationDB is designed specifically to be a foundation for many very
different stateful systems - not just different kinds of databases but things
like search engines or message queues that you normally don't think of in the
same category. So that almost any system can map to it efficiently, it has a
lowest common denominator data model (key/value) and the highest possible
guarantees in terms of concurrency control. And you can run diverse systems
supporting an application on the same FoundationDB cluster, or on different
clusters with the same exact operational requirements.

Some few users of FoundationDB have been able to get the benefits of this
vision, consolidating lots of different stuff into a single, operationally
desirable system. But for more people to be able to, not just does the
key/value store have to be available to them, but also lots of stuff has to be
built on top of it. By releasing FoundationDB under a very liberal open source
license, Apple has hopefully made that possible. In the long run, hopefully it
will make all server-side computing more reliable.

Also, it's a really good key/value store, if you happen to need one of those!

------
kindkid
Very cool.

I noticed that all the write benchmarks in
[https://apple.github.io/foundationdb/benchmarking.html](https://apple.github.io/foundationdb/benchmarking.html)
are for random writes. Is write throughput affected by highly-sequential
writes (e.g. - time series) vs random writes? How do you avoid hot-spotting on
recent ranges?

How efficient are range deletes?

On
[https://apple.github.io/foundationdb/performance.html](https://apple.github.io/foundationdb/performance.html)
I read "The memory engine is optimized for datasets that entirely fit in
memory, with secondary storage used for durable writes but not reads." I'd
like some clarification:

(1) Which memory does "entirely fit in memory" refer to? A single machine? Or
SingleNodeMemory * Nodes / ReplicationFactor?

(2) If only recently-written data is likely to be queried, and all recently-
written data fits entirely in memory, is that sufficient? If so, would an
unexpected query of old data cause a huge impact on write throughput?

(3) What is the structure/format of the data stored on disk? How is it
updated?

I'm wondering how well this could be used for time series data. I saw mention
here that wavefront uses FoundationDB for this, but would like more details if
any are available.

~~~
voidmain
Sequential writes should be a little faster than random at the individual
storage node level, but if your _entire_ write workload is a single ordered
log scalability will suffer. It might be theoretically possible for fdb to
scale in this situation by creating shards on the fly during transaction
processing, but no one has seriously tried to make that work.

You can mitigate by designing your key structure/data ordering to not have
that property.

The memory engine _requires_ your data to fit in memory (total across all your
nodes, after replication). It writes interleaved snapshots and updates to
disk, and reads the whole dataset back into memory when restarted.

You can do great modeling of time series data in FDB, though it will take some
care and thought.

You should ask these questions on the forum. This article is falling off HN, I
am going to lose track of it, and it doesn't look like the Apple team is
answering questions here.

------
josephg
This is wonderful news. I built a proof-of-concept realtime collaborative
editor on top of foundationdb a few years ago, and was very disappointed when
I couldn't use it in production. I'm really excited to use this in some
projects I'm working on.

Quick question: I know there's a watch API, but is there any way to subscribe
to a change feed from foundationdb? I'd like to consume the FDB event log to
do external indexing & map-reduce work.

~~~
voidmain
Yes. I don't know how well documented it is, but there is an API (well, system
keyspace) that can configure the database to log transactions for a selected
key range (up to and including the whole database) into another selected key
range. It is used by backup and asynchronous replication tools. The format of
the configuration keys and transaction logs should be considered less stable
than the core key/value API, which basically never breaks backward
compatibility, so by using it you are shouldering a maintenance burden to keep
up with e.g. changes in the log format and new types of mutations when new
versions of FoundationDB come along. So it shouldn't be used too casually. But
it's there.

Alternatively, your application or layer can use the "versionstamp" atomic
operations to write its own ordered log of what it is doing, or other indexing
tricks. Depending on your data model this might be able to be much more
efficient. For example, for external indexing you probably don't need to
preserve a history of prior values but only be able to identify all the values
that have changed. This can be done with a very simple and compact index that
doesn't need to duplicate all the data to be indexed.

~~~
josephg
> Alternatively, your application or layer can use the "versionstamp" atomic
> operations to write its own ordered log of what it is doing, or other
> indexing tricks.

I'm not sure I understand.

Are you suggesting having a second key space at `ops/{VERSIONSTAMP}` or
something where values contain enough information about the operation to be
able to process changes in an indexer? The indexer could then clean up after
itself, deleting the operations once they had been ingested? ... Effectively
using a portion of the keyspace as a queue?

~~~
voidmain
Yes. Or if it's not important for the indexer to process things
chronologically, you could just have an index of the primary key (only) of
records that haven't been indexed.

If you are trying to make your external index MVCC, then you will want to
carry some version information too.

This kind of question might be better served by the new community forum you
can get to from the website!

------
isoos
How does it compare to CockroachDB or TiDB?

~~~
evanweaver
It's closest to TiDB's key-value layer; a building block for more complex
systems. More traditional, monolithic databases like CockroachDB (SQL) or
FaunaDB (NoSQL) trade off extensibility for the benefits in performance and
operations that come from very tight coupling.

In my understanding, FoundationDB's transaction management is closest to
FaunaDB's; read/write sets are linearized in memory in preprocessing nodes and
distributed asynchronously to the replicas rather than locked on the replica
leaders like Spanner or CockroachDB. This is why FoundationDB doesn't support
long-lived transactions.

It's interesting that the FoundationDB team chose to unwind their service
architecture (there used to be separate transaction manager and replica
processes), I assume in the interests of ease of operations.

It is not clear to me how leader election and failover works for the
transaction management role. Maybe somebody from the team can clarify.

~~~
cooervo
Wait CockroachDB is monolithic? I though it was distributed.

~~~
evanweaver
I mean it in the terms of monolithic process vs. service-oriented
architecture, distinct from a distributed vs. centralized operational
topology.

FaunaDB and CockroachDB are implemented as monolithic processes and can break
encapsulation boundaries for performance reasons. For example, FaunaDB does
aggressive predicate pushdown to accelerate intersections and joins, which you
cannot do if you have to conform to a key/value interface exclusively. It can
also eliminate all network overhead for query data that's local to the
processing node.

I understand how that terminology is confusing though...how would you explain
it?

------
kawera
Do we know where/how Apple uses FoundationDB in production ?

~~~
nlavezzo
Wavefront's co-founder just tweeted this:
[https://twitter.com/panghy/status/987022825457266689](https://twitter.com/panghy/status/987022825457266689)

Finally it's out! @WavefrontHQ managaes petabyte scale clusters with
#foundationdb today!

~~~
kbumsik
I'm confused. How they could use FoundationDB when Apple aquired it a long
time ago? Did apple sell the software to other companies?

~~~
nlavezzo
FoundationDB was a company that existed in the market, licensing our database
technology for quite a while. Licenses don't necessarily terminate upon
acquisitions.

------
dbcurtis
Question for the FoundationDB gurus that are hanging out on this thread: How
well does it deal with spotty connectivity? I'm asking because I work on
mobile robots, and WiFi and/or LTE connections are always coming and going in
unpredictable ways as the vehicle moves about it's environment. Reconnecting
every few minutes is normal.

~~~
voidmain
The fault tolerance is pretty much flawless. You won't be able to get the
database "stuck" or see anomalies.

But performance is going to suck if you run server nodes over unreliable
connections. I have trouble seeing a FoundationDB cluster _running on mobile
robots_ as more than a trade show gimmick. Albeit an awesome gimmick. So in
summary you should totally do that.

~~~
dbcurtis
I can see that. There are various vectors to performance. I hear you saying
transactions per second would be unimpressive.

A couple of less time sensitive applications are: 1. distrubuting information
the entire fleet should eventually know, 2. event log aggregation with fine-
grained time alignment among nodes.

Both are probably silly problems to solve with a database, killing houseflys
with sledghammers and all that, but it never hurts to explore creative tool
misuse :)

~~~
alexis_read
I think Scuttlebutt might be more suitable for you? Extract the text from a
conversation on each robot. The offline story is easy here.

The other option would be some sort or CRDT-based system - Antidote perhaps?

------
DimitarIbra9987
I found out foundationDB vienna VA office is closed. Did apple move all people
to Cupertino?

------
misterbowfinger
I hate to be that person, but when I hear "ACID transactions in a distributed
database", I hear Citus/Spanner/CockroachDB.

I'm positive that Citus & Spanner are quite different from FoundationDB, but I
have no idea how. Googling didn't help much.

Can someone provide an overview of the differences?

~~~
voidmain
I think Citus is not really ACID.

Spanner (and to an extent its less mature OSS descendants Cockroach and TiKV)
has more comparable goals, but is fairly different architecturally. For
example, FoundationDB only requires N+1 replicas instead of 2N+1 to achieve N
failure tolerance (even lots of databases with much weaker guarantees are in
the latter category!), doesn't trust clocks at all, doesn't lose performance
when transactions cross replica sets, and uses optimistic instead of
pessimistic concurrency.

Also FoundationDB (and TiKV) make a distributed, transactional key/value store
available as an API, while Spanner and Cockroach expose only a relational
database layer. FoundationDB is designed philosophically with the idea that
you want to have a single storage layer to manage operationally but should be
able to mix and match data models and query engines above that layer.

On the other hand, FoundationDB doesn't currently have _any_ full fledged high
level database layer available. Someone will probably dig up our SQL layer
(which was AGPL, I think) but I wouldn't really recommend using it in
production because there is no active development team. Someone will probably
try porting the SQL layers from TiDB and Cockroach.

Maybe Apple will open source more stuff in the future, but let's not get too
greedy!

~~~
wll
The datacenter-aware mode documentation [0] says “Although data will always be
triple replicated in this mode, it may not be replicated across all
datacenters.”

Why is that?

[0]
[https://apple.github.io/foundationdb/configuration.html?data...](https://apple.github.io/foundationdb/configuration.html?datacenter-
aware-mode)

~~~
voidmain
I _think_ it's just saying that it's willing to place two of the three
replicas in a datacenter, for example if one of the three datacenters is down.
This has downsides, since losing a datacenter will make it aggressively fill
up disks, but mitigates against subsequent failures causing data loss.

Most of the people who have run FoundationDB at scale have, for performance
reasons, used configurations other than the "datacenter aware" mode for their
inter region replication, so they may not be the strongest thing
operationally.

There is some work that from what I can see in the code is still in progress
to build a new, almost magical inter-region replication mode that I am very
excited about, which combines synchronous replication to a "satellite"
datacenter within region with asynchronous replication between regions and
recovery logic that will finish replication and fail over in case of a partial
failure of a region. You get fast transaction commits (much less than the
inter region ping time), can fail over to a secondary region automatically and
safely (without losing any committed transactions) in the vast majority of
circumstances, and in the _worst_ case you can (manually, because you are
accepting data loss!) give up very recently committed transactions to fail
over.

~~~
wll
How would FoundationDB stay externally consistent with asynchronous cross-
region replication?

Thank you for your time and FoundationDB—along with @nlavezzo, and team(s)!

~~~
voidmain
The satellite mode that I described is an active/passive mode. One region is
accepting reads and writes; the other is just replicating everything. When it
looks like the active region is in trouble, the asynchronous replication is
"finished up" before switching over to the other region. The multiple
datacenters in each region ensure that usually a regional failure will be
"slow enough" that this automatic process (which after all only takes hundreds
of milliseconds to seconds) can usually complete before a region goes away.
And this will be handled pretty transparently by the datastore.

If a region is blown up instantly by an orbital laser cannon, then the
database will go down and you will have to manually tell it to recover ACI in
the other region, sacrificing the durability of whatever committed
transactions in the lost region were destroyed by the laser cannon.

~~~
wll
While that’s perfect to shield from orbital laser cannons, is active/active
geo-independent replication possibile?

~~~
voidmain
Well, if you want ACID then you are going to have to pay for at least one
geographic round trip per committed transaction. (So why _not_ go
active/passive, and have at least _one_ of your datacenters be fast?)

But what if you have different pieces of data and you want them to be fast in
different datacenters? I think a great solution to this can be layered on top
of multiple FoundationDB clusters, each using the satellite mode, but this is
one thing that I at least haven't been able to think of a way to provide
properly at the data model agnostic key/value store layer - the details about
what to put where seem fundamentally dependent on your data model.

~~~
wll
> So why _not_ go active/passive, and have at least _one_ of your datacenters
> be fast?

While local writes would stay fast, wouldn’t active/passive see higher-latency
non-local writes than Spanner or Fauna’s (assuming a NAM-EUR-ASIA topology)?

I agree with and do appreciate the multiple FoundationDB clusters suggestion.

~~~
voidmain
I'm speculating, but I think in this mode, from the "slow" datacenters you
would see one round trip time to start a transaction, then reads will be fast
(they can be done safely from your local datacenter because of MVCC), and then
one round trip time to commit the transaction. I think that's as good as
Spanner does with the same geography, but I'm not sure. I think you could get
rid of the first round trip time even without any clock synchronization
nonsense, by speculating on a read version for read/write transactions. And
1xRTT is obviously as fast as physically possible for ACID.

~~~
voidmain
Update: Apparently Spanner is _way_ slower than I thought in "slow"
datacenters, doing a round trip for _every_ transactional read. So this would
absolutely stomp that. (Although spanner, as a higher level system, has the
ability to move different data to be fast in different regions built in, which
is nice, and as I said it will probably have to be a layer feature in the fdb
world)

------
manigandham
I wish Apple just became a customer instead of buying and burying this for 3
years. Glad to see it come back but community momentum is hard to rebuild.

------
borplk
Anyone wants to give a summary of what primitives/structures/algorithms
FoundationDB uses?

------
brisance
The macOS compilation instructions are interesting; it needs Boost, Mono and a
JDK.

------
dtheodor
Since becoming open source and the system's architecture is in the open, I
wonder how this criticism of FoundationDB from a VoltDB architect
[https://www.voltdb.com/blog/2015/04/01/foundationdbs-
lesson-...](https://www.voltdb.com/blog/2015/04/01/foundationdbs-lesson-fast-
key-value-store-not-enough/) fairs against the knowledge that has now become
available. (To summarize, the author argues that building an SQL layer on top
of an ordered key-value store is suboptimal)

------
ah-
This is amazing! Is there a roadmap of what is being worked on or is this
mainly a dump of their existing codebase?

From looking at the commit history it seems like this is pretty actively
developed.

------
tomc1985
I remember reading about Apple purchasing FoundationDB and everyone being
disappointed and hoping they'd open source it. And now they have!

------
Rygu
Found the NodeJS bindings in the 5.0 branch:
[https://github.com/apple/foundationdb/blob/release-5.0/bindi...](https://github.com/apple/foundationdb/blob/release-5.0/bindings/nodejs/package.json.in)

Seems dated as it requires NodeJS either 0.8 or 0.10.

~~~
faizshah
More info on this in this issue:
[https://github.com/apple/foundationdb/issues/129](https://github.com/apple/foundationdb/issues/129)

------
wejick
Never heard about fdb but the concept is very intriguing. We know many nosql
db on top kv store rocksdb or level db (like dgraph before moving to new
storage engine). That it still needs more to write until it can be called
distributed and scalable. By using foundation db we can skip many parts and
focus on other part like query language and API. That's why I like the layer
concept, unfortunately the is very little documentation about it. Found some
layer written in python in the repo, but I don't understand where is the
position of layer in the general architecture. I thought it would like plugin
but I think that's not the case.

~~~
voidmain
A "layer" uses the fdb client much as it might use rocksdb. The layer can be a
library embedded in your application, or a network service, it's up to you.

------
ChuckMcM
Wow this is super exciting. I had a half dozen things that I though FDB would
be good for and then poof! it got sucked up into the Apple spaceship. Now to
have it emerge unscathed is pretty awesome.

Another step closer to a library appliance.

------
Rafuino
Found this section on performance very interesting.

[https://apple.github.io/foundationdb/performance.html#throug...](https://apple.github.io/foundationdb/performance.html#throughput-
per-core)

I wonder what the SSD engine performance would look with NVMe standard NAND or
an Optane SSD instead of SATA. Any FoundationDB guys/gals on this thread able
to comment?

Another Q: what's more commonly used in current FoundationDB deployments:
memory engine or storage engine?

~~~
panghy
We run both, memory engine hasn't seen too many updates in the last 3 years
though. For SSDs, we use i3 as well as EBS-backed volumes.

------
edpichler
Wow, a lot of people excited here, with several use cases of success. I am
impressed. I didn't know that FoundationDB was so good till now.

------
bobjordan
Nice documentation around the python API. At this point, docs state it
supports 2.7-3.4. What scale of workload is required to catch it up to 3.6
compatability? [https://apple.github.io/foundationdb/api-
python.html](https://apple.github.io/foundationdb/api-python.html)

------
nickpsecurity
For anyone interested in learning more, you can still read their blog articles
via Wayback here:

[https://web.archive.org/web/20150304035646/http://blog.found...](https://web.archive.org/web/20150304035646/http://blog.foundationdb.com)

------
ile
It is strange, why ArangoDB isn't mentioned often on HN, or in this thread. It
is multi-model (KV, document, graph) database (with transactions) and I have
been happily using it for a while now.

Haven't scaled it yet to any large installations, so I can speak about how
well it does that.

------
rurounijones
Seems like a pretty comprehensive release in terms of OS and language support.

One oddity I see is that the ruby gem is not available on rubygems.org and
therefore cannot be easily installed and maintained using the ruby package
manager which is a bit of a pain.

------
polskibus
Is foundationdb capable of providing a linearizable data store? As I remember
from Martin Kleppman's books Serializable Snapshot Isolation is not
linearizable because the snapshot does not include writes more recent than
itself.

~~~
voidmain
Yes. Linearizable means both serializable and externally consistent (or is
sometimes used as just a synonym for the latter), and FDB has these properties
with respect to transactions.

~~~
polskibus
That would mean that SSI is not used to provide Serializable isolation. level.
If so, what is used instead? 2 phase locking? I thought it's not very scalable
?

~~~
voidmain
I explain the basics of our concurrency control here:
[https://news.ycombinator.com/item?id=16877950](https://news.ycombinator.com/item?id=16877950)

I guess textbook SSI is willing to "reorder" conflicting transactions if the
result is still serializable, which could violate external consistency if you
don't have any other bounds on the order. In the language of SSI, fdb simply
aborts the later of any pair of read/write transactions with an rw-conflict,
in accordance with a fixed ordering which is externally consistent.

I guess it could also be that your book uses an idiosyncratic definition of
linearizable, like trying to apply it to individual operations within
transactions, which might rule out any optimistic concurrency method. It might
just be better to delete this word from your vocabulary in the database field
because there is no wide agreement on what it means. The first two hits on
Google for me are Wikipedia and Peter Bailis, and they give clearly
conflicting definitions, though I think fdb satisfies both!

~~~
polskibus
Thanks, I'd love to have a bit more of your attention, foundationdb seems very
interesting but I need to know a bit more :)

Let me expand the definition in Kleppmann's book then.I think it is important
because it creates a difference between SSI and typical Serializable level
based on 2PL. The below is paraphrasing the definitions on p. 324-329. The
book references
[http://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf](http://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf).
(I must admit, I read the book, not the paper).

Basic idea - make a system appear as if there were only one copy of the data
and ALL operations on it are atomic. In this model, there may be replicas, but
we don't care about them. As soon as a client completes a write to the db, all
clients reading the db must be able to see the value just written.

In SSI this is not true, because you may the snapshot may not include writes
more recent than the snapshot -> reads from the snapshot are not lineraizable.

Linearizable CAS register is equivalent to consensus, and can provide total
order. It is therefore what most developers would love to have (if cost was
not an issue :) )

~~~
voidmain
From the paper you link:

"A history is serializable if it is equivalent to one in which transactions
appear to execute sequentially, i.e., without interleaving... A history is
strictly serializable if the transactions’ order in the sequential history is
compatible with their precedence order... Linearizability can be viewed as a
special case of strict serializability where transactions are restricted to
consist of a single operation applied to a single object."

In these terms, FoundationDB has the strict serializability property, and thus
if you do exactly one operation in each FoundationDB transaction then that is
linearizable.

But that kind of linearizability is much less powerful than what FoundationDB
actually gives you. You cannot efficiently maintain global invariants, like
indexes, with single-operational linearizability. I don't think this
definition is very useful! I think strict serializability (which is to say
serializability & external consistency) is what you actually want.

A linearizable CAS register can be implemented in FDB as simply as this:

    
    
      @fdb.transactional
      def compare_and_set( tr, key, vold, vnew ):
        if tr[key] == vold:
          tr[key] = vnew
    

but this is not the _limit_ of what you can do.

~~~
polskibus
Thank you very much for your in-depth explanation, I believe the only thing
left for me is to run FDB myself, sounds very promising :) FDB replacing
zookeper + sth else would reduce the complexity of target distributed system,
almost too good to be true.

------
ram_rar
This is such a great news. I had seen, FoundationDB guys present in new
england database summit couple of years ago. Their demo reminded me of sun
microsystems famous demo of screwing their hard disk.

------
DimitarIbra9987
is Transaction authority in-memory or on disk? this architecture seems kinda
clunky. Wondering what's the performance and is this a good fit for quick k-v
store use cases?

~~~
YongMan
I also wonder the performance when concurrency writes to a few keys.

------
sleepyams
Does anyone know if the python API for fdb works with asyncio?

~~~
sleepyams
Answered my own question (seems like the answer is yes):

[https://github.com/apple/foundationdb/blob/master/bindings/p...](https://github.com/apple/foundationdb/blob/master/bindings/python/fdb/impl.py#L1584)

------
rakibtg
So you need a mac book pro to use foundation db? :D

------
foobarbazetc
Is there a canonical php client for this?

------
dman
Noticed the Visual Studio project files. Was Windows a primary development
platform for foundationdb?

~~~
panghy
Nope, the actor compiler is down with mono and your need windows to compile
the windows client. When we were independently working on it, we just ditched
the windows code completely (we are a mac/linux shop)

~~~
polskibus
Does it mean that client is unusable on Windows?

------
fooster
Does anyone know if the SQL layer was open sourced, or whether there are any
plans to open source it?

------
foota
This seems like it would be really great for large game servers, I'm thinking
things like eve.

------
jinqueeny
Congratulations! It’s very exciting to see FoudationDB back to the open source
community!

------
NelsonMinar
Why did it take three years? Apple killed FoundationDB as a product in March
2015.

------
truth_seeker
I think Apache Ignite distributed KV store is much better choice as it has
2PC, indexes, distributed computing idioms and can be embedded as library in
JVM app. Plus it supports SQL engine, thanks to H2 SQL parser engine.

You can also create graph layer over it using gremlin in a day or two.

------
bogomipz
Does anyone know what services at Apple are built on top of FoundationDB?

------
throwaway6497
Broken documentation: Architecture image is missing.
[https://apple.github.io/foundationdb/architecture.html](https://apple.github.io/foundationdb/architecture.html)

Hope someone from the team reads this.

~~~
zzzcpan
It's a pdf:
[https://apple.github.io/foundationdb/_images/Architecture.pd...](https://apple.github.io/foundationdb/_images/Architecture.pdf)

~~~
jayrhynas
Looks like Safari will happily load a pdf in an img tag, but Chrome won't

~~~
lut4rp
I wonder if that's a potential security issue waiting to happen.

------
dboreham
Is it known whether Apple runs this code in production?

------
thx4thefish
This seems nice. Besides a bunch of fanboy comments coming from the creators
and devs, why is this exciting to the rest of us where things like the
capability to join tables in an rdbms is trivial.

~~~
thx4thefish
Thanks for the downvote. My question still remains.

~~~
deepanchor
Didn't downvote you, but I believe the excitement is over FoundationDB's
ability to perform ACID compliant distributed transactions without sacrificing
performance -which to my knowledge no current RDBMS or even NoSql can do.

~~~
pacala
> no current RDBMS can do

[https://cloud.google.com/spanner/](https://cloud.google.com/spanner/) ?

~~~
deepanchor
Meant to say "no current open-source RDBMS"

------
EGreg
How does FoundationDB compare against CockroachDB?

------
gigatexal
this might not be the right place to ask, but if running foundationdb in a
container, how does one connect to it via python?

------
killertypo
this is amazing and this is huge for us - FoundationDB is one of the fastest
most scalable distributed KV stores.

------
pron
Any documents on the core algorithms?

------
wll
What a fantastic news!

------
shahbaz16
honest questions - why are there so many databases?

------
hintzemichael89
Very cool

------
uasnew
How does this compare to redis?

------
daveheq
So what does Facebook get out of it?

------
xvilka
These days most people know Hadoop for a distributed storage. In my opinion,
though CEPH [1] has the bigger potential.

[1] [https://ceph.com/](https://ceph.com/)

~~~
manigandham
There are several distributed storage systems like Ceph and they all have
problems. Ceph is not good because it's an object storage system trying to
provide block storage and a filesystem on top, which will never work well.

~~~
noahdesu
Not that this really has anything to do with FoundationDB, but why do you say
that object storage is a poor substrate for file and block abstractions? There
are many high-performance block and file systems built on object storage.

~~~
manigandham
Block level is the lowest form of addressing bytes on devices. Filesystems are
an abstraction on top of block devices. Object stores are an abstraction on
filesystems.

Emulating a low-level layer on a higher-level abstraction (which itself is
using this hierarchy) will never match the speed, scale, or reliability of
doing it correctly.

~~~
catwell
> Block level is the lowest form of addressing bytes on devices. Filesystems
> are an abstraction on top of block devices. Object stores are an abstraction
> on filesystems.

I don't agree with this, but I think you may be confused because "Object
Storage" can mean several different things.

"Object Store" in Ceph (as in RADOS - Reliable Autonomous Distributed Object
Store) basically means key-value store. I typically say "blob store" instead
to avoid the confusion with more sophisticated systems. It is exposed through
a S3-like API. As far as I know, this layer of CEPH is pretty good, and you
need a layer like this in most distributed systems anyway.

Ceph provides something called RBD, RADOS Block Device, which exposes a Block
Device interface and is implemented on top of RADOS blob storage. It is useful
for VM disks and has decent performance because it makes heavy use of the
cache.

Some people use filesystems on top of RBD, but as far as I know CephFS itself
does not sit on top of RBD. It is not as widely used as RBD because it is
pretty recent (first release in 2016). The data is stored in RADOS and the
metadata (which is the hardest part in a distributed filesystem) is dealt with
by a Metadata Server cluster (MDS). This sounds like a typical distributed
filesystem architecture to me, similar to GFS (the MDS replaces the GFS master
and RADOS is used instead of chunk servers).

People tend to have a lot of issues with Ceph, but I think this is because:

1) It is used in reasonably large scale production settings where you are
going to have issues anyway ;

2) It is not as easy to understand and fine-tune as it should be ;

3) Some people expect it to solve all their issues magically with perfect
performance...

4) Some people use filesystems on top of RBD when they should have used CephFS
or even direct interfaces to RADOS when possible.

But in general, I think Ceph is an example of a decently architectured complex
distributed system.

~~~
manigandham
Sure, and key/value systems are at the similar level of object stores, meaning
they are abstractions on filesystems (which are abstractions on block
devices). This is the hierarchy.

Using Ceph for block and file access is like using AWS S3 to emulate block
devices and filesystems. It'll work, and there is software for it, but it will
never be very good. And Ceph is far from S3.

~~~
noahdesu
> It'll work, and there is software for it, but it will never be very good

What are some examples of distributed file systems and block devices that
_are_ very good?

------
alexnewman
If this was done years ago it'd be a big deal. Now I worry the competition
will make this a non event.

~~~
eropple
To the best of my knowledge (n.b.: I am not an expert, though I follow this
field) there still aren't any direct competitors to the breadth of what
FoundationDB can do and do well.

~~~
throwaway84742
Google Spanner? Not OSS, but globally consistent, scalable, and extensively
battle tested.

~~~
eropple
Spanner can do global consistency and (some?) transactions but I'm unaware of
it being able to do the sort of layering Foundation can internally to expose
largely different database _forms_ on top of it. I have never used Spanner,
though, so I'm open to being corrected.

~~~
throwaway84742
What do you mean by “forms”? Spanner is also layered. The bottom layer is
basically a key-value store. On top of that there’s a full blown SQL layer,
which, BTW can work with hierarchical records as well as flat tables. Both
support transactions and guarantee global consistency.

~~~
wll
They may refer to FoundationDB layers. [1] While Spanner may be built on a
key-value store equivalent, Google does not expose it as a service.

[1] [https://apple.github.io/foundationdb/layer-
concept.html](https://apple.github.io/foundationdb/layer-concept.html)

~~~
eropple
That's exactly what I was referring to. Thanks.

------
protomyth
Apple builds their OS with a lot of software from FreeBSD, but when they
opensource FoundationDB they don't provide a distribution that will work on
it. I know the license says Apple doesn't have to do anything, but it just
seems wrong that they didn't provide a download.

~~~
cnlwsu
[https://www.foundationdb.org/download/](https://www.foundationdb.org/download/)

~~~
jaboutboul
And there we go...

~~~
protomyth
I don’t see FreeBSD listed.

~~~
rurounijones
Based on your earlier comment I guess you think that Apple owes the FreeBSD
community a version of FoundationDB that will work with it because apple uses
FreeBSD technology.

I make no comment on the validity of the stance but I think you are probably
in the minority.

However since it is now open source at least it is possible for someone to do
the work to get it on FreeBSD at least.

