The short version is that FDB is a massively scalable and fast transactional distributed database with some of the best testing and fault-tolerance on earth. It’s in widespread production use at Apple and several other major companies.
But the really interesting part is that it provides an extremely efficient and low-level interface for any other system that needs to scalably store consistent state. At FoundationDB (the company) our initial push was to use this to write multiple different database frontends with different data models and query languages (a SQL database, a document database, etc.) which all stored their data in the same underlying system. A customer could then pick whichever one they wanted, or even pick a bunch of them and only have to worry about operating one distributed stateful thing.
But if anything, that’s too modest a vision! It’s trivial to implement the Zookeeper API on top of FoundationDB, so there’s another thing you don’t have to run. How about metadata storage for a distributed filesystem? Perfect use case. How about distributed task queues? Bring it on. How about replacing your Lucene/ElasticSearch index with something that actually scales and works? Great idea!
And this is why this move is actually genius for Apple too. There are a hundred such layers that could be written, SHOULD be written. But Apple is a focused company, and there’s no reason they should write them all themselves. Each one that the community produces, however, will help Apple to further leverage their investment in FoundationDB. It’s really smart.
I could talk about this system for ages, and am happy to answer questions in this thread. But for now, HUGE congratulations to the FDB team at Apple and HUGE thanks to the executives and other stakeholders who made this happen.
Now I’m going to go think about what layers I want to build…
 Yes, yes, we ran Jepsen on it ourselves and found no problems. In fact, our everyday testing was way more brutal than Jepsen, I gave a talk about it here: https://www.youtube.com/watch?v=4fFDFbi3toc
(I was one of the co-founders of FoundationDB-the-company and was the architect of the product for a long time. Now that it's open source, I can rejoin the community!)
Kudos to the FoundationDB team and Apple for open sourcing this wonderful product. We're cheering you all along! And we look forward to contributing to the open source and community.
The relative performance contribution of most of those features (cross-partition transactions not evaluated -it's a must) can be seen in our paper at Usenix Fast:
FoundationDB uses optimistic concurrency, so "conflict ranges" rather than "locks". Each range is a (lexicographic) interval of one or more keys read or written by a transaction. The minimum granularity is a single key.
FoundationDB doesn't have a feature for indexing per se at all. Instead indexes are represented directly in the key/value store and kept consistent with the data through transactions. The scalability of this approach is great, because index queries never have to be broadcast to all nodes, they just go to where the relevant part of the index is stored.
FoundationDB delivers serializable isolation and external consistency for all transactions. There's nothing particularly special about transactions that are "cross-partition"; because of our approach to indexing and data model design generally we expect the vast majority of transactions to be in that category. So rather than make single-partition transactions fast and everything else very slow, we focused on making the general case as performant as possible.
Transaction coordination is pretty different in FoundationDB than in 2PC-based systems. The job of determining which conflict ranges intersect is done by a set of internal microservices called "resolvers", which partition up the keyspace totally independently of the way it is partitioned for data storage.
Please tell me if that leaves questions unresolved for you!
Ok, per my other question that makes sense. Similar to FaunaDB except the "resolvers" (transaction processors) are themselves partitioned within a "keyspace" (logical database) in FaunaDB for high availability and throughput. But FaunaDB transactions are also single-phase and we get huge performance benefits from it.
Systems that sound closest to FoundationDB's transaction model that i can think of are Omid (https://omid.incubator.apache.org/) and Phoenix (https://phoenix.apache.org/transactions.html). They both support MVCC transactions - but I think they have a single coordinator that gives out timestamps for transactions - like your "resolvers". The question is how your "resolvers" reach agreement - are they each responsible for a range (partition)? If transactions cross ranges, how do they reach agreement?
We have talked to many DB designers about including their DBs in HopsFS, but mostly it falls down on something or other. In our case, metadata is stored fully denormalized - all inodes in a FS path are separate rows in a table. In your case, you would fall down on secondary indexes - which are a must. Clustered PK indexes are not enough. For HopsFS/HDFS, there are so many ways in which inodes/blocks/replicas are accessed using different protocols (not just reading/writing files or listing directories, but also listing all blocks for a datanode when handling a block report). Having said that, it's a great DB for other use cases, and it's great that it's open-source.
I tried to explain distributed resolution elsewhere in the thread.
I believe our approach to indices pretty much totally dominates per-partition indexing. You can easily maintain the secondary indexes you describe; I don't understand your objection.
Also, the main draw-back of "indices as data" in NoSQL is when you need to add a new index -- suddenly, you have to scrub all your data and add it to the new index, using some manual walk-the-data function, and you have to make sure that all operations that take place while you're doing this are also aware of the new index and its possibly incomplete state.
Certainly not impossible to do, but it sometimes feels a little bit like "I wanted a house, but I got a pile of drywall and 2x4 framing studs."
This is a totally legitimate complaint about FoundationDB, which is designed specifically to be, well, a foundation rather than a house. If you try to live in just a foundation you are going to find it modestly inconvenient. (But try building a house on top of another house and you will really regret it!)
The solution is of course to use a higher level database or databases suitable to your needs which are built on FDB, and drop down to the key value level only for your stickiest problems.
Unfortunately Apple didn't release any such to the public so far. So I hope the community is up to the job of building a few :-)
Not even close. I don't even see anything I'd call a filesystem mentioned on your web page. I missed FAST this year, and apparently you had a paper about using Hops as a building block for a non-POSIX filesystem - i.e. not a filesystem in my and many others' opinion - but it's not clear whether it has ever even been used in production anywhere let alone become "best known" in that or any other domain. I know you're proud, perhaps you have reason to be, but please.
i'm incredibly impatient to have a look at what the community is going to build on top of that very promising technology.
[by wwilson] https://news.ycombinator.com/item?id=16877401
I see no reason you wouldn't be able to implement Datastore. In fact here's a public source claiming that Firestore (which I believe is its successor) is implemented on top of Spanner: https://www.theregister.co.uk/2017/10/04/google_backs_up_fir...
Also TBH now that I don't have commercial reasons to push interop, if I write another document database on top of FDB, I doubt I'd make it Mongo compatible. That API is gnarly.
Out of the many MongoDB criticisms, this one is valid. This https://www.linkedin.com/pulse/mongodb-frankenstein-monster-... article is quite right about it (note: endorsing the article does not mean I endorse its author by far).
Other than that, they totally did a fake it until you make it with MongoDB 3.4 passing Jepsen a year ago and MongoDB BI 2.0 containing their own SQL engine instead of wrapping PostgreSQL.
(I've tried to keep the above as dry as possible to avoid dragging the arguments around this situation into this thread - and I suspect the phrasing of the previous comment was also intended to try and avoid that, so let's see if we can keep it that way, please)
The Will Wilson I remember was not prone to such understatement. That API (and the corresponding semantics) was a nightmare.
I realized this 2 months ago at https://news.ycombinator.com/item?id=16386129
Congrats to the MongoDB team!
It's fast to install on a dev machine though, lol.
FoundationDB is now licensed under Apache 2, which is a much more permissive license, so most companies' open source policies allow it.
I know that multiple other companies have a similar policy (either a complete ban on using AGPL-licensed software, or special approval required to use it), although unlike Google, they don't post their internal policy publicly.
If someone works at one of these companies, what do you want to do – spend your day trying to argue to get the AGPL ban changed, or a special exception for your project; or do you just go with the non-AGPL alternative and get on with coding?
It is true that MongoDB's AGPL contagion and compliance burden, if you don't modify it, is less than many fear. It is also true that those corporate concerns are valid. MongoDB does sell commercial licenses so that such companies can handle their unavoidable MongoDB needs, but they would tend to minimize that usage.
It was a very bad rdbms (since they marketed as a replacement) and a very bad sharded db (marketed too).
Anybody know if Apple migrated any projects from Cassandra to FoundationDB?
Or was every project using FoundationDB a greenfield project?
Would be great to see from Apple an engineering blog "strengths and challenges at scale" post now that it has been opened again.
Do you have something to back that up? This to me reads like you imply that Elasticsearch does not work and scale.
It's definitely interesting but I'm cautious. The track record for FoundationDB and Apple has not been great here. IIRC they acquired the company and took the software offline leaving people in the rain?
Could this be like it happened with Cassandra at Facebook where they dropped the code and then more or less stopped contributing?
Also I haven't seen many contributions from Apple to open-source projects like Hadoop etc. in the past few years. Looking for "@apple.com" email addresses in the mailing lists doesn't yield a lot of results. I understand that this is a different team and that people might use different E-Mail addresses and so on.
In general I'm also always cautious (but open-minded) when there's lots of enthusiasm and there seems to be no downside. I'm sure FoundationDB has its dirty little secrets and it would be great to know what those are.
The statement was blunt but the reputation is not exactly unearned. These problems become worse proportional to scale.
I don't think they will ever move 'secret sauce' into open source, but infrastructural things like DBs and dev tooling seems to be going in that direction.
Apple, like everyone else, wants to commoditize their complements.
Woolvalley said “and now recently clangd“, which is a source code completion server which began its life at Google.
Clang did start its life with Apple.
> Apple chose to develop a new compiler front end from scratch, supporting C, Objective-C and C++. This "clang" project was open-sourced in July 2007.
EDIT: My bad, looks like FoundationDB wasn't fully open-source back then.
You can even check the comments on the HN thread  when they removed the downloads.
Unrelated to the original topic, but I had never come across that talk and it is great. I use the same basic approach to testing distributed systems (simulating all non-deterministic I/O operations) and that talk is a very good introduction to the principle.
When certain events happen rarely, you have fake them deterministically so that you have reproducible coverage.
If you have a reason to use FoundationDB over HDFS, NFS, S3, etc, then this will work well.
Doing a Lucene+DB implementation where each entry posting lists are stored natively in the key-value system was explored for Lucene+Cassandra as (https://github.com/tjake/Solandra). It was horrifically slow, not because Cassandra was slow, but because posting lists are optimized and putting them in a generalized b-tree or LSM-tree variant will remove some locality and many of the possible optimizations.
I'm still holding out some hope for a hybrid implementation where posting list ranges are stored in a kv store.
FDB leaves room for a lot of creativity in optimizing higher layers. Transactions mean that you can use data structures with global invariants.
You can already tune segment sizes (a segment is a self-contained index over a subset of documents). I'd assume that the right thing to do for a first attempt is to use a Codec to write each term's entire posting list for that one segment to a single FDB key (doing similar things for the many auxiliary data structures). If it gets too big, then you should have tuned max segment size to be smaller. Do some sort of caching on the hot spots.
If anyone has any serious interest in trying this, my email is in my profile to discuss further.
That sounds a lot like Datomic's "Storage Resource" approach, too! Would Datomic-on-FDB make sense, or is there a duplication of effort there?
Datomic’s single-writer system requires conditional put (CAS) for index and (transaction) log (trees) roots pointers (mutable writes), and eventual consistency for all other writes (immutable writes) .
I would go as far as saying a FoundationDB-specific Datomic may be able to drop its single-writer system due to FoundationDB’s external consistency and causality guarantees , drop its 64bit integer-based keys to take advantage of FoundationDB range reads , drop its memcached layer due to FoundationDB’s distributed caching , use FoundationsDB watches for transactor messaging and tx-report-queue function , use FoundationDB snapshot reads  for its immutable indexes trees nodes, and maybe more?
Datomic is a FoundationDB layer. It just doesn’t know yet.
I can confirm it wasn't fast!
(And to be fair that wasn't the point - back then there were no distributed versions of Solr available so the idea of this was to solve the reliability/failover issue).
I wouldn't use it on a production system now days.
Out of curiosity, what led you to do this? And what does it do better/worse/differently than out-of-the-box things like Elasticsearch or SOLR?
In any partition situation, if one of the partitions contains a majority of the coordinators then it will stay live, while minority partitions become unavailable.
Or do you get around this by not deploying an uneven number of boxes?
I think you can set the number of coordinators to be even, but you never should - the fault tolerance will be strictly better if you decrease it by one.
5 into 2, 2, and 1
7 into 3, 2, and 2
I was in attendance at your talk, and thought it was one of the best at the conference. Apple I think broke some hearts completely going closed-source for a while, but glad to see them open sourcing a promising technology.
To get great performance for SQL on FoundationDB you really want an asynchronous execution engine that can take full advantage of the ability to hide latency by submitting multiple queries to the K/V store in parallel. For example if you are doing a nested loop join of tables A and B you will be reading rows of B randomly based on the foreign keys in A, but you want to be requesting hundreds or thousands of them simultaneously, not one by one.
Even our SQL Layer (derived from Akiban) only got this mostly right - its execution engine was not originally designed to be asynchronous and we modified it to do pipelining which can get a good amount of parallelism but still leaves something on the table especially in small fast queries.
In FDB's envisioned architecture, the "client" is usually a (stateless, higher layer) database node itself! So the client encompasses the first layer of distributed database technology, connects directly to services throughout the cluster, does reads directly (1xRTT happy path) from storage replicas, etc. It simulates read-your-writes ordering within a transaction, using a pretty complex data structure. It shares a lot of code with the rest of the database.
If you wanted, you could write a "FDB API service" over the client and connect to it with a thin client, reproducing the more conventional design (but you had better have a good async RPC system!)
The microservices crew with their "our database is behind a REST/Thrift/gRPC/FizzBuzzWhatnot microservice" pattern is still catching up to the significance of this statement.
First request the row needs to be read from disk HDD. It takes 2ms.
Second request, the row is already in ram, it takes microseconds but still has to wait for the first request to finish.
Threads have overhead when having a lot of concurrency (thousands/millions requests/second).
For extreme async, see seastar-framework and scylladb design.
TLDR: high concurrency, low overhead etc.
If you’ve read any of the comments or have been following the project, it should be pretty obvious that this is far from rookie.
This is a game changer, not a hobby project. This is the first distributed data store that offers enough safety and a good enough understand of CAP theorem trade offs that it can be safely used as a primary data store.
FDB raised >$22m and was acquired by Apple.
If you consider that a "hobby project," please, teach me how to hobby.
A tiny bit of caution for folks trying to run systems like this though: It is frigging hard at any reasonable scale. The whole thing might be documented / OSS and what not, but very soon you are going to run into deep enough problems that's going to require very core knowledge to debug, energy to deep dive. Both of which you probably don't want to invest your time into. Do evaluate the cloud offerings / supported offerings before spinning these up. Else ensure you have hired experts who can keep this going. They are great as a learning tool, pretty hard as an enterprise solution. I have seen the same issue a ton of times with a bunch of software (redis/kafka/cassandra/mongo...) by now. IMO In the stateful world, operating/running the damn thing is 85% of the work, 15% being active dev. (Stateless world is a little better, but still painful).
The number of newbie engineers who see docker/kubernetes, and think "let me docker run or helm install" a stateful service in a couple of minutes - is mind boggling.
I'll remember this quote when trying to talk sense to them.
I remember all the people who bashed Apple when they acquired FoundationDB. I hope they are appropriately ashamed now.
I'm not ashamed about deriding apple for Apple taking a really, really great product and hiding it from the world for years to come.
This is definitely some atonement, but does not totally absolve Apple from the many times they've taken tech private.
I'm probably a 1% engineer, been hired by M$, FB, and Google. These guys were light years ahead of me. I'm not sure I'm as good now as they were at like 17 years old. In fact I'm probably only a decent engineer from having observed the stuff they were doing back then and finding inspiration.
They aforementioned SWEs make themselves multimillionaires /and/ have great jobs /and/ get praised by their peers, yet now everyone else has to bow down to them... over claims they were great programmers in high school? Is an appeal to your own experience the best way to make yourself seem relevant here?
It's an incredible DB system, but it ain't the Second Coming. Calm down.
Is it ok to admire great engineers on HN?
Is this comment relevant to anything but your own inferiority complex?
Can we get a decent definition of hubris in here?
It's great to admire excellent engineers; aspiring to be as skilled as someone at a task can be very motivating. Worshipping them is another thing.
You're right about the inferiority complex - I know I'm a relatively bad SW engineer, but that's mostly related to how new it is to me. I expect and want to improve.
Hubris is defined as excessive pride or self-confidence. I would say that bragging that you're "probably a 1% engineer," and that you've been hired by three of the largest SW companies out there qualifies as hubris. Maybe it's just me, but I don't think a 1% engineer would publicly boast about being one and then actually use the 'M$' in a non-farcical manner.
Shitting on 99% of the SWE population to make yourself look good, then shitting on yourself to make another person look even better doesn't really work. There's a reason humanCmp() is a little more complex than strCmp().
BTW, being hired by a large company doesn't mean you're all that. Plenty of idiots get hired by Oracle.
My second high school is unranked ("Texas Academy of Math and Science"). I don't think it qualifies as a high school. Seems we haven't done a great job of identifying the accomplishments of our alumni, based on the Wikipedia page. No doubt it would rank near the top though. My year alone Caltech accepted about 30 of us, more than any other high school in the country. Makes me wonder what my peers have been up to.
Anyway, I'd agree that these tech high schools have some amazingly smart people attending them.
1. It’s in its own repo
2. The build instructions are concise and clear. Dependencies are listed. You have to follow a total of 0 links.
3. They use a common build system and not an in-house thing.
Although it starts by running `make`, it's about as in-house as a thing can be.
Does the horse choke on the baseball? Is there an equine version of the Heimlich maneuver to be performed on horses suffering from mixed-metaphorical-adage-induced asphyxiation?
And then there were the hijinks we went through to build a cross-compiler with modern gcc and ancient libc (plus the steps to make sure no dependency on later glibc symbols snuck in):
Ahh... now that was a build system.
(with apologies to Tolkien)
Almost 5 years in and we have not lost any data (but we have lost machines, connectivity, seen kernel panics, EBS failures, SSD failures, etc., your usual day in AWS =p).
Basically once a system is complex enough some part if it is always broken. The software must be designed from the assumption that the system is never running flawlessly.
Or are you saying that AWS is particularly unreliable at scale?
A FDB transaction roughly works like this, from the client's perspective:
1. Ask the distributed database for an appropriate (externally consistent) read version for the transaction
2. Do reads from a consistent MVCC snapshot at that read version. No matter what other activity is happening you see an unchanging snapshot of the database. Keep track of what (ranges of) data you have read
3. Keep track of the writes you would like to do locally.
4. If you read something that you have written in the same transaction, use the write to satisfy the read, providing the illusion of ordering within the transaction
5. When and if you decide to commit the transaction, send the read version, a list of ranges read and writes that you would like to do to the distributed database.
6. The distributed database assigns a write version to the transaction and determines if, between the read and write versions, any other transaction wrote anything that this transaction read. If so there is a conflict and this transaction is aborted (the writes are simply not performed). If not then all the writes happen atomically.
7. When the transaction is sufficiently durable the database tells the client and the client can consider the transaction committed (from an external consistency standpoint)
The implementations of 1 and 6 are not trivial, of course :-)
So a sufficiently "slow client" doing a read write transaction in a database with lots of contention might wind up retrying its own transaction indefinitely, but it can't stop other readers or writers from making progress.
It's still the case that if you want great performance overall you want to minimize conflicts between transactions!
Is this similar to how Software Transactional Memory (STM) is implemented? It sounds very very similar indeed.
I don't know that there's been an exhaustive writeup of that part, but maybe one of us or somebody on the Apple team will put something together. It probably won't fit in an HN comment though!
Or... maybe this is the part where I point out that the product is now open-source, and invite you to read the (mostly very well commented) code. :-)
Going forward as an opensource product, I hope to see some clarity on the "how it works"... Distributed, performant ACID sounds good, almost too good to be true. Not that I doubt it at the moment, I just want to understand it better :)
The basic approach isn't super hard to understand, though the details are tricky. The resolvers partition the keyspace; a write ordering is imposed on transactions and then the conflict ranges of each transaction are divided among the resolvers; each resolver returns whether each transaction conflicts and transactions are aborted if there are any conflicts.
(In general the resolution is sound, but not exact - it is possible for a transaction C to be aborted because it conflicts with another transaction B, but transaction B is also aborted because it conflicts with A (on another resolver), so C "could have" been committed. When Alec Grieser was an intern at FoundationDB he did some simulations showing that in horrible worst cases this inaccuracy could significantly hurt performance. But in practice I don't think there have been a lot of complaints about it.)
MVCC needs a bit of additional logic ontop to be serializable - "Serializable snapshot isolation" is a good
keyword to search for - But it's definitely possible.
This then causes people who aren't versed with the product or the technology to decrease their perception of the product, and puts the team behind it in a position of having to not just come to its defense but to do so quickly due to the perception concerns.
Meanwhile, if they just do a basic search for "MVCC serializable" they would find that they were wrong; which means that it took more time to leave this insulting comment than it would have taken them to learn how this can work.
As a community, we really really really really need to beat down on casual cynics like this, who like to lazily "call bullshit" or play the "citation needed" card as a way to undermine the credibilty of other peoples' products. We live in a future where the answers to these kinds of doubts are a moment away: this particular form of debate tactic needs to die.
I'll take your feedback under consideration, and I'm sorry to have irritated you.
I believe that FoundationDB stores rows in lexicographical order by key. Other databases like Cassandra strongly push you toward not storing data this way as it can easily lead to hotspots in the cluster. How do you deploy a FoundationDB cluster without leading to hotspots, or perhaps what operational actions are available to rebalance data?
If you have a sorted data store, you can get the same distribution by keying off a hash of the "real" primary key, right?
Cassandra allows you to store multiple records in sorted order within a partition. The normal recommended way to get data locality is to store records that are frequently accessed together in the same partition.
I'm also kind of confused.. is the single repo complete?
What about the SQL Layer ? Where is all this stuff in the new GH repo?
Or is only the KV part be open-source?
Looking forward to some CockroachDB vs. FDB benchmark showdowns :)
"Wavefront by VMware’s Ongoing Commitment to the FoundationDB Open Source Project"
I wonder how it compares to MUMPS databases like Intersystems Cache and FIS GtM?
Also, the obvious difference is that Caché is closed source and prohibitively expensive.
FiS is only FOSS on selected platforms (Linux x86, OpenVMS Alpha), and proprietary on all other platforms.
Worked a job once where that was the underlying data store.
I was only allowed to touch the SQL interface to it, which was....weird.
The SQL dialect was ancient, felt like something from about 1990 (and this was in.... 2012 or so, so not THAT long ago).
Query performance seemed invariant. A simple select * from foo where id=X and a monster programatically generated join across 15 tables would both take about 1.5 seconds to return results.
Bloomberg’s comdb2 was open sourced recently https://github.com/bloomberg/comdb2 - it seems similar, but would be interesting to see comparison.
If not, then a lot of potential is being left on the table, because usage would require wrapping FoundationDB in a proxy or middleware of some kind to synthesize events, which can be extremely difficult to get right (due to race conditions, atomicity issues, etc). Without events, apps can find themselves polling or rolling their own pub/sub metaphor over and over again. If anyone with sway is reading this, events are very high on the priority list for me thanx!
I'm not sure it has every feature it will ever need in this area, but it's a pretty good starting point for building "reactive" stuff.
from their source code blessing. notes.
Related snippet from the "Distinctive Features Of SQLite" page from the sqlite project:
The source code files for other SQL database engines typically begin with a comment describing your legal rights to view and copy that file. The SQLite source code contains no license since it is not governed by copyright. Instead of a license, the SQLite source code offers a blessing:
May you do good and not evil
May you find forgiveness for yourself and forgive others
May you share freely, never taking more than you give.