Have nothing but praise for FoundationDB. It has been by far the most rock solid distributed database I have ever had the pleasure of using. I used to manage HBase clusters, and the fact that I have never once had to worry about manually splitting "regions" is such a boon for administration...let alone JVM GC tuning.
We run several FDB clusters using 3-DC replication and have never once lost data. I remember when we wanted to replace all of the FDB hardware (one cluster) in AWS, and so we just doubled the cluster size, waited for data shuffling to calm down, and just started axing the original hardware. We did this all while performing over 100K production TPS.
One thing that makes the above seamless for all existing connections is that clients automatically update their "cluster file" in the event that new coordinators join or are reassigned. That alone is amazing...as you don't have to track down every single client and change / re-roll with new connection parameters.
Anyway, I talk this database up every chance I get. Keep up the awesome work.
How would it compare to say something like hosted Redis, or if you wanna be more fancy ElasticSearch. I have been looking into FDB for pretty long time and have been looking for a perfect opportunity to use it. Would be helpful if you can describe your usage scenario (kind of data you are storing).
The key takeaways of FoundationDB is that it is a strongly consistent KeyValue store that preserves ordering (lexicographical). Although you might consider FDB's APIs to be quite primitive (get/set/scan), the payoff is how it seamlessly handles multi-key transactions without requiring you to write or manage some client side two-phase-commit process.
Given these primitives, you or other engineers can write higher level APIs on top (e.g. sql, search, etc). In fact, we make extensive use of their RecordLayer [1] library which provides a strongly consistent schema based write process using Protobuf. This includes on-write-consistent indexes. Apple has also open-sourced a MongoDB API [2] compliant interface that allows you to get all the consistency guarantees from FDB, but with an API interaction of MongoDB.
The beauty of FDB is that the primitives are done in such a rock solid fashion, you can write higher level APIs without having to worry about the really hard stuff (transactions, failures, config errors, testing, simulation, etc). Another example is that CouchDB is switching their back-end to use FDB for their 4.0 release.
Given that the database sorts data lexicographical...to us, it became a natural fit for an online (always-mutable) time-series database.
If you want more insight in how we use it, I go over it in some detail towards the end of my keynote [3] from last August.
> I ran FoundationDB Key-Value Store through every nemesis in Jepsen - including those that found failures in other databases - and FoundationDB passed all of them with flying colors.
FoundationDB is one of the coolest pieces of technology I've used in the past decade. The tuple keyspace is incredibly useful, so are the multi-key transactions. I've physically killed the power on an FDB node and FDB cluster; multiple times (heh, home servers)... and every time the cluster or node just comes back.
That's great that you are doing your own resiliency testing.
Having someone other than those officially on the Jepsen project run the Jepsen test is a good start. However, many databases have claimed to run the Jepsen tests themselves and pass, but when there is an actual paid engagement for a distributed database there are always issues that are found. That's generally true even for unpaid official runs as well although Zookeeper did pass existing tests. Every database is different and the paid engagement will design specific tests designed to break the database in question.
Two quotes from the paper that I think will motivate people to read it:
"Rigorous correctness testing via simulation makes FDB extremely reliable. In the past several years, CloudKit [59] has deployed FDB for more than 0.5M disk years without a single data corruption event. Additionally, we constantly perform data consistency checks by comparing replicas of data records and making sure they are the same. To this date, no inconsistent data replicas have ever been found in our production clusters."
"For example, early versions of FDB depended on Apache Zookeeper
for coordination, which was deleted after real-world fault injection found two independent bugs in Zookeeper (circa 2010) and was replaced by a de novo Paxos implementation written in Flow. No production bugs have ever been reported since."
Ehhhh, doesn't align with my experience. I think FDB is actually really poorly tested. When I was evaluating it for replacement of the metadata key-value store at a major, public web services company we found that injecting faults into virtual NVMe devices on individual replicas would cause corrupt results returned to clients. We also found that it would just crash-loop on Linux systems with huge pages, because although someone from the project had written a huge-page-aware C++ allocator "for performance", evidently nobody had ever actually tried to use it, including the author.
It's also really, really weird that their non-scalable architecture hits a brick wall at 25 machines. Ignoring the correctness flaws, it only works if you can either design around that limit by sharding, and never off cross-shard transactions, or if you can assure yourself that your use case will never outgrow half a rack of equipment.
Can you fix a point in time? Software evolves and I think a point I saw is that it wasn’t well tested then they changed once production workloads told them it needs to change.
There weren't any, which is why that particular shop elected to roll their own distributed system on top of rocks.
In general I think people who think they want to do FoundationDB owe themselves a serious contemplation of the cost/benefit of using Cloud Spanner instead. Obviously you cannot do your own fault injection testing of Spanner, but it does have end-to-end checksums.
For the record, I said the same thing. But it's a management problem because on the one hand you have a known open project with demonstrable flaws, and on the other you have your own in-house developers and you will tend to discount the bugs they haven't written yet.
But, also for the same record, thinking you can implement a reliable, globally-replicated key-value store on top of FoundationDB that is cheaper and better than Cloud Spanner may be evidence of the same cognitive bias.
> But, also for the same record, thinking you can implement a reliable, globally-replicated key-value store on top of FoundationDB that is cheaper and better than Cloud Spanner may be evidence of the same cognitive bias.
Thanks for the quotes, I've been wanting to read this paper for some time. Great to see they went through the consensus literature and made a decision to go with Active Disk Paxos, instead of stopping short and not fully understanding the consensus they're building on. The consensus and replication protocol is such a huge part of building a distributed database.
My understanding is that FDB relies heavily on deterministic simulations for testing, and that their async/await model is a big part of how they make sure they cover different possible interleavings in a deterministic way.
Only great things to say about FoundationDB. We've been using it for about a year now. Got a tiny, live cluster of 35+ commodity machines (started with 3 a year ago), about 5TB capacity and growing. Been removing and adding servers (on live cluster) without a glitch. We've got another 100TB cluster in testing now. Of all the things, we're actually using it as a distributed file system.
We've tried Ceph, GlusterFS, HDFS, MinIO and some others, and eventually decided on a custom FDB solution. It's a breeze to setup, and seems to eclipse others in performance [0] and reliability - Kyle (aphyr) the author of Jepsen series on distributed systems correctness, said: "haven't tested foundation in part because their testing appears to be waaaay more rigorous than mine." [1]
The way we use FDB, if anyone is interested, is we simply split files into small chunks (per FDB data design recommendations), and store all file's & folder's meta data in FDB such as byte count, create/access/write times, permissions, and a lot more. Folders are handled by the builtin Directory layer [2].
Interesting!
We have been doing the same thing with HopsFS for a couple of years. Except, we only store the small files in our database (www.rondb.com) - RonDB is a recent fork of MySQL Cluster (NDBCluster). Very small files (<1KB) are stored in memory in RonDB, small files (typically <128KB) are stored in NVMe disks in RonDB, and other files in HopsFS (which now stores its blocks in object storage (S3, ABS).
We had a paper on it as ACM Middleware and it's open-source on github. Are you going to publish your solution?
That's a really interesting solution. Can you tell us more about it? Operating distributed blob storage systems is kind of fragile with every software i have yet tried.
FDB's Directory layer provides all you need to create and edit nested paths. What's left to develop is a file chunking and assembly part, and statistics if needed.
The only reason you need chunking is because FDB has very clearly defined limits in their documentation, and one of those limits is value size - it can't exceed 100kB, and should be kept below 10kB for best performance.
For statistics like folder byte count, you could use FDB's atomic increment operation.
What kind of read/write ratio are you using? And would your solution work for a write-heavy workload?
Kafka has limits on the message size and i need a solution for storing large blobs (up to 10MB) at data ingestion on for a very short time until the job has been processed. So read/write ratio will be exactly 50% and there will be a high write load. Is FoundationDB capable for this specific task? Are there some knobs tuneable for acid? Perfect acid requirements would not really be needed, if everything is fsynced every second would be completely ok.
Kafka limits are customizable. 10MB is quite small. Kafka easily handles 100MB messages. If you need some temporal storage that works like a log then there's nothing better than kafka in terms of speed and scalability. But it's more about relatively long term storage. If you need to persist data for very short amount of time, why don't you use in-memory store? Do you have strict requirements around data loss? Something like Redis much just do the trick for out. With Redis cluster it even scales.
Markus Pilman from Snowflake did an awesome talk on FoundationDB's testing at CMU's Quarantine Tech Talks (2020), How I Learned to Stop Worrying and Trust the Database:
FDB is an awesome and unique piece of software (I attribute quite a bit of Snowflake's success to FDB). I've also had the pleasure of meeting some folks from the original team and they are true engineers. Does anyone know if/when Redwood (the new storage engine) has landed / will land?
Founders are building a distributed systems simulation product now called Antithesis. My data fabric startup, Stardog, is a happy Antithesis early adopter customer. It’s helping us reproduce and fix non-deterministic bugs deterministically. Good stuff.
Yes, I did. But I had specific requirements, and the main one was that I need a fully distributed database, where one of the nodes can disappear for any reason at any time and things would just continue as if nothing happened.
Hi, I work on the Crux team. I think "fully distributed" has a few possible meanings, but is it essentially a case of wanting something with a dead-simple clustering story? Or is it more about multi-region distribution & availability?
Whilst Kafka itself is almost certainly not as simple to operate as FDB (although I can't speak from experience), it does in turn provide Crux with dead-simple clustering, because each Crux node acts as an isolated replica. FDB could be used equally instead of Kafka, but the maximum write throughput would then be somewhat lower. The only key part of the story Crux doesn't currently supply out-of-the-box is a load-balancing layer atop a cluster of nodes.
I expect the real drawback of Crux's current design, by comparison, is that each node is an ~expensive full replica, whereas your system (built on FDB) will benefit from fully sharded indexes running across a cluster of (smaller) machines. The main trade-off then is the low-latency query performance of a fully local KV store.
I'm sorry — I tried to respond quickly, and this kind of response always lacks depth.
I wasn't criticizing Crux by any means. I like the project, and I spent a long time thinking carefully about bitemporality, as this is something I often need and have to implement by myself.
My decision to go with FDB was based on many factors, and it was taken over the course of multiple years. Some factors were technical (e.g. fully distributed, correct, ability to implement changefeeds) and some were not (I am not happy with the internal complexity of Kafka).
Also, I don't feel comfortable with the way most DB discussions are framed. Most people think that one can choose a database and switch between databases at will, which implies there is a clearly defined division between "a database" (with an API) and "the app". That is not necessarily true with FDB. To get the full value out of it, your app code should be aware of transactions, participate in database mechanisms (versionstamps), correctly handle asynchronous streaming of large amounts of data with backpressure, etc.
In case of my app, I did not want to switch from one "document" database to another, I wanted to be able to write a distributed application based on an underlying well-tested transactional database. That's why FDB is a good fit, and by "FDB" I do not mean their document db layer, I mean the basic FDB itself (I only use the tuple layer).
No need to apologise at all, and thank you for your thoughtful reply!
>To get the full value out of it, your app code should be aware of transactions, participate in database mechanisms (versionstamps), correctly handle asynchronous streaming of large amounts of data with backpressure
This is an excellent point, and broadly aligns with Crux's design also, since the transaction semantics provided are relatively raw and therefore much of the interesting parts of the "database" logic live firmly in the application code.
It sounds like you have a really interesting system and I would love to hear more about it one day :)
Personally I don't understand how you can call a database robust if it can't scale down nodes after you scaled them up once. What am I supposed to do if I ever deploy to 50 nodes and then it turns out that I only need 5. Shut the business down? Pay to run database servers forever that I don't even need anymore? Also the database configuration has a lot of gotchas and is very opaque. You might be waiting for 30sec for your CLI to connect to your localhost cluster of two processes and you have no idea what is happening or why it is taking that long. It just never felt so safe and robust as people claim it to be. I don't know, these were just my findings on the brief tests I did with it.
Also you better get familiar with a whole bunch of hidden "knobs" that are apparently configurable and very important somewhere and then get printed out into xml logs but of course there is no log viewer so you have to write your own. Maybe this isn't a problem for large companies but I'm providing feedback as a single user here.
I also don't understand how people can praise the c++ DSL. They should rewrite that into standard c++ coroutines as soon as possible so their entire build and dev environment isn't so hard to understand. As a user of open source software I generally like to be able to debug through the projects I use and figure out problems I have. It's much harder when a project uses their own custom language. I certainly tried to set it all up correctly but there always seemed to be some problems in regards to Intellisense within the IDE.
They are using the term "machine" and "process" in multiple conflicting meanings I think. I mean maybe they improved this since I used it 1.5 years ago but I kind of doubt it.
If I remember correctly there is a definite problem if you remove one of the processes that end up guaranteeing your configured redundancy mode. So then suddenly your entire cluster is inoperable. Yes really, I think it doesn't even properly respond to cli commands anymore and shows nodes as simply missing. Oh and suddenly those long wait times for your cli commands are really starting to bother you...
I don't really want to talk more about it because it's been a while but I just wanted to make the point that the user experience was quite bad.
I am pretty sure that the new cloudant transaction/storage engine is also based on foundationDB, which powers a lot of things behind the scenes at ibm. And couchdb 4 with foundationDB storage engine is hopefully not too far out either. Lets see how long this whole transition takes, but i am still hopeful that the mindshare and motivation of apple, snowflake, ibm and apache community will lead to something great.
I have been studying these key-value stores with efficient range iteration lately (such as LevelDB, RocksDB, BigTable, FoundationDB, etc). This is a great reference on how to make such a simple abstraction do a lot of useful things.
There was an SQL layer but performance was sub-par IIRC. There was also a blog post somewhere explaining why it's probably not a good idea to build an SQL layer on top of a KV store, devil in details, etc.
Not sure what to think of it, I'm not a DB expert by any means but the post sounds plausible enough and the SQL layer is discontinued AFAIK. I guess with each new abstraction layer you leave some perf on the table.
It's not a very good analysis. The only reason there isn't a good OLTP SQL engine on top of FDB is that no one has invested the considerable resources and expertise to build one. The FDB SQL Layer was on its way, but Apple prioritized other things. Snowflake built a world class analytic SQL database on FDB and the door is still open for someone to do OLTP. There might even be room for two different designs (one more like Akiban/SQL Layer or F1, competing with Spanner, and one more like Aurora, with a more familiar performance envelope but slightly less scalability)
This seems like a good place to ask - are there any new and exiting FOSS "application" worth checking out? I recall from the initial publication of the source - there was references to a great sql layer? I don't know if a FOSS work-a-like ever materialized? Other things I'd hoped for was a network filesystem/blob layer, like maybe s3/nfs/webdavfs compatible? What are people building on top of foundationdb today?
Ed: i suppose various document/db applications - like IMAP might be a good fit too?
large unstructured blobs and large files are among the things not well suited to foundationdb and couchdb 4 actually reduced supported blob size in the transition to foundationdb. it looks like object/blob storage systems are at the moment rather seperating more from key/value and document storage than growing together. but this is a good thing because the tradeoffs are very different and it allows each system to focus on what it does best. blob stores will hopefully move even more to content addressing and merkle dag similar to git and ipfs.
Can you elaborate on what requirements these blob systems should have?
My understanding is object stores are typically “flat” by design to scale well (in contrast to a tree structure found in filenames).
For content addressing, are people using the keys in sophisticated ways or are the values being indexed? Any reason to push this complexity into the storage layer as opposed to composing the functionality?
well there is often no hierarchy as in folders on an old school filesystem but if the system uses chunking, the organization of blob chunks is very important for the performance and scaling characteristics. The chunking algorithm needs to be performant but also lead to sensible chunk size and count and in addition can also do data based boundaries so chunks can be reused even if blob data changes at the start of the data. This can be different depending on your specific application (eg. the read/write ratio and average file sizes) and requirements for optimal use of the underlying filesystem, that's one reason why no de facto standard chunker has been established so far. There are many tradeoffs for key organization too. Do you need more sophisticated range queries or only single keys? How balanced is growing and shrinking of your data structure vs performance? What is the clustering story? How do you handle rebalancing/cleanup/pruning? Is your primary key organization content hashes like in ipfs or more arbitrary strings as in s3/minio? Is your metadata/ secondary keys system completely integrated or more independent?
Thats exactly what your last questions points to. If you are lets say dropbox and have probably a super sophisticated key value store setup i can imagine you would want your content addressable layer to be as simple and narrowly optimized as possible and develop and optimize the indexing, metadata and key queries system nearly completely separately. If you are working on some system that also should scale down to run on individual machines like ipfs, git annex or minio before their focus on kubernetes you want a system that can run as a single daemon but also where users can reason about the whole system as an integrated concept.
I’m curious about this as well. Is anyone working on building text search on top of FDB? It’s kind of astounding to me that last time I checked Elasticsearch was still essentially the only game in town.
its pretty hard to catch up with lucene, there is just so much work, features and brainpower in there at this point. as many features of foundationdb such as the transaction guarantees and reliability are not super important for fulltext search i cannot imagine any company even apple or ibm being able to justify that gigantic investment, instead im sure nearly any soluion willcontinue to use lucene under the hood for the forseeable future.
Lucene is Java, right? There should be space for a native implementation, like ScyllaDB is doing to Cassandra (and DynamoDB, though the gap is not of the same shape in the last case). Or am I missing something?
I used ElasticSearch and run one cluster in production in the past and found it horrible. Maybe I'm missing something, since they're so successful even as a public company...
There’s a lot to Lucene, and there’s also a lot to ElasticSearch. And I think they’re fairly tightly coupled.
But I do think that a well-funded and skilled startup team could take a run at ES. They’re a monopoly in their niche. There has to be money in disrupting them.
im sure its possible in theory, but without a permissive license it would be not relevant to most applications that are interesting. now try convincing a venture backed startup to build a lucene alternative and license it permissively after seeing what happend with amazon and es...
yes its java and also yes there is definitely a space for a native implementation, its not that i don't want that to happen, on the contrary. but the reality is that its really hard to do and i dont see only getting rid of java to currently motivate a relevant player or large enough dev community.
It was brittle beyond belief. The process would run out of memory and crash really easily. You are supposed to dimension resources well, but of course you'd like to handle things more gracefully when a huge inflow of data or requests comes in, good engineering is "graceful failure" such as just rejecting requests, instead of total mayhem and destruction. It was "interesting" to discover that the monitoring plugins included by default would bring down the entire cluster. 5 beefy AWS machines brought down when pointing their own monitor at them! Yes there was significant input traffic, but I would expect more solid behavior from production-ready software.
Shard migration and duplication was a beauty to watch when it worked well, but very often it wouldn't. I tried to establish a procedure to bring instances back up safely and repeatably, but sometimes I would just have to try and try again, no procedure guaranteed restoring operation in all cases. I do remember spending more than one weekend babysitting that cluster. There's a special place in my heart for that memory, and it's not a beautiful one.
I was very surprised to see them IPO successfully - this test was before they were public. I assumed the open source version was just missing the kind of tools and know how to be able to work solidly without paying them or their consultants. These days, I think it's poor engineering resulting in brittleness, we shouldn't have put so much traffic to it, and probably everyone out there experiences the same behavior unless they ensure much safer margins, which probably makes it expensive but I guess who's complaining.
I ended up designing and writing my own (much simpler) distributed system for what I needed (much simpler than everything ElasticSearch does, of course), and it's been in reliable operation since summer 2015. Good engineering goes a long way. Now in the process of turning that approach into a startup. If I get 1/10th as successful as ES I will be a happy camper :)
peruse the fdb forum. they produce document and record layers now. there are community layers of varying quality for a network block device, a filesystem, and a few other things.
Am I right that this is like a distributed form of something like LevelDB or RocksDB, which would be the underlying storage engine for a full database product?
It's unfortunate that they went silent for years after the Apple acquisition. That period was key for database adoption. I have the feeling everybody kind of settled for pgsql.
> I have the feeling everybody kind of settled for pgsql.
That's probably because of spending time on this echo chamber.
In reality everyone has likely been staying with the same databases they know and love but just moved to the cloud. It's why now AWS for example offers such a wide variety of databases e.g. MySQL, PostgreSQL, SQL Server, Oracle, MongoDB, Cassandra, Redis.
Those are two completely non overlapping use cases. If you can use pgsql for your problem, you have no business trying to use a distributed key value store instead. That would be at least as dumb as driving screws with a hammer.
Yeah, but there are quite a few efforts out there to extend PG into a distributed DB of one flavor or another. Some examples are YugabyteDB, CockroachDB, Aurora and Citus. It's a reasonable approach, but it's also reasonable to come at it from the other direction - build a SQL engine on top of a solid distributed key-value store. Contrafactuals are always dicey, but FDB vanishing behind the Apple wall of silence sure didn't help.
> Some examples are YugabyteDB, CockroachDB, Aurora and Citus.
Of those the first two are not PG, they just share the wire protocol and try to be compatible at a SQL level. Aurora is not really distributed, it's replicated for availability and durability six ways at the block storage level. Citus is distributed as I understand it though.
> but it's also reasonable to come at it from the other direction - build a SQL engine on top of a solid distributed key-value store.
Sure, that's possible. I'm not talking about the wisdom of building your own relational database as the end goal, just that a distributed key value database and a SQL database don't have overlapping problem sets.
Sure quibble about the details. The point is, these are all attempts to make PG more scalable. Around the time these projects got started Foundation looked like abandonware. If it hadn't, it's possible that "how can we have really scalable SQL databases?" might have had Foundation as part of the answer.
Good point. Still, if FDB was high-profile proprietary software during this time it might have inspired an open source clone. Spanner inspired Cockroach and Yugabyte, after all.
CockroachDB is exactly this - a SQL engine on top of a distributed key value store. It is not an extension of Postgres itself, it just speaks the protocol and implements many of the features.
Much much better. Single line backup/restore and Disaster Recovery mode that syncs second DC and able too switch on the fly with barely any configs (except one file).
this is limited by your creativity and willingness to make tradeoffs.
the only really general statement i can think of is that the "larger"/"longer" your transactions are, the harder a time you'll have getting it to cooperate with FDB. "small"/"fast" transactions will be easier to fit into its model.
(to likely replies: this isn't an absolute, see all the quotes. yes things like redwood will alleviate some of this, but not all.)
IIRC fdb is fully optimistic concurrency control. It doesn't do any locking. If you have workloads which are highly contended, you'll need to do something in the layer above to coordinate. Otherwise, performance will be unbearable.
This may be out-dated, please let me know if the story has evolved here.
if your transactions are conflicting heavily with each other, yeah, you'll have a bad time. and if everything synchronizes on some small set of keys, you'll have a really bad time. monitoring the transaction conflict rate on your cluster is important.
We run several FDB clusters using 3-DC replication and have never once lost data. I remember when we wanted to replace all of the FDB hardware (one cluster) in AWS, and so we just doubled the cluster size, waited for data shuffling to calm down, and just started axing the original hardware. We did this all while performing over 100K production TPS.
One thing that makes the above seamless for all existing connections is that clients automatically update their "cluster file" in the event that new coordinators join or are reassigned. That alone is amazing...as you don't have to track down every single client and change / re-roll with new connection parameters.
Anyway, I talk this database up every chance I get. Keep up the awesome work.
- A very happy user.