We run several FDB clusters using 3-DC replication and have never once lost data. I remember when we wanted to replace all of the FDB hardware (one cluster) in AWS, and so we just doubled the cluster size, waited for data shuffling to calm down, and just started axing the original hardware. We did this all while performing over 100K production TPS.
One thing that makes the above seamless for all existing connections is that clients automatically update their "cluster file" in the event that new coordinators join or are reassigned. That alone is amazing...as you don't have to track down every single client and change / re-roll with new connection parameters.
Anyway, I talk this database up every chance I get. Keep up the awesome work.
- A very happy user.
Given these primitives, you or other engineers can write higher level APIs on top (e.g. sql, search, etc). In fact, we make extensive use of their RecordLayer  library which provides a strongly consistent schema based write process using Protobuf. This includes on-write-consistent indexes. Apple has also open-sourced a MongoDB API  compliant interface that allows you to get all the consistency guarantees from FDB, but with an API interaction of MongoDB.
The beauty of FDB is that the primitives are done in such a rock solid fashion, you can write higher level APIs without having to worry about the really hard stuff (transactions, failures, config errors, testing, simulation, etc). Another example is that CouchDB is switching their back-end to use FDB for their 4.0 release.
Given that the database sorts data lexicographical...to us, it became a natural fit for an online (always-mutable) time-series database.
If you want more insight in how we use it, I go over it in some detail towards the end of my keynote  from last August.
> I ran FoundationDB Key-Value Store through every nemesis in Jepsen - including those that found failures in other databases - and FoundationDB passed all of them with flying colors.
FoundationDB is one of the coolest pieces of technology I've used in the past decade. The tuple keyspace is incredibly useful, so are the multi-key transactions. I've physically killed the power on an FDB node and FDB cluster; multiple times (heh, home servers)... and every time the cluster or node just comes back.
Having someone other than those officially on the Jepsen project run the Jepsen test is a good start. However, many databases have claimed to run the Jepsen tests themselves and pass, but when there is an actual paid engagement for a distributed database there are always issues that are found. That's generally true even for unpaid official runs as well although Zookeeper did pass existing tests. Every database is different and the paid engagement will design specific tests designed to break the database in question.
"Rigorous correctness testing via simulation makes FDB extremely reliable. In the past several years, CloudKit  has deployed FDB for more than 0.5M disk years without a single data corruption event. Additionally, we constantly perform data consistency checks by comparing replicas of data records and making sure they are the same. To this date, no inconsistent data replicas have ever been found in our production clusters."
"For example, early versions of FDB depended on Apache Zookeeper
for coordination, which was deleted after real-world fault injection found two independent bugs in Zookeeper (circa 2010) and was replaced by a de novo Paxos implementation written in Flow. No production bugs have ever been reported since."
It's also really, really weird that their non-scalable architecture hits a brick wall at 25 machines. Ignoring the correctness flaws, it only works if you can either design around that limit by sharding, and never off cross-shard transactions, or if you can assure yourself that your use case will never outgrow half a rack of equipment.
In general I think people who think they want to do FoundationDB owe themselves a serious contemplation of the cost/benefit of using Cloud Spanner instead. Obviously you cannot do your own fault injection testing of Spanner, but it does have end-to-end checksums.
that's nuts. rocks could've been added as a storage engine to fdb far more easily.
But, also for the same record, thinking you can implement a reliable, globally-replicated key-value store on top of FoundationDB that is cheaper and better than Cloud Spanner may be evidence of the same cognitive bias.
man, good thing nobody made any claim like that.
That's... brave. Flow is a DSL built on top of C++?
Their talk from a while ago about it was something that really blew me away at the time 
My understanding is that FDB relies heavily on deterministic simulations for testing, and that their async/await model is a big part of how they make sure they cover different possible interleavings in a deterministic way.
We've tried Ceph, GlusterFS, HDFS, MinIO and some others, and eventually decided on a custom FDB solution. It's a breeze to setup, and seems to eclipse others in performance  and reliability - Kyle (aphyr) the author of Jepsen series on distributed systems correctness, said: "haven't tested foundation in part because their testing appears to be waaaay more rigorous than mine." 
The way we use FDB, if anyone is interested, is we simply split files into small chunks (per FDB data design recommendations), and store all file's & folder's meta data in FDB such as byte count, create/access/write times, permissions, and a lot more. Folders are handled by the builtin Directory layer .
We had a paper on it as ACM Middleware and it's open-source on github. Are you going to publish your solution?
(Discussed here on HN: https://news.ycombinator.com/item?id=25149154 )
The only reason you need chunking is because FDB has very clearly defined limits in their documentation, and one of those limits is value size - it can't exceed 100kB, and should be kept below 10kB for best performance.
For statistics like folder byte count, you could use FDB's atomic increment operation.
Here's a good starting point (not mine): https://forums.foundationdb.org/t/object-store-on-foundation...
I'll be happy to answer specific question if you got any.
Kafka has limits on the message size and i need a solution for storing large blobs (up to 10MB) at data ingestion on for a very short time until the job has been processed. So read/write ratio will be exactly 50% and there will be a high write load. Is FoundationDB capable for this specific task? Are there some knobs tuneable for acid? Perfect acid requirements would not really be needed, if everything is fsynced every second would be completely ok.
Without doing your own tests, here's per core numbers, from which you can extrapolate (e.g. via CPU mips) towards your own hardware: https://apple.github.io/foundationdb/performance.html#throug...
FDB is ACID as is shipped, you don't need to turn knobs to make it such. Toughest part is to figure out classes/roles of the system. Here are a couple of good starting points: https://nikita.melkozerov.dev/posts/2019/06/building-a-found...
Did you look at Crux as well? (DB written in Clojure, has primitives to build changefeeds, opencrux.com).
Whilst Kafka itself is almost certainly not as simple to operate as FDB (although I can't speak from experience), it does in turn provide Crux with dead-simple clustering, because each Crux node acts as an isolated replica. FDB could be used equally instead of Kafka, but the maximum write throughput would then be somewhat lower. The only key part of the story Crux doesn't currently supply out-of-the-box is a load-balancing layer atop a cluster of nodes.
I expect the real drawback of Crux's current design, by comparison, is that each node is an ~expensive full replica, whereas your system (built on FDB) will benefit from fully sharded indexes running across a cluster of (smaller) machines. The main trade-off then is the low-latency query performance of a fully local KV store.
I wasn't criticizing Crux by any means. I like the project, and I spent a long time thinking carefully about bitemporality, as this is something I often need and have to implement by myself.
My decision to go with FDB was based on many factors, and it was taken over the course of multiple years. Some factors were technical (e.g. fully distributed, correct, ability to implement changefeeds) and some were not (I am not happy with the internal complexity of Kafka).
Also, I don't feel comfortable with the way most DB discussions are framed. Most people think that one can choose a database and switch between databases at will, which implies there is a clearly defined division between "a database" (with an API) and "the app". That is not necessarily true with FDB. To get the full value out of it, your app code should be aware of transactions, participate in database mechanisms (versionstamps), correctly handle asynchronous streaming of large amounts of data with backpressure, etc.
In case of my app, I did not want to switch from one "document" database to another, I wanted to be able to write a distributed application based on an underlying well-tested transactional database. That's why FDB is a good fit, and by "FDB" I do not mean their document db layer, I mean the basic FDB itself (I only use the tuple layer).
I hope that explains things a bit more.
>To get the full value out of it, your app code should be aware of transactions, participate in database mechanisms (versionstamps), correctly handle asynchronous streaming of large amounts of data with backpressure
This is an excellent point, and broadly aligns with Crux's design also, since the transaction semantics provided are relatively raw and therefore much of the interesting parts of the "database" logic live firmly in the application code.
It sounds like you have a really interesting system and I would love to hear more about it one day :)
It ought to show up within the archive  in a couple of days. Should already be in the Clojurians Zulip mirror as well.
Also you better get familiar with a whole bunch of hidden "knobs" that are apparently configurable and very important somewhere and then get printed out into xml logs but of course there is no log viewer so you have to write your own. Maybe this isn't a problem for large companies but I'm providing feedback as a single user here.
I also don't understand how people can praise the c++ DSL. They should rewrite that into standard c++ coroutines as soon as possible so their entire build and dev environment isn't so hard to understand. As a user of open source software I generally like to be able to debug through the projects I use and figure out problems I have. It's much harder when a project uses their own custom language. I certainly tried to set it all up correctly but there always seemed to be some problems in regards to Intellisense within the IDE.
What is this process describing, and how does it differ from what you were trying to do?
If I remember correctly there is a definite problem if you remove one of the processes that end up guaranteeing your configured redundancy mode. So then suddenly your entire cluster is inoperable. Yes really, I think it doesn't even properly respond to cli commands anymore and shows nodes as simply missing. Oh and suddenly those long wait times for your cli commands are really starting to bother you...
I don't really want to talk more about it because it's been a while but I just wanted to make the point that the user experience was quite bad.
We've run fdb in production for several years at this point. We have dramatically scaled production clusters up and down.
I have been studying these key-value stores with efficient range iteration lately (such as LevelDB, RocksDB, BigTable, FoundationDB, etc). This is a great reference on how to make such a simple abstraction do a lot of useful things.
Is it CP or AP? Comments seem to imply AP
It is CP per https://apple.github.io/foundationdb/cap-theorem.html
And/or would it be comparable to DynamoDB?
Ed: i suppose various document/db applications - like IMAP might be a good fit too?
My understanding is object stores are typically “flat” by design to scale well (in contrast to a tree structure found in filenames).
For content addressing, are people using the keys in sophisticated ways or are the values being indexed? Any reason to push this complexity into the storage layer as opposed to composing the functionality?
Thats exactly what your last questions points to. If you are lets say dropbox and have probably a super sophisticated key value store setup i can imagine you would want your content addressable layer to be as simple and narrowly optimized as possible and develop and optimize the indexing, metadata and key queries system nearly completely separately. If you are working on some system that also should scale down to run on individual machines like ipfs, git annex or minio before their focus on kubernetes you want a system that can run as a single daemon but also where users can reason about the whole system as an integrated concept.
The good thing about Lucerne existing is you’re allowed to also use their good ideas.
I used ElasticSearch and run one cluster in production in the past and found it horrible. Maybe I'm missing something, since they're so successful even as a public company...
But I do think that a well-funded and skilled startup team could take a run at ES. They’re a monopoly in their niche. There has to be money in disrupting them.
Shard migration and duplication was a beauty to watch when it worked well, but very often it wouldn't. I tried to establish a procedure to bring instances back up safely and repeatably, but sometimes I would just have to try and try again, no procedure guaranteed restoring operation in all cases. I do remember spending more than one weekend babysitting that cluster. There's a special place in my heart for that memory, and it's not a beautiful one.
I was very surprised to see them IPO successfully - this test was before they were public. I assumed the open source version was just missing the kind of tools and know how to be able to work solidly without paying them or their consultants. These days, I think it's poor engineering resulting in brittleness, we shouldn't have put so much traffic to it, and probably everyone out there experiences the same behavior unless they ensure much safer margins, which probably makes it expensive but I guess who's complaining.
I ended up designing and writing my own (much simpler) distributed system for what I needed (much simpler than everything ElasticSearch does, of course), and it's been in reliable operation since summer 2015. Good engineering goes a long way. Now in the process of turning that approach into a startup. If I get 1/10th as successful as ES I will be a happy camper :)
That is impressive. Like a framework for implementing noSQL DBs.
It would be really cool to find it, if it's still out there.
Edit: found it https://www.voltdb.com/blog/2015/04/foundationdbs-lesson-fas...
Not sure what to think of it, I'm not a DB expert by any means but the post sounds plausible enough and the SQL layer is discontinued AFAIK. I guess with each new abstraction layer you leave some perf on the table.
That's probably because of spending time on this echo chamber.
In reality everyone has likely been staying with the same databases they know and love but just moved to the cloud. It's why now AWS for example offers such a wide variety of databases e.g. MySQL, PostgreSQL, SQL Server, Oracle, MongoDB, Cassandra, Redis.
Of those the first two are not PG, they just share the wire protocol and try to be compatible at a SQL level. Aurora is not really distributed, it's replicated for availability and durability six ways at the block storage level. Citus is distributed as I understand it though.
> but it's also reasonable to come at it from the other direction - build a SQL engine on top of a solid distributed key-value store.
Sure, that's possible. I'm not talking about the wisdom of building your own relational database as the end goal, just that a distributed key value database and a SQL database don't have overlapping problem sets.
the only really general statement i can think of is that the "larger"/"longer" your transactions are, the harder a time you'll have getting it to cooperate with FDB. "small"/"fast" transactions will be easier to fit into its model.
(to likely replies: this isn't an absolute, see all the quotes. yes things like redwood will alleviate some of this, but not all.)
This may be out-dated, please let me know if the story has evolved here.