Hacker News new | past | comments | ask | show | jobs | submit login
FoundationDB: A distributed unbundled transactional key value store (micahlerner.com)
254 points by mlerner 3 months ago | hide | past | favorite | 44 comments

I wrote a post explaining FDB as well, which got retweeted by @FoundationDB https://mobile.twitter.com/aptxk1d/status/141786976657708237...

My post focuses on the correctness, proof of the correctness of FDB’s failure recovery, some of which is missing even from the paper itself.

Some of the paper authors reached out; and I corrected one issue on my blog post pointed out by one of the authors.

Once CouchDB switches to using FoundationDB I think it'll surpass most NoSQL stores. Can't wait. The main issue right now with FoundationDB is that it's very nontrivial to setup and interface with IMHO.

Any announced date for when this switch might happen? Overall a good decision by Couchdb community.

I also look forward to CouchDB 4. But CouchDB 2 and 3 has been very easy to setup and use over the years. Hopefully it stays the same.

I'm not sure when FDB will get easier soon, if ever.

Certainly had a devil of a time trying to get it to work on windows.

I'd love to know how Apple distributes its CloudKit data across all of (what I assume are) its dozens of globally distributed FoundationDB clusters.

As of version 6.0, a single FoundationDB cluster currently only supports one active, writable 'master' region + one hot failover region. Transactions are all essentially LAN so performance characteristics are great. It doesn't support the slower TiDB/CRDB style active/active configuration where all nodes are writable, data is globally distributed and transactions can span across all nodes in the cluster.

Do they have some kind of scheme whereby ACID is preserved for local transactions only, while far away regions are asynchronously replicated to with less guarantees on data consistency?

There were 2 FDB papers published recently – this one discusses some of that: https://www.foundationdb.org/files/QuiCK.pdf

Maximum fdb cluster size is actually not that large, in host count terms, and the maximum practical size per host is similarly not large for the Apple scale, so they are almost certainly sharding objects across hundreds or thousands of clusters.

This has nice properties for blast containment.

I suspect it’s because they’re not just running a single cluster, but many many clusters with customers distributed globally. They had a talk about how they do it with other systems some time ago that I can’t find at the moment.

A vaguely similar project that might be of interest is: https://github.com/microsoft/FASTER

It's also an "unbundled" low-level component that one could use as the foundation for a database engine or whatever. According to Microsoft, FASTER is not just "fast", but significantly faster than even some basic in-memory data structures that ship in the .NET standard library!

The downside is that it doesn't (yet) support some more advanced features like multi-server distributed mode.

However, that relative simplicity may be preferred in some scenarios...

Is it a LevelDB-like log structured merge tree?

Their distributed checkpointing/recovery thing (built on FASTER) is very interesting!

The graph in the upper right corner of page 9 is certainly impressive: https://tli2.github.io/assets/pdf/dpr-sigmod2021.pdf (it is indexed with number of virtual machines)

If anyone can use it for anything other than session storage I'd love to hear about that. Their API is so limited that I just don't understand how can I possibly use it.

The CMU Vaccination Database Talks series includes an excellent talk on FASTER from one of the authors: https://youtu.be/4Y0oNFud-VA

Thanks for that link.

I'm migrating from RethinkDB to FoundationDB right now. One thing which is easy to miss when you look at FoundationDB is that it is more of a toolkit for building databases than a database in itself. So, if you're looking for a faster horse, say a "better PostgreSQL", or a MongoDB that doesn't lose your data, it will seem disappointing ("where are all the features I expected?").

Where I think it really shines is when you couple it tightly with your application. Suddenly you get a fully transactional distributed database with clearly understandable semantics, and if you can show some flexibility in how you map your data onto the underlying primitives, you can get fantastic performance, as well as correctness.

FoundationDB has rather severe limitations:

> Transaction size cannot exceed 10,000,000 bytes of affected data

In my experience almost all transactions are smaller than that, but the there is a long tail of large ones even for the same operation (because some 1:n link usually has a small n, but occasionally has an n which is orders of magnitude larger).

> FoundationDB currently does not support transactions running for over five seconds.

Limiting the duration of write-transactions isn't too bad. But being able to use a longer lived readonly snapshot is very convenient.

Do higher level databases building on FoundationDB (e.g. the SQL layer) typically work around some of these limitations, or do they leave it to the application to avoid these?

The size limit makes a lot of sense when you look at the transaction model. When you write to FDB, none of the writes are even sent to the server until you attempt to commit the transaction, so this limit is actually a "maximum request size limit" on the final commit request.

AIUI, the transaction size does not include values that are read, and for snapshot reads does not include the keys either, so this 10 MB limit only really constrains transactions that are writing a large volume of data.

For those transactions, the documentation suggests writing the data first (using manny smaller transactions) and then only updating a pointer to that data in the final transaction.

I'm currently building a simple database I'm calling AgentDB on top of FDB. It implements a message-passing system within the database, with guaranteed exactly-once semantics. It's designed to allow business logic to be more easily expressed without having to worry about all the failure modes typically present in a distributed system.

For this, I process messages in batches, with one transaction per batch. If a transaction fails, I retry with a smaller batch size, so in my case the answer would be "yes, to an extent". The user of AgentDB still has to ensure that processing a single message doesn't exceed the transaction size, but they don't have to worry that a batch of messages would exceed that.

The Record Layer provides mechanisms to get around the key size limitations and the the transaction time/size limitations.

See section 4: https://www.foundationdb.org/files/record-layer-paper.pdf

If that is severe limitation how would you characterize DynamoDB?

One thing I’ve never been able to figure out with FoundationDB is how to deploy it and a good list of layers for various use cases (SQL, document store, etc.)

These two resources helped me a lot in understanding how to setup FoundationDB processes correctly:



Wouldn't the "unbundled" cause a lot of headache when trying to deploy it in production?

Keep in mind that while Foundationdb has its place and it's really good at what it does it's ssd only. If you're running on HDDs for cheaper storage you're better off with something else.

Anyone has real experience running this in prod vs tikv? Latter seems like a more familiar design but missing watches which fdb seems to support

Former Apple employee here. They’ve been using it extensively in production for years. Pretty much all of CloudKit is run using FDB for storage.

Public source for that info, so I don’t get accused of NDA violations - https://youtu.be/HLE8chgw6LI

Timestamp 14:15 touches on CloudKit’s extensive FDB usage.

Nobody here is going to accuse you of violating your NDA :P

I might, but it's mostly because you said that I wouldn't

One of PingCap's (company behind TiKV and TiDB) founders once did his own FoundationDB review [1] - he concluded: "the performance is cool". I'm assuming at that time FDB's performance was superior to TiKV.

[1] https://medium.com/@siddontang/benchmark-foundationdb-with-g...

Disclaimer: I work for PingCAP. That benchmark didn't compare to TiKV. We generally avoid making such comparisions not because our competitors are better but because we know we will always have difficulty fairly benchmarking. Instead we generally do what is shown in that blog post: note in what aspect of a banchmark other databases are showing strength: because that indicates an opportunity to learn.

We don't view ourselves in competition with FoundationDB because most TiKV users want the ability to use TiDB with TiKV for SQL (and MySQL compatibility). If you know you don't ever need that then FDB can be a good choice (but TiKV can be as well).

I've only ever tinkered with tikv, but one advantage that it currently has over fdb is its support of co-processors --the ability for each node in the cluster to run some piece of query logic locally on its own shard of data instead of having all of the data shipped to one node (or even all the way up to the application level) for processing, which is quite bandwidth intensive.

We do, and while I love it, a word of warning: do not allow production FDB clusters to have any member get above 90% storage consumed.

Isn't this the one that disappeared overnight? Good luck getting the people's trust again

It was a private company, was bought by Apple, and is now Apache 2.0 with major contributions from outside Apple.



EDIT: I forgot they were a commercial product before Apple open-sourced it after acquisition.

FoundationDB was pulled from market when Apple acquired it in 2015.[1] Apple re-released it as open source in 2018.[2] Since FoundationDB is FOSS this time around, it would be possible to fork it in case it becomes unavailable again.

[1] https://news.ycombinator.com/item?id=9259986

[2] https://news.ycombinator.com/item?id=16877395

I completely forgot it was initially closed source, and only seriously considered it when it was opensourced in 2018. For the sake of the GP argument, though, I'd say that rebuilt an immense amount of trust in this case. My bad!

Perhaps you were referring to DabbleDB?

In theory, their design makes sense: separating the component pieces makes it easier to adapt to different systems/scenarios, it allows you to more succinctly map out the architecture (and hopefully simplify interfaces), you have that whole testing framework you can use on the individual pieces, etc.

The problem is with implementation. The more you separate components, the more code you have to add to allow them to interoperate. The more code, the more complexity. The more complexity, the more bugs. In addition, when you continue to separate components out into new failure domains that have a higher probability of failure ("3 components running on 1 node" -> "3 components running on 3 separate nodes") you increase the chances of problems even more. So, while the design looks nice in theory, and I'm sure they have some wonderful reports about how thorough their test framework is, in practice it might be a tire fire (depending on how it's actually used).

Some people are going to go, "I've been using this in production for a year, it's rock solid!" Well, people said that about Riak, until they wanted to set it on fire when they found out how buggy the implementation was at actually handling various errors and edge cases, and that they each had to hire dedicated Riak programmers just to fix all its problems. You can make large-scale Riak clusters work "at scale" if you hire enough on-call to constantly fight fires and monkey-patch the bugs and juggle clusters and nodes and repair indexes, and hand-wave outages and flaws as due to some other issue ("the cluster was completely hosed and had to be recovered from backup because the disks got corrupted and a node went down and a network partitioned all at once, it's not Riak's fault").

(looking at this HN thread, it seems like my assumption may be correct: https://news.ycombinator.com/item?id=27424605)

> looking at this HN thread, it seems like my assumption may be correct

that has got one guy complaining about fdb, and a couple of others saying positive things.

so in short, you've got no experience with fdb, wrote a couple of paragraphs of speculation and references to other systems, and then declared victory, hoping nobody would read the other thread, i guess.

i've written stuff against fdb, and i've seen it in non-trivial production. it's not a panacea, it's a useful point in the design space of databases, and does pretty well there.

Do you work in distributed systems at a level where you understand their claims, and the significance of this work vs. a general dislike of their high level implementation choices? Not implying anything with the previous statement except that not all software people understand strict serializability, etc. in distributed systems. In this case, you need some understanding to critique the paper.

And if you do work in distributed systems at that level, please elaborate in detail what you're trying to refute, because it's all "I feel this should be bad. We need to be careful in trusting them!", which is unconvincing.

If this paper is to be trusted. FDB is tested in many failure mode and scenceio. I expect it may fail, but should fail in predictable way.

My point is that even if it fails in a predictable way in test scenarios, it won't necessarily in real life. The paper & test framework are effectively going to trick people into trusting that the thing will work well, rather than going by actual observations of how it works in varied production environments. For the casual observer, who cares? But for the people about to spend millions of dollars on this tech to run it in production, I hope they aren't swayed purely by theoreticals.

Remember a decade ago when NoSQL came out, and everybody was hooping and hollering about how amazing the concept was, and people like me would go "Well, wait a minute, how well does it actually run in production?" And people on HN would shout us down because the cool new toy is always awesome. And lo and behold, most NoSQL databases are no better in production than Postgres, if not a tire fire. People who have fought these fires before can smell the smoke miles away.

Their test framework uncovered a few zookeeper bugs.

Zookeeper bugs are very rare even in production and hard to find.

If they can uncover zookeeper bug, i think they did some serious testing.

You're still missing my point. Testing shows you testing bugs. Running in production shows you production bugs. There are bugs you literally will never see in test until you run in production. There is absolutely no way to discover those bugs purely by testing, no matter how rigorous your test is.

The most glaring of these bugs are SEUs caused by cosmic rays. Unless your test framework is running on 10,000 machines, 24 hours a day, for 3 years, you will not receive the SEUs that will affect production and cause bugs which are literally impossible without randomly-generated cosmic rays.

The simpler of the bugs are buggy firmware. Or very specific sections of a protocol getting corrupted in very specific ways that only trigger specific kinds of logic in a program over a long period of time. Or simply rolling over a floating point that was never reached in test because "we thought 10,000 test nodes was enough" or "we didn't have 64 terabytes of RAM".

And more examples, like the implementation or the administrative tools just being programmed shittily. Even if you find every edge case, you can write a program badly which will simply not deal with it properly. Or not have the tools for the admins to be able to deal with every strange scenario (a common problem with distributed systems that haven't been run in production). Or 5 different bugs happening across 5 completely different systems at the same time in just the right order to cause catastrophic failure.

Complex systems just fuck up more. In fact, if your big complex distributed system isn't fucking up, it's very likely that it will fuck up and you just haven't seen it yet, which is way more dangerous than not knowing how or when it's going to fuck up.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact