Hacker News new | past | comments | ask | show | jobs | submit login
FoundationDB: A distributed unbundled transactional key value store (muratbuffalo.blogspot.com)
197 points by eatonphil 73 days ago | hide | past | favorite | 84 comments



I was watching the old 2014 video: https://youtu.be/4fFDFbi3toc?t=1081

I've included the timestamp of the relevant bit. He is talking about swizzling, which is like clogging++.

He says they take a subset of the network connections, and have them timeout, but then bring them all back in reverse order, and that this somehow exposes far more bugs than other methods.

This seems like a very interesting "well, that's funny!" moment that might lead to understanding something foundational about how distributed systems fail.

Does anyone have more insight on this?


That sounds reasonable to me. In a cascading failure where network congestion ramps up (eg, old school NFS), the machines with the worst network bandwidth or latency would time out first, and be the last to recover.

In a heterogeneous cluster, I might expect the fastest machines to die last, and either a human operator concentrating on getting the most capacity back online first, or just automation getting the fastest one online first, since a faster machine can go from hard down to green health check sooner than the slow machine. Not precisely reverse chronological order, but certainly with large statistical clusters.

And on top of that, exponential back-off would cause the same phenomenon.

But if your question is "why more bugs that way?" your guess is as good as mine. Mine is 'consensus is hard.'.


I feel dumb asking, what’s the common use case for FDB. And how does it compare to Redis?

I’ve read lots about how fantastic FDB (and unique) but I’m struggling to understand the use case / problem to use it for.


Most apps can use a single-instance relational database or document store. Fast enough with plenty of features. When requirements increase, there are many distributed relational and document stores that provide more scalability and performance.

Key/value stores offer a low-level API to handle extreme requirements when nothing else works. An example is adtech platforms where user profiles and contextual data need to be accessed in a fraction of a second to respond to an ad request, often hitting queries-per-second in the 6/7 figure range. Another example is CDNs that need to lookup a specific piece of data across a large dataset very quickly.

As of 2022, there are plenty of key/value stores like Redis, HBase/Bigtable, Aerospike, ScyllaDB that have various trade-offs in speed, scale and flexibility. What sets FDB apart is strong consistency (supporting transactions across multiple keys) and extreme reliability which makes it a good choice as a core metadata and coordination service that other complex apps and distributed systems can rely on.

For example, Apple uses it as a backend for CloudKit, and Snowflake uses it to store the metadata and operations of its cloud data warehouse.

Here's a video playlist from several other companies talking about it at FoundationDB Summit: https://www.youtube.com/playlist?list=PLbzoR-pLrL6oWYrC950yA...


This was a great, very detailed answer to a fairly open-ended question, and I appreciate that a lot.


You can build a distributed SQL database on top of FDB, or some other kind of database. It has everything you need to build primary and secondary indexes, including not just transactions but also being able to specify weaker consistency guarantees, the sort of stuff that is useful for internals like, IIRC, building secondary indexes in the background such that they don't conflict (much?) with writes. Its API is very, very nice for this.

For example, there is a MongoDB layer on top of FoundationDB: https://www.foundationdb.org/blog/announcing-document-layer/


from website

"Multi-model data store. FoundationDB is multi-model, meaning you can store many types data in a single database. All data is safely stored, distributed, and replicated in the Key-Value Store component.

Easily scalable and fault tolerant. FoundationDB is easy to install, grow, and manage. It has a distributed architecture that gracefully scales out, and handles faults while acting like a single ACID database."

I've never used it personally but the claims are ACID transactions, persistence, distributed, and reliable.. So basically if you want something like Redis, but also want transactions, scale up across multiple machines and higher reliability.


Transaction support, not just on a key but across multiple keys. Not many key/value DB support transaction.


Redis loses data, it has to be a cache. And it range scans only on individual machines. Eg, it doesn't scale, you have to scale it.

Foundation is transactional correct on a cluster, and supports range scans.


Care to comment a bit further about losing data? There are two different persistence methods offered by Redis at the moment. https://redis.io/topics/persistence


I was wondering who else uses FDB beyond Apple and Snowflake. Found that CouchDB is rewriting their storage layer to use FoundationDB for their v4 release. That said I couldn’t find much info on progress or release dates.

https://speakerdeck.com/wohali/couchdb-4-dot-0-1-plus-2-equa...


IBM was funding that work and just recently announced that they no longer want to go ahead with the couchdb-fdb implementation and will refocus their support toward V3 related tasks.

I'm not sure of their reasons but Apple recently backed away from things like providing windows fdb binaries and had some problems with their foundation site during the jlog incident that may have been interpreted as a lack of corporate commitment toward costs like hosting for the FOSS community, so IBM may have decided there was too much risk in having this dependency.


That's disappointing. CouchDB itself has also dropped off the radar these days which may have been a consideration too in terms of funding ROI.


Yeah, though it's hard to say which parts matter to IBM's bottom line since in terms of cloud DB sales they must be relying more on the trapped audience "not getting fired for choosing IBM" than on upsell to developers who try couchdb.

Really around 2.0 things like pouchDB were drowned out by a commercially funded choices and things more directly sponsored to work with each of the frameworks.


Wavefront is also using hit at scale.

From the list of talks at the 2019 FDB Summit, it also seems to be used at Segment, eBay and Uber


TIL it's SQLite under the covers!

(for now at least)


I suspect the page talking about that might be out of date as in 2018 at a conference about FDB the team were talking about the phasing out of SQLite in favour of their more optimised Redwood storage backend. I would assume that has been completed by now.


Not necessarily. One of the unfortunate problems with FDB is the glacial release process. Check their blog and announcement forums and you'll see the last release they felt worthy of announcing was in 2019. GitHub releases shows an enigmatically named 7.0 release but given it doesn't have any release notes and the website still claims 6.3 is the latest version, it may still not be ready to go yet.


(Disclaimer: affiliated)

Redwood isn't "completed", but it's not unavailable. Evaluating it via `fdbcli> configure ssd-redwood-1-experimental ` is encouraged, but it hasn't seen sufficiently deep evaluation and verification that it's been set as the default ssd storage engine yet.

Not all releases got blog posts. Which ones did honestly had far more to do with the people involved in the releases at the time than any technical merits. This is the first feedback that I've seen where the blog posts were used as a signal of worthiness or stability, so we'll see if we can try to be a bit more responsible about making release posts.

Releases that are posted to the downloads page are all equally considered "ready to go". They appear a bit slower than what gets tagged on github, as the production environments in the core set of companies supporting FDB development are used as the last stage of QA before an official release. (Though this distinction might be less clear as the downloads page was recently redirected to the github release artifacts page.)

You can find the 7.0 release notes at https://github.com/apple/foundationdb/blob/main/documentatio..., which will appear in the documentation once the public site is updated to 7.0 (which happens once the first official, public release is blessed).

All that said, the regular release cadence so far has been about every 6 months, which I think still does qualify as "glacial".


Well, every six months for a core DB engine seems OK. I don't know if I'd describe that as glacial, but the problem is, how would anyone know that without serious investigation? There's a 7.0 release that isn't marked as beta or anything, yet, the release notes can only be found in git under a directory named "sphinx". It's very ambiguous whether this is released or not, which is weird, and doesn't send positive signals about the overall organization and focus of the project. The message it sends is that Apple have no interest in increasing FDB usage outside of Apple, basically, which from an end user's perspective increases risk. What if Apple decide it's time to move on, or close development again? A community of users and devs helps, but that's not going to emerge unless the community is a priority. Maintaining the website/blog would be step one in that process, I'd think.


> One of the unfortunate problems with FDB is the glacial release process.

One person’s “glacial release process” is another’s “stable production-grade database”. Not everything has to rev like an npm library…


There's a balance, though. Three years without an announced release would be a long time even in the enterprise world. It's not like FDB is perfect. The way it handles cross-region replication in particular is not ideal.


Kind of interesting to list the past threads. Were there other major ones? I omitted a few trivial ones.

FoundationDB: A distributed unbundled transactional key value store - https://news.ycombinator.com/item?id=28740497 - Oct 2021 (44 comments)

FoundationDB: A distributed, unbundled, transactional key value store [pdf] - https://news.ycombinator.com/item?id=27424605 - June 2021 (99 comments)

Swizzle Clogging - https://news.ycombinator.com/item?id=25176931 - Nov 2020 (7 comments)

Testing Distributed Systems with Deterministic Simulation (2014) [video] - https://news.ycombinator.com/item?id=23289593 - May 2020 (8 comments)

Redwood: The Future of the FoundationDB Storage Engine - https://news.ycombinator.com/item?id=20498836 - July 2019 (5 comments)

Apple Open-Sources FoundationDB Record Layer - https://news.ycombinator.com/item?id=18910748 - Jan 2019 (32 comments)

FoundationDB Record Layer - https://news.ycombinator.com/item?id=18906341 - Jan 2019 (78 comments)

FoundationDB Summit: video roundup - https://news.ycombinator.com/item?id=18826096 - Jan 2019 (3 comments)

FoundationDB Document Layer - https://news.ycombinator.com/item?id=18563179 - Nov 2018 (35 comments)

FoundationDB 6.0 released, featuring multi-region support - https://news.ycombinator.com/item?id=18488879 - Nov 2018 (70 comments)

FoundationDB Summit Program Announced - https://news.ycombinator.com/item?id=18304906 - Oct 2018 (23 comments)

FoundationDB high contention allocator - https://news.ycombinator.com/item?id=17817946 - Aug 2018 (9 comments)

FoundationDB community highlights, two weeks in - https://news.ycombinator.com/item?id=16994579 - May 2018 (49 comments)

Snowflake Metadata powered by FoundationDB - https://news.ycombinator.com/item?id=16880379 - April 2018 (8 comments)

Apple open-sources FoundationDB - https://news.ycombinator.com/item?id=16877395 - April 2018 (441 comments)

FoundationDB's Lesson: A Fast Key-Value Store Is Not Enough - https://news.ycombinator.com/item?id=9306297 - April 2015 (32 comments)

Apple Acquires FoundationDB - https://news.ycombinator.com/item?id=9259986 - March 2015 (376 comments)

FoundationDB and the New NoSQL - https://news.ycombinator.com/item?id=8745333 - Dec 2014 (19 comments)

Databases at 14.4Mhz - https://news.ycombinator.com/item?id=8729420 - Dec 2014 (81 comments)

The Future of NoSQL - https://news.ycombinator.com/item?id=8599198 - Nov 2014 (24 comments)

Testing Distributed Systems with Deterministic Simulation [video] - https://news.ycombinator.com/item?id=8345030 - Sept 2014 (6 comments)

On Lowered Expectations: Transactions, Scaling, and Honesty - https://news.ycombinator.com/item?id=7869623 - June 2014 (5 comments)

FoundationDB vs. Jepsen - https://news.ycombinator.com/item?id=7467037 - March 2014 (23 comments)

Google F1 vs. FoundationDB's SQL Layer - https://news.ycombinator.com/item?id=6718282 - Nov 2013 (12 comments)

Which modern databases support ACID transactions? - https://news.ycombinator.com/item?id=6636541 - Oct 2013 (54 comments)

FoundationDB 1.0 - https://news.ycombinator.com/item?id=6244631 - Aug 2013 (6 comments)

FoundationDB marries NoSQL & SQL with Akiban acquisition - https://news.ycombinator.com/item?id=6057440 - July 2013 (3 comments)

FoundationDB Fault Tolerance Demo Video - https://news.ycombinator.com/item?id=5739721 - May 2013 (15 comments)

A NoSQL Database with ACID Transactions - https://news.ycombinator.com/item?id=5318307 - March 2013 (55 comments)

FoundationDB — Not Your Standard NoSQL Database - https://news.ycombinator.com/item?id=4503400 - Sept 2012 (17 comments)

FoundationDB: A new generation of NoSQL database - https://news.ycombinator.com/item?id=4294719 - July 2012 (73 comments)


I would add this awesome talk which never got much traction (outside of comments in other threads).

Testing Distributed Systems with Deterministic Simulation [video] - https://news.ycombinator.com/item?id=8345030 - Sept 20, 2014 (6 comments)


Ah thanks—I remember that talk! it was instantly one of my favorites.

I've added that thread and also https://news.ycombinator.com/item?id=23289593 to the list above.


This is one of my favourite talks, and I think it's a great example of the whole "the future is here but not evenly distributed" thing.

I just can't imagine tools like this not winning out on a long enough timeframe, particularly with compiler support.


I keep this video bookmarked and share it with new SDETs - it's downright inspirational how far he takes his testing, legend!


I heard they take it further here https://antithesis.com/


FoundationDB also powers CloudKit


and snowflake


Is there a managed version of Foundation DB?


Not as far as I know. The primary users & contributors are apple and snowflake. It was only made OSS a few years ago, and there were already a number of comparable alternatives (scalable, consistent OLTP like Cockroach, Yugabyte, Fauna) with companies behind them.


For Anyone with experience with FoundationDB! How does it compare to CockroachDB ?


FoundationDB is a key/value store, and well-known for being fast, scalable and incredibly reliable. It passed many of the distributed database tests unlike many competitors. Here's a good HN thread when Apple open-sourced it a few years ago: https://news.ycombinator.com/item?id=16877395

CockroachDB implements postgres-compatible SQL layer on top of their own key/value store called Pebble. This is similar to TiDB which implements mysql on top of their TiKV key/value store.

The latter databases offer easier programming with SQL interfaces, but since you can model just about anything on top of key/value store, there are already "layers" that can implement similar abstractions on top of FDB like the Record Layer[1], Document Layer [2], and even a SQL layer [3] at one point.

1) https://www.foundationdb.org/blog/announcing-record-layer/ 2) https://foundationdb.github.io/fdb-document-layer/ 3) https://github.com/jaytaylor/sql-layer


There is one caveat I would add. While you can always build just about any kind of database or data model on top of a logical KV store, the scalability and performance is almost always going to be qualitatively worse for queries than using a storage engine purpose-built for the intended type of database or data model. It is simpler to implement, which is why people do it, but you give up a lot of selectivity and locality on the query side.

The target design case for KV stores is when all your queries are essentially trivial. Layering SQL on top of it implies the expected use case is complex queries of the type where a KV store won't perform well. Good for a compatibility interface, but not performant.


>a logical KV store

I've always considered KV to be an API. Under the hood you're still using data structures managing fixed-size blocks and byte-offsets, such as LSM trees, and using binary searches to minimize disk reads (e.g. index files of very 128th 64k key block).

If you're building something more robust on top of a KV store, I'd be under the impression you're not doing so using the logical API but leveraging the underlying data structures with algorithms better suited to the task. An LSM tree is pretty decent at random seek/scan, and implementing something like JOINs is less about KV constraints as it is LSMT constraints.

For example, FoundationDB doesn't use SQLite directly but rather its B-tree implementation.


I intentionally added the word "logical" because I think it is important. All database engines are KV at the physical page/block I/O level but they almost never expose that to the user, and there is no requirement to limit the expression to what the lower level can express.

The primary limitation of KV as an API is that it can't directly express complex but elementary relationships between keys e.g. intersections when your keys are constraint types. Or graph traversal relationships, if efficiency on real hardware matters at all. Most complex databases, like SQL, are implicitly searching across relationships like this constantly. Not being able to express these types of relationships as first-class citizens is computationally expensive.

Putting a KV store under every database is a recent fad. Databases classically had more expressive but still inadequate storage architectures (e.g. first-class operations on constraint types were terrible there too), so this is a step backward in most ways -- but I understand why expediency incentivizes that approach. Kind of like discarding joins to achieve scale-out -- joins are critical operations for many things but scaling out joins requires very challenging computer science. On the flip side, a lot of the cutting edge research has been focused on jamming as much expressiveness in a singular index-organized store as possible, in as little space as possible. This has proven to be a very effective, albeit more difficult to understand.

LSM trees are a good example of an incremental hack to address limitations in more traditional approaches while not solving the underlying fundamental issues with the way the indexing structures are constructed. LSM tree write throughput is okay but not great, empirically. Similarly, the query performance is okay but not great, for reasons that are well-understood from other contexts in database engine design. You can use it to solve a specific problem but it is unclear why it should be the solution to a general problem.

KV stores, LSM trees, etc are stuck in a suboptimal local minima.


Could you elaborate on this argument? For something like a graph, I din't quite follow why a KV API is problematic or how you'd do it better. If you want to map node locality to KV locality you can certainly do that.


The core design problem for graph databases is data locality during join recursion. If you shard your data model by key, the cross-shard join operation asymptotically converges on a Cartesian product for each join iteration i.e. pathological data locality. This is why most graph databases have such poor scalability in practice.

A less intuitive approach is to shard the data model based on edge similarity measures, such that a specific key does not map to a single shard. While this obviously has poorer locality for simple key lookup, cross-shard join operations -- the most expensive operation and what actually matters -- only involve a tightly bounded number of shards and therefore have much better locality. While this was originally developed for to make the problem scale on supercomputers (at IBM Research AFAIK), it is entirely amenable to use in graph databases.

At the storage engine level, first-class support for indexing by similarity measures has different design requirements and tradeoffs than simple key lookup or ordered-tree indexing. While you could make it work on a KV store, the impedance mismatch would incur a significant performance drag. This cuts both ways; storage engines optimized for use with similarity measure designs are going to offer poor performance if you try to put b-trees or LSM-trees on top of them.


I think of FoundationDB as a low level persistence layer that you can use to build a fully functional database on top with more complex querying capability. It provides the core set of semantics such as data replication, globally ordered transactions and horizontal scalability - that are all complex to build in their own right.

If you look at CRDB or TiKV design, under the hood they are also KV-stores with SQL layered on top.


Very curious, what’s an alternative implementation strategy or more appropriate mental model?

My mental model (I am not a database engineer) is that every SQL database is fundamentally backed by a key value store. For Postgres, each value is a row of a relation, the key is the relation’s key, and the physical storage is the table file. Postgres builds b-trees on top to make it fast, but the b-tree itself is a relation, a key value store that Postgres works hard to keep sorted.


Not expert, but as enthusiast I have read some about this (for https://tablam.org).

KV is truly ill-suited for this.

Is very easy to see if you put the data layout:

    --data
    pk  city    country
    1   miami   USA
    2   bogota  Colombia

    --As Kv (naive):

    pk1:    1
    pk2:    2
    city1:  miami
    city2:  bogota

    --As btree, with "value" being N (as you say):

    Key    Value

    1      1    miami   USA
    2      2    bogota  Colombia

    --As paged btree

    Page1
        Low: 1 High:2   
            //think this is a block
            1   miami   USA
            2   bogota  Colombia

    --As columnar:

    pk:     1       2
    city:   miami   bogota

    --As PAX (hybrid columnar/row)

    Page1
        Low: 1 High:2   
            //think this is a block
            pk:     1       2
            city:   miami   bogota


Many data models don't have obvious singular keys. The critical search relationships cannot be reduced to an order relationship on a single column a priori. Much of the most interesting research in storage models is around increasingly efficient ways of indexing complex information theoretic features across multiple columns in a single structure, with an underlying storage implementation to match.

At some level, every database contains a key-value store. For performance reasons, the hardware always has to be treated this way. Databases work at the level of blocks/pages, but those abstractions are usually hidden from users with a lot of clever logic in the middle that is more opinionated to enable optimization. That doesn't change.

An interesting and important property of search data structures is that, at the limit, a single index that can optimally satisfy all possible queries is equivalent to general AI. It is also completely intractable. Fortunately real-world queries tend to be much more limited in nature. A corollary is that the distinction between indexing, storage, and scheduling in databases is a fiction -- useful for making some things simple but not necessary in any database. In essence, the practice of treating indexing, storage, and scheduling as discrete functions in a database is the opposite extreme. There are a vast number of possible implementations between these two extremes with better properties than either in practical real-world databases.

As a general design principle for scalability and performance reasons, you want to organize your data model around a single indexing mechanic. Consequently, it is critical to maximize the expressiveness and efficiency of any particular indexing mechanic. At the limit, with a good algorithm, it is equivalent to having an index for each column, with the ability to efficiently search more features of the data and without the overhead of actually having an index for each column.

I think we are entering a new golden age of database technology where the boundaries between elements we treated as discrete are much fuzzier.


Any papers you can recommend that go into this in more detail?


i’m also not a db engineer, but i think this is true-ish. however building and maintaining those index tables is hard and probably prone to issues if you can’t update multiple as part of the same transaction.

the other major thing you’d miss is smarter joins. the distributed databases do a lot of work to be able to push down predicates as far down as possible to the physical nodes storing the data.

there’s probably more as well.


> probably prone to issues if you can’t update multiple as part of the same transaction

IIRC one of FoundationDB's features is that it does support such transactions, so you can easily implement indexing on top of it.


Yes, agreed, although a lot of storage modalities naturally map to key/value in some way.

Ironically FDB uses SQLite for the storage servers. Abstractions all the way down.


though as with things like CAP theorem, there are always clever ways to to place a given implementation at various points along the spectrum, rather than just at one end or the other.


Adding clarifications to that, Pebble is a local K/V library akin to RocksDB/LevelDB and provides no services or distributed capabilities. Comparing the two does not make sense.

On the other hand, TiKV is more comparable to the internal distributed transactional K/V layer to CockroachDB that its SQL layer is coupled to. That of course is not a usable external interface. You could utilize CockroachDB like a K/V store with simple table to forego the SQL functionality and planning (e.g. only primary col index, only key/value column, so on), but I am not sure what the practicality of that is as I have not kept up with CockroachDB development.

(disclaimer: I interned at Cockroach several years back)


The comparison is about providing a relational data model on top of a key/value store.

Yugabyte also does something similar on top of its underlying DocDB document store.


An important difference between CockroachDB and TiDB is that the former has an embedded key-value store; everything runs in the same process.

TiDB is split into multiple parts: TiDB proper (stateless SQL/query plan layer), TiKV (the key-value store), and PD (placement driver, the main control plane and metadata layer). TiKV also isn't a pure key-value store, as it handles some particular query operations that can be "pushed down" from TiDB to reduce network round trips.

FoundationDB's role similar to TiKV. It's a pure key-value store with replication built in.


It's actually not that fast. Since it has multiple layers and that implies overhead. It's about 4x as slow as MySQL.

Personally I prefer cockroachdb, postgres compatible and easy scale.


What is better, FoundationDB, CockroachDB, or Hypercore?


One of these is not like the others.


I've never used CockroachDB, but FoundationDB has been an absolute dream to work with. There are a few downsides to using it:

  - There is (currently) no hosted solution you can use, so you have to run it yourself.
  - There is no great "layer" that provides a nice document/SQL layer so you kind of have to work with the raw sorted KV abstraction.
  - You have to structure your workload to avoid a high rate of conflicts on individual keys.
  - It won't scale to extremely high write volume workloads like Cassandra/Scylla will.
That said, in exchange you get:

  - Interactive linearizable transactions (highest consistency level possible) over arbitrary key ranges (none of this single partition transactions nonsense) with a guarantee that FDB will basically never violate this property. Jepsen refuses to test FDB because FDB's simulation testing is much more rigorous than anything Jepsen does, but AFAIK the most recent releases of Cockroach/Mongo/Cassandra/Fauna/etc have never fully passed a Jepsen test.

  - Automatic range partitioning. So unlike Cassandra you don't have to pick a static partitioning key yourself, FDB just automatically detects and split ranges for you based on storage volume and write throughput.

  - A strongly consistent store you can actually use in anger and that you don't have to baby. We've never managed to actually break a FoundationDB cluster, the worst case scenario we've ever seen is that we slam it with too many writes/reads and it starts to apply backpressure by increasing latency.

  - Write/Read throughput scales ~linearly as nodes are added.

  - Built in backup/restore from blob storage.

  - Strongly consistent disaster recovery. This one is amazing and its hard to go back to other databases that don't support it, but basically you can run 2 FDB clusters in DR mode where one is the leader (accepting writes + reads) and the other is passively replicating from it. You can then fail over from the primary FDB to the secondary *consistently* in such a way that the two clusters coordinate to lock the primary and then unlock the secondary so that your application is guaranteed to not see any inconsistency or data loss with no participation from the application required (except teaching it how to talk to 2 different FDB clusters at once).
Basically its lacking some niceties for building traditional applications on top of it, but it is an amazing primitive for building rock solid distributed systems and will almost never let you down once you've developed some operational experience with it. Its the closest thing to "magic" that I've ever worked with in the distributed storage space.


to the folks asking why you'd use FDB, i'd highlight the main points as:

> Interactive linearizable transactions (highest consistency level possible) over arbitrary key ranges (none of this single partition transactions nonsense) with a guarantee that FDB will basically never violate this property.

the transaction guarantees on FDB are just the best there is in any product you can get for free. not to fault the rest, FDB just went past what ~98% of folks will ever need. but if you're in that last 2%, god speed trying to get anything else to offer you the same semantics.

> A strongly consistent store you can actually use in anger and that you don't have to baby.

i've seen some folks trying to run a cluster without proper monitoring, who were nearly driving the poor thing into the ground, with hosts dying, and having so much pressure put on them they failed out, but it never lost anything. i don't think it ever even became unavailable.

> Automatic range partitioning. So unlike Cassandra you don't have to pick a static partitioning key yourself, FDB just automatically detects and split ranges for you based on storage volume and write throughput

eh, in defense of everything else, FDB is a key value store, and the keys are byte strings, so the situation is a bit easier for them.


Sure, I’m not really knocking Cassandra. I’m just pointing out that Cassandras approach favors raw write throughout over everything else which is great for some use cases, but FDBs approach is much more forgiving and a better fit for many use cases, albeit at the cost of reduced throughout.


FoundationDB is just a key value store, so it's not comparable.

There is a relational layer you can run though called Foundation Record Layer. But this doesn't support SQL AFAICT. The examples on Github just show a query builder interface in Java [0].

There does seem to be a Lucene interface but it doesn't seem to be used [1].

[0] https://github.com/FoundationDB/fdb-record-layer/blob/main/e...

[1] https://forums.foundationdb.org/t/lucene-layer-on-foundation...


It's comparable based off of the fundamentals and I have the same question.

They both use KVs using LSMs, do range queries, support ACID transactions, handle resharding and clustering for you.

They're both based off of Spanner fundamentals. They both actually have an SQL layer - foundationdb's is the record layer. Just because one has a primary SQL interface doesn't mean we can't compare.

I'd really like to know a comparison of the fundamentals including hotspot management, resharding, read/write performance, resharding etc. also and have been looking for this.


> They both actually have an SQL layer - foundationdb's is the record layer.

Can you show me their SQL layer? I was looking for it.


https://github.com/FoundationDB/fdb-record-layer is the SQL/orm like interface. This handles things in a DB like way - joins, indexes. I misspoke when I said SQL regarding this - sorry.

There's a go and rust library floating around that's not as good. I've tried em :)


The readme says

> Queries - The Record Layer does not provide a query language, however it provides query APIs with the ability to scan, filter, and sort across one or more record types, and a query planner capable of automatic selection of indexes.

So I'm not sure how you can call that a SQL layer when SQL is a query language and FDB Record Layer says it doesn't have a query language?


Yeah, I did say "I misspoke when I said SQL regarding this - sorry.". I should have said "SQL-like", or at least relation-like.

There were a couple SQL-ish things I found in other languages and got them mixed up. Sorry about that :)


I don’t think it ever got open sourced, although I do remember reading about it years back.


FWIW I don't think FoundationDB is based off Spanner at all (it might even predate it? I think FDB was started in 2008 or something). Its transaction handling system is completely different (and unique from what I can tell) and not dependent on well behaved clocks.



Neat, here [0] is a 3rd party one at the end of the thread but this isn't the one the thread is about. Doesn't seem like that one is public.

[0] https://github.com/rustdesk/opentick


It was open source and if you Google a bit you can find forks with the relevant code snapshots, here's one: https://github.com/AydinSakar/sql-layer


I was just looking into this, and found this really interesting post about foundationdb and sql.

https://www.voltactivedata.com/blog/2015/04/foundationdbs-le...

Also this, excellent comparison of distributed database consistency methods, mentions cockroach, but not foundation. http://dbmsmusings.blogspot.com/2018/09/newsql-database-syst...

This brings up the question, what method/protocol does foundation use for distributed consistency?


"What method/protocol does foundation use for distributed consistency?"

This is a really tough question to answer succinctly. It took me a long time to wrap my head around it, but the TLDR is:

  1. It uses a quorum of Coordinator nodes to elect another node (via Paxos) whose job it is to then assigns roles to other nodes.
  2. Some of those roles are replicated (Like TLogs) and some of those roles are singletons (like the thing that advances the clock and hands out versionstamps).
  3. If any node in the "transaction subsystem" ever fails, the cluster goes through a recovery where it becomes unavailable for ~3s during which it reassigns the roles and keeps chugging.
In terms of what it looks like for a transaction to actually commit, thats another long conversation. Its hard to draw analogies to existing systems because it really is kind of its own thing.


Awesome answer, thanks! Foundation seems like magic almost, from just choosing very smart trade offs and co-ordination.


Yeah that’s basically it. It really is a wonderful piece of free software that was (and still is) way ahead of its time in many ways.

They basically made a ton of good architectural decisions really early on and then just nailed down the details super tight over many years.

Their technique of running cluster-wide recoveries to handle failures is a great example. Most distributed systems engineers find the idea of designing a database that becomes unavailable (even for a few seconds) when a single node is lost completely appalling, but it turns out this small trade off leads to massive simplifications elsewhere and the resulting system is net/net much more reliable than anything else I’ve ever seen.


Is there a FoundationDB Helm chart?


I understand nothing of helm, but https://github.com/FoundationDB/fdb-kubernetes-operator/pull... seems to suggest yes?


FTA:

> FoundationDB, a transactional key-value store that supports multi-key strictly serializable transactions across its entire key-space. FoundationDB (FDB, for short) is opensource. The paper says that: "FDB is the underpinning of cloud infrastructure at Apple, Snowflake and other companies, due to its consistency, robustness and availability for storing user data, system metadata and configuration, and other critical information."

> The main idea in FDB is to decouple transaction processing from logging and storage. Such an unbundled architecture enables the separation and horizontal scaling of both read and write handling.


Fdb's main selling point was reliability. But with the proliferation of good raft/paxos implementations that I think is not that much of an advantage anymore.


There's a lot more to reliability than just the consensus algo and implementation.


The storage engine can be taken off the shelf. Lmdb, rocksdb etc. These are the key parts.


There is more to a database than a raft implementation. A raft implementation would provide data replication between members and leader election. That is just one piece in a distributed database architecture. You then need a storage engine to persist the data, and a query engine to query the data.

To be able to scale, the database would most likely be composed of several raft clusters which brings couple of other challenges: - You need a sharding layer so that the queries can be routed to the right cluster - You need an orchestrator to be able to manage these clusters

It gets trickier as you try to run transactions across multiple raft clusters. You end up needing to implement 2PC.

FoundationDB solves all of the above except for the query engine part (given it is a KV).


Storage engines can be taken off the shelf. My point is that multiple products solve this problem and they have jepsen tests done. They do not have a unique proposition as much.


You say that like it’s trivial to build a reliable OLTP key value store once you have a correct raft implementation. Never mind real world performance or reliability.


The hairiest bits like consensus and storage engines can however be taken off the shelf.


Is there a proven BFT version of that? RAFT aint it


Bft?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: