FoundationDB excites a lot of people because it's an extremely scalable and extremely reliable distributed database that supports ACID transactions, and which is both open-source and has Apple standing behind it. And yeah, all of that is pretty nice.
But arguably the real power comes from the fact that it exposes a relatively low-level data model that can then be wrapped in one or more stateless "layers". All of these layers write to the same storage substrate, so you can have your document database, your SQL database, your time-series database, your consensus/coordination store, your distributed task queue, etc., etc., but you're only actually operating one stateful system. Your SREs will thank you.
Writing these layers to be scalable and high-performance can be challenging, but it looks like Apple is actively doing it and willing to release the results to the rest of us. This also suggests that their previous open-sourcing of the MongoDB-compatible document layer wasn't a one-off fluke. All of this is very good news for everybody who needs to run databases in the real world.
Full disclosure: I worked on FoundationDB a long, long time ago.
"Together, the Record Layer and FoundationDB form the backbone of Apple's CloudKit. We wrote a paper describing how we built the Record Layer to run at massive scale and how CloudKit uses it."
I think this is the first time that little detail has been publicly disclosed.
> your SQL database
They mention "a declarative query API", but as far as I can tell that's not actually SQL, right? So migrating from another relational db would require learning a new query language?
The client/server distinction isn't terribly strong in the FDB world. The FDB client is unusual in that it's a (stateless) part of the FDB cluster itself. You could therefore embed it in the client itself or build an RPC service around it. The Record Layer takes the same approach---it's just a Java library---so you could either embed it in the client application or build some kind of wire protocol for accessing it. One could have an embedded SQL layer like SQLite or H2 with no additional server beyond the cluster or a separate SQL layer network server that acted more like Postgres or MySQL.
The Record Layer was designed for use cases that don't need a SQL interface, so we focused on building the layer itself. That said, the Record Layer exposes a ton of extension points so there's a fluid boundary between what needs to live in its main codebase and what can be implemented on top. There are almost certainly enough extension points to implement a SQL interface as another layer on top of the Record Layer. For example, you could add totally new types of indexes outside of the Record Layer's codebase, if that were needed for SQL support. It's still a lot of work, especially on the query optimizer. Perhaps the community is up to that challenge. :-)
"In the future it is possible that the Record Layer may develop a formal query language, but it is unlikely that such a language would closely resemble the SQL standard." 
* Full table scans on large tables in a R/W transaction would fail due to the 5s transaction duration restriction. This is obviously a bad idea regardless of database, but if you wanted to support SQL you would have to allow it
* Sorting and joins require (theoretically) the same amount of memory as your data size, or the ability to spill to disk. This isn't FDB specific, but offering this kind of feature in a scale-out, high-concurrency model would be tough for QoS.
There are plenty more, but those are the big ones I ran across during design. The actual SQL interface part is not hard, but you would have to disallow many useful constructs people normally expect to work.
So far behind it that they already shut it down once.
Have you had a negative experience with it you can share?
They've been basically put them to work reimplementing it in Swift. It's open and out there but not a lot of people paying attention. While it's still early days I think there may a year where, suddenly, Swift on the server is a super serious thing and all this work they've been doing on little old CloudKit kind of takes over the world.
Just a fun prediction.. but it wouldn't be the first time Apple pulled something like that.
I do like that Swift's non-tracing garbage collection model is well suited for server apps. Rust is cool too but maybe Swift would be a little friendlier and thus better suited to inherit Java's mantle. I mean can you just imagine if Apple is slowly building up Swift to overtake Java on the server? That that's one of their long game master plans? I know that sounds completely crazy.. It just might work. They do run one of the biggest data center networks in the world so they have a pretty good testbed and can justify a hefty R&D budget.
Initially at Wavefront we were using HBase for telemetry, Zookeeper for the service mesh and MySQL Cluster for entity metadata. All that was moved on top of FDB with 3 different layers that we developed.
I'm excited that this kind of database is now going to be available more broadly and with the confidence that CloudKit is using the same technology since to date implementing something like this was basically a DIY project.
1) Operationally, HBase is a nightmare whereas FDB is extremely easy to operate.
2) HBase doesn't natively, or efficiently with extensions, support transactions across rows.
3) GC makes HBase performance unpredictable whereas FDB is written in C++.
4) HBase depends on Zookeeper and it is operationally painful to support and we were replacing it with FDB also.
I don't think I will ever again use anything from the Hadoop ecosystem if I can get away with it.
The paper is rather interesting: https://www.foundationdb.org/files/record-layer-paper.pdf
Many Kubernetes scaling issues are etcd-related.
RethinkDB is dead-ish, and CockroachDB is treating their changefeeds as an enterprise feature that requires a Kafka instance to stream to :(
Short overview and maybe good to know it's becoming part of the CNCF: https//www.cncf.io/blog/2018/08/28/cncf-to-host-tikv-in-the-sandbox/
Haven't worked with it myself yet, but maybe others can share their experience?
There have also been some HN threads in the past, about TiDB at least.
1. TiKV uses Multi-Raft architecture, I think Raft provides more HA.
2. TiKV's transaction model is inspired by Google Percolator, it's a classical optimistic 2PC transaction model with MVCC support. I'm not a expert of FDB, but I think different transaction models fit for different application scenarios, TiKV's transaction model is good when your workload is mainly small transactions and with a low conflict rate.
3. TiDB is a full-featured SQL layer on top of TiKV, aims to provide a MySQL compatible solution, you know, most of the TiDB users are migrated from the MySQL, so the focus of TiDB will be how to be compatible with these legacy MySQL-based applications. For example, how to read MySQL binlog and then replay on TiDB in real time, let TiDB become a MySQL active replica, or how to support complex SQL queries like distributed join or groupby, you know, building a full-featured SQL optimizer is a huge project.
There are some case studies:
There are some quick-start documents you can start with:
I’m not sure foundation would be able to do the full consistent list required in the current model though (someone above mentioned full consistent table scans being not possible in FDB). In etcd and cockroach you can leverage MVCC and get a consistent scan over older history. But if FDB has that it’s pretty close (I looked into this before FDB was acquired the first time).
Edit: unlikely = in practice I haven’t seen anything except large range lock contention (which is why we added chunking to the api)
I am not sure if I will use the record layer (I've been planning to write "my layer" myself), but it will definitely be an interesting thing to look at.
To be honest, I don't like anything about Cassandra. Beginning with the naming: back when I was trying to learn about Cassandra, I couldn't get past the obscure and bizarre naming (super-columns?). When I dealt with systems using it, I never quite understood how you can keep saying that "the later timestamp wins" and speak of consistency with a straight face: in a distributed system, there is no such thing as a "later timestamp". Or speak of transactions which aren't really transactions at all.
Then I read the Jepsen reports about Cassandra. Yes, Cassandra has made progress since then, but still.
I think of Cassandra as an outdated piece of technology at this point: we can (and do) build better distributed databases today, with better consistency guarantees, and proper transactions in case of FoundationDB. Cassandra was designed for a specific use case and then outgrew its initial design, because there was nothing else at the time. But I see no reason to stick with it any longer.
Even now when you need massive multi-region scalability there is little to choose from — if you want it to be open-source, there's pretty much only FoundationDB left.
Cassandra/Scylla are the only open source key value stores that do linear scalability by simply adding nodes even in huge, geo-distributed settings as far as I know, but they are ultimately AP systems. And Scylla just has absurd performance compared to Cassandra or FoundationDB. You just have to know what you're getting into. (But yes, ACID transactions are a good model for developers, and truly FDB's linearizable transactions and high scalability make it an obvious choice many CP systems, if you ask me.)
Since most of what I do (or consult with) does not need massive performance, I'd rather pick databases with compromises favoring consistency and correctness. This is why I like what I see in FoundationDB so far.
The rest either have a single cluster that can try to be stretched (usually with bad results or incredibly high latency) or is an enterprise feature using complicated log-shipping to apply updates everywhere.
TLDR: google/fb/etc went the other way, using hbase,bigtable witch are strongly consistent.
(Saw it in your profile)
But the current storage engine is not as well optimized as it could be.
It does have scalability limits, although they’re not relevant for 99.9% of use cases.
Upgrading a cluster to a new non-patch version will require a small (seconds) amount of downtime. A mitigating factor there is upgrading your client doesn’t have that limit, which is where all the interesting stuff is.
The minimum latency for a transaction is relatively high compared to systems which acknowledge writes before syncing to disk or only after syncing to a single disk.
I wouldn't say it doesn’t solve “use cases”. Rather, if you can live within the limitations (which means you need to know what they are), you can reduce the complexity and cost of designing a system for your use case by a lot.
Check out my talk from the FDB Summit for an example: https://youtu.be/SKcF3HPnYqg
Yes, the Record Layer helps you define and index into _hierarchies_ of entities, but I suspect it doesn't have an answer for other access patterns (e.g. producing "report" views that relate or aggregate non-hierarchical data).
You could retroactively construct a custom view _after the fact_, but only if you can do so within 5 seconds. And (if you want continued access to that view) you'll need to ensure that that view is maintained thereafter (you would have to define your own layer -- it cannot be entrusted to application logic unless you can atomically switch to a new version of your stack). Maintaining such a view is made more difficult if your data allows updates/deletes.
The same caveat affects schema migrations. You would need to be able to fit the migration into a 5 second transaction (or tell your applications to stop modifying the data for a while, and handle it as a series of smaller transactions).
My assessment is that if you have (non-hierarchical) relational use-cases at scale, FDB really requires you to plan your access patterns from day 1. Whereas a typical RDBMS fares far better at satisfying emergent needs. That said: FDB's model is brilliant for document store and key-value use-cases.
The Record Layer does a whole bunch of work to deal with index maintenance (see our docs )). Our “index maintainer” abstraction (discussed in the paper) makes maintaining our indexes (including those that are basically materialized views) completely seamless from the user's perspective, even for updates and deletes. We also have a lot of tooling for making efficient schema migrations. For example, schema migrations are performed lazily (when the data is accessed), so they aren't limited by the 5 second transaction limit. If you add/remove/change indexes, they'll be put into a “write-only” mode where they'll keep accepting writes while an “online indexer” builds the index over multiple transactions. We even have fancy logic to automatically adjust the size of the transactions if they start failing due to contention or timeouts!
Basically, the Record Layer solves a lot (but not all) of the pain points that shows up when you don't know your access patterns from the beginning. The paper talks a bit about how CloudKit uses some of those features.
 https://foundationdb.github.io/fdb-record-layer/FAQ.html — search for “aggregation”
1. Any reason to write it in Java instead of C, C++, Rust, etc?
2. Any reason to use Protobuf instead of Flatbuffers, Avro, etc?
3. Can FoundationdDB be used with Apache Arrow?
Excellent question regarding the choice to use Protocol Buffers. Firstly, as mentioned in the paper released last year, CloudKit uses Protocol Buffers for client-server intercommunication. As a result, there was already expertise around protobuf, which is a good tie breaker when evaluating alternatives. (Here's that paper, by the way: http://www.vldb.org/pvldb/vol11/p540-shraer.pdf) Secondly, the Record Layer makes heavy use of Protocol Buffer descriptors, which specify the field types and names within protobuf schemata, and dynamic messages. Descriptors are used internally within the Record Layer to do things like schema validation. (For example, if an index is defined on a specific field, the descriptor can be checked to validate that that field exists in the given record type.) Likewise, dynamic messages make it possible for applications using the Record Layer to load their schema at run time by reading it from storage. The FDBMetaDataStore allows the user to do exactly that (while storing the schema persistently in FoundationDB): https://static.javadoc.io/org.foundationdb/fdb-record-layer-...
The Record Layer's data format is not compatible with the specification specified by Apache Arrow, no.
1. Size of the CloudKit cluster and the number of RecordLayer instances. A ratio would also be enough to get an approx. idea.
2. How metadata changes involving field data type are being handled?
3. How are relationships and therefore, foreign keys handled? Are any referential actions like cascading deletes supported?
There are some guidelines regarding field type changes in the schema evolution guide: https://foundationdb.github.io/fdb-record-layer/SchemaEvolut... Most data type changes are incompatible with either Protobuf's serialization format or the FDB Tuple layer's serialization format (which the Record Layer users for storing secondary indexes and primary keys). The general advice for type changes (if there are existing data in your record stores) would instead be to introduce a new field of the new type and deprecate the old one.
After all, it's no mystery Apple wants to expand their services revenue. Their hardware revenue it's not growing as much as it used to.
Combined with other statements in this thread, I think that may be true. I remember reading once that iMessage used to be served by Cassandra, but now its served by FDB.
This is all speculation though.
> 8.1 New CloudKit Capabilities
> CloudKit was initially implemented using Cassandra as the underlying storage engine.
So it seems this is what happened, for CloudKit at least.
If that's the case, how does FDB compare to ScyllaDB now that they both have secondary indexes?
Secondary indexing is a core feature of the Record Layer, though! It includes a variety of secondary index types. The simplest are implemented using essentially the same strategy as ryanworl outlines (with more details on how that index works available in the key-value store documentation: https://apple.github.io/foundationdb/simple-indexes.html). And index updates are all entirely transactional (i.e., as the index update happens in the same transaction as record insertion, they are always consistent and up-to-date). However, all of that happens behind the scenes. The API presented to the user only asks for what record to save (update or insert), and then the Record Layer updates the appropriate indexes using a user-provided schema. Importantly, the Record Layer also supports handling the various stages of index maintenance (e.g., deleting an index's data after removing it from the schema or filling in data from existing records after an index is added). More can be found within the Record Layer overview: https://foundationdb.github.io/fdb-record-layer/Overview.htm...
In a transaction, you write a key like “users/1” with a value of “bob” and then write another key like “users/bob/1” with no value. Then you can do a range scan over the prefix “users/bob/“ and find all the primary keys. After that you do individual gets for the keys in the PK index to retrieve the full record if needed.
The comparison between the two is FDB “secondary indexes” are just like anything else in FDB. Namely, you update them in transactions and they are consistent immediately. Scylla does not AFAIK have this feature.
It would work with FoundationDB, RocksDB, etc. I actually learned these techniques when I interned at FoundationDB but have used them the most with other K-V systems.
So you can build PostgreSQL, ElasticSearch on top of the foundationDB.