Hacker News new | past | comments | ask | show | jobs | submit login
Apple open-sources FoundationDB (foundationdb.org)
2136 points by spullara on April 19, 2018 | hide | past | favorite | 441 comments

This is INCREDIBLE news! FoundationDB is the greatest piece of software I’ve ever worked on or used, and an amazing primitive for anybody who’s building distributed systems.

The short version is that FDB is a massively scalable and fast transactional distributed database with some of the best testing and fault-tolerance on earth[1]. It’s in widespread production use at Apple and several other major companies.

But the really interesting part is that it provides an extremely efficient and low-level interface for any other system that needs to scalably store consistent state. At FoundationDB (the company) our initial push was to use this to write multiple different database frontends with different data models and query languages (a SQL database, a document database, etc.) which all stored their data in the same underlying system. A customer could then pick whichever one they wanted, or even pick a bunch of them and only have to worry about operating one distributed stateful thing.

But if anything, that’s too modest a vision! It’s trivial to implement the Zookeeper API on top of FoundationDB, so there’s another thing you don’t have to run. How about metadata storage for a distributed filesystem? Perfect use case. How about distributed task queues? Bring it on. How about replacing your Lucene/ElasticSearch index with something that actually scales and works? Great idea!

And this is why this move is actually genius for Apple too. There are a hundred such layers that could be written, SHOULD be written. But Apple is a focused company, and there’s no reason they should write them all themselves. Each one that the community produces, however, will help Apple to further leverage their investment in FoundationDB. It’s really smart.

I could talk about this system for ages, and am happy to answer questions in this thread. But for now, HUGE congratulations to the FDB team at Apple and HUGE thanks to the executives and other stakeholders who made this happen.

Now I’m going to go think about what layers I want to build…

[1] Yes, yes, we ran Jepsen on it ourselves and found no problems. In fact, our everyday testing was way more brutal than Jepsen, I gave a talk about it here: https://www.youtube.com/watch?v=4fFDFbi3toc

Will said what I wanted to say, but: me too. I'm super happy about this and grateful to the team that made it happen!

(I was one of the co-founders of FoundationDB-the-company and was the architect of the product for a long time. Now that it's open source, I can rejoin the community!)

Another (non-technical) founder here - and I echo everything voidmain just said. We built a product that is unmatched in so many important ways, and it's fantastic that it's available to the world again. Will be exciting to watch a community grow around it - this is a product that can benefit hugely from OS contributions as layers that sit on top of the core KV store.

I wrote a java port of the python counter many moons ago [1]. Will have to resurrect it!

[1] https://github.com/leemorrisdev/foundationcounter

It should probably be pointed out that atomic increment is in most situations a more efficient solution for high contention counters in modern FDB.

Ah I don’t believe that was available last time I used it - I’ll check it out thanks!

Ditto. So glad to see FoundationDB available again! Both the tech and the company have been missed. (Former FoundationDB Solutions Engineer :) )

Nice to see you are joining back!

I echo what wwilson has said. I work at Snowflake Computing, we're a SQL analytics database in the cloud, and we have been using FoundationDB as our metadata store for over 4 years. It is a truly awesome product and has proven to be rock-solid over this time. It is a core piece in our architecture, and is heavily used by all our services. Some of the layers that wwilson is talking about, we've built them. metadata storage , object-mapping layer , lock manager , notifications system . In conjunction with these layers, FoundationDB has allowed us to build features that are unique to Snowflake. Check out our blog titled, "How FoundationDB powers Snowflake Metadata forward" [1]

Kudos to the FoundationDB team and Apple for open sourcing this wonderful product. We're cheering you all along! And we look forward to contributing to the open source and community.

[1] https://www.snowflake.net/how-foundationdb-powers-snowflake-...

That's impressive! Are you going to contribute your changes to foundationdb back?

Where is the object-mapping-layer ? Inside the db in c++ or a service (different process etc) outside of it.

Latter. It's a Java library used by our services.

I am one of the designers of probably the best known metadata storage engine for a distributed filesystem, hopsfs - www.hops.io. When I looked at FoundationDB before Apple bought you, you supported transactions - great. But we need much more to scale. Can you tell me which of the following you have: row-level locks partition-pruned index scans non-serialized cross-partition transactions (that is, a transaction coordinator per DB node) distribution-aware transactions (hints on which TC to start a transaction on)

The relative performance contribution of most of those features (cross-partition transactions not evaluated -it's a must) can be seen in our paper at Usenix Fast: https://www.usenix.org/system/files/conference/fast17/fast17...

It's somewhat hard to answer your questions because the architecture (and hence, terminology) of FoundationDB is a little different than I think you are used to. But I will give it a shot.

FoundationDB uses optimistic concurrency, so "conflict ranges" rather than "locks". Each range is a (lexicographic) interval of one or more keys read or written by a transaction. The minimum granularity is a single key.

FoundationDB doesn't have a feature for indexing per se at all. Instead indexes are represented directly in the key/value store and kept consistent with the data through transactions. The scalability of this approach is great, because index queries never have to be broadcast to all nodes, they just go to where the relevant part of the index is stored.

FoundationDB delivers serializable isolation and external consistency for all transactions. There's nothing particularly special about transactions that are "cross-partition"; because of our approach to indexing and data model design generally we expect the vast majority of transactions to be in that category. So rather than make single-partition transactions fast and everything else very slow, we focused on making the general case as performant as possible.

Transaction coordination is pretty different in FoundationDB than in 2PC-based systems. The job of determining which conflict ranges intersect is done by a set of internal microservices called "resolvers", which partition up the keyspace totally independently of the way it is partitioned for data storage.

Please tell me if that leaves questions unresolved for you!

Thanks for the detailed answer. Is it actually serializable isolation - does it handle write skew anomalies (https://en.wikipedia.org/wiki/Snapshot_isolation)? Most OCC systems I know have only snapshot isolation.

Systems that sound closest to FoundationDB's transaction model that i can think of are Omid (https://omid.incubator.apache.org/) and Phoenix (https://phoenix.apache.org/transactions.html). They both support MVCC transactions - but I think they have a single coordinator that gives out timestamps for transactions - like your "resolvers". The question is how your "resolvers" reach agreement - are they each responsible for a range (partition)? If transactions cross ranges, how do they reach agreement?

We have talked to many DB designers about including their DBs in HopsFS, but mostly it falls down on something or other. In our case, metadata is stored fully denormalized - all inodes in a FS path are separate rows in a table. In your case, you would fall down on secondary indexes - which are a must. Clustered PK indexes are not enough. For HopsFS/HDFS, there are so many ways in which inodes/blocks/replicas are accessed using different protocols (not just reading/writing files or listing directories, but also listing all blocks for a datanode when handling a block report). Having said that, it's a great DB for other use cases, and it's great that it's open-source.

Yes, it's really serializable isolation. The real kind, not the "doesn't exhibit any of the named anomalies in the ANSI spec" kind. We can selectively relax isolation (to snapshot) on a per-read basis (by just not creating a conflict range for that read).

I tried to explain distributed resolution elsewhere in the thread.

I believe our approach to indices pretty much totally dominates per-partition indexing. You can easily maintain the secondary indexes you describe; I don't understand your objection.

My guess is the objection lies in "Have to manage the index myself."

Also, the main draw-back of "indices as data" in NoSQL is when you need to add a new index -- suddenly, you have to scrub all your data and add it to the new index, using some manual walk-the-data function, and you have to make sure that all operations that take place while you're doing this are also aware of the new index and its possibly incomplete state.

Certainly not impossible to do, but it sometimes feels a little bit like "I wanted a house, but I got a pile of drywall and 2x4 framing studs."

"I wanted a house, but I got a pile of drywall and 2x4 framing studs."

This is a totally legitimate complaint about FoundationDB, which is designed specifically to be, well, a foundation rather than a house. If you try to live in just a foundation you are going to find it modestly inconvenient. (But try building a house on top of another house and you will really regret it!)

The solution is of course to use a higher level database or databases suitable to your needs which are built on FDB, and drop down to the key value level only for your stickiest problems.

Unfortunately Apple didn't release any such to the public so far. So I hope the community is up to the job of building a few :-)

I agree with voidmain’s comment as secondary indexes shouldn’t be any different than the primary KV in your case. Almost seems that you’re focusing on a SQL/Relational database architecture but storing your data demoralized anyways. Odd combination of thoughts.

I love the idea of demoralized data :)

Is that where the data sits around waiting, eager, to be queried into action while watching data around them being used over and over again.... But that time never comes, thus leaving them to question their very worth?

> Transaction coordination is pretty different in FoundationDB than in 2PC-based systems. The job of determining which conflict ranges intersect is done by a set of internal microservices called "resolvers", which partition up the keyspace totally independently of the way it is partitioned for data storage.

Ok, per my other question that makes sense. Similar to FaunaDB except the "resolvers" (transaction processors) are themselves partitioned within a "keyspace" (logical database) in FaunaDB for high availability and throughput. But FaunaDB transactions are also single-phase and we get huge performance benefits from it.

> best known metadata storage engine for a distributed filesystem, hopsfs

Not even close. I don't even see anything I'd call a filesystem mentioned on your web page. I missed FAST this year, and apparently you had a paper about using Hops as a building block for a non-POSIX filesystem - i.e. not a filesystem in my and many others' opinion - but it's not clear whether it has ever even been used in production anywhere let alone become "best known" in that or any other domain. I know you're proud, perhaps you have reason to be, but please.

Yes, it's used in production. There's a company commercializing it: www.logicalclocks.com

I'm not convinced a paper with a website and some proof of concepts would be considered the "best". You're throwing around a bunch of components in to a distro calling yourselves everything from "deep learning" to a file system. It's not clear what you guys are even trying to do here.

You don't need to worry about shards / partitions with FDB. Their transactions can involve any number of rows across any number of shards. It is by far the best foundation for your bespoke database.

Your talk was one of the best talk i've seen , and i keep mentionning it to people whenever they ask me about distributed systems, database and testing.

i'm incredibly impatient to have a look at what the community is going to build on top of that very promising technology.

Can you share that talk here too?

link to the [talk] from the parent comment [by wwilson]

[talk] https://www.youtube.com/watch?v=4fFDFbi3toc

[by wwilson] https://news.ycombinator.com/item?id=16877401

I'm not familiar with FDB but what you say sounds almost too good to be true. Can I use it to implement the Google Datastore api? I'm trying for years to find a suitable backend so that I can leave the Google land. Everything I tried either required a schema or lacked transactions or key namespaces.

As an existence proof: before the acquisition we built an ANSI SQL database and a wire-compatible clone of the MongoDB API.

I see no reason you wouldn't be able to implement Datastore. In fact here's a public source claiming that Firestore (which I believe is its successor) is implemented on top of Spanner: https://www.theregister.co.uk/2017/10/04/google_backs_up_fir...

If I recall correctly, the "SQL layer" you had in FDB before the Apple acquisition was a nice proof of concept, but lacked support for many features (joins in SELECT, for example). Is the SQL layer code from that time available anywhere to the public? (I'm not seeing it in the repo announced by OP.)

I used to work there. The SQL layer was capable of actually the majority of SQL features including joins, etc. We had an internal rails app that we used to dog-food for performance monitoring, etc. I used to work on the document layer, and was sad to see it wasn't included here.

That has FoundationDB dependencies that don’t exist in the apple repo (SQL Parser, JDBC Driver, etc)

I hope this question doesn't feel too dump, but is it possible to implement a SQL Layer using SQLite's Virtual Table mechanism and leverage all foundationdb's features?

Is the wire-compatible clone of the MongoDB API available?

It doesn't look like Apple open-sourced the Document Layer, which is a slight bummer. But I echo what Dave said below: what we got is incredible, let's not get greedy!

Also TBH now that I don't have commercial reasons to push interop, if I write another document database on top of FDB, I doubt I'd make it Mongo compatible. That API is gnarly.

> That API is gnarly.

Out of the many MongoDB criticisms, this one is valid. This https://www.linkedin.com/pulse/mongodb-frankenstein-monster-... article is quite right about it (note: endorsing the article does not mean I endorse its author by far).

Other than that, they totally did a fake it until you make it with MongoDB 3.4 passing Jepsen a year ago and MongoDB BI 2.0 containing their own SQL engine instead of wrapping PostgreSQL.

What specifically are you trying to avoid endorsing about the author of the LinkedIn post to which you linked? I couldn't find anything from a cursory web search.

He runs lambdaconf, and refused to disinvite a speaker who many people felt shouldn't be permitted to speak because of his historical non-technical writings.

(I've tried to keep the above as dry as possible to avoid dragging the arguments around this situation into this thread - and I suspect the phrasing of the previous comment was also intended to try and avoid that, so let's see if we can keep it that way, please)

You could try adding controversy to the author name and searching then. As mst correctly notes, I am trying to avoid reigniting said controversy while indicating my distaste.

> That API is gnarly.

The Will Wilson I remember was not prone to such understatement. That API (and the corresponding semantics) was a nightmare.

I was pretty bummed not see that, myself

Hey there--I'm an engineer on the Cloud Datastore team. I'd love to know more about what your needs are if you're willing to share.

He "needs" to get rid of you (Google). :-)

I've forked the official SDK so that I can get extra functionality but it's quite hard to keep it updated when internal stuff changes. There is no way I can contribute. I can't use it everywhere I want...shortly said it's not open source and this sucks.

Honest question, does MongoDB not work for this?

Honest question, does MongoDB work?

Until recently, no. But apparently they got serious about making it work and now it actually does. See https://jepsen.io/analyses/mongodb-3-4-0-rc3 for verification.

I realized this 2 months ago at https://news.ycombinator.com/item?id=16386129

That's Nostradamus-level crazytown. What next, PHP strongly enforcing a sound static type system with immutable defaults, GADTs and dependent types?

Congrats to the MongoDB team!

> What next, PHP strongly enforcing a sound static type system with immutable defaults, GADTs and dependent types?

Too funny!

MongoDB is pretty average at absolutely everything.

It's fast to install on a dev machine though, lol.

For better or worse, many times UX is more important than functionality.

MongoDB is AGPL or proprietary. Many companies have a policy against using AGPL licensed code. So, if you work at one of those companies, then open source MongoDB is not an option (at least for work projects), and proprietary may not be either (depending on budget etc).

FoundationDB is now licensed under Apache 2, which is a much more permissive license, so most companies' open source policies allow it.

Unless people want to change the mongodb code that they would be using, using the agpl software should be a non issue and there are not problems with it. People should start understanding the available licenses instead of spreading fear.

It isn't "spreading fear", it is just reality. Google bans AGPL-licensed software internally: https://opensource.google.com/docs/using/agpl-policy/

I know that multiple other companies have a similar policy (either a complete ban on using AGPL-licensed software, or special approval required to use it), although unlike Google, they don't post their internal policy publicly.

If someone works at one of these companies, what do you want to do – spend your day trying to argue to get the AGPL ban changed, or a special exception for your project; or do you just go with the non-AGPL alternative and get on with coding?

The main reason it's a problem at many of the companies which ban it is they have a lot of engineers who readily patch and combine code from disparate sources and might not always apply the right discipline to keep AGPL things appropriately separate. Bright-line rules can be quite useful.

It is true that MongoDB's AGPL contagion and compliance burden, if you don't modify it, is less than many fear. It is also true that those corporate concerns are valid. MongoDB does sell commercial licenses so that such companies can handle their unavoidable MongoDB needs, but they would tend to minimize that usage.

Because MongoDB doesn't (didn't at least) anything at all good except for little complex document modifiers.

It was a very bad rdbms (since they marketed as a replacement) and a very bad sharded db (marketed too).

> How about replacing your Lucene/ElasticSearch index with something that actually scales and works?

Do you have something to back that up? This to me reads like you imply that Elasticsearch does not work and scale.

It's definitely interesting but I'm cautious. The track record for FoundationDB and Apple has not been great here. IIRC they acquired the company and took the software offline leaving people in the rain?

Could this be like it happened with Cassandra at Facebook where they dropped the code and then more or less stopped contributing?

Also I haven't seen many contributions from Apple to open-source projects like Hadoop etc. in the past few years. Looking for "@apple.com" email addresses in the mailing lists doesn't yield a lot of results. I understand that this is a different team and that people might use different E-Mail addresses and so on.

In general I'm also always cautious (but open-minded) when there's lots of enthusiasm and there seems to be no downside. I'm sure FoundationDB has its dirty little secrets and it would be great to know what those are.

> Do you have something to back that up? This to me reads like you imply that Elasticsearch does not work and scale.




The statement was blunt but the reputation is not exactly unearned. These problems become worse proportional to scale.

>It’s in widespread production use at Apple

Anybody know if Apple migrated any projects from Cassandra to FoundationDB?

Or was every project using FoundationDB a greenfield project?

There were rumors iMessage moved there but ¯\_(ツ)_/¯

Would be great to see from Apple an engineering blog "strengths and challenges at scale" post now that it has been opened again.

I have nothing specific to add to this either other than I also spent years working there, and I'm unbelievably happy about this.

It is a genius move from Apple. I just wish they'd apply this logic to a lot of their other stuff.

They are, slowly. Swift is open source, clang is open source. They are moving parts of the xcode IDE into open source, like with sourcekitd and now recently clangd.

I don't think they will ever move 'secret sauce' into open source, but infrastructural things like DBs and dev tooling seems to be going in that direction.

Well, clangd is a Google project, which Apple has decided to start contributing to, so probably doesn’t belong on your list.

Apple, like everyone else, wants to commoditize their complements.

Clang was an Apple project from the start.. I'm not sure what is telling you it is a Google project

Clangd != clang.

Woolvalley said “and now recently clangd“, which is a source code completion server which began its life at Google.

Clang did start its life with Apple.

In case a citation is needed: https://en.wikipedia.org/wiki/Clang

> Apple chose to develop a new compiler front end from scratch, supporting C, Objective-C and C++. This "clang" project was open-sourced in July 2007.

I believe the "d" in the GP comment was not a typo https://news.ycombinator.com/item?id=16874734

~~How is it different from when Apple acquired the then-open-source FoundationDB (and shut down public access)? They could have just kept it open source back then.~~

EDIT: My bad, looks like FoundationDB wasn't fully open-source back then.

From what I recall (and based on some quick retro-googling) I don't believe Foudnation was open-source. One of the complaints about it on HN back in the day was that it was closed...

You are correct, FoundationDB was free but never OSS.

You can even check the comments on the HN thread [1] when they removed the downloads.

[1] https://news.ycombinator.com/item?id=9259986

> In fact, our everyday testing was way more brutal than Jepsen, I gave a talk about it here: https://www.youtube.com/watch?v=4fFDFbi3toc

Unrelated to the original topic, but I had never come across that talk and it is great. I use the same basic approach to testing distributed systems (simulating all non-deterministic I/O operations) and that talk is a very good introduction to the principle.

In reinforcement learning, the same approach (simulation) to addressing a data deficit is used, although the goal there is learning and not testing.

When certain events happen rarely, you have fake them deterministically so that you have reproducible coverage.

Know of any ideas around using this for long term Time Series data? Wonder if maybe something like OpenTSDB but with this as backend instead of hbase (which can be a sort of operational hell)

Yes :) it's the core of how Wavefront stores telemetry.

Not quite the same but TimescaleDB is Postgres-based and is showing a lot of promise for time-series data.

How would you replace a Lucene/Elasticsearch index with foundationDb?

It's more like you would build a better Elasticsearch using Lucene to do the indexing and FoundationDB to do the storage. FoundationDB will make it fault tolerant and scalable; the other pieces will be stateless.

It'd take a low number of hours to wire up FoundationDB as a Lucene filesystem (Directory) implementation. Shared filesystem with a local RAM cache has been practical for a while in Lucene, and was briefly supported then deprecated in Elasticsearch. I've used Lucene on top of HDFS and S3 quite nicely.

If you have a reason to use FoundationDB over HDFS, NFS, S3, etc, then this will work well.

Doing a Lucene+DB implementation where each entry posting lists are stored natively in the key-value system was explored for Lucene+Cassandra as (https://github.com/tjake/Solandra). It was horrifically slow, not because Cassandra was slow, but because posting lists are optimized and putting them in a generalized b-tree or LSM-tree variant will remove some locality and many of the possible optimizations.

I'm still holding out some hope for a hybrid implementation where posting list ranges are stored in a kv store.

I think you are on the right track. Storing every individual (term, document, ...) in the key value store will not be efficient, but you should be able to take Lucene's nice fast immutable data structure and stuff blocks of it (at the term level or below) into FDB values very efficiently. And of course you can do caching (and represent invalidation data structures in FDB), and...

FDB leaves room for a lot of creativity in optimizing higher layers. Transactions mean that you can use data structures with global invariants.

So from the Lucene perspective, the idea of a filesystem is pretty baked into the format. However, there's also the idea of a Codec which takes the logical data structures and translates to/from the filesystem. If you made a Codec that ignored the filesystem and just interacted with FDB, then that could work.

You can already tune segment sizes (a segment is a self-contained index over a subset of documents). I'd assume that the right thing to do for a first attempt is to use a Codec to write each term's entire posting list for that one segment to a single FDB key (doing similar things for the many auxiliary data structures). If it gets too big, then you should have tuned max segment size to be smaller. Do some sort of caching on the hot spots.

If anyone has any serious interest in trying this, my email is in my profile to discuss further.

Hmmm. I'm skeptical. A Lucene term lookup is stupidly fast. It traverses an FST, which is small and probably in memory. Traversing the postings lists itself also needs to be smart by following a skip table, which is critical for performance.

> you should be able to take Lucene's nice fast immutable data structure and stuff blocks of it (at the term level or below) into FDB values very efficiently.

That sounds a lot like Datomic's "Storage Resource" approach, too! Would Datomic-on-FDB make sense, or is there a duplication of effort there?

It most definitely would.

Datomic’s single-writer system requires conditional put (CAS) for index and (transaction) log (trees) roots pointers (mutable writes), and eventual consistency for all other writes (immutable writes) [0].

I would go as far as saying a FoundationDB-specific Datomic may be able to drop its single-writer system due to FoundationDB’s external consistency and causality guarantees [1], drop its 64bit integer-based keys to take advantage of FoundationDB range reads [2], drop its memcached layer due to FoundationDB’s distributed caching [3], use FoundationsDB watches for transactor messaging and tx-report-queue function [4], use FoundationDB snapshot reads [5] for its immutable indexes trees nodes, and maybe more?

Datomic is a FoundationDB layer. It just doesn’t know yet.

[0] https://docs.datomic.com/on-prem/acid.html#how-it-works

[1] https://apple.github.io/foundationdb/developer-guide.html?hi...

[2] https://apple.github.io/foundationdb/developer-guide.html?hi...

[3] https://apple.github.io/foundationdb/features.html#distribut...

[4] https://docs.datomic.com/on-prem/clojure/index.html#datomic....

[5] https://apple.github.io/foundationdb/developer-guide.html?hi...

Can't see datomic itself ever doing that though because they'd have to support those features in all the backends.

I wrote the original version of Solandra (which is/was Solr on Cassandra) on top of Jake's Lucene on Cassandra[1].

I can confirm it wasn't fast!

(And to be fair that wasn't the point - back then there were no distributed versions of Solr available so the idea of this was to solve the reliability/failover issue).

I wouldn't use it on a production system now days.

[1] http://nicklothian.com/blog/2009/10/27/solr-cassandra-soland...

> I've used Lucene on top of HDFS and S3 quite nicely.

Out of curiosity, what led you to do this? And what does it do better/worse/differently than out-of-the-box things like Elasticsearch or SOLR?

HDFS is supported by Solr+Lucene, but rather than my own poor paraphrasing, see what you think of this writeup: https://engineering.linkedin.com/search/did-you-mean-galene

Ah, excellent! Thanks. That answers my question. I also found the idea of early termination via static rank very intriguing.

See elassandra for a better solandra, keeping lucene indexes and sstables separate. It should be better than keeping posting list in kv-store

ok thanks, it was sort of confusing me

Actually, since my current side project needs good graph support I went looking for foundationDb graph and found this https://github.com/rayleyva/blueprints-foundationdb-graph may have to check it out.

I just watched the demo of 5 machines and 2 getting unplugged. The remaining 3 can form a quorum. What happens if it was 3 and 3? Do they both form quorums?

A subset of the processes in a FoundationDB cluster have the job of maintaining coordination state (via disk Paxos).

In any partition situation, if one of the partitions contains a majority of the coordinators then it will stay live, while minority partitions become unavailable.

Nitpick: To be fully live, a partition needs a majority of the coordinators and at least one replica of each piece of data (if you don't have any replicas of something unimportant, you might be able to get some work done, but if you have coordinators and nothing else in a partition you aren't going to be making progress)

The arbitrator pattern is a great lightweight solution to the split-brain problem. Here's a great blog about it: http://messagepassing.blogspot.se/2012/03/cap-theorem-and-my...

What if the number of boxes are even?

Or do you get around this by not deploying an uneven number of boxes?

The majority is always N/2 + 1, where N is the number of members. A 6 member is less fault-tolerant than a 5 member cluster (quorum is 4 nodes instead of 3, and it still only allows for 2 nodes to fail).

The number of coordinators is separate from the number of boxes. You don't have to have a coordinator on every box.

I think you can set the number of coordinators to be even, but you never should - the fault tolerance will be strictly better if you decrease it by one.

This makes me wonder also what happens with multiple partitions, for example:

5 into 2, 2, and 1

7 into 3, 2, and 2

No majorities, I guess?

Apple also uses HBase for Siri I believe, what are some of the cluster sizes that FoundationDB scales to? Could it be used to replace HBase or Hadoop?

I was in attendance at your talk, and thought it was one of the best at the conference. Apple I think broke some hearts completely going closed-source for a while, but glad to see them open sourcing a promising technology.

If scale is a function of read/writes, very large. In fact with relatively minimal (virtual) hardware it's not insane to see a cluster doing around 1M writes/second.

I was talking more about large file storage like HDFS, and the MapReduce model of bringing computation to data. HBase does the latter, and it's strongly consistent like FoundationDB, though FoundationDB provides better guarantees. As a K/V I understand what you and OP say.

How does this compare with CockroachDB? I'm planning to use CockroachDB for a project but would love to get an idea if I can get better results with FoundationDB.

well FoundationDB for one doesn't resort to clickbaity pseudo-benchmark marketing tactics


They might be targeting the wrong market, hence the desperate marketing. For people who use MySQL/PostgreSQL a compatible, slower, but distributed database probably just doesn't solve any problem. Those people need a managed solution, not a distributed one.

"we also simulate dumb sysadmins". That was a really inspiring talk, thanks!

Fantastic talk! Very engaging and insightful.

That presentation was really good! Well explained on the simulations. If one wanted to get into this exciting event and create something with FoundationDB but no database experience (I do know many programming languages) where would I start? If anyone could point me in the direction, I'd greatly appreciate it.

How scalable and feasible to implement a SQL Layer on top of SQLite's Virtual Table Mechanism (https://www.sqlite.org/vtab.html) which redirects the read/write of the record data from/to foundationdb?

Long before we acquired Akiban, I prototyped a sql layer using (now defunct) sqlite4, which used a K/V store abstraction as its storage layer. I would guess that a virtual table implementation would be similar: easy to get working, and it would work, but the performance is never going to be amazing.

To get great performance for SQL on FoundationDB you really want an asynchronous execution engine that can take full advantage of the ability to hide latency by submitting multiple queries to the K/V store in parallel. For example if you are doing a nested loop join of tables A and B you will be reading rows of B randomly based on the foreign keys in A, but you want to be requesting hundreds or thousands of them simultaneously, not one by one.

Even our SQL Layer (derived from Akiban) only got this mostly right - its execution engine was not originally designed to be asynchronous and we modified it to do pipelining which can get a good amount of parallelism but still leaves something on the table especially in small fast queries.

@voidmain, Thank you, it's very insightful and clear! I mean, I can see the disadvantage if such SQL layer is implemented directly through SQLite's virtual tables.

Would it be possible to build a tree DB on top of it like MonetDB/Xquery? I always wondered why XML databases never took off, I've never seen anything else quit as powerful. Document databases if du jour seem comparatively lame.

Yes you can. You need a tree index basically. Any kv store can serve as the backing data structure. I've been writing one for config file bidirectional transformation.

They had a document layer that wasn't released that was MongoDB compatible — so you can do this.

MongoDB is just a database with less features than a SQL database. An XML/XQuery database is fundamentally different, so I figured if FoundationDB layers are really so powerful, they might be able to model a tree DB as well.

Ah misread. Maybe? Storing hierarchical data would be pretty natural.

That is so exciting! Can't wait to poke around. I wonder if there's anything in there I might recognize

Sorry if I missed it but can you or someone else please link where the client wire protocol is documented?

The client is complex and needs very sophisticated testing, so there is only one implementation. All the language bindings use the C library.

Curious why the client is complicated compared to other dbs in same space ?

In some distributed databases the client just connects to some machine in the cluster and tells it what it wants to do. You pay the extra latency as it redirects these requests where they should go.

In FDB's envisioned architecture, the "client" is usually a (stateless, higher layer) database node itself! So the client encompasses the first layer of distributed database technology, connects directly to services throughout the cluster, does reads directly (1xRTT happy path) from storage replicas, etc. It simulates read-your-writes ordering within a transaction, using a pretty complex data structure. It shares a lot of code with the rest of the database.

If you wanted, you could write a "FDB API service" over the client and connect to it with a thin client, reproducing the more conventional design (but you had better have a good async RPC system!)

> but you had better have a good async RPC system!

The microservices crew with their "our database is behind a REST/Thrift/gRPC/FizzBuzzWhatnot microservice" pattern is still catching up to the significance of this statement.

This might be a dumb question (from someone used to using blocking JBDC) but why is async RPC important in this case? Just trying to understand. And can gRPC not provide good async RPC?

I was referring to the trend of splitting up applications into highly distributed collections of services without addressing the fact that every point where they communicate over the network is a potential point of pathological failure (from blocking to duplicate-delivery etc). This tendency replaces highly reliable network protocols (i.e. the one you use to talk to your RDBMS) with ad hoc and frequently technically shoddy communication patterns, with minimal consideration for how it might fail in complex, distributed ways. While not always done wrong, a lot of microservice-ification efforts are quite hubristic in this area, and suffer for it over the long term.

Imagine single core single threaded design. You send 2 requests for 1 row each.

First request the row needs to be read from disk HDD. It takes 2ms.

Second request, the row is already in ram, it takes microseconds but still has to wait for the first request to finish.

Threads have overhead when having a lot of concurrency (thousands/millions requests/second).

For extreme async, see seastar-framework and scylladb design.

TLDR: high concurrency, low overhead etc.

Wouldn't layers be hard to be built on the server (since you have to also change the client) and slow to be built as a layer (since it will be another separate service) ?

I'm not sure what you are asking, but depending on their individual performance and security needs layers are usually either (a) libraries embedded into their clients, (b) services colocated with their clients, (c) services running in a separate tier, or (d) services co-located with fdbservers. In any of these cases they use the FoundationDB client to communicate with FoundationDB.

In case (c) or (d) how can a layer leverage the distributed facilities that FDB gives? I mean if I have clients that connect to a "layer service" that is the one who talks to FDB, I have to manage "layer service" scalabily, fault tolerance etc... by myself.

Yes, and that's the main advantage of choosing (a) or (b). But it's not quite as hard as it sounds; since all your state is safely in fdb you "just" have to worry about load balancing a stateless service.

got it, what will you suggest to do something like that? a simple RPC with a good async framework I've read, like what? an RPC service on top of Twisted for python, similar things in other languages?

thanks :)

I am guessign you'd pretty much embed the client into your higher layer

Ok, thanks :)


Postgres operates great as a document store, btw. You don’t really need mongo at all. And if you need to distribute because you’ve outgrown what you can do one a single postgres node, you don’t want to use mongo anyway.

If you’ve read any of the comments or have been following the project, it should be pretty obvious that this is far from rookie.

This is a game changer, not a hobby project. This is the first distributed data store that offers enough safety and a good enough understand of CAP theorem trade offs that it can be safely used as a primary data store.

> I respect hobby projects though


FDB raised >$22m and was acquired by Apple.

If you consider that a "hobby project," please, teach me how to hobby.

Friends don’t let friends use MongoDB.

https://www.youtube.com/watch?v=4fFDFbi3toc I am big fan of PostgreSQL but it would prob be a good idea to look into a thing you are commenting about.

Or perhaps it's not so incredible? Maybe it wasn't such a huge hit for Apple and didn't leave up to expectation so they figure they can give it away and earn some community goodwill.

there were rumors that fdb had never been adopted at Apple because of some limitations and they still ran an in-house version of cassandra

This is great news, when I was with dynamo, FoundationDB was the other green shore for me :). They did so many things so well.

A tiny bit of caution for folks trying to run systems like this though: It is frigging hard at any reasonable scale. The whole thing might be documented / OSS and what not, but very soon you are going to run into deep enough problems that's going to require very core knowledge to debug, energy to deep dive. Both of which you probably don't want to invest your time into. Do evaluate the cloud offerings / supported offerings before spinning these up. Else ensure you have hired experts who can keep this going. They are great as a learning tool, pretty hard as an enterprise solution. I have seen the same issue a ton of times with a bunch of software (redis/kafka/cassandra/mongo...) by now. IMO In the stateful world, operating/running the damn thing is 85% of the work, 15% being active dev. (Stateless world is a little better, but still painful).

> IMO In the stateful world, operating/running the damn thing is 85% of the work, 15% being active dev. (Stateless world is a little better, but still painful).

The number of newbie engineers who see docker/kubernetes, and think "let me docker run or helm install" a stateful service in a couple of minutes - is mind boggling.

I'll remember this quote when trying to talk sense to them.

This sounds like a business opportunity to me rather than a cautionary note.

I remember all the people who bashed Apple when they acquired FoundationDB. I hope they are appropriately ashamed now.

> I remember all the people who bashed Apple when they acquired FoundationDB.

I'm not ashamed about deriding apple for Apple taking a really, really great product and hiding it from the world for years to come.

This is definitely some atonement, but does not totally absolve Apple from the many times they've taken tech private.

I went to the same high school as the founders[1]. They were about the 2 best software engineers in a school with a LOT of very smart software engineers. Another pair founded Yext, which went public last year. I still consider that school the group with the highest concentration of raw brain power I've ever been a part of.

I'm probably a 1% engineer, been hired by M$, FB, and Google. These guys were light years ahead of me. I'm not sure I'm as good now as they were at like 17 years old. In fact I'm probably only a decent engineer from having observed the stuff they were doing back then and finding inspiration.

1: https://en.wikipedia.org/wiki/Thomas_Jefferson_High_School_f...

I went to UVA and I think about half of the engineering school was from TJ (including 3 of my roommates). :) I can't think of any superior public high school (and I went to a different Governor's School in Virginia myself) anywhere, due to the amazingly large and deep talent pool TJ pulls from. Nothing like it exists in the Bay Area, that's for sure!

The top high schools in the Bay Area often give it a run for its money…

I did chuckle at the part above about the "best software engineers in high school," but can't argue with the results.

That's a very special high school

It definitely was when I went there.

That's a great school. I hired some students there when I ran a tech camp for middle-schoolers and they were...beyond.

Just checked the rankings. My first high school made #172 on the list ("International School of the Americas").

My second high school is unranked ("Texas Academy of Math and Science"). I don't think it qualifies as a high school. Seems we haven't done a great job of identifying the accomplishments of our alumni, based on the Wikipedia page. No doubt it would rank near the top though. My year alone Caltech accepted about 30 of us, more than any other high school in the country. Makes me wonder what my peers have been up to.

Anyway, I'd agree that these tech high schools have some amazingly smart people attending them.

2: https://en.wikipedia.org/wiki/Texas_Academy_of_Mathematics_a...

Forgot to mention that those four engineers were same school AND same grade as me. So 4 out of 400 students were the tech guys behind Yext and FoundationDB.

Maybe it's just me, but the hubris in this comment seems a little excessive.

They aforementioned SWEs make themselves multimillionaires /and/ have great jobs /and/ get praised by their peers, yet now everyone else has to bow down to them... over claims they were great programmers in high school? Is an appeal to your own experience the best way to make yourself seem relevant here?

It's an incredible DB system, but it ain't the Second Coming. Calm down.

Did you bow down at your laptop?

Is it ok to admire great engineers on HN?

Is this comment relevant to anything but your own inferiority complex?

Can we get a decent definition of hubris in here?

I don't bow down to my laptop, but I respect what went into making it.

It's great to admire excellent engineers; aspiring to be as skilled as someone at a task can be very motivating. Worshipping them is another thing.

You're right about the inferiority complex - I know I'm a relatively bad SW engineer, but that's mostly related to how new it is to me. I expect and want to improve.

Hubris is defined as excessive pride or self-confidence. I would say that bragging that you're "probably a 1% engineer," and that you've been hired by three of the largest SW companies out there qualifies as hubris. Maybe it's just me, but I don't think a 1% engineer would publicly boast about being one and then actually use the 'M$' in a non-farcical manner.

Shitting on 99% of the SWE population to make yourself look good, then shitting on yourself to make another person look even better doesn't really work. There's a reason humanCmp() is a little more complex than strCmp().

BTW, being hired by a large company doesn't mean you're all that. Plenty of idiots get hired by Oracle.

Google please take a few notes here:

1. It’s in its own repo

2. The build instructions are concise and clear. Dependencies are listed. You have to follow a total of 0 links.

3. They use a common build system and not an in-house thing.

Speaking as the original author of this monstrosity of a build system, please be careful before offering praise here. To be clear, there is a top-level, non-recursive Makefile that uses the second expansion feature of GNU make, translating Visual Studio project files into generated Makefile inputs that are transformed into targets to power the build.

Although it starts by running `make`, it's about as in-house as a thing can be.

This kind of deep-inside-baseball from-the-horses-mouth interaction is what's so awesome about HN!

> deep-inside-baseball from-the-horses-mouth

Does the horse choke on the baseball? Is there an equine version of the Heimlich maneuver to be performed on horses suffering from mixed-metaphorical-adage-induced asphyxiation?

Yes, you bite the hand that feeds the horse baseballs.

This whole comments section is one of my all time favorites.

Fair enough, I stand corrected. Superficially it seems like a more pleasant experience than dealing with gn/ninja as an end user.

As an end user it's absolutely better. I'm just torn between pride and embarrassment thinking about how it was implemented.

I think my favorite part of the build process was when we frobnicated libstdc++:


And then there were the hijinks we went through to build a cross-compiler with modern gcc and ancient libc (plus the steps to make sure no dependency on later glibc symbols snuck in):


Ahh... now that was a build system.

I saw this in the Makefiles, and -- ah, the life of distributing proprietary Linux software. At one point in a prior job, we just replaced our build system with a wrapper Makefile that 'chroot'ed into a filesystem image that was a snapshot of one of our build machines, since it was so difficult to set up. This meant we (developers) had easier system updates, security upgrades, etc. That was just the tip of the iceberg!

Now you can "... be beautiful and terrible as the Morning and the Night! Fair as the Sea and the Sun and the Snow upon the Mountain! Dreadful as the Storm and the Lightning! Stronger than the foundations of the earth. All shall love [you] and despair!" with a build inside docker; the host can run a modern kernel ; your image can run Slackware 0.1 ;-)

(with apologies to Tolkien)

I'm curious what you think is wrong with gn/ninja (besides the fact that it's non-standard). My problems building Chromium come mostly from other parts of depot_tools.

Ah, GNU make second expansion... glad to see someone else appreciates it ;-).

We (Wavefront) has been operating petabyte scale clusters for the last 5 years with FoundationDB (we got the source code via escrow) and we are super excited to be involved in the opensourcing of FDB. We have operated over 50 clusters on all kinds of aws instances and I can talk about all the amazing things we have done with it.


We basically replaced mySQL, Zookeeper and HBase with a single KV store that supports transactions, watches, and scales. It's not a trivial point that you can just develop code against a single API (finally Java 8 CompletableFutures) and not have to set up a ton of dependencies when you are building on top of FDB. We are (obviously) experts at monitoring FoundationDB with Wavefront and we hope to release the metric harvesting libraries and template dashboards that we use to do so.

Almost 5 years in and we have not lost any data (but we have lost machines, connectivity, seen kernel panics, EBS failures, SSD failures, etc., your usual day in AWS =p).

"but we have lost machines, connectivity, seen kernel panics, EBS failures, SSD failures, etc., your usual day in AWS " <=== This I wish more people realized that is a day to day reality if you are in AWS at scale.

The best way I've heard it described is "complex systems run in degraded mode".


Basically once a system is complex enough some part if it is always broken. The software must be designed from the assumption that the system is never running flawlessly.

No doubt but that's pretty high overhead for many projects colo is actually a decent choice but I guess that's not a popular opinion.

As I understand it, it's like that everywhere at scale, not just on AWS, it being a property of operating at scale.

Or are you saying that AWS is particularly unreliable at scale?

I seem to think that cloud providers are particular opaque about small glitches (i.e. they aren't going to tell you that a router or switch was rebooted for maintenance if it comes back right away and you can email support and it's always the same response: "it's working right now") :)

On the network side no, it's much more crappy on AWS.

Which provider is the best, network wise?

I only have experience with AWS and on prem and high quality colo like Equinix. Possibly due to reduced complexity and having full control over networking setup but significantly fewer issues vs AWS.

And FoundationDB has held our data durable through all of this.

Sounds like however bolts on PG compatible SQL layer on top will have a killer product on their hands :)

Have a look at CockroachDB

Already playing with it but FoundationDB is used for production Petabyte scale deployments, and the whole deterministic simulation thing for testing is really reassuring as far as bugs/stability. I am guessing with Apple's resources that approach was taken to a whole new level after the acquisition?

I can see everyone's extremely happy about this, which is great. As someone who's never used it, I'd like to know more about FoundationDB and how it compares to other offerings such as MySQL or Postgres, and which use cases is it most suited to. I would especially love to hear the thoughts of those with direct experience of using Foundation DB. Thanks!

Seconding this. Instead of hearing from bloggers or contributors, what is using this DB like in the trenches? What simple use case would fit this DB perfectly? Is there a lot of setup on second/third deploy? Easy to maintain, or requires a lot of tweaking/tuning? How is memory usage with 10M records? 100M?

Personally I'd be more interested in hearing how this compares to other distributed noSQL implementations like Cassandra.

It probably is better, because Apple switched from Cassandra to FoundationDB, according to all the rumors http://www.businessinsider.com/why-apple-bought-foundationdb... . But as we know Apple they probably won't tell us.

or CockroachDB

It's under Apache 2.0 for those curious: https://github.com/apple/foundationdb/blob/master/LICENSE. Also, side note: it looks like this was a private GitHub repository for at least a couple months, since they have pull requests going back for at least that long. I find this interesting, since Apple normally "cleans up" history before open sourcing.

Swift, too, was published on GitHub including its whole version control history, dating back to 2010 :)

I hadn't heard of FoundationDB before, so I did some digging into the features: https://apple.github.io/foundationdb/features.html . It seems to claim ACID transactions with serializable isolation, but also says later on that it uses MVCC, slower clients won't slow down operations, and that it allows true interactive queries. I didn't think an MVCC implementation could provide that level of isolation, and I'm not even sure how you provide that level of isolation and those other guarantees with any implementation, am I misunderstanding something?

I'll try to give you a quick introduction. The architecture talk I recorded for new engineers working on the product ran to four or five hours, I think :-). In short, it is serializable optimistic MVCC concurrency.

A FDB transaction roughly works like this, from the client's perspective:

1. Ask the distributed database for an appropriate (externally consistent) read version for the transaction

2. Do reads from a consistent MVCC snapshot at that read version. No matter what other activity is happening you see an unchanging snapshot of the database. Keep track of what (ranges of) data you have read

3. Keep track of the writes you would like to do locally.

4. If you read something that you have written in the same transaction, use the write to satisfy the read, providing the illusion of ordering within the transaction

5. When and if you decide to commit the transaction, send the read version, a list of ranges read and writes that you would like to do to the distributed database.

6. The distributed database assigns a write version to the transaction and determines if, between the read and write versions, any other transaction wrote anything that this transaction read. If so there is a conflict and this transaction is aborted (the writes are simply not performed). If not then all the writes happen atomically.

7. When the transaction is sufficiently durable the database tells the client and the client can consider the transaction committed (from an external consistency standpoint)

The implementations of 1 and 6 are not trivial, of course :-)

So a sufficiently "slow client" doing a read write transaction in a database with lots of contention might wind up retrying its own transaction indefinitely, but it can't stop other readers or writers from making progress.

It's still the case that if you want great performance overall you want to minimize conflicts between transactions!

Great write-up.

Is this similar to how Software Transactional Memory (STM) is implemented? It sounds very very similar indeed.

This is a good explanation of how it happens on a single node. What do you do when the transaction is distributed? How do you achieve consensus? Is there a write up on it anywhere?

The only thing that's different in a distributed cluster is the implementations of steps 1 and 6. As voidmain said, the details of that are not trivial, ESPECIALLY the details of how it never produces wrong answers during fault conditions.

I don't know that there's been an exhaustive writeup of that part, but maybe one of us or somebody on the Apple team will put something together. It probably won't fit in an HN comment though!

Or... maybe this is the part where I point out that the product is now open-source, and invite you to read the (mostly very well commented) code. :-)

The documentation ( https://apple.github.io/foundationdb/technical-overview.html ) sells the product, but doesn't give a deep enough explanation. As a closed source product, that's understandable.

Going forward as an opensource product, I hope to see some clarity on the "how it works"... Distributed, performant ACID sounds good, almost too good to be true. Not that I doubt it at the moment, I just want to understand it better :)

Thanks. Can you elaborate on how 6 is actually accomplished? Various earlier comments have hinted that the transactional authority (conflict checking) can actually scale 'horizontally' beyond the check-throughput that can be archived by a single node. Is that the case? and whats the magic sauce for doing that for multi-object transactions? :)

Yes, conflict resolution is for most workloads a pretty small fraction of total resource use so you usually don't need a ton of resolvers (I think out of the box it still comes configured with just one?), but it can scale conflict resolution horizontally.

The basic approach isn't super hard to understand, though the details are tricky. The resolvers partition the keyspace; a write ordering is imposed on transactions and then the conflict ranges of each transaction are divided among the resolvers; each resolver returns whether each transaction conflicts and transactions are aborted if there are any conflicts.

(In general the resolution is sound, but not exact - it is possible for a transaction C to be aborted because it conflicts with another transaction B, but transaction B is also aborted because it conflicts with A (on another resolver), so C "could have" been committed. When Alec Grieser was an intern at FoundationDB he did some simulations showing that in horrible worst cases this inaccuracy could significantly hurt performance. But in practice I don't think there have been a lot of complaints about it.)

Does the implementation handle the case that you want to do a write that is conditioned on a prior read finding no corresponding record(s)?

Of course. FDB thinks about read and write conflict ranges, which are functions of the keys, not the values (or lack thereof). A read of a non-existent key conflicts with a write to that key. A read of a range of keys conflicts with a write to a key in that range, even if that key did not have an associated value at the time of the original read.

Any chance you’d be able to find that video, it sounds extremely interesting.

Thanks, that explanation is really helpful!

> I didn't think an MVCC implementation could provide that level of isolation

MVCC needs a bit of additional logic ontop to be serializable - "Serializable snapshot isolation" is a good keyword to search for - But it's definitely possible.




Edit: Formatting

Apache Trafodion does as well:


I mean, PostgreSQL is MVCC and it clearly supports serializable as an isolation level... have you tried using Google?


A lot of people apparently hated that I attacked this person, and downvoted me; but what the person who posted this comment was doing is intentionally throwing shade at and casting doubt on a project by saying "I think that the claims on this webpage don't make any sense as this should be impossible". It comes off as "oh come on, this doesn't even sound plausible; I call bullshit".

This then causes people who aren't versed with the product or the technology to decrease their perception of the product, and puts the team behind it in a position of having to not just come to its defense but to do so quickly due to the perception concerns.

Meanwhile, if they just do a basic search for "MVCC serializable" they would find that they were wrong; which means that it took more time to leave this insulting comment than it would have taken them to learn how this can work.

As a community, we really really really really need to beat down on casual cynics like this, who like to lazily "call bullshit" or play the "citation needed" card as a way to undermine the credibilty of other peoples' products. We live in a future where the answers to these kinds of doubts are a moment away: this particular form of debate tactic needs to die.

I absolutely was not throwing shade at the project, I was just trying to understand. I have a great interest in distributed systems that can efficiently implement serializable ACID transactions, and looked at their documentation with great interest to see how they implemented it. I honestly thought that MVCC could only efficiently implement Snapshot Isolation. So I figured my understanding was wrong, or at worst there's a typo in their documentation and they use something other than MVCC under the hood. Either way, I wanted to know more about their architecture.

I'll take your feedback under consideration, and I'm sorry to have irritated you.

I suspect the downvoting was for the "LMGTFY"-esque part of your response. Dunno, don't care, just offering a possible suggestion.

Original thread for when Apple acquired FoundationDB for those curious: https://news.ycombinator.com/item?id=9259986

Ahah! That thread's so sad vs. this thread's so exciting :)

The Windows installer has Lorem Ipsum instead of EULA text... https://twitter.com/stijnsanders/status/987042633691394050

Also, Windows Defender SmartScreen complains about unknown publisher when trying to execute the installer.

I'm very interested in hearing more about what running FoundationDB in production is like.

I believe that FoundationDB stores rows in lexicographical order by key. Other databases like Cassandra strongly push you toward not storing data this way as it can easily lead to hotspots in the cluster. How do you deploy a FoundationDB cluster without leading to hotspots, or perhaps what operational actions are available to rebalance data?

Does Cassandra hash the primary key to get a more even distribution?

If you have a sorted data store, you can get the same distribution by keying off a hash of the "real" primary key, right?

Cassandra allows you to configure a "partitioner" that determines which nodes a primary key belongs to. There is a ByteOrderPartitioner that stores partitions in order lexicographically by primary key. There is also a Murmor3 hash based partitioner (which is the recommended default).

Cassandra allows you to store multiple records in sorted order within a partition. The normal recommended way to get data locality is to store records that are frequently accessed together in the same partition.

Firstly: Wow! this is amazing news!!!

I'm also kind of confused.. is the single repo complete?

What about the SQL Layer [0]? Where is all this stuff in the new GH repo?

Or is only the KV part be open-source?

Looking forward to some CockroachDB vs. FDB benchmark showdowns :)

[0] https://github.com/jaytaylor/sql-layer

also super interested in CockroachDB, but I just can't find enough war stories, or stories of people using it in production...

I am not a database expert by any means but have been curious about distributed data systems and had not heard of FoundationDB till now and was very excited to read about it. On reading through the documentation, I encountered a section on "Known Limitations"[1] which stated that keys could not be larger than 10kb and values cannot be larger than 100kb. This seems to be a major limitation. Am I missing something or is this strictly for storing text?

[1] https://apple.github.io/foundationdb/known-limitations.html#...

Because the data model is ordered, large blobs can and normally should be mapped to a bunch of adjacent keys and read with a range read, not a single huge value. That also allows you to read or write just part of one efficiently.

For those asking about use in production, Wavefront just posted this:

"Wavefront by VMware’s Ongoing Commitment to the FoundationDB Open Source Project"


"because it is an ordered key-value store, FoundationDB can use range reads to efficiently scan large swaths of data"


I wonder how it compares to MUMPS databases like Intersystems Cache and FIS GtM?

Caché is more oriented towards vertical scaling, big monolithic servers, where FoundationDB is more towards horizontal scaling.

Also, the obvious difference is that Caché is closed source and prohibitively expensive.

FiS is only FOSS on selected platforms (Linux x86, OpenVMS Alpha), and proprietary on all other platforms.

Oh, Cache...

Worked a job once where that was the underlying data store.

I was only allowed to touch the SQL interface to it, which was....weird.

The SQL dialect was ancient, felt like something from about 1990 (and this was in.... 2012 or so, so not THAT long ago).

Query performance seemed invariant. A simple select * from foo where id=X and a monster programatically generated join across 15 tables would both take about 1.5 seconds to return results.

It seems that MUMPS doesn't have serializable isolation.

Yeah, when I read about that I thought it sounded neat—make sure the index updates with the data. On the other hand, I think of CAP theorem as an iron triangle. If you are gaining consistency, what’s the trade off?

This _is_ the usual trade off, but what makes FoundationDB so crazy is that it's a CP system that has a performance profile that AP systems would have a hard time matching.

It’s neat although there is no sql front end.

Bloomberg’s comdb2 was open sourced recently https://github.com/bloomberg/comdb2 - it seems similar, but would be interesting to see comparison.

comdb2 is not a big data db per say. it compares more to mysql than foundationdb https://blog.dripstat.com/first-look-at-bloombergs-amazing-c...

Great write up!

comdb2 has no sharding ?

Wow this is some really exciting news! I think it would be amazing to create a GraphQL API for FoundationDB. Therefore i have created a feature request for this in the Prisma repo. For those who don't know Prisma is a GraphQL database mapper for various databases.


How's this compare to https://github.com/pingcap/tikv? It's a relatively new distributed KV store written in Rust that also is transactional and backs the new TiDB database.

Here's a brief comparison by a tikv developer: https://github.com/pingcap/tikv/issues/2986#issuecomment-383...

Well on first guess, I would assume that FoundationDB is more mature than tikv, although I have never used either but tikv looks cool

Agreed, technology-wise FDB and TIDB look almost the same.

Looks promising, but does anyone know if FoundationDB has external events or triggers, similar to Firebase or RethinkDB? I can't seem to find much on it.

If not, then a lot of potential is being left on the table, because usage would require wrapping FoundationDB in a proxy or middleware of some kind to synthesize events, which can be extremely difficult to get right (due to race conditions, atomicity issues, etc). Without events, apps can find themselves polling or rolling their own pub/sub metaphor over and over again. If anyone with sway is reading this, events are very high on the priority list for me thanx!

It has the ability to watch keys so building a notification system on top of it is pretty easy. Really only limited by your imagination.

In addition to single-key asynchronous watches, there are also versionstamped ops (for maintaining your own, sophisticated log in a layer) and configurable key range transaction logging (but see the caveats in my other post on the topic).

I'm not sure it has every feature it will ever need in this area, but it's a pretty good starting point for building "reactive" stuff.

Your response is both very exciting and slightly intimidating! Would love to see a “NewSQL” and/or event store built on top of FDB tech. Would the key ranges and versioned ops be capable of providing/emulating a performant “atomic broadcast” similar to that in Kafka?

Does the trigger execute in scope of triggering transaction with the same isolation level?

Versionstamped operations and transaction logging are fully transactional. Watches are asynchronous: they are used to optimize a polling loop that would "work" without them.

Would versionstamped operations fit for the log abstraction modeling question I've asked about here on the forums?



May you do good and not evil. May you find forgiveness for yourself and forgive others. May you share freely, never taking more than you give.

from their source code blessing. notes.

That's from sqlite. (Awesome tech, awesome license).

Related snippet from the "Distinctive Features Of SQLite" page[1] from the sqlite project:

The source code files for other SQL database engines typically begin with a comment describing your legal rights to view and copy that file. The SQLite source code contains no license since it is not governed by copyright. Instead of a license, the SQLite source code offers a blessing:

May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give.

[1] https://sqlite.org/different.html

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact