Schema changes happen lazily, with old rows being rewritten on the next update, and a background job doing bulk rewriting.
Scalability: "While reads scale in a nearly linear manner as cluster size increases, writes have a more conservative level of scaling with an absolute limit. The offloading of portions of work (essentially all the WHERE clause predicate evaluations) across the cluster does help scale writes beyond what a single machine can do. Ultimately, this architecture does saturate on a single machine’s ability to process the low level bplogs."
This doesn't provide the horizontal scaling that Spanner does, CockroachDB aims at, or FoundationDB presumably has.
You mean FaunaDB?
FoundationDB was acquhired by Apple, but its failure is generally attributed to a poorly-performing SQL layer: https://www.voltdb.com/blog/2015/04/01/foundationdbs-lesson-...
Here they were doing 15M writes/s on 32 16-core servers, at a rate of 30,000 writes/s/core: http://web.archive.org/web/20150427041746/http://blog.founda...
FaunaDB managed 120,000 writes per second on 15 machines. https://fauna.com/blog/distributed-acid-transaction-performa...
(Yes, not equivalent benchmarks, but that's still a 50x difference in magnitude.)
* optimistic concurrency control (sometimes you need to retry, but often the optimism pays off)
* serializable transaction isolation (something like but not exactly like ssi, rather than 2pl)
* ieee 754-2008 decimal floats
* undo-based mvcc (writers don't block readers)
* group (network) sync replication
* paxos based failover
* lua stored procs
Interesting technology and I'm very happy to see it open-sourced. Kudos to the team.
(I used this when I worked there. Few firms can pull off something like this in-house; they could. You wouldn't believe how much data they store in this thing.)
Makes using SQLITE a real pain...
I don't mean that in a derogatory way, I'm just curious what motivated making this.
But to answer your question: Mike Bloomberg's autobiography talks about this. When they started in the 80's there wasn't as much great off-the-shelf software as there is today.
Their customers were (and always have been) insanely demanding when it comes to reliability and speed. The last thing Mike wanted to do was be caught sheepishly explaining to the CIO of Merrill Lynch, "Well, gee, our Oracle database has this bug they can't fix for the next two weeks..." or even "...this optimization they can't make for the next six months"
This is a company that invented its own layer-2 network protocols 35 years ago just to squeeze every last drop of performance and reliability out of the hardware. Of course they wrote their own database (actually two -- there was a comdb 1 of of course, too)
To directly answer your question: go look at where 'replication' was on rdbms 14 years ago. PostgreSQL? MySQL?
They probably open sourced the database, and would do more -- to dispel the above notion, and therefore to attract more qualified recruits.
Financial services companies, and technology suppliers around them, have came up much of the 'modern day' tech, some times decade(s), before google's and facebook's of the world
-- have had NoSQL with stored procs(eg Goldman S, early to mid 90s ),
-- in-house grown programming lang (APL+ based in Morgan Stanley) emphasizing vector-based operations
-- Smart contracts
(where a contract is represented as algorithm specified in a domain specific language)
). This was way before etherium, started, I believe, in Credit Swiss (but not sure)
-- one of the fastest time series database (kdb+)
-- Data science and machine learning (modeling risk and valuations)
>> We had several goals in mind. One was being wire-format compatible with an older internal product to allow developers to migrate applications easier. Another was to have a database that could be updated anywhere and stay in sync everywhere. The first goal could only be satisfied in-house. The second was either in its infancy for open source products, or available at high cost from commercial sources.
It's not that uncommon especially if you've been around a long time.
Edit: As someone else points out, they were doing cloud computing before the internet existed.
I remember when we didn't even have comdb2.
I can't think of any solid distributed RDBMS that would have been around in 2004. Does anyone with more knowledge have any idea (open or not)?
- if nothing else, it does no harm (no 'secret sauce' competitors could benefit from)
- it buys karma (think recruiting goodwill)
If the project catches on though then there are many advantages:
- it can spark a self-sustained ecosystem that can further drive the product, at much lower cost for original creator (think Hadoop leading to Cloudera, Hortonworks etc). Product improves, bugs are fixed, toolset matures
- newhires come with know-how to use your internal tools, lower ramp up, better productivity. Anecdotal, but when I was at Microsoft no newhire knew how to use the internal Cosmos stuff, and even among old timers more folk were familiar with Hadoop...
Some reasons include throttling load on the "old" servers, better feedback on progress, the ability to pause/resume, or even being able to do it faster than the DBMS can e.g. by snapshotting the disk on the source machine and making a CoW clone of it. Heck, if you're running your own hardware and feeling a little reckless, pull out one of the drives from the source machine's RAID mirror and you've already got a full clone right there.
I guess you could build all of that into the DBMS, but it's a rather specialised manual operation that's not happening all that often and it's one of the cases that the administrator almost certainly does know better
If you're not using vnodes, then you need all of the sstables from the previous $RF nodes in the ring. So with RF==3, it will briefly have about treble the amount that it will finally carry.
It's a lot of temporarily wasted disk space for sure, but now you're in full control of how you get the data there