
Comdb2 – Bloomberg's distributed RDBMS under Apache 2 - Callicles
https://github.com/bloomberg/comdb2
======
Scaevolus
Reading through their VLDB paper [1], Comdb2 appears to be a moderately
scalable (up to dozens of nodes (?)) RDBMS with a strong emphasis on
consistency and availability. Benchmarks show numbers comparable to Percona
XtraDB (MySQL with a different storage engine), at ~2,000 writes/s and
2,000,000 reads/s against a 6 node server cluster. High availability and
global sequencing is provided by using GPS clocks, similar to
Spanner/Truetime.

Schema changes happen lazily, with old rows being rewritten on the next
update, and a background job doing bulk rewriting.

Scalability: "While reads scale in a nearly linear manner as cluster size
increases, writes have a more conservative level of scaling with an absolute
limit. The offloading of portions of work (essentially all the WHERE clause
predicate evaluations) across the cluster does help scale writes beyond what a
single machine can do. Ultimately, this architecture does saturate on a single
machine’s ability to process the low level bplogs."

This doesn't provide the horizontal scaling that Spanner does, CockroachDB
aims at, or FoundationDB presumably has.

[1]:
[http://www.vldb.org/pvldb/vol9/p1377-scotti.pdf](http://www.vldb.org/pvldb/vol9/p1377-scotti.pdf)

~~~
idibidiartists
<< or FoundationDB presumably has >>

You mean FaunaDB?

~~~
idibidiartists
Down voted for asking a question? The system is definitely broke in such
instances.

~~~
acdha
You worded the question in a way suggested you thought the previous commenter
was wrong. That reads like either being a FaunaFB fanboy or not taking time to
learn about Foundation, both of which will reliably attract downvotes. Had you
worded it differently I doubt you'd be getting downvoted.

~~~
idibidiartists
So essentially the HN commenting UX is one of walking on egg shells.

~~~
saghm
I mean, a two-second Google could haven't shown you what FoundationDB was.

------
macdice
Some assorted interesting points:

* optimistic concurrency control (sometimes you need to retry, but often the optimism pays off)

* serializable transaction isolation (something like but not exactly like ssi, rather than 2pl)

* ieee 754-2008 decimal floats

* undo-based mvcc (writers don't block readers)

* group (network) sync replication

* paxos based failover

* lua stored procs

Interesting technology and I'm very happy to see it open-sourced. Kudos to the
team. (I used this when I worked there. Few firms can pull off something like
this in-house; they could. You wouldn't believe how much data they store in
this thing.)

~~~
microcolonel
You know, I never really understood the point of decimal floats. Why use
floats at all if the point is to express beeps and dollar ppm? You might as
well just pick a fraction size and use uint64_t to my mind, or just some
bigint type (which is still going to be faster than decimal floats).

~~~
macdice
The point is that it's hard to pick the fraction size. US national debt vs
Italian lira/Swiss franc exhange rate, humans generate numbers of wide ranging
scales and yet it's convenient for computers to deal in fixed size datatypes.
Hence floating scale. But yeah, it's a compromise. I expect these new types to
catch on and be added to various language standards over the coming years, but
we'll see.

------
NickGerleman
This is really cool but I'm curious why Bloomberg would need this. Ie, what
special needs does Bloomberg have that would lead to a primarily non-
engineering company investing the resources to create this. Was there nothing
off the shelf that would have fit their needs?

I don't mean that in a derogatory way, I'm just curious what motivated making
this.

~~~
nemothekid
>Was there nothing off the shelf that would have fit their needs?

I can't think of any solid distributed RDBMS that would have been around in
2004. Does anyone with more knowledge have any idea (open or not)?

~~~
Keyframe
IBM DB2 and Oracle 9i were around for sure. Other solutions as well. DB
landscape was well-established back then.

~~~
dsparkman
They were, but rediculously expensive at the time. $100k+ at the time to have
a non-replicated version on Oracle running on a multicore server. Dealing with
Oracle was like dealing with the mob.

~~~
Keyframe
Indeed, but that's a whole other issue.

------
boxfish
What's the motivation for Bloomberg to open-source this?

~~~
rusanu
Probably the same motivation Yahoo had to release Hadoop, FB to release Hive,
Netflix to release so much of their libs and so on and so forth:

\- if nothing else, it does no harm (no 'secret sauce' competitors could
benefit from)

\- it buys karma (think recruiting goodwill)

If the project catches on though then there are many advantages:

\- it can spark a self-sustained ecosystem that can further drive the product,
at much lower cost for original creator (think Hadoop leading to Cloudera,
Hortonworks etc). Product improves, bugs are fixed, toolset matures

\- newhires come with know-how to use your internal tools, lower ramp up,
better productivity. Anecdotal, but when I was at Microsoft no newhire knew
how to use the internal Cosmos stuff, and even among old timers more folk were
familiar with Hadoop...

------
ihenriksen
I see Comdb2 requries SQLite to install. So, I'm guessing Comdb2 is a
distributed storage engine for SQLite, or?

~~~
kanwisher
I believe originally it wasn't sql, and they added SQL by using the sqlite
parsing engine for SQL. They are a massive contributor to the sqlite project.
It is in no way supposed to be compatible with sqlite

~~~
czinck
I think you're mixing 2 things. This is a complete rewrite of an old key-value
store (hence the 2 in the name) but comdb2 was always SQL. Comdb2 shares a few
things with comdb, but they're really just to make migration easier (the
preference for tags in csc2 files instead of usual DDL, and then a tag based
API that it looks like we got rid of for this release). Under the covers
comdb2 is completely different and as far as I know shares no code with comdb.

~~~
toosleepy
It wasn't always an SQL system. SQL was added a year or two in its development
(with lots of databases already live).

~~~
alexjscotti
Yep. With a false start or two along the way. One of the most interesting
things on this project was repeatedly changing engines while the plane was in
the air.

~~~
mponomar
We've had databases literally up for 10+ years, through 5-6 major version
upgrades. There's been one file format change that required downtime very
early on.

------
nodesocket
Somewhat annoying you have to copy data to nodes manually:

    
    
        copycomdb2 mptest-1.comdb2.example.com:${HOME}/db/testdb.lrl
    

This sort of stuff is why I loved RethinkDB. They handled all these complexity
details for you.

~~~
ketralnis
I'm not too offended by letting the administrator figure out the best way to
do an uncommon operation. As a point of comparison, Cassandra does have a
proper way to bootstrap new nodes but I've found that in many cases it's
better to short circuit it and rsync the initial data myself (and use its
repair functionality to clean up the mess).

Some reasons include throttling load on the "old" servers, better feedback on
progress, the ability to pause/resume, or even being able to do it faster than
the DBMS can e.g. by snapshotting the disk on the source machine and making a
CoW clone of it. Heck, if you're running your own hardware and feeling a
little reckless, pull out one of the drives from the source machine's RAID
mirror and you've already got a full clone right there.

I guess you could build all of that into the DBMS, but it's a rather
specialised manual operation that's not happening all that often and it's one
of the cases that the administrator almost certainly does know better

~~~
coredog64
How does that work? As near as I can figure, you need to have all the sstable
files from all nodes in a rack on disk. Most will be discarded on "nodetool
cleanup", but I would expect it to have to rewrite all the files due to the
new token range.

~~~
ketralnis
> you need to have all the sstable files from all nodes in a rack on disk

If you're not using vnodes, then you need all of the sstables from the
previous $RF nodes in the ring. So with RF==3, it will briefly have about
treble the amount that it will finally carry.

It's a lot of temporarily wasted disk space for sure, but now you're in full
control of how you get the data there

------
gleenn
Nice they have a JDBC driver... makes it a lot easier to hook into. After
looking at all the C++ jobs on Bloomberg's website, makes sense the schema
format looks C++ish.

~~~
pjmlp
They have a few people on the ANSI C++ process, including Bjarne. :)

~~~
maxlybbert
I think Bjarne works for Morgan Stanley. Bloomberg has John Lakos, Alisdair
Merideth, Dietmar Kuhn and others (although I'm not sure which of those are
officially on the committee).

~~~
pjmlp
You are right, I mixed that up and did not check before posting.

~~~
maxlybbert
I usually have to check [http://stroustrup.com/](http://stroustrup.com/)
because I can't remember if he works for Goldman Sachs, JPMorgan or Morgan
Stanley. And, in fact, his statement used to say that the nice bank Morgan
Stanley not the ruthless JPMorgan.

------
shusson
I wonder how this compares to CockroachDB

------
brian_herman
Neat! It supports wsl! Even better! O_o

