Hacker News new | past | comments | ask | show | jobs | submit login
Comdb2 – Bloomberg's distributed RDBMS under Apache 2 (github.com)
240 points by Callicles on June 10, 2017 | hide | past | web | favorite | 65 comments

Reading through their VLDB paper [1], Comdb2 appears to be a moderately scalable (up to dozens of nodes (?)) RDBMS with a strong emphasis on consistency and availability. Benchmarks show numbers comparable to Percona XtraDB (MySQL with a different storage engine), at ~2,000 writes/s and 2,000,000 reads/s against a 6 node server cluster. High availability and global sequencing is provided by using GPS clocks, similar to Spanner/Truetime.

Schema changes happen lazily, with old rows being rewritten on the next update, and a background job doing bulk rewriting.

Scalability: "While reads scale in a nearly linear manner as cluster size increases, writes have a more conservative level of scaling with an absolute limit. The offloading of portions of work (essentially all the WHERE clause predicate evaluations) across the cluster does help scale writes beyond what a single machine can do. Ultimately, this architecture does saturate on a single machine’s ability to process the low level bplogs."

This doesn't provide the horizontal scaling that Spanner does, CockroachDB aims at, or FoundationDB presumably has.

[1]: http://www.vldb.org/pvldb/vol9/p1377-scotti.pdf

<< or FoundationDB presumably has >>

You mean FaunaDB?

FoundationDB was the proto-CockroachDB (or rather, CockroachDB is essentially a re-attempt at building FoundationDB). It was an early attempt at building a NewSQL database. (NewSQL per se, i.e. not counting parallel databases from the pre-NoSQL age, like Gamma [1], Volcano [2] and Grace [3], which share many of the same design principles.)

FoundationDB was acquhired by Apple, but its failure is generally attributed to a poorly-performing SQL layer: https://www.voltdb.com/blog/2015/04/01/foundationdbs-lesson-...

[1] http://pages.cs.wisc.edu/~dewitt/includes/paralleldb/ieee90....

[2] https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b...

[3] https://pdfs.semanticscholar.org/a7f4/e4e6166dc683e7fa7d5b9e...

No, FoundationDB which was posting some VERY impressive numbers before being acquired by Apple in March 2015.

Here they were doing 15M writes/s on 32 16-core servers, at a rate of 30,000 writes/s/core: http://web.archive.org/web/20150427041746/http://blog.founda...

FaunaDB managed 120,000 writes per second on 15 machines. https://fauna.com/blog/distributed-acid-transaction-performa...

(Yes, not equivalent benchmarks, but that's still a 50x difference in magnitude.)

Down voted for asking a question? The system is definitely broke in such instances.

You worded the question in a way suggested you thought the previous commenter was wrong. That reads like either being a FaunaFB fanboy or not taking time to learn about Foundation, both of which will reliably attract downvotes. Had you worded it differently I doubt you'd be getting downvoted.

So essentially the HN commenting UX is one of walking on egg shells.

I mean, a two-second Google could haven't shown you what FoundationDB was.

Some assorted interesting points:

* optimistic concurrency control (sometimes you need to retry, but often the optimism pays off)

* serializable transaction isolation (something like but not exactly like ssi, rather than 2pl)

* ieee 754-2008 decimal floats

* undo-based mvcc (writers don't block readers)

* group (network) sync replication

* paxos based failover

* lua stored procs

Interesting technology and I'm very happy to see it open-sourced. Kudos to the team. (I used this when I worked there. Few firms can pull off something like this in-house; they could. You wouldn't believe how much data they store in this thing.)

You know, I never really understood the point of decimal floats. Why use floats at all if the point is to express beeps and dollar ppm? You might as well just pick a fraction size and use uint64_t to my mind, or just some bigint type (which is still going to be faster than decimal floats).

The point is that it's hard to pick the fraction size. US national debt vs Italian lira/Swiss franc exhange rate, humans generate numbers of wide ranging scales and yet it's convenient for computers to deal in fixed size datatypes. Hence floating scale. But yeah, it's a compromise. I expect these new types to catch on and be added to various language standards over the coming years, but we'll see.

Small parts, like screws, can be priced in microdollars for accounting purposes. At the same time the US GDP is on the order of $20 trillion. 64 bit is not large enough to fit both of these types of numbers.

Makes using SQLITE a real pain...

This is really cool but I'm curious why Bloomberg would need this. Ie, what special needs does Bloomberg have that would lead to a primarily non-engineering company investing the resources to create this. Was there nothing off the shelf that would have fit their needs?

I don't mean that in a derogatory way, I'm just curious what motivated making this.

"Non-engineering company" -- ha!

But to answer your question: Mike Bloomberg's autobiography talks about this. When they started in the 80's there wasn't as much great off-the-shelf software as there is today.

Their customers were (and always have been) insanely demanding when it comes to reliability and speed. The last thing Mike wanted to do was be caught sheepishly explaining to the CIO of Merrill Lynch, "Well, gee, our Oracle database has this bug they can't fix for the next two weeks..." or even "...this optimization they can't make for the next six months"

This is a company that invented its own layer-2 network protocols 35 years ago just to squeeze every last drop of performance and reliability out of the hardware. Of course they wrote their own database (actually two -- there was a comdb 1 of of course, too)

Just for some scale, we have 4k+ engineering employees and if we were a public company, we'd be somewhere around the 5th largest software company in the world by revenue. (Strange stat, but our primary product is software and accounts for over 3/4 revenue)

Looking forward to seeing a fastsend / PRCCOM paper one day.

PRCCOM is no more! Don't need to keep bringing that one up :)

Oh, in that case I hope you guys see fit to write it up one day. It's a legitimately interesting and clever solution, even if obsolete.

I definitely wouldn't call Bloomberg a "non-engineering" company if you've ever worked with any of their trading / market data products. Pretty much true of any trading side finance company nowadays.

Bloomberg started as a technology company. Sure, they have expanded into media, but I'm pretty sure they are still first and foremost a provider of software (plus hardware) and data.

I would bet that the Terminals are responsible for something like 140% of their profits, and that the media empire is a huge loss leader / vanity project.

The whole point of Reuters news is to inform traders about things that will lead to more trades being done over the Reuters platform, so it's entirely likely Bloomberg operates the same model.

So. We did a few things better than the industry to this day. HA is a spectrum. Transparent masking of all failures in any state of arbitary SQL transaction is on the far right side of that spectrum.

To directly answer your question: go look at where 'replication' was on rdbms 14 years ago. PostgreSQL? MySQL?

14 years ago Sybase had strong replication. Sybase was wall street's darling for a long time starting in the late 80's.

>>what special needs does Bloomberg have that would lead to a primarily non-engineering company investing the resources to create this

They probably open sourced the database, and would do more -- to dispel the above notion, and therefore to attract more qualified recruits.

Financial services companies, and technology suppliers around them, have came up much of the 'modern day' tech, some times decade(s), before google's and facebook's of the world

  -- have had NoSQL with stored procs(eg Goldman S, early to mid 90s ),   
  -- in-house grown programming lang (APL+ based in Morgan Stanley) emphasizing vector-based operations
  -- Smart contracts 
  (where a contract is represented as algorithm specified in a domain specific language) 
    ).  This was way before etherium, started, I believe, in Credit Swiss (but not sure)

   -- one of the fastest time series database (kdb+)
   -- Data science and machine learning (modeling risk and valuations)
Today's hedge funds manage petabytes of data. So do many of the big investment banks...


>> We had several goals in mind. One was being wire-format compatible with an older internal product to allow developers to migrate applications easier. Another was to have a database that could be updated anywhere and stay in sync everywhere. The first goal could only be satisfied in-house. The second was either in its infancy for open source products, or available at high cost from commercial sources.

I've worked at midsize "non-engineering" companies that do boring stuff like logistics where they wrote their own RDBMS and even virtual machines for legacy hardware.

It's not that uncommon especially if you've been around a long time.

Bloomberg was doing cloud computing decades before the term was even invented.

Edit: As someone else points out, they were doing cloud computing before the internet existed.

People forget that Bloomberg had computer networks before the Internet.

I remember when we didn't even have comdb2.

Bloomberg is a massive tech company.

If even advertising companies are engineering shops, what makes you think fintech companies aren't?

Right, it's like everyone's forgot what Google's business model actually is!

The work started 12 years ago when there were probably no `NewSQL` [1] solutions out there.


Bloomberg, after all, is a high-tech news and information delivery network.

>Was there nothing off the shelf that would have fit their needs?

I can't think of any solid distributed RDBMS that would have been around in 2004. Does anyone with more knowledge have any idea (open or not)?

IBM DB2 and Oracle 9i were around for sure. Other solutions as well. DB landscape was well-established back then.

They were, but rediculously expensive at the time. $100k+ at the time to have a non-replicated version on Oracle running on a multicore server. Dealing with Oracle was like dealing with the mob.

Indeed, but that's a whole other issue.

DEC RDB might have done it, if it was still available.

What's the motivation for Bloomberg to open-source this?

Probably the same motivation Yahoo had to release Hadoop, FB to release Hive, Netflix to release so much of their libs and so on and so forth:

- if nothing else, it does no harm (no 'secret sauce' competitors could benefit from)

- it buys karma (think recruiting goodwill)

If the project catches on though then there are many advantages:

- it can spark a self-sustained ecosystem that can further drive the product, at much lower cost for original creator (think Hadoop leading to Cloudera, Hortonworks etc). Product improves, bugs are fixed, toolset matures

- newhires come with know-how to use your internal tools, lower ramp up, better productivity. Anecdotal, but when I was at Microsoft no newhire knew how to use the internal Cosmos stuff, and even among old timers more folk were familiar with Hadoop...

I see Comdb2 requries SQLite to install. So, I'm guessing Comdb2 is a distributed storage engine for SQLite, or?

There's a heavily modified SQLite embedded in Comdb2 for query parsing/planning. There's some auxiliary tools that can optionally use SQLite (or Comdb2). It'll run without installing an SQLite package.

I believe originally it wasn't sql, and they added SQL by using the sqlite parsing engine for SQL. They are a massive contributor to the sqlite project. It is in no way supposed to be compatible with sqlite

I think you're mixing 2 things. This is a complete rewrite of an old key-value store (hence the 2 in the name) but comdb2 was always SQL. Comdb2 shares a few things with comdb, but they're really just to make migration easier (the preference for tags in csc2 files instead of usual DDL, and then a tag based API that it looks like we got rid of for this release). Under the covers comdb2 is completely different and as far as I know shares no code with comdb.

It wasn't always an SQL system. SQL was added a year or two in its development (with lots of databases already live).

Yep. With a false start or two along the way. One of the most interesting things on this project was repeatedly changing engines while the plane was in the air.

We've had databases literally up for 10+ years, through 5-6 major version upgrades. There's been one file format change that required downtime very early on.

So, you can run Comdb without SQLite?

It uses components of SQLite, included in-tree.

Somewhat annoying you have to copy data to nodes manually:

    copycomdb2 mptest-1.comdb2.example.com:${HOME}/db/testdb.lrl
This sort of stuff is why I loved RethinkDB. They handled all these complexity details for you.

I'm not too offended by letting the administrator figure out the best way to do an uncommon operation. As a point of comparison, Cassandra does have a proper way to bootstrap new nodes but I've found that in many cases it's better to short circuit it and rsync the initial data myself (and use its repair functionality to clean up the mess).

Some reasons include throttling load on the "old" servers, better feedback on progress, the ability to pause/resume, or even being able to do it faster than the DBMS can e.g. by snapshotting the disk on the source machine and making a CoW clone of it. Heck, if you're running your own hardware and feeling a little reckless, pull out one of the drives from the source machine's RAID mirror and you've already got a full clone right there.

I guess you could build all of that into the DBMS, but it's a rather specialised manual operation that's not happening all that often and it's one of the cases that the administrator almost certainly does know better

How does that work? As near as I can figure, you need to have all the sstable files from all nodes in a rack on disk. Most will be discarded on "nodetool cleanup", but I would expect it to have to rewrite all the files due to the new token range.

> you need to have all the sstable files from all nodes in a rack on disk

If you're not using vnodes, then you need all of the sstables from the previous $RF nodes in the ring. So with RF==3, it will briefly have about treble the amount that it will finally carry.

It's a lot of temporarily wasted disk space for sure, but now you're in full control of how you get the data there

This sounds like a terrible idea.

RethinkDB is still around (present tense). It has been re-licensed under Apache-2.0 and a community is building to move it forward.


Nice they have a JDBC driver... makes it a lot easier to hook into. After looking at all the C++ jobs on Bloomberg's website, makes sense the schema format looks C++ish.

They have a few people on the ANSI C++ process, including Bjarne. :)

I think Bjarne works for Morgan Stanley. Bloomberg has John Lakos, Alisdair Merideth, Dietmar Kuhn and others (although I'm not sure which of those are officially on the committee).

You are right, I mixed that up and did not check before posting.

I usually have to check http://stroustrup.com/ because I can't remember if he works for Goldman Sachs, JPMorgan or Morgan Stanley. And, in fact, his statement used to say that the nice bank Morgan Stanley not the ruthless JPMorgan.

*Dietmar Kühl

Thank you. I tried to double-check, but apparently got it wrong anyway.

I wonder how this compares to CockroachDB

Neat! It supports wsl! Even better! O_o

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact