
Building CockroachDB on top of RocksDB - bandwitch
https://www.cockroachlabs.com/blog/cockroachdb-on-rocksd/
======
tschellenbach
Our in-house DB at Stream also runs on top of RocksDB + Raft. Its amazing just
how much faster it is than anything else out there (especially compared to
cassandra). Instagram uses rocksdb as storage for Cassandra, Linkedin and
pinterest use rocksdb. As soon as you have the time to build your own db using
rocksdb you get really finegrained control over performance.

[https://stackshare.io/stream/stream-and-go-news-feeds-for-
ov...](https://stackshare.io/stream/stream-and-go-news-feeds-for-
over-300-million-end-users)

~~~
stingraycharles
Rocksdb is pretty good and we relied heavily on it at QuasarDB as well. Having
said that, we are nowadays deploying more and more production setups with
Levyx’ Helium, which scales better and directly integrates with the hardware.

~~~
m0zg
Given that Helium appears to be proprietary, what kind of perf benefit are we
talking about here?

~~~
jandrewrogers
I haven't used Helium specifically, but 3-5x greater throughput would be
completely believable in my experience. It is an open secret that high-
performance closed source storage engines can have several times the
throughput of their open source equivalents on the same hardware. High-end
storage engines often have sufficient throughput to consistently saturate NVMe
arrays for diverse workloads, which is not something you commonly see in open
source. Consequently, it is common to see closed source storage engines for
people doing high-scale sensor analytics work and similar.

The source of this performance gap is architectural. The current design of
RocksDB precludes it ever being legitimately high-performance in most
contexts, and most other open source storage engines use a similar design.
Modern high-performance storage engines also use a common architecture
implemented in minor variations, you just don't see this architecture in open
source much. I realize that few software engineers have the skillset and
experience required to design a top-notch storage engine, but I am still
surprised by the dearth of open source examples given the large value in
closing this gap.

I rarely use open source storage engines in the systems I build for this
reason. The CapEx/OpEx implications of using them is far too costly at scale.
Fortunately, I have the approximately free option of using my own storage
engine implementations.

~~~
ryanworl
This technique is just more IO parallelism at the physical layer due to higher
concurrency while submitting IO, correct? Since NVMe and new SSDs don't hit
peak throughput until very high queue depths this doesn't surprise me.

~~~
jandrewrogers
I/O parallelism is necessary but far from sufficient. My own designs
arbitrarily allow 64 reads and 64 writes to be in-flight concurrently per
core. There is no science behind that limit beyond the fact it has worked
brilliantly for many years across every type of storage. But I/O parallelism
won't fix terrible scheduling.

A fast storage engine needs to eliminate most of the elements that will stall
an execution pipeline. This means doing things like almost completely
eliminating shared data structures and context switching. It also means
designing your own execution and I/O scheduler to greatly reduce the various
forms of stalling on memory ubiquitous in many designs. It is difficult to
overstate the extent to which thoughtful schedule design can greatly improve
throughput.

A state-of-the-art storage engine can drive 2+ GB/s _per core_ , and schedule
things to keep the storage hardware performance close to theoretical while
smoothing out transients. It is very easy to run out of storage bandwidth in
my experience.

~~~
ccleve
I'd love to learn more. I'm in need of a fast engine.

What proprietary engines do you know of that I can look at?

Do you have more details on your own designs? Anything you can share?

~~~
ddorian43
See scylladb. They're doing most of the grandparent is talking about (per-core
sharding, async everything, disk/io/network scheduler, dpdk, custom filesystem
etc).

------
peterwwillis
RocksDB is a fork of LevelDB, which was [in]famous for its ease of corrupting
data. Did Facebook ever do anything to ensure data wouldn't corrupt, or is
that still a common thing operationally? (You find it more at larger scales)

Here's an example of how data corruption can suck, with (example) Riak and
LevelDB. The leveldb data would corrupt often, which would leave you in a
predicament. Say you had 10 nodes with a 3 node replication factor, and the
whole cluster is humming away at a decent clip. Now one node's leveldb
corrupts, and you have to rebuild it. If you have a huge fuckoff dataset, this
can take a while. Now another node goes down. Now only 1 node has the data you
need, and 2 nodes are down - so now 8 nodes are doing the work of 10, and if
you have any more failures, your data might be gone. Now add replication,
which will suck performance and bandwidth away from the regular work. And
because it would corrupt so easily & often, there needed to be hash trees to
quickly identify what data was corrupt, and then you needed to fix it and
rebuild your hash trees. This would also suck away performance. Finally, you
can't just add new nodes while rebuilding, because the extra load makes the
cluster fall over. And the more nodes, the higher the likelihood of failures.

~~~
StreamBright
I never experienced this with several in production Riak clusters running for
years. Can you explain how to reproduce or give a link to any public forum
where this was discussed?

~~~
peterwwillis
Sure. Build about 10 classes of clusters of varying sizes, each with a dataset
ranging from 100GB to a petabyte or more. Run them on shitty oversubscribed
openstack clusters with a combination of ephemeral, Ceph, and SAN disks. Do
replication to similar-ish clusters in different regions. Handle data for
about 100 different applications that process so much data at such low latency
that cloud-based databases aren't even an option. Keep adding nodes and
storage to existing clusters over time.

It turns out that really unstable hardware/networks like to expose bugs. It
also wasn't discussed in public forums. We paid for support and even employed
Riak developers, and still we hobbled on putting out fires. I'll bet other DBs
go through the same crap and keep it quiet.

Also, read the Riak documentation and you'll find the corruption recovery
documentation among other hints at common failures and limitations.

~~~
StreamBright
Thanks for confirming it wasnt a Riak issue. SAN disk really? As an architect
i can tell you that SANs are almost always are antipattern for building
reliable and scalable distributed systems.

~~~
peterwwillis
I didn't know I was being interrogated about Riak failure modes. Ok, here's
more verbiage on Riak failures.

First off, SAN was one of three different disk storage solutions. When you
work for <BIG CORP>, the lowly product teams don't always get to pick and
choose what infrastructure is available. They have to do the best they can
with what they have, when they don't get what they ask for. If <BIG CORP> says
to use a shitty openstack cluster, that's what you have to deal with, and you
have to beg for all the ephemeral SSDs you can get. (Which then becomes a huge
pain in the ass when you need to scale storage and your choice is (A) buy more
machines and migrate nodes so you can upgrade ephemeral on the old machines or
(B) start swapping disks in running hypervisors and cry yourself to sleep, or
(C) use SAS or other array/volume on a SAN)

And thanks for blithely ignoring what I'm saying. Riak did have corruption
bugs that should have been preventable - as I said, a major source of the
problems _was LevelDB_ , and _Riak 's own documentation shows this to be
true_.

You could look sideways at these things and the db would corrupt. A node with
ephemeral storage, with no detectable errors on it whatsoever, would suddenly
stop working. We would go look at it, and it had a single leveldb file
corrupt... and nothing else wrong with it at all. Not only would it corrupt,
it wouldn't make any attempt to fix itself, even though there was a documented
fix.

Riak has anti-entropy intended to detect missing data and fix it, but it's the
erlang equivalent of cron jobs and hash trees. The whole thing is designed to
just go "dum de dum, I wonder if anything's broken after $INTERVAL?" and then
perform some operation, which if the cluster is under load, may kick it over.
So they added throttling (throttling is everywhere in Riak, as instead of
simply rejecting operations because it's unsafe, they'd rather make everything
go r e a l l y s l o w).

There was very little intelligence or event-driven programming for failure
detection and remediation. When the db corrupted in a way that wasn't handled
by anti-entropy, the node would just die, and we had to manually intervene
(later by writing automation to intervene) rather than it, you know, just
doing its own automation to fix the corruption. The AAE trees rebuild every
$INTERVAL and there's no way to change when or how they rebuild other than to
change the $INTERVAL, so there's no way to, for example, force them to rebuild
when it is convenient based on lulls in application use.

Then there's Riak search, which has the nice habit of taking down your cluster
due to god knows what (memory bloating, cpu starvation, unknown bugs in error
logs, etc). Don't use Riak search.

Replication was also a joke. Any network interruption ( _hello, distributed
apps have network interuptions_ ) would kill replication. We would have to
detect replication had failed and queues were filling up, and re-start the
replication until queues fell. But sometimes replication couldn't resume,
because there were 1 of 1,000 different potential failure modes happening with
1 node in a remote cluster somewhere. So we had to resolve that node's issue
and get the whole remote cluster healed before the replication queues filled
up. If we didn't do that, we'd have to do a full-transfer to prevent potential
data loss, which would take _days_.

We developed auto-healing scripts to deal with most of these situations, and
the controls Riak added to slow down processing so it didn't kill the cluster
from all the competing operations it was trying to do at once (kv processing,
replication, hash regeneration, etc) were not enough for our automation to be
able to efficiently control the nodes when they were unhealthy. Riak would
just occasionally perform incredibly poor, or nodes would die randomly, and
we'd get some unknown errors we couldn't diagnose. All our monitoring and
investigation showed nothing wrong with the host - no resource starvation, no
error messages, no spikes of client traffic. Riak was just having a bad day,
and us being a very small team of not-erlang-programmers, had to just restart
shit until it got better, and research fixes once things improved. Our
postmortem incident queue was rather large.

This is a small sampling of production Riak issues. I'm not going to dig into
my brain for every bug they have, but suffice to say that a distributed
database should be able to recover from a single file corruption, and should
be able to resist it from ever happening through various techniques that are
20+ years old. Their code is just lame, and proof that just because you write
something in Erlang doesn't mean it's going to be stable. And in Riak's
defense, the reason _why_ their code was lame was because they were a small
company trying to juggle a lot of demanding engineering issues from different
customers, and they didn't have much money or time. But lame code is still
lame code.

------
polskibus
I noticed that RocksDB is used very often in OLTP scenarios. What's the OLAP
equivalent of RocksDB in OLTP world? Apache Parquet? Apache Arrow? What would
you use these days to create a high performance OLAP/OLHybridP engine ?

~~~
ryanworl
For analytics workloads, your best bet is using compression techniques that
let you do operations on the data without decompressing it. A good example is
dictionary encoding a set of sorted string keys so you can preform prefix
queries by doing a greater than and less than comparison on the integers
instead of examining every string entirely.

Once you’ve encoded the data into large enough blocks, you could use any
storage engine and write the encoded blocks into it along with metadata for
managing which blocks are a part of what tables and partitions of tables.

You can also just use something like Parquet or ORC, but that’s not going to
get you the best performance possible.

~~~
polskibus
I know there are many techniques that used together give good performance
(optimal memory layout, compression, vectorization, etc. etc.), however I'd
like to use a package that does a lot of it, same what RocksDB (or SQLite)
does for OLTP cases. Is there something like that? If not, what's out there
that gives the best foundation for building OLAP functionalities on top of it?

~~~
gianm
Check out Druid [1], an open-source analytical database with tightly-coupled
storage and processing engines designed for OLAP. In particular it implements
a memory-mappable storage format, indexes, compression, late tuple
materialization, and query engines that can operate directly on compressed
data. There is a patch out to add vectorized processing as well, so you should
expect to see that show up in a future release.

Its storage format and processing engine aren't designed to be embedded in the
same way as RocksDB and SQLite are, but you certainly could if you wanted to,
since the code is fairly modular. Or you could use it as a standalone service
as it was designed to be used.

[1] [http://druid.io/](http://druid.io/)

~~~
tristor
There's also Clickhouse [1] which seems to scale much better than Druid, and
has similar architectural decisions to make it somewhat general as a columnar
store for OLAP uses. Cloudflare wrote an article in the past where they
compared Clickhouse and Druid and they chose Clickhouse because they could get
similar performance on the same workload with 9 nodes in Clickhouse which
would require hundreds for Druid. They built all of the DNS analytics at
CloudFlare on Clickhouse [2].

Disclosure: I work at Percona, and we've seen a lot of our customers make use
of Clickhouse and have begun some of our own services work around it in
Consulting. It's now a primary database talked about at our conferences, and
we post about it regularly. [3]

[1]: [https://clickhouse.yandex/](https://clickhouse.yandex/) [2]:
[https://blog.cloudflare.com/how-cloudflare-
analyzes-1m-dns-q...](https://blog.cloudflare.com/how-cloudflare-
analyzes-1m-dns-queries-per-second/) [3]:
[https://www.percona.com/blog/2018/10/01/clickhouse-two-
years...](https://www.percona.com/blog/2018/10/01/clickhouse-two-years/)

~~~
neeleshs
There is a very good article[1] by one of the Druid committers about
Clickhouse/Drui/Pinot that goes into some details on why the Cloudflare tests
turned out the way they did.

[1]:[https://medium.com/@leventov/comparison-of-the-open-
source-o...](https://medium.com/@leventov/comparison-of-the-open-source-olap-
systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7)

~~~
eurg
That article is better than expected, and it matches my own experience with CH
(it was a great match for our use-case, and some of those reasons are in the
article; and also, we could have used an inverted index, would one have been
available; surprisingly survived w/o it).

------
the_duke
Excellent article, very informative.

I just had to chuckle at this:

> Non-engineers: in a computer, a move is always implemented as a copy
> followed by a delete

Yeah, that's really gonna help a non-developer understand the article
better...

~~~
tyingq
It's confusing altogether. For example, that's not how /bin/mv (usually)
works.

~~~
ulysses
Usually, that's because /bin/mv is just changing a link to the file, not
moving the file itself. In cases where it's actually moving the file -- say
across a file system boundary -- it does copy the file and then delete the old
version.

~~~
the_duke
I reckon it was meant in the context of compaction.

------
perfmode
When a SQL implementation is built on a KV storage engine, how do tables,
rows, and columns typically map to the underlying KV data model?

~~~
ryanworl
The CockroachDB blog post on this topic is a good summary. There is an
additional trick that isn't directly KV related, but is important in a
distributed environment when using a KV storage engine.

When defining a hierarchy of tables, such as customers -> orders ->
order_line_items, you can make the primary key of the child tables contain the
primary key of the parent table.

e.g. (customer_id) for the customers table, (customer_id, order_id) for the
orders table, then (customer_id, order_id, line_item_id) for the
order_line_items.

When this is stored on disk in a sorted format, it makes joins between these
extremely cheap because the data will all be next to each other on disk.

CockroachDB calls this "interleaved tables".

~~~
gigatexal
Interleaved tables are best for 1:1 relationships.

~~~
ryanworl
Hard disagree. This is the only way in a distributed, sorted KV store to get
any semblance of data locality. If Cockroach and Spanner didn't do this, they
would constantly be doing 2PC for modifying data that is related but stored on
different groups of machines.

~~~
gigatexal
We are evaluating it for production and the senior engineer we talked to from
the company told us that. I’d think he knows what he’s talking about.

~~~
ryanworl
I may be misinterpreting what you mean by a 1-1 relationship, but the
documentation for Cloud Spanner, Cockroach, and the old FoundationDB SQL layer
all use a similar schema to the one I described in their examples.
Additionally, the F1 paper describes using it in the same way in Figure 2.

~~~
gigatexal
He said tables with a parent child relationship that is one to one meaning if
I understand that right one parent ID mapping to one child ID. Its entirely
possible either I’m misinterpreting him or he’s wrong but that’s what he said.

~~~
wrigby
I think the data locality benefits would hold for one to many relationships
too (that is, one parent with many children).

------
dominotw
> If you surveyed most NewSQL databases today, most of them are built on top
> of an LSM, namely, RocksDB.

Is this actually true?

spark, foundationdb, memsql, nuodb , citus . I am not sure any of these are
built on top of rocksdb.

Which ones are actually built on lsm?

~~~
manigandham
None of those are newsql other than MemSQL, which is an OLAP system that uses
a custom rowstore format and parquet for columnstores.

In addition to CockroachDB there's also TiDB which runs on top of TiKV which
uses RocksDB.

~~~
dominotw
> None of those are newsql other than MemSQL

Why aren't citus, nuodb 'newsql'?

> there's also TiDB

One more example doesn't qualify the statement "most are built on rocksdb". I
wasn't saying there is only one newsql db built on rocksdb.

Of the 14 examples listed here
[https://en.wikipedia.org/wiki/NewSQL](https://en.wikipedia.org/wiki/NewSQL)

only 2 that you mentioned seem to be built on rocksdb.

~~~
manigandham
I'm not disagreeing, RocksDB is not used by most. The statement in the blog
post is not true.

------
kureikain
If someone love LevelDB/RocksDB but want to use a pure-Go implementation, I
have good thing about this library:

[https://github.com/syndtr/goleveldb](https://github.com/syndtr/goleveldb)

------
ddorian43
Seems like few features: 1\. sstables as different files 2\. range delete
(which is rare)

compared to LMDB (which is faster & more efficient):
[https://symas.com/lmdb/technical/](https://symas.com/lmdb/technical/)

Still would be nice to see how LMDB would fare in a complex distributed DBMS
(most of them are in rocksdb-type libraries).

But LMDB is supposed to stay small. So more features are in a fork:
[https://github.com/leo-yuriev/libmdbx](https://github.com/leo-yuriev/libmdbx)

~~~
hyc_symas
LMDB is already used in distributed DBs - such as LDAP. OpenLDAP performance
is orders of magnitude greater than any RDBMS or other distributed DB.

