Hacker News new | comments | ask | show | jobs | submit login
Building CockroachDB on top of RocksDB (cockroachlabs.com)
180 points by bandwitch 29 days ago | hide | past | web | favorite | 74 comments

Our in-house DB at Stream also runs on top of RocksDB + Raft. Its amazing just how much faster it is than anything else out there (especially compared to cassandra). Instagram uses rocksdb as storage for Cassandra, Linkedin and pinterest use rocksdb. As soon as you have the time to build your own db using rocksdb you get really finegrained control over performance.


Rocksdb is pretty good and we relied heavily on it at QuasarDB as well. Having said that, we are nowadays deploying more and more production setups with Levyx’ Helium, which scales better and directly integrates with the hardware.

Given that Helium appears to be proprietary, what kind of perf benefit are we talking about here?

I haven't used Helium specifically, but 3-5x greater throughput would be completely believable in my experience. It is an open secret that high-performance closed source storage engines can have several times the throughput of their open source equivalents on the same hardware. High-end storage engines often have sufficient throughput to consistently saturate NVMe arrays for diverse workloads, which is not something you commonly see in open source. Consequently, it is common to see closed source storage engines for people doing high-scale sensor analytics work and similar.

The source of this performance gap is architectural. The current design of RocksDB precludes it ever being legitimately high-performance in most contexts, and most other open source storage engines use a similar design. Modern high-performance storage engines also use a common architecture implemented in minor variations, you just don't see this architecture in open source much. I realize that few software engineers have the skillset and experience required to design a top-notch storage engine, but I am still surprised by the dearth of open source examples given the large value in closing this gap.

I rarely use open source storage engines in the systems I build for this reason. The CapEx/OpEx implications of using them is far too costly at scale. Fortunately, I have the approximately free option of using my own storage engine implementations.

This technique is just more IO parallelism at the physical layer due to higher concurrency while submitting IO, correct? Since NVMe and new SSDs don't hit peak throughput until very high queue depths this doesn't surprise me.

I/O parallelism is necessary but far from sufficient. My own designs arbitrarily allow 64 reads and 64 writes to be in-flight concurrently per core. There is no science behind that limit beyond the fact it has worked brilliantly for many years across every type of storage. But I/O parallelism won't fix terrible scheduling.

A fast storage engine needs to eliminate most of the elements that will stall an execution pipeline. This means doing things like almost completely eliminating shared data structures and context switching. It also means designing your own execution and I/O scheduler to greatly reduce the various forms of stalling on memory ubiquitous in many designs. It is difficult to overstate the extent to which thoughtful schedule design can greatly improve throughput.

A state-of-the-art storage engine can drive 2+ GB/s per core, and schedule things to keep the storage hardware performance close to theoretical while smoothing out transients. It is very easy to run out of storage bandwidth in my experience.

I'd love to learn more. I'm in need of a fast engine.

What proprietary engines do you know of that I can look at?

Do you have more details on your own designs? Anything you can share?

See scylladb. They're doing most of the grandparent is talking about (per-core sharding, async everything, disk/io/network scheduler, dpdk, custom filesystem etc).

You obviously haven't used LMDB. It's zero-overhead reads can saturate NVMe and even Optane DIMMs, something that no other DB engine has accomplished.

What are your thoughts on LMDB?

In our testing, it’s multiple times faster, especially at scale. RocksDB’s compaction becomes a bottleneck fairly quickly when put under strain for extended periods of time.

Helium performs much, much better at scale and doesn’t have compaction issues. It’s proprietary, but in my experience it’s money well spent.

For the record, we were able to fully saturate a 4xNVMe with a 96 core server using Helium, while RocksDB achieved about 20% of the full NVMe capacity.

As with all benchmarks, YMMV.

Did you try LMDB ?

RocksDB is a fork of LevelDB, which was [in]famous for its ease of corrupting data. Did Facebook ever do anything to ensure data wouldn't corrupt, or is that still a common thing operationally? (You find it more at larger scales)

Here's an example of how data corruption can suck, with (example) Riak and LevelDB. The leveldb data would corrupt often, which would leave you in a predicament. Say you had 10 nodes with a 3 node replication factor, and the whole cluster is humming away at a decent clip. Now one node's leveldb corrupts, and you have to rebuild it. If you have a huge fuckoff dataset, this can take a while. Now another node goes down. Now only 1 node has the data you need, and 2 nodes are down - so now 8 nodes are doing the work of 10, and if you have any more failures, your data might be gone. Now add replication, which will suck performance and bandwidth away from the regular work. And because it would corrupt so easily & often, there needed to be hash trees to quickly identify what data was corrupt, and then you needed to fix it and rebuild your hash trees. This would also suck away performance. Finally, you can't just add new nodes while rebuilding, because the extra load makes the cluster fall over. And the more nodes, the higher the likelihood of failures.

I never experienced this with several in production Riak clusters running for years. Can you explain how to reproduce or give a link to any public forum where this was discussed?

Sure. Build about 10 classes of clusters of varying sizes, each with a dataset ranging from 100GB to a petabyte or more. Run them on shitty oversubscribed openstack clusters with a combination of ephemeral, Ceph, and SAN disks. Do replication to similar-ish clusters in different regions. Handle data for about 100 different applications that process so much data at such low latency that cloud-based databases aren't even an option. Keep adding nodes and storage to existing clusters over time.

It turns out that really unstable hardware/networks like to expose bugs. It also wasn't discussed in public forums. We paid for support and even employed Riak developers, and still we hobbled on putting out fires. I'll bet other DBs go through the same crap and keep it quiet.

Also, read the Riak documentation and you'll find the corruption recovery documentation among other hints at common failures and limitations.

Thanks for confirming it wasnt a Riak issue. SAN disk really? As an architect i can tell you that SANs are almost always are antipattern for building reliable and scalable distributed systems.

I didn't know I was being interrogated about Riak failure modes. Ok, here's more verbiage on Riak failures.

First off, SAN was one of three different disk storage solutions. When you work for <BIG CORP>, the lowly product teams don't always get to pick and choose what infrastructure is available. They have to do the best they can with what they have, when they don't get what they ask for. If <BIG CORP> says to use a shitty openstack cluster, that's what you have to deal with, and you have to beg for all the ephemeral SSDs you can get. (Which then becomes a huge pain in the ass when you need to scale storage and your choice is (A) buy more machines and migrate nodes so you can upgrade ephemeral on the old machines or (B) start swapping disks in running hypervisors and cry yourself to sleep, or (C) use SAS or other array/volume on a SAN)

And thanks for blithely ignoring what I'm saying. Riak did have corruption bugs that should have been preventable - as I said, a major source of the problems was LevelDB, and Riak's own documentation shows this to be true.

You could look sideways at these things and the db would corrupt. A node with ephemeral storage, with no detectable errors on it whatsoever, would suddenly stop working. We would go look at it, and it had a single leveldb file corrupt... and nothing else wrong with it at all. Not only would it corrupt, it wouldn't make any attempt to fix itself, even though there was a documented fix.

Riak has anti-entropy intended to detect missing data and fix it, but it's the erlang equivalent of cron jobs and hash trees. The whole thing is designed to just go "dum de dum, I wonder if anything's broken after $INTERVAL?" and then perform some operation, which if the cluster is under load, may kick it over. So they added throttling (throttling is everywhere in Riak, as instead of simply rejecting operations because it's unsafe, they'd rather make everything go r e a l l y s l o w).

There was very little intelligence or event-driven programming for failure detection and remediation. When the db corrupted in a way that wasn't handled by anti-entropy, the node would just die, and we had to manually intervene (later by writing automation to intervene) rather than it, you know, just doing its own automation to fix the corruption. The AAE trees rebuild every $INTERVAL and there's no way to change when or how they rebuild other than to change the $INTERVAL, so there's no way to, for example, force them to rebuild when it is convenient based on lulls in application use.

Then there's Riak search, which has the nice habit of taking down your cluster due to god knows what (memory bloating, cpu starvation, unknown bugs in error logs, etc). Don't use Riak search.

Replication was also a joke. Any network interruption (hello, distributed apps have network interuptions) would kill replication. We would have to detect replication had failed and queues were filling up, and re-start the replication until queues fell. But sometimes replication couldn't resume, because there were 1 of 1,000 different potential failure modes happening with 1 node in a remote cluster somewhere. So we had to resolve that node's issue and get the whole remote cluster healed before the replication queues filled up. If we didn't do that, we'd have to do a full-transfer to prevent potential data loss, which would take days.

We developed auto-healing scripts to deal with most of these situations, and the controls Riak added to slow down processing so it didn't kill the cluster from all the competing operations it was trying to do at once (kv processing, replication, hash regeneration, etc) were not enough for our automation to be able to efficiently control the nodes when they were unhealthy. Riak would just occasionally perform incredibly poor, or nodes would die randomly, and we'd get some unknown errors we couldn't diagnose. All our monitoring and investigation showed nothing wrong with the host - no resource starvation, no error messages, no spikes of client traffic. Riak was just having a bad day, and us being a very small team of not-erlang-programmers, had to just restart shit until it got better, and research fixes once things improved. Our postmortem incident queue was rather large.

This is a small sampling of production Riak issues. I'm not going to dig into my brain for every bug they have, but suffice to say that a distributed database should be able to recover from a single file corruption, and should be able to resist it from ever happening through various techniques that are 20+ years old. Their code is just lame, and proof that just because you write something in Erlang doesn't mean it's going to be stable. And in Riak's defense, the reason why their code was lame was because they were a small company trying to juggle a lot of demanding engineering issues from different customers, and they didn't have much money or time. But lame code is still lame code.

> It turns out that really unstable hardware/networks like to expose bugs.

This sounds like a weird complaint to be honest. If you verify that your hardware is unstable, how can you expect the software not to fail and corrupt data?

> Also, read the Riak documentation and you'll find the corruption recovery documentation among other hints at common failures and limitations.

I'm not sure how that's a negative thing. You think about and document recovery even if you don't expect things to fail.

> If you verify that your hardware is unstable, how can you expect the software not to fail and corrupt data?

No such thing as 'stable hardware' at scale.

Ok, but then there's also no such thing as no corruption at scale.

If you accept imperfect hardware, you will get errors written to the drive. A single node of a database will get corruption by definition in that case. We're taking about RocksDB specifically here, so it is only one processing node.

How did you expect it to behave instead?

There is, just the error rate is different between a SAN backed openstack cluster vs AWS for example. EC2 is reliable compare to what hw just described above.

What? It seems like you're blaming the db software for a range of external hardware and operations issues. The apps need "so much data at such low latency" yet couldn't have a proper running environment?

As far as I'm aware, none of the embedded DBs can help with fast recovery after corruption or any partial recovery. Sucks, but tolerable on local networks and small nodes. For spinning disks or far away non local nodes this of course doesn't work well and you have to implement your own data store.

I think the problem the parent was referring to was the database corrupting itself, or returning corrupt results, not it getting corrupted by an outside agency.

There are certain ways LSM trees can be screwed up in implementation, but the attack surface for corruption is relatively small, so I would not be surprised if that got plugged up. There's room for noobs to put in bugs, but a small enough surface area for a comprehensively cleanup to happen later. So my attitude about RocksDB is that I'm not too worried about LevelDB's history.

But I'm just a RocksDB user and have worked on LSM stores in the past, and I'm giving you my feelings and impressions, I don't have inside knowledge here.

Curious what is the cause of data corruption in leveldb?

That kind of concern is probably why FoundationDB built on and modified SQLite. Its reliability is already great.

I noticed that RocksDB is used very often in OLTP scenarios. What's the OLAP equivalent of RocksDB in OLTP world? Apache Parquet? Apache Arrow? What would you use these days to create a high performance OLAP/OLHybridP engine ?

There is a practical engineering reason why the OLAP equivalent doesn't seem to exist. General purpose storage engines, and this applies to RocksDB, are like the C++ STL in that they provide good average performance across a wide range of common cases but are nowhere close to optimal if you have a well-defined type of data model and workload as your use case. You can always gain an integer factor increase in throughput by designing a less generalist implementation with a similar interface.

As with the C++ STL, the limiting factor is the number of tunable parameters available i.e. the amount of internal architectural flexibility built into the implementation. OLTP storage engines are pretty simple, so a manageable number of behavioral parameters can usually get you within 3x of the throughput of a more targeted design, which is acceptable performance for most workloads that are not ingest-intensive.

OLAP-ish storage engines, on the other hand, are at least an order of magnitude more complex to implement and have many more degrees of freedom depending on the expected data model and workload. There is a lot more data model and workload diversity in OLAP than OLTP, which makes implementing the effective internal architectural flexibility and set of tunable parameters that need to be maintained very unwieldy. If you limited yourself to the number of user-definable tuning and configuration parameters as an OLTP-oriented storage engine like RocksDB, the performance gap between a generalist implementation and a more targeted implementation will be more like 10-100x, which needless to say is huge. This makes the practical applicability of any "general purpose" OLAP storage engine that someone would want to use quite narrow, which diminishes the value of implementing a general purpose engine.

This leads to the current reality that there is a zoo of specialist storage engines for OLAP-ish workloads -- graph, time-series, event processing, geospatial, classic DW, etc. Much more generalist OLAP storage engines that do several of these models could exist in theory but the bar for technical sophistication and complexity is much higher than for OLTP.

Open source projects in particular tend to have a natural ceiling on the number of man-years invested to get an initial implementation of an architecture, which inherently limits the expressiveness of that architecture for software with this complexity.

For analytics workloads, your best bet is using compression techniques that let you do operations on the data without decompressing it. A good example is dictionary encoding a set of sorted string keys so you can preform prefix queries by doing a greater than and less than comparison on the integers instead of examining every string entirely.

Once you’ve encoded the data into large enough blocks, you could use any storage engine and write the encoded blocks into it along with metadata for managing which blocks are a part of what tables and partitions of tables.

You can also just use something like Parquet or ORC, but that’s not going to get you the best performance possible.

(author of the blog post here) I'd second ryanworl's comment that the rabbit hole goes much deeper than just storing things in a column oriented disk or in-memory format like Parquet or Arrow. That's just the first step. To get the best performance you have to have your data in an in-memory format that allows you to compress it efficiently, and then perform many relational operations on the compressed form itself. Another example is Run-length and delta encoding a sorted column of integers, and then building relational operators (e.g. a join) that operates directly on the compressed data.

The best explanation for all the various techniques the go into the data structures and operator designs for OLAP workloads is the survey 'The Design and Implementation of Modern Column-Oriented Database Systems' by Abadi, Boncz, Harizopoulos, Idreos, and Madden: http://db.csail.mit.edu/pubs/abadi-column-stores.pdf

I know there are many techniques that used together give good performance (optimal memory layout, compression, vectorization, etc. etc.), however I'd like to use a package that does a lot of it, same what RocksDB (or SQLite) does for OLTP cases. Is there something like that? If not, what's out there that gives the best foundation for building OLAP functionalities on top of it?

Check out Druid [1], an open-source analytical database with tightly-coupled storage and processing engines designed for OLAP. In particular it implements a memory-mappable storage format, indexes, compression, late tuple materialization, and query engines that can operate directly on compressed data. There is a patch out to add vectorized processing as well, so you should expect to see that show up in a future release.

Its storage format and processing engine aren't designed to be embedded in the same way as RocksDB and SQLite are, but you certainly could if you wanted to, since the code is fairly modular. Or you could use it as a standalone service as it was designed to be used.

[1] http://druid.io/

There's also Clickhouse [1] which seems to scale much better than Druid, and has similar architectural decisions to make it somewhat general as a columnar store for OLAP uses. Cloudflare wrote an article in the past where they compared Clickhouse and Druid and they chose Clickhouse because they could get similar performance on the same workload with 9 nodes in Clickhouse which would require hundreds for Druid. They built all of the DNS analytics at CloudFlare on Clickhouse [2].

Disclosure: I work at Percona, and we've seen a lot of our customers make use of Clickhouse and have begun some of our own services work around it in Consulting. It's now a primary database talked about at our conferences, and we post about it regularly. [3]

[1]: https://clickhouse.yandex/ [2]: https://blog.cloudflare.com/how-cloudflare-analyzes-1m-dns-q... [3]: https://www.percona.com/blog/2018/10/01/clickhouse-two-years...

There is a very good article[1] by one of the Druid committers about Clickhouse/Drui/Pinot that goes into some details on why the Cloudflare tests turned out the way they did.


That article is better than expected, and it matches my own experience with CH (it was a great match for our use-case, and some of those reasons are in the article; and also, we could have used an inverted index, would one have been available; surprisingly survived w/o it).

Does ClickHouse support fine grained data security (for example role A gives access only to tuples with column X==123)?

No [1]. ClickHouse is a fairly low-level tool. If you need that kind of thing, you build an ACL-aware app on top of it.

[1] https://clickhouse.yandex/docs/en/operations/access_rights/

That's what Apache Arrow is, you had the right choice. That solves the processing component and you can use any number of on-disk formats like Parquet and ORC.

And the hybrid of OLAP + OLTP is usually called HTAP.

Is it possible to insert new tuples to arrow model without rebuilding it from scratch?

No, I am not aware of any storage engine that provides that out of the box. The techniques are very tied into what your query processing engine can do and expects the data to look like.

For example, do you materialize tuples immediately, or do you fully run it through your processing pipeline and not materialize until the end?

Your storage engine and format needs to be at least somewhat involved in answer that question, because you need to know what data to read and when.

Unfortunately most of the systems that build what you're describing are closed source (e.g. Snowflake, Microsoft SQL Server, Vertica, Teradata). There isn't an open-source project that does all of those things.

What about Presto?

Presto is more of a distributed SQL solution: you run it on a cluster of nodes, point them at your storage later, it’s more optimised at querying very large datasets and it’s not built or tuned for high performance (in terms of latency or execution time).

Apache Arrow is your best bet, but it's still very much a new project without a lot of the things you're looking for.

> I noticed that RocksDB is used very often in OLTP scenarios.

My experience with it has been most stream processing in kafka streams ect as local state store.

S3 + ORC|Parquet + PrestoDB works very well

Excellent article, very informative.

I just had to chuckle at this:

> Non-engineers: in a computer, a move is always implemented as a copy followed by a delete

Yeah, that's really gonna help a non-developer understand the article better...

It's confusing altogether. For example, that's not how /bin/mv (usually) works.

Usually, that's because /bin/mv is just changing a link to the file, not moving the file itself. In cases where it's actually moving the file -- say across a file system boundary -- it does copy the file and then delete the old version.

I reckon it was meant in the context of compaction.

When a SQL implementation is built on a KV storage engine, how do tables, rows, and columns typically map to the underlying KV data model?

Excellent question. There's a CockroachDB blog post about that: https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mappin...

The CockroachDB blog post on this topic is a good summary. There is an additional trick that isn't directly KV related, but is important in a distributed environment when using a KV storage engine.

When defining a hierarchy of tables, such as customers -> orders -> order_line_items, you can make the primary key of the child tables contain the primary key of the parent table.

e.g. (customer_id) for the customers table, (customer_id, order_id) for the orders table, then (customer_id, order_id, line_item_id) for the order_line_items.

When this is stored on disk in a sorted format, it makes joins between these extremely cheap because the data will all be next to each other on disk.

CockroachDB calls this "interleaved tables".

Interleaved tables are best for 1:1 relationships.

Hard disagree. This is the only way in a distributed, sorted KV store to get any semblance of data locality. If Cockroach and Spanner didn't do this, they would constantly be doing 2PC for modifying data that is related but stored on different groups of machines.

We are evaluating it for production and the senior engineer we talked to from the company told us that. I’d think he knows what he’s talking about.

I may be misinterpreting what you mean by a 1-1 relationship, but the documentation for Cloud Spanner, Cockroach, and the old FoundationDB SQL layer all use a similar schema to the one I described in their examples. Additionally, the F1 paper describes using it in the same way in Figure 2.

He said tables with a parent child relationship that is one to one meaning if I understand that right one parent ID mapping to one child ID. Its entirely possible either I’m misinterpreting him or he’s wrong but that’s what he said.

I think the data locality benefits would hold for one to many relationships too (that is, one parent with many children).

Also they do 2PC work in a 1PC way by maintaining a hidden transaction status table.

You can have hash sharding and be able to distribute tables easily compared to range sharding.

Can you link to the post you’re referring to?

https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mappin... this was in another comment.

https://emsal.me/blog/5 this blog post has a good introduction to their implementation of interleaved tables.

Thanks, perfect!

> If you surveyed most NewSQL databases today, most of them are built on top of an LSM, namely, RocksDB.

Is this actually true?

spark, foundationdb, memsql, nuodb , citus . I am not sure any of these are built on top of rocksdb.

Which ones are actually built on lsm?

None of those are newsql other than MemSQL, which is an OLAP system that uses a custom rowstore format and parquet for columnstores.

In addition to CockroachDB there's also TiDB which runs on top of TiKV which uses RocksDB.

> None of those are newsql other than MemSQL

Why aren't citus, nuodb 'newsql'?

> there's also TiDB

One more example doesn't qualify the statement "most are built on rocksdb". I wasn't saying there is only one newsql db built on rocksdb.

Of the 14 examples listed here https://en.wikipedia.org/wiki/NewSQL

only 2 that you mentioned seem to be built on rocksdb.

I'm not disagreeing, RocksDB is not used by most. The statement in the blog post is not true.

Cassandra, MongoDB, BigTable, InfluxDB, LevelDB.

> Cassandra, MongoDB, BigTable, InfluxDB, LevelDB.

None of these are NewSql[1](ACID and SQL) though.

1. https://en.wikipedia.org/wiki/NewSQL

If someone love LevelDB/RocksDB but want to use a pure-Go implementation, I have good thing about this library:


Seems like few features: 1. sstables as different files 2. range delete (which is rare)

compared to LMDB (which is faster & more efficient): https://symas.com/lmdb/technical/

Still would be nice to see how LMDB would fare in a complex distributed DBMS (most of them are in rocksdb-type libraries).

But LMDB is supposed to stay small. So more features are in a fork: https://github.com/leo-yuriev/libmdbx

LMDB is already used in distributed DBs - such as LDAP. OpenLDAP performance is orders of magnitude greater than any RDBMS or other distributed DB.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact