The source of this performance gap is architectural. The current design of RocksDB precludes it ever being legitimately high-performance in most contexts, and most other open source storage engines use a similar design. Modern high-performance storage engines also use a common architecture implemented in minor variations, you just don't see this architecture in open source much. I realize that few software engineers have the skillset and experience required to design a top-notch storage engine, but I am still surprised by the dearth of open source examples given the large value in closing this gap.
I rarely use open source storage engines in the systems I build for this reason. The CapEx/OpEx implications of using them is far too costly at scale. Fortunately, I have the approximately free option of using my own storage engine implementations.
A fast storage engine needs to eliminate most of the elements that will stall an execution pipeline. This means doing things like almost completely eliminating shared data structures and context switching. It also means designing your own execution and I/O scheduler to greatly reduce the various forms of stalling on memory ubiquitous in many designs. It is difficult to overstate the extent to which thoughtful schedule design can greatly improve throughput.
A state-of-the-art storage engine can drive 2+ GB/s per core, and schedule things to keep the storage hardware performance close to theoretical while smoothing out transients. It is very easy to run out of storage bandwidth in my experience.
What proprietary engines do you know of that I can look at?
Do you have more details on your own designs? Anything you can share?
Helium performs much, much better at scale and doesn’t have compaction issues. It’s proprietary, but in my experience it’s money well spent.
For the record, we were able to fully saturate a 4xNVMe with a 96 core server using Helium, while RocksDB achieved about 20% of the full NVMe capacity.
As with all benchmarks, YMMV.
Here's an example of how data corruption can suck, with (example) Riak and LevelDB. The leveldb data would corrupt often, which would leave you in a predicament. Say you had 10 nodes with a 3 node replication factor, and the whole cluster is humming away at a decent clip. Now one node's leveldb corrupts, and you have to rebuild it. If you have a huge fuckoff dataset, this can take a while. Now another node goes down. Now only 1 node has the data you need, and 2 nodes are down - so now 8 nodes are doing the work of 10, and if you have any more failures, your data might be gone. Now add replication, which will suck performance and bandwidth away from the regular work. And because it would corrupt so easily & often, there needed to be hash trees to quickly identify what data was corrupt, and then you needed to fix it and rebuild your hash trees. This would also suck away performance. Finally, you can't just add new nodes while rebuilding, because the extra load makes the cluster fall over. And the more nodes, the higher the likelihood of failures.
It turns out that really unstable hardware/networks like to expose bugs. It also wasn't discussed in public forums. We paid for support and even employed Riak developers, and still we hobbled on putting out fires. I'll bet other DBs go through the same crap and keep it quiet.
Also, read the Riak documentation and you'll find the corruption recovery documentation among other hints at common failures and limitations.
First off, SAN was one of three different disk storage solutions. When you work for <BIG CORP>, the lowly product teams don't always get to pick and choose what infrastructure is available. They have to do the best they can with what they have, when they don't get what they ask for. If <BIG CORP> says to use a shitty openstack cluster, that's what you have to deal with, and you have to beg for all the ephemeral SSDs you can get. (Which then becomes a huge pain in the ass when you need to scale storage and your choice is (A) buy more machines and migrate nodes so you can upgrade ephemeral on the old machines or (B) start swapping disks in running hypervisors and cry yourself to sleep, or (C) use SAS or other array/volume on a SAN)
And thanks for blithely ignoring what I'm saying. Riak did have corruption bugs that should have been preventable - as I said, a major source of the problems was LevelDB, and Riak's own documentation shows this to be true.
You could look sideways at these things and the db would corrupt. A node with ephemeral storage, with no detectable errors on it whatsoever, would suddenly stop working. We would go look at it, and it had a single leveldb file corrupt... and nothing else wrong with it at all. Not only would it corrupt, it wouldn't make any attempt to fix itself, even though there was a documented fix.
Riak has anti-entropy intended to detect missing data and fix it, but it's the erlang equivalent of cron jobs and hash trees. The whole thing is designed to just go "dum de dum, I wonder if anything's broken after $INTERVAL?" and then perform some operation, which if the cluster is under load, may kick it over. So they added throttling (throttling is everywhere in Riak, as instead of simply rejecting operations because it's unsafe, they'd rather make everything go r e a l l y s l o w).
There was very little intelligence or event-driven programming for failure detection and remediation. When the db corrupted in a way that wasn't handled by anti-entropy, the node would just die, and we had to manually intervene (later by writing automation to intervene) rather than it, you know, just doing its own automation to fix the corruption. The AAE trees rebuild every $INTERVAL and there's no way to change when or how they rebuild other than to change the $INTERVAL, so there's no way to, for example, force them to rebuild when it is convenient based on lulls in application use.
Then there's Riak search, which has the nice habit of taking down your cluster due to god knows what (memory bloating, cpu starvation, unknown bugs in error logs, etc). Don't use Riak search.
Replication was also a joke. Any network interruption (hello, distributed apps have network interuptions) would kill replication. We would have to detect replication had failed and queues were filling up, and re-start the replication until queues fell. But sometimes replication couldn't resume, because there were 1 of 1,000 different potential failure modes happening with 1 node in a remote cluster somewhere. So we had to resolve that node's issue and get the whole remote cluster healed before the replication queues filled up. If we didn't do that, we'd have to do a full-transfer to prevent potential data loss, which would take days.
We developed auto-healing scripts to deal with most of these situations, and the controls Riak added to slow down processing so it didn't kill the cluster from all the competing operations it was trying to do at once (kv processing, replication, hash regeneration, etc) were not enough for our automation to be able to efficiently control the nodes when they were unhealthy. Riak would just occasionally perform incredibly poor, or nodes would die randomly, and we'd get some unknown errors we couldn't diagnose. All our monitoring and investigation showed nothing wrong with the host - no resource starvation, no error messages, no spikes of client traffic. Riak was just having a bad day, and us being a very small team of not-erlang-programmers, had to just restart shit until it got better, and research fixes once things improved. Our postmortem incident queue was rather large.
This is a small sampling of production Riak issues. I'm not going to dig into my brain for every bug they have, but suffice to say that a distributed database should be able to recover from a single file corruption, and should be able to resist it from ever happening through various techniques that are 20+ years old. Their code is just lame, and proof that just because you write something in Erlang doesn't mean it's going to be stable. And in Riak's defense, the reason why their code was lame was because they were a small company trying to juggle a lot of demanding engineering issues from different customers, and they didn't have much money or time. But lame code is still lame code.
This sounds like a weird complaint to be honest. If you verify that your hardware is unstable, how can you expect the software not to fail and corrupt data?
> Also, read the Riak documentation and you'll find the corruption recovery documentation among other hints at common failures and limitations.
I'm not sure how that's a negative thing. You think about and document recovery even if you don't expect things to fail.
No such thing as 'stable hardware' at scale.
If you accept imperfect hardware, you will get errors written to the drive. A single node of a database will get corruption by definition in that case. We're taking about RocksDB specifically here, so it is only one processing node.
How did you expect it to behave instead?
There are certain ways LSM trees can be screwed up in implementation, but the attack surface for corruption is relatively small, so I would not be surprised if that got plugged up. There's room for noobs to put in bugs, but a small enough surface area for a comprehensively cleanup to happen later. So my attitude about RocksDB is that I'm not too worried about LevelDB's history.
But I'm just a RocksDB user and have worked on LSM stores in the past, and I'm giving you my feelings and impressions, I don't have inside knowledge here.
As with the C++ STL, the limiting factor is the number of tunable parameters available i.e. the amount of internal architectural flexibility built into the implementation. OLTP storage engines are pretty simple, so a manageable number of behavioral parameters can usually get you within 3x of the throughput of a more targeted design, which is acceptable performance for most workloads that are not ingest-intensive.
OLAP-ish storage engines, on the other hand, are at least an order of magnitude more complex to implement and have many more degrees of freedom depending on the expected data model and workload. There is a lot more data model and workload diversity in OLAP than OLTP, which makes implementing the effective internal architectural flexibility and set of tunable parameters that need to be maintained very unwieldy. If you limited yourself to the number of user-definable tuning and configuration parameters as an OLTP-oriented storage engine like RocksDB, the performance gap between a generalist implementation and a more targeted implementation will be more like 10-100x, which needless to say is huge. This makes the practical applicability of any "general purpose" OLAP storage engine that someone would want to use quite narrow, which diminishes the value of implementing a general purpose engine.
This leads to the current reality that there is a zoo of specialist storage engines for OLAP-ish workloads -- graph, time-series, event processing, geospatial, classic DW, etc. Much more generalist OLAP storage engines that do several of these models could exist in theory but the bar for technical sophistication and complexity is much higher than for OLTP.
Open source projects in particular tend to have a natural ceiling on the number of man-years invested to get an initial implementation of an architecture, which inherently limits the expressiveness of that architecture for software with this complexity.
Once you’ve encoded the data into large enough blocks, you could use any storage engine and write the encoded blocks into it along with metadata for managing which blocks are a part of what tables and partitions of tables.
You can also just use something like Parquet or ORC, but that’s not going to get you the best performance possible.
The best explanation for all the various techniques the go into the data structures and operator designs for OLAP workloads is the survey 'The Design and Implementation of Modern Column-Oriented Database Systems' by Abadi, Boncz, Harizopoulos, Idreos, and Madden: http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
Its storage format and processing engine aren't designed to be embedded in the same way as RocksDB and SQLite are, but you certainly could if you wanted to, since the code is fairly modular. Or you could use it as a standalone service as it was designed to be used.
Disclosure: I work at Percona, and we've seen a lot of our customers make use of Clickhouse and have begun some of our own services work around it in Consulting. It's now a primary database talked about at our conferences, and we post about it regularly. 
And the hybrid of OLAP + OLTP is usually called HTAP.
For example, do you materialize tuples immediately, or do you fully run it through your processing pipeline and not materialize until the end?
Your storage engine and format needs to be at least somewhat involved in answer that question, because you need to know what data to read and when.
My experience with it has been most stream processing in kafka streams ect as local state store.
I just had to chuckle at this:
> Non-engineers: in a computer, a move is always implemented as a copy followed by a delete
Yeah, that's really gonna help a non-developer understand the article better...
When defining a hierarchy of tables, such as customers -> orders -> order_line_items, you can make the primary key of the child tables contain the primary key of the parent table.
e.g. (customer_id) for the customers table, (customer_id, order_id) for the orders table, then (customer_id, order_id, line_item_id) for the order_line_items.
When this is stored on disk in a sorted format, it makes joins between these extremely cheap because the data will all be next to each other on disk.
CockroachDB calls this "interleaved tables".
https://emsal.me/blog/5 this blog post has a good introduction to their implementation of interleaved tables.
Is this actually true?
spark, foundationdb, memsql, nuodb , citus . I am not sure any of these are built on top of rocksdb.
Which ones are actually built on lsm?
In addition to CockroachDB there's also TiDB which runs on top of TiKV which uses RocksDB.
Why aren't citus, nuodb 'newsql'?
> there's also TiDB
One more example doesn't qualify the statement "most are built on rocksdb". I wasn't saying there is only one newsql db built on rocksdb.
Of the 14 examples listed here https://en.wikipedia.org/wiki/NewSQL
only 2 that you mentioned seem to be built on rocksdb.
None of these are NewSql(ACID and SQL) though.
compared to LMDB (which is faster & more efficient):
Still would be nice to see how LMDB would fare in a complex distributed DBMS (most of them are in rocksdb-type libraries).
But LMDB is supposed to stay small. So more features are in a fork: https://github.com/leo-yuriev/libmdbx