Hacker News new | comments | show | ask | jobs | submit login
ScyllaDB: Drop-in replacement for Cassandra that claims to be 10x faster (scylladb.com)
114 points by haint 847 days ago | hide | past | web | favorite | 93 comments

Very nice.

Broadly speaking, this is the correct style of architecture for a database engine on modern hardware. It is vastly more efficient in terms of throughput than the more traditional architectures common in open source. It also lends itself to elegant, compact implementations. I've been using similar architectures for several years now.

While I have not benchmarked their particular implementation, my first-hand experience is that these types of implementations are always at least 10x faster on the same hardware than a nominally equivalent open source database engine, so the performance claim is completely believable. One of my longstanding criticisms of open source data infrastructure has always been the very poor operational efficiency at a basic architectural level; many closed source companies have made a good business arbitraging the gross efficiency differences.

Agreed, but which architectural features are you referring to?

Over the last decade, the distributed system nature of modern server hardware internals has become painfully evident in how software architectures scale on a single machine. The traditional approaches -- multithreading, locking, lock-free structures, etc -- are all forms of coordination and agreement in a distributed system, with the attendant scalability problems if not used very carefully.

At some point several years ago, a few people noticed that if you attack the problem of scalable distribution within a single server the same way you would in large distributed systems (e.g. shared nothing architectures) that you could realize huge performance increases on a single machine. The caveat is that the software architectures look unorthodox.

The general model looks like this:

- one process per core, each locked to a single core

- use locked local RAM only (effectively limiting NUMA)

- direct dedicated network queue (bypass kernel)

- direct storage I/O (bypass kernel)

If you do it right, you minimize the amount of silicon that is shared between processes which has surprisingly large performance benefits. Linux has facilities that make this relatively straightforward too.

As a consequence, adjacent cores on the same CPU have only marginally more interaction with each other than cores on different machines entirely. Treating a single server as a distributed cluster of 1-core machines, and writing the software in such a way that the operating system behavior reflects that model to the extent possible, is a great architecture for extreme performance but you rarely see it outside of closed source software.

As a corollary, garbage-collected languages do not work for this at all.

I think that's a generalization that simply shifts the burden elsewhere, and cannot be said to be "the right" architecture in general. There is a reason CPUs implement cache-coherence on top of their "innate" shared-nothing design, and the reason is abstraction. If you don't need certain abstractions, then a sharded approach is indeed optimal, but if you do, then you have to implement them at some level or another, and it's often better to rely on hardware messaging (cache-coherence) than software messaging.

So for some abstractions such as column and other analytics databases, a sharded architecture works very well, but if you need isolated online transactions, strong consistency etc., then sharding no longer cuts it. Instead, you should rely on algorithms that minimize coherence traffic while maintaining a consistent shared-memory abstraction. In fact, there are algorithms that provide the exact same linear-scalability as sharding with an added small latency constant (not a constant factor but a constant addition) while providing much richer abstractions.

Similarly, your statement about "garbage-collected languages" is misplaced. First, there's that abstraction issue again -- if you need a consistent shared memory, then a good GC can be extremely effective. Second, while it is true that GCs don't contribute much to a sharded infrastructure (and may harm its performance), GCs and "GCed languages" are two different things. For example, in Java you don't have to use the GC. In fact, it is common practice for high-performance, sharded-architecture Java code to use manually-managed memory.

You can still offer isolated transactions, tunable consistency, etc. within a shard though, which Cassandra does.

And yes, you can write high performance Java, but for whatever reasons the Cassandra codebase isn't an example of that. They just did a big storage engine rewrite and the result is slower.


> You can still offer isolated transactions, tunable consistency, etc. within a shard though

And if that happens to be exactly all you need then that's great! :)

> And yes, you can write high performance Java, but for whatever reasons the Cassandra codebase isn't an example of that.

I don't know anything about the Cassandra codebase, but one thing I'm often asked is if you're not going to use the GC in Java, why write Java at all. The answer is that the very tight core code like a shard event loop turns out to be a rather small part of the total codebase. There's more code dedicated to support functions (such as monitoring and management) than the core, and relying on the GC for that makes writing all that code much more convenient and doesn't affect your performance.

Google built multi-row transactions on top of a partitioned row store though? I guess I'm not really sure which applications have a shared memory architecture like the one you're describing.

And to be fair, some people are pretty productive in modern C++. It's a shame the JNI isn't better so you can have the best of both worlds.

(for the record, Quasar is #1 on my list of libraries to try if I go back to Java)

Of course you can implement any shared/consistent memory transaction on top of shards -- after all, the CPU implements shared memory on top of message-passing in a shared-nothing architecture, too. It's just that then you end up implementing that abstraction yourself. If you need it, someone has to implement it, and if you need it at the machine level, it's better to rely on its hardware implementation than re-implement the same thing in software. Naive implementations end up creating much more contention (i.e. slow communications) than a sophisticated use of hardware concurrency (i.e. communication) instructions.

My point is that if you're providing a shared-memory abstraction to your user (like arbitrary transactions) -- even at a very high level -- then your architecture isn't "shared-nothing", period. Somewhere in your stack there's an implementation of a shared-memory abstraction. And if you decide to call anything that doesn't use shared memory at CPU/OS level "shared nothing", then that's an arbitrary and rather senseless distinction, because even at the CPU/OS level, shared memory is implemented on top of message-passing. So the cost of a shared abstraction is incurred when it's provided to the user, and is completely independent of how it's implemented. The only way to avoid it is to restrict the programming model and not provide the abstraction. If doing that is fine for the user -- great, but there's no way to have this abstraction without paying for it.

And JNI is better now[1] (I've used JNR to implement FUSE filesystems in pure Java). JNR will serve as the basis for "JNI 2.0" -- Project Panama[2]. And thanks!

[1]: https://github.com/jnr/jnr-ffi

[2]: http://openjdk.java.net/projects/panama/

Very interesting, thanks

I really appreciate your insight, but I didn't understand why you keep implying "open source" as some kind of lack-behind design?

In practice, open source databases use more traditional, simpler architectures for which there is a lot of literature. Ironically, you see a lot more creativity and experimentation in closed source database architectures, and this has accrued some substantial benefits to those implementations.

The architecture at the link looks unusual compared to open source databases but it is actually a common architecture pattern in closed source databases with significant benefits, particularly when it comes to performance. There is a lot of what I would call "guild knowledge" in advanced database engine design, much like with HPC, things the small number of experts all seem to know but no one ever writes down.

It is a path dependency problem. Most open source databases were someone's first serious attempt at designing a database, a project that turned into a product. This is an excellent way to learn but it would be unrealistic to expect a thoroughly expert design for a piece of software of such complexity on the first (or second, or third) go at it. The atypical quality of PostgreSQL is a testament to the importance of having an experienced designer involved in the architecture.

It's not exactly guild knowledge, there's just a log of legacy baggage with open source projects that were started by random people and became popular before much thought was given to the architecture. This model has been considered by Cassandra devs for at least the last two years, and there are open JIRA tickets associated with it, it just hasn't been considered a priority.

What aspect of the architecture seems unusual?

I agree. I suppose that the vast majority of closed source DBs do not follow this shared nothing architecture either.

How does one bypass the kernel for network and disk IO? I've never heard of doing this before (e.g. IO is always a system call)

Basically mapping the device registers into user space and doing exactly what the kernel would do, but without the syscall overhead.

Some researchers made a big splash at OSDI by doing this securely:


Oracle has long argued the value of bypassing the file system and associated kernel drivers with its raw devices and ASM. It'd be interesting to see such a thing land in other platforms.

For network, the kernel is bypasses through dpdk. For disk, the syscalls are there. But they are always async IO with O_DIRECT. So the OS wont cache and buffer anything. So that is what you bypass

> As a corollary, garbage-collected languages do not work for this at all.

As a naive person about this...

Is possible to reap some of the benefits from this kind of architecture in a managed environment (like .NET)?

If a try to implement, for example a sqlite-like/kdb+, memory only database in .NET, how far is possible to go?

Or how avoid some common traps?

This approach seems to dovetail nicely with unikernel approaches. Have you experimented with combining the above constraints with a system like MirageOS?

Strongly agreed. It's great that these HPC techniques are finally starting to trickle down.

> one process per core, each locked to a single core; use locked local RAM only (effectively limiting NUMA); direct dedicated network queue (bypass kernel); direct storage I/O (bypass kernel)

I have no idea how to do any of these things. What are the system/api calls to lock a process to a kernel? How do you bypass kernel IO?

Take a look at sched_setaffinity and Intel DPDK.

Basically ditched Java in favor of C++, and used a C++ framework called Seastar.

"The Scylla design, right, is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC."


I agree as well, on the networking aspect since based on their diagram they are utilizing Intel NICs which Intel provides this DPDK to bypass the kernel space and access the hardware from the application itself.

Now my question is how portable Scylla be in terms of NIC vendors?

I believe dpdk supports non-Intel NICs as well. Would also be interesting to compare scylla with kernel networking vs Cassandra.

Finally, back to sanity of great old-school products, like Informix, by dropping Java (the whole scam) for C++14 and by paying attention to details of an underlying OS (again).

Same trend, by the way, is in Android development.

10x speedup (same algorithms, same architecture) replacing Java with C++ is not possible (~2x at max).

One of the latest benchmarks I've seen is "Comparison of Programming Languages in Economics" [1] for code without any IO just number crunching, has a 1.91 to 2.69 speedup of using C++ compared to Java. So any code involving IO is going to be slower.

Replacing bad Java code with excellent machine aligned C++ a 10x speedup is possible.

[1] https://github.com/jesusfv/Comparison-Programming-Languages-...

You are placing way too much weight in microbenchmarks. You simply can't use them to make a sweeping statement like you just did. Writing code that is identical to one another from language to language is not idiomatic and is not representative of how you would write each in a large scale project such as cassandra.

Java has a ton of overhead that C++ doesn't. Each object has metadata which results in more "cold data" in the cache. Each object is a heap allocation (unless you're lucky enough to hit the escape analysis optimization), which again leads to less cache locality because things are distributed around memory. Then there's the garbage collector. Then bounds checking.

1. I don't think that the number crunching paper I've cited is a "microbenchmark".

2. You seem to have missed the "bad Java" part, and my reply to Mechanical Sympathy.

It doesn't come from the choice of language. It comes from the choice of architecture. C++ is a tiny piece of the puzzle. It would have been hell to implement such an architecture in Java bit this is as far as the language matters.

It's particularly flawed given:

a) IO is such a large portion of the problem b) Hypertable isn't just way, way faster.

IO is not only a large part. It is the main part. That is why it is important to get it right : scylla for instance does not leave the cache to the OS. It has its own caches for everything. Never blocks on IO or page faults because all IO bypasses the kernel. And those are just two tiny examples.

Even on the networking side, you can see from projects like this that you can get what should be enough messaging performance for any NoSQL store out of Java: https://github.com/real-logic/Aeron

Their Java throughput is about 70% of their C++11 throughput, and that's with a pretty synthetic benchmark where there is not any logic behind those messages. Once you add in some real logic there, it gets even thinner.

They aren't doing user space networking, but that actually ought to allow Java to do even better.

> scylla for instance does not leave the cache to the OS. It has its own caches for everything

Uh-huh... that's all pretty common for databases. Cassandra would fit that description.

> Never blocks on IO or page faults because all IO bypasses the kernel.

That just seems nonsensical. Sometimes, you are waiting for IO. That's just reality. It is conceivable you bypass the kernel for I/O, but that creates a lot of complexity and limitations. Near as I can tell though, they do talk to the kernel for IO.

Cassandra has a row cache that it does not necessarily use. Most of its data sits in the linux page cache. Because the SSTables are mmaped into memory, you can't get rid of that: even if you do use the row cache, you would be at best using twice as much cache memory.

Scylla never touches the page cache. All IO in seastar is direct IO and then scylla caches everything itself. We always know when the disk access is going to happen. The OS paging mechanism does not do a thing.

As for waiting for IO, of course IO does not complete immediately. But you can either block and wait for it, as Cassandra does (it doesn't even have the option not to in the case of the mmaped regions) or you can do something fully async like seastar that guarantees you never block waiting for IO.


By the way, I think you're replying to one of the devs of Scylla.

So, in general, I understand there is lots of stuff going on in Scylla that does distinguish it, at least from Cassandra. There is the user space networking logic for IO. However, a lot of the IO overhead with disk, for example.

>However, a lot of the IO overhead with disk, for example.

That's why they benchmarked this workload on a 4x SSD RAID configuration :). Given that i/o bandwidth and throughput continues to increase, processor frequency isn't, and core counts are going up, it's prudent to design a system that can take advantage of this.

Yeah, and a 4x SSD RAID configuration is kind of overkill in the extreme for most Cassandra set ups.

I'm sure there is a way to set up IO subsystems so that Cassandra becomes a huge bottleneck, but that's a pretty specialized context.

I'm sure I am.

"a) IO is such a large portion of the problem"

I'm very interested in a cluster benchmark therefor, say 10 servers, as Cassandra claims to scale very well. With a cluster IO has a higher performance impact than one server with local RAID IO.

The way ScyllaDB works, it'd be really weird if it didn't scale at least as well as Cassandra.

Are there any recent comparative benchmarks of hypertable? I looked around but couldn't find any.

Not that I know of. Anecdotally though, it has some advantages, but doesn't exactly crush the competition.

[Edit: The LMAX guys showed how much more performance is possible with aligning code with CPU/hardware (in this case for Java)

http://mechanical-sympathy.blogspot.de/ ]

That's at a microbenchmark, cpu bound level. You usually aren't at that point.

It's exciting to finally see this. Cassandra's strengths were in its distributed architecture (no master, tunable consistency, etc.). The database engine itself has always been a bit of a mess (https://issues.apache.org/jira/browse/CASSANDRA-8099).

>The Scylla design, right, is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC. We can easily reach 1 million CQL operations on a single commodity server. In addition, Scylla targets consistent low latency, under 1 ms, for inserts, deletes, and reads.

Interesting. From: http://www.scylladb.com/technology/architecture/

So on virtualized hardware, namely AWS, I'm sure the benchmarks won't be so magnificent. Needing a dedicated nic per core is a big deal unless you're at a pretty large scale.

A modern Ethernet chipset has a large number of independent hardware queues. These can be assigned to VMs for direct access to the NIC, bypassing the hypervisor. AWS, since you used that example, offers instances with this type of direct bypass.

Just to pull an example from memory, the ubiquitous Intel 82599 10GbE NIC silicon has up to 128 TX and RX queues in hardware. IIRC, these are bundled in pairs for direct access in virtualized environments, so in principle you could have 64 virtual cores each with their own dedicated physical hardware queue. This is almost certainly what they were talking about. That is the whole point of this feature in Ethernet silicon; it gives cores (virtual or physical) dedicate network hardware off a single NIC.

Not dedicated NIC per core, but multi-queue NIC having its queues serviced by dedicated cores.

How big do you have to be to lease hardware?

leasing is often cheaper than the alternatives

That's what I thought. I think deploying something like this would be easier than the parent comment suggests.

"A Cassandra compatible NoSQL column store, at 1MM transactions/sec per server."

Personal pet-peeve of mine. Using "TPS" or "Transactions/sec" to measure something that is in no way transactional. Maybe ops/sec, reads/sec, updates/sec, or something...

Add my pet peeve: not listing latency stats. Big Tables does millions of ops/sec but it can take 5(!) seconds to complete one. That's the stat that matters to customers.

> The test hardware configuration includes:

> 1 DB server (Cassandra / Scylla)

The whole point of Cassandra is to run a cluster of servers to handle load at scale with minimal friction instead of having to buy a big single machine or spend all your time/money trying to run a clustered RDBMS. This test doesn't measure the correct thing.

Cassandra's performance scales linearly with the number of nodes though, so per-node performance definitely matters. Probably not 10x, but probably not 1x either.


Numbers look great, but so do /dev/null's. What guarantees does it make?

Has it been through Jepsen yet?

It's planned. However, I don't believe we'll pass it today. We're targeting GA for Jan and we'll give it our best shot.

Yeah, I was wondering about that. It looks like you guys have done some brilliant work with the storage engine, but reimplementing all the distributed logic is another (possibly bigger) project.

Honestly, Cassandra's Jepsen didn't set a high bar:


Except that problem has been largely addressed now.

I really really really want to see Aphyr attack the patched version to see if he thinks the fix actually worked.

Datastax is presenting on the topic at their Summit on thursday http://cassandrasummit-datastax.com/agenda/testing-cassandra...

How so? The fundamental flaw was using timestamps.

Right, I should add that it was two years ago. My point is that the age of a project has nothing to do with the correctness of its Paxos implementation.

Jepsen's finding wasn't that there was a bug in Paxos. It was in how it handled conflicts.

"So you confer with DataStax for a while, and they manage to reproduce and fix the bug: #6029 (Lightweight transactions race render primary key useless), and #5985 (Paxos replay of in progress update is incorrect). You start building patched versions of Cassandra."

"Cassandra lightweight transactions are not even close to correct. Depending on throughput, they may drop anywhere from 1-5% of acknowledged writes–and this doesn’t even require a network partition to demonstrate. It’s just a broken implementation of Paxos. In addition to the deadlock bug, these Jepsen tests revealed #6012 (Cassandra may accept multiple proposals for a single Paxos round) and #6013 (unnecessarily high false negative probabilities)."

That's four bugs independent of the conflict resolution issue.

Oh LWT's are a mess, indeed. Thankfully, you don't normally need them.

Wait, did I read that right? the test was with (1) one server? What's the point of that? Smells like a cooked up test.

The point of that is to show how efficient a node can be, because that is what is replaced.

All the external facing things for scylla is the same as Cassandra. That includes all the ring stuff and all network protocols.

So you should expect similar cluster behavior.

>> So you should expect similar cluster behavior.

i would expect nothing.

If theire numbers were astounding with a 10, 100, 1000 node cluster, they would have published numbers with such set-ups. I call shenanigans on a report that is purposely out of line with the expected use case.

Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.......

There is nothing commodity about a server with 128GB RAM.

When you introduce other nodes, you get chatter and network traffic....

Nothing commodity about a server with 128GB RAM? At list price, you can configure one of dell's entry-level servers with 128GB of RAM for less than $3,500.


Dell's servers you point to do not have 48 logical cores, either. That cpu runs $2.2K by itself.

> There is nothing commodity about a server with 128GB RAM.

Except I can launch higher than that on EC2, so that's not fact.

Literally zero mention that in the event of a network partition they will just drop messages on the floor. This is fine as a cache, perhaps replacing Redis... but as a Cassandra replacement this is pretty scary

Where did you get this from? I hope that's not a conclusion from the benchmark doing single server load testing.


The license is Affero GPL, which means you need to open-source your code even if you use it for a service. "Traditional" GPL was effective only while redistributing. That means you would need to go for commercial license whenever you build a service on it. Which in fact is a fair approach for a business model when there is a company behind an open source project. Especially that this time there is no lock-in. You could always come back to Cassandra.

The virality doesn't cross the database interface layer.

Modifications to the database software must be shared, yes, but your client application is outside the reach of the AGPL and can remain proprietary.

Wow, that autoadvancing website is a deal breaker.

ScyllaDB web person here. If I made it so that you could block one script and get a home page without the horizontally scrolling thingy (but have all the other JS stuff work including syntax highlighting and graphs), would you come back? ( dmarti@scylladb.com )

NoScript FTW

which JVM did they use? What was the flags passed to the JVM?

will this be the next docker in the nosql database?

I don't even see any connection how could this be the next docker in the nosql world... in other words i didn't get at all what you mean...

What does this mean?

It's nonsensical buzzwords.

I guess the poster's underlying question is "will this database become hyped as the Next Big Thing"

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact