Hacker News new | past | comments | ask | show | jobs | submit login

"For instance, on a cluster of five m3.large nodes, an even mixture of processes performing single-row inserts and selects over a few hundred rows pushed ~40 inserts and ~20 reads per second (steady-state)."

Wow, that's...not fast.

(Cockroach Labs CTO here)

Jepsen is essentially a worst-case stress test for a consistent database: all the transactions conflict with each other. CockroachDB's optimistic concurrency control performs worse with this kind of workload than the pessimistic (lock-based) concurrency control seen in most non-distributed databases. But this high-contention scenario is not the norm for most databases. Most transactions don't conflict with each other and can proceed without waiting. Without contention, CockroachDB can serve thousands of reads or writes per second per node (and depending on access patterns, can scale linearly with the number of nodes).

And of course, we're continually working on performance (it's our main focus as we work towards 1.0), and things will get better on both the high- and low-contention scenarios. Just today we're about to land a major improvement for many high-contention tests, although it didn't speed up the jepsen tests as much as we had hoped.

I can verify.

I've been testing with node.js's pg client (not even using the 'native' option) and on 1 node I get > 1,000 single inserts /sec. I'm seeing this scale linearly too. NB I'm testing on a pretty overloaded macbook with only 8gb RAM. On a decent-sized cluster with production-grade hardware and network, performance is really unlikely to be a bottleneck. Also, it'll significantly improve during this year, one imagines.

Couldn't this be a problem for application authors and a potential source of DoS activities? It seems this should be a big red warning flag somewhere.

Not really.

All databases tend to struggle to under contention. If you are an application author building a high-scale app, you need to avoid contention by design. That single counter row you were planning to increment on every visit? Bad idea; maybe insert a new visit record instead.

The contention in this example is an aspect of the test design, not database design. You wouldn't try to build an app this way.

It'd only be a DOS vector, if the rest of the db bogged down during this type of contention.

I just want to emphasize that:

a.) this is a beta product where the team has been focusing on completeness and correctness before performance

b.) I was testing Cockroach from August through October--Cockroach Labs has put a good deal of work into performance in the last three months, but that work hasn't been merged into mainline yet. I expect things should be a good deal faster by 1.0.

c.) Jepsen does some intentionally pathological things to databases. For instance, we choose extremely small Cockroach shard sizes to force constant rebalancing of the keyspace under inserts. Don't treat this as representative of a production workload.

They need to be at least 1000x to 10,000x faster on the read side though

I agree that it can and should be faster, but the targets will depend on what you're going for. 20k linearizable reads/sec on EBS-backed instances is, uh, maybe a tad optimistic.

I'm a little confused about how read speed is 2x slower than write speed. With respect to 'correctness', you're drifting into pyrrhic victory or 'not even wrong' territory at that point.

When there are basic expectations of behavior that aren't being met, many of us would reject the idea that this code is 'correct'.

[Disclaimer: CockroachDB engineer here, working on performance and benchmarking]

IIRC, in the case that aphyr refers to for these specific numbers, the reads are scans that span multiple shards[1], while the writes are writes to single shards.

[1] even though aphyr says it's just a hundred rows, the tables are split into multiple shards because aphyr in this case was specifically testing our correctness in multi-shard contention scenarios. In production you wouldn't have multi-shard reads crop up until you were doing scans for tables that were hundreds of thousands of rows in size[2]. It's easier to picture if you think of this performance speed difference in the scenario where you are doing full table scans spanning multiple shards on multiple computers while the underlying rows are being rapidly mutated by contending write transactions. The transactions get constantly aborted to avoid giving a non-serializable result, and performance is suffering. We agree that the numbers in this contention scenario are too low, and we are actively working on high-contention performance (and performance in general) leading up to our 1.0 release[3].

[2] Specifically, we break up into multiple shards when a single shard exceeds 64mb in size.

[3] You can follow along on one of the PRs that address this specific performance issue here: https://github.com/cockroachdb/cockroach/pull/13501

Oh, I see. That makes sense, thank you.

The benchmark isn't quite apples to apples. I'm used to people doing put/get benchmarks, not put/search workflows. Merging results from multiple machines is nothing to sneeze at, especially if you have replication between nodes.

I worked for a search engine dotbomb, and I had a constant worry that the core engineering team had not solved a very similar problem. But since the funding round failed I'll never know for sure what they had planned.

Are read-only transactions handled without contention? I understand that a read-write transaction needs to be aborted if a later conflicting write is committed. But read-only transactions can view a snapshot of what was committed before they begin - you trade a little bit of latency for being conflict free. It doesn't even have to be the default as long as you offer it as an option, you might want this if you know you are reading hot keys, or as a fallback if you've been aborted X times.

This was IMO the big innovation of true time in spanner - contention free reads at either now, or at some time in the past.

Yes, if your timestamp is far enough in the past (this is determined by the maximum clock offset configured for the cluster), and if you don't collide with writes of a long-running transaction (which may run at a timestamp that could influence what you read), you will not have to worry about contention. What you're suggesting are definitely options for reading consistent, if slightly out of date, information from a hot keyspace. There's a blog post on time travel queries, which conveniently expose the required functionality as specified by the SQL standard.

As I mentioned in the article's introductory paragraphs, where Spanner forces a minimum latency on writes to ensure consistency, CockroachDB pushes that latency to reads which contend with a write.

Latency is one thing throughput is a different animal 2 reads/second per m3.large node that really forces one to think there are some severe architectural issues.

You're not wrong. Whoever downvoted you is pushing an agenda and I'm not happy about it.

Reads per second is ultimately a measure of how many operations can be retired per second, not how long the operation takes. Anyone who has spent fifteen minutes learning about capacity planning should know that. 5 reads per second doesn't mean each read takes 200ms. It might mean 1 second per read spread across 5 threads of execution, or 3 seconds per read across 15 workers.

The parent poster (and many others in this thread) are assuming that performance under this deliberately pathological test is reflective of performance in the real world.

Optimistic locking systems inherently perform poorly under contention. But they also perform better than pessimistic concurrency systems overall because in the real world we design applications to avoid contention.

As an example, the Google App Engine datastore runs zillions of QPS across petabytes of data in a massive distributed cluster. But if you build an app that does nothing but mutate a single piece of state over and over, you'll top out at a couple transactions per second. This is painful if you're trying to build a simple counter, but with minimal care you can build a system that scales to any sized dataset and traffic volume.

So the benchmark is only informative from the standpoint of determining which records should be split to avoid concurrent writes.

For instance you wouldn't expect a single user to make 5 comments or upvotes per second so storing data about recent activity with the user isn't a bottleneck that you need to design for. Storing data about responses with an item might also be okay as long as you don't plan to be HN or Reddit (Github, for example, would be just fine). But if you want to track activity globally (eg, managing the watch list notifications in github), you will need to design around that number.

Aphyr's test is not a benchmark. It is a test of database correctness under a carefully constructed set of pathological circumstances. It cannot be used to infer real-world performance behavior.

Yes, the general advice for users of the GAE datastore is to build Entity Groups around the data for a single user. That isn't absolute though, and it doesn't cause problems for watch lists; the Watch can be part of the User's EG rather than the Issue's EG. Or it can be its own EG. In practice this doesn't require as much consideration as you probably imagine.

Regardless of locking strategy that throughput is abysmal. None would complain about high latency.

> I'm a little confused about how read speed is 2x slower than write speed.

FWIW, you're also describing Cassandra and it seems to do fine in the marketplace.

Also describing random read/write disk benchmarks.

I wonder how the clock skew bounds affect performance. Would a more accurate clock increase throughput?

The clock skew limit controls the maximum latency on reads--so yeah, if you're doing reads that contend with writes, improving clock resolution improves latency. Not entiiirely sure about the throughput impact though.

At a greater risk of losing a server in your cluster due to automatic shutdown after too much clock-drift, right?

You wouldn't reduce the clock drift limit unless you had more accurate clocks however. If all of your servers are on bare metal and synced to a high resolution atomic or GPS clock then you can probably reduce it from 250ms to 10ms or less similar to Googles Spanner setup.

Even if they manage to multiply this by 100 on the final release, it's still way weaker than a regular sql db. I hope they have another selling point than performance.

There is a better response to this up in this thread (https://news.ycombinator.com/item?id=13661735).

The test is testing the worst-case scenario of everything needing locking. CockroachDB uses an optimistic "locking" approach which makes this bad. But if you're use-case is strictly linearizable high-throughput reads good luck finding anything that is not a single-machine database.

That's actually the only good answer I received to my comment.

Horizontal scaleability might be one.

But why scale something when you can just run one postgres instance and provide 10000x the performance?

Your one machine's disk fails and you'll lose data if you have anything other than synchronous replication. And once you have synchronous replication let's hope it is on at least a different rack if in the same data-center. Preferably in a different geographical region actually. And now you no longer have "really fast" QPS.

All this assumes you have only so much data that can fit in one machine. If that's not the case you're in need of a database that spans more than one machine.

My though exactly. It's already the case with faster DB. E.G: most projects my customers make me work on will never need scaling on multiple instances. They run really fast on a single postgres instance.

One had some perf issues.

First he just moved the DB to avoid having it on the same machine as the web server. Saved him a year.

Then as the user base grew, he just bought a bigger server for the db.

Right now it works really fast for 400k u/d, and there are still 2 offers of servers more powerful that the current one he has. By the time he needs to buy them, 2 new more incredible offers will probably exist. Which is a joke because more traffic means he has more money anyway.

Having a cluster of nodes to scale is a lot of work. So unless you have 100 millions users doing crazy things, what's the point ?

You do understand that scalability is more than just concurrent users right ? Maybe not ?

There are many very small startups doing work in IoT, Analytics, Social, Finance, Health etc who have ridiculously challenging storage needs. Trying to vertically scale is simply not an option when you have a 50 node Spark cluster or 1M IoT devices all streaming data into your database. Vertically scaling your DB doesn't work as it is largely an I/O challenge not a CPU/Storage one.

But hey it's cute and lucky that you work on small projects. But many of us aren't and have to deal with these more exotic database solutions.

This is a very niche use case. The vast majority of projects have neither the user base to arrive to such numbers, no are they I/O challenged. Most companies are not facebook or google.

Plus if you manage to do 50 write sec on this DB, I don't see how you plan to scale anything with it. You are back to use good old sharding with load balancers.

You're right, most people probably won't need cockroach db, but in my opinion, one of the nicest things about CDB is that it uses the postgres wire protocol.

So by all means, scale up your single DB server. But if you ever get the happy issue of even more growth, and needing to scale beyond a single server, then you "should" be able to migrate your data to a cockroachDB cluster, and not have to change much, if anything, in your application.

If your dataset fits comfortably on one postgres instance, and will continue to do so for your current architectural planning time horizon, then you have little need to use CockroachDB / Spanner.

These databases are designed for use-cases which require the consistency of a relational database, but cannot fit on a single instance.

Right now not really. Cockroach perf don't allow you do have a big dataset given the performances.

You are misunderstanding this article. This is not a benchmark, this is a test of how correct the database is with distributed transactions and data in the worst conditions possible. These are not real-world performance numbers in any sense.

You are misunderstanding these comments.

The problem is not just the performances, it's that distributing has a huge cost in term of servers and maintenance.

If you can write only 50 times a second, your data set won't get big enough to justify distributing it.

Put your millions of row in one server and be done with it. Cheaper, faster, easier.

There is a tendancy nowaway to make things distributed for the sake of it.

Distribution is a constraint, not a feature.

Why do you keep repeating 50 w/s when that's not an actual performance number? CDB will likely run with thousands of ops/sec per node.

Can you really not see why distributed databases are needed? High availability, (geo) replication, active/active, oversized data, concurrent users, and parallel queries are just a few of the reasons.

Distribution will always come with a cost, but the tradeoffs are for every application to make. We use a distributed SQL database and while it's faster than any single-node standard relational DB would be, speed isn't the reason we use it.

Can you elaborate on this point please? My reading of the article and docs is that perf is expected to scale linearly with the number of nodes (and therefore with dataset size).

Let's say you get 50 writes / sec per node. You need what, 1000 nodes to get to the playing fields of Postgres, with a simpler and cheaper setup ? Right now it's really not competitive, unless they improve the performance 1000 times. It makes no sense to buy and administrate many more machines to get the power you can have with one machine on another, proven tech.

> Let's say you get 50 writes / sec per node.

If the DB can only handle 50 ops/sec then the point you are making here is valid.

But see https://news.ycombinator.com/item?id=13661349, that's a pathological worst-case number. You should find more accurate performance numbers for real-world workloads before making sweeping conclusions.

Your original comment was:

> Even if they manage to multiply this by 100 on the final release, it's still way weaker than a regular sql db

This is what my comment, and the sibling comments, are objecting to, and I don't think you've substantiated this claim. 100x perf is probably well in excess of 10k writes/sec/node, which is solid single-node performance (though you'd not run a one-node CockroachDB deployment). Even a 10x improvement would get the system to above 1k writes/sec/node, which would allow large clusters (O(100) nodes) to serve more data than a SQL instance could handle.

Obviously I'd prefer to be able to outperform a SQL instance on dataset size with 10 nodes, but for a large company, throwing 100 (or 1000) nodes at a business-critical dataset is not the end of the world.

This statement makes absolutely no sense.

Performance is very loosely correlated with dataset size and less so in most distributed NoSQL databases like CockroachDB.

At a given point in time yes, but to get this data set, you need to write it into the DB. Big datasets implies either you have been receiving data for very long of you did it quickly on a shorter period of time. It's usually the later. Which mean you need fast writes. 50 writes / sec is terrible, even more if to improve that you need to buy more servers while you 30 euros / months postgres instance can deal much more than that.

What I found out the hard way is that there's a qualitative difference between 'can' and 'must' here that causes a lot of problems with the development cycle.

When the project can no longer fit onto a developer's box it changes a bunch of dynamics and often not for the better. Lots of regressions slip in, because developers start to believe that the glitches they see are caused by other people touching things they shouldn't.

With Mongo we had to start using clustering before we even got out of alpha, and since this was an on-premises application now we had to explain to customers why they had to buy two more machines. Awkward.

Plus I found that people using MongoDB tend to not formalize their data schema because the tool doesn't enforce it. But they do have a data schema.


- it's implicit, and you have to inspect the db and the code to understand it.

- there is no single source of truth, so any changes better be backed up by unit tests. And training is awkward. If some fields/values are rarely use, you can easily end up not knowing about them while "peeking at the data".

- the constraints are delegated to dev, so code making checks is scattered all around the place. Which makes consistency suffers. I almost ALWAYS find duplicate or invalid entries in any mongo storage I have to work on. And of course again, changes and training are a pain. "Is this field mandatory ?" is a question I asked way to often in those mongo gigs.

Things like redis suffer less from this because they are so simple it's hard to get wrong.

But mongo is a great software. It's powerful. You can use it for virtually anything, whatever the size, quantity, or shape. And so people do it.

The funny thing is, the best MongoDb users I met are coming from an SQL background. They no the pros and the cons, and use the flexibility of Mongo without making a mess.

But people often choose mongo as a way to avoid learning how to deal with DB.

Oh, I'll never use Mongo again. I think they believe their own PR, and I feel duped by the people who pushed it into our project against the reservations of more than half of the team. In fact I'd rather not work with anyone involved in that whole thing.

But I didn't want to turn the Cockroach scalability discussion into another roast of Mongo so I very purposefully only talked about the horizontal scaling vis a vis 'immediately have to scale' versus 'very capable of scaling' horizontally.

If you gave me a DB that always had to be clustered but I could run it as two or three processes on a single machine without exhausting memory, I would be fine with that as long as the tooling was pretty good. As a dev I need to simulate production environments, not model them exactly.

But if the whole thing only performs when it's effectively an in-memory DHT, there are other tools that already do that for me and don't bullshit me about their purpose.

That last statement is a bit off. Almost every developer knows how to use an SQL database. For 95% of things you want to do it's a very simple piece of technology.

The issue is that they are annoying to use. You have to create a schema, you have to manage the evolution of that schema (great another tool), you have to build out these complicated relational models and finally you have to duplicate that schema in code.

And so IMHO that's why people move towards MongoDB. It's amazing for prototyping and then people just well stick with it as there isn't a compelling enough reason to switch.

See, I would agree with you but I'm aware that we may be dating ourselves.

It's been so long since I touched SQL that I'm forgetting how. Which means that my coworkers haven't touched it either, and some of them that were junior are probably calling themselves 'senior' now, having hardly ever touched SQL. It's no wonder noSQL can waltz into that landscape.

But what's the old rule? If your code doesn't internally reflect the business model it's designed for, then you'll constantly wrestle with impedence mismatches, inefficiencies, and eventually the project will grind to a halt.

So what I'd rather see is a database that is designed to work the way modern, 'good' persistence frameworks advertise them to work. That would represent progress to me.

> - it's implicit, and you have to inspect the db and the code to understand it.

Or run ToroDB Stampede [1] and look at the generated tables manually or via other tools like Schema Spy [2].

[1] https://www.torodb.com/stampede/

[2] http://schemaspy.sourceforge.net/

(warning: I'm a ToroDB dev)

Storage and availability. But I agree, it will not be an easy trade, until they improve the performance significantly.

Scale != performance, in the narrow aspect that you seem to be considering.

Lots of data, parallel queries, high-availability, replication, multi-master, etc are all instances that require true distributed scalability.

It's named cockroachDB...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact