Can't wait to see it get the Jepsen treatment.

jrallison · on Oct 9, 2014

It won't do well, but at least Antirez is up front about the shortcomings.

The article itself states that Redis Cluster will lose acknowledged writes during partitions (no P), will be eventually consistent (no C), and won't be available during partitions (no A).

antirez · on Oct 9, 2014

Yep, Jepsen is more suitable to check systems that claim either linerizability, or at least write safety, during partitions. I guess that a modified version of Jepsen could be used in order to validate the failure modes or to discover other unexpected ones that at human inspection look easy to reproduce in actual production environments. Also I don't know if Jepsen is good at this, but in theory it could be instrumented in order to check how good the implementation is, which is, even if it is not designed for write safety during partitions, how better the countermeasures are working?

bkirwi · on Oct 9, 2014

In theory, Jepsen or a Jepsen-like system should be able to check any of these failure modes.

On the other hand, it sounds like Redis Cluster offers few hard guarantees; instead, it promises that failures should be rare 'in practice'. Which is a fine thing for a tool to do, of course, but it makes things less amenable to the kind of stress-testing Jepsen does -- since running inside Jepsen's little universe is about as far from normal operation as you can get. If you already know that a system can fail in a certain way, getting Jepsen to reproduce that failure tells you very little.

If you'd like to make this kind of testing possible, it would be useful to state as many 'positive' rules as possible, which Redis Cluster should always respect -- things like "if a majority of nodes are fully connected, they should always accept writes" and "an unpartitioned cluster should always agree on the same value" -- alongside the documentation on ways it might fail. This way, clients can be assured of the 'bare minimum' that the system supports, and tools like Jepsen can give you more useful information.

antirez · on Oct 9, 2014

Oh, there are definitely hard rules like that. For example a majority partition never accepts queries, and when there are no partitions at all Redis Cluster guarantees to converge on a single value for each key, and to a single view of the cluster configuration. I'll try to document better this things, but basically they arise from the simple algorithm that makes the configuration eventually consistent.

bkirwi · on Oct 9, 2014

Neat; that should be very useful.

It's great to see that Redis has a official story for clustering / failover out; like you said in the post, the worst distributed systems are the ones you have to rewrite every single time. It's going to be interesting watching this evolve.

swartkrans · on Oct 9, 2014

Seems like given the above problems you would only want to use this with non-critical data of which you didn't care a great deal about its integrity. This doesn't describe a lot of use cases.

It's not too hard to get memcached running for cluster of machines, nor is it really all that difficult to get started with cassandra for basic applications, although I haven't done it. I've heard ReThinkDB wants to be a serious contender and that they've tried to solve these problems, but I haven't read about how great it is nor have I heard anyone use it in production so I don't know where it stands.

seiji · on Oct 9, 2014

This doesn't describe a lot of use cases.

Do you have any idea how many people run master/slave mysql? That's async and will definitely give you data drift over time. There's an entire suite of Percona tools to detect and repair broken mysql datasets because mysql breaks data so often due to replication problems.

The world is much more async and much less consistent than anybody realizes (cash machines are even async), but everything still works pretty much okay. The last line of defense for solving irreconcilable consistency problems is customer service.

jeremyjh · on Oct 9, 2014

Cash machines are definitely not async in normal operations, though the authorization network is configured with rules for stand-in mode when third-party systems are unavailable. Without the primary authorization system immediately available the ATM is hard down.

jrallison · on Oct 9, 2014

async != an incorrect system. Cash machines are a good example of an asynchronous, eventually consistent system that works. As is any well built eventually consistent data store.

Redis/Mysql has it's problems, as does all databases. However, as a practitioner it's important I know the shortcomings of each, so that I can know exactly what a certain database is offering in terms of safety and whether it's a good fit for my use-case. Antirez being open about them is a huge +1 in my book.

Data drift or loss of acknowledged writes isn't correct for a wide range of use-cases. If you know the limits, you can choose the incorrect system, and deal with the fallout when that "rare" event does happen.

seiji · on Oct 9, 2014

async != an incorrect system.

Very correct! Most of the mysql problems derive from it being statement replication and not log shipping. A proper postgres replication setup will be less prone to having data inconstancies.

antirez · on Oct 9, 2014

Note that systems with asynchronous replication and normal failover procedures are all subject to the same rules as Redis Cluster, and I believe there is a great deal of people running RDBMs in this setup. So IMHO actually there are a lot of use cases.

IgorPartola · on Oct 9, 2014

From what I understand, there are two types of systems really (in terms of CAP): those that do async replication and those that do sync replication.

Async replicated systems can be (but are not necessarily) AP: each system can process requests independently of the other, and they'll get consistent eventually (how they arrive at this consistency is different between systems, and is not always "correct" for the general use case).

Sync replicated systems can be (but are not necessarily) CA: as long as replication is not broken they are available and consistent. These systems have the upside in that they allow you to serve read-only data during partitions, but have slow writes because of the sync replication.

Both types can have features removed, for whatever reason. For example, you could have a sync replicated system that doesn't serve any requests when the replication is broken. Or you could have an async system that doesn't provide a mechanism to correct inconsistencies (a la MySQL async replication). What you describe sounds like an imperfect version of the async system: not highly available, not consistent, and not partition tolerant. Given that better AP systems exist, ones that actually provide things like availability during partitions and eventual consistency mechanisms/guarantees, will Redis Cluster eventually support these features?

Edit: please don't take any of this as criticism. Redis is an awesome product that I use and rely on often. Thanks for all the great work.

antirez · on Oct 9, 2014

Hello Igor,

basically we can rule out synchronous systems: product mismatch from the POV of Redis. We are left with AP systems. AP systems trying to get as good as they can from the POV of Availability, and the consistency model provided (even if not "C"), do these two additional things that Redis Cluster is not able to do:

1) They are available in the minority partition. Redis Cluster has some degree of availability, but not to that extend. True "A" means, serve queries even if you are the only node left.

2) They provide more write safety, unless you use it with last-write-win strategies or alike.

"1" is actually related to "2". If you merge values later, you can accept writes even if the node has no clue about what is the state of the rest of the system.

About "2", if you check around, you'll immediately figure out that there are no systems available that are AP and can easily model Redis data structures with the same performances, and having the same number of elements per value. This is because to merge huge stuff is time consuming, requires a lot of meta data, and so forth.

Now there is another constraint, that I don't want a system with application-assisted merging. For example recent Riak versions provide data structures with a default merge strategy, so we have examples of similar stuff. I can assure you that there is either meta-data needed to merge, that would make Redis a lot less memory efficient, and sometimes, there are simply no good strategies. For example take the "Hash" type. There is a partition and in one node I set as "name" field of user 1234 "George" and in another node "Melissa". There is no better merge strategy than last-write-wins, basically.

Now, that said, there are certain things that don't add major overhead, AND, improve write safety. For example if the connection is declared as "SAFE" we could cache all the SADD commands (and others having similar properties) until they are acknowledged by all the nodes serving this key. Unacknowledged writes are retained as a "log" of commands, and are re-played when the node is turned into a slave of a new master.

So there are plans to improve the consistency model, but it is unlikely that we go towards the Riak model, but the good thing is, there are stores like Riak that are exploring this other way of doing data structures without application assisted merge functions, with per element meta data and so forth.

IMHO what there is to ponder is that at the end of the day, is that a distributed database is the sum of what it offers when there are not partitions, and what it offers when there are partitions. Often to improve the latter requires to give up something in the former: simplicity, space efficiency, data model freedom, ...

IgorPartola · on Oct 9, 2014

Thanks for the explanation. Is it possible to add something like "this record was not merged cleanly" flag? Basically, from what I understand without sync replication you cannot theoretically have a system that does the correct merge in all use cases. MySQL would deal with this by letting two replicas contain different data until you scan for this in your application code and detect it. Other systems (I think Cassandra) will attempt to merge things in the background after the partition is gone, but with strategies like "last write wins" or even versioning you can still get bad records. I am thinking of a system that can do whatever strategy you choose, or even just "last write wins" but warns you that a write on this key was lost. That way you can either ignore it if that's appropriate (some types of sensors for example, that frequently update values) or you can bubble this error up your application code until a user can fix it. On the other hand I definitely can see this as a performance issue so it might be more appropriate for a datastore that is aiming for high consistency while still allowing AP mode.

seiji · on Oct 9, 2014

Picking out the three points: not highly available, not consistent, and not partition tolerant.

[reverting to non-standard, dummy definitions below just to explicitly define some things]

Redis Cluster is available as long as (50% + 1) of the master nodes are reachable by each other. So, 15 masters will be more "available" than 3 masters if they are deployed across a sane network topology. Your masters will have replicas as well, so if a single master instance fails, a replica for that master will be elected to be the new master.

Redis Cluster is consistent in that each replica is an exact copy of its attached master (by default, each master owns N/16384 of the keyspace, where N is the total number of masters). You can lose writes if the master doesn't have time to replicate before the master fails, but that's the nature of async replication. (You can request your Redis client "wait" for at least N replicas to receive your commands before the command completes; this gives you some assurances.) It's possible for you to get an "inconsistent read" if you read from an async replica after writing to the master, but you have perfect read-after-write consistency if only talk to master instances.

Redis Cluster nodes on the minority side of a partition will, by default, deny reads and writes until they re-join the cluster. Redis Cluster prefers data integrity over availability by default since Redis Cluster has no merge operations. You can optionally allow the minority side to keep accepting commands, but that definitely will not guarantee any consistency of your data when the whole cluster reappears.

Given that better AP systems exist, ones that actually provide things like availability during partitions and eventual consistency mechanisms/guarantees, will Redis Cluster eventually support these features?

The one limitation for each of those features is: Redis is an in-memory database. Riak uses multiple KB of metadata for each object in the DB. With Redis, it's common to have tens of millions (or hundreds of millions) of keys on one server.

Availability during partitions with eventual consistency (if you want more than last-write-wins) requires CRDT-like things, which requires metadata, which requires more memory usage per-key.

Full and usable availability during partitions is nice to think about, and it would be great to one day have the option to maybe have in-memory Redis "better data consistency for smaller, manageable datasets" with CRDT merge operations so you get a Reis-dynamo type thing.

But, since Redis Cluster is already 4 years under development, it's not worth burning brain cycles on until real users start using Redis Cluster and we can see where the needs-vs-features plot falls. (plus, the issue backlog for other Redis development improvements is about 100 tasks long, so speculative improvements have to fall in line with the balance of Cluster vs. Standalone vs. Master-Replica vs. New Commands vs. Improving Existing Commands vs. Bug Fixes vs. User Contributions vs. Doc Updates vs. Evangelism vs. ...).

giaour · on Oct 9, 2014

That's the use case of everyone running MongoDB, so Redis Cluster should be able to find a significant user base.

vidarh · on Oct 10, 2014

> . This doesn't describe a lot of use cases.

It describes a huge proportion of the data I've ever worked with, because there are massive amount of use cases where what you are working with does not have to be (and as the systems scale: rarely is) the single source of truth for that data.

Redis beats memcached the moment you need more than a bunch of blobs.

E.g. our in-house capacity monitoring uses Redis for a ~1 hour time series view of our systems. We don't care about data loss - the odds of losing the data within an hour of an outage we need to care about is small enough to be justified and if that ever becomes a concern we can run two and split our updates between them. For long term storage, we migrate rolled up data to couchdb at the moment (doesn't matter what - we could use anything really; it's rarely queried and basically to let us get an occasional longer historical view to budget for resources etc).

We don't particularly care about integrity as long as it's "right most of the time" because the data gets constantly corrected, and precision in the averaged longer term numbers is not particularly important either (I want to be able to project when we run out of disk, for example - I don't care if disk usage is at 96% or 98%, as both are way too close for comfort).

Yet memcached is less attractive because it means far more book-keeping in the app. With Redis we encode part of the information in the keys (keys indicate which level of roll-up the data is at, the name of the data item, and the starting timestamp of the period the data is for) in a way that lets us easily retrieve a list of keys to do the roll-up. With Redis' Lua support we could probably do even more of the roll-up process in Redis itself, but I haven't looked at that yet.

There are a lot of apps like this, where your data volume makes hitting disk annoying (this replaced a Graphite based system - Graphite was thoroughly trashing an expensive disk array and still regularly getting too slow to us; the Redis based system uses ~10% of a much slower machine) and where the cost of losing some time window worth of data is low enough not to matter because it just means a blip in a data stream that is rapidly obsoleting itself.

jbinto · on Oct 9, 2014

I didn't get the context, so I looked it up:

http://aphyr.com/tags/Jepsen

> This article is part of Jepsen, a series on network partitions.

Covers MongoDB, redis, RabbitMQ, among others. All articles start with "call me maybe", thus the Jepsen reference.

rurounijones · on Oct 10, 2014

Would love to see Jepsen thrown at Apache Storm.