From what I understand, there are two types of systems really (in terms of CAP):...

antirez · on Oct 9, 2014

Hello Igor,

basically we can rule out synchronous systems: product mismatch from the POV of Redis. We are left with AP systems. AP systems trying to get as good as they can from the POV of Availability, and the consistency model provided (even if not "C"), do these two additional things that Redis Cluster is not able to do:

1) They are available in the minority partition. Redis Cluster has some degree of availability, but not to that extend. True "A" means, serve queries even if you are the only node left.

2) They provide more write safety, unless you use it with last-write-win strategies or alike.

"1" is actually related to "2". If you merge values later, you can accept writes even if the node has no clue about what is the state of the rest of the system.

About "2", if you check around, you'll immediately figure out that there are no systems available that are AP and can easily model Redis data structures with the same performances, and having the same number of elements per value. This is because to merge huge stuff is time consuming, requires a lot of meta data, and so forth.

Now there is another constraint, that I don't want a system with application-assisted merging. For example recent Riak versions provide data structures with a default merge strategy, so we have examples of similar stuff. I can assure you that there is either meta-data needed to merge, that would make Redis a lot less memory efficient, and sometimes, there are simply no good strategies. For example take the "Hash" type. There is a partition and in one node I set as "name" field of user 1234 "George" and in another node "Melissa". There is no better merge strategy than last-write-wins, basically.

Now, that said, there are certain things that don't add major overhead, AND, improve write safety. For example if the connection is declared as "SAFE" we could cache all the SADD commands (and others having similar properties) until they are acknowledged by all the nodes serving this key. Unacknowledged writes are retained as a "log" of commands, and are re-played when the node is turned into a slave of a new master.

So there are plans to improve the consistency model, but it is unlikely that we go towards the Riak model, but the good thing is, there are stores like Riak that are exploring this other way of doing data structures without application assisted merge functions, with per element meta data and so forth.

IMHO what there is to ponder is that at the end of the day, is that a distributed database is the sum of what it offers when there are not partitions, and what it offers when there are partitions. Often to improve the latter requires to give up something in the former: simplicity, space efficiency, data model freedom, ...

IgorPartola · on Oct 9, 2014

Thanks for the explanation. Is it possible to add something like "this record was not merged cleanly" flag? Basically, from what I understand without sync replication you cannot theoretically have a system that does the correct merge in all use cases. MySQL would deal with this by letting two replicas contain different data until you scan for this in your application code and detect it. Other systems (I think Cassandra) will attempt to merge things in the background after the partition is gone, but with strategies like "last write wins" or even versioning you can still get bad records. I am thinking of a system that can do whatever strategy you choose, or even just "last write wins" but warns you that a write on this key was lost. That way you can either ignore it if that's appropriate (some types of sensors for example, that frequently update values) or you can bubble this error up your application code until a user can fix it. On the other hand I definitely can see this as a performance issue so it might be more appropriate for a datastore that is aiming for high consistency while still allowing AP mode.

seiji · on Oct 9, 2014

Picking out the three points: not highly available, not consistent, and not partition tolerant.

[reverting to non-standard, dummy definitions below just to explicitly define some things]

Redis Cluster is available as long as (50% + 1) of the master nodes are reachable by each other. So, 15 masters will be more "available" than 3 masters if they are deployed across a sane network topology. Your masters will have replicas as well, so if a single master instance fails, a replica for that master will be elected to be the new master.

Redis Cluster is consistent in that each replica is an exact copy of its attached master (by default, each master owns N/16384 of the keyspace, where N is the total number of masters). You can lose writes if the master doesn't have time to replicate before the master fails, but that's the nature of async replication. (You can request your Redis client "wait" for at least N replicas to receive your commands before the command completes; this gives you some assurances.) It's possible for you to get an "inconsistent read" if you read from an async replica after writing to the master, but you have perfect read-after-write consistency if only talk to master instances.

Redis Cluster nodes on the minority side of a partition will, by default, deny reads and writes until they re-join the cluster. Redis Cluster prefers data integrity over availability by default since Redis Cluster has no merge operations. You can optionally allow the minority side to keep accepting commands, but that definitely will not guarantee any consistency of your data when the whole cluster reappears.

Given that better AP systems exist, ones that actually provide things like availability during partitions and eventual consistency mechanisms/guarantees, will Redis Cluster eventually support these features?

The one limitation for each of those features is: Redis is an in-memory database. Riak uses multiple KB of metadata for each object in the DB. With Redis, it's common to have tens of millions (or hundreds of millions) of keys on one server.

Availability during partitions with eventual consistency (if you want more than last-write-wins) requires CRDT-like things, which requires metadata, which requires more memory usage per-key.

Full and usable availability during partitions is nice to think about, and it would be great to one day have the option to maybe have in-memory Redis "better data consistency for smaller, manageable datasets" with CRDT merge operations so you get a Reis-dynamo type thing.

But, since Redis Cluster is already 4 years under development, it's not worth burning brain cycles on until real users start using Redis Cluster and we can see where the needs-vs-features plot falls. (plus, the issue backlog for other Redis development improvements is about 100 tasks long, so speculative improvements have to fall in line with the balance of Cluster vs. Standalone vs. Master-Replica vs. New Commands vs. Improving Existing Commands vs. Bug Fixes vs. User Contributions vs. Doc Updates vs. Evangelism vs. ...).