
Redis Presharding - mattyb
http://antirez.com/post/redis-presharding.html
======
simonw
The one thing that puts me off Redis sharding is that it breaks cross-key
operations, like sunion and sinter. I get a lot of use out of those.

I wonder if it would make sense for Redis to support cross key operations
where you can tell it another Redis server to retrieve the keys from?

    
    
        # Connect to redis server 1
        r = Redis('server1.dev', 6379)
        # Tell it to perform an intersection with a key on server 2
        r.sinterremote('my-key', 'server2.dev:6379/remote-key')
        # Same, but store the result on server 1
        r.sinterremotestore('local-storage-key', 'my-key', 'server2.dev:6379/remote-key')

~~~
antirez
Hi Simon, this is a good question. What I think is that this ops ultimately
will be too slow to be useful, anyway you need a massive amount of data
transfer to compare things in different instances.

But Redis Cluster supports the MIGRATE command, so what we plan is to provide
the client with a very cool way to do this kin do things.

Basically MIGRATE is able to transfer a key from an instance to another (or to
copy it). The performance is pretty impressive. So the idea is that you can
migrate two keys from two different instances into a "temporary" instance that
you use just for computation, and then perform what you want in this instance.

This way we'll provide all the tools needed to people that understand what
they are doing to do that. But will not provide direct support for something
that in the general case can't be super fast and horizontally scalable.

~~~
simonw
It sounds like I'll just need to make sure I keep sets I'm going to need to
intersect with each other on the same node. They're usually sets of integers
so that shouldn't be too much of a problem for the kinds of thing I use Redis
for.

~~~
antirez
Cool that it works for you, but for an use case where you have many
distributed sets, and you need to intersect cross-node, the simplest thing you
could do is to get the smallest transfered in the other node via MIGRATE
against a temporary key, intersect, delete it. At least we know there is an
exit strategy :)

~~~
jaksprats
that will work (as good as can be hoped for).

the process can be thought of as a map-reduce. first pass would be figure out
where all your SETs are located and how big they are. then intersect all co-
located SETs. then (in parallel) take smaller SETs and MIGRATE them towards
larger SETS, intersect those and repeat (respecting new sizes). A pyramid of
SET intersection can be done until the result is the final two SETs being
intersected, a process will need to coordinate all of this, but luckily
intersections are idempotent, so if one parallel job finishes quicker, it
doesn't need to wait.

W/ redis speed, the overhead (for the coordinator) to get the size of the
different SETs in all these steps should be minimal as compared to the time
taken to MIGRATE the data (meaning for-free). Also w/ redis' speed the MIGRATE
will be very fast and the cross-node-join (or cross-node-intersecton)
bottlenecks on network I/O, so if a good framework for this redis map-reduce
is created, it will be a pretty optimal setup, and it wont bloat the server w/
tasks that can not be done directly in the server (cross-node-intersections
are done at something more like a proxy level, they require intermediate
results).

this problem is a hard one, data analytic stores optimise to this problem by
storing data redundantly w/ a "pre-joined" colocation strategy, which works
for star schemas and a limited number of tables, but doesnt make sense w/
1000s of SETs, so this is a real good solution and classic redis anti-bloat.

------
tlrobinson
So if I understand correctly this is just a neat trick using existing Redis
features (and relying on it's low memory footprint) to easily implement
sharding today?

But in an ideal world (Redis cluster) you would run a single node per machine,
and Redis would automatically handle sharding, replication etc?

~~~
antirez
More or less yes. Actually with Redis Cluster you probably want one instance
per core anyway, that will use all your CPU power. But Redis will handle all
the sharding, adding new nodes, removing nodes, selecting the right node, and
so forth.

The point of the article is that with what you have now as stable releases you
can mount a schema that may work for you.

------
dlsspy
See also: How we've been doing this in memcached and membase

<http://dustin.github.com/2010/06/29/memcached-vbuckets.html>

~~~
antirez
vbuckets are exactly one of the ideas proposed in one of the old design
document of Redis Cluster. And indeed the final Redis Cluster is using this
concept. In our land they are called "Hash Slots".

I think the concept is much more interesting then consistent hashing, as it is
a discrete way to see the problem. Dividing the key space into N pieces, and
assigning this pieces to different nodes. Resharding this way is easy, just
move one hash slot from one instance to another. For moving data around Redis
uses the MIGRATE command.

~~~
dlsspy
The other subtlety is in being able to prevent the server from servicing
requests when it shouldn't be (to avoid consistency issues).

What you don't want is a client not aware of the new configuration -- or
reading and writing before the new one is ready. That's where vbucket states
allow us to atomically move things.

It's not obvious from your doc how one can actually accomplish this with zero
downtime while maintaining consistency.

------
liuliu
Process migration (moving instance) is not easy. In the simplest level, you
still want to have some kind of naming/binding scheme such that the migration
will be transparent in application layer (well, DNS, maybe). Besides, with
your proposal in "Moving Instance", there are still race conditions. For
example, when you "slaveof no one"ed on the new instance, which instance (old
or new) will handle the incoming request? It seems to me either way, you will
end up in an inconsistent state (or handle some request faultily). Well, I
need to lookup the underlying mechanism of "slaveof" to be confident about
that, but it is worth to point out.

I'd like to be able to use Redis Cluster as soon as it is available, here is
my question: how can I help? Recently, I want to start write some script for
rehashing, but there are some problems, namely, I cannot just "get/set"
everything (e.g., I cannot "get/set" a list with some serialized format). Do
you have recommended way of doing that?

~~~
antirez
Hello liuliu, there is no race:

when all the clients are pointed to write in the new instance, that is a
slave, but as all Redis slaves can accept writes, you are already ok. Then you
elect the instance as master just to disconnect the real master without
issues.

Redis Cluster is something where we (Redis core team, that is, I and Pieter)
can't get help for now, but will need help soon with testing.

~~~
moe
Count me in, I'm also eagerly waiting for the cluster feature.

Have you considered relying on Zookeeper for the gossip heavy-lifting?

Yes it's java (ugh) and would add an external dependency (ugh), but it's
battle-tested and could perhaps save you from re-inventing such a complex
wheel?

~~~
jrockway
Considering they won't use a library for the networking, I doubt they are
going to add a Java dependency for config management.

~~~
moe
It would not be a java dependency but rather a component dependency (you talk
to zookeeper over the network).

Zookeeper is also not just config management, it's _state_ management. It
helps to solve some of the really nasty problems of distributed state, e.g.
achieving quorum and dealing with split brain situations.

However, it's been a while since I read up on the planned redis-cluster
design, so I'm not sure how they are tackling these issues.

------
chubot
How about just using consistent hashing?
<http://en.wikipedia.org/wiki/Consistent_hashing>

In this scheme, you can add and remove servers without mving all your data.

There is apparently a memcached plugin to do this:
[http://www.mikeperham.com/2009/01/14/consistent-hashing-
in-m...](http://www.mikeperham.com/2009/01/14/consistent-hashing-in-memcache-
client/)

~~~
moe
Consistent hashing is a common baseline, but it's not the hardest problem in
this.

You still have to move _some_ data when you shuffle nodes and the devil is in
the detail there; when/how to move chunks without interrupting access or
violating integrity, how to ensure n-copies (if redundancy is required), how
to prevent hot-spots. And of course all the aggregation/routing to allow
complex queries that span nodes.

That's the reason why there are very few out-of-the-box sharding solutions.
It's just very tough to get right, and one size won't fit all.

------
EGreg
Personally, I would prefer to hash my fields like this:

1\. Hash each field in the key in some fashion

2\. Concatenate the hashes, for example implode(".", $hashes)

3\. Now you've got a point in the space of all possible concatenated hashes.
On each client, have a partition like this for each key:

["", "abf34.9f8c7.92833", ...]

4\. go through the partition and find the last point that is smaller than your
point. That point maps to your shard.

5\. Monitor your machines. If one of your shards becomes too full, or too much
action is going, SPLIT IT. You do this simply by replicating the data to one
or more new shards (like it says in the article) introducing one or more
points in the partition right after the point mapping to the shard. Point them
to the new shards. And update all the clients!

That's all Greg

