Finally Redis collections are iterable

aitskovi · on Oct 29, 2013

For more context pieter's original pull request is here: https://github.com/antirez/redis/pull/579

jwr · on Oct 29, 2013

This is cool and helpful, but not a game-changer. For things like garbage-collecting I've been using probabilistic techniques (just get a random element and check it, make sure you check enough to have guarantees you need) with great success. The new scanning doesn't provide any tight guarantees (it can't, really, without sacrificing a lot of what Redis stands for), so it will be more of a convenience than a really new paradigm.

Still, it is nice to see -- so, thanks!

antirez · on Oct 29, 2013

I agree that for some data type like Sets you could use SRANDMEMBER, however to sample all the elements by random sampling you need to perform a lot more work. For example for a 10000 elements collection in average you are going to need around 10 times the requests.

jwr · on Oct 30, 2013

Yes. I was lucky to have applications where that wasn't a problem — mostly because this kind of work was done regularly and incrementally, rather than rarely and in a big chunk.

legedemon · on Oct 29, 2013

This completely blew my mind. One of the most clever algorithms and implementation that I have seen in the recent times. It was like seeing Radix sort for the first time and realizing that sorting still works even if you sort by least significant bit first!

DanWaterworth · on Oct 29, 2013

Elements added during the iteration may be returned, or not, at random.

I can't be the only one who thinks this is crazy behaviour.

qwerta · on Oct 29, 2013

It is actually very reasonable. Other alternative is that iterator returns an error if underlying collection was modified (fail fast iterators).

antirez · on Oct 29, 2013

There are alternatives but all will cost something... especially memory :-)

For example every element may have an epoch, and every time a new element is added the epoch is incremented. Then what you could do is to only remember elements with epoch <= the epoch at which you started iterating, so that all the new elements will not be returned.

We tried to do our best with the state available, that is just a 64 bit counter the server does not even need to remember, but just to return back to the caller, and get as argument in the next iterator call.

awj · on Oct 29, 2013

> For example every element may have an epoch, and every time a new element is added the epoch is incremented. Then what you could do is to only remember elements with epoch <= the epoch at which you started iterating, so that all the new elements will not be returned.

Do that and you will immediately have people wanting an equally transaction-like view of the iteration set in the face of updates. I think it's best to draw the line where you have and be specific about what this is or is not intended to support.

qwerta · on Oct 29, 2013

Is 64bit counter part of hash which points to current position in Hash Tree?

I implemented something similar for HTreeMap, but in Java.

antirez · on Oct 29, 2013

It's actually something like a counter that counts starting from the most significant bits first, in the article there is a link to a comment that explains the implementation.

qwerta · on Oct 29, 2013

So it was hard because hash table could resize between calls?

For HTreeMap I used expanding Hash Tree. There are 4 dir levels, each with 128 entries. If dir becomes full (or has too many collisions), it splits into another 128 dirs.

Iterations is done by increasing counter. If dir at level 2 is not found, it is increased by 128^2. Writing iterators took single day.

kevingadd · on Oct 29, 2013

.NET's collections all fail fast for seemingly this reason - they can't really provide anything else without huge performance consequences (even the fail fast behavior comes with a cost, due to the check...)

hakt · on Oct 29, 2013

I'm very happy with this compromise.

Schema migrations involving tens of millions of keys can be challenging. The KEYS command returns all matching keys, so you can roll your own iterator with KEYS AAA* then KEYS AAB* etc. But keys aren't evenly distributed so some result sets are zero and others are still too large.

I've been indexing keys with Redis lists and iterate through the index with LRANGE. That uses extra memory and causes more roundtrip calls.

I can deal with this compromise since SCAN makes Redis much more useful for my applications.

jwr · on Oct 29, 2013

It's a compromise. If you need tighter specifications, you pretty much need to implement a transactional system.

discardorama · on Oct 29, 2013

How is this any different from an iterator in Java? IIRC, if you have the iterator of a Collection and add elements to it during iteration, the results are undefined. You can, however, safely remove elements even while iterating.

DanWaterworth · on Oct 29, 2013

I don't use Java.

thezilch · on Oct 29, 2013

Redis has keyspace notifications [0], so you can make your iterator aware that it should possibly restart.

[0] http://redis.io/topics/notifications

bkirkbri · on Oct 29, 2013

It's not crazy, it's realistic. Nothing is free and this sort of pragmatism lets/forces users to act like grown ups.

updated: grammar fix

nknighthb · on Oct 29, 2013

I think it is the only appropriate behavior considering Redis's constraints and philosophy.

Redis is not a transactional, ACID-compliant datastore. You can, for at least some workloads, build such a datastore on top of it, but it requires care and effort. You have to assemble the given pieces correctly, and fabricate some of your own.

This SCAN mechanism seems perfectly in line with this -- if you understand its limitations, it might be a useful tool, but like every other part of Redis, you can't just use it assuming it acts like your favorite SQL RDBMS.

philsnow · on Oct 30, 2013

I can't see why they don't just restart the cursor at the beginning when the collection goes up a size in the middle of an iteration.

Encode the size of the collection in the first 8 bits of the cursor value, and when the client gives you an cursor value with the first 8 bits set to a smaller size than the current size of the collection, start it over from scratch with the new size as the top 8 bits.

If the collection has shrunk, no big deal, just keep going from where the client was because of the mentioned properties of using a reversed bit set.

lmm · on Oct 29, 2013

What do you think is crazy? What other behaviour do you feel would be appropriate to Redis' use cases?

Robin_Message · on Oct 29, 2013

So, I'm trying to work out how this actually works, and thought I'd share my working (especially the reversed bit counter). No idea if my thinking out loud will help anyone else.

TL;DR: Because we count reversed, when the table shrinks (shrinking is done live, so the old table sticks around for a while) we will continue to iterate only from the position in the sequence where all of the masked bit combinations (those bits of the hash that are ignored in the smaller table) have already been explored. If we counted normally, then on a shrink we would end up skipping parts of the old table. Lines https://github.com/antirez/redis/blob/unstable/src/dict.c#L7... explain it perfectly, now I understand it.

Looking at the code, the increment works as follows:

    counter |= ~mask;
    counter = reverse(counter);
    counter++;
    counter = reverse(counter);

The mask is the size of the hash table, minus one, so it always looks something like 0b00001111.

So the first step sets all the unimportant high bits. This means that, after the reverse, all of the unimportant low bits are set, which means an increment will set all of the unimportant bits to zero, and increment the important part of the reversed counter.

So this could be rewritten as:

    counter = reverse(counter);
    counter += (1 << (32-log_size));
    counter = reverse(counter);

Sometimes a collection is made up of two tables, one larger, one smaller. There is an extra loop in each iteration to go through all of the elements of the larger table that share a hash prefix with the counter.

Now, why the reversed counter? If the table stays the same size, it doesn't matter what order you iterate. If the table grows, the prefix system still works, so that can't be it. So, by process of elimination, it must be necessary for collections that shrink.

Say we have an 8 element collection, and have iterated 000 and 100, and next is 010. Then it starts shrinking to a 4 element collection. So next is 010 (which is interpreted as 10 in the new, smaller table, and 010 and 110 in the old table), then 01 (01, 001, 101), then 11 (11, 011, 111), then done.

Well, that worked (we visited all 8 places in the old table). Let's try a non-reverse increment.

000, 001. Next is 010.

Switch to size 4, and then visit (10,010,110), (11,011,111), done. We missed 100 and 101.

Okay, I'm happy. It sort of makes intuitive sense that if the shrink is what's important, and the high end is lost during the shrink, then incrementing from the high end will work better because throwing away the high end will still cover the whole range, as long as you still look at every bit in the thrown away section. Whereas incrementing from the low end will result in gaps.

Go from size 256 to 4 to really show it:

    0,128,64,192|2,1,3,0

vs

    0,1,2,3|0 (which examines only the top quarter of the old hash table)

Or to put it another way, look at this sequence: size 8: 0,4,2,6,1,5,3,7. The bottom bit is set only after all of the possiblities with the bottom bit reset have been explored. So if remove some of the top bits, either the bottom bits will not be set, or every combination of top bits with those bottom bits set will already have been explored.

lost-theory · on Oct 29, 2013

Great addition, I can finally replace my probabilistic key dumper (lua procedure that grabbed 1k RANDOMKEYs at a time and ran forever).

programminggeek · on Oct 29, 2013

For some reason my brain read this as "Finally Redis collections are terrible"

dl_terp · on Oct 29, 2013

In this day and age, that move was like shooting yourself in the foot. This is the second page like this to make page #1 of hacker news... we're in the sharing information age people, you can't do stuff like this.