
Finally Redis collections are iterable - alexandere
http://antirez.com/news/63
======
aitskovi
For more context pieter's original pull request is here:
[https://github.com/antirez/redis/pull/579](https://github.com/antirez/redis/pull/579)

------
jwr
This is cool and helpful, but not a game-changer. For things like garbage-
collecting I've been using probabilistic techniques (just get a random element
and check it, make sure you check enough to have guarantees you need) with
great success. The new scanning doesn't provide any tight guarantees (it
can't, really, without sacrificing a lot of what Redis stands for), so it will
be more of a convenience than a really new paradigm.

Still, it is nice to see -- so, thanks!

~~~
antirez
I agree that for some data type like Sets you could use SRANDMEMBER, however
to sample all the elements by random sampling you need to perform a lot more
work. For example for a 10000 elements collection in average you are going to
need around 10 times the requests.

~~~
jwr
Yes. I was lucky to have applications where that wasn't a problem — mostly
because this kind of work was done regularly and incrementally, rather than
rarely and in a big chunk.

------
legedemon
This completely blew my mind. One of the most clever algorithms and
implementation that I have seen in the recent times. It was like seeing Radix
sort for the first time and realizing that sorting still works even if you
sort by least significant bit first!

------
DanWaterworth
_Elements added during the iteration may be returned, or not, at random._

I can't be the only one who thinks this is crazy behaviour.

~~~
qwerta
It is actually very reasonable. Other alternative is that iterator returns an
error if underlying collection was modified (fail fast iterators).

~~~
antirez
There are alternatives but all will cost something... especially memory :-)

For example every element may have an epoch, and every time a new element is
added the epoch is incremented. Then what you could do is to only remember
elements with epoch <= the epoch at which you started iterating, so that all
the new elements will not be returned.

We tried to do our best with the state available, that is just a 64 bit
counter the server does not even need to remember, but just to return back to
the caller, and get as argument in the next iterator call.

~~~
qwerta
Is 64bit counter part of hash which points to current position in Hash Tree?

I implemented something similar for HTreeMap, but in Java.

~~~
antirez
It's actually something like a counter that counts starting from the most
significant bits first, in the article there is a link to a comment that
explains the implementation.

~~~
qwerta
So it was hard because hash table could resize between calls?

For HTreeMap I used expanding Hash Tree. There are 4 dir levels, each with 128
entries. If dir becomes full (or has too many collisions), it splits into
another 128 dirs.

Iterations is done by increasing counter. If dir at level 2 is not found, it
is increased by 128^2. Writing iterators took single day.

------
Robin_Message
So, I'm trying to work out how this actually works, and thought I'd share my
working (especially the reversed bit counter). No idea if my thinking out loud
will help anyone else.

TL;DR: Because we count reversed, when the table shrinks (shrinking is done
live, so the old table sticks around for a while) we will continue to iterate
only from the position in the sequence where all of the masked bit
combinations (those bits of the hash that are ignored in the smaller table)
have already been explored. If we counted normally, then on a shrink we would
end up skipping parts of the old table. Lines
[https://github.com/antirez/redis/blob/unstable/src/dict.c#L7...](https://github.com/antirez/redis/blob/unstable/src/dict.c#L717-L721)
explain it perfectly, now I understand it.

Looking at the code, the increment works as follows:

    
    
        counter |= ~mask;
        counter = reverse(counter);
        counter++;
        counter = reverse(counter);
    

The mask is the size of the hash table, minus one, so it always looks
something like 0b00001111.

So the first step sets all the unimportant high bits. This means that, after
the reverse, all of the unimportant low bits are set, which means an increment
will set all of the unimportant bits to zero, and increment the important part
of the reversed counter.

So this could be rewritten as:

    
    
        counter = reverse(counter);
        counter += (1 << (32-log_size));
        counter = reverse(counter);
    

Sometimes a collection is made up of two tables, one larger, one smaller.
There is an extra loop in each iteration to go through all of the elements of
the larger table that share a hash prefix with the counter.

Now, why the reversed counter? If the table stays the same size, it doesn't
matter what order you iterate. If the table grows, the prefix system still
works, so that can't be it. So, by process of elimination, it must be
necessary for collections that shrink.

Say we have an 8 element collection, and have iterated 000 and 100, and next
is 010. Then it starts shrinking to a 4 element collection. So next is 010
(which is interpreted as 10 in the new, smaller table, and 010 and 110 in the
old table), then 01 (01, 001, 101), then 11 (11, 011, 111), then done.

Well, that worked (we visited all 8 places in the old table). Let's try a non-
reverse increment.

000, 001. Next is 010.

Switch to size 4, and then visit (10,010,110), (11,011,111), done. We missed
100 and 101.

Okay, I'm happy. It sort of makes intuitive sense that if the shrink is what's
important, and the high end is lost during the shrink, then incrementing from
the high end will work better because throwing away the high end will still
cover the whole range, as long as you still look at every bit in the thrown
away section. Whereas incrementing from the low end will result in gaps.

Go from size 256 to 4 to really show it:

    
    
        0,128,64,192|2,1,3,0
    

vs

    
    
        0,1,2,3|0 (which examines only the top quarter of the old hash table)
    

Or to put it another way, look at this sequence: size 8: 0,4,2,6,1,5,3,7. The
bottom bit is set only after all of the possiblities with the bottom bit reset
have been explored. So if remove some of the top bits, either the bottom bits
will not be set, or every combination of top bits with those bottom bits set
will already have been explored.

------
lost-theory
Great addition, I can finally replace my probabilistic key dumper (lua
procedure that grabbed 1k RANDOMKEYs at a time and ran forever).

------
programminggeek
For some reason my brain read this as "Finally Redis collections are terrible"

------
dl_terp
In this day and age, that move was like shooting yourself in the foot. This is
the second page like this to make page #1 of hacker news... we're in the
sharing information age people, you can't do stuff like this.

