We're really excited to be working on solving real-world problems we've come across with Skizze, hopefully others will find it useful too as it matures!
So imagine you have a domain with a expected cardinality (which you pass) of 1 million and a threshold of 10k.
What will happen is every time you insert an event it will be multiplexed into all 4 sketches. Each one of them will have a dictionary "thDict" (I will look into having them shared in a domain to reduce memory) that can take up to 10k keys.
the keys will be the "strings" passed in and the values will be counters (+1 every time the key is inserted). Using a dictionary of 10k keys we can then give you precise 100% accurate:
* memebrship: check if key exists in "thDict"
* cardinality: count number of keys
* frequency: for a given key check the value
* top-k: sort the dictionary by values/keys
once the 10k keys threshold is reached, we can just replay them into all sketches.
I get the multiplexing and I think that's a great concept. One other thing that you might also want to consider (to reduce unneeded interaction with the server) is allow specifying counts of the keys - useful, for instance, when you want to know not only which IPs are the most frequent hitters but also which IPs are the most "heavy" hitters (by number of bytes transferred instead of individual requests).
But I didn't quite get the semantics behind the threshold though (I might have to look at the code to fully get it), but building up on your example of 1M keys and 10k as the threshold, you say the error rate would be 0% for all the sketches (which is great - but why 0%?). So suppose all keys are 32 bytes long and the counters are 32bit integers - for simplification lets also suppose that there's a single shared "thDict" dictionary - this would give a raw memory usage of 10,000 * 32 * 4 bytes or 1,250 KiB.
Then, if I understood correctly - it seems that the threshold determines the memory usage (unless the "sketch replaying" part that I also didn't fully understand affects this in some way). But what I would like to know is in what way the cardinality / threshold ratio influences the expected error rate. Given that any probabilistic data structure should work on the basis of the expected error and memory requirements, it seems crucial to know how to plan for both sides of the equation, but somehow after reading both links and the GitHub repo README I still can't understand it.
Redis provides a bitset that can be used as a bloom filter. You'd end up having to hash k times on a client side, while also calculating the ideal size of the bitset for the expected "n" values and "e" error rate.
The HyperLogLog is actually the same. We intend to provide a threshold feature that is variable to guarantee 100% accuracy up to a specific cardinality.
We're really excited to be working on solving real-world problems we've come across with Skizze, hopefully others will find it useful too as it matures!