Show HN: Skizze – High-throughput probabilistic data structure service

njpatel · on March 8, 2016

Some more details are available at http://seif.codes/skizze-behind-the-scenes-of-alpha-2/ and https://njp.io/skizze-alpha-release/ , including what's coming up in future releases.

We're really excited to be working on solving real-world problems we've come across with Skizze, hopefully others will find it useful too as it matures!

alixaxel · on March 8, 2016

Great work on putting this together! Amazing idea really.

I'm specially looking forward for the streaming metrics/statistics feature.

PS: Could you go into a bit more into the threshold specification per domain and what kind of memory requirements it yields?

seiflotfy · on March 8, 2016

Sure thing.

So imagine you have a domain with a expected cardinality (which you pass) of 1 million and a threshold of 10k.

What will happen is every time you insert an event it will be multiplexed into all 4 sketches. Each one of them will have a dictionary "thDict" (I will look into having them shared in a domain to reduce memory) that can take up to 10k keys. the keys will be the "strings" passed in and the values will be counters (+1 every time the key is inserted). Using a dictionary of 10k keys we can then give you precise 100% accurate:

* memebrship: check if key exists in "thDict"

* cardinality: count number of keys

* frequency: for a given key check the value

* top-k: sort the dictionary by values/keys

once the 10k keys threshold is reached, we can just replay them into all sketches.

Does this make sense?

Cheers and thanks for the kind words :D

alixaxel · on March 8, 2016

I get the multiplexing and I think that's a great concept. One other thing that you might also want to consider (to reduce unneeded interaction with the server) is allow specifying counts of the keys - useful, for instance, when you want to know not only which IPs are the most frequent hitters but also which IPs are the most "heavy" hitters (by number of bytes transferred instead of individual requests).

But I didn't quite get the semantics behind the threshold though (I might have to look at the code to fully get it), but building up on your example of 1M keys and 10k as the threshold, you say the error rate would be 0% for all the sketches (which is great - but why 0%?). So suppose all keys are 32 bytes long and the counters are 32bit integers - for simplification lets also suppose that there's a single shared "thDict" dictionary - this would give a raw memory usage of 10,000 * 32 * 4 bytes or 1,250 KiB.

Then, if I understood correctly - it seems that the threshold determines the memory usage (unless the "sketch replaying" part that I also didn't fully understand affects this in some way). But what I would like to know is in what way the cardinality / threshold ratio influences the expected error rate. Given that any probabilistic data structure should work on the basis of the expected error and memory requirements, it seems crucial to know how to plan for both sides of the equation, but somehow after reading both links and the GitHub repo README I still can't understand it.

krat0sprakhar · on March 8, 2016

Looks useful. However, how does the bloom filter and HLL implementation compare to the ones available in Redis?

seiflotfy · on March 8, 2016

Redis provides a bitset that can be used as a bloom filter. You'd end up having to hash k times on a client side, while also calculating the ideal size of the bitset for the expected "n" values and "e" error rate.

The HyperLogLog is actually the same. We intend to provide a threshold feature that is variable to guarantee 100% accuracy up to a specific cardinality.