
Show HN: Skizze – High-throughput probabilistic data structure service - seiflotfy
https://github.com/skizzehq/skizze
======
njpatel
Some more details are available at [http://seif.codes/skizze-behind-the-
scenes-of-alpha-2/](http://seif.codes/skizze-behind-the-scenes-of-alpha-2/)
and [https://njp.io/skizze-alpha-release/](https://njp.io/skizze-alpha-
release/) , including what's coming up in future releases.

We're really excited to be working on solving real-world problems we've come
across with Skizze, hopefully others will find it useful too as it matures!

------
alixaxel
Great work on putting this together! Amazing idea really.

I'm specially looking forward for the streaming metrics/statistics feature.

PS: Could you go into a bit more into the threshold specification per domain
and what kind of memory requirements it yields?

~~~
seiflotfy
Sure thing.

So imagine you have a domain with a expected cardinality (which you pass) of 1
million and a threshold of 10k.

What will happen is every time you insert an event it will be multiplexed into
all 4 sketches. Each one of them will have a dictionary "thDict" (I will look
into having them shared in a domain to reduce memory) that can take up to 10k
keys. the keys will be the "strings" passed in and the values will be counters
(+1 every time the key is inserted). Using a dictionary of 10k keys we can
then give you precise 100% accurate:

* memebrship: check if key exists in "thDict"

* cardinality: count number of keys

* frequency: for a given key check the value

* top-k: sort the dictionary by values/keys

once the 10k keys threshold is reached, we can just replay them into all
sketches.

Does this make sense?

Cheers and thanks for the kind words :D

~~~
alixaxel
I get the multiplexing and I think that's a great concept. One other thing
that you might also want to consider (to reduce unneeded interaction with the
server) is allow specifying counts of the keys - useful, for instance, when
you want to know not only which IPs are the most frequent hitters but also
which IPs are the most "heavy" hitters (by number of bytes transferred instead
of individual requests).

But I didn't quite get the semantics behind the threshold though (I might have
to look at the code to fully get it), but building up on your example of 1M
keys and 10k as the threshold, you say the error rate would be 0% for all the
sketches (which is great - but why 0%?). So suppose all keys are 32 bytes long
and the counters are 32bit integers - for simplification lets also suppose
that there's a single shared "thDict" dictionary - this would give a raw
memory usage of 10,000 * 32 * 4 bytes or 1,250 KiB.

Then, if I understood correctly - it seems that the threshold determines the
memory usage (unless the "sketch replaying" part that I also didn't fully
understand affects this in some way). But what I would like to know is in what
way the cardinality / threshold ratio influences the expected error rate.
Given that any probabilistic data structure should work on the basis of the
expected error and memory requirements, it seems crucial to know how to plan
for both sides of the equation, but somehow after reading both links and the
GitHub repo README I still can't understand it.

------
krat0sprakhar
Looks useful. However, how does the bloom filter and HLL implementation
compare to the ones available in Redis?

~~~
seiflotfy
Redis provides a bitset that can be used as a bloom filter. You'd end up
having to hash k times on a client side, while also calculating the ideal size
of the bitset for the expected "n" values and "e" error rate.

The HyperLogLog is actually the same. We intend to provide a threshold feature
that is variable to guarantee 100% accuracy up to a specific cardinality.

