
Show HN: Run your own A/B testing back end on AWS Lambda and Redis HyperLogLog - gingerlime
https://github.com/Alephbet/gimel
======
yummyfajitas
Unless you are sharding multiple redis instances and merging the HLL counts at
the end (you need only do this at google scale, maybe not even then), HLL is
not the right way to go. HLL has errors of several percent.

Instead you should use a bloom filter plus a simple counter:

    
    
        if not bloom_filter.might_contain(user_id):
            count.incr()
            bloom_filter.add(user_id)
    

A bloom filter can easily be tuned to have 1e-6 false positive rate (meaning a
1e-6 error rate in the counts). You just can't reliably do that with HLL.

~~~
gingerlime
That's one of the main trade-offs, but I hope it's still reasonable. As far as
I'm aware, the "quoted" error rate with Redis is 0.81%[0]. And also there's
bound to be in-built errors with how browser send those events, bounces,
adblockers etc. So I wonder if it's still within the "acceptable" range?

There's an experimental branch[1] I was working on[2] that uses Google
BigQuery instead of Redis. I'd love to take a look at your bloom filter
suggestion though.

As I mentioned, I'm eager to get collaborators to join the project. Any input
is appreciated.

[0] [http://antirez.com/news/75](http://antirez.com/news/75)

[1]
[https://github.com/Alephbet/gimel/tree/bigquery](https://github.com/Alephbet/gimel/tree/bigquery)

[2] [http://blog.gingerlime.com/2016/a-scalable-analytics-
backend...](http://blog.gingerlime.com/2016/a-scalable-analytics-backend-with-
google-bigquery-aws-lambda-and-kinesis/)

~~~
yummyfajitas
The standard deviation is 0.8 for a single HLL, but a ratio involving 2 of
them will have about 1.6%. Simple arithmetic:

    
    
        conversions x (1+0.008) / [visitors x (1 - 0.008)
        ~ (conversions / visitors) (1 + 2 x 0.008 + ...)
        ~ (conversions / visitors) (1 + 0.016 + ...)
    

(It should actually be a bit less if you do the calculation carefully,
probably 1.6/sqrt(2).) 12kb is also a pretty big filter, but I guess
reasonable.

It's definitely true that non-statistical errors are highly likely to drown
this out. But why have extra error if you don't need it?

~~~
gingerlime
Totally agree. Should aim to reduce error rates as much as possible. That's
one of the reasons I was also exploring BigQuery.

Any tips for implementing this with bloom filters using redis/python and how
to reduce or even estimate error rates there?

------
gingerlime
Just released version 1.0.0 with a CLI for easy deploy on AWS.

I'm looking for contributors, so if you're interested - please get in
touch[0].

[0]
[https://github.com/Alephbet/gimel/issues/2](https://github.com/Alephbet/gimel/issues/2)

