

Efficient way to calculate active users - misspran
https://engineering.helpshift.com/

======
jacquesm
There is always a trade-off between speed and precision, this is why many
systems that determine to figure out some value will resort to sampling and
other tricks.

Another trade-off used is the space-speed one, which this article is an
example of (and a very elegant one at that). If you want to use this then
you're going to have to keep intermediate results around in some form.

If you want the best of both worlds (in even less space!) for this particular
usecase you can find a pretty good medium by figuring out what the ratio of
users:unique users is for some period, then just keep a single count per day,
add for your desired period and divide by the ratio found earlier.

This will allow near instantaneous computation whenever the result is required
and allows you to push the time-consuming portion of the computation to
whenever the system is loaded lightly. The more frequently you compute the
ratio the closer your result will be to the actual number.

Of course these are all approximations, if for some reason you require the
exact number then you're going to have to simply do more work.

------
Zikes
Redis implemented HyperLogLog as a native feature recently. Here's antirez's
announcement and write-up:
[http://antirez.com/news/75](http://antirez.com/news/75)

~~~
joevandyk
Also, postgresql has HyperLogLog as an extension:
[http://tapoueh.org/blog/2013/02/25-postgresql-
hyperloglog](http://tapoueh.org/blog/2013/02/25-postgresql-hyperloglog)

------
bhouston
Or just add to each user record the time of their last action if their last
action is more than a day old (or what ever interval.) Given that you usually
have the user record already in memory, this can be moderately efficient. Then
you can pass through your user records linearly with no memory usage to figure
out MAU, DAU, YAU, etc.

~~~
alaiacano
Using a mutable "last active time" from some sort of users table is pretty
dangerous for post-hoc analysis because if (when!) something goes wrong, the
information is overwritten and gone forever. It's better to use immutable
event logs as the post describes.

~~~
raverbashing
"pretty dangerous for post-hoc analysis because if (when!) something goes
wrong, the information is overwritten and gone forever."

If your last active time table is being fumbled then yes, but I'd worry about
all the other parts of your data if that was the case.

Dangerous is not something I would ascribe to a "last active time" table
(except if you're using it for security audit purposes)

~~~
vedang
Mutable data does not give you any ability to analyse actions over a period of
time. For example: "How many people were active during the month of the Indian
Elections?" Can't tell you today, because all the data has been mutated away.

Secondly, mutable data also means that you have less scope of drawing new
insights from historic data. (Primarily because you never have historic data)

------
amix
There is a better way to do this by using bitmaps (especially bitmaps in
Redis). I recommend looking at
[https://github.com/Doist/bitmapist](https://github.com/Doist/bitmapist)

~~~
kiran_kulkarni
Technically HyperLogLog works on bitmaps, this library leverages this fact and
uses Redis bitmaps instead of an in-memory implementation.

Although we are fans of Redis, if you implement it natively you can avoid
network latency. Implementing it natively is not a problem because of the
commutative nature of HyperLogLog

Further, If one is planning to use Redis it will be better to use built-in
HyperLogLog datastructure provided by Redis 2.8.9 as documented here
[http://antirez.com/news/75](http://antirez.com/news/75)

------
Groxx
HyperLogLog is quite neat, and I very much enjoyed antirez's writeup as well.
There are probably a lot of uses for it beyond e.g. counting active users. I'd
love to hear about them.

But nearly everyone uses counting users as an example. For this kind of use, I
honestly have to ask: at WhatsApp's scale, is 5GB of ram really an issue? It
seems like they could probably keep that exact setup _and roll it over every
minute_ and not even really tax a modern server.

Or compact it - one bit per person, lookup is just jumping to the address at
their ID, counting is just summing, which would probably meet most needs. With
this you can handle _every person on earth_ in < 8GB. You can do that with an
m3.xlarge on EC2 (15GiB ram) for a measly 25 cents per hour. That's $6/day.
That's _literally nothing_ compared to normal server costs.

~~~
bialecki
I'm sure other people are wondering the same thing, so a quick take.

The problem is not if you're counting one thing (or even 100). The problem is
when you want analytics and you want it to scale to 1,000s or 1,000,000s of
counters. That may seem ridiculous (who could possibly need that many
counters?). But it happens quickly when you say, "How many DAUs do we have?
How many from country X? How many using device Y? How many from country X and
using device Y?"

Also, to address an idea you mentioned around bitmaps. Bitmaps are great until
you have lots of counters and lots of users/things to count. Then the problem
is they get very sparse. Imagine user #100,000 does something. You need to
allocation 97k of space (lots of zeros behind that 100,000th bit) just to
count that one thing. Are bitmaps a good idea? Sure, in a lot of cases they
are. The problem is they just break down at some point and that's when these
other tricks are really nice.

------
bosky101
hat-tip to the contributions made by Philippe Flajolet in this field.
[http://en.wikipedia.org/wiki/Philippe_Flajolet](http://en.wikipedia.org/wiki/Philippe_Flajolet)

------
Kenji
"This event usually contains an id which can uniquely identify the user. This
id can be a cookie, IP address or Vendor ID in an iOS App."

I'm disappointed. In the first few lines he simply pushes the core issue
aside, that is, unique identification. IPs are nowhere near unique identifiers
and cookies might be disabled. Once you get unique identification, counting is
easy.

~~~
Zikes
That's not really the point of the article. That "core issue" was pushed aside
because for the purposes of the discussion of counting uniques the
determination of uniqueness is assumed solved.

