Hacker News new | comments | show | ask | jobs | submit login
Realtime Metrics for 128 Million Users with Redis: 50ms + 16MB RAM (getspool.com)
112 points by pooriaazimi 1858 days ago | hide | past | web | 18 comments | favorite

This was posted a while ago, and I have since implemented bitmaps myself. One thing I learned from the documentation[1] is that setting an initial bit at a very high-numbered place (like 2^30 - 1) takes a while to allocate (compared to Redis's normal speed) and blocks other operations in the process.

In my case, and it appears to be true for Spool too, I don't know what bit will be set first. It could be 12 or it could be 2938251, so to prevent a slowdown if the initial bit is a high place I use buckets of bitmaps, each holding around 8 million bits.

[1] Read Warning: http://redis.io/commands/setbit

There is an easy fix for this, that is also the recommended way to use bitmaps in Redis: split your bitmap among multiple keys.

For instance you want to set bit i, but you want k bits per every key, the you do:

    keyname = "bitmap:"+(i/k)
    keybit = i%k
k can be fairly large, like 128k bytes per key. It's still small but big enough for keys overhead to be negligible.

Wow, that's almost exactly how I implemented it:

   var bucketSize = 8190;


   var bucketNumber = Math.floor(userId / bucketSize),
       bitInBucket = userId % bucketSize;

...correction on my last comment, looks like I use ~8 thousand bits per bucket, not 8 million.

Repost from https://news.ycombinator.com/item?id=3292542 if anyone's interested in reading the comments from then.

I know disk space is cheap these days, but at 16MB/metric/<level-of-granularity>, it seems like your metric dataset would grow pretty quick. With just 10 metrics tracked daily, thats another gigabyte per week. Of course it does come with the benefit of maintaining all the raw data since you never roll up or aggregate the data...so the pros probably outweigh that con. :)

I was thinking that too, but the 16m is keeping track of data for 128m users. Assuming you don't have that many, the number is potentially a lot less.

2 million users' actions could be tracked in 250k per metric. 10 metrics per day is 2.5m per day x 7 days is back to just over 16m (17.5m).

Redis stores everything in RAM, and RAM is not as cheap as disk. Adding GB's of RAM every week will quickly get rather expensive. But I guess you could dump old data to disk and load it back to Redis only when you need it. It might even compress well, depending on what the metrics track.

"But I guess you could dump old data to disk and load it back to Redis only when you need it."

Redis has a mode which does this automatically I believe (and it's the default if I remember correctly).

Isn't Redis still single-threaded for queries, but saving in the background? That seems a little risky: you've got your 100 million users setting bits in your bitsets and suddenly everything blocks for 10 seconds while old data is being loaded from disk.

The "virtual memory" feature is now deprecated, I think.

It only really needs be captured like that using redis, when the collection period is up you could store it on disk, or even just the aggregate information on disk.

It's not often you see 16MB these days and it turns out not to be a typo.

The only problem with this method is that it requires that IDs are integers, start at 1 and increment by 1.

I'm using MongoDB and IDs are 12-byte values of which the first four are a timestamp. Does anyone know of a way to make this method work, ideally without adding another field to the collection?

The comments on the article address this - the OP is using UUIDs as the primary key for their users, but each user is also assigned an "analytics key" which is an integer that started at one. You can even use the redis INCR command to generate these on demand.

There is a way in mongo to replace the id with an auto incrementing number. Have a look at the docs. It's also helpful of you want to use the id as a base62 value for urls

Kind of a disingenuous title, since that time sort of implies that redis is handling that many users and that that's the average response time...

Sorry - Although I think the word 'Metrics' must debunk that implication, I can understand what you mean.

I hope it's clearer now.


   Realtime Metrics with Redis: 128 Million Users + 16MB RAM = 50ms

   Realtime Metrics for 128 Million Users with Redis: 50ms + 16MB RAM

I still think that those user numbers in the title evokes a mental image of a certain type of load with a certain type of response time. I think if you got rid of the response time, it would be less linkbaity, because then it's clear that your focus is on the amount of storage it would take. It's not very important either way.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact