

Effective Web App Analytics with Redis - hdeshev
http://filer.progstr.com/1/post/2012/03/effective-web-app-analytics-with-redis.html

======
trun
Great article. I've built a number of systems very similar to this and have
found Redis to be a fantastic platform, both in terms of reliability and
flexibiliy. I'll share a few tips that I think complement your approach.

\- When you want to compute metrics for multiple intervals (hour / day / month
/ etc) Redis' MULTI / EXEC constructs make transactional updates to multiple
keys a snap. Additionally batching (which is supported by most Redis clients)
can _dramatically_ improve performance.

\- You can use Redis sets for computing uniques in realtime. You can also use
set operations like SUNION to compute uniques across multiple time periods
relatively quickly. For example, SUNION 24 hour intervals to get the total
uniques for the day. You just have to be careful that large numbers of uniques
eat up your available memory _very_ quickly. EXPIREAT helps ensure things get
cleaned up automatically.

\- Using a Redis list as an event queue is a great way to further ensure
atomicity. Use RPOPLPUSH to move events to a 'uncommitted' queue while
processing a batch of events. If you have to rollback, just pop them back on
to the original list.

~~~
hdeshev
And that's why I love Hacker News! Thanks for the tips, trun!

I'll make sure I use batching first, and look into the unions technique after
that.

------
thibaut_barrere
First, thanks for sharing! Then a comment on this:

"I've done implementations of the above using SQL databases (MySQL) and it
wasn't fun at all. The storage mechanism is awkward - put all your values in a
single table and have them keyed according to stats name and period. That
makes querying for the data weird too. That is not a showstopper though - I
could do it. The real problem is hitting your main DB a couple of times in a
web request, and that is definitely a no-no."

This is not a SQL vs NOSQL issue: decoupling the reporting system from your
main (production/transaction) system is a widely advised practice in "business
intelligence".

Use a different instance, with a schema designed for reporting.

You can use Redis for that (and I use it actually!) but you can also use MySQL
or any other RDBMS.

It's fairly easy to implement: one line for each fact, then foreign keys to a
date dimension and hour dimension (see [1]), then you can sum on date ranges,
hour ranges, drill down etc, on many different metrics.

[1] [https://github.com/activewarehouse/activewarehouse-etl-
sampl...](https://github.com/activewarehouse/activewarehouse-etl-
sample/tree/master/etl)

~~~
hdeshev
Sound advice. I didn't mean to go to the SQL vs. NoSQL war zone and I have
nothing against SQL DB's. All I wanted to say is that I find the current Redis
solution easier to implement than the [clumsy] one I did in the past.

~~~
thibaut_barrere
Yeah no worries! I didn't imply you meant that :) Like I said I use redis
timeseries exactly like you.

The pros are that it's very easy to setup etc (no schema definition, very
practical API, easy to query), the cons are that you are limited by the memory
space (but like you wrote, not an issue in your cases) and that it's harder to
make more elaborated reports.

But I use both techniques depending on the needs.

Thanks for taking the time to write this!

------
ihsw
> The above mechanism needs some finishing touches. The first is data
> expiration. If you don't need daily data for more than 30 days back, you
> need to delete it yourself. The same goes for expiring monthly data - in our
> case stuff older than 12 months. We do it in a cron job that runs once a
> day. We just loop over all series and trim the expired elements from the
> hashes.

Rather than iterating over the entire list of series and checking for expired
elements you can use a sorted set and assign a time-based score. The cron job
can still run once a day but you can find items in that sorted set that have
members below a certain score threshold, which will almost certainly be
faster.

Naturally this will increase memory usage (which may be undesired) but it's
food for thought. Eventually the looping and trimming expired hashes can be
coded using lua server-side scripting in redis-2.6, which is interesting in a
different way and has it's own challenges.

~~~
theli0nheart
The problem with this implementation is that you can't have multiple entries
for items with the same value. For example, you might be doing metrics for
each user in a web application, and want to measure access times. The obvious
identifier is IP address for the key, and timestamp for the value, but when
you insert a new key/value pair into a ZSET, any previous value for the same
item will be replaced.

Therefore, it makes it kind of difficult to use ZSETS for metrics unless you
only care about uniques.

------
tptacek
There's a blog post from Salvatore somewhere talking about how he marshalled
time series data into strings which made me thing the naive/straightforward
approach was suboptimal. I always thought ZSETs indexed by time_ts would be a
good fit for this.

~~~
thibaut_barrere
In case this helps, probably related:

\- <https://github.com/antirez/redis-timeseries>

\- ZSET pull request <https://github.com/antirez/redis-timeseries/pull/1>

\- and the resulting "dsl uptime" script I made out of it:
<https://github.com/thbar/dsl-uptime>

------
bradleyland
This is cool, but if you're looking to work with time series data, you should
definitely have a look at RRD. A lot of the operations you'd want to perform
on time series data are available internally with RRD. RRD can also do some
cool stuff like generate graphs.

~~~
hdeshev
I took a look at it when starting work on those features, but decided against
it. I wanted as much freedom as possible in storing, accessing, and displaying
data since we'll need to build something more sophisticated in terms of
analytics later on.

We use our analytics engine to show charts to our users as well. I can't do
that with graphs generated from tools like Cacti/Ganglia/Graphite... As is the
case with almost all sysadmin tools, they don't look too good.

~~~
itay
Another option for time series data (with prettier graphs) is Splunk.
Disclaimer: I work at Splunk, on the developer platform team, so I have a
vested interest in developers trying it out and giving me feedback :)

If you're curious about it and have any questions, feel free to get in touch
(email in profile).

