

Ask HN: How would you build a high performance, semi-real-time stats service? - simonw

Many sites can benefit from showing real-time stats about their visitors - bit.ly have practically built their entire product around it - but doing so requires persisting data on every page request in a way that allows for near-real-time queries. How would you go about building something like this - regular MySQL, memcached counters or something more exotic like MongoDB?
======
simonw
OK, I'll start. I should emphasize I don't have a particular project in mind
just yet, but I do think it's an interesting thought exercise.

For my purposes, I'm not particularly interested in saving a full record for
every single hit for all time - I've already got that in web server logs. I'd
much rather accumulate counters - they take up much less space and should be a
lot cheaper to query when I want to generate graphs and reports.

MongoDB looks like a tempting option here. It has an atomic increment option
and from my initial testing appears to be really fast. Since I'm just
incrementing counters, sharding writes across multiple instances should be
pretty easy in the future (though that is true no matter what data store I
use).

Here's one potential scheme. For each page (or whatever is being counted),
maintain a counter for every minute for the past two hours. Also maintain a
counter for every hour for the past month, and a counter for every day for
ever.

When a hit comes in, increment the counter for that page on the current day,
the current hour and the current minute. That's three counter updates.

Every 24 hours, expire any minute counters from more than 24 hours ago, and
any hour counters for more than two days ago.

You might want to keep track of hits to an individual page, hits to that
section of the site (or any of that user's pages) and hits to the site as a
whole. That's fine - it just means you'll be updating 9 counters for every hit
instead of 3.

~~~
nir
I don't really know MongoDB. Memcache seems like a good fit: you could use key
names like '{{url}}-{{day}}-{{hour}}' etc, incr/decr is built into the API and
make use of its expiration system. Keep in mind that with memcache (AFAIK) you
need to know your key names - ie, you can't search for "all keys starting with
x" so you might need to keep a table of keys in a SQLish DB.

Perhaps take a look at Google App Engine - it has Memcache built in and is
optimized for handling a lot of short requests. It might save you a lot of
deployment/scaling headaches.

------
mbenjaminsmith
I look at it this way: you need your 'collection' side of things to be as
lightweight as possible. I would 1) collect request data in memory 2) 'post-
process' and move that data periodically into your db (once a minute, once a
second, whatever) - allowing you to basket that data however you want and 3)
move that pre-processed data back into memory to add into your responses as
needed. That way the most expensive work would get done in a dignified,
scheduled manner.

I don't know anything about MongoDB but if it's not MySQL that's a good start.

Disclaimer: I don't know what I'm talking about.

------
lsc
hm... does anyone use MySQL NDB cluster, the in memory db? with ram prices
what they are, that sort of thing sounds like the way to go. I mean, with
64GiB ram servers being under $4K in parts these days, it seems like it might
also be affordable.

Of course, I haven't used MySQL ndb cluster for several years. it seemed solid
then, but I haven't used it since.

~~~
drusenko
Last time I used the MySQL ndb cluster (admittedly, quite a few years ago, so
things may be different now), it was a huge fail. It was extremely
temperamental, very difficult to configure (the documentation wasn't
particularly helpful in this regard) and would often crash out of the blue.
Crashing lead to a couple very regrettable things happening:

a- Starting the NDB cluster was a painful and slow process, often taking 45
minutes for very, very little data (1000's of rows).

b- On crash, data was not recoverable and usually resulted in a total data
loss.

So those are my experiences using ndb in a production environment. Given that,
I would stay as far away from it as I can.

Having said that, I'm sure things have changed in the past 4-5 years, I'd just
make sure you test things out properly, for an extended period of time, in
production conditions and that you fully understand the configuration of the
cluster and all of the little settings that impact performance and
recoverability.

~~~
lsc
on recoverability:

sure you can 'repair tables' on a normal MySQL setup, but you are going to
loose some rows that were cached; just like with fsck, you can make data sane,
but not correct,

people tell me that now MySQL 'repair table' is as good as promoting a slave
these days, but when a server crashes, the hardware is suspect (unless the
console log indicates something software-only, such as an oom condition.) so
restoring from a slave still sounds like the way to go.

The application I used it for back in the day was a Dspam server... it had
millions of rows (keeping track of everyone's spam at a fair-sized ASP) Come
to think of it, I never shut it down until we took it out of production.

Of course, a database server that crashes, regardless of the ease of recovery,
is of little utility. if it's still doing that, well, forget it. That I am
unable to find people who use it is quite worrysome in this regard. The thing
is, it was solid with dspam, and all the synthetic tests I threw at it many
years ago.

heh. I almost want to come up with some free webapp to test this. got any
ideas on a webapp that would be easy to write, that would quickly attract
users, and that would thrash the database? it doesn't need to make money.

~~~
drusenko
So the particular crash issues we were having were not your typical MySQL
crashes (which, as you correctly point out, are usually hardware-related),
they were definitely software-related with ndb. Which is obviously a huge
problem, especially as the data was unrecoverable.

Before moving to NDB and after migrating from it, we had a solid single MySQL
instance on the same hardware. Oh, the joys of premature optimization :)

