I really like seeing people's thought process as they decide to switch something major, but if I were doing this same task, I think I would have tried more on speeding up the sql solution before scrapping the db.
There are several ways you can calculate a rank, and some (e.g. self join) are inherently slow. Maybe the problem is as simple as not having the right clustered index? Or maybe sql dbs can do more optimization if you tell them to use their rank function instead of selecting out a rownum? Postgres 8.4+, sql server 2005+, and Oracle all have rank functions (one of the 'window functions' in the SQL:2003 standard http://en.wikipedia.org/wiki/Window_function_(SQL)#Window_fu... )
The critical functionality in this application seems to be online calculation, i.e. he wants to know a player's exact rank immediately after insert. A sorted set is good for that.
In general I agree though. If you relax the consistency requirement slightly you can get away with regenerating the rank table every minute or so. It takes ~500ms on my machine to generate the ranked table for 500k rows on PG9. Unlogged tables in 9.1 would probably make it faster.
as sbov mentioned in another thread, the pro-RDBMS/SQL argument is that if i only reduce my expectation (admittedly slightly), SQL is a great choice doesn't seem too convincing. It's true that approach would be slightly less complicated (only 1 store, but now you have transformation jobs).
I am intrigued. I've been thinking about this problem today, but I can't think of an efficient relational implementation in current DBs. How would you do this? I came up with the following:
1) On get_rank() do a count(*) where score < users_score;
2) Store row_number() over() [...] order by score in a lookup table and refresh every so often. Use this for get_rank() lookups.
The problem I have is that you either have to calculate the rank on read using a linear-time count() or on write which means updating n/2 rows on average with their new position.
Recalculating ranks for everyone every time the score changes seems like a difficult task no matter how you approach it. I don't think you really need that for most use cases though. Some reasonable simplifications I've seen are don't rank anyone below a certain cutoff score, only calculate rankings periodically, or only calculate rankings for a subset of the data.
But if there is a straightforward solution that allows you to not make these concessions and it can do so with great performance, I'm not sure how you can continue to claim that relational databases are the right tool for this job. Sure, it can do it, but it seems like something else does it better.
You're right, I don't have much of an argument for relational databases here.
I am concerned about the complexity of this approach though and I don't think the concessions we're doing away with are that terrible. How meaningful is it to be rank 5623 vs. rank 5249, or even rank 2519? I don't think people can make much sense of ranks beyond 100, but maybe I'm wrong. I would rather break rankings into smaller groups. Which is more interesting, being ranked 23,528 worldwide, or being the town champion? I think the latter is much more meaningful to a person.
limiting ranks (to 5000 for most games and 1000 for a few particular ones) is the concession we ended up doing in v1. When we introduced the change, no one complained (or has since) - our developers have been great.
As for complexity. The complexity of learning is a noop as far as I'm concerned. Duplicate data? Not concerned. Managing 2 stores...slightly concerning. Agreed. Maintenance? The entire change is largely encapsulated in a new ~30 line Rank class. Sure, I'd rather not have that extra code.
I hadn't considered rank-groups (thanks for the idea though!). I'll definitely give that more thought. Nonetheless, I think I rather have the raw data available and accessible (in addition to any additional abstraction). I don't like making decision for the developers. Maybe 100 ranks isn't enough for them (I've seen them do some pretty creative stuff with the bit of data we're collecting already).
Interesting. For http://greenfelt.net we have a single PostgreSQL DB and store all our games in one big table. For the score based games (like tetris) we do everything with Postgres's great expression indices to segregate things into daily, weekly, and monthly high scores ("DATE_TRUNC('YEAR',date at time zone 'UTC')"). We use Postgres's windowing functions to get the ranking numbers.
We don't have very high traffic though--we manage roughly 3000 games/hour at our peak times with our one DB server.
Maybe a funny question but why have the word device included as part of the unique key? Wouldn't 1-leto, 1-paul, 3-duke and 2-jess been ok entries for the unique key?
we don't use the actual word "device"..it's just a placeholder for the sample. Our actual unique key is a hash of the real device id (which is a UUID) + the name. This means 2 users with the same name on different devices have, or 2 users with different names on the same device, are treated differently.
There are several ways you can calculate a rank, and some (e.g. self join) are inherently slow. Maybe the problem is as simple as not having the right clustered index? Or maybe sql dbs can do more optimization if you tell them to use their rank function instead of selecting out a rownum? Postgres 8.4+, sql server 2005+, and Oracle all have rank functions (one of the 'window functions' in the SQL:2003 standard http://en.wikipedia.org/wiki/Window_function_(SQL)#Window_fu... )