Non sequitur hint: If you are storing data like his powerhungy, consider storing the sum of the squares of the datum as well as the datum and number of samples in the aggregate (might be initially 1). This lets you compute the standard deviation for display, but it also has the nice property that after aggregating samples, you can still compute the standard deviation of that.
I don't know of a book reference, but if you sit down for half an hour and play though the algebra of stddev you can figure it how to combine multiple standard deviations into a single stddev. It's fairly simple and quite satisfying. The sum of squares can be computed by multiplying the mean with the number of elements. So, by storing the number of elements, mean of those elements and the population stddev of those elements, you can take two or more sets of numbers and compute their combined standard deviation as a simple formula based on their stddev, mean and nelems.
There's a wikipedia entry in combining standard deviations (Standard_deviation#Combining_standard_deviations) but it's dense if you don't have a math background (I don't have one). The crux of it is, to compute the stddev of a set you need to compute the average, then sum of squares of the delta [ie sum([elem[i] - avg] for 1 .. i)]. You don't have the individual elements any more so you can't compute the sum of squares of mean_deltas, but using the stored stddevs, means and averages you can recompute that information out when computing the new stddev.
Well, it's a lot of easier to explain with a whiteboard. You're basically subtracting out the information you don't have based on the stddev/mean-aka-avg/nelems data you do have, you're subtracting out infinite series and it all works out perfectly.
RRDtool is pretty nice, but it has a fair number of scalability issues too:
* Once you create an RRA (archive file) you can't modify it to add or remove metrics, or change their properties. This makes them relatively inflexible.
* Updating RRAs is I/O heavy. Every time an update comes in, the OS must read, modify and write a page.
* RRDcache mitigates this somewhat by deferring flushes, but there are diminishing returns to this (eventually the number of writes coming in will cause the cache flush and filesystem metadata update rate to exceed the maximum IOPS available), and you risk data loss in the event of a power outage or the OOM killer kills the process.
Time-series data access patterns tend to be write-heavy. Storing first in an append-only log is a big win here; Cassandra and MySQL are both good choices, though you do have to think about the schemata first. And disk is so cheap now that expiration can be an afterthought.
To handle very high throughput, storing RRD files on a ramdisk works surprisingly well, if you can afford the cost and the loss of a few seconds of data - which most of the time you can.
A simple tar + gzip is all you need to flush to disk, at the frequency of your choice. It turns out rrd write operations are safe enough to do this without corruption. And the IO cost is minimal compared to rrdcache: rrd data compresses extremely well.
Interesting idea, but not a very efficient use of memory. Not only would you have to reserve memory for the ramdisk (assuming you even have enough to store all the files), but more precious memory would be wasted by buffer-caching the freshly-written tarred archives. You'd be sacrificing memory that would be otherwise used to service reads on frequently-accessed RRD files.
Your point is technically valid but irrelevant in practice. The compressed tar is surprisingly small and you can stream it to S3 if you really want to shave off megabytes.
But more importantly, you don't care. This will give you 3 orders of magnitude better write throughput with 2 hours of work. The savings in engineering time alone will buy you 1000x the ram you wasted!
With that kind of achievement, my advice would be leave over-optimization for later and buy yourself a drink :)