
Beyond NoSQL: Using RRD to store temporal data - rellik
http://www.plainlystated.com/2011/07/beyond-nosql-using-rrd-to-store-temporal-data/
======
jws
Non sequitur hint: If you are storing data like his powerhungy, consider
storing the sum of the squares of the datum as well as the datum and number of
samples in the aggregate (might be initially 1). This lets you compute the
standard deviation for display, but it also has the nice property that after
aggregating samples, you can still compute the standard deviation of that.

~~~
ghotli
Cool. Would you be so kind as to throw a book suggestion my way as to where I
could learn such tricks?

~~~
hcles
I don't know of a book reference, but if you sit down for half an hour and
play though the algebra of stddev you can figure it how to combine multiple
standard deviations into a single stddev. It's fairly simple and quite
satisfying. The sum of squares can be computed by multiplying the mean with
the number of elements. So, by storing the number of elements, mean of those
elements and the population stddev of those elements, you can take two or more
sets of numbers and compute their combined standard deviation as a simple
formula based on their stddev, mean and nelems.

There's a wikipedia entry in combining standard deviations
(Standard_deviation#Combining_standard_deviations) but it's dense if you don't
have a math background (I don't have one). The crux of it is, to compute the
stddev of a set you need to compute the average, then sum of squares of the
delta [ie sum([elem[i] - avg] for 1 .. i)]. You don't have the individual
elements any more so you can't compute the sum of squares of mean_deltas, but
using the stored stddevs, means and averages you can recompute that
information out when computing the new stddev.

Well, it's a lot of easier to explain with a whiteboard. You're basically
subtracting out the information you don't have based on the stddev/mean-aka-
avg/nelems data you do have, you're subtracting out infinite series and it all
works out perfectly.

numsum1 = SUM(nelems[i] * (stddev[i]^2 + mean[i]^2))

numsum2 = SUM( nelems[i] * mean[i] )^2 / SUM(nelems[i])

combined_stddev = SQRT((numsum1 - numsum2) / SUM(nelems[i]))

~~~
ghotli
I find this endlessly interesting. Thanks for the overview.

------
otterley
RRDtool is pretty nice, but it has a fair number of scalability issues too:

* Once you create an RRA (archive file) you can't modify it to add or remove metrics, or change their properties. This makes them relatively inflexible.

* Updating RRAs is I/O heavy. Every time an update comes in, the OS must read, modify and write a page.

* RRDcache mitigates this somewhat by deferring flushes, but there are diminishing returns to this (eventually the number of writes coming in will cause the cache flush and filesystem metadata update rate to exceed the maximum IOPS available), and you risk data loss in the event of a power outage or the OOM killer kills the process.

Time-series data access patterns tend to be write-heavy. Storing first in an
append-only log is a big win here; Cassandra and MySQL are both good choices,
though you do have to think about the schemata first. And disk is so cheap now
that expiration can be an afterthought.

~~~
shykes
To handle very high throughput, storing RRD files on a ramdisk works
surprisingly well, if you can afford the cost and the loss of a few seconds of
data - which most of the time you can.

A simple tar + gzip is all you need to flush to disk, at the frequency of your
choice. It turns out rrd write operations are safe enough to do this without
corruption. And the IO cost is minimal compared to rrdcache: rrd data
compresses extremely well.

~~~
otterley
Interesting idea, but not a very efficient use of memory. Not only would you
have to reserve memory for the ramdisk (assuming you even have enough to store
all the files), but more precious memory would be wasted by buffer-caching the
freshly-written tarred archives. You'd be sacrificing memory that would be
otherwise used to service reads on frequently-accessed RRD files.

~~~
shykes
Your point is technically valid but irrelevant in practice. The compressed tar
is surprisingly small and you can stream it to S3 if you really want to shave
off megabytes.

But more importantly, you don't care. This will give you 3 orders of magnitude
better write throughput with _2 hours of work_. The savings in engineering
time alone will buy you 1000x the ram you wasted!

With that kind of achievement, my advice would be leave over-optimization for
later and buy yourself a drink :)

~~~
otterley
It's not irrelevant in practice, at least not for me. I deal in enormous
amounts of data that couldn't possibly fit into RAM.

------
thehammer
Site appears to be temporally unavailable.

~~~
pasbesoin
[http://webcache.googleusercontent.com/search?q=cache:http%3A...](http://webcache.googleusercontent.com/search?q=cache:http%3A%2F%2Fwww.plainlystated.com%2F2011%2F07%2Fbeyond-
nosql-using-rrd-to-store-temporal-data%2F)

~~~
rellik
Text-only version comes up for me:
[http://webcache.googleusercontent.com/search?q=cache:http://...](http://webcache.googleusercontent.com/search?q=cache:http://www.plainlystated.com/2011/07/beyond-
nosql-using-rrd-to-store-temporal-data/&hl=en&strip=1)

------
nwmcsween
You could do the same thing with mongodb and 'capped collections' although
aging the data like rrd would require mongodb to have a callback for when the
capped collection is full.

------
sciurus
That was one of the clearest explanations of the strengths of RRDtool that
I've read. You can spend a lot of time massaging a more general database to
store time series data, or you can use RRDtool.

------
shubber
Pity there's no mention that RRDTool has been around for decades, pretty much
stable. It's worth remembering that old tools aren't necessarily obsolete.

