

Ask HN: Best datastore to store high volume time series data - haarts

Someone asked me what datastore I would use to store billions of sensor measurements (adding 50K+ per second). The data once stored won't change and a data point would be a simple timestamp and an integer.
I couldn't come up with an answer. Any ideas?
======
latch
You havent provided enough information to answer this question. From what
you've described, practically _anything_ would work. The missing pieces? How
do you need to be able to query it? Or, what, put simply, what's the plan for
the data?

Beyond that, my first thought was a sorted set in Redis. However log(n) seems
like an expensive price to pay for inserts which'll mostly be a push (since
scores will largely be sequential).

~~~
haarts
The plan is to provide an overview of the trend in the data (plot it) per
sensor (there are hundreds of them). Perhaps select a minimum and maximum over
a certain time interval.

------
caw
Use Round Robin Databases-- they're made for high volume timeseries data.
They're fixed in size, so they won't grow after you create them. They also
support automatic rollup, so you could save 1 datapoint per second for 10
minutes, then save the average of 15 datapoints for the past day, etc.

Caveats:

1) You can't change the schema once you create the file. Adding new fields
means adding a new file or recreating your existing file

2) With heavy I/O you may need to put these files on a RAM disk, and then
periodically flush the persistency.

~~~
haarts
That sounds very interesting. I could only find RRDTool which seems like a
combination of a RRD and a graphing tool. Further some vague references about
RRD in PostgreSQL. Am I missing something?

~~~
caw
RRDTool is what you're looking for. The graphing portion is because you
probably want to display the data. "rrdtool dump" will export to XML. You can
use the RRD files without using the graphing utility.

For an example, see the Cacti project.

------
minaguib
Look into graphite and statsd

