
Ask HN: What DB to use for huge time series? - BWStearns
Hi HN, I wanted to know if anyone had good recommendations for a database for massive timeseries. I took a look at InfluxDB and Druid, both of which look promising but they&#x27;re young projects and I don&#x27;t want to strand myself with a deprecated component at the core of the system I&#x27;m working on. Does anyone have any suggestions&#x2F;advice&#x2F;experience they can share to provide some guidance here?<p>thanks in advance!
======
sparkman55
Depending on how 'huge' your timeseries are, you might be pleasantly surprised
with Postgres. Postgres scales to multiple TB just fine, and of course the
software can be easier to write since you have SQL and ORMs to rely on. It's
also an incredibly mature and stable software package, if you're worried about
future-proofing.

Some (constantly-growing) timeseries can be stored on a per-row basis, while
other (static or older) timeseries can be stored in a packed form (e.g. an
array column).

I find that most of the time, "Big Data" isn't really all that big for modern
hardware, and so going through all of the extra software work for specialized
data stores isn't really all that necessary. YMMV, of course, depending on the
nature of your queries.

~~~
ad_hominem
Some tips for querying timeseries in Postgres:
[http://no0p.github.io/postgresql/2014/05/08/timeseries-
tips-...](http://no0p.github.io/postgresql/2014/05/08/timeseries-tips-pg.html)

HN discussion:
[https://news.ycombinator.com/item?id=7809819](https://news.ycombinator.com/item?id=7809819)

------
metabrew
Approximately, if you have something like 10+ billion items, use Cassandra.

If you have less than 10 billion items, Postgres will be fine, and is easier
to manage IMO.

If you do use postgres, you should vertically partition the table. This will
help keep indexes smaller, improve the the cache hit rate, vastly improve the
ease with which you can drop older data, and make various other admin tasks
easier.

I've done this in the past with a compound primary key of (topic_id, t) where
t was a microseconds-past-the-epoch timestamp (bigint) unique within a topic.
Then set up a parent table: CREATE TABLE events (topic_id, t, data_fields..)
and "CREATE TABLE .. INHERITS events" from it into multiple subtables, named
based on the timespan they will hold, like events_2013, events_2014.

Depending on how much data you have, either partition by day/month/year/etc. I
partitioned every million seconds (~11 days), since that kept the resulting
table sizes a bit more manageable (gigs not TBs).

Add a CHECK CONSTRAINT to each sub-table to constrain the timespan (ie, WHERE
t BETWEEN ?? and ??).

When you do a SELECT * FROM events WHERE topic_id = 1 AND t BETWEEN $x AND $y
ORDER BY t DESC; the query planner knows which sub-table(s) to query, and
doesn't touch the other tables at all.

You can also add a BEFORE INSERT trigger to the parent table that inserts into
the correct sub-table, otherwise get clients to compute the correct table name
when inserting.

------
ddlatham
For a good answer, you need to provide a lot more detail in the requirements:

\- What do the writes look like? If they are coming in a stream how many
writes per second do you need to support? If they are a bulk load how large
and frequent are the batches? Simple numerical values?

\- What do the reads look like? How many queries per second do you need to
support? How much data per query? How fast do the queries need to be? Will
your queries be simple aggregations? Dimensional queries? Unique dimension
value counts? Are approximations tolerated?

\- How much history do you need to keep?

\- What are your requirements for availability?

\- What are your requirements for consistency?

\- How fast does new data have to show up in reads?

Without more detail, you're going to get dozens of suggestions which may each
be right for a particular case.

~~~
BWStearns
Part of the reason the question was light on details is that this is just at
the very beginning and a lot of relevant things aren't locked in yet. Below
are the back of napkin results and are subject to the risk of being laughably
wrong.

Writes: not totally sure in terms of how the data is being packaged before
being sent yet, but it'll probably be more than 10 writes a second but less
than 1000 initially(?). Not sure yet if we're aggregating and batching before
sending or if we are, to what degree.

Availability: If it has brief breaks where it just misses some data
(<3seconds?) probably not the worst thing, but really trying to avoid big gaps
in the data.

Reads will likely be grabbing the last n records of a given set of sensors
maybe with some light math on it if the query language supports it, though
there might be an easier way to cache recent history and then only need to go
to the big list for responding to a longer-term issue. Also the nature of
reads is very subject to change since there's a bunch of use-cases for the
data being kicked around and I haven't gone through what each use's reads
would look like yet.

New data needs to show up in reads in soft-real time. The napkin-estimate
indicates that we might be looking at asking for about 6-80MB returned per
query as a generally large but perhaps not max query, bigger operations that
dealt with legitimately huge amounts of data will probably be scheduled around
lighter periods/put on different machines (not sure how adding more machines
reading would impact since I don't know what db it will be yet).

Ideally keep as much history as humanly possible, possibly moving them to
physical archival at some point (1yr+?).

~~~
angersock
What sort of data are you collecting?

------
chollida1
KDB+ [http://kx.com/kdb-plus.php](http://kx.com/kdb-plus.php)

I have no affiliation, other than being a customer. Its as close to a standard
as you can find in finance.

There are many useful tutorials out there that let you try it out and you can
usually get an eval version to try before you buy.

[http://code.kx.com/wiki/Startingkdbplus/contents](http://code.kx.com/wiki/Startingkdbplus/contents)
If you find something that is comparable in terms of performance and features,
but cheaper, please mail me!! I would be very grateful.

~~~
bkeroack
Look at the source.

(c)
[http://code.kx.com/wsvn/code/kx/kdb%2B/c/c/k.h](http://code.kx.com/wsvn/code/kx/kdb%2B/c/c/k.h)
(c#)
[http://code.kx.com/wsvn/code/kx/kdb%2B/c/c.cs](http://code.kx.com/wsvn/code/kx/kdb%2B/c/c.cs)

This guy is truly depraved.

~~~
conover
Dear lord. It must take forever to get up to speed on a codebase like that.

~~~
geocar
Doesn't take that long. Maybe a few days?

k5 isn't that big (about 9 C files)

------
chuckcode
Not a database but HDF5 ([http://www.hdfgroup.org](http://www.hdfgroup.org))
is used for storing all sorts of scientific data, has been around for a while
and is very stable. PyTables is built on top of it and there are lots of other
languages that can have existing libraries to read/write HDF5 (matlab, python,
c, c++, R, java, ...)

~~~
afiedler
I have had good experience using HDF5 to store time series data, but just
research datasets and nothing that has been put into production. I don't
really know how well it works with threading, for example. It does work very
well with PyTables and Pandas for analysis and definitely beats CSV files,
which is the the normal way these research datasets are stored.

If you are interested in using HDF5 and PyTables to store time series data,
check out this little library that I created:
[http://andyfiedler.com/projects/tstables-store-high-
frequenc...](http://andyfiedler.com/projects/tstables-store-high-frequency-
data-with-pytables/)

------
gourneau
I have used [http://influxdb.com/](http://influxdb.com/), I have used it with
a few million records. Getting the data out is a bit slow because it goes over
HTTP. Also make your InfluxDB library of choice can deal with HTTP chunking. I
found that if you request a lot of data from InfluxDB and the system does not
have enough memory, the process will silently die.

If you have mega huge data [http://opentsdb.net/](http://opentsdb.net/) seems
pretty decent, however I have not tried it out.

~~~
gourneau
Clarification: InfluxDB only crashed on me when I requested a lot of data
without chunking. With chunking I didn't have any problems.

I like InfluxDB and still use it.

------
spudlyo
Cassandra was used at Twitter[0] to store quite a lot of time series data.

 _A typical production instance of the time series database is based on four
distinct Cassandra clusters, each responsible for a different dimension (real-
time, historical, aggregate, index) due to different performance constraints.
These clusters are amongst the largest Cassandra clusters deployed in
production today and account for over 500 million individual metric writes per
minute. Archival data is stored at a lower resolution for trending and long
term analysis, whereas higher resolution data is periodically expired._

[0]: [https://blog.twitter.com/2013/observability-at-
twitter](https://blog.twitter.com/2013/observability-at-twitter)

~~~
Xorlev
Believe they've moved to Manhattan, their own custom datastore:

[https://blog.twitter.com/2014/manhattan-our-real-time-
multi-...](https://blog.twitter.com/2014/manhattan-our-real-time-multi-tenant-
distributed-database-for-twitter-scale)

------
oppositelock
I'd recommend OpenTSDB. Using a 11 node hadoop cluster on m1.xlarge nodes in
Amazon, (2 name, 9 data), I can ingest a sustained rate of ~75,000 time series
datapoints per second in an HBase table.

The upside is that OpenTSDB scales really well with hadoop cluster size, so
you can just scale it up to handle more load.

The downsides are that their data schema and query format are optimized for
data efficiency, not speed or flexibility. It's really easy to refine a search
for a particular metric by filtering on tags, but it's really hard to do any
sort of analysis across metrics, so you have to write your own glue on top of
that which fetches the datapoints for the metrics you care about, and does its
own aggregation.

~~~
linuxhansl
That's what we're doing. Uses HBase underneath, scales well.

------
sdab
It would be useful to know what "huge" means here. And how you want to look up
the data.

That said, I've used Cassandra in the past for timeseries data as one of the
useful queries that can be made is a range query (if the composite key is set
up correctly)

~~~
BWStearns
36-100MB/person per day ~250 days/year expecting ~20,000 (an educated stupid
wild ass guess) initially when the system is actually put into production.
~100-400TB per year(?). Most of the data would only be of interest for a month
or so, but we do want to preserve the data in general in some usable fashion
for testing and some research stuff.

~~~
sdab
In this case, I would still recommend Cassandra. It can easily handler the
data sizes you mention as well as the write rates you imply further down the
thread.

Cassandra has a nice and simple architecture (every node is identical, no
zookeeper roles etc), high write performance and scalability [1], and is
fairly robust. My main piece of advice is to get the tables correctly set up.
You need to know exactly what queries you want to make and design a table
around that query (Cassandra only allows performant queries to be made, unless
you go out of your way to set a flag). Whether a query is possible or
performant depends on the key of the rows for the table, which may be a
composite key. Take a look at the cassandra documentation for more details.

1\. [http://techblog.netflix.com/2011/11/benchmarking-
cassandra-s...](http://techblog.netflix.com/2011/11/benchmarking-cassandra-
scalability-on.html)

~~~
BWStearns
Thanks a ton. I am leaning towards a solution that involves Cassandra. What
would you say about using something on top of it like Blueflood?

~~~
sdab
I havent used Blueflood, so I couldnt say but it looks like an interesting
project.

------
JohnBerryman
DataDog uses elasticsearch for their timeseries data store:
[http://www.elasticsearch.org/content/uploads/2013/11/es_case...](http://www.elasticsearch.org/content/uploads/2013/11/es_caseStudy_datadog.pdf)

Elasticsearch might seem like a strange option at first since it's
historically a text search engine, but it's main datastructure is a compressed
bit array which is ideal for OLAP processing.

~~~
olidb2
I work at Datadog - we're only using ElasticSearch for full-text structured
events, not time-series, which represent 10,000 - 100,000 times more data in
volume.

We had to build our own Time-Series streaming / storage / query so we could
handle millions of points per second and years of retention.

(we _love_ ElasticSearch, though)

------
ad_hominem
I have a timeseries problem on the backburner, and like you am hopeful for
InfluxDB but it's still missing a couple features that I need, so haven't used
it yet.

As another person mentioned, you're going to be looking at columnar databases
(few/one rows, with a very large amount of columns) if you have truly large
storage requirements. Since my data is still small, I'm sticking with Postgres
for now.

I've seen a couple people mention OpenTSDB; another alternative to that is
KairosDB[1], which adds Cassandra support and focuses on data purity[2]
(OpenTSDB will interpolate values if there are holes).

And to echo another person, just forget about Graphite/Whisper. It uses a
simple pre-allocated block format that will eventually cause problems when you
want to change time windows.

[1]:
[https://code.google.com/p/kairosdb/](https://code.google.com/p/kairosdb/)

[2]:
[https://code.google.com/p/kairosdb/wiki/FAQ](https://code.google.com/p/kairosdb/wiki/FAQ)

~~~
Diederich
What features are you from InfluxDB? I am a long-time graphite user, and I
just saw InfluxDB, and it looked really good.

~~~
ad_hominem
Looks like I'm only waiting on custom functions[1] now. I used to also be
waiting on continuous queries[2] but looks like that feature is done now.

[1]:
[https://github.com/influxdb/influxdb/issues/68](https://github.com/influxdb/influxdb/issues/68)

[2]:
[http://influxdb.com/docs/v0.8/api/continuous_queries.html](http://influxdb.com/docs/v0.8/api/continuous_queries.html)

------
tilogaat
Blueflood([http://blueflood.io/](http://blueflood.io/)) may be what you are
looking for. It uses Cassandra under the hood. It's a project out of Rackspace
and is being used in prod by Rackspace's cloud monitoring. Currently,
Blueflood ingests about 2.2M metrics/min and can probably scale to 40M
metrics/min. Full disclosure - I am a dev on that project. It's being actively
developed!

~~~
shaneduan
If you are considering software-as-a-service solution, Rackspace has just
released public APIs of Cloud Metrics powered by blueflood at no additional
cost.

[http://www.rackspace.com/blog/cloud-metrics-working-
toward-a...](http://www.rackspace.com/blog/cloud-metrics-working-toward-a-
public-launch/)

(Disclaimer: I am the Product Manager on that project)

~~~
tyre
This is a fantastic snapshot of an engineer and PM commenting on the same
product

------
yummyfajitas
Graphite is a mature system. It's a pain in the ass, but I generally find it
essential for server monitoring.

I'm working on a timeseries database aimed at replacing graphite. It's just
getting started, so it probably won't work immediately, but contributions are
welcome. Currently the write performance is already better than graphite [1].

[https://github.com/stucchio/timeserieszen](https://github.com/stucchio/timeserieszen)

[1] This was one of the design goals. Whenever graphite receives a data point
a disk seek is incurred - the data point must be appended to the timeseries
file. Timeserieszen uses a WAL - data flowing in is immediately written, and
periodically the WAL is rolled over into permanent storage.

~~~
sciurus
Cool project!

I commented on some other graphite replacement projects at
[https://news.ycombinator.com/item?id=8368689](https://news.ycombinator.com/item?id=8368689)

~~~
yummyfajitas
I didn't realize graphite was officially dead. I must say, however, that it
was the shittiest piece of software I've ever relied on and loved.

~~~
general_failure
I didn't realize this either. In fact, we just moved to graphite :-(

------
alexjarvis
Use [https://crate.io](https://crate.io) It is built on Elastic Search and
I've recently built something large to store time series data with it. We
actually migrated away from Cassandra and ported out application from it
because it didn't allow us any schema or indexing flexibility. It also allows
you to partition a table by a column (e.g. a day) which means that a new table
is created each day. Zero config, fast and operationally easy. Depending on
your latency requirements for reads I would also have a serious look at
Couchbase but I don't know how well they fare for time-series data.

------
PhrosTT
Maybe check this out?

[https://github.com/soundcloud/roshi](https://github.com/soundcloud/roshi)

Roshi is basically a high-performance index for timestamped data. It's
designed to sit in the critical (request) path of your application or service.
The originating use case is the SoundCloud stream; see this blog post for
details.

~~~
nemothekid
Roshi sits on top of Redis, so this solution can be quite expensive.

------
bdkoepke
Depends on the kind of data you are storing. Hierarchical Data Format is a
scientific data format developed by the national center for supercomputing. It
is specifically designed to store and organize large amounts of numeric data
(including timeseries). It supports flat arrays for large data sets, but also
supports B-Trees for more relational style data as well. You can also easily
tag the array data.

If your format is cast in stone you may also be able to get away with using
flat-files. If you implement the List interface or something similar it would
be very easy to integrate into your application. (Normally I wouldn't
recommend flat-files for anything, but for time series it can be not a bad
option, as much as that makes me cringe).

------
toddpersen
Chiming in with a definite bias. I'm one of the co-founders of InfluxDB, and
while we're still somewhat young, we actually just hit the 1-year anniversary
of our first commit today. We're currently a team of 5 full-time developers,
dedicated to making InfluxDB the best time series database available. We've
also got some strong institutional backing, so we're not going anywhere for a
very, very long time.

If there are any questions we can answer to help you make a more informed
decision, drop us a line at support@influxdb.com or reach out to the
community:
[https://groups.google.com/d/forum/influxdb](https://groups.google.com/d/forum/influxdb)

------
exabrial
MySQL, Postgres, etc all scale 'just fine' to terabyte sized databases, and
the tooling and reporting tools for these databases is unmatched by any NoSQL
solution. What really matters is the type of queries you want to run... and
whether or not you need to automagically degrade over time. InfluxDB,
opentsdb, and competitors provide that automatically, but powerful tools like
SequelPro are missing from that space (Though you gain things like Grafana).

If in doubt, start with a traditional RDBMS. And ONLY after you profile your
application and see exactly where your pain points are, do something crazy.

Have fun!

------
rdtsc
Depending on what you are doing you can even try to write it yourself. It
would be a good exercise.

Here is a toy hand crafted time series storage design:

Say you are storing tuples of {<timestamp>,<datablob>}. Then querying it by
timestamp.

Writer can store it in two files,open in append only mode only. One is the
data file one is the index file. Data might look like:

<datablob1><datablob2>...

And an index file, it stores timestamps and offsets into the data files where
the blobs are:

<timestamp1><offset1><timestamp2><offset2>...

If you need rolling fall-off. Then create new pairs of files every day (hour,
week, month). And delete old ones as you go.

Then if you can ensure that your have time synchronization set up and
timestamp are in increasing order (this might be hard). You can do binary
searching. If you use rolling fall-offs. Then you can discard whole files
periods based on the query range when you search.

All this would go into a directory. Reader and writer could be different
processes. Your timestamp and offset sizes should be fixed length. Writer
first appends to the data file and then writes the index. Reader knows how to
find the last valid record by looking at the size of the file.

~~~
ibebrett
I wouldn't recommend trying to do this yourself. Of course you can make
something that kind of works, but making a resilient production ready database
that is fault tolerant and scales is a lot harder than writing to a file.

~~~
rdtsc
Well it was just a toy example I came up with in a couple of minutes.

But sometimes depending on the requirements a file is enough. If you
intimately know the and control the bytes that get written it is easier to
understand and reason about your systems (that means optimizing it, scaling
it, making it fault tolerant).

Also one way to make a resilient and fault tolerant database is to have less
code running. Sometimes the base libc and unix offer a good and stable base on
which it is easy to build. If you append the file in read or append only mode.
You can rely on certain behavior now.

People in the past have bought into marketing crap and got stuff like MongoDB
which would throw data over the fence and pray that it would be synced
eventually (by default!). But heck it was WebScale(tm).

~~~
ibebrett
That is why you seriously audit your tools and why many in the industry avoid
Mongo like the plague. Controlling the byes that gets written to a file is
actually not simple at all, and its a huge research problem as far as file
systems of databases. I'm just saying, I don't think writing your own db is
every a very good idea, unless it is SO simple that you would barely call it a
DB.

------
agoodno
Good talk at Strangeloop this year about exactly this.
[https://thestrangeloop.com/sessions/time-series-data-with-
ap...](https://thestrangeloop.com/sessions/time-series-data-with-apache-
cassandra)

The talk is posted here.
[https://www.youtube.com/watch?v=ovMo5pIMj8M](https://www.youtube.com/watch?v=ovMo5pIMj8M)

------
paulrr
Kx ([http://kx.com](http://kx.com)) has been around forever and has a good rep
for this sort of thing.

------
capkutay
I think Elasticsearch is a good solution for analytic workloads (including
time-series data). Query speeds are significantly faster than most DBs because
it uses the vector space model (which also introduces the possibility of false
positives). I wouldn't recommend it (yet) as a primary data store. It is
however really useful for analytics.

------
TheAceOfHearts
If you're up for considering a cloud service, you might want to check out
Treasure Data ([http://treasuredata.com/](http://treasuredata.com/)).

The free plan allows 10M records per month with a maximum capacity of 150M.

Full disclosure: I work there.

~~~
hbs
Does treasure data have a dedicated storage engine for time series? This kind
of data has specific needs which are not met by general purpose storage
layers.

~~~
TheAceOfHearts
To an extent, yes. We wrote our time-partitioned columnar storage from
scratch: it has row-based storage for more recent data and column-based
storage for historical data, and the data is merged from row-based to column-
based periodically for performance. We realized from day one that much of "big
data" is log/timestamped data, so our query execution engined are optimized
for time-windowed queries.

------
Xorlev
First, ask if you really need "massive" scale. Is this an idea, or a well-
defined product? I'd imagine if you knew what you were building, you wouldn't
be here asking.

So "massive" \-- why not prototype on Postgres, and then migrate when you
actually have projections on size.

Different orders of magnitude change the technology you work with.
Additionally, the latency with which you need to access the metrics (real
time, report based).

Cassandra is a pretty solid choice, Influx is really new to the game but is
promising.

Druid is trusted by a lot of people, Metamarkets (the author) among them, but
may or may not be what you need.

I'd spend some time talking to the people in #druid-dev on Freenode, they're
friendly and can help guide you.

~~~
j_s
See also this talk by two Metamarkets devs:
[https://www.youtube.com/watch?v=Hpd3f_MLdXo](https://www.youtube.com/watch?v=Hpd3f_MLdXo)

If accuracy doesn't have to be 100%, a number of options open up.

------
x0n
So many people suggesting relational databases or just plain "big data"
solutions. Time series databases tend to have quite unique features like
interpolation of data (i.e. you can query a specific datapoint at a specific
date and time for a value, and you will get an interpolated value if there is
no specific sample for that data point.)

Anyway, no one has mentioned RRD tool yet:
[http://oss.oetiker.ch/rrdtool/](http://oss.oetiker.ch/rrdtool/)

"RRDtool is the OpenSource industry standard, high performance data logging
and graphing system for time series data. RRDtool can be easily integrated in
shell scripts, perl, python, ruby, lua or tcl applications."

~~~
misframer
The title has "huge time series" in it. How well does RRDTool scale?

~~~
x0n
That's a very open question, but RRD tool offers many modes of operation for
data consolidation:

\---

Data Acquisition

When monitoring the state of a system, it is convenient to have the data
available at a constant time interval. Unfortunately, you may not always be
able to fetch data at exactly the time you want to. Therefore RRDtool lets you
update the log file at any time you want. It will automatically interpolate
the value of the data-source (DS) at the latest official time-slot (interval)
and write this interpolated value to the log. The original value you have
supplied is stored as well and is also taken into account when interpolating
the next log entry.

Consolidation

You may log data at a 1 minute interval, but you might also be interested to
know the development of the data over the last year. You could do this by
simply storing the data in 1 minute intervals for the whole year. While this
would take considerable disk space it would also take a lot of time to analyze
the data when you wanted to create a graph covering the whole year. RRDtool
offers a solution to this problem through its data consolidation feature. When
setting up an Round Robin Database (RRD), you can define at which interval
this consolidation should occur, and what consolidation function (CF)
(average, minimum, maximum, last) should be used to build the consolidated
values (see rrdcreate). You can define any number of different consolidation
setups within one RRD. They will all be maintained on the fly when new data is
loaded into the RRD.

Round Robin Archives

Data values of the same consolidation setup are stored into Round Robin
Archives (RRA). This is a very efficient manner to store data for a certain
amount of time, while using a known and constant amount of storage space.

It works like this: If you want to store 1000 values in 5 minute interval,
RRDtool will allocate space for 1000 data values and a header area. In the
header it will store a pointer telling which slots (value) in the storage area
was last written to. New values are written to the Round Robin Archive in, you
guessed it, a round robin manner. This automatically limits the history to the
last 1000 values (in our example). Because you can define several RRAs within
a single RRD, you can setup another one, for storing 750 data values at a 2
hour interval, for example, and thus keep a log for the last two months at a
lower resolution.

The use of RRAs guarantees that the RRD does not grow over time and that old
data is automatically eliminated. By using the consolidation feature, you can
still keep data for a very long time, while gradually reducing the resolution
of the data along the time axis.

Using different consolidation functions (CF) allows you to store exactly the
type of information that actually interests you: the maximum one minute
traffic on the LAN, the minimum temperature of your wine cellar, ... etc.

------
a-l
I suppose, you need a column-oriented database
[http://en.wikipedia.org/wiki/Column-
oriented_DBMS](http://en.wikipedia.org/wiki/Column-oriented_DBMS) I've used
Sybase for huge telecom-statistical database.

------
qboxio
Use the ELK Stack - Elasticsearch, Logstash, Kibana. Logstash is for ETL and
data normalization. Kibana is for building cool visualizations. Elasticsearch
for storing, processing, analysis, scaling and search.

Here are some resources:

Webinar: the Elk Stack in a Devops Environment
[http://www.elasticsearch.org/webinars/elk-stack-devops-
envir...](http://www.elasticsearch.org/webinars/elk-stack-devops-environment/)

Webinar: An Introduction to the ELK Stack
[http://www.elasticsearch.org/webinars/introduction-elk-
stack...](http://www.elasticsearch.org/webinars/introduction-elk-stack/)

------
yang_guo
kdb+/q is commonly used in the financial world for these types of problems.
They have a 32bit for free, and you can ask them about pricing on the 64bit
version.

[http://kx.com/](http://kx.com/)

~~~
sgdread
Yep, banks heavily use these to store tickers (currencies, instruments) and do
calculations of VWAP, etc. Pretty much standard in the industry for these kind
of applications.

------
krallja
Until recently, TempoIQ used Apache HBase to store time series data.
[http://blog.tempoiq.com/why-tempoiq-moved-off-
hbase](http://blog.tempoiq.com/why-tempoiq-moved-off-hbase)

------
halayli
At Webmon, we use postgres, with a binary field to store Protobuf messages.
The protobuf msg allows me to store histograms/original values/etc.. Sharding
is done at the application level.

------
virtuabhi
To get a relevant recommendation, you'll have to describe two things at least
- data and queries. How is the data generated? What is stored in the data?
What are the queries you plan to run?

~~~
BWStearns
The data is going to be sensor readings. Just numeric readouts over time.
There will probably be about 8-16 physical sensor points per person per
reading. I'll want to retrieve slices of time rather than individual records
and likely produce some averages/basic algebra over those slices in order to
produce more meaningful data for the rest of the system which is pretty
vanilla in terms of data requirements.

~~~
imperialWicket
I made a data star supporting weather/water sensor data in postgresql that
heavily relied on table partitions for handling performance [1]. We had it on
pretty weak machines, replicated with bucardo, and never had any issues. It
worked well to several million records/month (not sure where it is now).

[1] [https://github.com/imperialwicket/postgresql-time-series-
tab...](https://github.com/imperialwicket/postgresql-time-series-table-
partitions)

------
fourk
[http://blueflood.io/](http://blueflood.io/) is an option. It's built on top
of cassandra and has experimental support for use as a backend for graphite-
web. There are several engineers still actively working on it who are
generally happy to help with any issues raised via irc or the mailing list.
Unsure what you mean by 'massive', but I've used it to store billions of data
points per day successfully.

Disclaimer: I'm a former core contributor to blueflood.

------
bfrog
Postgres is fantastic for most things. I think people think "big data" is
somehow too big, the big data I've seen fits in a few terabytes which postgres
handles just fine

------
didip
The big idea in storing time series data is to partition data by timestamp
(daily/hourly/minutely depending on how granular do you want it). This
technique can be done in various data store:

* I've done it in PostgreSQL using triggers and table inheritance. With this technique trimming old data is as simple as dropping old tables.

* Logstash folks use daily indices on ElasticSearch to store log data which is time series by nature.

* I have heard from quite a few people that Cassandra works really well with this data model too.

------
caw
My company is currently using Mongo, and while it works, I wouldn't recommend
it. We're looking at Cassandra and Elasticsearch, which seems to be a lot more
promising.

~~~
arenaninja
The number of horror stories I've seen about mongo is up to around 10 this
month alone.

I'm now glad I never made the jump... in the meantime, pgsql is still on my
list

~~~
caw
I wouldn't say it's a horror story, it's just not really for time series "big
data". The backend guys have had to muck about with the data a lot to get good
performance out of it. There's some optimizations we missed on the sysadmin
side too, like sharding the cluster after it got to ~250GB, and now it's many
times that. Our Mongo clusters have been running production for well over a
year.

------
bbromhead
Definitely try Cassandra and if you don't want to run it yourself try
[https://www.instaclustr.com/](https://www.instaclustr.com/)

------
ankushio
Have you considered opentsdb or graphite? I love graphite because of the nice
frontend interface and functionality it provides for visualizing and
transforming your metrics.

~~~
sciurus
Development of graphite is effectively dead. The datastore component (carbon
and whisper) has design issues, and the official replacement (ceres) hasn't
seen any commits this year. There are a some alternatives, though.

For data storage Cyanite [0] speaks the graphite protocol and stores the data
in Cassandra. Alternately, InfluxDB [1] speaks the graphite protocol and
stores the data in itself

To get the data back out, there's graphite-api [2] which can be hooked up to
cyanite [3] or influxdb [4]. You can then connect any graphite dashboard you
like, such as grafana [5], to it.

[0] [https://github.com/pyr/cyanite](https://github.com/pyr/cyanite) [1]
[http://influxdb.com](http://influxdb.com) [2]
[https://github.com/brutasse/graphite-
api](https://github.com/brutasse/graphite-api) [3]
[https://github.com/brutasse/graphite-
cyanite](https://github.com/brutasse/graphite-cyanite) [4]
[https://github.com/vimeo/graphite-
influxdb](https://github.com/vimeo/graphite-influxdb) [5]
[http://grafana.org](http://grafana.org)

~~~
lobster_johnson
A slightly off-topic question, since you seem to know what you're talking
about: What are people using these days for collecting and display devops-
level metrics, if it's not Graphite? Are your links relevant here?

Last I looked at Graphite I balked at the data store design (very I/O heavy)
and the awful front ends (very limited graphing and reporting capabilities).
But I haven't discovered a good alternative that has traciton. Diamond seems
like the thing to use for collecting metrics (instead of collectd), though.

Edit: Grafana looks good, actually.

------
temp_queries
We're an established team with a stealth product that we're releasing soon. If
you'd like to participate in an early trial, send us an email. We're also
happy to talk to anyone with time series needs or related needs like analytics
on big data. Maybe we can build you something custom or cut you a deal. Drop
us a line at bigdata.queries@gmail.com

------
final
RavenDB - they recently switched to a new engine and did some time series
related work. Send them an email and you'll probably get a few free licenses.
It's a very well selling commercial product, so the risk of deprecation
minimal.

I have personally seen millions of records saved per minute on a top end SSD
server.

------
burtonator
Check out KairosDB... it is based on Cassandra and is very similar to OpenTSDB
but IMO Cassandra is a bit easier to scale and maintain with fewer parts.

We're using it in production... it's still early but there are about 1-2 dozen
moderate sized installs (like 10 box installs).

We're pretty happy with it so far..

------
nlavezzo
At FoundationDB we recently did a blog about using FDB for time series data:

[http://blog.foundationdb.com/designing-a-schema-for-time-
ser...](http://blog.foundationdb.com/designing-a-schema-for-time-series-data-
using-fdb)

One of our largest customer installations is for this purpose.

------
dotmanish
Take a look at Amazon Redshift (I don't know if you have a higher time budget
or a higher dollar budget for what you're building, but Redshift might turn
out to be pretty cost-effective when you add in system upkeep as well). It
scales well.

------
loganfrederick
There's a company in Chicago called TempoIQ (formerly TempoDB) that is working
on a time series database.

[https://www.tempoiq.com/](https://www.tempoiq.com/)

I'm not affiliated with them, I just met them once.

------
mikhailfranco
Stonebraker again ... ?

SciDB: [http://www.scidb.org/](http://www.scidb.org/)

Paradigm4: [http://www.paradigm4.com/](http://www.paradigm4.com/)

... anyone have experience of using SciDB?

------
ajmm
If you would like to perform similarity queries on your time-series please try
simMachines.com (I am the founder)

We offer cyclic time warp search, time warp search and any other metric of
pseudo-metric you can come up with :-)

------
hbs
What kind of project is that? Is it a side project or one that will give birth
to a company? Requirements are rather different depending on importance you
give to your (or your customers') data.

~~~
BWStearns
It's for work. If it were a side project I would probably have just grabbed
InfluxDB and run with that since it looks the most fun, but since it's for a
core part of the whole system then the risk of project abandonment is a bit
high.

------
herman5
Have a look at TempoDB - built specifically for timeseries data
([https://tempo-db.com/about/](https://tempo-db.com/about/))

~~~
hbs
TempoDB has renamed itself into TempoIQ and no longer offer their storage
service. I've heard some angry comments from customers who recently received
an email telling them the storage service they were using was to be shutdown
at the end of october!

~~~
samdjohnson
I work at TempoIQ, and we still to offer our storage service. We've launched a
new product (as TempoIQ) that is hosted in a private environment and offers
storage, historical analysis, and real-time monitoring.

As for the customers on TempoDB, we are working them to transition to TempoIQ
if the switch makes sense or offering to guide them in a transition to another
time-series database like InfluxDB.

------
notduncansmith
If you're cool with a simple K/V storage format (K: timestamp, V: data), Riak
might work for you. Easy to scale, reliable, and wicked fast.

------
dalacv
If this is for a Manufacturing environment, OSI Soft's PI Historian may be a
good fit. Really depends on what exactly your requirements are.

------
mooneater
What strikes me about the comments, is the very diverse range of products
being recommended. Nothing even close to consensus on this space.

------
ratnakar007
I have looked at InfluxDB, and it is pretty cool. Combined with Grafana, you
have a complete solution out of the box.

------
siliconviking
First question - do you even need a database right now? Have you for example
considered using CSV files and simply loading those files into Pandas or R on
demand?

I am currently working on a project analyzing massive amounts of options data
and have found this approach to be both quite easy as well as flexible to work
with... and as my project matures I may move select parts of it into a
database.

~~~
new_test
>loading those files into Pandas or R

What is "massive" for you? I was under impression you can't use R or pandas
for anything that doesn't fit into memory.

~~~
siliconviking
As for massive - something like daily options data for 3000 stocks, spanning a
number of years, with information down to the tranche level (let's say 60
million rows if stored in a relational database fashion). In my case the
analysis can be done on the stock level though, which means that only a 3000th
of the dataset needs to be loaded into memory at any time.

------
nartz
Redshift or Vertica are your best options and are built for massive queries
over large data.

------
whalesalad
I have not used it but will be experimenting with influxdb soon for storing
timeseries data.

------
marianoguerra
have a look at [https://dalmatiner.io/](https://dalmatiner.io/) I think it's
used here [https://project-fifo.net/](https://project-fifo.net/)

------
maslam
Redshift. It scales superbly.

~~~
scottlocklin
While KDB+ is infinitely better for TS data, hiring people who know what
they're doing, and buying the physical hardware you need to make it fly isn't
what most modern firms are interested in. If you want to run on the EC2,
Redshift is super great. Time/date range queries though, holy shit those suck.

------
vishalchandra
Cassandra works best. E.g. Say as the backend for email storage.

------
cmollis
postgres (or redshift which is.. postgres).

oracle too. Just did something relatively small with that (< 10MM rows), but
it's pretty solid.

------
hellomichibye
have a look at [http://www.TimeSeries.guru](http://www.TimeSeries.guru)

------
pmalynin
If its only numbers, then Graphite with Grafana for visualization. Otherwise
MongoDB can be good as well.

------
mikeklaas
Redshift

------
mvitorino
BigQuery

