Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What DB to use for huge time series?
126 points by BWStearns on Sept 25, 2014 | hide | past | web | favorite | 130 comments
Hi HN, I wanted to know if anyone had good recommendations for a database for massive timeseries. I took a look at InfluxDB and Druid, both of which look promising but they're young projects and I don't want to strand myself with a deprecated component at the core of the system I'm working on. Does anyone have any suggestions/advice/experience they can share to provide some guidance here?

thanks in advance!




Depending on how 'huge' your timeseries are, you might be pleasantly surprised with Postgres. Postgres scales to multiple TB just fine, and of course the software can be easier to write since you have SQL and ORMs to rely on. It's also an incredibly mature and stable software package, if you're worried about future-proofing.

Some (constantly-growing) timeseries can be stored on a per-row basis, while other (static or older) timeseries can be stored in a packed form (e.g. an array column).

I find that most of the time, "Big Data" isn't really all that big for modern hardware, and so going through all of the extra software work for specialized data stores isn't really all that necessary. YMMV, of course, depending on the nature of your queries.



>I find that most of the time, "Big Data" isn't really all that big for modern hardware, and so going through all of the extra software work for specialized data stores isn't really all that necessary. YMMV, of course, depending on the nature of your queries.

I totally agree. Most of useful "big data" is time-series data, and they aren't all that huge compared to images/videos/etc.

That being said, I think the reason to adopt something like Hadoop/MPP engines is not for storage but ease of querying: while Postgres can handle storing terabytes of data, joining two terabyte-scale tables can get a little iffy. This gets even more complex if you start packing data into array columns for space efficiency.

There is an argument to be made that historical/archival data aren't all that useful and thus do not need to be analyzed: that was definitely my assumption coming from finance. However, I've been surprised how far back some of our customers at Treasure Data go to mine insights from data.


Approximately, if you have something like 10+ billion items, use Cassandra.

If you have less than 10 billion items, Postgres will be fine, and is easier to manage IMO.

If you do use postgres, you should vertically partition the table. This will help keep indexes smaller, improve the the cache hit rate, vastly improve the ease with which you can drop older data, and make various other admin tasks easier.

I've done this in the past with a compound primary key of (topic_id, t) where t was a microseconds-past-the-epoch timestamp (bigint) unique within a topic. Then set up a parent table: CREATE TABLE events (topic_id, t, data_fields..) and "CREATE TABLE .. INHERITS events" from it into multiple subtables, named based on the timespan they will hold, like events_2013, events_2014.

Depending on how much data you have, either partition by day/month/year/etc. I partitioned every million seconds (~11 days), since that kept the resulting table sizes a bit more manageable (gigs not TBs).

Add a CHECK CONSTRAINT to each sub-table to constrain the timespan (ie, WHERE t BETWEEN ?? and ??).

When you do a SELECT * FROM events WHERE topic_id = 1 AND t BETWEEN $x AND $y ORDER BY t DESC; the query planner knows which sub-table(s) to query, and doesn't touch the other tables at all.

You can also add a BEFORE INSERT trigger to the parent table that inserts into the correct sub-table, otherwise get clients to compute the correct table name when inserting.


For a good answer, you need to provide a lot more detail in the requirements:

- What do the writes look like? If they are coming in a stream how many writes per second do you need to support? If they are a bulk load how large and frequent are the batches? Simple numerical values?

- What do the reads look like? How many queries per second do you need to support? How much data per query? How fast do the queries need to be? Will your queries be simple aggregations? Dimensional queries? Unique dimension value counts? Are approximations tolerated?

- How much history do you need to keep?

- What are your requirements for availability?

- What are your requirements for consistency?

- How fast does new data have to show up in reads?

Without more detail, you're going to get dozens of suggestions which may each be right for a particular case.


Part of the reason the question was light on details is that this is just at the very beginning and a lot of relevant things aren't locked in yet. Below are the back of napkin results and are subject to the risk of being laughably wrong.

Writes: not totally sure in terms of how the data is being packaged before being sent yet, but it'll probably be more than 10 writes a second but less than 1000 initially(?). Not sure yet if we're aggregating and batching before sending or if we are, to what degree.

Availability: If it has brief breaks where it just misses some data (<3seconds?) probably not the worst thing, but really trying to avoid big gaps in the data.

Reads will likely be grabbing the last n records of a given set of sensors maybe with some light math on it if the query language supports it, though there might be an easier way to cache recent history and then only need to go to the big list for responding to a longer-term issue. Also the nature of reads is very subject to change since there's a bunch of use-cases for the data being kicked around and I haven't gone through what each use's reads would look like yet.

New data needs to show up in reads in soft-real time. The napkin-estimate indicates that we might be looking at asking for about 6-80MB returned per query as a generally large but perhaps not max query, bigger operations that dealt with legitimately huge amounts of data will probably be scheduled around lighter periods/put on different machines (not sure how adding more machines reading would impact since I don't know what db it will be yet).

Ideally keep as much history as humanly possible, possibly moving them to physical archival at some point (1yr+?).


What sort of data are you collecting?


All these questions should ideally not be a concern when you are looking for a database. A general purpose database which can handle all the above and more is AmisaDB. http://www.amisalabs.com/


KDB+ http://kx.com/kdb-plus.php

I have no affiliation, other than being a customer. Its as close to a standard as you can find in finance.

There are many useful tutorials out there that let you try it out and you can usually get an eval version to try before you buy.

http://code.kx.com/wiki/Startingkdbplus/contents If you find something that is comparable in terms of performance and features, but cheaper, please mail me!! I would be very grateful.



Remember that KDB is based on K, which stems from APL, which relies on symbols rather than words for its functions.

Coming from that background, C and especially C# must seem extremely verbose.

For example (from Wikipedia):

In K, finding the prime numbers from 1 to R is done with [0]:

    (!R)@&{&/x!/:2_!x}'!R
And APL[1]:

    (~R∊R∘.×R)/R←1↓ιR

Its truly awesome stuff.

[0]: http://en.wikipedia.org/wiki/K_(programming_language) [1]: http://en.wikipedia.org/wiki/APL_(programming_language)


Hah... Almost looks like the programmer started out trying to create an entry for obfuscated C competition but then pivoted midway when s/he realized it could be commercialized.

Specially loved this comment:

  // remove more clutter
  #define O printf
  #define R return
  #define Z static


My favorite part is the comment "remove more clutter." Instead it should have been, "Remove all hope of maintenance."


That is special. I had a fun time parsing that. I mean that genuinely.


Dear lord. It must take forever to get up to speed on a codebase like that.


Doesn't take that long. Maybe a few days?

k5 isn't that big (about 9 C files)


Jd (the J database) isn't as powerful as KDB+, but it's pretty good and only costs if you want to pay for support.


If you want a quick video guide to getting started, there's: http://www.timestored.com/kdb-guides/getting-started-kdb

Even the java driver is written similar to the C code :) http://code.kx.com/wsvn/code/kx/kdb%2B/c/jdbc.java


I'd also recommend KDB+

I find my code very clear and readable in it.


What is the price range of the full version?


Not a database but HDF5 (http://www.hdfgroup.org) is used for storing all sorts of scientific data, has been around for a while and is very stable. PyTables is built on top of it and there are lots of other languages that can have existing libraries to read/write HDF5 (matlab, python, c, c++, R, java, ...)


I have had good experience using HDF5 to store time series data, but just research datasets and nothing that has been put into production. I don't really know how well it works with threading, for example. It does work very well with PyTables and Pandas for analysis and definitely beats CSV files, which is the the normal way these research datasets are stored.

If you are interested in using HDF5 and PyTables to store time series data, check out this little library that I created: http://andyfiedler.com/projects/tstables-store-high-frequenc...


I have used http://influxdb.com/, I have used it with a few million records. Getting the data out is a bit slow because it goes over HTTP. Also make your InfluxDB library of choice can deal with HTTP chunking. I found that if you request a lot of data from InfluxDB and the system does not have enough memory, the process will silently die.

If you have mega huge data http://opentsdb.net/ seems pretty decent, however I have not tried it out.


Clarification: InfluxDB only crashed on me when I requested a lot of data without chunking. With chunking I didn't have any problems.

I like InfluxDB and still use it.


Cassandra was used at Twitter[0] to store quite a lot of time series data.

A typical production instance of the time series database is based on four distinct Cassandra clusters, each responsible for a different dimension (real-time, historical, aggregate, index) due to different performance constraints. These clusters are amongst the largest Cassandra clusters deployed in production today and account for over 500 million individual metric writes per minute. Archival data is stored at a lower resolution for trending and long term analysis, whereas higher resolution data is periodically expired.

[0]: https://blog.twitter.com/2013/observability-at-twitter


Believe they've moved to Manhattan, their own custom datastore:

https://blog.twitter.com/2014/manhattan-our-real-time-multi-...


I'd recommend OpenTSDB. Using a 11 node hadoop cluster on m1.xlarge nodes in Amazon, (2 name, 9 data), I can ingest a sustained rate of ~75,000 time series datapoints per second in an HBase table.

The upside is that OpenTSDB scales really well with hadoop cluster size, so you can just scale it up to handle more load.

The downsides are that their data schema and query format are optimized for data efficiency, not speed or flexibility. It's really easy to refine a search for a particular metric by filtering on tags, but it's really hard to do any sort of analysis across metrics, so you have to write your own glue on top of that which fetches the datapoints for the metrics you care about, and does its own aggregation.


That's what we're doing. Uses HBase underneath, scales well.


this, or Hbase + Phoenix: http://phoenix.apache.org/


It would be useful to know what "huge" means here. And how you want to look up the data.

That said, I've used Cassandra in the past for timeseries data as one of the useful queries that can be made is a range query (if the composite key is set up correctly)


36-100MB/person per day ~250 days/year expecting ~20,000 (an educated stupid wild ass guess) initially when the system is actually put into production. ~100-400TB per year(?). Most of the data would only be of interest for a month or so, but we do want to preserve the data in general in some usable fashion for testing and some research stuff.


In this case, I would still recommend Cassandra. It can easily handler the data sizes you mention as well as the write rates you imply further down the thread.

Cassandra has a nice and simple architecture (every node is identical, no zookeeper roles etc), high write performance and scalability [1], and is fairly robust. My main piece of advice is to get the tables correctly set up. You need to know exactly what queries you want to make and design a table around that query (Cassandra only allows performant queries to be made, unless you go out of your way to set a flag). Whether a query is possible or performant depends on the key of the rows for the table, which may be a composite key. Take a look at the cassandra documentation for more details.

1. http://techblog.netflix.com/2011/11/benchmarking-cassandra-s...


Thanks a ton. I am leaning towards a solution that involves Cassandra. What would you say about using something on top of it like Blueflood?


I havent used Blueflood, so I couldnt say but it looks like an interesting project.


You might look into partitioning. Oracle and SQL Server both support that type of operation. Additionally, being able to find support when things get "too big to handle" can be easier on a mature technology with lots of users.

On a side note, you can hook a Hadoop cluster up to SQL Server if you're into that kind of thing for storage.


When it comes to time series, reasoning in terms of byte size does not really make sense, it's better to state how many datapoints you need to handle and in how many distinct time series they are distributed.


8-16ish datapoints per sample and they'll be distributed more or less evenly during the day and then pretty much go dead at night. There may or may not be a value for every data point at every sample.


There's good news and bad news. Good news is storing this much data isn't hard; there's plenty of people who've done it and many systems will scale enough.

Bad news is picking a system means understanding access patterns -- reading, not writing. Do you only need to look within a single user? That's much easier. If you have to query across users, or do things like (and I have no idea what your problem domain is, but if it's utility usage, things like average usage by zip or block; if it's wearables, activity by city, etc), stuff gets much harder. How granular do you need to be able to query, and how far back? What is the sla on a query: are results calculated in batch mode or on demand for a website? You often have to duplicate data in order to optimize one set for throughput access and the other set for minimal random query time. Can you get away with logarithmic granularity for queries, ie every sample is available for 1 month, every 3rd for the next month, every 10th for a couple months after that, etc. What windowing functions do you need to run, and how frequently do they need to be updated? What is the ratio of writes to reads? If you have to access random data quickly, eg for a site, can you calculate > 1 day back in batch mode, cache those results, and add the last 24h of data at runtime? etc etc etc.

You need to have some conversations with the data consumers.

Edit: and I've assumed these data are read-only; if you can update them, then there's far more difficulty.


There should be no updates but there is a possibility that records can be added out of order. I've seen that this is a problem for some systems and not for others.


My guess would be you would want Cassandra, specifically to incur less overhead for empty values. I haven't built finance backtesting/monitoring infrastructure - which sounds exactly like what you're building - but in this case, I think you'll get real value from triggers, even if that's only being supported experimentally right now.


What will the sampling frequency be? How many samples per sampling interval?


DataDog uses elasticsearch for their timeseries data store: http://www.elasticsearch.org/content/uploads/2013/11/es_case...

Elasticsearch might seem like a strange option at first since it's historically a text search engine, but it's main datastructure is a compressed bit array which is ideal for OLAP processing.


I work at Datadog - we're only using ElasticSearch for full-text structured events, not time-series, which represent 10,000 - 100,000 times more data in volume.

We had to build our own Time-Series streaming / storage / query so we could handle millions of points per second and years of retention.

(we love ElasticSearch, though)


I have a timeseries problem on the backburner, and like you am hopeful for InfluxDB but it's still missing a couple features that I need, so haven't used it yet.

As another person mentioned, you're going to be looking at columnar databases (few/one rows, with a very large amount of columns) if you have truly large storage requirements. Since my data is still small, I'm sticking with Postgres for now.

I've seen a couple people mention OpenTSDB; another alternative to that is KairosDB[1], which adds Cassandra support and focuses on data purity[2] (OpenTSDB will interpolate values if there are holes).

And to echo another person, just forget about Graphite/Whisper. It uses a simple pre-allocated block format that will eventually cause problems when you want to change time windows.

[1]: https://code.google.com/p/kairosdb/

[2]: https://code.google.com/p/kairosdb/wiki/FAQ


What features are you from InfluxDB? I am a long-time graphite user, and I just saw InfluxDB, and it looked really good.


Looks like I'm only waiting on custom functions[1] now. I used to also be waiting on continuous queries[2] but looks like that feature is done now.

[1]: https://github.com/influxdb/influxdb/issues/68

[2]: http://influxdb.com/docs/v0.8/api/continuous_queries.html


Blueflood(http://blueflood.io/) may be what you are looking for. It uses Cassandra under the hood. It's a project out of Rackspace and is being used in prod by Rackspace's cloud monitoring. Currently, Blueflood ingests about 2.2M metrics/min and can probably scale to 40M metrics/min. Full disclosure - I am a dev on that project. It's being actively developed!


If you are considering software-as-a-service solution, Rackspace has just released public APIs of Cloud Metrics powered by blueflood at no additional cost.

http://www.rackspace.com/blog/cloud-metrics-working-toward-a...

(Disclaimer: I am the Product Manager on that project)


This is a fantastic snapshot of an engineer and PM commenting on the same product


Graphite is a mature system. It's a pain in the ass, but I generally find it essential for server monitoring.

I'm working on a timeseries database aimed at replacing graphite. It's just getting started, so it probably won't work immediately, but contributions are welcome. Currently the write performance is already better than graphite [1].

https://github.com/stucchio/timeserieszen

[1] This was one of the design goals. Whenever graphite receives a data point a disk seek is incurred - the data point must be appended to the timeseries file. Timeserieszen uses a WAL - data flowing in is immediately written, and periodically the WAL is rolled over into permanent storage.


Cool project!

I commented on some other graphite replacement projects at https://news.ycombinator.com/item?id=8368689


I didn't realize graphite was officially dead. I must say, however, that it was the shittiest piece of software I've ever relied on and loved.


I didn't realize this either. In fact, we just moved to graphite :-(


Use https://crate.io It is built on Elastic Search and I've recently built something large to store time series data with it. We actually migrated away from Cassandra and ported out application from it because it didn't allow us any schema or indexing flexibility. It also allows you to partition a table by a column (e.g. a day) which means that a new table is created each day. Zero config, fast and operationally easy. Depending on your latency requirements for reads I would also have a serious look at Couchbase but I don't know how well they fare for time-series data.


Maybe check this out?

https://github.com/soundcloud/roshi

Roshi is basically a high-performance index for timestamped data. It's designed to sit in the critical (request) path of your application or service. The originating use case is the SoundCloud stream; see this blog post for details.


Roshi sits on top of Redis, so this solution can be quite expensive.


Depends on the kind of data you are storing. Hierarchical Data Format is a scientific data format developed by the national center for supercomputing. It is specifically designed to store and organize large amounts of numeric data (including timeseries). It supports flat arrays for large data sets, but also supports B-Trees for more relational style data as well. You can also easily tag the array data.

If your format is cast in stone you may also be able to get away with using flat-files. If you implement the List interface or something similar it would be very easy to integrate into your application. (Normally I wouldn't recommend flat-files for anything, but for time series it can be not a bad option, as much as that makes me cringe).


Chiming in with a definite bias. I'm one of the co-founders of InfluxDB, and while we're still somewhat young, we actually just hit the 1-year anniversary of our first commit today. We're currently a team of 5 full-time developers, dedicated to making InfluxDB the best time series database available. We've also got some strong institutional backing, so we're not going anywhere for a very, very long time.

If there are any questions we can answer to help you make a more informed decision, drop us a line at support@influxdb.com or reach out to the community: https://groups.google.com/d/forum/influxdb


MySQL, Postgres, etc all scale 'just fine' to terabyte sized databases, and the tooling and reporting tools for these databases is unmatched by any NoSQL solution. What really matters is the type of queries you want to run... and whether or not you need to automagically degrade over time. InfluxDB, opentsdb, and competitors provide that automatically, but powerful tools like SequelPro are missing from that space (Though you gain things like Grafana).

If in doubt, start with a traditional RDBMS. And ONLY after you profile your application and see exactly where your pain points are, do something crazy.

Have fun!


Depending on what you are doing you can even try to write it yourself. It would be a good exercise.

Here is a toy hand crafted time series storage design:

Say you are storing tuples of {<timestamp>,<datablob>}. Then querying it by timestamp.

Writer can store it in two files,open in append only mode only. One is the data file one is the index file. Data might look like:

<datablob1><datablob2>...

And an index file, it stores timestamps and offsets into the data files where the blobs are:

<timestamp1><offset1><timestamp2><offset2>...

If you need rolling fall-off. Then create new pairs of files every day (hour, week, month). And delete old ones as you go.

Then if you can ensure that your have time synchronization set up and timestamp are in increasing order (this might be hard). You can do binary searching. If you use rolling fall-offs. Then you can discard whole files periods based on the query range when you search.

All this would go into a directory. Reader and writer could be different processes. Your timestamp and offset sizes should be fixed length. Writer first appends to the data file and then writes the index. Reader knows how to find the last valid record by looking at the size of the file.


I wouldn't recommend trying to do this yourself. Of course you can make something that kind of works, but making a resilient production ready database that is fault tolerant and scales is a lot harder than writing to a file.


Well it was just a toy example I came up with in a couple of minutes.

But sometimes depending on the requirements a file is enough. If you intimately know the and control the bytes that get written it is easier to understand and reason about your systems (that means optimizing it, scaling it, making it fault tolerant).

Also one way to make a resilient and fault tolerant database is to have less code running. Sometimes the base libc and unix offer a good and stable base on which it is easy to build. If you append the file in read or append only mode. You can rely on certain behavior now.

People in the past have bought into marketing crap and got stuff like MongoDB which would throw data over the fence and pray that it would be synced eventually (by default!). But heck it was WebScale(tm).


That is why you seriously audit your tools and why many in the industry avoid Mongo like the plague. Controlling the byes that gets written to a file is actually not simple at all, and its a huge research problem as far as file systems of databases. I'm just saying, I don't think writing your own db is every a very good idea, unless it is SO simple that you would barely call it a DB.


Good talk at Strangeloop this year about exactly this. https://thestrangeloop.com/sessions/time-series-data-with-ap...

The talk is posted here. https://www.youtube.com/watch?v=ovMo5pIMj8M


Kx (http://kx.com) has been around forever and has a good rep for this sort of thing.


I think Elasticsearch is a good solution for analytic workloads (including time-series data). Query speeds are significantly faster than most DBs because it uses the vector space model (which also introduces the possibility of false positives). I wouldn't recommend it (yet) as a primary data store. It is however really useful for analytics.


If you're up for considering a cloud service, you might want to check out Treasure Data (http://treasuredata.com/).

The free plan allows 10M records per month with a maximum capacity of 150M.

Full disclosure: I work there.


Does treasure data have a dedicated storage engine for time series? This kind of data has specific needs which are not met by general purpose storage layers.


To an extent, yes. We wrote our time-partitioned columnar storage from scratch: it has row-based storage for more recent data and column-based storage for historical data, and the data is merged from row-based to column-based periodically for performance. We realized from day one that much of "big data" is log/timestamped data, so our query execution engined are optimized for time-windowed queries.


First, ask if you really need "massive" scale. Is this an idea, or a well-defined product? I'd imagine if you knew what you were building, you wouldn't be here asking.

So "massive" -- why not prototype on Postgres, and then migrate when you actually have projections on size.

Different orders of magnitude change the technology you work with. Additionally, the latency with which you need to access the metrics (real time, report based).

Cassandra is a pretty solid choice, Influx is really new to the game but is promising.

Druid is trusted by a lot of people, Metamarkets (the author) among them, but may or may not be what you need.

I'd spend some time talking to the people in #druid-dev on Freenode, they're friendly and can help guide you.


See also this talk by two Metamarkets devs: https://www.youtube.com/watch?v=Hpd3f_MLdXo

If accuracy doesn't have to be 100%, a number of options open up.


So many people suggesting relational databases or just plain "big data" solutions. Time series databases tend to have quite unique features like interpolation of data (i.e. you can query a specific datapoint at a specific date and time for a value, and you will get an interpolated value if there is no specific sample for that data point.)

Anyway, no one has mentioned RRD tool yet: http://oss.oetiker.ch/rrdtool/

"RRDtool is the OpenSource industry standard, high performance data logging and graphing system for time series data. RRDtool can be easily integrated in shell scripts, perl, python, ruby, lua or tcl applications."


with RRDtool the older your data is the less of it you have. You might be logging at 1s period, and you get a nice graph for the last year, but if you want to look back at a 1 minute period from a year ago your 60 samples are gone and have been aggregated.

Historians like Pi etc will 'compress' time series by only storing data points where data has changed by some threshold. If you look back 5 years all the resolution is still there.


The title has "huge time series" in it. How well does RRDTool scale?


That's a very open question, but RRD tool offers many modes of operation for data consolidation:

---

Data Acquisition

When monitoring the state of a system, it is convenient to have the data available at a constant time interval. Unfortunately, you may not always be able to fetch data at exactly the time you want to. Therefore RRDtool lets you update the log file at any time you want. It will automatically interpolate the value of the data-source (DS) at the latest official time-slot (interval) and write this interpolated value to the log. The original value you have supplied is stored as well and is also taken into account when interpolating the next log entry.

Consolidation

You may log data at a 1 minute interval, but you might also be interested to know the development of the data over the last year. You could do this by simply storing the data in 1 minute intervals for the whole year. While this would take considerable disk space it would also take a lot of time to analyze the data when you wanted to create a graph covering the whole year. RRDtool offers a solution to this problem through its data consolidation feature. When setting up an Round Robin Database (RRD), you can define at which interval this consolidation should occur, and what consolidation function (CF) (average, minimum, maximum, last) should be used to build the consolidated values (see rrdcreate). You can define any number of different consolidation setups within one RRD. They will all be maintained on the fly when new data is loaded into the RRD.

Round Robin Archives

Data values of the same consolidation setup are stored into Round Robin Archives (RRA). This is a very efficient manner to store data for a certain amount of time, while using a known and constant amount of storage space.

It works like this: If you want to store 1000 values in 5 minute interval, RRDtool will allocate space for 1000 data values and a header area. In the header it will store a pointer telling which slots (value) in the storage area was last written to. New values are written to the Round Robin Archive in, you guessed it, a round robin manner. This automatically limits the history to the last 1000 values (in our example). Because you can define several RRAs within a single RRD, you can setup another one, for storing 750 data values at a 2 hour interval, for example, and thus keep a log for the last two months at a lower resolution.

The use of RRAs guarantees that the RRD does not grow over time and that old data is automatically eliminated. By using the consolidation feature, you can still keep data for a very long time, while gradually reducing the resolution of the data along the time axis.

Using different consolidation functions (CF) allows you to store exactly the type of information that actually interests you: the maximum one minute traffic on the LAN, the minimum temperature of your wine cellar, ... etc.


I suppose, you need a column-oriented database http://en.wikipedia.org/wiki/Column-oriented_DBMS I've used Sybase for huge telecom-statistical database.


Use the ELK Stack - Elasticsearch, Logstash, Kibana. Logstash is for ETL and data normalization. Kibana is for building cool visualizations. Elasticsearch for storing, processing, analysis, scaling and search.

Here are some resources:

Webinar: the Elk Stack in a Devops Environment http://www.elasticsearch.org/webinars/elk-stack-devops-envir...

Webinar: An Introduction to the ELK Stack http://www.elasticsearch.org/webinars/introduction-elk-stack...


kdb+/q is commonly used in the financial world for these types of problems. They have a 32bit for free, and you can ask them about pricing on the 64bit version.

http://kx.com/


Yep, banks heavily use these to store tickers (currencies, instruments) and do calculations of VWAP, etc. Pretty much standard in the industry for these kind of applications.


Is this just your typical columnar OLAP database, like Vertica or Big Query?


Until recently, TempoIQ used Apache HBase to store time series data. http://blog.tempoiq.com/why-tempoiq-moved-off-hbase


At Webmon, we use postgres, with a binary field to store Protobuf messages. The protobuf msg allows me to store histograms/original values/etc.. Sharding is done at the application level.


To get a relevant recommendation, you'll have to describe two things at least - data and queries. How is the data generated? What is stored in the data? What are the queries you plan to run?


The data is going to be sensor readings. Just numeric readouts over time. There will probably be about 8-16 physical sensor points per person per reading. I'll want to retrieve slices of time rather than individual records and likely produce some averages/basic algebra over those slices in order to produce more meaningful data for the rest of the system which is pretty vanilla in terms of data requirements.


I made a data star supporting weather/water sensor data in postgresql that heavily relied on table partitions for handling performance [1]. We had it on pretty weak machines, replicated with bucardo, and never had any issues. It worked well to several million records/month (not sure where it is now).

[1] https://github.com/imperialwicket/postgresql-time-series-tab...


http://blueflood.io/ is an option. It's built on top of cassandra and has experimental support for use as a backend for graphite-web. There are several engineers still actively working on it who are generally happy to help with any issues raised via irc or the mailing list. Unsure what you mean by 'massive', but I've used it to store billions of data points per day successfully.

Disclaimer: I'm a former core contributor to blueflood.


Postgres is fantastic for most things. I think people think "big data" is somehow too big, the big data I've seen fits in a few terabytes which postgres handles just fine


The big idea in storing time series data is to partition data by timestamp (daily/hourly/minutely depending on how granular do you want it). This technique can be done in various data store:

* I've done it in PostgreSQL using triggers and table inheritance. With this technique trimming old data is as simple as dropping old tables.

* Logstash folks use daily indices on ElasticSearch to store log data which is time series by nature.

* I have heard from quite a few people that Cassandra works really well with this data model too.


My company is currently using Mongo, and while it works, I wouldn't recommend it. We're looking at Cassandra and Elasticsearch, which seems to be a lot more promising.


Beware of using Elasticsearch as a primary DB. Kyle Kingsbury has shown that it loses an awful lot of data during a partition, despite their claims. Example: http://aphyr.com/posts/317-call-me-maybe-elasticsearch


I think with $70m+ in funding, ES has the resources to fix their split brain issues.


I've only had about a year and a half of working in that space, but can confirm that Cassandra's quite good for truly huge time-series data. If you want to record an event every time someone makes a purchase at a Wal-Mart, turn to Cassandra or a similar system.

It is also much more highly regarded as a primary data store than Elasticsearch.


The number of horror stories I've seen about mongo is up to around 10 this month alone.

I'm now glad I never made the jump... in the meantime, pgsql is still on my list


I wouldn't say it's a horror story, it's just not really for time series "big data". The backend guys have had to muck about with the data a lot to get good performance out of it. There's some optimizations we missed on the sysadmin side too, like sharding the cluster after it got to ~250GB, and now it's many times that. Our Mongo clusters have been running production for well over a year.


We had a 50gb instance initially and it was no problem, then our app started to get a lot more traffic in a short period of time. We had to start sharding on a reasonably large scale. Mongo is a lot harder to maintain when it's big and unpredictable. I know this sounds like a plug, alas it's the truth, we found Object Rocket and now we don't worry about Mongo.


Definitely try Cassandra and if you don't want to run it yourself try https://www.instaclustr.com/


Have you considered opentsdb or graphite? I love graphite because of the nice frontend interface and functionality it provides for visualizing and transforming your metrics.


Development of graphite is effectively dead. The datastore component (carbon and whisper) has design issues, and the official replacement (ceres) hasn't seen any commits this year. There are a some alternatives, though.

For data storage Cyanite [0] speaks the graphite protocol and stores the data in Cassandra. Alternately, InfluxDB [1] speaks the graphite protocol and stores the data in itself

To get the data back out, there's graphite-api [2] which can be hooked up to cyanite [3] or influxdb [4]. You can then connect any graphite dashboard you like, such as grafana [5], to it.

[0] https://github.com/pyr/cyanite [1] http://influxdb.com [2] https://github.com/brutasse/graphite-api [3] https://github.com/brutasse/graphite-cyanite [4] https://github.com/vimeo/graphite-influxdb [5] http://grafana.org


A slightly off-topic question, since you seem to know what you're talking about: What are people using these days for collecting and display devops-level metrics, if it's not Graphite? Are your links relevant here?

Last I looked at Graphite I balked at the data store design (very I/O heavy) and the awful front ends (very limited graphing and reporting capabilities). But I haven't discovered a good alternative that has traciton. Diamond seems like the thing to use for collecting metrics (instead of collectd), though.

Edit: Grafana looks good, actually.


We're an established team with a stealth product that we're releasing soon. If you'd like to participate in an early trial, send us an email. We're also happy to talk to anyone with time series needs or related needs like analytics on big data. Maybe we can build you something custom or cut you a deal. Drop us a line at bigdata.queries@gmail.com


RavenDB - they recently switched to a new engine and did some time series related work. Send them an email and you'll probably get a few free licenses. It's a very well selling commercial product, so the risk of deprecation minimal.

I have personally seen millions of records saved per minute on a top end SSD server.


Check out KairosDB... it is based on Cassandra and is very similar to OpenTSDB but IMO Cassandra is a bit easier to scale and maintain with fewer parts.

We're using it in production... it's still early but there are about 1-2 dozen moderate sized installs (like 10 box installs).

We're pretty happy with it so far..


At FoundationDB we recently did a blog about using FDB for time series data:

http://blog.foundationdb.com/designing-a-schema-for-time-ser...

One of our largest customer installations is for this purpose.


Take a look at Amazon Redshift (I don't know if you have a higher time budget or a higher dollar budget for what you're building, but Redshift might turn out to be pretty cost-effective when you add in system upkeep as well). It scales well.


There's a company in Chicago called TempoIQ (formerly TempoDB) that is working on a time series database.

https://www.tempoiq.com/

I'm not affiliated with them, I just met them once.


Stonebraker again ... ?

SciDB: http://www.scidb.org/

Paradigm4: http://www.paradigm4.com/

... anyone have experience of using SciDB?


If you would like to perform similarity queries on your time-series please try simMachines.com (I am the founder)

We offer cyclic time warp search, time warp search and any other metric of pseudo-metric you can come up with :-)


What kind of project is that? Is it a side project or one that will give birth to a company? Requirements are rather different depending on importance you give to your (or your customers') data.


It's for work. If it were a side project I would probably have just grabbed InfluxDB and run with that since it looks the most fun, but since it's for a core part of the whole system then the risk of project abandonment is a bit high.


Have a look at TempoDB - built specifically for timeseries data (https://tempo-db.com/about/)


TempoDB has renamed itself into TempoIQ and no longer offer their storage service. I've heard some angry comments from customers who recently received an email telling them the storage service they were using was to be shutdown at the end of october!


I work at TempoIQ, and we still to offer our storage service. We've launched a new product (as TempoIQ) that is hosted in a private environment and offers storage, historical analysis, and real-time monitoring.

As for the customers on TempoDB, we are working them to transition to TempoIQ if the switch makes sense or offering to guide them in a transition to another time-series database like InfluxDB.


If you're cool with a simple K/V storage format (K: timestamp, V: data), Riak might work for you. Easy to scale, reliable, and wicked fast.


If this is for a Manufacturing environment, OSI Soft's PI Historian may be a good fit. Really depends on what exactly your requirements are.


What strikes me about the comments, is the very diverse range of products being recommended. Nothing even close to consensus on this space.


I have looked at InfluxDB, and it is pretty cool. Combined with Grafana, you have a complete solution out of the box.


First question - do you even need a database right now? Have you for example considered using CSV files and simply loading those files into Pandas or R on demand?

I am currently working on a project analyzing massive amounts of options data and have found this approach to be both quite easy as well as flexible to work with... and as my project matures I may move select parts of it into a database.


>loading those files into Pandas or R

What is "massive" for you? I was under impression you can't use R or pandas for anything that doesn't fit into memory.


As for massive - something like daily options data for 3000 stocks, spanning a number of years, with information down to the tranche level (let's say 60 million rows if stored in a relational database fashion). In my case the analysis can be done on the stock level though, which means that only a 3000th of the dataset needs to be loaded into memory at any time.


Redshift or Vertica are your best options and are built for massive queries over large data.


I have not used it but will be experimenting with influxdb soon for storing timeseries data.


have a look at https://dalmatiner.io/ I think it's used here https://project-fifo.net/


Redshift. It scales superbly.


While KDB+ is infinitely better for TS data, hiring people who know what they're doing, and buying the physical hardware you need to make it fly isn't what most modern firms are interested in. If you want to run on the EC2, Redshift is super great. Time/date range queries though, holy shit those suck.


Cassandra works best. E.g. Say as the backend for email storage.


postgres (or redshift which is.. postgres).

oracle too. Just did something relatively small with that (< 10MM rows), but it's pretty solid.



If its only numbers, then Graphite with Grafana for visualization. Otherwise MongoDB can be good as well.


Redshift


BigQuery




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: