Hacker News new | past | comments | ask | show | jobs | submit login
Why Not to Build a Time-Series Database (outlyer.com)
204 points by dgildeh 5 months ago | hide | past | web | favorite | 126 comments

TLDR: "Why Not to Build a Time-Series Database? Because we're building one and you should pay us."

> Hopefully our story will make you think twice before trying to build your own TSDB in house using open-source solutions, or if you’re really crazy, building a TSDB from scratch. Building and maintaining a TSDB is a full time job, and we have dedicated expert engineers who are constantly improving and maintaing our TSDB, and no doubt will iterate the architecture again over time as we hit an even higher magnitude of scale down the line.

> Given our experience in this complex space, I would sincerely recommend you don’t try and do this at home, and if you have the money you should definitely outsource this to the experts who do this as a full time job, whether its Outlyer or another managed TSDB solution out there. As so many things turn out in computing, it’s harder than it looks!

Hmmm. I used to be part of a team that handled market data at crazy rates and we took exactly the opposite approach to these guys.

When I see:

"You Can Lose a Few Datapoints Here and There"

I see that these guys are barking the wrong tree.

1. We used single thread per network card. (Yes, we architected clusters/failovers, etc... but not once was it required because of data rates)

2. The server could handle a fully saturated Gibit network at <50% CPU (per core)

3. Data was NEVER thrown away (but we had allowances in our API to let the client reading the data to drop updates and get sub-second aggregates instead -- eg OHLC or summation)

4. Data was stored in basically flat file systems.

5. Our calculation engine was run 'downstream' toward the client ends, or on the client end, away from data collection. If needed (ie. the calcs were expensive to run), these could feed back into the server for long term storage.

This was mid 2000. I'm sure this is not rocket science for modern day timeseries guys.

Yeah, it's still pretty much the same just at 10 or 40 gbit now.

Hardware capture almost never drops and timestamps with GPS sync.

You can then take those capture files and manipulate them however you want into normalized market data.

Market data has the notable feature of being segmented by trading day, so the combination of symbol-venue-date is an appropriately small unit of data to run aggregations of any kind over or to distribute over a cluster.

So for market data at least, there's not much to "rolling your own" time series DB in Python or what-have-you.

Prcessing that firehouse in real time for trading is a different matter though and how you build that depends heavily on your latency requirements.

Right. For those interested OpenHFT has created a really nice set of open source solutions to do this.


Do you know any article or book outlining the architecture of a full HFT system, I.e. from market data consumption to pricing to trading? Thanks in advance!

That's how I read it too. To people who haven't worked with metrics at scale though there is some good information and it's worth reading.

It blows my mind that businesses are willing to outsource metrics. When I worked at Amazon it was trivial to estimate the next quarter's results from the app metrics. Naturally this meant we were subject to trading restrictions.

If a monitoring company ever starts applying Google/Facebook style ethics with regards to exploiting the data their customers give them, they will be in an incredibly powerful position.

> It blows my mind that businesses are willing to outsource metrics.

It makes sense at various scales compared to hiring, training, maintaining infrastructure, handling incidents, etc. related to your own metrics solution.

When I'm interviewing a database expert, the one that says:

> "You Can Lose a Few Datapoints Here and There"

is not the one I'm going with...

There are some properties of the data that can be exploited to add weaker consistency guarantees. This leads to some desirable design trade-offs in terms of simplicity and performance optimisation. While this could result in data loss, it may be permissible given that queries can span large time ranges where one or two missing datapoints do not carry the same weight as a financial miscalculation, or loss of life. The same could be said with multiplayer games played over mobile devices, with intermittent connectivity issues. In this domain, the player's moves are fast forwarded once connectivity is restored, as this provides no observable difference to other players. My point is that it's very dependent on the use case, and does not apply across the board.

There's nothing wrong with a special-purpose tool for building approximate graphs, but calling it a "time-series database" or even quoting "inserts-per-second" is intellectually dishonest.

Many SSDs only write 4kb blocks, and writing a 64bit datapoint uncompressed to disk would not only be slow, but it would result in write amplification and wear out the disk sooner. The solution that many TSDBs, including Prometheus and Influx, involves in-memory batching with a backing WAL log file. If the in-memory batch or WAL log is lost, you would lose data as well.

You shouldn't be the one hiring if you can't talk about different scenarios.

"Don't do it because it's hard! And you should listen to us because we have the meta-knowledge and experience (now) to know everything there is to know about this topic, plus the bravery to admit in public that we are only human and we make mistakes."

Isn't this the mantra of all of these types of articles?

OK, yes, it usually makes sense. Especially in the case where you are like these guys and experienced enough in some relevant area to know just how difficult it can be. These are perfectly good reasons from technical, business, and project planning perspectives.

Isn't there a TLDR where they mentioned when it does make sense to build your own TSBD? Presumably in some case where you have a team of serious, high-grade experts who know exactly what they are doing; have requirements that cannot be met by any of the other offerings out there; and where the whole thing has been specced out and deemed reasonable?

The premise itself is also quite funny, imo. I think very few people on this planet would think "I need a database so I build one myself". They might do this with the application layer, but most people consider databases black boxes that they interact with through SQL, or maybe not even that. Maybe they just use an abstraction framework in their favorite languages that lets them write objects which have .load() and .save() methods that generate SQL by themselves.

As I was reading through the post I kept wondering why they weren't using some warehousing technique for older data - either dump it to S3 or better yet, Google BigQuery, which is amazingly fast at that scale. They only did it after doing lots of fire-fighting and per-tenant clusters.

Clickhouse would also be a good option for doing aggregating queries that TSDBs are mostly used for.

One of my wishlist items in the data space is a Managed Clickhouse offering. :-)

We used a combination of Kafka + Hbase+ Phoenix (http://phoenix.apache.org/) for similar purpose. It takes some effort to setup initial Hbase cluster but once you do it manually once and automate with Ansible /systemd it's pretty robust in operation.

All our development was around query engine using plain JDBC/SQL to talk to Hbase via Phoenix. Scaling is as simple as adding a node in the cluster.

That's interesting. What are query times like? Let's say for single series to query data for a week at a five-minute interval, how many seconds it would take?

Does Kafka have timestamps? I didn't see any when I looked, but I was working with an older client version & didn't get far into it.

we didnt rely on it, or on the ordering of messages received from kafka, timestamps and transaction IDs were generated by the client app/kafka publisher and were part of the message put into kafka topic, when we consume that message with one of the parallel kafka consumers and save a row in hbase table that original timestamp + transactionId becomes part of the rowkey string, other parts being attributes that we wanted to index (secondary indices are supported in hbase/phoenix but we didnt use them too much, basically the composite rowkey is the index). Then when querying hbase it works as a parallel scanning machine and can do a time range scan + filtering +aggregation very fast.

On a separate note we didn't use joins even though they are supported in Phoenix, data was completely denormalized into one big table.

In case anyone is interested in a video on this, here's the talk presented at the local San Francisco monitoring group, monitorSF (https://www.meetup.com/MonitorSF/)


Time-series data handling/storage seems a mostly solved problem in the mining, oil, and manufacturing industries. Deployed in the field since the 80's,


Discourse: Industry user now OSIsoft employee

A solved problem at what scale? A kid with pencil and paper can solve the problem when it's small enough.

I don't have personal knowledge, but my understanding is that timeseries data in the oil industry is insane scale. Think 10s or 100s of wells in a field, sensors at regular distances in every well each taking multiple readings (temperature, flow etc) on a sub-second cadence. And a big oil company will of course have many fields.

It's in the same league as timeseries data in financial markets, which I have more direct experience of. Financial timeseries data probably has more metadata (trade/order flags etc) but in terms of raw data it's similar.

What the article is talking about is an order of magnitude larger. Think millions of data points per minute coming out of a medium sized data center. It gets harder when you look at the cardinality of the data (explained better by the article), which corresponds to the sized of the indexes.

Yes I can confirm data historians like PI are used heavily in my industry (manufacturing).

Nice article.

>its not uncommon for some of our customers to send us millions of metrics every minute

What kind of customers/services generate millions of points a minute?

I use to work at a Fortune 50 retailer on the cloud platform (a lot of tooling around CI/CD for the teams that manage the website). We had a large problem with keeping the metrics pipeline current. A major issue is that be default, Spring Boot publishes about 500 different metrics on a 10 second slice. Allowing every application to pump out that many default metrics, most of which are never used, means that it takes only 334 instances to get one million metrics per minute (1,000,000 / 6 / 500 = 333 1/3). I would guess we had a couple thousand instances in production on a normal day. In a couple weeks, they have Black Friday. Any team that hasn't been able to fix performance problems are given the go ahead to just throw money at it and scale horizontally in obscene ways.

Of course, our metrics were all handled in house. From talking to the teams that handled the metrics pipeline, the vendors were great for smaller companies, but there was no off the shelf solution for a companies that large with that volume. But I did very little with that myself, other than look into the fact that Spring Boot published way too many default metrics. Who needs P50, P70, P75, P80, P85, P90 - P99 on all web requests?! Just set a default that is small and worthwhile and let the developers adjust as needed.

why this whole per minute thing I can insert 1000000 per minute on my mbp using reasonable batching (it's only 16K per second)

Folks need to resist the inclination to just gather maximum data for the hell of it.

If you're pumping out a million metrics per minute, almost none of those are ever going to actually be used to generate meaningful insight.

I used to work at a startup that made physical robots. The robot generated several GBs of data every time it turned on. You're correct, most of that data wasn't looked at most of the time. But every now and then, someone would say "Hey, I saw a robot do something funny the other day, what the hell happened?" And having all that data usually made it possible to figure out what happened. To me, "maximum data for the hell of it" isn't about generating insight by looking at trends, it's about generating insight during incident analysis.

> To me, "maximum data for the hell of it" isn't about generating insight by looking at trends, it's about generating insight during incident analysis.

Agree 100%.

That is a very particular use case, where very high res data is critical. I note that even here, you're interested in data from "the other day", not years ago.

In most cases, time spent maintaining terabytes of rapidly aging time series data would be better spent elsewhere.

I think that really depends on the case.

A particularly good high-frequency trader might be interested in Terabytes of minutia when they're trying to sort out what caused yesterday's spike and crash of ticker XYZ.

Systems and sales analysts that are looking at web store front ends (and back ends, if there are issues) would be interested in large volumes of data, specifically corner cases (users who don't follow a statistically significant path), when trying to sort out a UI/UX redesign.

Traffic and transit analysts might want terabytes of data (especially with date and weather indicators) when considering what kind of freeway interchange to add to a growing area.

I suppose I could go on...

I used to work at Meraki. There, we would, every 5 minutes or so, record byte and packet counts classified by traffic class and remote end of the connection for every single client connected to an SMB internal or customer-facing network.

(Meraki actually did implement its own time-series database, and after I left published a paper describing its design and implementation. https://meraki.cisco.com/lib/pdf/trust/lt-paper.pdf. Good quote on the motivation: "As discussed in Section 2.3.3, customers have a nearly insatiable demand for high-resolution historical data, even though they mostly query data from the recent past."

Any more good links from them?

Not that I can find - it's a proprietary solution for internal use only.

Some companies record as much as they can, e.g. "here's all of the user's mouse-movements, their full User Agent, etc." or "here's everywhere the user tapped on the app and which of the notifications we sent them that they responded to, and more!"

The example from the article was "one team at one of our customers decided to dump 30 million metrics on us, send all of their mobile product metrics into Outlyer"

Machine-generated event telemetry from mobile phones, cars, etc can easily be tens of millions per second in real-world applications. Human-generated event telemetry (e.g. text messaging) peaks at hundreds of thousands per second if you are working on global scales.

There is virtually an unlimited number of applications that could generate 16k events per second (million per minute).

At one employer most customers would opt-in to upload their system logs, and company would analyze to anticipate problems for preventive maintenance.

The result was a tens of terabytes a day Niagara of data.

When I left they were in early stages of Hadoop because ordinary parse/analyze was starting to fall behind.

16k samples/s is not a lot. There are many Prometheus users with hundreds of thousands of samples/s on a single Prometheus server.

Across their organisations it can be much more, Fastly has reported 2.2M/s (https://promcon.io/2018-munich/slides/monitoring-at-scale-mi...) for example.

Often per-click stuff ends up with dozens or hundreds of data points from different parts of the code -- heartbeats, feature usage, funnels, experiment entry, etc.

1 mil is only 16K per second.

We have customers that generate tens of millions of measurements per second. Lots of low-level systems latencies can be collected at high volume. Also, high volume online services can easily generate this order of magnitude.

Transportation companies--truck fleet data, airplanes, etc.

Customers using a large-scale analytics database...

"time-series database" is some of the most overhyped nonsense since noSQL.

Time-series is just data with time as a primary component. It comes in all shapes and volume, but if you have a lot of data and are running heavy OLAP queries than we already have an entire class of capable databases.

Use any modern distributed relational column-oriented database, set primary key to metric id + timestamp, and you'll be able to scale easily with full SQL and joins. You can keep your other business data there too, along with JSON, geospatial, window functions, and all the other rich analytical queries available with relational databases.

We have trillion row tables that work great. No special "TSDB" needed.

While I understand your point, you are quite mistaken if you think that time is just another key. Dealing with time properly requires a concept of point distance, similar to GIS systems requiring 2d distance understanding. You cannot do joins on time with SQL databases unless you want to throw away important data.

As an example, in the industry I work in, you may have no readings for days or weeks, and then hundreds of readings from the same sensor. Why? Many systems in industrial environments send new readings only "on-change", and assume the underlying data storage architecture will forward fill to in-between times. This is why the practically ancient time series architecture of data historians still dominates in these use cases.

In fact, for many time series functions you actually have to throw away the notion of relational joins to be able to efficiently perform time-preserving joins. Window functions only work in basic use cases with relatively small amounts of data where you're aggregating.

Vertica has this functionality and it has been there for years. Fully functional database and you can do time series joins, gap filling with linear interpolation or constant. You can define the intervals at what you want the data points. And you can scale from a few gigs to petabytes of data. https://www.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/...

I agree! Vertica's temporal capabilities are marvelous and the engineers who worked on the planner optimizations for the time extensions are brilliant.

There are of course trade-offs to the approach Vertica takes -- look at StreamBase for a very different take on the problem, another Stonebraker project.

Any of course historians represent yet another take, optimized for point-in time queries that are native and don't need the processing extensions Vertica uses.

I'm not sure where the complexity is that you claim, nor what it has to do with data historians?

So what if there are missing rows? This doesn't affect the database and any aggregations will work fine. Databases don't "fill-in" data, but you can definitely write whatever SQL you need to create averages and buckets to smooth out results.

From reading your website, it seems you're talking about the "last value recorded" as of a certain time, which doesn't seem to be a common query but is totally possible. KDB+ has "asof" joins and others can handle it with window functions using last_value().

We run queries on a table containing 2.7+ trillion rows of data that has no set pattern and infinite cardinality, and results return within seconds. Window functions and joins work without issue. Have you actually tried using a columnstore?

Relational databases represent a column/row-oriented architecture. Data historians are a specialized, non-relational, time-oriented architecture. Using time as a key in a relational index implies that only ordering is important, but that is not the case. Distance between points in time is extremely important because time operates on a continuous 1d line and data points are represented at varying distances between each other on that line. Data historians are architected to both preserve this temporal relationship and take advantage of this by eliminating duplicate data, employing temporal compression techniques to be able to store millions of readings per second for years worth or data.

> From reading your website

My website doesn't have much to do with this because Sentenai isn't a time series database system. I did, however, spend most of my time in research working on temporal data systems, and have been fortunate to collaborate with or learn from researchers who have spent decades solving the unique problems that temporal data presents. What you might consider uncommon for your use cases is extremely common in manufacturing, defense and other areas.

There's a decades-old industry around database systems that handle time natively. And while many support SQL as a lingua franca, and some are column stores, they're not relational by any means as they either extend SQL to support time, or limit non-temporal joins to ensure performance. StreamBase, Kdb, Aurora and many other specialized architectures exist because one size does not fit all. Michael Stonebraker, whose work has included StreamBase, Vertica, Tamr, Postgres, Aurora, and many others, famously published this paper about the very problem: https://cs.brown.edu/~ugur/fits_all.pdf .

Further reading that might be illuminating:

  1. http://cs.brown.edu/research/aurora/vldb03_journal.pdf
  2. http://www.cs.rochester.edu/u/james/Papers/AllenFerguson-events-actions.pdf
  3. https://books.google.com/books?id=BK6oCAAAQBAJ&pg=PA9&source=gbs_toc_r&cad=4#v=onepage&q&f=false (excerpt)

I appreciate your links to further reading, and I'm trying to read the Aurora paper right now but after reading the abstract and the intro (I'm in progress right now), I can't find a case that is uniquely fit/perfect for data historians... I know this is already asking a lot, but would you mind giving me one go-to- use case that really made you think "this is what purpose-built data historian-style databases are good for?".

Every issue mentioned in the abstract/intro (which are meant to motivate the paper) seems like it can be solved as an add-on to existing application databases (albeit with their most recent developments/capabilities in mind). The very description of HADP vs DAHP systems seems silly, because it's just a question of write load, and that's fundamentally only solved with batching and efficient IO, or if you give up durability, it doesn't seem inherent to the data model. There's also assertions like:

> Moreover, performance is typically poor because middleware must poll for data values that triggers and alerters depend on

But like, postgres though, you're free to define a better/more efficient LISTEN/SUBSCRIBE based trigger mechanism, for example, you can highly optimized code right in the DB... Thinking of some of the cases called out in the paper here's what I think in my head:

- Change tracking vs only-current-value -> just record changes/events, as far as tables getting super big, partitioning helps this (timescaledb does this)

- Backfilling @ request time -> an postgres extension could do this

- Alerting -> postgres does have customizable functions/procedures as well as LISTEN/SUBSCRIBE. The paper is right (?) about TRIGGERs not scaling then this might be the most reasonable point.

- Approximate query answering is possible with postgres with stuff like HyperLogLog, but the paper is certainly right in that it is not implemented by default.

Maybe I'm mistaking the extensibility of postgres for the redundancy of the paradigm, akin to thinking something like "lisp is multi-paradigm so why would I use Haskell for it's enhanced inference/safety".

I'm still reading the paper so maybe by the end of it it will dawn on me.

So Aurora isn't a historian, but is a complex event processing system. It's an entirely different beast that solves very specific problems around high-speed queries that could theoretically require scanning through all data stored historically for queries.

I'm not a huge fan of historians (I've spent too much of my career working with them), but I can definitely tell you where they make sense. The scenario is this:

Imagine you have a large facility with thousands of machines, each with a programmable logic chip for controls and monitoring. These machines create lots of data and so often employ data reduction semantics by reducing data to on-change rather than sampling sensors at thousands of hertz. A single machine may have dozens or hundreds of variables to track. These tags might be hierarchical: Machine 1, subsystem 5, variable b. If you say there's 100,000 total tags to track in the facility, and they're on average sampled at 10hz, you need a system capable of writing a million durable timestamped values per second. Now that's child's play for, say, google, but if you're a manufacturer, you can't afford to spend massive amounts of money on cloud systems, and usually want to do this all on a single server on the factory floor because you need realtime monitoring that can display the current value in time for every single tag. ( https://www.ws-corp.com/LiveEditor/images/SLIDES/10/3.jpg ). Ideally, in a single node scenario, you want compression. It's not uncommon to store 100 billion timestamped values per day and keep them for a year or more for audit purposes is something goes wrong. Today, for the sake of predictive maintenance, data retention policies of up to 10 years are becoming more common.

So what would you sacrifice to be able to do efficient realtime monitoring and ingestion of millions of data points per second? You can't use queueing semantics to protect an RDBMS because logging can't take more than a 10th of a second per point. If you think about the use case, what you'd sacrifice is transactional queries and row-level joins, because you just don't need them. At the same time, this data is really sparse when you look at it from a table's perspective, so you'll want something like a column store to underly the data storage.

So what we do is throw out transactional guarantees, choose a storage system that is good at compression (roll-ups in some historians will store a formula approximating the data instead of raw data itself over a window), and prioritize speed of point retrieval for most recent "hot data" by caching it in-memory.

You can of course extend Postgres to achieve many of these things, but having done it myself, in practice it's sub-optimal in the exact same way that using bubble sort for all your programmatic sorting needs is sub-optimal.

One thing you might want to keep in mind is that many of the people involved in Aurora are the authors of Postgres. They're not arguing you can't do things in Postgres, they're arguing that in practice the RDBMS's guarantees are theoretically incompatible with high performance in the area of Complex Event Processing, because alignment between different simple events (recorded as rows in a database) can drift so far that memory requirements become prohibitive if you don't use a stream-processing architecture.

Also keep in mind that Aurora is from 2002 and many of the ideas have been implemented elsewhere over time. The great thing about Postgres is that it's perfect scaffolding on which you can build other stuff.

Thanks for taking the time to give such a detailed answer and explanation on historian systems.

Most data/operational historians are separate programs on top of databases, so is that what you're actually talking about? Your papers seem to suggest that.

Stonebraker is talking about OLTP vs OLAP. I agree that they are very different scenarios.

Over the past 10 years I've had the misfortune of working with most major historians in the field (Oracle's the biggest one I never encountered) and not a single one implemented process data storage on top of a relational database. Some used relational databases to store asset metadata, so maybe that's what you're thinking of?

If you read all the way to the end Stonebraker is actually advocating for specialized architectures like, array databases (SciDB) and stream processing engines like StreamBase, which at the time was just gaining GUI-based query creation capabilities because it was difficult to teach its non-relational concepts to SQL users.

You don't have "missing rows". For time series A you have a time point at 12:01 AM, a datapoint at 12:02 AM, and another datapoint two weeks later at 5:04 PM. For time series B the times are different. You need some notion of whatever state the physical system was in at any given time.

Yes, I understand this as the "last value recorded" concept in my comment. KDB+ supports this with "asof" joins. Others can just do it by scanning a wider time frame or the entire table.

Null gaps in a columnstore can be skipped over basically instantaneously and usually are just zone map/index lookups. Again I question how common this query is and whether it's really worth limiting yourself to a special TSDB because of it.

> Others can just do it by scanning a wider time frame or the entire table.

"Scanning the entire table" for every request to have the last value recorded is rarely a practical option.

> KDB+ supports this with "asof" joins.

> [...]

> Again I question how common this query is and whether it's really worth limiting yourself to a special TSDB because of it.

KDB literally markets itself as a time series database. What's the point you're making again?

I think TimescaleDB lacks an "asof" function for now, but it makes up for it by having the full power of PostgreSQL for other stuff. Regardless, Time Series databases like KDB and TimescaleDB are useful.


it looks like TimescaleDB recommends using

to get the most recent value for any particular set of sources that you're SELECTing over, which would use indices and be reasonably fast.

> KDB literally markets itself as a time series database

kdb+ also markets itself as OLAP/OLTP as well.

A lot of these guys market themselves as a "time series database" because kdb+ do and they want to be compared with kdb+ by people who haven't used kdb+ (but might be considering it).

Distributed relational column-oriented databases are best at large data volumes and OLAP queries. KDB+ is one of those, even though they call it a TSDB in marketing terminology because of its FinTech customer base.

TimescaleDB is not a TSDB, it's an extension to add automatic partitioning to PostgreSQL tables. Timescale helps Postgres get more performance, but it does not give you the full capabilities of a real distributed column-oriented system. If you must use PostgreSQL though then it's a good compromise.

The query you posted does not match the discussion about the last value at a specific instant in time, only the last value ever recorded in the table for that key.

> Distributed relational column-oriented databases are the best at large OLAP data volumes and queries. KDB+ is one of those, even though they call it a TSDB in marketing terminology because of its FinTech customer base.

You're mistaken about Kdb's relational features. Kdb was designed as a time series processing engine using arrays (columns). Column storage doesn't have anything to do with whether a database is relational or not, and Kdb wasn't originally any more relational than the language Erlang is.

I never said relational is related to column-storage.

kdb+ has SQL semantics and relational queries, but it's a combination of the q language integrated into a database so sure, it's a superset of a relational database. Perhaps we disagree on what relational means.

My point was that the current relational features of Kdb didn't exist originally (they were grafted on later) so it's not "marketed" as a TSDB, but it is in fact a TSDB marketed as a relational DB.

The definition of relational is very precise, whether you use the domain calculus, relational calculus or relational algebra. Wiki has a good summary of what must be natively supported by a database system to be relational: https://en.wikipedia.org/wiki/Relational_algebra

If you don't implement this at the transaction log level, but implement it via emulation at the output level, you can't make full relational guarantees, so these operations are fundamental to database design.

Why does it matter what it was originally? We're talking about what the product is today, not 20 years ago.

kdb+ supports a superset of SQL and relational algebra, so it's a relational database. How it's implemented doesn't matter if it provide the functionality, which it can.

> Kdb wasn't originally any more relational than the language Erlang is.

To wit: SQL wasn't originally any more relational[1] than the language Erlang is.

[1]: Codd, E. F. A Relational Model of Data for Large Shared Data Banks -- Communications of the ACM, Vol. 13, No. 6, June 1970, pp. 377-387

I'm not sure what value such a statement brings the world.

I was drawing a comparison between the Kdb runtime and the Erlang runtime, because the OP seems to be conflating the ability to emulate relational features at the application level with first-class support for relational semantics as in a relational database. Support for relational semantics can be emulated via programming languages with sophisticated runtimes like Erlang, but I wouldn't classify any database supporting a Turing complete language with a sophisticated runtime as a relational database.

In our case, "last value" isn't good enough. We do interpolation, and use compression algorithms for which interpolation minimizes reconstruction error.

There are some unique challenges to storing time series data that are different than those of relational databases. Namely, read/write asymmetry, data safety, data aggregation, and analysis of large data sets.

I wrote in depth about these problems and how different TSDBs solve them here. https://www.irondb.io/2018/08/tsdbs-at-scale-part-one/

All modern columnstores can handle vast ingest rates and query speeds. It's all down to sharding, zone maps and sparse indexing, fast algorithms that operate on compressed data, and storage throughput. These are well-solved problems at this point.

Your blog post doesn't mention a single columnstore database though. KDB+, Clickhouse, MemSQL, or any of the GPU-powered variations will happily beat any TSDB out there.

They can't handle high cardinality. Imagine having millions of columns in the column-oriented database (70% of those columns are updated every second). Imagine that you have to add new columns all the time.

The main misconception about TSDB's is that it's just a data with timestamp. TSDB's has multi-dimentional data model, time is only one of the dimensions.

You don't need to add new columns.

   CREATE TABLE metrics (metric_name text, ts timestamp, properties json, key(metric_name, ts))
OLAP queries with SQL are very good at handling whatever dimensions you want.

'metric_name text' is actually a tag-value list. Many TSDB's allows you to match data by tag. Each tag should be represented by a column in your example.

Single table design will be prone to high read/write amplification due to data alignment. Usually, you need to read many series at once so your query will turn into full table scan. Or it will read a lot of unneeded data which happened to be located near the data you need. Writes will be slow since your key starts with metrics name. Imagine that you have 1M series and each series gets new data point every second. In your scema it will result in 1M random writes.

Cardinality of the table will go through the roof, BTW. Every data point will add the key. Good luck dealing with this.

> 'metric_name text' is actually a tag-value list. Many TSDB's allows you to match data by tag. Each tag should be represented by a column in your example.

For the life of me, I can't figure out why this would be a good idea. I feel like I must not understand what you're saying:

If I've got a million disks that I want to draw usage graphs for, why I would put each one in a separate column?

What's the business use-case you're imagining?

> Usually, you need to read many series at once so your query will turn into full table scan.

Why do I need a full table scan if I'm going to draw some graphs?

I've got something like 4000 pixels across my screen; I could supersample by 100x and still be pulling down less data than the average nodejs/webpack app.

> Imagine that you have 1M series and each series gets new data point every second. In your scema it will result in 1M random writes.

No that's definitely not what manigandham is suggesting. One million disks each reporting their usage means a million rows in two columns (disk name/sym, and volume) would be written (relatively) linearly.

Modern TSDB is expected to support tags. This means that every series will have a unique set of tag-value pairs associated with it. E.g. 'host=Foo OS=CentOS arch=amd64 ssdmodel=Intel545 ...'. And in the query you could pinpoint relevant series by this tags thus the tags should be searchable. For instance, I may want to see how specific SSD model performs on specific OS or specific app. If the set of tags is stored as a json in one field such queries wouldn't work efficiently.

About that 1M writes thing. You have two options. 1) Organize data by metric name first, or 2) by timestamp. In case of 2) the updates will be linear but reads will have huge amplification. In case of 1) updates will be random, but reads will be fast.

You are talking about regular relational databases. I'm talking about distributed column-oriented databases. Big difference.

You can store tags and other data in JSON/ARRAY columns. The primary key is used for automatically sharding and sorting.

Groups of rows are sorted, split into columns, compressed, and stored as partitions with metadata. This means you can 'scan' the entire table in milliseconds using metadata and then only open the partitions, and the columns inside, that you actually need for your query. There are no random writes either, it's all constant sequential I/O with optional background optimization. And because of compression, storing the same key millions of times has no real overhead.

As stated several times before, we deal with this everyday on trillion row tables inserting 100s of billions of rows daily. Queries run in seconds. We do just fine.

We have a lot of data stored in Postgres JSON fields at my work. Around three months ago, we were trying to optimize some queries by adding sub-key indexing to the JSON field. We tried multiple times, but Postgres seemed to keep using sequential scan on the records, rather than the JSON index. So, we just decided to normalize the data and use proper foreign key fields for query performance.

Sure they can handle them, just not in an economically viable fashion.

Compared to what? Economically viable is very vague and relative. Columnar storage can easily reach 90% compression levels, is faster to read, and vectorized processing beats per-row/record iteration, so there's a reason it's the best for OLAP currently.

Why not benchmark IronDB against Clickhouse and post the results?

90% compression on time series is not viable unless you have some very specific dataset.

Both of you commenters have your own TSDBs which seems to be coloring all of your posts.

I'm going to leave this conversation as unproductive unless you care to benchmark your products against modern column-stores, although I think it's telling that there are never such benchmarks available.

Granted its an exceptional case, but the query loads we saw at Datadog were poorly served by off-the-shelf solutions.

Maybe things are different now, but I doubt it. You can spend a fortune to get good performance, or you can deal with slow performance, or you can invest a lot of engineering effort and get both, but there's not a ready-to-use solution that will magically replace an entire engineering team for real scale.

But almost no one ever hits that scale, so maybe it's better to adopt this line as a rule of thumb anyway...

I think Clickhouse would do well but I've seen other metrics/observability vendors (like Honeycomb) also build their own systems given the scale and cost factors.

Isn't Datadog on AWS? If you have very specific needs and can build a vertical infrastructure stack then it makes perfect sense to build your own.

Yeah AWS, though mostly just EC2.

I think the challenge is that there are multiple competing needs which are in tension. Data isn't uniform, a large write load that's almost never queried, recent data is accessed way more often than older data, flexible tagging means (org,metric) queries produce potentially millions of points (imagine disk usage across every node for every disk), but indexing tags can be very costly, and its difficult to predict what someone is going to want to query.

I agree that hyper-focus on those needs can distort the picture though. You don't actually have to solve them most of the time, and a relatively poorly optimized solution goes a lot further than people realize. Simply adding caching, for example, solves almost all these issues.

Anyway I mostly agreed with your opening comment.

A modern, distributed, relational, column-oriented database will often stress emphatically in the documentation that using timestamps as primary keys is an anti-pattern that's likely to lead to hot tablets: https://cloud.google.com/spanner/docs/schema-design#choosing...

Yes, that's why my comment said use metric name and timestamp.

Spanner isn't a column-oriented database, but they all support multiple columns as the primary/sort/shard/distribution key. Use the name as the first column, and timestamp as the last column, for scalable distribution.

the first section on that page shows what the op was saying, to use the name column first and timestamp last

It really depends. Capabilities like time-series-specific compression, automatic rollups, complex aggregations and/or ranking, stable storage in S3, clustering, and replication vary a lot and I think that's why we see so many TSDBs out there. I maintain a list of TSDBs[0] and it started as an evaluation of what already available for my previous employer to use. We didn't find one that fit our exact use case, so we ended up building our own on top of MySQL.

[0] https://misfra.me/2016/04/09/tsdb-list/

All the features you mentioned are already part of distributed column-oriented databases. S3 storage is orthogonal and unrelated to time-series data. It's usually not included in shared-nothing local storage architecture of databases but you can definitely mount S3 storage in a variety of ways.

Many options on your list are not TSDBs, like Aerospike, Elasticsearch, Cassandra, Kudu, GridGain/Ignite. EventQL and Riak are obsolete. Apache Apex is a stream processing framework. Many of the others are just extensions to Prometheus built-in mini-storage or offer time-series indexing on top of existing databases.

> All the features you mentioned are already part of distributed column-oriented databases.

I disagree, but even if that was the case, not all of them perform well. For example, we could've used Cassandra for our use case at my previous employer but the lack of push-down aggregations (at the time, not sure if they're supported now) would've been terrible for our top-K aggregate queries.

What features do you disagree about?

Cassandra is not a distributed relational column-oriented database, so yes, it will be bad at OLAP queries.

Cassandra is a "wide-column" or "column-family" database, which is unfortunately confusing industry jargon but better referred to as an advanced/nested key-value store. It comes from the original Dynamo whitepaper, along with similar systems like HBase, BigTable, DynamoDB, Azure Table Storage, etc. They can sometimes handle time-series queries with good data modeling because of fast prefix scans but the lack of a real query language makes them a bad choice for analytics scenarios.

I understand. Can you give an example of a “modern distributed relational column-oriented database”?

Two capabilities that are important in my work are roll-ups (reducing resolution of data) and fast bulk deletes of old data.

Clickhouse, MemSQL, Redshift, MapD, Kinetica, etc.

If you just want rollups and don't care about every row, then look at Druid (or imply.io for a startup making it easier).

All these systems can delete old data very quick as they just delete entire compressed partition files.

Fwiw, more recent versions of Druid have a no-rollup mode that does ingestion row-for-row. It ended up being useful for cases where you _do_ care about every row, maybe because you want to retrieve individual rows or maybe because you don't want to define your rollups at ingestion time. And in that mode, Druid behaves like the other DBs you mention.

(I am a Druid committer.)

Some of those we’ve looked at before and decided not to go with because of unknown observability, high operational requirements, or cost. But yeah, no real problems with data models or queries.

I think Druid has come the closest to the most ideal system for the requirements I’ve had to deal with, but haven’t used it yet.

Thanks, by the way! This helps a lot.

> No special "TSDB" needed.

I think that is a large simplification. I recommend this (relatively) short article: https://blog.timescale.com/timescaledb-vs-6a696248104e

The first graph is particularly salient.

Timescale, for all their wonderful marketing, is just an automatic sharding extension for PostgreSQL. You can accomplish the same yourself using native partitioning, or pg_partman, or Citus.

Partitions are a basic building block for scaling performance and storage so it helps when you have lots of data, but Postgres w/Timescale does not have column-oriented storage and is still single-node only so it comes nowhere near the capabilities of cutting-edge columnstores like Clickhouse, KDB+, MemSQL, Kinetica, etc.

> Timescale, for all their wonderful marketing, is just an automatic sharding extension for Postgres database. You can accomplish the same yourself using native partitioning, or pg_partman, or Citus or any number of other tools.

Put another way...

"Postgres, for all their wonderful marketing, is just an automatic data organization system for <underlying filesystem>. You can accomplish the same yourself using open, read, write, or any number of other syscalls."

You're doing the whole "large simplification" thing again. Yes, you can do everything yourself. No, you don't want to do that. Postgres by itself is not great for time-series data. Time series databases are useful, as your reply even showed, except for the part where you seem to assume any software that doesn't do something entirely novel is simply a quick abstraction that you could just whip up yourself.

Column stores have advantages over row stores, but they also have disadvantages. Your statement that it "comes nowhere [close to] the capabilities of cutting-edge column stores [...]" could just as easily be reversed as well.

Not really what I was saying at all.

Timescale adds automatic partitioning to Postgres, a single-node rowstore relational database. This will naturally give you better performance for larger data (whether time-series or not).

This will not approach the performance and scalability of a fully distributed relational column-oriented database like Clickhouse or MemSQL, because automatic partitioning is just one of many techniques they use for fast performance. There is nothing a special TSDB, or TSDB extension, can do that these database cannot already do faster, while providing rich SQL and joins.

You do realize there are say ClickHouse clusters that ingest in a few days more than largest timescale cluster can handle as it's max size.

Considering that TimescaleDB has only been available for a relatively short period of time, I would love to see a source for that statement. It sounds like a really fun article to read.

Since TimescaleDB is creating a new partition for each chunk of time, it should be able to maintain its ingestion rate consistently for as long as you have storage to store that data. Perhaps it won't keep up with distributed, eventually consistent databases, but such databases generally have very limited analytical power, and if you're using them for anything but time series data, that whole "eventual consistency" thing requires a lot of careful thought.

Timescale is an extension to add automatic partitioning to PostgreSQL to give you some scalability and performance benefits. It is nowhere near the performance potential of a real distributed column-oriented database, which are strongly consistent, have rich SQL support, and even support transactions.

google is you friend google clickhouse vertica etc. the comment about limited analytical power is especially fun. Cloudflare is ingesting 11 million rows per second into CH.

I think time will change the foundation of a lot of databases, once someone gets it right, and I’m not sure time-series is really it.

We’re currently spending billions trying to build bitemporal public data in Europe, and it’s no easy feat so far.

Basically what we need is to be able to register future data, that don’t come in to play until they are supposed to, as well as keeping a live history that you can spook through to view a data set from any given date, as well as make changes to some past data as if you were there at that date.

You can obviously do so with code, and a lot of the old SAP systems actually support this, but the first DB that handles this well will get to run every single public system.

The PI data archive (it's the database part of the pi system) actually has a lot of these features, including the future data stuff (probably the biggest feature in the past few years). It's made by OSIsoft which is a big player in time series databases for industrial settings. (fyi I work on this product)

This works for a small / medium size company but certainly does not scale for bigger companies. There are several problems that you gonna run at scale.

That's true for pretty much any technology.

> Time-series is just data with time as a primary component.

Doesn't it have a special feature that data comes in a linearly increasing time dimension.

I don't see why that's so special. Other data can have incrementing IDs or some other value. In fact all data can be considered to have a timestamp, at the very minimum being when it was inserted into the database, so it's a rather vague definition overall.

I think, its different because its dimension where usefulness of data goes does as the data ages. TimescaleDB for example, does optimisations ( they call it chunks or something) based on this fact.

Timescale is just an automatic partitioning extension for Postgres. You can also do it with the native partitioning feature, or pg_partman, or Citus, or other tools.

Partitioning the table is the optimization, so that you skip over data when querying and manage it in smaller parts, but Timescale doesn't do anything about older data and neither do most databases.

> Timescale is just an automatic partitioning extension for Postgres.

That, and a good PR operation.

I've had great success using event timestamp ( milliseconds since 1970 ) as a column, with an index on it. And then when you query you can use BETWEEN. If you write your own ORM you can make it automatically calculate the time range according to some defaults - minute, second, etc. Works great.

If you are using spatial data, You can also use two columns like this for longitude and latitude.

I’m surprised no one has brought up Splunk in here (that I could see at a cursory glance).

They manage to do time-series storage on a pretty large scale (over 5PB/day for their largest customer).

I wouldn't really say that splunk is time-series in the same way they wanted. It's mainly logs indexing, but not the kind of aggregation you'd want from numeric, labelled metrics.

> aggregation you'd want from numeric, labelled metrics.

We use it exactly for this, works great.

Do you use the new metrics store? I just learned they have one now in the new versions.

No we don't use metrics store.

That doesn't really sound like a cost effective way of doing metrics... The amount of data you license from splunk is likely quite expensive compared to a dedicated metrics system, right?

Not very knowledgable in the area but can someone please explain how does kdb fit within this class of time series dbs and whether there are any alternatives available to kdb.

Kdb is a priprietary db that runs using the processor level 2 cache as memory. Therefore it is crazy fast. Also therefore, it does not handle large datasets.

At least in the past.

As the blog author, great to see the discussion and feedback, so appreciate it!

Without going through comments one by one, the main ones about this being a solved problem or there's already solutions out there that do this, I would just say those comments remind me exactly of the type of conversations I had years ago with my team. We all thought it would be much easier or thought there would be something off the shelf that could do everything, and after several years of fire fighting, the reality was the problem looks much simpler than it really is, by a long mile.

Now that we've been doing this for a few years, and spoken directly with creators of many other TSDBs, we take a very skeptical view of all claims made about any database. They all sound amazing when you first read about them, maybe even work great in testing, till you hit scale and then you find all the limitations. If there was a perfect TSDB out there, everyone would be using it and there wouldn't be a new one announced on a weekly basis!

I think the one comment on query loads being different sums things up - I've no doubt all the other options thrown out there work well for data historians, but for monitoring tools with loads of concurrent users, loading dashboards with 10's or 100's of queries each, and alerting systems polling every few seconds in parallel, the query load can get very high quickly, and making those fast while still writing metrics in at scale, is a hard problem and I don't think any individual TSDB has really solved that properly, which is why we ended up building our own distributed architecture ourselves.

It's pretty ridiculous that "Time-Series Database" has come to mean ingesting massive amounts of streaming data. They've been around a long time and have many use cases.

They're a great way to store data efficiently, accessing specific data if you know the time range you are looking for is very fast and simple, and you can roll your own in a few dozen lines of C if that's what you want to do. If that's all you need, why not?

That may be a perfectly good solution if you have a very static infrastructure and narrow use case.

As a thought exercise, for the most trivial solution, you could create a single append only flat file. This may work well for writes, but what happens when you want to read the datapoints for only a single series in time order? This would result in an expensive scan over the whole file. An improvement could be to create a file per series, but this becomes problematic when writing many small datapoints across each different file. The problem worsens in the case of a dynamic containerised infrastructure which produce a unique number of timeseries over very short intervals, which was the catalyst for the development of Prometheus TSDB v2, as the prior version stored a file per timeseries.

As the post states, there is a balance between the read and write pattern - achieving that with a few lines of C for a general purpose case is a difficult task, if not impossible.

To be clear, my post was to state that there are many use cases for time-series databases and bemoan the fact that most current development centers around a specific use case. That is in fact what I wrote.

I have a hard time believing that "a dynamic containerised infrastructure which produce a unique number of timeseries over very short intervals" is the superset of all time-series use cases, but perhaps it is so.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact