Hacker News new | past | comments | ask | show | jobs | submit login
How Time Series Databases Work, and Where They Don’t (honeycomb.io)
228 points by telotortium 49 days ago | hide | past | favorite | 55 comments

That article takes various concepts from typical TSDB solutions and seemingly only looks at the bad sides. Time series data has many different forms, not every form works for every TSDB solution.

For the 3 caveats at the top, there are already two TS solutions that look promising (QuestDB, TimescaleDB). Often an operational analytics DB (Clickhouse, CrateDB) might also be a solution.

(TimescaleDB co-founder)

Thanks for the mention, and I completely agree :-)

Personally, there is a lot in this article that is misguided.

For example, it essentially defines "time-series database" as "metric store." As TimescaleDB users know, TimescaleDB handles a lot more than just metrics. In fact, we handle any of the data types that Postgres can handle, which I suspect is more than what Honeycomb's custom store supports.

  TSDBs are good at what they do, but high cardinality is not built into the design. The wrong tag (or simply having too many tags) leads to a combinatorial explosion of storage requirements.
This is a broad generalization. Some time-series databases are better at high cardinality than others. Also, what is "high-cardinality" - 100K? 1M? 10M? (We in fact are designed for _higher cardinalities_ than most other time-series databases [0])

  In contrast, our distributed column store optimizes for storing raw, high-cardinality data from which you can derive contextual traces. This design is flexible and performant enough that we can support metrics and tracing using the same backend. The same cannot be said of time series databases, though, which are hyper-specialized for a specific type of data. 
We just launched tracing and metrics support in the same backend - in Promscale, built on TimescaleDB [1]

I do commend the folks at Honeycomb for having a good product loved by some of my colleagues (at other companies). I also commend them for attempting to write an article aimed to educate. But I wish they had done more research - because without it, this article (IMO) ends up confusing more than educating.

For anyone curious on our definition of "time-series data" and "time-series databases": https://blog.timescale.com/blog/what-the-heck-is-time-series...

[0] https://blog.timescale.com/blog/what-is-high-cardinality-how...

[1] https://blog.timescale.com/blog/what-are-traces-and-how-sql-...

How does timescale (a single-purpose database) hold up against single-store (a multi-purpose database)? Of course, timescale is cheaper, but other than that, have you folks compared / contrast against single-store as a TSDB?

PS https://www.timescale.com/papers/timescaledb.pdf is 404

TimescaleDB performs quite well. One of our unique insights is that it is quite possible to build a best-in-class time-series database on top of Postgres (although it’s not easy ;-)

Here is one benchmark: https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...

There are some challenges with building on Postgres - but what we’ve been able to do is build innovative capabilities that overcome these challenges (Eg columnar compression in a row-oriented store, multi-node scale out).

We also have some exciting things that we are announcing this week. Stay tuned :-)

PS - Where did you find that PDF? Thought we took it down (it was hard to keep it up to date :-) )


Re: paper: I stumbled upon it when going through other timescaledb threads on news.yc, specifically here, https://news.ycombinator.com/item?id=13943939 (5 yrs ago)

I had a serious case of deja vu reminding me of your article on compression in timescaledb :-D

Thanks for reading that article :-)

> Often an operational analytics DB (Clickhouse, CrateDB) might also be a solution

This might be a bit off topic, but speaking of gaps in common observability tooling: is an OLAP database a common go-to for longer-timescale analytics (as in [1])? We're using BigQuery, but on ~600GB of log/event data I start hitting memory limits even with fairly small analytical windows.

In this context I have seen other references to: Sawzall (google), Lingo (google), MapReduce/Pig/Cascading/Scalding. Are people using Spark for this sort of thing now? Perhaps a combined workflow would be ideal: filter/group/extract interesting data in Hadoop/Spark, and then load into OLAP for ad-hoc querying?

[1]: https://danluu.com/metrics-analytics/

> is an OLAP database a common go-to for longer-timescale analytics (as in [1])?

I would not consider Clickhouse or CrateDB "classic" OLAP DBs. I can speak for CrateDB (I work there), that it definitely would be able to handle 600GB and query across it in an ad-hoc manner.

We have users ingesting Terabytes of events per day and run aggregations across 100 Terabyte.

What kind of hardware requirements would be needed to store and query this much data?

- Depends - Just inserting, indexing, storing and simple querying can be done with little memory (i.e. 1:500 memory-disk-ratio 0.5GB RAM per 1TB disk). Typical production clusters with high query load are in the 1:150 range i.e. 64GB RAM for 10TB disk).

Otherwise typical general purpose hardware (Standard SSDs, 1:4 vCPU:memory ratios, ...)

Interesting, so that'd be about 1 vCPU and 4GB RAM per 625GB of data. That seems very price efficient. Would something like AWS's EBS be sufficient for this? Would you need one of the higher tiers? Or would you be looking at running this on a box with locally attached storage?

Most of CrateDB clusters run on cloud providers hardware (azure, aws, alibaba). Using EBS (GP2 or now GP3) is also quite common. Due to the indexing / storage engine, gp disks are typically sufficient and faster disks have little to no advantage

Wouldn't 0.5GB RAM per 1TB disk be more like 1:2000 memory-disk-ratio? Which is even better!

Sorry, mixed up the number 2GB memory (0.5GB heap). So 1:500 is correct

For longer-scale timeseries I still recommend Druid as the go-to. Mainly because if you make use of it's ahead-of-time aggregations (which you can do for real-time or scale-out batch ingestion) then your ad-hoc queries can execute extremely quickly even over very large datasets.

Druid only really has 1 downside, which is it's still a bit of a pain to setup. It's gotten a ton ton better in recent times and I have been contributing changes to make it work better out of the box with common big data tooling like Avro.

For performance it's the top dog except for really naive queries that are dominated by scan performance. For those you are best off with Clickhouse, it's vectorized query engine is extremely fast for simpler/scan heavy workloads.

We used clickhouse on about 80TB in a raid10 setup. It was extremely fast

What are the best books out there to learn about Time Series databases? (there are already a million for relational and graph, but haven't seen one for time series). Bonus on how to implement one

CMU did a timeseries series couple of years ago:


Things have changed a little bit now , but not much.

Excellent, thanks

If you want something like Honeycomb but scales better then maybe look at Druid.

Last time I checked, Druid were not very good at ad-hoc tasks because it lacked join and SQL supports was sketchy. How is it now ?

Limited JOIN support. SQL is now very good.

JOINs vs no JOINs isn't an adhoc vs not-adhoc thing but more of a schema thing. If you try jam a star schema into it you aren't going to have a good time. This is true for pretty all of these more optimised stores. If you have a star schema and want to do those sorts of queries (and performance or cost aren't your #1 driving factors) then the better tool is a traditional data warehouse like BigQuery.

This probably won't be the case forever though, there is significant progress in the Presto/Trino world to enable push-down for stores like Druid which would allow you to source most of your fact table data from other sources and then join into your events/time-orientated data from Druid very efficiently.

Take a look at IronDB too. High scale distributed implementationd.

This is a very narrow definition of a time series DB, really more of a pure metrics store that requires individual time series to be kept separately. I find it odd, as I wouldn't define time series DB, in general, to be what they are. Rather, a time series DB is usually simple some sort of DB that supports a time column as the main index. There's tons that fall in that category (including Postgres, which you should absolutely use unless you have a compelling reason not to) that can easily handle many of the situations the article claims won't work. Druid can handle high-cardinality, for instance.

I couldn't agree more, and I've talked to the folks at TimescaleDB about exactly this issue; it can be hard for folks familiar with the narrow definition to understand how many more usecases a tool like Timescale can fit.

More broadly I think this is an issue with a narrow definition of "time series", aside from the DB angle. When I was doing more forecasting and predictive modeling, I was constantly stymied by "time series" resources only considering univariate time series, where my problems were always extremely multivariate (and also rarely uniformly sampled...).

When I've asked around for other vocabulary here, the options are slim. Panel data can work, but that has more to do with there being both a time and spatial dimension (e.g. group, cohort, user, etc.) than there being multiple metrics observed over time. It's also an unfamiliar term for data scientist folks without a traditional stats background. "Multivariate time series" might be technically correct, but that works much better in the modelling domain than the database domain.

I've found one source of vocabulary to come from research into the temporal relational algebra and temporal databases.

For example: https://citeseerx.ist.psu.edu/viewdoc/download?doi= (This being a not fully settled / exploited area of computational science, there are plenty of alternative interpretations, too.)

The main challenge here is that extending relational algebra ("key tuple maps to row") into time ("key tuple maps to function from time to row") essentially cross-products two different dimensions, creating a much bigger domain to process and reason about.

Once you have the theory down, you can express some pretty handy relationships.

For you functional folks, you can consider the TRA as the regular RA lifted into the time monad. Except the composition can also go cross-wise; TRA can also be considerd as the narrow timeline lifted into the relational monad. Fun times!

If you click around the site you will find https://www.honeycomb.io/blog/why-observability-requires-dis..., including a Strange Loop video that goes into their datastore. This is likely a factor in how they think and write about this topic.

Can you talk more about why you should use Postgres unless you have a compelling reason not to? I'm currently investigating options for storing high-frequency data. Postgres is looking like a good option.

Postgres has a lot of engineering work put into it to handle all kinds of use cases, plus massive effort by the community building tutorials, videos, wrappers, libraries, etc. If you have a problem with Postgres config, or are trying to get it to do something odd, there is undoubtedly a bunch of Stack Overflow discussions about that thing. For most other databases, the selection of all of those is much thinner.

Additionally, often we believe our application needs feature X, and there is some database tech that purports to excel at X, the fact is postgres is probably able to do X unless you get to some extreme velocity or volume. Furthermore, by going with another database you are almost always giving up on features W, Y, and Z that you don't realize you need that Postgres supports and isn't supported by the "Exotic" database you are thinking of.

In short, Postgres has amazing breadth and depth in features and support and tooling. Be sure you are ok giving that up!

If you like Postgres, you may want to try TimescaleDB, which is a time-series database built on Postgres (packaged as a Postgres extension). Postgres database + time-series database all in one.

This btw is one of the reasons I love Postgres - its extensibility.

(Disclaimer: TimescaleDB cofounder)

Did you guys used to call it just ScaleDB? I remember talking to some people about something called ScaleDB that was built on top of Postgres back when I was looking for database solutions for another product, this was in 2015. We ended up going with Druid for that.

What do yo mean by high-frequency data? 100Hz, 1KHz, 100KHz? For that kind of use cases many time-series DBs break apart. We have customers storing multiple millions of high frequency measurements per sec in arrays.

I would say, Postgres is not too storage efficient in itself for large amounts of data, especially if you need any sorts of indexes. Timescale basically mitigates that by automatically creating new table in the background ("chunks") and keeping individual tables small.

TimescaleDB also implements compression. From the docs:

When compression is enabled, TimescaleDB converts data stored in many rows into an array. This means that instead of using lots of rows to store the data, it stores the same data in a single row. Because a single row takes up less disk space than many rows, it decreases the amount of disk space required, and can also speed up some queries.

(Timescale employee)

Generally in the 100Hz or 200Hz range for the time being. What do you mean by break apart?

Not being able to keep up with the incoming data. But 100-200Hz I'd consider fine for most

We tried to use Postgres with TimescaleDB plugin for high frequency data several TB in size. It was unusable. Switched to Clickhouse, which was roughly 50-100 faster on the same hardware and 10 times less disk space. They use very different storage engines with different functionality so check the docs to see what fits your use case.

Cardinality is only expensive if the metric dimensions/tags/labels are a part of the builtin metric data model (like for example in InfluxDB).

It's not expensive when dimensions are just columns in the custom table schema, like for example in TimescaleDB or kdb+/q.

Also in the latter case a single database "row" can contain multiple metrics (each modeled as a separate column in a table).

This is true, but given this post and the one linked about their custom column store, I don't think Timescale would fit. Timescale is still row-oriented while in memory (compressed chunks are columnar), though that may not be a big deal in practice; the bigger deal might be the ability to arbitrarily add columns. With Timescale you either have to issue DDL to add a column (which can be expensive and doesn't work in some cases), or resign yourself to JSONB. My worry with the JSONB is that it'll be expensive to search w/o indices, but if you need indices that goes against the Honeycomb philosophy.

Very curious to hear if folks more deeply familiar with TimescaleDB internals have a different take!

> It's not expensive when dimensions are just columns in the custom table schema, like for example in TimescaleDB or kdb+/q.

kdb+/q (not exactly sure) is the one that is used heavily in finance, correct?

[EDIT] - Yup[0][1]. I'm not sure how many competitors there are but last time I heard about kdb it seemed like they were pretty much the go-to choice outside comp sci land (which is normally preoccupied with influx, prometheus, timescale, etc).

There's probably a reason it's trusted by those other fields.

[0]: https://kx.com/

[1]: https://code.kx.com/home/

kdb is crazy expensive which is why storing your 5-min load average inside of it doesn't make much sense

kdb+/q was originally heavily used in finance/trading, but nowadays also in IoT/IIoT.

The q language itself is pretty-much general purpose PL, but of course it shines for timeseries processing workloads.

I'm really struggling to understand the marketing from Honeycomb. They have a lot of reputable and smart staff, seemingly making conflicting public statements.

For example, in this post, the author describes how expensive cardinality is for TSDB, but doesn't really address how Honeycomb's column store implementation efficiently stores data with full cardinality anywhere near as cheaply.

Like, I fully appreciate how having traces and structured event logs will improve observability, but this 'actually, Prometheus is expensive too' angle is weird, and their pricing page's 'starts at $100' doesn't really make it easy to compare.

Thinking about this a bit more, there's really two factors you have to balance when setting these systems up: event volume and cardinality.

It feels like Honeycomb basically has you pre-pay for cardinality. Every field you add is equally efficient at storing data regardless of type or possible values (well, probably table compression helps some), and the event size is not a factor in billing(?). Since there's no pre-calculated aggregation, you bear the full cost of every event.

Pointless aside but that's probably the most annoying cookie banner I've seen. Your options are to accept the cookies or read their privacy policy, which is also covered by the cookie banner.

These articles are so good. Very informative for a product engineer such as myself. Allows me to create a much clearer and more expansive mental model of the running systems.

I wouldn't take this as the one truth, they are describing their database more than they are describing all timeseries databases.

Just like MySQL and Postgres can behave wildly differently depending on use, so can timeseriesdbs, there is no one truth here.

It’s hard for me to rate the content. I couldn’t get past the sketchy dark pattern of a site that tracks me and sells my data for marketing with an “accept all cookies” button, while making it much harder to find the process for opting out. At that point it’s easier for me to just opt out of using the site.

It’s too bad because I’d love to read some good technical articles, and this sounds like an interesting topic.

Agreed. I open nearly every HN article from an unknown domain in incognito/private mode. At least half implement this kind of dark pattern.

I find it spectacularly baffling for this kind of content-marketing website, where income presumably does not come from advertising, nor trading in visitor data, but from their DBaaS. Why would such a company risk alienating their visitors like this? It makes me question the decision-making in their unrelated, primary service; their conscientiousness; even their security. Truly mystifying.

I think there's a universe where there is no overlap between engineers/PMs who work on the actual DB product and website/content marketing site.

The marketing space is filled with all kinds of "plug and play" SaaS providers which offer detailed customer journey data and sometimes it's just straight up easier to add an "accept all" consent banner than to try and allow for hot loading specific 3rd party libraries based on customized consent options.

Is it the right thing to do? In my opinion, no. But I can also understand a situation where decisions were made on marketing tech before understanding the technical privacy implications. And then the implementation is handled by a team (potentially much smaller) that does not work on the actual product.

Sure. I believe I understand your explanation and appreciate it. It's possible for their core service to be rock solid and their marketing side to be of lesser priority and so not to receive the thought and resources of their core service. It makes sense.

Still, were I to be responsible for evaluating competing services, such choices would definitely be a ding. Not unrecoverable, but it would make me wonder unnecessarily about their corporate culture and customer care. All things being equal, I think it would be wise to go with the service that didn't do that.

> sells my data for marketing with an “accept all cookies” button

most likely just poorly implemented gdpr.

I used to wear the DPO hat, so this comes from a vantage point of having been in the weeds: ignorance is no excuse for incompetence.

"Doing GDPR" badly because you don't know any better? Then don't even try to walk that tightrope. Don't track. At all. If you use on third parties who don't give you the option to do it right, then dump those third parties.

+1 Read their other articles on the blog as well. Very informative.

There is a lot of development work left to do to optimize time-series database designs. They will get more useful as that work gets done.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact