For the 3 caveats at the top, there are already two TS solutions that look promising (QuestDB, TimescaleDB). Often an operational analytics DB (Clickhouse, CrateDB) might also be a solution.
Thanks for the mention, and I completely agree :-)
Personally, there is a lot in this article that is misguided.
For example, it essentially defines "time-series database" as "metric store." As TimescaleDB users know, TimescaleDB handles a lot more than just metrics. In fact, we handle any of the data types that Postgres can handle, which I suspect is more than what Honeycomb's custom store supports.
TSDBs are good at what they do, but high cardinality is not built into the design. The wrong tag (or simply having too many tags) leads to a combinatorial explosion of storage requirements.
In contrast, our distributed column store optimizes for storing raw, high-cardinality data from which you can derive contextual traces. This design is flexible and performant enough that we can support metrics and tracing using the same backend. The same cannot be said of time series databases, though, which are hyper-specialized for a specific type of data.
I do commend the folks at Honeycomb for having a good product loved by some of my colleagues (at other companies). I also commend them for attempting to write an article aimed to educate. But I wish they had done more research - because without it, this article (IMO) ends up confusing more than educating.
For anyone curious on our definition of "time-series data" and "time-series databases": https://blog.timescale.com/blog/what-the-heck-is-time-series...
PS https://www.timescale.com/papers/timescaledb.pdf is 404
Here is one benchmark: https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...
There are some challenges with building on Postgres - but what we’ve been able to do is build innovative capabilities that overcome these challenges (Eg columnar compression in a row-oriented store, multi-node scale out).
We also have some exciting things that we are announcing this week. Stay tuned :-)
PS - Where did you find that PDF? Thought we took it down (it was hard to keep it up to date :-) )
Re: paper: I stumbled upon it when going through other timescaledb threads on news.yc, specifically here, https://news.ycombinator.com/item?id=13943939 (5 yrs ago)
This might be a bit off topic, but speaking of gaps in common observability tooling: is an OLAP database a common go-to for longer-timescale analytics (as in )? We're using BigQuery, but on ~600GB of log/event data I start hitting memory limits even with fairly small analytical windows.
In this context I have seen other references to: Sawzall (google), Lingo (google), MapReduce/Pig/Cascading/Scalding. Are people using Spark for this sort of thing now? Perhaps a combined workflow would be ideal: filter/group/extract interesting data in Hadoop/Spark, and then load into OLAP for ad-hoc querying?
I would not consider Clickhouse or CrateDB "classic" OLAP DBs. I can speak for CrateDB (I work there), that it definitely would be able to handle 600GB and query across it in an ad-hoc manner.
We have users ingesting Terabytes of events per day and run aggregations across 100 Terabyte.
Otherwise typical general purpose hardware (Standard SSDs, 1:4 vCPU:memory ratios, ...)
Druid only really has 1 downside, which is it's still a bit of a pain to setup. It's gotten a ton ton better in recent times and I have been contributing changes to make it work better out of the box with common big data tooling like Avro.
For performance it's the top dog except for really naive queries that are dominated by scan performance. For those you are best off with Clickhouse, it's vectorized query engine is extremely fast for simpler/scan heavy workloads.
Things have changed a little bit now , but not much.
JOINs vs no JOINs isn't an adhoc vs not-adhoc thing but more of a schema thing. If you try jam a star schema into it you aren't going to have a good time. This is true for pretty all of these more optimised stores. If you have a star schema and want to do those sorts of queries (and performance or cost aren't your #1 driving factors) then the better tool is a traditional data warehouse like BigQuery.
This probably won't be the case forever though, there is significant progress in the Presto/Trino world to enable push-down for stores like Druid which would allow you to source most of your fact table data from other sources and then join into your events/time-orientated data from Druid very efficiently.
More broadly I think this is an issue with a narrow definition of "time series", aside from the DB angle. When I was doing more forecasting and predictive modeling, I was constantly stymied by "time series" resources only considering univariate time series, where my problems were always extremely multivariate (and also rarely uniformly sampled...).
When I've asked around for other vocabulary here, the options are slim. Panel data can work, but that has more to do with there being both a time and spatial dimension (e.g. group, cohort, user, etc.) than there being multiple metrics observed over time. It's also an unfamiliar term for data scientist folks without a traditional stats background. "Multivariate time series" might be technically correct, but that works much better in the modelling domain than the database domain.
For example: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47...
(This being a not fully settled / exploited area of computational science, there are plenty of alternative interpretations, too.)
The main challenge here is that extending relational algebra ("key tuple maps to row") into time ("key tuple maps to function from time to row") essentially cross-products two different dimensions, creating a much bigger domain to process and reason about.
Once you have the theory down, you can express some pretty handy relationships.
For you functional folks, you can consider the TRA as the regular RA lifted into the time monad. Except the composition can also go cross-wise; TRA can also be considerd as the narrow timeline lifted into the relational monad. Fun times!
Additionally, often we believe our application needs feature X, and there is some database tech that purports to excel at X, the fact is postgres is probably able to do X unless you get to some extreme velocity or volume. Furthermore, by going with another database you are almost always giving up on features W, Y, and Z that you don't realize you need that Postgres supports and isn't supported by the "Exotic" database you are thinking of.
In short, Postgres has amazing breadth and depth in features and support and tooling. Be sure you are ok giving that up!
This btw is one of the reasons I love Postgres - its extensibility.
(Disclaimer: TimescaleDB cofounder)
I would say, Postgres is not too storage efficient in itself for large amounts of data, especially if you need any sorts of indexes. Timescale basically mitigates that by automatically creating new table in the background ("chunks") and keeping individual tables small.
When compression is enabled, TimescaleDB converts data stored in many rows into an array. This means that instead of using lots of rows to store the data, it stores the same data in a single row. Because a single row takes up less disk space than many rows, it decreases the amount of disk space required, and can also speed up some queries.
It's not expensive when dimensions are just columns in the custom table schema, like for example in TimescaleDB or kdb+/q.
Also in the latter case a single database "row" can contain multiple metrics (each modeled as a separate column in a table).
Very curious to hear if folks more deeply familiar with TimescaleDB internals have a different take!
kdb+/q (not exactly sure) is the one that is used heavily in finance, correct?
[EDIT] - Yup. I'm not sure how many competitors there are but last time I heard about kdb it seemed like they were pretty much the go-to choice outside comp sci land (which is normally preoccupied with influx, prometheus, timescale, etc).
There's probably a reason it's trusted by those other fields.
The q language itself is pretty-much general purpose PL, but of course it shines for timeseries processing workloads.
For example, in this post, the author describes how expensive cardinality is for TSDB, but doesn't really address how Honeycomb's column store implementation efficiently stores data with full cardinality anywhere near as cheaply.
Like, I fully appreciate how having traces and structured event logs will improve observability, but this 'actually, Prometheus is expensive too' angle is weird, and their pricing page's 'starts at $100' doesn't really make it easy to compare.
It feels like Honeycomb basically has you pre-pay for cardinality. Every field you add is equally efficient at storing data regardless of type or possible values (well, probably table compression helps some), and the event size is not a factor in billing(?). Since there's no pre-calculated aggregation, you bear the full cost of every event.
Just like MySQL and Postgres can behave wildly differently depending on use, so can timeseriesdbs, there is no one truth here.
It’s too bad because I’d love to read some good technical articles, and this sounds like an interesting topic.
I find it spectacularly baffling for this kind of content-marketing website, where income presumably does not come from advertising, nor trading in visitor data, but from their DBaaS. Why would such a company risk alienating their visitors like this? It makes me question the decision-making in their unrelated, primary service; their conscientiousness; even their security. Truly mystifying.
The marketing space is filled with all kinds of "plug and play" SaaS providers which offer detailed customer journey data and sometimes it's just straight up easier to add an "accept all" consent banner than to try and allow for hot loading specific 3rd party libraries based on customized consent options.
Is it the right thing to do? In my opinion, no. But I can also understand a situation where decisions were made on marketing tech before understanding the technical privacy implications. And then the implementation is handled by a team (potentially much smaller) that does not work on the actual product.
Still, were I to be responsible for evaluating competing services, such choices would definitely be a ding. Not unrecoverable, but it would make me wonder unnecessarily about their corporate culture and customer care. All things being equal, I think it would be wise to go with the service that didn't do that.
most likely just poorly implemented gdpr.
"Doing GDPR" badly because you don't know any better? Then don't even try to walk that tightrope. Don't track. At all. If you use on third parties who don't give you the option to do it right, then dump those third parties.