Absolutely love the story. TimescaleDB & InfluxDB have had a lot of posts on HN, so I'm sure others are wondering - how do we compare QuestDB to them? It sounds like performance is a big one, but I'm curious to hear your take on it.
As you said, performance is the main differentiator. We are orders of magnitude faster than TimescaleDB and InfluxDB on both data ingestion and querying.
TimescaleDB relies on Postgres and has great SQL support. This is not the case for InfluxDB and this is where QuestDB shines: we do not plan to move away from SQL, we are very dedicated in bringing good support and some enhancements to make sure the querying language is as flexible and efficient as possible for our users.
Hi, TimescaleDB cofounder here. Nice to read about your journey in time-series data, and always welcome another database that can satisfy a specific type of developer needs.
I also commend you on your desire to rebuild everything Postgres offers from scratch. We took a different route by building on top of Postgres (which e.g. allowed us to launch with native replication, rock-solid reliability, window functions, geo spatial data, etc without sacrificing performance). But there are many ways up this mountain!
As a quick thing, however: While it’s not very representative of the workloads we typically see, I tried your simple 1B scan on a minimally-configured hypertable in TimescaleDB/PostgreSQL, and got results that were >12x faster on my 8-core laptop than what you were reporting on a 48-core AWS m5.metal instance.
I think the Hacker News community always appreciates transparency in benchmarking; looking forward to reading a follow up post where you share reproducible benchmarks where all databases are tuned equivalently.
I'm sure many folks would be really interested to see two things:
1. A blog post around a reproducible benchmark between QuestDB, TimescaleDB, and InfluxDB
2. A page, like questdb.io/quest-vs-timescale, that details the differences in side-by-side feature comparisons, kind of like this page: https://www.scylladb.com/lp/scylla-vs-cassandra/. Understandably, in the early days, this page will update frequently, but that level of transparency is really helpful to build trust with your users. Additionally, it'll help your less technical users to understand the differences, and it will be a sharable link for people to convince others & management that QuestDB is a good investment.
Perhaps the QuestDB team could add it to the Time Series Benchmarking Suite [1]? It currently supports benchmarking 9 databases including TimescaleDB and InfluxDB.
Wow! Nice. I am surprised neither Scylla nor KairosDB are on that list. I think you could run Scylla by itself (to compare with raw Cassandra) and also re-run with KairosDB running on top of Scylla and Cassandra to see what effects that has on performance. (Though of course, there are advantages to having KairosDB, too.)
Over the network streaming is not yet available. Someone has mentioned Kafka support, how useful would that be to stream processed (aggregated) values and/or actual table changes?
hi there - co-founder of questdb here. The demo on our website hosts a 1.6 billion rows NYC taxi dataset with 10 years of weather data with around 30-minute resolution and weekly gas prices over the last decade.
We've got example of queries in the demo, and you can see the execution times there.
(Hard to draw many meaningful conclusions from a single, extremely simple query without much explanation?)
Graph shows PostgreSQL as taking a long time, but doesn't say anything about configuration or parallelization. PostgreSQL should be able to parallelize that type of query since 9.6+, but I think they didn't use parallelization in these experiments with PostgreSQL, even though they used a bunch of parallel threads with QuestDB?
So would be good to know:
- What version of Postgres
- How many parallel workers for this query
- If employing JIT'ing the query
- If pre-warming the cache in PostgreSQL and configuring it to store fully in memory (as benchmarks with QuestDB appeared to do a two-pass to first mmap into memory, and only accounting for the second pass over in-memory data).
etc
Database benchmarking is pretty complex (and easy to bias), and most queries do not look like this toy one.
I agree that our blog post lacks of details, here are some:
- PostgreSQL 12
- 12
- No
- We ran the test using the pg_prewarm [0] module, the difference was negligible
Regarding the "toy" query, the reason we are showcasing this instead of other more complex queries is because this is a simple, easily reproducible benchmark. It provides a point of reference for performance figures.
> Database benchmarking is pretty complex (and easy to bias), and most queries do not look like this toy one.
I would say that benchmarking is very hard. We tried not to perform a biased benchmark by running something that is not time-series specific and which does not put us in advantage compared to what Postgres should do.
The takeaway from this is that configuration is important and we should expose it. The next benchmark we do will have an associated repository so people can review our config and point non optimal items if any.
Is also be interested in hearing when is QuestDB not a good choice? Are there use cases where TimescaleDB, InfluxDB, ClickHouse or something else are better suited?
Hard question to answer because each solution is unique and has its own tradeoffs. Taking a step back QuestDB is a less mature product than the ones mentioned, and therefore there are many features, integrations etc. to build on our side. This is a reflection of how long we have been around and capital we have raised versus those companies who are much larger in size.
If you are already using Postgres, then TimescaleDB is a natural fit - not having to deploy and manage a separate service is a real boon. You can also join with non-TimescaleDB tables, so if you need to combine time series data with regular relational data, that's another advantage.