> Building a distributed time-series database on PostgreSQL
Next order of business: Making mud pies.
PostgreSQL is geared towards transactional work. With time series, you basically just append data occasionally, and do analytics. PostgreSQL is terrible for analytics - its architecture is all wrong. 2 or 3 orders of magnitude slower than the state of the art if not more.
Which databases are good for analytics from your point of view?
In my experience, being able to do advanced ad-hoc SQL queries is priceless for analytics. Timescale helps in scaling time series use-cases that used to scale badly in plain PostgreSQL.
There are other relational databases like MemSQL or Clickhouse that use distributed column-oriented architectures that are much better at large scale analytics and aggregations.
Postgres is getting pluggable storage engines in the next version (and already has foreign data wrappers) so that can at least lead to a better storage design.
It's not just my point of view - it's well known in the research community, and has been for decades.
For FOSS, have a look at MonetDB. For research-oriented systems, look for publications regarding HyperDB or VectorWise/Actian Vector (VectorH in the cluster version). Other commercial offerings are Vertica (formerly C-Store) and SAP Hana.
PostgreSQL is not even something anyone compares against in analytics...
Oh yea, MemSQL and ClickHouse are also indeed relevant and in this category, except that ClickHouse doesn't support all of SQL and any table structure, so it's not a full-fledged DBMS.
If you take a look at any of our benchmarks, you’ll see that this is not the case. PostgreSQL in fact can scale quite well for time-series analytics, if architected correctly.
But why don’t you just try out TimescaleDB and see for yourself?
Please link to those benchmarks, and we'll see. Also, a link to the relevant SIGMOD/VLDB/ICDE/DaMoN/ADMS/etc. submission arguing in favor of TimeScaleDB's design would also be appreciated.
On the linked-to article I only see references to irrelevant transactional DBMSes...
Yes, it's just like I thought. You're comparing against transaction-oriented DBMSes, or ones which handle documents rather than tabular data (and hence slow on tabular data).
One possible exception is InfluxDB - I'm not familiar enough with it.
Anyway, try running TSBS on columnar DBMSes like Actian VectorH, Vertica, SAP HANA etc. ClickHouse may also be relevant; they don't support any possible schema, but it may be enough to run TSBS.
We're happy to take pull requests for new databases, we have so far from Clickhouse, CrateDB, and SiriDB (and one pending). We've tried to make it relatively easy for new databases to hook in.
We usually implement ones that we hear about a lot from customers, and so far those haven't come up a ton. We'll keep it in mind though as we look to keep adding new ones.
Thank you very much for your valuable input - it is very important that people understand the differences and look into this!
Performance comparisons to the candidates you named would be very interesting to see.
Downvoters: you should be happy that people with more knowledge than the average javascript-aws-webdevops-guy that is needed to operate a startup invest time to inform you about alternatives you might not know about.
Also it is important to keep this site attractive to people that have a different opinions and experiences - do not do that trump thing! Thanks!
Of course, for each claim replicable facts are needed.
I am curious about that too.
As a separate topic, if the operational dbs can be compatible with parquet type storage (backup and restore), the offline analytics and machine learning would be seamlessly integrated together. Offline analytics usually can simplify online analytics. Discovering new dimensions, normalization/denormalization, and optimization of indices and partitions. Operational dbs shouldn't have to stress themselves at the gunpoint.
I think the ask was for comparisons against traditional analytics databases (redshift, Vertica, etc.). Columnstores are substantially faster for table scans + aggregations then rowstores (and they use a lot less storage) [1].
Next order of business: Making mud pies.
PostgreSQL is geared towards transactional work. With time series, you basically just append data occasionally, and do analytics. PostgreSQL is terrible for analytics - its architecture is all wrong. 2 or 3 orders of magnitude slower than the state of the art if not more.