Hacker News new | comments | ask | show | jobs | submit login
PipelineDB v0.9.9 – One More Release Until PipelineDB Is a PostgreSQL Extension (pipelinedb.com)
122 points by Fergi 9 months ago | hide | past | web | favorite | 23 comments

PipelineDB is pretty interesting for time-series data. It takes an approach to processing the data as it comes in, and storing aggregates or pre-aggregates over time series. I haven't followed the latest, but as of a few years ago much of the approach was similar to some research out of UC Berkeley from about 10 years ago. You can find the paper that talks about that work (TelegraphCQ CQ for continuous query) at http://db.csail.mit.edu/madden/html/TCQcidr03.pdf. Definitely an interesting read if you're into technical papers and databases.

Druid also does this, with pre-aggregation of streaming data along predefined dimensions for very fast cube-based analytics. It's not a relational database though and is just now getting a SQL interface through Apache Calcite. http://druid.io/

Imply is a startup with a modern cloud/on-prem distribution of Druid with a built-in visualization and querying tool: https://imply.io/

I didn't know the product at all, at a glance this looks amazing to be for BI/alerting on streaming time series data.

Anyone who wants to chime in on whether this has fit your requirements for time series data processing? Thanks!

If it's a Postgres extension for time-series data, I wonder how it compares to TimescaleDB, which I recently discovered and have been evaluating.

TimescaleDB is an extension that auto-partitions a table by a time column to give you faster queries based on time ranges. This is great for timeseries data (which is just data with time as part of the primary key) and allows for fast ingest with recent time partitions being smaller and in-memory.

PipelineDB is a Postgres build (and soon to be extension) that runs real-time continuous "queries" on incoming data as materialized views. This is similar to the Kafka KSQL or Spark streaming and is effectively taking in a constant stream of data and re-running a query incrementally to give you up-to-date analytics with windowing.

Neither products are full-scale data warehouse solutions. TimescaleDB is currently limited to time-based partitioning, single-node only, and still uses standard row-based tables. PipelineDB discards raw incoming data and only keeps materialized views so you must know your queries beforehand. Both give you the advantage of using full SQL and the rest of the postgres ecosystem to easily join and analyze with other operational data without doing lots of ETL.

Another worthy mention is CitusDB which is another auto-sharding extension to Postgres, mainly focused on scaling horizontally for OLTP scenarios.

>TimescaleDB is currently limited to time-based partitioning

To be clear, TimescaleDB supports partitioning by other dimensions, as long as one dimension is time-based. That is, one partition must be time based, but you can additional dimensions as well (e.g. device_id).

I don't know TimescaleDB but PipelineDB is more about real time constant aggregation. You can do like a rolling sum that's has many inserts a second for a very long time, and the speed of retrieving the sum and inserts remain pretty constant.

Hi there! I'm one of the PipelineDB founders. This description is correct. The unique thing about PipelineDB is that it doesn't store granular data. Once all aggregates are incrementally updated, the raw input rows as discarded and only aggregate output is stored.

This approach dramatically limits disk IO and long-term storage requirements, and enables super high performance in most cases on modest hardware.

PipelineDB has been used in production for nearly four years now and is used by Fortune 100 companies.

So once you make it as an extension, any chance to mix PipelineDB with Citus in one cluster?

My hunch says that it's possible as far as there is some additional computation done with the future aggregate query on the coordinator in Citus.

PPDB looks interesting, but we also need to keep the underlying raw data and multiple clusters require more complex pipeline.

We haven't looked too far into integrations with any existing systems at this point, but if there was significant user demand for it on both ends we'd definitely be open to it.

One thing I will mention here is that we do have plans to add support for persistent streams [0] after version 1.0.0 is released. We've learned a lot over the years about how our users/customers interact with streams in production and persistent streams will be built atop that foundation of understanding.

Please feel free to comment on that issue with your use case, requirements, etc. and we'll see what we can do!

[0] https://github.com/pipelinedb/pipelinedb/issues/1463

Persistent streams are interesting, but we spent years refining our ETL and building it around Citus, that it would be very complicated to separate those two. I will wait for the extension and do some testing.

Pipeline DB would be a great fit, it seems, for 'Speed-layer' part of lambda architecture. While the postgresql underneath would work as a 'batch layer'.

A pair of both would only be missing an in-memory and on disk column-oriented 'analytics' tier (perhaps as separate instance that's being continuously fed from the first two?).

I think I had seen that PostgreSQL also has columnar store extension, wondering if that would work close the remaining gap for the analytics instance.

fdw_cstore! It's a workable columnar store, but honestly if you have "big" data it's only passable. It is much easier to run than other columar stores however.

Powering real-time reporting dashboards is definitely the #1 use case we see for PipelineDB from open source users and customers of our new SaaS product powered by PipelineDB, called Stride (stride.io).

well the continous view, looks also useful for a good materialization technique for some kind of searchable "view"/table. i.e. it could probably be used to built a "cheap" elasticsearch without needing to import data into another system since you could just use triggers to update the continous view.

Had never heard of this either but it does look really interesting.

This would be absolutely perfect for the job I had in Sales Intelligence a few years ago... except we were locked into SQL Server and there was no way the powers that be would ever let us switch over to PostgreSQL.

SQL Server 2017 has in-memory (hekaton) storage engine and columnstore indexes. Combine them both and you can do the same thing with real-time queries over the entire dataset.

What is storage model compared to timescaledb [0]

[0] https://github.com/timescale/timescaledb

The storage engine for PipelineDB is PostgreSQL and the output of continuous SQL queries (continuous views in PipelineDB) is stored in what are essentially incrementally updated, realtime tables. You can think of PipelineDB as very high throughput, incrementally updated materialized views, also.

see: http://docs.pipelinedb.com/continuous-views.html

Correct me if I'm wrong, but PipelineDB gives effective access to data in commit order right?

can't wait for support on RDS!

FWIW, AWS has a whitelist of Postgres extensions you can use in RDS so that'll probably take more time, if it ever happens.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact