
PipelineDB 1.0 – High-Performance Time-Series Aggregation for PostgreSQL - Fergi
https://www.pipelinedb.com/blog/pipelinedb-1-0-0-high-performance-time-series-aggregation-for-postgresql
======
manigandham
PipelineDB = Insert data with time component to be aggregated on the fly into
always up-to-date summary tables using a variety of aggregation functions. Raw
data is not persisted.

TimescaleDB = Store data with time component into "hypertable" that is
automatically partitioned by time, for faster queries when limited by time
range. Single node and has helper methods to make time based bucketing and
aggregation easier.

Citus = Store data in distributed tables automatically partitioned and spread
across multiple nodes, by any single column. Join across nodes with non-
distributed tables.

Can definitely use PipelineDB for real-time summaries and TimescaleDB or Citus
for raw long-term storage in the same database.

Side note: It would be nice if Postgres had package manager for extensions.

~~~
mfreed
Thanks for the great summary, manigandham.

We're actively working on the scale-out version of TimescaleDB that will allow
you to transparently shard hypertables across many servers. Hope to announce
more specifics in the next several months.

~~~
nwmcsween
What is wrong with citus? Why reimplement it?

~~~
buremba
Use TimescaleDB if you have time-series data, if you want to scale out your
OLTP workload then Citus is what you're looking for.

~~~
mslot
Citus is also used for large time-series / analytics use cases e.g.
[https://www.citusdata.com/customers/heap](https://www.citusdata.com/customers/heap)

There's a question of what you actually want to do with the time-series data.
If you don't expect to have much data or just want to store it and maybe
occassionally query it, then a single server with partitioning (e.g. through
pg_partman, Timescale) might be enough. If you want to build an analytical
dashboard that needs to remain fast even if you're dealing with many users and
terabytes of data per day, then you probably need Citus.

Citus can load, aggregate and query the data in parallel using all the cores
in the cluster. It also supports Postgres' native partitioning and pg_partman:
[https://www.citusdata.com/blog/2018/01/24/citus-and-pg-
partm...](https://www.citusdata.com/blog/2018/01/24/citus-and-pg-partman-
creating-a-scalable-time-series-database-on-PostgreSQL/)

~~~
buremba
AFAIK Heap uses Citus but also has an internal partitioning scheduler for
their customer event data so I don't think that they're a good example.
Timescale doesn't support scaling out yet but it's in their roadmap so let's
wait for them to implement for a fair conclusion.

If you're going to create roll-up tables and power your dashboard using those
tables, you're fine with both options IMO. Cloudflare was also using Citus
exactly for this use-case before they switched to Clickhouse.

If you have ad-hoc use-cases for time-series data, Timescale might be a better
option because it's built exactly for this use-case and it knows the semantics
of the data so it can partition the data in an optimized way and perform some
optimizations such as parallelized operations and re-sizing chunks. In that
sense, it's comparable to Influxdb, not Citus.

------
allan_s
Has anyone tried to mix pipelinedb with timescale[1] , I think both are
working on different side of playing with timeseries data ?

[1][https://www.timescale.com/how-it-works](https://www.timescale.com/how-it-
works)

~~~
qaq
theoretically sounds like a perfect match

------
chucky_z
I've been following Pipeline since the beginning and it's so fricking cool.
Please, if you can't think of a good use of Pipeline, use it instead of a
count(*)! :D

~~~
Fergi
(Jeff, PipelineDB Co-Founder here) - thanks, Chucky! We appreciate your
support!

------
skunkworker
Interesting, this seems to be the other side of the postgres time series
extension coin.

TimescaleDB for writes, PipelineDB for reads.

~~~
grammr
I'm Derek, one of the co-founders--that's an interesting way to frame it, I
think that makes a lot of sense at a high level.

We're in contact with the TSDB founders (awesome and super smart guys!) and
are in the early stages of figuring out an integration that makes sense.
That's most likely going to happen.

To anyone interested: we'd love to hear and consider your ideas re: TSDB
integration. Feel free to open an issue in either repo (or add to an existing
one) and tell us more!

~~~
scrollaway
Can you guys join forces and convince AWS to make both of those products
available on RDS? :)

~~~
qaq
So basically AWS will monetize something they have spent 0 resources building
and will likely cannibalise the only viable monetization option?

~~~
chatmasta
Surely not, AWS has never done _anything_ like that!

------
dkulchenko
How does this compare to TimescaleDB?

Are they solving the same problem in different ways or are they complementary
projects? If it's the latter, what would that look like?

~~~
akulkarni
(Timescale founder) I'd say they are quite complementary. More here:
[https://news.ycombinator.com/item?id=18298004](https://news.ycombinator.com/item?id=18298004)

------
tracker1
Hoping this gains some traction as a defacto extension for cloud hosted
postgresql. I think this is probably as useful as plv8 for a lot of use cases.

------
the-alchemist
And it supports Postgres 10.x!

[http://docs.pipelinedb.com/installation.html#install-
postgre...](http://docs.pipelinedb.com/installation.html#install-postgresql)

Can't wait for Postgres 11 support.

~~~
Fergi
PostgreSQL 11 support is imminent =)

------
crescentfresh
Looking over this cursorily, looks super cool.

    
    
        INSERT INTO events_stream (ts, value) VALUES (now(), '0ef346ac');
    

> As soon as the continuous view reads new incoming events and the distinct
> count is updated the reflect new information, the raw events will be
> discarded.

So you create a table, insert into it, and it's always empty. Is that right?

Does this work for any table in pg? How does pg know that the insert should
NOT actually insert a row?

~~~
Fergi
This only applies to continuous views, not all PG tables. Think of continuous
views in PipelineDB as very high throughput, incrementally updated
materialized views. Raw data hits continuous queries in PipelineDB (continuous
views) and only the output of the continuous queries is stored. So 1 billion
events ingested could be distilled down into a single row that incrementally
counts up from 1 => 1 billion as each data point arrives, instead of storing
all of the 1 billion raw data points and counting them up later.

~~~
ohnoesjmr
You can't really do that with distinct, as if you have 1 billion distint
entries, you essentially have to store all of them to dedup.

~~~
grammr
This is precisely why PipelineDB has rich support for data structures such as
HyperLogLog [0]. HLL's allow you to track distincts information using fixed-
size HLLs that only grow to about 14KB while encoding uniques counts for
billions of distinct values. The tradeoff is about a ~0.8% margin of error,
which users generally find acceptable.

Furthermore, PipelineDB has a special combine [1] aggregate that allows you to
combine data structures such as HLL across multiple rows with no loss of
information. A simpler example would be average: to get the _actual_ average
of multiple averages you obviously can't simply take the average of all the
averages. Their weights must be taken into account, and combine handles that.

The capability to combine aggregate values in this way generalizes to all
aggregates in PipelineDB.

[0] [http://docs.pipelinedb.com/aggregates.html#hyperloglog-
aggre...](http://docs.pipelinedb.com/aggregates.html#hyperloglog-aggregates)

[1]
[http://docs.pipelinedb.com/aggregates.html#combine](http://docs.pipelinedb.com/aggregates.html#combine)

------
usgroup
Fantastic guys, thank you! I’ve been looking forward to it becoming an
extension for half a year. This is great news.

This basically means Postgres now has continuous views and a toolkbox of
functions for running calculations. Combined with PG11 partitioning features
and better parallel gusty execution, PG is an even more formidable choice for
medium sized data.

------
Arqu
I work closely in the space of providing time series databases as managed
solutions. I can say that I am very happy to see this recent development of
new tsd's and this with timescale is a huge bump to the industry/segment.
Everybody currently measures some analytics and mostly user data and there is
so much abuse with it, yet there is so much more you can measure and do and it
is still very early stage. Farms, industrial applications, IoT and so much
more. I'd love to just measure temperature and wind speed at unprecedented
resolution.

------
ishikawa
Very interesting. Does it aggregate per day? If so, I wonder how it handles
time-zone, I mean when to create a new day when you have agents on different
time-zones.

~~~
grammr
How aggregations are performed are determined entirely by your own continuous
view definitions [0]. In this case I'm guessing you'd want to include a time-
based column in the aggregation GROUP BY clause.

And since PipelineDB is a PostgreSQL extension, you can use the timestamptz
type (which includes timezone support), and in general you could pretty easily
simply normalize your event timezones in your continuous view definitions.
When you're reading aggregate data back out, you could cast the time-based
column using whatever timezone the client prefers.

Thanks for the question--I hope that was helpful!

[0] [http://docs.pipelinedb.com/continuous-
views.html](http://docs.pipelinedb.com/continuous-views.html)

~~~
ishikawa
Thank you, yes, it was helpful for me to understand the possibilities. I'll
dig more into that.

------
Rapzid
Running functions on top of the transaction log(in transaction order) is a
really powerful thing.

------
alakin
Is most of the intermediate processing done in memory, or is it limited by hd
write speed?

~~~
grammr
I'm Derek, one of the co-founders--excellent question!

The former. PipelineDB performs aggregations in memory on microbatches of
events, and only merges the aggregate _output_ of each microbatch with what's
on disk. This is really the core idea behind why PipelineDB is so performant
for continuous time-series aggregation. Microbatch size is configurable:
[http://docs.pipelinedb.com/conf.html](http://docs.pipelinedb.com/conf.html).

~~~
alakin
That's awesome! If you don't mind - one more q.. I see that stream-stream
joins are not yet supported ([http://docs.pipelinedb.com/joins.html#stream-
stream-joins](http://docs.pipelinedb.com/joins.html#stream-stream-joins)). Can
you comment on when you think this feature cold land or is it still a ways
off?

~~~
Fergi
Just out of curiosity, do you have a specific use case that necessitates
stream-stream JOINs, or were you just exploring the docs and wondering about
this?

~~~
alakin
My use case is pretty much parallel time series alignment with several layers
of aggregation. I guess I perceive stream-stream joins as an _easy_ way for me
to wrap my head around how to structure my compute graph, but it seems doable
with the method mentioned by @grammr. I'd hope for an interface roughly like
"CREATE join_stream from (SELECT slow_str.key AS key, sum(slow_str.val,
fast_str.val) AS val FROM slow_str, fast_str INNER JOIN ON slow_str.key =
fast_str.key)". I do realize there are some tough design decisions for a
system like this, but I'd also like to drop my wacky zmq infrastructure ;)

------
jadbox
How does this compare to Citus?

~~~
nwmcsween
They are completely different products? Citus deals with scaling pipelinedb
deals with continuous queries.

~~~
Mayzie
Citus advertises itself as an excellent way to achieve real-time analytics
across billions of rows and tonnes of data, which this product also does.

How both products achieve this is however different.

------
temuze
Congrats!

Also, how's stride.io doing?

~~~
grammr
I'm Derek, one of the co-founders--thank you!

We're super happy with where Stride is at! We've continued to onboard
customers in a few AWS regions and the infrastructure is rock solid at this
point. Most users are ingesting 10k+ events/s and their analytics frontends
are retrieving results in well under 100ms. We've gotten it to the point where
it "just works" which has made Stride users' lives a lot easier at that scale.

And since the hard parts of Stride are powered by PipelineDB, an added benefit
for us is that we now get a ton of super detailed instrumentation data about
PipelineDB performance and behavior, which has helped make the open-source
product quite a bit better.

We'll be moving Stride into self-service/GA next year--stay tuned!

------
tnolet
Big question for me: does it work on Heroku postgres?

~~~
Fergi
No, not currently; however, we offer hosted deployments of PipelineDB as well
as a SaaS product called Stride (stride.io), which is based on PipelineDB.

