
Postgres as the Substructure for IoT and the Next Wave of Computing - bsg75
https://blog.timescale.com/postgres-accidental-iot-platform-timescaledb-postgresql-time-series-data-7983d28da5af
======
manigandham
Summary without buzzwords:

TimescaleDB is an extension that brings the automatic partitioning (which is
common in data warehouses) to Postgres. It is more scalable and reliable than
doing it yourself using normal PG partitioning features. Timescale is similar
to Citus, which can also auto-shard tables, but is more focused on time as the
primary dimension, along with helper functions for easier bucketing and
analysis. Unlike Citus, it's currently limited to a single node.

The biggest benefit is being able to use Postgres with your other operational
data so you can avoid ETL and simplify your stack. It will also greatly
improve insert performance since recent data is in small partitions with
indexes in memory. This will not match the scale and performance of a fully
distributed data warehouse with time as a partitioning key, but will work for
most general timeseries looking at a recent time window.

Whether your data comes from IoT devices, blockchain transactions, or some
other source doesn't matter. If you want to try this extension, Aiven makes it
easy: [https://aiven.io/blog/turn-aiven-postgresql-into-time-
series...](https://aiven.io/blog/turn-aiven-postgresql-into-time-series-
powerhouse-with-timescale/)

------
skywhopper
Good article, once it got around to what I think is the point--that Postgres
with the TimescaleDB extension can function as a great time-series database.
The IoT framing put me off kilter for a long way in--"IoT" as a buzzword is
somewhat past its prime (not that the reality of embedded networked computers
in everything has slowed down), and to be fair, the time-series use case is
mainly about IoT sensor data, not really IoT in the general case.

But once I got past that, it was enlightening. I've been very dissatisfied
with the time-series options available out there, so learning about something
built on Postgres, which we already use extensively for OLTP purposes, was
very interesting. Thanks.

------
dhd415
Their basic approach for handling time series data -- partition the inserts by
time range -- is one that works for pretty much any relational database. If
you can treat the data as immutable which is often the case with time series
data, it gets even better. The fact that they have an out-of-the-box solution
for it is nice, but I've implemented my own for an application in which a
portion of the workload was time series data and a separate portion was not.
It's all about mechanical sympathy and understanding the performance
characteristics of your chosen datastore.

------
w8rbt
On the IoT device itself, sqlite would be a good choice due to its simplicity
and the fact that it's small enough to fit and run.

As an aside, I find that I can use sqlite for 90% of the DB needs I have. For
the other 10% (multiple writers, more granular locking, more concurrent
connections, richer data types such as inet), then I use Postgres.

~~~
hardwaresofton
For the very smallest scale of IoT device, most of them phone home to relay
hubs, and the data gets passed to some other machine/bigger processor -- I
think this article is more about running TimescaleDB on the bigger processor
(or maybe even a relay?), not on the actual IoT device itself.

Though TimescaleDB would improve the efficiency of RAM use I don't think you
want to try and run a database on an IoT device with very limited RAM to begin
with.

------
sly010
I noticed that abstract ideas that otherwise make no sense has the ability to
survive by simply changing their definition to fit whatever context they are
being used in.

"IoT" is only one of those buzzwords that means different things for everyone.
Free cookie to whomever guesses the other one...

~~~
hardwaresofton
While I am also jaded at the mention of "IoT" these days, I think he tied it
down to the concrete well enough, in discussing "time series data".

With TimescaleDB time series data is the offer, not some vague idea of
"connectedness" or "the future".

------
akulkarni
Author here. Thanks for sharing our post. It’s one of the many reasons why
we’re so excited about the future of Postgres.

~~~
nickserv
We're looking into Timescale to replace our InfluxDB servers as we've found
InfluxDB to be resource-hungry, the lack of open source clustering is also a
problem.

The use of PostreSQL in Timescale is a big selling point, it's something we
know how to manage and scale. A big stumbling block though is with the
retention of older data.

For now we do mainly service/machine monitoring and event logging without
needing high granularity for older data and keeping disk usage contained.

For this we've found InfluxDB to be well suited, as the retention policy
system is easy to set up and fits our needs perfectly.

Do you plan on introducing similar functionality in Timescale? I've found this
issue in Github but don't know if this is something that is being pursued for
the official product.

[https://github.com/timescale/timescaledb/issues/350](https://github.com/timescale/timescaledb/issues/350)

~~~
mfreed
Hi, thanks for the question about rollups. Timescale indeed enables you to do
rollups, much like the functionality you describe. (Right now these are
scheduled via a crontab, but in-DB scheduling is in the works.)

But unlike InfluxDB (at least what we've been told by users), the important
thing about TimescaleDB's rollups is that they correctly handle late data. So
if you are performing a rollup every 10 minutes, but the data corresponding to
9:58 actually arrives at 10:05, it's easy to ensure that you subsequently
reperform the rollup and "correct" that previous answer. This happens because
we correctly support UPSERTS: [http://docs.timescale.com/v0.9/using-
timescaledb/writing-dat...](http://docs.timescale.com/v0.9/using-
timescaledb/writing-data#upsert)

You can also read more about one of our user's experience with rollups
(compared to via Influx) in their recent blog post:
[https://blog.dnsfilter.com/3-billion-time-series-data-
points...](https://blog.dnsfilter.com/3-billion-time-series-data-points-
dnsfilter-replaced-influxdb-with-timescaledb-d9f827702f8b)

As to that feature request in the github issue, it's actually asking for a
much subtler and more powerful functionality. In particular, it wants the
"rollup" to be hidden from the user, such that they don't worry about querying
the raw vs. rollup table, and just see a single table view across multiple
different levels of aggregation. Further, the planner would be smart enough
which one to choose based on the time aggregation being asked for, as well as
transparently merge data across different aggregation levels.

This functionality is definitely on top of mind, but my understanding is that
Influx doesn't support anything like this as well: If you have a rollup table
there (say, to 10 minutes), you need to explicitly ask for data from the 10min
rollup table, and you don't see any data that has been written in the last 10
minutes. TimescaleDB similarly supports this today.

~~~
nickserv
Thanks much for your detailed reply. If I understand correctly, the way to do
this is to:

1) Use PostgreSQL's built-in aggregate functions and/or Timescale API
(time_bucket?) to feed rollup tables;

2) Delete the raw, non-aggregated data;

3) trigger these queries using cron according to need.

This would be totally OK for us (even if in-DB scheduling is indeed easier),
and as you say, in some ways superior to Influx. I will test it out.

Do you have any tutorials or examples of using rollups in practice? Thanks!

~~~
mfreed
Pretty much, except I'd use our data retention mechanisms instead of #2
(directly deleting the rows), as it's much more efficient.

So to review:

1) Create a "raw" and "rollup" table. Define the rollup table with the right
UNIQUE constraint (say, on timestamp and server_id). Call create_hypertable()
to convert them both to Timescale hypertables.

2) Write a rollup UPSERT query similar to following:

INSERT INTO rollup VALUES ( SELECT time_bucket('10 minutes', time) as time,
server_id, sum(value) as sum, count(value) as count FROM raw WHERE time >
now() - interval '20 minutes' GROUPBY time, server_id) ON CONFLICT (time,
server_id) DO UPDATE SET sum = excluded.sum, count = excluded.com;

3) Write a data retention query: SELECT drop_chunks(interval '6 hours',
'raw');

4) Then schedule the rollup query to run every 10 minutes, and the data
retention query to run, e.g., every 3 hours. Note that the rollup query above
will properly handle late data.

A more detailed tutorial is in the works (e.g., using triggers for late data
rather than the "20 minute sweep" trick as above). But also happy to help out
in our slack channel:
[http://slack.timescale.com/](http://slack.timescale.com/)

------
RobAtticus
Somewhat related article (re: IoT platforms):
[https://www.slideshare.net/kartben/iot-developer-
survey-2018](https://www.slideshare.net/kartben/iot-developer-survey-2018)

Those of us at TimescaleDB hope to see PostgreSQL move up in the rankings (at
least pass 'Don't Know', yikes!).

------
luord
Well, I love Postgres so reading that I can use it for even more use cases is
pretty cool.

------
hardwaresofton
tl;dr TimescaleDB is a does batching + partitioning better than postgres
(including postgres 10) can, automatically, and turns postgres into a grade A
time series database. I'd love to know how TimescaleDB plans to monetize.

That said, I actualy _did_ read the article, and it was excellent. I was a
little put off by the "IoT" (buzzword fatigue) and like 2% over-editorializing
(to set the stage for timescaledb) but it makes sense given that the poster is
the CTO, with a company to promote, and the fact that the people who might
read the article are varied.

Inside the article is a link to a post on why PG10 doesn't quite stack up[0],
which was also a great read -- the naive engineer in me immediately asked
myself that since PG10 offered improved partitioning why I would need
TimescaleDB if I had PG10. Obviously PG10 doesn't offer the chunking, and it
doesn't offer automation of the partitions that TimescaleDB does, and it
degrades under number of partitions (which is bound to grow with time series
data of course).

[EDIT] - Ignore the sentence immediately below this edit - write performance
_is_ the graphed metric in the article. Sentence left in for posterity. Turns
out PG10 writes degrade horribly as partition size increases.

They did not graph write speed in the article though, which was a bit
disappointing, so I assume the difference was negligible (in the end, you are
writing with postgres to multiple chunks/partitions, no matter how
efficiently/automatically they were created, so it's probably the same).

I've been itching to use TimescaleDB for a while, really glad to see it get
exposure like this -- I'm just about ready to go all in on it, I already love
Postgres and TimescaleDB means now I can do more awesome things with Postgres.

I'd like to hear the profit model of the company though -- how does
TimescaleDB plan to make money?

[0]: [https://blog.timescale.com/time-series-data-
postgresql-10-vs...](https://blog.timescale.com/time-series-data-
postgresql-10-vs-timescaledb-816ee808bac5)

~~~
RobAtticus
>They did not graph write speed in the article though, which was a bit
disappointing, so I assume the difference was negligible

Did you mean query performance? There are write speed/insert numbers under
"Insert performance measurement graphs"

~~~
hardwaresofton
Oh sorry, I somehow focused in on the number of partitions bit and somehow
missed that the metric was actually insert performance.

I'm surprised PG10 tanked so fast...

~~~
anarazel
The partition selection / pruning in 10 was not yet improved from older
inheritance based "partitioning" (COPY might already have been optimized back
then, not sure right now). It's looking different in v11.

~~~
hardwaresofton
Ahhh I thought the partitioning was NOT inheritance based by default after
reading the partition documentation page in the PG docs[0]. More specifically
the "Implementation using inheritance" section[1] -- I thought the default was
_NOT_ to use table inheritance, but that you could use it as an option.

[0]: [https://www.postgresql.org/docs/current/static/ddl-
partition...](https://www.postgresql.org/docs/current/static/ddl-
partitioning.html)

[1]: [https://www.postgresql.org/docs/current/static/ddl-
partition...](https://www.postgresql.org/docs/current/static/ddl-
partitioning.html#DDL-PARTITIONING-IMPLEMENTATION-INHERITANCE)

~~~
anarazel
You misunderstood me, sorry I should have been clearer. Some internals are
shared between inheritance and declarative partitioning. For v10 most parts of
the system (except for COPY) used the same partition exclusion logic for both
forms of partitioning. Which means it was O(partitions) with a pretty high
constant.

------
baybal2
storing data on an MCU is not a good idea - tiny flash memory wears out
quickly

