
PostgreSQL 11 Partitioning Improvements - craigkerstiens
https://pgdash.io/blog/partition-postgres-11.html
======
craigkerstiens
The native time partitioning in Postgres 10 was a great foundation, but
definitely had a few rough edges. With Postgres 10 you mostly needed some
additional extension like pg_partman
([https://www.citusdata.com/blog/2018/01/24/citus-and-pg-
partm...](https://www.citusdata.com/blog/2018/01/24/citus-and-pg-partman-
creating-a-scalable-time-series-database-on-PostgreSQL/)) or timescale to
smooth out the rough edges and make things as usable as you'd expect.

This set of changes in 11 is pretty great and makes it much more usable out of
the box. For those not reading the article the short list of improvements:

    
    
      - Updates move records across partitions
      - A default/catch all partition
      - Automatic index creation
      - Foreign keys for partitions
      - Unique indexes
      - Hash partitioning (where as 10 was just time/range based)

~~~
londons_explore
Still missing:

* Dynamic repartitioning to balance query load/space/create new partitions when tables get over a certain size. Without this, when you get more than a few hundred TB of data, you'll end up with terrible hotspots where single partitions are much more loaded than others and theres nothing you can do about it.

* Query replanning: Query plans can sometimes have a lot of error. By replanning, one can start on another query plan if the existing plan is grossly wrong in its estimates.

* Lock Eliding instead of *Exclusive locks. Some major operations, such as schema changes, take locks which effectively stop the server responding to any other requests till the operation is done. Instead, these operations should work on a copy-on-write image of the database, repeatedly re-syncing with recently changed data until the final change necessary is small, and then atomically commit the change.

~~~
arghwhat
#1 seems a bit weird. You could theoretically have the DB automatically create
new buckets for a hash partition, but hash partitions shouldn't have hot spots
unless you hash on a bad column, and creating new buckets will require all
existing rows to be moved.

For other types of partitions, partition hotspots would be the fault of the
DBA partitioning on a bad column. The DB could theoretically half partition
ranges if a range partition is very large, but there are other things that are
more important.

#2 is a nice performance feature which has been talked about before, but do
any DB's actually have this yet? I suspect that it is quite tricky, and won't
be universally beneficial.

Is #3 really a problem? I certainly wouldn't expect schema changes to perform
well, and increasing complexity just to help in this area would seem silly.

~~~
londons_explore
> hash partitions shouldn't have hot spots unless you hash on a bad column

In the nice world of all data having an even load, sure...

But in the real world, you can easily get a handful of users in your "Users"
table sending millions of requests per second, and in that case, you would
really like to re-partition so that they don't happen to all end up on the
same partition.

Implementation can be as simple as allowing splittable partitions, so that
whenever load/size gets too high, a partition can be split in half, and half
the records moved to a new host. The partition map is only a few kilobytes, so
is globally shared/updated. There are no concurrency issues, because during
the splitting process, either the old or new partition is responsible for each
record, and both the old and new partition hosts can respond to a read or
write query, either themselves or by forwarding it to the other host.

Whenever two neighbouring partitions both see low load/space usage, join them
with the same method in reverse. By joining only neighbouring partitions, you
can't suffer terrible fragmentation and blowing up the size of the global
partition map.

~~~
arghwhat
Partitions provide _separation_ , not _load balancing_ , and does thus not
concern themselves with load or size.

While a can imagine a way to make hashed partitioning sub-divide buckets, it
won't really be useful, as you can still have a single user blowing their
bucket out of proportions with others being almost empty, which partitioning
can never solve.

What you want is round robin distribution and parallel query execution, not
partitioning. This of course only makes sense if the tables are on different
disk systems/hosts or if all tables fit in memory.

~~~
pgaddict
No, that's not true. When combined with parallelism, ability to place
partitions to other hosts and partition-aware algorithms (partition-wise
joins/aggregation etc.) it can be a powerful scalability feature.

~~~
arghwhat
It _is_ true. However, while partitioning is not load balancing, partitioning
can be used to make some form of load distribution.

Partitioning primarily deals with isolating related data into small pools so
that queries do not need to touch full tables. The purpose of the buckets is
to group related data so that a single partition may serve a query (instead of
having to touch the entire table), thereby lowering the load from each
individual query, not distributing all queries.

Load balancing deals with distributing all queries to ensure even node load,
not increasing efficiency of queries (which it does not do at all). Data
distribution leading to round-robin accesses across nodes/"partitions" is
best. This is the exact opposite of partitioning, which tries to group all
that data into a single partition.

Hash partitioning on serial or uuid might give you round-robin like behavior,
but it's not really what partitioning is meant for.

~~~
pgaddict
I'm not really sure what is your point. I was not suggesting partitioning
alone is a load-balancing solution, but that partitioning in combination with
other features is a powerful scalability feature. Assuming there is a (1) way
to keep the partitions withing some size limit automatically, and (2) the
partitioning scheme is such that the queries can leverage the partitioning
scheme (eliminate most partitions, join tables partitioned in the same way).

We don't have (1) yet - at least not in core PostgreSQL, but this part should
be doable in an extension until it gets into the core. We do have (2) to a
large extent, although I'm sure there are gaps and room for improvement.

And when combined with ability to place partitions to other hots, that will be
another step.

No one is claiming partitioning alone magically solves scalability, but IMHO
it's an important building block in doing that (for a large class of
applications).

~~~
arghwhat
My point is that partitioning is the polar opposite of load balancing, and the
wrong tool to implement it.

------
koolba
> With Postgres 11, you can create a unique index on the master

> The columns in the index definition should be a superset of the partition
> key columns. This means that the uniqueness is enforced locally to each
> partition, and you can’t use this to enforce uniqueness of alternate-
> primary-key columns.

Are there any workarounds for global indexes across partitions outside of the
partition key? Say if you have a a table partitioned by (some_date) with a
unique primary key (id), how would one speed up looks up that are purely by
id? Having per-partition local indexes would require an index scan per
partition.

I supposed you could have a separate/smaller (id, some_date) non-partitioned
table (or maybe partitioned on HASH(id)) and then do the indirection yourself
wherever you're doing "SELECT ... WHERE id = ..." but seems like a kludge. Any
other workarounds?

~~~
endymi0n
Having done this with MySQL before, there‘s basically three things you can do:
1) Do what you proposed, but this kinda defeats the whole purpose, as that
side table can get very, very large (and if it‘s not fitting into working
memory anymore, good luck keeping up a sane insert rate). 2) Just don‘t load
that kind of data into an RDS subject to these scaling limitations at all
(what I would do nowadays, but in a lot of ways, easier said than done) 3)
(what we did back then): If you can, accept a window of uniqueness instead of
real, true global uniqueness. If you can live with that, define a cronjob to
build you a VIEW on the last TODAY-N partitions and check for duplicates using
that in a SELECT CTE before inserting.

~~~
Shelnutt2
Using Spider storage engine in Mariadb it is possible to do this by having
vertical partitions using different partition by clauses [1]. Essentially you
will create sub tables with the partition by clauses, then use a spider table
to reference these subtables. Spider than intelligently will select the
partition to use based on the query and where clause. The downside of this is
it will require storage of the data multiple times on disk, since each
subtable needs the full data set. Spider does handle inserting the data, so
once schemas are build, only the single spider table is needed for
inserts/selects/updates.

In addition to this, in 10.4 it was mentioned spider is working to have
internal joins. This means it would be possible to have 1 table defined with
the full data set, then a second table that just has id, datefield partition
by id. When you select WHERE id = X, it will read from the small table
partitioned by id, find the dates, then join against the larger table to pull
in remaining columns/data. This isn't mentioned in the presentation linked,
but it was brought up in discussion at the mariadb developers unconference.

[1] page 6: [https://mariadb.org/wp-
content/uploads/2018/03/Merging_patch...](https://mariadb.org/wp-
content/uploads/2018/03/Merging_patches_into_MariaDB_10.4.pdf) (note this is
about pages for 10.4 but the VP functionality is in 10.3)

------
jeffdavis
Partitioning and parallel query go great together!

For applicable workloads, it means you get nearly perfect scaling up to the
number of partitions you have.

PG11 will be huge.

------
john37386
I use zabbix a lot with PG database and there are 2 tables that grow
indefinitely. The zabbix housekeeper lags a bit when the tables become +200GB.
I tried the autopartitioning and the housekeeper becomes irrelevant. I can now
easily keep 1 year of data for my +200k monitored items. I like my critical
alerts to keep 3 months at 60 seconds interval. Pg partitioning is really
awesome

------
gigatexal
Very excited about this. Postgres10 is awesome. But I am looking forward to
working with 11.

