Hacker News new | past | comments | ask | show | jobs | submit login
PostgreSQL 10: Partitions of partitions (joaodlf.com)
140 points by joaodlf on Jan 20, 2018 | hide | past | favorite | 48 comments

No primary/unique keys on the partition table and only on the partitions still seems like a common dealbreaker.

There's a patch adding this capability in the current commitfest. See:


If everything goes well, it might be in PostgreSQL 11.

Just curious but why? Wouldn't unique keys within partitions mean part-key + part-unique-key becomes a unique identifier?

If I'm understanding everything right, the uniqueness is only checked within each partition because only the partitions can have unique indexes. If you're partitioning by month and got an id field that needs to be unique to the table, you could conceivably have the same id show up in different months. Just because I partition by months doesn't mean I don't sometimes need the whole table, too. I might usually only need monthly data, but sometimes I might need a full history. Now if I've got a duplicate I'm fucked and I don't even know it.

This is a good point, but your partition space implicitly has a unique compound key (part_unique_id, part_key). In almost any case I can think of compound unique keys are probably sufficient, but if you really must have a single UUID, then you could combine these two keys to create a unique identifier.

Say new_key = <part_unique-id>-<part_key>. Now new_key is guaranteed unique across the partition space. You could consider hashing as well...although I don't recommend this since hashes don't have collision guarantees (even if the chances of collisons are small for most modern algorithms).

I don't see how a compound key does anything here.

Say my table is a Student table: StudentId, Building, Name, Birthdate, Gender, Grade, Status. I want to partition by Status so that Active students are together, but StudentId must be unique across the entire district.

I am not sure this case really fits what I thought your point was, but a shared sequence would help here:

    CREATE SEQUENCE student_id_seq
        START WITH 1
        INCREMENT BY 1
        CACHE 1;
And anywhere you create a student you would set default to be next value in the sequence:

     ALTER COLUMN id SET DEFAULT nextval('student_id_seq'::regclass);
Although, in this example I am pretty sure a valid limitation of the model could be that students must be created with "Active" status...which then confuses me slightly since you wouldn't really partition by status since updating the status of a student is an action so I would move the user record at that time. Which is no longer partitioning per se.

But this is one (probably naiive) way to handle this case.

A sequence isn't a unique key. It's not data integrity. It's just a sane default. There's nothing stopping an application from inserting a specific value, and so, without a unique constraint, there's nothing stopping an application from inserting a duplicate value. There's a world of difference between a sane default and a enforced constraint. Half the point of an RDBMS is that the database cannot store inconsistent or invalid data. You mark a column as unique and you don't have to worry about it again. You can't store a duplicate value unless the database is corrupt.

Yes, if the applications using the database have no bugs and always work as expected, then you won't have any duplicates. However, that line of reasoning leads to just 100% trusting everything the application does regardless of the design of the data model. That's exactly how data stores used to work before RDBMSs, and it's exactly why RDBMSs came about: applications can't be trusted to manipulate data consistently to a known set of rules. Somebody will mess it up somewhere, so it's important to enforce rules to leave the database in a manner that other applications (or other parts of the same application) will find comprehensible.

> It's not data integrity.

This is true. And while you could claim it is a "non-starter" for using partitions I would argue that app logic guarantees of how that column is used is sufficient for many cases to actually use in its current state.

You have good points, and I have no disagreement about data integrity constraints or whether or not partition-wide uniqueness guarantees are a good feature.

It would seem you and I simply disagree on whether or not partitions without uniqueness guarantees are unusable for most use cases. I believe many partitions use cases don't need uniqueness guarantees (such as high volume, low/no update work loads). And for quite a few, if definitely not all, cases where uniqueness is desirable it could be satisfactorily handled in app logic.

But again, key agreement is partition-wide uniqueness guarantees are a good feature.

If you're partitioning by month how would the month end up in 2 partitions? I mean how would march and up in april or something!???

Think: TransactionId, ItemId, BuyerID, Count, date (1, 1, 1, 2, 1/1/2018; 1, 2, 1, 2, 1/2/2018)

Where TransactionId should be globally unique?

[ed: see https://news.ycombinator.com/item?id=16196096 for a much better/comprehensive example of the same idea]

See my response above :)

I agree that you would normally end up with a compound unique key, but it might have some benefits to know that your transaction I'd was globally unique. Not jus per month/weekday - but also on geographic region, stores (eg: avoid need to rewrite/legacy support on store merges/splits, ditto for regions etc).

Naive question but this means no joins or foreign keys right?

No, it means you might have duplicates of fields you don't want duplicates of. In general, JOINs work much more efficiently when there's an index on the columns used to join the tables, but they're not required.

Let's say you want a table for your account transactions:

  create table transaction (
    tran_id int primary key not null,
    tran_date timestamp(0) not null,
    account int not null,
    amount decimal(30,4) not null
Now if there's already a transaction with an id of 8043, you can't insert another transaction with that same id. There's only one transaction with 8043 allowed in the whole table. However, if we partition the table:

  create table transaction (
    tran_id int not null,
    tran_date timestamp(0) not null,
    account int not null,
    amount decimal(30,4) not null
  ) partition by range (tran_date);

  create table transaction_y2018m01 partition of transaction
  for values from ('2018-01-01 00:00:00') to ('2018-01-31 23:59:59');

  create table transaction_y2018m02 partition of transaction
  for values from ('2018-02-01 00:00:00') to ('2018-02-28 23:59:59');

  alter table transaction_y2018m01 add constraint ux_transaction_y2018m01_tran_id unique (tran_id);
  alter table transaction_y2018m02 add constraint ux_transaction_y2018m02_tran_id unique (tran_id);
See, the only uniqueness restrictions are on the partitions, not the overall table. Now I could potentially have a transaction with an id of 8043 in both January and February of 2018, as well as any number of transactions with an id of 8043 not in either of those two months. If my application assumes that transaction ids are always unique, that's got the potential to cause a problem. If multiple applications or multiple users use the same database, it's possible that an error or a race condition might cause a duplicate id.

I suppose a mitigation might be a table of just (unique_id, partition_field) as arbiter of validity. Assuming such a table to be compatible with system design... one would need a whole lot of rows and/or heavy fragmentation to rule it out, no?

Foreign keys referencing partitioned tables are not supported. Joins are absolutely fine.

One solution for unique primary keys is to use a system like Snowflake [0]

[0] https://blog.twitter.com/engineering/en_us/a/2010/announcing...

The question is enforcement, bit generation.

Is anyone that was using pg_partman before migrated to native partitioning yet? No support for ON CONFLICT and PKs are serious limitations that are available with pg_partman.

Most I know that are using partitioning even with Postgres 10 are still taking advantage of pg_partman. Pg_partman overall makes things much more usable in general.

I agree pg_partman is awesome, and it's a great testament to the extensibility baked into PostgreSQL.

Can anyone offer a really basic summary of what problems can be solved with PostgreSQL partitions?

A big one that nobody has mentioned yet - reduce memory requirements for your db. In general, for a database to be performant, it needs to be able to keep the hot part of the dataset and REGULARLY USED INDEXES in memory, you don't want to be paging indexes in and out of memory every time you have to use them.

Say you've got a giant table for of emails, and there's a bunch of indexed fields (people maybe want to filter their emails by tags, read/unread, etc.). If you keep them in one giant table, you need to keep those giant indexes fully in memory to get good performance. BUT, you know 99% of requests are for recent emails, people only occasionally look at old ones. So you partition your table by timestamp, and instead of running queries like `select * from email where status = 'unread' order by timestamp desc limit 50`, you run queries like `select * from email where status = 'unread' AND TIMESTAMP >= <pretty-recent> order by timestamp desc limit 50`. This only hits the first partition, and almost always returns the 50 emails you need - if it doesn't, you run subsequent queries to hit older partitions.

But now, say you partition by month, and keep 2 years of data, almost all queries are being serviced by a partition containing 1/24th of the total data, so keeping indexes for that 1 partition in memory is WAY easier than keeping indexes in memory for the entire giant table. When you have to query old partitions, it's slower and there's still some swapping happening, but that's rare anyways. So you can get away with having way less memory for your db, without performance degradation on most queries.

With your example, I believe you’ve just described TimescaleDB. https://www.timescale.com/

We use it to store 150M records a day, and be performant enough to have a customer facing analytics dashboard querying against it for real-time results. Pretty amazing approach.

TimescaleDB person here. Thanks for the call-out.

For those who don't know, TimescaleDB is a extension to Postgres that makes it easy to use and scale for time-series data. It works with both PG9.6 and PG10.

Some folks might be interested in two recent blog posts we wrote about PG10 partitioning:

1. Technical write-up about PG10 partitioning: https://blog.timescale.com/scaling-partitioning-data-postgre...

2. Why TimescaleDB's partitioning is easier-to-use and more performant than PG10's native partitioning for time-series data: https://blog.timescale.com/time-series-data-postgresql-10-vs...

Is there that much of a performance difference, though? In your example the btree of the index of the 2 years of data is probably 1 level deeper than that of an individual month. If you're mostly querying the last month, only that part of the btree will be kept in memory. The size of the hot data of the index is similar in both cases. I fail to see how that would make much of a performance difference.

For bulk operations like (auto-)VACUUM, CLUSTER or dumping/restoring data it can make a big difference, though.

Yes. Sometimes up to 500x faster. It's not the index which is important but the table itself. PostgreSQL tables are heap tables. The index might be ordered by month but the rows it referrs to can be spread throught the entire large table.

There are realistic scenarios where each row can end up costing you an 8kb page fetch, reducing query speed by ~250x for narrow tables as you're bound by memory bandwidth, or worse, disk bandwidth.

There's also an additional performance benefit in being able to skip the index scan when querying by month, and just sequentially scanning the entire partition which is usually 2-3x faster to access the same amount of data.

> Yes. Sometimes up to 500x faster. It's not the index which is important but the table itself. PostgreSQL tables are heap tables. The index might be ordered by month but the rows it referrs to can be spread throught the entire large table.

That makes sense.

> There are realistic scenarios where each row can end up costing you an 8kb page fetch, reducing query speed by ~250x for narrow tables as you're bound by memory bandwidth, or worse, disk bandwidth.

I've actually had a similar scenario. CLUSTERing by the index solved that. But of course running CLUSTER on a huge table is very slow. If you can partition the data so you only have to cluster one of the partitions, that's a huge win.

> There's also an additional performance benefit in being able to skip the index scan when querying by month, and just sequentially scanning the entire partition which is usually 2-3x faster to access the same amount of data.

Didn't think of that.

Thanks for the clarifications.

Partitioning by date is a common solution to efficient archiving - the partitions may be dumped independently, and instead of DELETE, which requires expensive cleanup after the fact, you can simply drop the partition (which does not require cleanup at all).

The declarative partitioning (added in PostgreSQL 10) is also transparent for the optimizer, i.e. it can understand how the data is routed to partitions, and can leverage it while planning/executing queries. For example when a table is partitioned on "a" and your query does "GROUP BY a" then in some cases the database can do the aggregation per partition - which should be more efficient in general (smaller hash tables, less data to sort, ...). Or when joining tables partitioned in the same way, it may be possible to do by joining the matching partitions (again, more efficient).

Obviously, PostgreSQL 10 only introduced the "core" declarative partitioning, and many such goodies are currently being worked on - either for PostgreSQL 11 or following version(s).

A simple example is you bucket your data into monthly tables such that when you run a query that has a timestamp constraint, the query analyzer is smart enough to only scan a subset of the tables that are relevant to your query. It's also really nice for expiring old data with table drops.

Imagine a very large table, let's say it's a tally table where we frequently run analytics queries against. Even with optimised indexing, you're going to hit performance issues eventually (it all depends on query complexity + size of data). Partitioning essentially splits data into different tables - Let's say we partition by date range, we could end up with a table for each individual year/month/week/day, whatever really! Instead of then performing actions against one big table, you would effectively only affect the partitions that belong to the date range you're interested in. Partitioning is all about splitting data into multiple tables.

Couldn't you use views for that? Everything you described sounds like a view to me. Whats the difference between them?

Traditional views are just stored queries that are built on the fly, so there's no performance improvement. Materialized views are stored/cached on disk and will give you some performance improvements but you'll have deal with when/how the cache is updated and the resulting penalties.

Yep, Postgres grants flexibility (you can build temporal materialized views + other constructs some other database servers refuse as nondeterministic) but with the burden of having to refresh the materialized view yourself. This was a little bit of a surprise after doing materialized views in MSSQL and Oracle, which treat 'em more like big old indexes updated in lockstep with the base tables.

Reduced index size is the most important one, IMHO.

Once your indices are larger than than your available memory, write and read performance plummets. This obviously only applies to fairly large tables, but in our case it will be a godsend.

The problem with views in Postgres is that they are not updated on the fly, they require manual updating. For real time data, it's not a very useful feature.

Erm I think you're thinking of materialized views. Regular views are essentially stored queries.


Oh, yes. I tend not to gravitate towards views, so got myself mixed up there :).

It seems like the inverse of a merge table. Years ago we used to archive data into tables named by the date range and then we would create a merge table that combined them. So every night we would drop the oldest, create a new one, and create a new merge table of the updated set.

Partitioning looks to be going the other way. Create a big table, partition for convenience, drop partitions easily when you are ready to dispose of them. I can definitely see some cases where this would be super handy.

I'm not familiar with merge tables, but it sounds quite like a view for UNION of all the tables.

Poor data archiving practices.

Jokes aside when you need to operate reporting on a multi-billion row scale which only needs to look at a very specific subset of your data I could see the use of partitioning. It's the transparent cousin to "copy it into it's own table".

Do you have better advice for archiving practices? I don't get the joke.

Partitioning is not like copying to another table because the partition is already its own table. It saves the entire expensive operation of copying that data and you get the additional benefit that it stays current.

Personally I'm a fan of pruning production data and creating separate structures for archival purpose. You should already have recurring backups so pulling the data out of your tables should not be scary.

Copy data out of tables, put them into formats conducive to analysis (document stores, relational, column oriented stores, etc). You can then tie everything together in your language or from something like Presto.

Except that the removal of data from tables is fairly expensive process - you have to delete the data, which means running queries (which have to scan the data, write a lot of WAL and modified blocks, etc) and then do cleanup (which means vacuum on the table and also on all the indexes). That's expensive.

With well designed partitioning scheme you can simply drop a partition, and you're done. Much faster/cheaper.

But this is just one one benefit of partitioning - I've mentioned the possible benefits for planning elsewhere in this thread.

Ok but partitions make it easier to prune data without expensive cleanup operations and are useful for the same reason in the relational analytic store. Nothing about partitions requires you to run analytics on production databases.

There's a trade-off with size vs latency. It can make sense to use smaller partitions if you need to do interactive sorts and filters on a partitioned subset of the data, in the several million row range. Benefit is smaller if the data is naturally co-located due to insertion time.

The biggest win I see though is fast deletes.

How is this better than inherited tables with indexes and constraints ala https://stackoverflow.com/a/3075248

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact