Hacker News new | past | comments | ask | show | jobs | submit login
Zero-downtime schema migrations in Postgres using views (fabianlindfors.se)
181 points by fabianlindfors 45 days ago | hide | past | favorite | 54 comments

I looked into this recently, and views cover like 98% of the functionality that the client app needs from Postgres. One issue I ran into was that Postgres forgets that the primary key is a primary key when pulling from a view, which breaks some queries that rely on grouping by the primary key.

https://dba.stackexchange.com/questions/195104/postgres-grou... has some more info on this

I recently transitioned a number of tables over to views as part of a data model rearrangement and I absolutely loathed my previous self that leveraged that primary key trick. I don't think there is anything unsafe about them choosing to transition to allowing any column singularly defined as a unique index for the table to serve this role and it'd help make things a fair bit more logical.

That all said, until that happens, I'd strongly suggest avoiding that functionality since it can lay down some real landmines.

I've made a library that generates Typescript types from a PG database and I see a variation of this problem: since the "reflection" capabilities on views don't tell me about references in their source tables, I lose the ability to see where a foreign key points to. I normally use this to create nominal ID types, but in views I can just create strings or numbers.

I am trying to work around this but so far my only solution is to parse the SQL of the view definition and extract the information from there. I have it somewhat working but it's a bit complex for my liking..

Interesting link, thanks! It would be really nice if Postgres were to close that gap and make them fully equivalent (if that is even possible).

Until then, you just have to list the other columns in the GROUP BY... You should usually really be listing which columns you need explicitly, anyway.

Doesn't DISTINCT work in this case?

  row_to_json(vt.*) FROM vt
  JOIN vy ON vt.id = vy.tid
  WHERE vt.id = 1
  AND vy.amount > 1;

The SQL in question which was problematic for me (tables renamed):

SELECT posts.*, MAX(COALESCE(comments.created_at, posts.created_at)) AS latest_timestamp FROM posts LEFT OUTER JOINS comments ON posts.id = comments.post_id GROUP BY posts.id ORDER BY latest_timestamp desc

In short, this is sorting posts by the most recent comment, with a fallback to the post date if the post has no comments on it. Hard to get rid of the grouping here and get the same data back.

I see, that makes more sense.

I'm the OP of the DBA StackExchange post, and while yes, it does technically work in this case, you do lose some abilities. For one, its much harder to `count(*)` rows with `DISTINCT`. Also, `DISTINCT` uses a totally different mode in planning that requires first retrieving all the rows and then finally filtering them. This is _much_ slower and generally not very fun to deal with!

Author here, thanks for reading. As has been mentioned in some comments, the article only covers half the story of how this can be used for migrations. The greater idea is to have every commit be short-lived and two-phased: apply and cleanup. A migration is first applied and once it's no longer used by any clients, it's cleaned up.

1. Apply: the new schema is created and any new data is added to the backing table.

2. Cleanup: the schema is deleted and any data that is no longer needed is removed from the backing table.

For example, to perform a column deletion:

1. Apply: create new schema without the column which is being deleted. Don't change anything in the backing table.

2. Cleanup: delete the schema and the column from the backing table.

Thanks for this article.

I've always liked the idea of using views as an API. But, I'd never considered using Postgres' schemas as a way of versioning that API. This approach wouldn't work for every kind of migration but definitely seems correct and useful in some cases.

As others have pointed out you still of course have to deal with the technical debt of clients using the "old" schemas/views... but that is a problem that has to be addressed anyway... so I don't consider that to be a "flaw" of this approach; it's something close to orthogonal to this approach.


Interesting approach, with a pretty dark side-effect: the tech debt silently creeping in and saying forever. Apart from that, you can use any abstraction over the base schema, depending on the use-case. Renaming a column can be as simple as renaming a field in the DTO structure.

The interesting part is when you actually get to pay the debt back. Yes, this is not free lunch. You move fast, but you have to eventually pay the cost of the underlying storage containing all the historical schemas combined at once.

This is the difference between carefully planned schema migrations and abstraction-based migrations. The former takes more time planning and executing, but the end state is the reference schema. The latter keeps on extending the underlying schema and building projections over it. Over time, the system accumulates tech debt in the form of ever-expanding base schema that contains all the historical columns.

The point here is to do migrations live while clients randomly connect and stay connected for random time. When all old clients have disconnected, the old schema can be removed.

The point is not to let old clients work indefinitely.

In other versions of this idea, all views/procedures are prefixed with some revision-number but that looks messy.

But what happens when schema Xn adds a column N, and schema Xn+1 removes it? The base schema, the source for all the views will contain all the columns Xn... from all the migrations until purged.

That purging is what has a lot of operational complexity. And renaming a column. All the rest is zero-downtime in PG already.

We have automated updates of schema. Installs are on-prem, and we support running multiple versions of our program at the same time (if there's a bug in new version, just use previous as workaround).

We rarely drop columns due to this, but haven't found it to be a huge issue. If we need to, we'll wait until the last supported version that has the column in its schema is retired. Then we'll add an explicit "drop column" to our schema file for the next version.

Views can perform arbitrary queries. They can achieve any kind of inference starting from the base data that's stored in tables. (Keep in mind that modern SQL is fully Turing-complete.) This makes any view-augmented database a viable source of rich, complex knowledge, not just mere information.

But I doubt unleashing the full expressive power of SQL is the point here. It would easily turn a moderately complicated "remove column" migration a real maintenance hell.

In the simplest form, when you add a column to some schema, it should be materialized in the base schema and exposed via the migration view.

The problems start when you add and remove 20 columns because even though they are no longer visible in migration schemas, they take up space in the base schema

@lallysingh, sorry, can't reply directly.

All I'm saying is that I don't think we need a full Turing-complete cannonball to hit the (relatively) small fly of no-downtime migrations.

Is it a silver-bullet? It's Turing-complete so high chances yes. But for me it has a high risk of causing a silver-poisoning.

Personally, I would stick with simpler solutions.

Butting in to say: Maximum reply depth is relative to the view. Just click on the comment's time and you will get a form with a reply field.

WOAH! I’ve been using HN for years and had NEVER realised that. And I honestly keep seeing people saying they can’t reply, either. This changes things!

Let's see how deep this rabbit-hole goes

Yes, this approach might actually decrease the power of your database as some things become much trickier. It might be worth trading of some expressive power of the database for smoother, less hands-on migrations though, which is what I'm going for.

That sounds like problems upstream of this technique. Are you complaining that it isn't a silver bullet or that it's so powerful that it'll be abused?

It would have to be in steps:

Add Xn+1: create new views without column. When no Xn clients remain: drop column from tables and drop schema Xn.

The point as I see it is to not break live client connections which expect the column to exist.

Yes, this is a very important insight! I think the author proposes this approach as replacement for regular schema migrations, whilst both approaches complement one another.

You can use views to make migrations that were previously tricky zero-downtime.

If that's not the case, then I mist've read the article wrong!

Edit: although when I think of it - if you want to eventually materialize old migration schemas into the base schema, you need to do the rename, too. Which is not zero-downtime because of new migration schemas that do the renaming automatically. Meaning changing views' definition, meaning lots of locking.

So, you still need maintenance windows to merge all the changes. Just not on every change. Otherwise the base schema will then eventually be completely out of sync and contain tons of old, unused columns.

You're right that the article doesn't mention how this can actually be practically used for migrations, but that is exactly what I had in mind. Maybe I'll write another one some day about the two-phase migrations combined with views.

But now you have the ability to merge multiple maintenance windows into one. Do for example the next 4 migrations with the view trick and the 5th is everytime applying all accumulated changes to the real tables.

Looking forward to your future blog posts! The rename thing got me really interested.

This is precisely the way I was imagining! I'm experimenting with creating a migration tool which will do this fully automatically.

This is definitely a valid concern. My plan for building migrations on top of this is to have older migrations automatically be deleted as they become unused. Preferably you wouldn't actually keep using the old schemas but rather keep them as you roll out changes to your application servers. Migrations could then be deleted (in reverse order) based on some condition, for example time since the schema was created or time since last query.

But in order to support the "add column" migration you have to add it to the base schema, right? It has to be materialized somewhere.

Now, when you remove a column, you just create a new view without it being available. But the column in the base schema must stay because: you support the old schema, and because you want to avoid the operational complexity of physically removing it, right?

Now, you can obviously split a table into a list of tables each storing the pk and some column value. But you have just invented columnar storage in a row-based rdbms :D

A different approach is to have scheduled maintenance window, for example quarterly or every 6 month when you materialize all the changes at once. The amortized operational cost is way lower :)

I think you've misunderstood - the issue is how to keep two views in place - one supporting current clients, until the last one disconnects, and another supporting new clients.

Until you have no current clients left, you can't delete the old view or remove underlying columns that that old view requires. Once they have, you can execute that phase.

I thought this wouldn't work with INSERTs and UPDATEs, but it turns out newer versions of Postgres automatically support updating simple views. [1]

[1] https://www.postgresql.org/docs/12/sql-createview.html (See "Updatable Views")

The information on this is, admittedly, hard to dig out of the primary docs but there also exists an INSTEAD OF trigger mode - so you can CREATE TRIGGER "actually_save_stuff" INSTEAD OF INSERT which will intercept an attempted insert on the table and write the rows elsewhere.

Generally I've found these useful in the short term for letting application logic continue to work blissfully ignorant to the fact that the "table" it is writing to has been moved, merged or refactored into a set of tables that the trigger will bridge logic over to. While in the long term I've found that the automatic logic covers the most common case, a view that's filtering out some rows for some reason (permissions - soft deletion) before presenting it to the application logic.

This might work to some extent for renaming things but doesn't for any other kind of migration I can think of:

- Dropping a column doesn't work (assuming the point of dropping it is actual deletion, rather than just hiding it)

- Adding a column doesn't work either

- Changing a column's nullability, default value, or data type doesn't work

- Doesn't help with adding or changing constraints or indexes

With this model, migrations would be two-stepped. First the migration would be applied, creating a new schema and adding the new information to the underlying table. Once the version is old and no longer used by any clients, the migration would be "deleted", removing the schema and the underlying information.

So when dropping a column it would go like this: 1. Migration is applied. Columns is removed from the new view but remains in the backing table. 2. Migration is no longer needed. Column is removed from backing table.

This design should be extendable to cover changing data type (introduce a new column and moves values over + sync values using trigger), changing default or changing the contents of a full column.

Constraints or indices are a bit trickier, but it might work to create a new column duplicated from the old one and add constraints to that instead.

Adding a column, changing column's nullability and adding/changing constraints is already zero-downtime in PG.

Not sure about default value change, but I would also say its zero-downtime as adding a column with default is zero-downtime. Haven't checked, though.

Dropping a column and changing the data type are not zero-downtime.

Renaming a column is not zero-downtime as you might have multiple readers/writes using both.

> Adding a column, changing column's nullability and adding/changing constraints is already zero-downtime in PG.

Making a previously nullable column NOT NULL is not zero downtime. Neither are adding constraints -- except in some cases with NOT VALID, which isn't quite the same thing.

> Dropping a column and changing the data type are not zero-downtime.

Dropping a column is zero downtime.

Adding a NULL constraint isn't zero downtime since PG scans the whole table ensuring that the constraint is not violated.

Discussion on another "zero downtime postgres migration" from a few days ago.


I don't like really solutions which force everything into a single schema just to do migrations. They shouldn't be that difficult.

In this particular case, rather than making everything a view all the time, you could just use views during a migration window to get the same effect. Replace the table with a view which has all the columns plus the new name as an alias for the old one, migrate all the clients, replace the view with the table with the column renamed. No special permanent crap necessary.

The idea is not for all the different versions to be permanent, just temporary while the clients are being updated. Check out the top-level comment I wrote for a more fleshed our explanation!

That's still, in my opinion, overkill and unnecessary.

That would probably depend on the use case no?


Very interesting. I like the idea of "virtualizing" the hard schema with views to abstract away some things.

View performance can be a thing at larger scale for OLTP workloads, also, the solution you propose also adds complexity since you have two schemas now instead of one, and as you rightly point out, complexity with views themselves. The question becomes when is this added complexity worth it?

I think you hit the nail on the head regarding complexity. I'm experimenting with creating a migration tool which handles all the views for you. I believe some of the process can be automated and the complexity reduced but remains to be seen how well it works!

Views are equivalent to "query templates" unless for example they do operations on subquery aggregate columns (for example group by sum().)

So views don't add complexity relative to giving the query explicitly. It might even help since the incoming queries can be smaller, thus less parsing/processing.

I don't remember which but there was a DB server I worked with where the tables weren't directly accessible and required alias's to be created, thus enforcing portability.

I think views are great way of abstracting out business logic. The one thing keeping me from using it more frequently is that it doesn’t work well with Row Level Security.

I was having issues with this as well; it seemed my problem was that I was creating the view as a superuser. Since view creation is essentially WITH SECURITY DEFINER (to take a term from UDFs) any user accessing the view to see the underlying rls enabled table would see all the results (since `postgres` user bypasses all RLS).

Creating a separate "data owner" and "api owner" to own the api and data schema+tables/views respectively alleviates this problem. Your API schema owner != your table owner unless you `ALTER TABLE FORCE ROW LEVEL SECURITY` in addition to just enabling it.

What issues are you running into with RLS?

The owner of the view is taken as the identity for RLS purposes when determining what data is seen in the underlying table(s). So if you select from a view with different users, you always get the same results- whatever the owner of the view would see.

That is not the case. At least not in postgres 13. The owner of the view only determines if RLS is bypassed or not. If the view is owned by a different user than the table (or `FORCE ROW LEVEL SECURITY`), then the view will evaluate RLS. RLS is bypassed for the owner of the table, superusers, and users with `BYPASSRLS` attribute.

We can test this.


    create role data;
    create role api;
    create role bob;
    create role alice;

    create schema api_v1 authorization api;
    create schema data authorization data;

    set role data;

    create table data.user_profile (
        id int generated always as identity primary key, 
        username text,
        value text

    grant usage on schema data to api;
    grant select on data.user_profile to api;

    insert into data.user_profile (username, value) values
        ('bob', 'Bob''s First Value'),
        ('bob', 'Bob''s Second Value'),
        ('alice', 'Alice''s First Value');

    create policy "Users can only see their own data" 
        on data.user_profile
        for select using (current_user = username);

    alter table data.user_profile enable row level security;

    select * from data.user_profile; -- shows 3 record, both bob and alice's

    set role api;

    create view api_v1.user_profile as
    select * from data.user_profile;

    grant usage on schema api_v1 to bob;
    grant usage on schema api_v1 to alice;

    grant select on api_v1.user_profile to bob;
    grant select on api_v1.user_profile to alice;

    select current_user; -- currently api
    select * from api_v1.user_profile; -- shows 0 records

    set role bob;
    select * from api_v1.user_profile; -- shows 2 records, both bob's

    set role alice;
    select * from api_v1.user_profile; -- shows 1 record, alice's.

    -- both alice and bob cannot access the data schema or it's contents. RLS is respected for the api_v1 schema.

Does that make sense, or did I misunderstand your meaning?

You probably already know this, but in case you don't a workaround for views not respecting RLS is to use an SQL function and then create a view which selects the fields from the function. Definitely more awkward than creating a view from a table without RLS though.

    SELECT ...;


Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact