My tests running ALTER varied from ~20 seconds to ~1 min for the changes.
> Current CI/CD practices often make it very easy for software developers to commit and roll out database migrations to a production environment, only to find themselves in the middle of a production incident minutes later. While a staging deployment might help, it's not guaranteed to share the same characteristics as production (either due to the level of load or monetary constraints).
(neon.tech employee here)
This is where branching databases with production data helps quite a bit. Your CI/CD environment and even staging can experience the schema changes. When you build from a seed database you can often miss this kind of issue because it lacks the characteristics of your production environment.
But the author rightly calls out how staging isn't even enough in the next paragraph:
>The problem is, therefore (and I will repeat myself), the scale of the amount of data being modified, overall congestion of the system, I/O capacity, and the target table's importance in the application design.
Your staging, even when branched from production, won't have the same load patterns as your production database. And that load and locks associated will result in a different rollout.
This has me thinking if you can match the production environment patterns in staging by setting staging up to mirror the query patterns of production. Mirroring like what's available from pg_cat could put your staging under similar pressure.
And then this also made me think about how we're not capturing the timing of these schema changes. Unless a developer looks and sees that their schema change took 56 seconds to complete in their CI system you won't know that this change might have larger knock on effects in production.
Author here - this is my primary goal, exposing the complexity developer might not even think about. Can't even count number of instances seemingly inconspicuous changes caused incident.
"Works on my DB" is new "works on my machine" (and don't trademark it, please :)))
Agreed! A common ORM pitfall is column rename which often doesn't get implemented as a rename as much as it does a DROP and ADD which will affect the data in a surprising way :-D
Good post, you can tell the author has some burn marks from seeing some sparks in production.
> Let's start with The Bad News - you cannot avoid rewriting the entire table, which will generate a significant amount of WAL files in the process. This is a given, and you must plan how to manage it.
There are some exceptions to this. The blog post focuses on changing a column from int to bigint, which indeed needs a full table rewrite to reallocate 8 bytes per ID value instead of 4. But:
> As an exception, when changing the type of an existing column, if the USING clause does not change the column contents and the old type is either binary coercible to the new type or an unconstrained domain over the new type, a table rewrite is not needed. [0]
So, you could change from varchar to text, for example, or change from varchar(20) to varchar(4), etc. and not incur a full table rewrite.
Microsoft SQL Server has a similar issue but also similarly when adding a new column to a table, which I found out by accident when trying to deploy a migration that had worked fine in our non-production environment stages.
Adding the column as not nullable and setting a default is the absolute killer.
A work around when you still want to have a default is to do a three pass migration. First add the column as nullable and no default, then run an update to set the default on all rows, and finally change it to not nullable and set the default constraint.
Changing a PostgreSQL column type without following the author's instructions and just running the following command is VERY anti-pattern. Confused why people do this in the first place.
ALTER TABLE table_name
ALTER COLUMN column_name
[SET DATA] TYPE new_data_type
>you need to make sure the source system has enough disk space to hold the WAL files for a long enough time
if the asynchronous replication process has an external buffer instead of the WAL, then it addresses this issue
Because you lose a significant amount of performance if you start adding NULL and variable-length columns just because you're afraid of a table rewrite.
Because the resulting table will not have had 1 table of update-induced bloat at the end of the operation.
Because you can be sure the modification is applied atomically and you as the user can be sure the migration from A to B goes through as expected or has a graceful rollback to the old data, rather than getting stuck or failures halfway through the migration.
Because toasted data from DROP-ed columns is not removed from storage with the DROP COLUMN statement, but only after the row that refers to that toasted value is updated or deleted.
...
Every column you "DROP" remains in the catalogs to make sure old tuples' data can be read from disk. That's overhead you now will have to carry around until the table is dropped. I'm not someone who likes having to carry that bloat around.
For ” How to Safely Change a PostgreSQL Column Type” (new column, trigger, etc). Is there a particular reason PostgreSQL cannot set up column migration itself out-of-the-box? I have used PSQL for many years and it is always a bit cumbersome to do the column migrations manually, even though the process itself is quite mechanical.
> Is there a particular reason PostgreSQL cannot set up column migration itself out-of-the-box?
People haven't asked hard enough to the right people, I suppose. PostgreSQL is an open-source project, where wanted and needed features are supplied by willing individuals and companies, and vetted by the community so that the code base remains in good quality.
I just suppose no-one has bothered yet with implementing ALTER TABLE CONCURRENTLY to the point that it's been accepted by the community, though another reason might be because the induced metadata churn (only 1600 distinct user-defined columns available in each table at most) might become problematic with ADD COLUMN ... DROP COLUMN.
So Postgresql and MySQL can both do full table copies as needed. And their locking has gotten better. They also can do more changes in place. Yet still too often they'll need exclusive locks which blocks even reads.
For very big tables it's often better to manually add a column, backfill then drop-swap.
My thoughts exactly. It's surprising that external online schema change tools for Postgres have only become a thing fairly recently! The only two I'm aware of are:
Meanwhile over in the MySQL and MariaDB ecosystem, external OSC tools have been around for quite some time, starting with oak-online-alter-table over 15 years ago. The most popular options today are pt-online-schema-change or gh-ost, but other battle-tested solutions include fb-osc, LHM, and the latest entry Spirit.
I’m probably alone here and this goes against the HN consensus but, as great a piece of tech Postgres is, I’ve often found it to be a premature optimization to pick as the main DB for newer companies. If you don’t know what your data models are, you’re going to end up in worlds of pain constantly running schema migrations like the OP describes.
It's not Postgres, specifically, as much as any SQL or non-schemaless database, right?
And if we're saying that's a problem, then sounds like we're going back into the NoSQL debates from a decade ago.
Hopefully not.
I think it's better to understand your schema as much as possible, and have a sane process for applying changes when needed. Defining a schema forces you to think about what you're building.
OTOH, the idea that developers on a project are just going to throw whatever new attributes they need into a document as they go along is a recipe for bugs and inefficiency. Also, near-instant technical debt, as early work frequently survives longer than anticipated.
You also don't completely escape data changes without pain when using a NoSQL database. If for instance you change a string to an int you'd still need to figure out what to do with existing data, either via conversion or handling in-code.
> Defining a schema forces you to think about what you're building.
YES. Thank you. Sit down with pencil and paper, write down a table name, and start putting attributes into it. Then define a PK, and ask yourself if every attribute is directly related to the PK (a user named foo has an id of 1, and lives in country bar). Repeat. Then ask yourself how you’d join the tables. If you find that something _could_ be represented as a join, but isn’t, consider doing so.
Still risks involved in migrations (mostly from the migration executing too quickly and creating high load in the cluster - the admission control system should have reduced this) and we have extra review steps for them, but it's been very useful to be able to migrate large tables without any extra application-level work.
I'd say Postgres is very often the right choice for newer company as it's well understood, easy to operate and you know you don't have to switch to a new DB because the vendor got acquired / merged / shutdown after 2 years or is going through growing pains and deprecations themselves.
If you give your schema a good thought (The one place where you shouldn't rush and take shortcuts at the beginning) and for example use jsonb columns and later move data out of it if you notice you need to query on it more performantly you can get very far.
The pain of data model migrations is also usually not that big if the company isn't very large and has a lot of legacy yet.
Not really, it doesn't mean you put everything into a jsonb field. It could mean that for example if you have some user specific settings you just drop them in a jsonb on the user itself instead of building a schema with mapping tables, permissions etc. as you don't know yet which options you want to support. Once thing stabilize you can pull it out of there.
From my personal experience this works really well and is a nice balance between a strict schema, and still allowing some flexibility for experiments.
This will work, but you risk referential integrity violations, as well as performance problems at scale.
The main issue is what I said in “IME…” – tech debt builds, and people never want to go back and fix it. Just upsize the hardware, easy-peasy.
I would rather see a wide table with a bunch of bools indicating various options, personally. When that gets annoying or unperformant, split it out into logical groups. Color schemes can be their own table, email marketing preferences can be their own table, etc.
Which is precisely the caveat I mentioned at the beginning:
> and for example use jsonb columns and later move data out of it if you notice you need to query on it more performantly you can get very far.
If you put data inside where you want the database to enforce integrity...then it's the wrong place for the data. If you are getting problems on scaling, you are relying on data in jsonb columns for heavy queries which you should not. In that case it should've been moved out already.
As always it's about tradeoffs and being pragmatic. There's no 100% way of saying jsonb is always the wrong choise, or always the right choice. You still have to be smart about when to reach for it.
> I would rather see a wide table with a bunch of bools indicating various options, personally. When that gets annoying or unperformant, split it out into logical groups. Color schemes can be their own table, email marketing preferences can be their own table, etc.
The point is to exactly avoid this kind of overhead when you have zero paying customers, as that's premature optimization. Of course from a pure data model perspective it's nice, but from a business perspective you don't need that until it hurts and you have to split it out.
I'm extremely curious to hear what you consider a better/simpler choice. At least postgres gives you the tools to do schema migrations, and if you're operating at a scale where such migrations become a problem (i.e. probably not for a while) you really ought to know what you're doing.
> Current CI/CD practices often make it very easy for software developers to commit and roll out database migrations to a production environment, only to find themselves in the middle of a production incident minutes later. While a staging deployment might help, it's not guaranteed to share the same characteristics as production (either due to the level of load or monetary constraints).
(neon.tech employee here)
This is where branching databases with production data helps quite a bit. Your CI/CD environment and even staging can experience the schema changes. When you build from a seed database you can often miss this kind of issue because it lacks the characteristics of your production environment.
But the author rightly calls out how staging isn't even enough in the next paragraph:
>The problem is, therefore (and I will repeat myself), the scale of the amount of data being modified, overall congestion of the system, I/O capacity, and the target table's importance in the application design.
Your staging, even when branched from production, won't have the same load patterns as your production database. And that load and locks associated will result in a different rollout.
This has me thinking if you can match the production environment patterns in staging by setting staging up to mirror the query patterns of production. Mirroring like what's available from pg_cat could put your staging under similar pressure.
And then this also made me think about how we're not capturing the timing of these schema changes. Unless a developer looks and sees that their schema change took 56 seconds to complete in their CI system you won't know that this change might have larger knock on effects in production.