1.) Start with a nullable column with some existing rows with null values.
2.) Make the column required with default=100. Postgres now marks atthasmissing as true and attmissingval as 100. Rows will be lazy-populated to 100, but aren't immediately modified.
3.) Make the column nullable again.
4.) Add some new null values.
5.) Make the column required with default=200.
At this point, the right behavior is that all originally-null rows from step #1 have value 100 and all rows added in step #4 have value 200, but that's not representable with the atthasmissing/attmissingval system. I guess maybe in steps 3 or 5 it needs to just bail out and do a full table rewrite?
I was thinking through this problem a while ago and one idea I had was that you could version your schema and have each row remember what the schema version was when the row was created. Then you could lazy-populate any null value by looking up the next default value for the column chronologically in schema history. But I guess maybe that's complex enough or would hurt common-case perf enough that Postgres realistically isn't ever going to implement it.
This should fail at step 2 (and again at step 5). Modifying the default value of a column never modifies existing rows of a column. DDL (e.g., ALTER TABLE) should never modify data. You need DML (e.g., UPDATE) to do that. Example:
CREATE TABLE product (
id SERIAL PRIMARY KEY NOT NULL,
quantity int NULL
INSERT INTO product (quantity) VALUES (NULL);
ALTER TABLE product ALTER COLUMN quantity SET DEFAULT 0;
ALTER TABLE product ALTER COLUMN quantity SET NOT NULL;
SELECT * FROM product WHERE quantity IS NULL;
You'd need to run this before modifying the NULL constraint to NOT NULL:
UPDATE product SET quantity = 0 WHERE quantity IS NULL;
Remember that a column that doesn't have a default value specified actually has an implicit DEFAULT NULL default value (unless it's something special like SERIAL). That's how RDBMS know to throw an error when you don't specify a NOT NULL column. It tries to default the column to NULL and can't.
Just to confirm/repeat back:
* When you say DDL operations (ALTER TABLE) will "never modify data", they're still allowed to add a column with a non-null default since that's not "modifying data" by your definition, it's picking the value to use when introducing new data. I shouldn't think of it as "set implicit null to 3", I should think of it as "set nothingness to 3", just like how it would "set nothingness to null" in normal column creation. But it would never be willing to "set null to 3" in a DDL operation.
* Postgres doesn't have an O(1) way (meaning: never does a table scan while holding a significant lock) to transition a nullable column to non-null, since it'll need to check the contents of the entire table. (Although maybe it could use an index?) I guess I was hoping that the default would help here, but really it only affects creation of new values.
* More generally speaking, Postgres's design direction is to put data integrity as a higher priority than scale. It'll do both if it can (like in the article), but there are still cases where schema changes might cause user disruption if done on a gigantic table. It could have been designed so that schema changes are always fast (like they tend to be in NoSQL systems), but that would come at the expense of other nice properties.
* Correct. Adding a constraint or unique index will always require scanning the data to complete the transaction to ensure that it's consistent with the new rules. It's safe to assume that modifying the schema of a table itself in essentially any way will always require an exclusive table lock to complete. Even creating an index will often lock a table. PostgreSQL does allow you to create some indexes without a lock with the CONCURRENTLY option  and other RDBMSs may have similar options, but that's an additional feature that you'll want to check in your specific implementation with their own caveats.
* Correct. The design of the relational model which traditional SQL RDBMSs implement as a whole is to focus on ACID  compliance and strict determinism. It's focused on being correct first and foremost, and things like performance and availability take a backseat to that (which is where NoSQL data stores step in with different focuses). You wouldn't want to have that kind of focus on a social media site where it's more important to be available first and who cares if a bit of data is lost or broken or out of date now and then, but a financial, government, or medical system must store data correctly and must correctly retrieve data that it claimed to store correctly. Accuracy is paramount in those systems. This is one reason why DBAs have fairly universal reputations as extremely cautious perfectionists (the other reason is that if you don't design your tables and normalize your data it becomes extremely difficult to work with down the line).
MySQL is somewhat notorious for breaking many of the above rules, especially prior to MySQL 5.0. That is half the reason why it tends to have such a poor reputation among DBAs (the other reason being that Oracle is a terrible vendor). MySQL used to allow you to store invalid dates, would silently modify or truncate data rather than error, etc. It's much, much better now, but reputations are hard to repair in a world of strict rules and deterministic functions.
that was a lot of places to have to patch to make this happen!
Finally, you add the default to the schema which will now be in the millisecond range because all/most rows have a default.
And then you add the not null constraint which will also be in the ms range.
You can do all of this in a transaction too because postgres' Schema modifications are transactional.
This is how I have added countless columns with default values without downtime over the last few years.
The following query will show dead tuple count for the table: select tuple_count, dead_tuple_count from pgstattuple('table_name');
create a temporary index covering that column. Then it will be in the range of ms.
With this the re-check for nulls when adding the not null constraint will not take any time at all.
And another example of a feature that’s only available in Microsoft SQL Server Enterprise Edition ($$$), but effctively free in Postgres.
There are lots of reasons to criticize MSSQL (ridiculous cost, poor syntax/language extensions, awful text search, painful process for adhoc (non ETL) imports, etc...) but my decade old 2008 edition is still superior to PG in many ways. Why can't we have honest comparisons?
>>> and a gaping hole in Postgres’ operational story is filled
Are there any other holes one should be aware of ? I'm looking forward to migrate a database with ~100 tables many of which having ~100 millions rows and lot's of FK to Postgresql... The operational side of things is what I know the least. Any pointers ?
HPE did an excellent technical, no marketing BS writeup on PG11. I highly recommend everyone read this.
> Despite the flexibility, this approach forces the planner to perform an exhaustive search and to check constraints on each partition to determine whether it should be present in the plan or not. Large amount of partitions may result in significant planning overhead.
> The pg_pathman module features partition managing functions and optimized planning mechanism which utilizes knowledge of the partitions' structure.
If OLTP the best advice is create a reasonable load tester and use it for all future schema migrations. This will find all unexpected table locks due to DDL commands like CREATE INDEX, ALTER TABLE etc.
The docs explain the different settings: https://www.postgresql.org/docs/9.6/static/runtime-config-au...
When tweaking performance settings, always benchmark, especially with your specific load characteristics.
For a big table with frequent writes, the default is too much (20% dead tuples).