CREATE COLLATION case_insensitive (
provider = icu,
locale = 'und-u-ks-level2',
deterministic = false
CREATE TABLE names(
insert into names values
select * from names
last_name collate case_insensitive,
first_name collate case_insensitive;
What I'm most excited about is the fact that WITH clauses will no longer necessarily be optimization fences — I have some big queries that are really only manageable this way, and now they may well be very much faster.
sort of. It was doing case-folding based on LC_CTYPE set at initdb time, so if you have data of different locales to deal with, it would only work for one locale and be (subtly) broken for others (Turkish comes to mind)
It's also based on glibc's interpretation of case-folding rather than what the Unicode consortium has agreed upon.
Depending on locale and tolerance to weirdness that can be anywhere between a non-issue and a show-stopper.
Case-folding is very tricky business.
Creating views just for one query is kind of like creating temporary variables for each subexpression. Sure, it works, but not a great solution.
Also, other databases do optimize through a CTE. So if you port to PG 11, you might see a lot of regressions in plans unless you rewrite to use views instead. PG 12 will work a lot better for existing queries.
> Additionally, PostgreSQL 12 adds the ability to rebuild indexes concurrently, which lets you perform a REINDEX operation without blocking any writes to the index. This feature should help with lengthy index rebuilds that could cause downtime when managing a PostgreSQL database in a production environment.
It even goes a bit farther in that it doesn't just rebuild indexes but can also rewrite a table completely in any order you want - all while keeping the original table fully usable .
Compared to what's now coming as a built-in, there's one restriction though in that in order for repack to work, the table needs a unique key of some sort, but usually you have a synthesised primary key anyways.
It also once did a bad thing to a non-logged table with relation to having it be synced to the replica with no way of getting rid of the data again, though that might have been fixed since.
I have used pg_repack many times and it's both very useful and extremely reliable.
: it will need to acquire two very short-lived exclusive locks (sub 0.1s) and you can tell it after how much time it should get impatient and just kill whatever's preventing it from acquiring that lock.
> "In Oracle, readers never wait for writers or vice versa. I can serve my 18,000 daily users. I can EXPORT a consistent backup dump. I can update a 20,000 row table. I can rebuild an index. All simultaneously."
SQL Server looks pretty clean to me.
ALTER INDEX IX_name ON schema.table REBUILD WITH (ONLINE=ON);
I'm not familiar with JSON path queries - how do they compare to the jsonb querying syntax that Postgres already has?
Also, the syntax is standardized so people have probably encountered it before in e.g. aws cloudwatch JSON metric filters
"Generated columns" and "foreign keys to partitioned tables" are two that caught my eye.
Citus is very much focused on simplifying all the extra work you'd have to do, to your application it still appears as if it is a single node–you don't have to wire together FDWs or teach your application to understand about different nodes/shards.
A lot of the work going into 12 we're pretty excited about at Citus because it can allow us to continue to tackle the scaling piece and leverage more of what it is in the box.
I know to a large extent "it depends", but it would be nice to get an idea of some numbers here - kind of surprised it doesn't say.
The mouse genome database is perhaps the only real world Postgres databases that is publicly available and isn't trivially small. It is available from http://www.informatics.jax.org. There are quite a number of publicly available datasets, of course, but this is specifically a raw pg_dump that includes indexes that are actually useful to the underlying application. A living, breathing database.
One of the test cases used during development involved restoring the dump, and then re-inserting the data with retail insertions (pristine indexes generated by CREATE INDEX are not very interesting). Something like the following, repeated for each table:
create table __foo (like foo including indexes);
insert into __foo select * from foo;
Afterwards, all of the indexes taken together (all indexes on all "__foo"-like tables) were about 15% smaller than they were when the same procedure was performed on a v11 Postgres. Since the database has many indexes on each table, the overall size of the database was reduced by about 10%.
That is only one anecdote, involving a database that happened to have a lot of indexes that each stored a lot of duplicates. And, this tells you nothing about how latency or throughput are affected for the application itself, which is usually much more interesting. Your mileage may vary, but this is what happened in the one real-world case that I actually tested. It was not cherry-picked.
Andres provided a link to some TPC-C benchmark results that I published, showing significant improvements despite the fact that that benchmark involves few or no duplicates in indexes. That's due to another specific issue with how we choose where to split leaf pages.
Overall, the best way to think of the v12 work is as something that fixes cases where Postgres previously didn't do so well. It makes performance more consistent overall.
To me, it's the nr. 1 software where every new update brings super exciting things.
Thanks for the efforts you're putting in.
Some numbers are here https://www.postgresql.org/message-id/CAH2-WzmsK-1qVR8xC86DX...
but there's a lot more details in other parts of this, and other threads.
And for one of the other bree improvements, it hugely depends on how high your concurrency is, and whether your data is ordered in a sense, or not:
Also excited about generated columns; we do a fair amount of denormalization in postgres, and managing that stuff in code can get pretty error-prone.
The way Oracle works, when you do transactional row update (e.g. UPDATE), the previous version is read and copied to the undo log, and then the update is performed in-place in the table even though it's not committed yet.
Each row (or maybe it's per page — it's been a while) has a "transaction slots" header, and the transaction then writes its ID to this list. If another, concurrent transaction does a SELECT against the same row, it will see in the slot list that this row is from an uncommitted transaction, and if your transaction isolation level is high enough, it will then look into the undo log for the currently committed row.
This design means that there's no transactional overhead for reads done against rows that aren't being written to, so it's optimized for the general case — performance is only impacted when a transaction has dirty rows and other transactions have to dip into the undo log to find the old versions. The downside is that commits and rollbacks are slower; for commits, all the transaction slots need to be updated, and for rollbacks the undo log needs to be copied back into the table.
Postgres's current MVCC is very elegant in theory — MVCC means never updating anything in-place, always appending, so there's never any read/write contention, and commits/rollbacks are near-instantaneous. But the garbage collection aspect ("vacuum") has always been a challenge, and Postgres has mitigated a lot of performance issues over time with clever hacks that compromise the purity of the MVCC model (e.g., HOT tuples, index visibility bits). I think it's definitely time for alternative approaches.
Commit is not slower. Some part of the work is delayed, ex first SELECT after change may be involved to check & update row status in data block by peeking transaction state in UNDO.
Huge rollback is the real pain and something you don't want to do and in some situations you would prefer to go UNDO-less.
That's pretty nice.
Just imagine how many permutations we'd need to unit test the implementation of full MERGE. Sounds scary. Better develop a frequently used subset, the UPSERT, and stop at that.
This sounds interesting - I wonder if we'll see a whole bunch of weird and wonderful plugins built on this!
Really intrigued to see what people come up with.
Currently, we have to manually detach the default partition, create the new partitions, move the data from default partition to the newly created partitions, then re-attach the default partition. That's a lot of works compared to the reorganize partition feature of MySQL.
Just one word of caution: I'd be careful deploying betas into production. Even leaving possible bugs aside, while they'll be upgradable using pg_upgrade, they're not guaranteed to be on-disk compatible with the final version. In fact, we're already certain that there'll be some small changes between beta1 and beta2.
Edit: for testing purposes
I used Clickhouse previously with large data like this and worked much better. Obliviously, I compare oranges with apples, but PostgreSQL could support a columnar data engine or some kind of index?
What made that "too slow"?
I just tested it, and it's not too bad. This is on my development environment (~3yo laptop, reasonably powerful), with all kinds of compile-time debugging enabled - but with compiler optimizations turned on.
CREATE TABLE hashes(hash text, count int);
\copy hashes FROM program '7z e -so ~/tmp/pwned-passwords-sha1-ordered-by-count-v4.7z pwned-passwords-sha1-ordered-by-count-v4.txt' WITH (DELIMITER ':', FREEZE);
Time: 1245024.360 ms (20:45.024)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24403 andres 20 0 31480 22292 4544 R 73.8 0.1 14:54.09 7z
23701 andres 20 0 4380008 11488 9024 S 27.9 0.0 5:42.04 postgres
Creating an index:
CREATE INDEX ON hashes(hash COLLATE "C");
Time: 709595.370 ms (11:49.595)
This is all while I also was working on the same loptop, including recompiling etc.
EDIT: Formatting #2
So what do you mean "too slow?" How long did it take?
I don't have any experience with Clickhouse, and not much with columnar databases in general, but if we're talking about a simple table with one index over one text column, I'm not sure whether it makes any difference if you store the tuples row-wise or column-wise on disk. It's an index that covers 100% of the data either way.
I don't know anything about your server but it sounds like PostgreSQL just wasn't able to get much IO throughput from your storage system. Things like storage hardware, filesystem type, and kernel parameters are the big factors here.
The default settings of postgres are more tuned for OLTP though, while you wanted to do OLAP. At least increasing the per-session memory limit would be advisiable to get anything done in any amount of time (the default is something tiny like 64MB). In general configuring Postgres to use more RAM is important to make things fast.
Hopefully it will be fixed with new parallelism improvements.
It'd be cool if you could report details of the problem you had.
The hardware is important. If you tried to import that data on a $5/mo DO droplet, I wouldn't be surprised by having issues.
Also, gotta wonder if you have good indices on the data.
As others have said, the amount of data you're talking about is not very big. I regularly deal with Postgres databases that are two orders of magnitude larger, and they are lightning fast.