Features I'd Like in PostgreSQL

whitepoplar · on Jan 28, 2023

I'd love to see:

1) Simple, easy to use, high-availability features.

2) Horizontal scalability without having to resort to an extension like Citus.

3) Built-in connection pooler.

4) Query pinning.

5) Good auto-tuning capability.

morrbo · on Jan 28, 2023

Never agreed with something so much more in my life. Points 1-3 in particular. I understand their logic of not wanting to "pollute" the core products with these features.... but it's sort of past the point now where you're starting to expect this stuff in a modern database product (instead of add-ons, hacks, and outdated blog posts)

threeseed · on Jan 28, 2023

It is kind of ridiculous that the first three haven't been sorted by now.

And over time it will increasingly relegate PostgreSQL to being for development only with production use being handled by wire-compatible databases e.g. Aurora.

gavinray · on Jan 29, 2023

It's not baked-in, but Andy Pavlo has an ML-based tuning tool called "Ottertune" for Postgres:

I think it's free for a single DB

https://ottertune.com/

tbillington · on Jan 29, 2023

Only for instances on aws though.

ahachete · on Jan 28, 2023

> 2) Horizontal scalability without having to resort to an extension like Citus.

Just curious, what would save you having the solution in-core? Installation, sure, but that's a one-off possibly in your deployment code. "CREATE EXTENSION citus" and add that to postgresql.conf? Sure, but not too much work for me. The rest (commands to actually create the nodes, do the sharding itself) are something I cannot imagine being different or simpler if with an in-core solution.

What am I missing?

threeseed · on Jan 28, 2023

It means that when you upgrade PostgreSQL you don't have to worry if the extensions are compatible and have been fully tested not just for functionality but security etc as well.

And most importantly it means you don't have to worry if that extension will move to a freemium model which (a) often has important features out of your price range and (b) is generally unacceptable in enterprise environments.

ahachete · on Jan 29, 2023

Right, those are compulsory steps for every upgrade.

Yet in the particular case of Citus, history (so far) has shown a) that they update the extension regularly and fast, so by the time you want to upgrade to a newer major version you already have Citus updated too; b) they are going exactly in the opposite direction of "freemium", they actually open sourced even the previous proprietary bits; c) as OSS, it can always be forked and if one day closed source, being such an important project, it would be definitely forked.

(I don't have any stakes on Citus)

threeseed · on Jan 29, 2023

The "you can always fork Citus" argument is not workable.

Scaling databases is an expert skill that is far out of reach for almost everyone.

Far easier just to pick another database that includes HA and horizontal scalability out of the box.

ahachete · on Jan 29, 2023

Sharding databases is not such a dark, magic art. Citus relies on Postgres for many key features; and Citus does the rest. It's already quite "feature complete".

If Citus would become proprietary overnight, my main concerns of maintaining a fork would be around the codebase and the language expertise more than the sharding concepts.

Note that sharding is different from a purely distributed database. The latter is an entirely different class (and more complex system).

threeseed · on Jan 29, 2023

The more likely scenario is that Citus simply disappears entirely and becomes an internal feature of Azure.

Keyframe · on Jan 29, 2023

Citus is Microsoft for quite awhile now. Take what you will from that.

contrahax · on Jan 29, 2023

You can’t use those extensions on almost any cloud like AWS, GCP, Aiven, etc. - I think Azure is the only one offering a citus product because they acquired them. Also extension updates are a major pain point in hosted DBs - always lagging behind. Having some solution in core would resolve this, even if it was just bundling some best in class extensions out of the box to bypass these cloud providers very selective extension support.

ksec · on Jan 29, 2023

It is funny to see this because every time these reasons for people continue to use MySQL are listed, PostgreSQL folks will be quick to reply most of these are non-issues.

KyeRussell · on Jan 29, 2023

This sort of tribal framing doesn’t do anything to help the situation. I’m immediately willing to believe that the “Postgres folks” that are saying that, genuinely have not found these points to be issues, which doesn’t mean that OP doesn’t.

seedless-sensat · on Jan 28, 2023

Points 1-3 exactly describe my pains operating Postgres at scale

qazxcvbnm · on Jan 29, 2023

For 4) I've heard there's this available https://github.com/DrPostgres/pg_plan_guarantee

boiler_up800 · on Jan 29, 2023

Yugabyte delivers most of these. It’s also the “purest” implantation. I expect it to be the winner over time of the Postgres scaling wars.

LtWorf · on Jan 28, 2023

What would the proprietary vendors sell then?

_448 · on Jan 29, 2023

> Horizontal scalability without having to resort to an extension like Citus.

Citrus also has one glaring problem: single point of failure due to a single monitor node.

Postgres would blow every other database away if it had HA, multi-write nodes and fault tolerant feature similar to FoundationDB or TiDB.

smw · on Jan 29, 2023

Isn't that what CockroachDB is?

_448 · on Jan 29, 2023

It suffers from licensing spaghetti.

EamonnMR · on Jan 29, 2023

Big second on horizontal scaling. Amazon Redshift is a great example of what's possible, but it also has enough problems that you can't really use it as a primary application database.

zX41ZdbW · on Jan 29, 2023

Redshift is an OLAP database, much like ClickHouse. For OLAP databases, horizontal scalability is a must - due to generally larger data volumes.

While Postgres is OLTP. For OLTP databases achieving horizontal scalability require a more sophisticated approach to distributed consensus, like in CockroachDB or Spanner.

viraptor · on Jan 28, 2023

What do you mean by query pinning? Something like query cache where the results are kept around explicitly?

whitepoplar · on Jan 29, 2023

Sorry, query plan pinning.

avinassh · on Jan 29, 2023

How does Citus provide horizontal scaling? How does it work?

jeltz · on Jan 29, 2023

Sharding.

zX41ZdbW · on Jan 29, 2023

I found Citus implementation quite weak among analytic databases. It does not fare well even among Postgres-based systems, like Greenplum [1].

[1] https://benchmark.clickhouse.com/

Existenceblinks · on Jan 29, 2023

I would love 6) unique constraints across all partitions

ies7 · on Jan 29, 2023

We've used postgres for so long that number 1 & 3 feels just automatic for our teams to setup and don't ask why they're not built-in in postgres.

ilyt · on Jan 29, 2023

Ye at the point where you need it (even if you decide to do it on your own and not use some cloud solution) it's one off time expense

efxhoy · on Jan 29, 2023

Which tools do you use for 1 and 3?

danielheath · on Jan 29, 2023

I'll put in my vote for an 'ordering' type.

Currently, supporting a feature like "users can reorder the items in a playlist" is typically done using an integer position column. However, this doesn't prevent gaps (so e.g. the third item in the playlist might have `position = 42`), and inserting between two other items requires updating every row with a greater position value.

I'd like to be able to say `update ... set position = 3` to make the record the third item in the list. You'd need to be able to set the scope (eg `add column position ORDINAL WINDOW BY playlist_id` or something).

gregwebs · on Jan 29, 2023

Maybe you could use the technique of maintaining a separate adjacency list column and materializing the ordering list. It would be a simplified version of what is done for hierarchy to materialize a nested set: https://www.sqlservercentral.com/articles/hierarchies-on-ste...

A DB type could be nice to hide away the adjacency list.

The other suggestion here to use fractional numbers is similar to materialized path (ltree data type in Postgres) and has been formalized to apply to hierarchies with Farey Fractions: https://sigmodrecord.org/publications/sigmodRecord/0506/p47-...

danielheath · on Jan 30, 2023

If I were happy to rewrite the queries, I could use something like (forgetting the exact incantation)`select row_number(), rest_of_table as position partition by playlist_id order by position` to get the position column.

gurjeet · on Jan 29, 2023

Use numeric data type instead of integer, and update the position to the midpoint of the other 2 you're trying to place the record between.

    update ... set position = (prev + next)/2;

To get better performance you can choose to use the float data type, but then you'd be limited to a fixed precision; sufficient for most cases, though.

tshaddox · on Jan 29, 2023

That still requires you to look up the element in the target position, instead of just saying “put this element in the 3rd position.”

Also, you’ll still eventually need to go clean up the entire sequence because you’ll run out of gaps between adjacent numbers. Because of this, I’d probably rather use a more predictable type (like one of the integer types) and explicitly plan my cleanup schedule.

bastawhiz · on Jan 29, 2023

> Also, you’ll still eventually need to go clean up the entire sequence

If you're using a numeric type, you get up to 16383 digits after the decimal. That's... Probably more precision than you'll be able to reasonably use up in almost any use case. Any time you're reordering in bulk, you're resetting the order value to a nice integer, so it would take many thousands of ad-hoc reordering operations near a single position to get it close to the precision limit, yeah?

https://www.postgresql.org/docs/current/datatype-numeric.htm...

tshaddox · on Jan 29, 2023

Sure, but if it’s orders of magnitude more than you’ll ever need, you’re just using way more storage than you need. That’s what I meant by being more explicit about your plan and using more predictable storage. As a basic example, you could also use integer (or bigint) and start by number things like 1000, 2000, 3000, etc. Now you know exactly how many slots between items you have, and can more easily query for cases where you’re running low on slots.

bastawhiz · on Jan 30, 2023

Capacity doesn't mean storage. The default text field can store orders of magnitude more than I might need for a field, but that doesn't mean it takes orders of magnitude more storage.

forrest2 · on Jan 29, 2023

Or model it as a linked list and you can sidestep the limitations / complexity of some kind of numeric (or bytes / text based ordering field)

------------

playlists {

id

}

------------

playlist_members {

id

playlist_id

prev_playlist_member_id

(and/or next_playlist_member_id)

song_id

}

------------

you could then just select * from playlist_members where playlist_id = ... and sort on the client side. you'd probably add an application limit where playlists have a max length of some kind.

re-orders can be done in a fixed number of row updates and typical application queries are still possible / fast.

forrest2 · on Jan 29, 2023

or perhaps for some applications it would be sufficient to do

playlists {

  id

  song_ids []

}

-------------

and just store the ordering in an array. Some postgres drivers might start shitting the bed though at some gigantic array sizes, but a playlist probably has reasonable enough limits that you wouldn't have a big problem.

danielheath · on Jan 30, 2023

ID arrays can't use FK constraints (requested elsewhere in these comments), otherwise that would be pretty good. Performance-wise it means more IO than optimal, but that's not necessarily a huge problem.

Tostino · on Jan 29, 2023

This is the exact solution I came up with.

tshaddox · on Jan 29, 2023

If you were going to enforce a maximum size so that you could sort on the client, I bet you could just use an integer and rewrite the entire sequence from 1 to N on every reorder operation. Unless you were expecting to have way more writes than reads, which is a little hard to imagine.

eatonphil · on Jan 29, 2023

> so that you could sort on the client

You could probably also sort in the database with a recursive cte or with PL/pgSQL.

tshaddox · on Jan 29, 2023

Indeed, but AFAIK that is still much slower than sorting on an indexed integer column, which probably means you still need to enforce a maximum list length.

danielheath · on Jan 30, 2023

Writing a linked list in the database might sound good in theory; as someone maintaining a system that uses that technique, please please please do not ever do it. You lose access to basically every database-provided consistency technique.

KyeRussell · on Jan 29, 2023

Usual downsides of linked lists apply (i.e. traditional limit/offset pagination isn’t really viable).

I’ve never considered doing this though! I imagine I must’ve come across some problem where this would’ve been better than whatever I came up with.

topicseed · on Jan 29, 2023

What if two rows use the same prev_playlist_member_id?

forrest2 · on Jan 30, 2023

You make that impossible with application logic or constraints. Just like any linked list, you need to update the other pointers

Example constraint would be a unique index for the Playlist member table on the combination of Playlist id and previous Playlist member ID.

danielheath · on Jan 30, 2023

That's exactly what I want the ordinal type to do transparently under the hood, automatically touching the minimum possible rows when I specify EG `position = 3`. Bonus points if there's an autovacuum-style procedure to move records apart if they're sitting too close together.

masklinn · on Jan 28, 2023

Join and index hints.

I know the postgres devs don't like them, and that the query planner should be good enough that they're not needed, but it's not, and it regularly fucks up.

derefr · on Jan 28, 2023

I would go further: give me a PL/ language (or an equivalent bytecode-abstract-machine abstraction) that lets me program — at least in a read-only capacity — directly against the access-method handles, such that the DB's table heap-file pages, index B-tree nodes, locks, etc. are "objects" I can manipulate, pass to functions, and navigate graph-wise (i.e. ask a table for its partitions; ask a partition for its access-method; type-assert the access-method as a heap file; ask for the pages as a random-access array; probe some pages for row-tuples; iterate those, de-serializing and de-toasting them as a type-cast; and then implement some efficient search or sampling on those tuples. Basically the code you'd write in a PG extension to interact with the storage engine on that level, but limited to only "safe" actions against the storage, and so able to be directly user-exposed.)

The closest analogy I know of to that, is how you work with ETS tables in Erlang. I want to send the RDBMS code that operates at that level!

Actually, I presume that SQLite would necessarily have some low-level C interface that works like this — but few people seem to talk about it/be aware of it compared to its high-level SQL-level interface.

jiggawatts · on Jan 28, 2023

This is vaguely how the Microsoft "Jet" database engines work. Microsoft Exchange uses this low-level query authoring technique to achieve its scalability and performance goals. This generally makes its performance consistent and predictable.

There were some attempts back in the early 2000s to move Exchange over to use the SQL Server RDBMS engine, and they added a bunch of features to enable this kind of low-level control. Not just join hints: you could force specific query plans by specifying the plan XML document for that query. See: https://learn.microsoft.com/en-us/sql/t-sql/queries/hints-tr...

This wasn't good enough however, and Exchange still uses the Jet database.

Something that might be interesting is an RDBMS "as a library", where instead of poking it with ASCII text queries, you get a full programming API surface where you can do exactly the type of thing you propose: perform arbitrary walks through data structures, develop custom indexes, or whatever.

cdcarter · on Jan 28, 2023

SQLite compiles SQL to bytecode, and then executes that bytecode against the database. However, there's no public interface for creating/running bytecode directly, instead of as a result of a compiled statement. You almost certainly COULD do what you're trying to achieve, but the SQLite author's have specifically called it out as a bad idea - https://sqlite.org/forum/info/c695cbe47b955076 - since bytecode representation can change from release to release in a way that would only matter to the compiler (or to your weird hacked in interface). Meaning, non-portable.

derefr · on Jan 29, 2023

> since bytecode representation can change from release to release in a way that would only matter to the compiler (or to your weird hacked in interface). Meaning, non-portable.

IIRC JVM static-analysis libraries get around this by essentially forcefully pulling in and reflecting upon the particular compiler release's internals that are being built against. The result is "non-portable", but only in the sense that it's getting tailored to the particular compiler release that's already concretely available in the build environment.

Mind you, that's a bit different, because you don't usually ship the compiler parts of the JDK as part of your application JAR; while SQLite does ship this compiler as part of the library. Would be fine, though, as long as your executable's embedding SQLite statically (or in a Docker image, etc) — in other words, vendoring the particular version of SQLite that matches the version the application-layer codegen library was compiled against.

felixge · on Jan 29, 2023

I would love this too, and looked into building an extension for it a few years ago. IIRC the main challenge was that many features such as row level security are built straight into the current query executor, making it difficult to build this as a production grade tool. I could try to dig up my notes if you’re interested.

SigmundA · on Jan 28, 2023

+1, planner is great and all, but I know better sometimes, let me tell it so and have it listen.

Nothing like the planner deciding it knows better at some random time in production because of data changes.

bb88 · on Jan 28, 2023

The argument from the camera folks goes like this:

Why should a camera's software written 5 years ago in Japan/China/Taiwan choose for me with the lighting conditions I have right now in Seoul at 2:30 in the morning?

That's why most professional prefer to use a manual mode. Auto is often used as a first suggestion (but not a very good first suggestion).

ddorian43 · on Jan 29, 2023

https://github.com/ossc-db/pg_hint_plan

biehl · on Jan 29, 2023

Wow. That seems awesome. Do lots of people use it? Is it reliable?

ddorian43 · on Jan 29, 2023

Yes, it works. It's included by default in YugabyteDB as an example.

riku_iki · on Jan 28, 2023

You usually can manage this by dividing your query into subqueries each creating some temp table, so you have control over how joining actually happens.

aeyes · on Jan 28, 2023

If you do this be careful what your temp_buffers is set to so that your temp tables don't spill to disk. If you are on a network file system, for example on AWS RDS, writing big (a few hundred MB) temp tables to disk will stall all transactions.

The same is of course true when you have big joins that don't fit in work_mem but default size of this will be much larger.

I can usually fix bad plans with CTEs, no need to get much fancier. And the problem is often caused by schema design where you have a mapping table of two tables in the middle and your join is N-M-M where the planner has no information about the relationship between the two outer tables.

riku_iki · on Jan 28, 2023

> temp tables don't spill to disk

my setup is local nvme ssd raid, so I hope this part won't be bottle neck. Also, if you are doing heavy join, where join order and method is need to be controlled, you temp table likely will be large, so you will need to be ready to have disk io.

derefr · on Jan 28, 2023

It's a shame, then, that there's no way to define at-first-purely-in-memory tables, which only "spill" to disk if they cause your query to exceed work_mem.

Within PL/pgSQL, CREATE TEMPORARY TABLE is still (sometimes a lot!) slower than just SELECTing an array_agg(...) INTO a variable, and then `SELECT ... FROM unnest(that_variable) AS t` to scan over it. (And CTEs with MATERIALIZED are really just CREATE TEMPORARY TABLE in disguise, so that's no help.)

taffer · on Jan 28, 2023

> It's a shame, then, that there's no way to define at-first-purely-in-memory tables, which only "spill" to disk if they cause your query to exceed work_mem.

But isn't this exactly how temp tables work? A temp tabe lives in memory and only spills to disk if it exceeds temp_buffers[1].

[1] https://www.postgresql.org/docs/current/runtime-config-resou...

derefr · on Jan 28, 2023

Huh, I think you're right... but it's still slower! I've definitely measured this effect in practice.

Just spitballing here — I think the difference might come from where the metadata required to treat the table "as a table" in queries has to be entered into, and the overhead (esp. in terms of locking) required to do so.

Or, perhaps, it might come from the serialization overhead of converting "view" row-tuples (whose contents might be merged together from several actual material tables / function results) into flattened fully-materialized row-tuples... which, presumably, emitting data into an array-typed PL/pgSQL variable might get to skip, since the handles to the constituent data can be held inside the array and thunked later on when needed.

taffer · on Jan 28, 2023

Temp tables are per session, there shouldn't be any locking involved.

zeroimpl · on Jan 28, 2023

I believe the metadata about them is still written to various system catalog tables. Creating lots of temp tables will cause autovacuum activity on tables like pg_attr, for example.

riku_iki · on Jan 28, 2023

that one is for some access buffers and not actual temp tables?

taffer · on Jan 28, 2023

No temp_buffers is for temp tables and materialized CTEs. Maybe you are confusing it with shared_buffers?

riku_iki · on Jan 28, 2023

I am trying to say that your link says it is for access buffer for accessing temp table, and it doesn't say actual temp table is stored in that buffer and not flushed on the disk.

SigmundA · on Jan 28, 2023

Or there could just be a hint that says do it in source code order:

https://learn.microsoft.com/en-us/sql/t-sql/queries/hints-tr...

There could also be hints that tell what kind of join to use and which indexes to use!

ttfkam · on Jan 29, 2023

AWS-only, but if you're running in that environment, you've got the apg_plan_mgmt extension.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

hholzgra · on Jan 30, 2023

There is also the use case where the planner comes up with the correct plan in the end, but the process could be sped up by giving it additional hints that lead it to the right result faster by restricting the search space up front

avianlyric · on Jan 28, 2023

> Unit test mode (random result sorting)

Ran into this one recently, and worked around it by executing a CLUSTER command after inserting data. By CLUSTERing on a column that contains random data (which your test could inject) you force Postgres to randomise the row order on disk, and thus randomise the return order.

In my case our primary key is a random UUID (I know, it’s a terrible thing, it’s not my choice), which is perfect for the task as CLUSTER requires an index. You might be able to pull the same trick by creating an index in a transaction, CLUSTERing then rolling back the transaction, but I suspect that you can’t call CLUSTER in a transaction.

Not as nice as proper test mode, but gets the job done, and avoids the need to wrap queries deep inside your application, with all the brittleness and peril that comes from using wizard level code reflection that’s normally required for such tricks.

hinkley · on Jan 28, 2023

The first and maybe the only time I saw UUID PKs was in a multi-tenant solution. If your customers are potentially competitors to each other, you absolutely do not want people to be able to walk IDs sniffing for other stuff. I wasn't the biggest fan of this solution, but I really didn't have a better suggestion.

On that same project we ran afoul of Javascript's 10^53 floating point limit with IDs, but I think that was a separate issue.

avianlyric · on Jan 29, 2023

Honestly I think random UUIDs, or just UUIDs in general, make for terrible PKs and generally shouldn’t be used.

To solve the multi-tenant issue, I personally prefer either flakeID/hashID that have an ordered component and a random component to make walking IDs hard; Aggressively name spacing your data so a customer ID plus an object ID is always needed as pair to look something up, so trying to walk object IDs can only every result in someone accidentally looking up objects that already belong to them; edge layers that remap and filter internal identifiers via hashing etc so externally all ID are opaque and random; strong and careful access control, that ensures you can only ever lookup ID that belong to you, with careful consideration for side channel timing attacks.

Relying on random UUIDs for customer privacy would raise red flags for me. If being able to walk ID is enough to break your security model, then I kinda wonder if you actually have a security model.

hinkley · on Jan 29, 2023

It's security in depth in this case, as I recall. There are subtleties in how you confirm or deny the existence of a record and for their particular solution to the problem you would get a 403 vs a 404 for a record you did not have access to.

Knowing how much activity a competitor is adding into a project management system isn't a lot of data, but it's more than zero.

avianlyric · on Jan 29, 2023

There’s a simple solution to that. Always return 403, it’s what you should be doing regardless of your underlying data model.

You should always be determine right to access a record before attempting to look it up. If you can’t determine ACLs without the lookup, then your permissions layer should always assume the requester doesn’t have permission to view a non-existent record and return a 403.

Doing anything else is dangerous, and randomising PKs is a sticky plaster over a badly designed access control system.

ccll · on Jan 29, 2023

How about solutions like https://hashids.org/ which let you keep auto increment intger PKs and present the IDs to the users in a obfuscated string form?

nszceta · on Jan 29, 2023

It sounds like GP encountered UUID Version 4 keys.

UUID version 7 features a time-ordered value field derived from the widely implemented and well known Unix Epoch timestamp source, the number of milliseconds seconds since midnight 1 Jan 1970 UTC, leap seconds excluded. As well as improved entropy characteristics over versions 1 or 6.

If your use case requires greater granularity than UUID vesion 7 can provide, you might consider UUID version 8. UUID version 8 doesn't provide as good entropy characteristics as UUID version 7, but it utilizes timestamp with nanosecond level of precision.

ttfkam · on Jan 29, 2023

> Both UUIDv8 and UUIDv4 only specify that the version and variant bits are in their correct location. The difference is that UUIDv4 specifies that the remaining 122 bits be pseudo-randomly generated. UUIDv8 suggests that the generated value still be time-based however the implementation details of how that generation happens are up to the implementor. This means we have 48 bits of custom implementation, four bits for the version (1000), 12 more bits of custom implementation, two bits for the variant, and finally 62 bits as the implementation sees fit. This leaves a lot open to the implementor but it has enough rules around it that it can coexist in the existing UUID environment.

https://blog.devgenius.io/analyzing-new-unique-identifier-fo...

In other words, one CAN implement a UUIDv8 with nanosecond precision, but that is not a requirement of UUIDv8. 4 is random. 8 is a custom grab bag.

yrro · on Jan 29, 2023

Waiting patiently for these new UUID versions to be supported by uuid(1) or Python...

winrid · on Jan 29, 2023

You can keep the incremental PK for internal stuff, and use an external_id for APIs. Benefit - you can change it if needed.

codingdave · on Jan 29, 2023

If knowing/guessing an ID bypasses security between tenants, I don't think the ID itself is the security hole.

stickfigure · on Jan 29, 2023

What's wrong with UUIDs? Back in the 90s Microsoft strongly encouraged it. You lose a little performance and a little storage (128 vs 64 bits), but gain the ability to merge databases (think corporate acquisitions) with less pain.

At least one distribute datastore (Google Cloud Datastore) works better with random keys rather than incremental keys, it tends to produce fewer tablet splits. At least, it used to, the technology may have changed since then.

Honestly it seems like the arguments for/against UUIDs as keys are pretty mild on both sides. Why would it be "terrible" to go one way or the other?

srcreigh · on Jan 29, 2023

Harder to copy and paste IDs. Discourages denormalization since every reference is 128 ~~bytes~~ bits. Reduced throughout for inserts since the primary index pages won’t all be in cache if the keys are randomized. Lose ability to use id as a fast temporal ordering. Encourages uuids generated client-side.

All in all not enough good reasons to make a lot of stuff slightly worse.

drodgers · on Jan 29, 2023

> Reduced throughout for inserts since the primary index pages won’t all be in cache if the keys are randomized. Lose ability to use id as a fast temporal ordering.

You can generate the first few bytes (we use the first 3) of the UUID from a timestamp - that gives you index locality. It's even better than an auto-inc because you can control exactly how much of your index will be used for hot inserts based on how many timestamp bits you use, so you can avoid lock contention around the single latest index page.

> Encourages uuids generated client-side.

It's really an API design choice about whether you allow this, but it can be useful in some circumstances (if you trust the client).

jeltz · on Jan 29, 2023

> so you can avoid lock contention around the single latest index page.

That is usually the total opposite of what you want. There are some optimizations for inserting to the last page but primarily it is because you want to to sequential inserts. So if you want to avoid contention on the last page you should insert ordered by (connection ID, sequential ID or high resolution timestamp). That way every connection will do sequential inserts on its own page.

srcreigh · on Jan 30, 2023

Nice this makes sense. Seems like storing connection_id in the table would make it difficult to look up data later? It seems like you could use (customer_id, seq_id) for similar effect.

esailija · on Jan 29, 2023

You lose security/privacy benefits. People in this thread pretend that there is one id type that is obviously better.

- UUID (all types) scale, sequential ids don't scale

- UUID (fully random) is completely secure and private. You cannot infer anything from it. As soon as you add anything to it ( like timestamp or counter), outsiders can infer things like how fast you are generating new objects

- UUID (fully random) cannot be abused by developers, which might skip creating timestamp columns and read it from the id instead

- UUIDs are slower, but as you see y's not for no benefits

ttfkam · on Jan 29, 2023

Not that much slower than bigint as long as you replace the top one or two bytes in the UUID with a rotating per-minute counter.

You keep 106 bits of entropy (from v4's 122 bits) while largely eliminating page faults. Walkability is eliminated while not leaking too much temporal info.

https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...

srcreigh · on Jan 29, 2023

https://youtu.be/b2F-DItXtZs

CogitoCogito · on Jan 29, 2023

> Discourages denormalization since every reference is 128 bytes.

This might just be a mistake, but a uuid is 128 bits = 16 bytes not 128 bytes. Of course yes there are still circumstances where the extra space (16 bytes vs 8 or 4 or something else) isn't worth it.

srcreigh · on Jan 29, 2023

Yeah I always mix these up. Thanks

danielheath · on Jan 29, 2023

> Encourages uuids generated client-side.

Is there an actual problem with this? I don't mean "it feels impure" - what's the failure case here that's severe enough to discard UUIDs as PKs?

The other reasons all boil down to 'database performance tuning', which is reasonable if you have any prospect of needing it, but most datasets are _tiny_.

srcreigh · on Jan 29, 2023

The other reasons are more important imo, and they aren’t really all tuning more ergonomics and simplicity.

There’s a lot more edge cases when you use client-side generated IDs. You have to check if the ID is actually new to avoid security issues is the main one. Let’s just upsert into the DB and return the results and the uuid is new and randomly generated so it’s fine! So simple! Except now Alice can send someone else’s uuid and read their data. X100 endpoints all need to handle this correctly. It’s a huge risk.

If you do offline first ID generation using global uuids, the possibility of conflicts (due to bugs, patched client code, etc) that’s a whole rabbit hole of edge cases and problems as well.

It can be done, Bret Taylor used that architecture for quip, but it’s needlessly tricky which is the whole story for UUIDs imo - annoying and slightly worse for many normal apps. If you’re building a complex distributed system and want to deal with all the trickiness, go ahead. I’d recommend you use a B64 custom identifier instead of UUID so that it’s copy paste able :-)

stickfigure · on Jan 30, 2023

>> Encourages uuids generated client-side.

>Is there an actual problem with this?

The nature of your application and its security and idempotence needs will determine who should (ideally) generate ids for any specific context. The format is irrelevant.

Generate ids on the client when it makes sense, generate them on the server when it makes sense. Server-side is more common but if you choose poorly you'll make bad software either way.

avinassh · on Jan 29, 2023

> Encourages uuids generated client-side.

Postgres can auto generate UUIDs too and can be used for pk

selcuka · on Jan 29, 2023

> Postgres can auto generate UUIDs too and can be used for pk

True, but the practice makes it easier to shift to the dark side and insert the child records first, as you can already know the pk of the parent record in advance.

Tostino · on Jan 29, 2023

How can you do that when you have fkeys in your database design like a good engineer, right? Right?

selcuka · on Jan 29, 2023

> How can you do that when you have fkeys in your database design like a good engineer, right? Right?

I know you are trolling, and it is ok if done sensibly, but it is not simply purism that makes it a bad idea. Because you have to trust the client it opens the way to discovery attacks (try inserting a record with a specific PK -> it will fail if the record exists -> now you know that it exists even if you don't have access to that record).

You may also not have access to all client implementations (think of public APIs) so some client libraries might not implement proper (i.e. strongly random) UUID generation.

jeltz · on Jan 29, 2023

Also more btree index bloat and slower inserts due to random insert order.

avinassh · on Jan 29, 2023

How would one measure the B Tree bloat in Postgres?

srcreigh · on Jan 29, 2023

B-trees are self balancing so I don't think there will be bloat. But if you insert towards the end of an index a lot, the right-side of the index will more likely be in cache which makes it a lot faster. If you inserted randomly then you need to do lots of traversals in different parts of the B-Tree.

avianlyric · on Jan 29, 2023

When you’ve got 128bits to play with, I think you can do more useful thing with those bits than UUIDs provide.

UUID are perfect fine identifiers in an entirely pure sense. But baking a like more info into those identifiers can make debugging and building operational tools so much easier. It’s just a massive waste to fill those 128bits with either random data, or very limited options that other UUID versions give you.

Arguments around leaking information into public I think are a little silly as well. There are plenty of ways of preventing that, trading ease of debugging for making your internal identifiers safe in public is a poor trade off in my view.

petepete · on Jan 28, 2023

I'd love to see native temporal tables at some point. Being able to step back in time without relying on extra code or a plugin would be great.

https://learn.microsoft.com/en-us/sql/relational-databases/t...

eliaspro · on Jan 28, 2023

Temporal tables are IMHO a profound gamechanger. Once you understood how to use them, you'll have a completely different view on your data models/db schemas and your data lifecycle.

It makes so many things far easier, having a temporal relation for every record and saves a lot of headaches that would otherwise usually be dealt with on the application level.

obeavs · on Jan 29, 2023

Can you elaborate? Understand the basics but afraid I can't see around the corners of how it impacts development.

eliaspro · on Jan 30, 2023

Imagine you're building an support ticket tracker whose reporting capabilities should allow to provide data on various historic items of your tickets such as "time per assignee", "time per state", "time between creation and solution" etc.

In an environment built on "traditional" SQL you'd have a history table, where you first and foremost would have to maintain records for those values: - ticket ID - time of state change - type of state change

But for the stats to be actually valuable, you might need additional history context (different per type), such as the user ID, the state, criticality, etc for which either an additional table per history type item type would be required or the data could be stored as serialized value with a metadata column of the history table. One requires complexity on the DB level, the other requires application level logic and slows down report generation tremendously.

Temporal tables to the rescue to remove all this complexity.

The full state of a ticket record including it's related entities at the given time of an UPDATE is fully preserved by the DB natively, so all you'd have to do to retrieve historic records at a given time is to append "AS OF $TIMESTAMP" to your query.

In the end, there's no need for extra history tables, much better performance when building reports, no need for application level data wrangling, ...

obelix74 · on Jan 29, 2023

There have been at least three patches for this that have been rejected. There are at least two extensions that kinda work.

What we need is native support for bi temporal data like MariaDB.

captrb · on Jan 28, 2023

Half my job is doing this in reverse by accumulating deltas over streams and materializing the current state in various caching tiers. I'd love this to be native feature in PG, but for any my workloads it would need to support a lower-cost archival-grade storage tier ala a blob store.

richbell · on Jan 28, 2023

I would also like to see "friendly" SQL syntaxes like what DuckDB offers, but I doubt they'd add it without an update to the standards themselves.

https://duckdb.org/2022/05/04/friendlier-sql.html

aftbit · on Jan 28, 2023

Oh thanks for sharing this! I love all of those, and most seem like they'd be easy sugar over the existing syntax. The biggest missing feature from those that I would really enjoy in data exploration tasks (though not in PROD) would be automatic JOIN ON selection based on foreign keys.

Example:

    SELECT users.id, COUNT(*)
    FROM users
    JOIN orders ON AUTO
    WHERE orders.created_at > NOW() - '7 day'::interval
    GROUP BY ALL

This would only work if there was an obvious path to do the join. In this case, I'm imagining that the `orders` table might have a `user_id` column which is a foreign key into the `users` table.

discretion22 · on Jan 29, 2023

That sounds very close to NATURAL JOIN which is already present[0] although that does rely on the typical convention of FK columns being named the same on parent and child (related) tables.

I think you are suggesting some sort of lookup based on the defined FK relation, but that would be confused by situations where tables have multiple FK relations, such as tables with values restricted by a lookup value table (or more than 1 such FK). Those are pretty common, so I could see the 'AUTO' feature breaking down quickly. I think that is why the NATURAL JOIN approach is taken and that basically does what I believe you are describing, provided the column naming is matched.

[0] https://www.postgresql.org/docs/15/queries-table-expressions...

[edit:spelling]

Rizz · on Jan 29, 2023

You could use the name of the foreign key, together with FKs namespaced to the table they are on would allow very expressive joining, and the query might even survive schema changes. ORMs tend to work like this.

sroussey · on Jan 28, 2023

MySQL has done this for ages by using `NATURAL JOIN`.

Example pulled from Google: https://sebhastian.com/natural-join-mysql/

ttfkam · on Jan 29, 2023

So does Postgres. Unbeknownst to the OP, supported it since at least v7.2 (aka a looooooong time ago).

https://www.postgresql.org/docs/7.2/queries-table-expression...

zX41ZdbW · on Jan 29, 2023

Worth to note that there is no innovation in DuckDB. The features such as:

    SELECT * EXCLUDE
    SELECT * REPLACE
    Column Aliases in WHERE / GROUP BY / HAVING
    Struct Dot Notation
    Function Aliases from Other Databases

are implemented in ClickHouse long before.

MuffinFlavored · on Jan 29, 2023

https://wiki.postgresql.org/wiki/Loose_indexscan

> "Loose indexscan" in MySQL, "index skip scan" in Oracle, SQLite, Cockroach (very recently added), and "jump scan" in DB2 are the names used for an operation that finds the the distinct values of the leading columns of a btree index efficiently; rather than scanning all equal values of a key, as soon as a new value is found, restart the search by looking for a larger value. This is much faster when the index has many equal keys.

I ran into a case in a hobby project where a really basic GROUP BY query was performing was worse than it should. I was surprised that https://malisper.me/the-missing-postgres-scan-the-loose-inde... it's been known "missing" since 2017.

qeternity · on Jan 29, 2023

FYI: TimescaleDB includes a skip scan operation which can be used on regular Postgres tables.

foreigner · on Jan 28, 2023

Allow trailing commas in SELECT and IN. Pretty please?

valzam · on Jan 28, 2023

Omg yes please. I get that it might be slightly harder to parse but any language that doesn't support it inevitably annoys me. In pg and json you end up with runtime errors, programming languages lead to bigger diffs than necessary (adding an entry to a static list is a 2 line change instead of 1 line)

defanor · on Jan 28, 2023

> programming languages lead to bigger diffs than necessary (adding an entry to a static list is a 2 line change instead of 1 line)

I think in ML-family languages it's fairly common to format like this, which avoids bigger diffs:

  foo = [ 1
        , 2
        ]

In C-style languages something similar goes as well, though perhaps used less commonly:

  int x[] = { 1
            , 2
  };

giraffe_lady · on Jan 28, 2023

This is how I format sql code when I have control over it, have for years. People often recoil in horror, sometimes they ponder for a minute then switch over themselves.

It works very easily and consistently to center around the space and lead with the comma:

  select id
       , name
       , address
       , ts
    from table
   where condition
     and etc;

lgas · on Jan 28, 2023

Doesn't this just move the problem to when you add a longer statement like "INNER JOIN"? Then the entire SQL statement becomes the diff?

ttfkam · on Jan 29, 2023

I tend to write

    SELECT a.foo
         , b.bar
         , array_agg(z.gumball) gumballs
      FROM zoo z
     INNER JOIN alpha a
           ON (z.z_id = a.z_id)
      LEFT JOIN baker b
           ON (a.a_id = b.a_id)
     WHERE z.last_modified > '2023-01-01'
       AND a.is_active
     GROUP BY 1, 2
     ORDER BY 2, 1
     LIMIT 100
    ;

Left side highlights the operations. Commas and "AND" delimit parts of each directive. The semicolon lines up on the left side to help visualize the end of each statement in a long chain of commands, especially DDL.

I also tend to capitalize the SQL keywords and leave the identifiers lower case.

giraffe_lady · on Jan 29, 2023

Sometimes yes. I don't think there are any perfect solutions and I won't pretend this one is. But usually I find those change less frequently than columns and where clauses, and if you write them out and declare them explicitly you rarely have to move the whole query over. It does happen though.

jaza · on Jan 29, 2023

Agreed. My standard is to pad with 12 spaces, eg:

  SELECT      foo
  FROM        bar

Which allows room for everything I commonly use, including INNER JOIN.

ComputerGuru · on Jan 29, 2023

My alternative is to follow with “1” so every preceding line can have a trailing comma or start a predicate clause with “1 = 1” so every subsequent line can start with “AND foo …”

remram · on Jan 29, 2023

This will still cause a two-line change if you insert at the beginning.

Granted, adding at the end is more common, but you can't fix the problem with tricks (short of putting every comma on a line by itself too).

valzam · on Jan 28, 2023

I hope postgres adopts easily disabling an index instead of deleting it. MySQL has this feature and it's very useful for testing whether an index can be deleted on production load. Transactional DDL is amazing already, disabling indexes would be a great addition for low-risk performance tuning.

viraptor · on Jan 28, 2023

The completely unused indexes can be found through https://www.postgresql.org/docs/current/monitoring-stats.htm...

haki · on Jan 30, 2023

You can utilize transactional DDL for that

https://hakibenita.com/sql-tricks-application-dba#make-index...

pramsey · on Jan 28, 2023

update pg_index set indisvalid = false where indexrelid = 63514;

samokhvalov · on Jan 29, 2023

There is an opinion that it's unsafe: https://twitter.com/petervgeoghegan/status/15991919640456724...

There was an extension from Teodor some time ago, plantuner, to disable indexes in a session or globally ("set plantuner.forbid_index='id_idx2';") – but it didn't make it to core, and even to contribs. Maybe because the functionality to disable indexes was mixed with planner hints there. It's a very old story, discussion from 2009: https://www.postgresql.org/message-id/flat/47E63672-972E-452...

ioltas · on Jan 29, 2023

It's been argued a few times that we should have the possibility to make an index invalid on sight with a proper ALTER INDEX command, yes.

samokhvalov · on Jan 29, 2023

I see to use cases:

- disable for all - disable only for my session, to check what would happen with the plan, and only then decide to proceed with disabling for all (or to drop it)

ALTER is quite invasive way, even more than "UPDATE .. SET indisvalid = false ...". It would be good to do it via SET as it was proposed in the plantuner extension long ago.

srcreigh · on Jan 29, 2023

CTEs should not be materialized by default (latest PG if you reference a CTE more than once it stores intermediate rows on disk and disables indexes and causes a lot of trouble.)

I’d love to see B-Tree primary storage option. Aka store the row data inside the primary index. This can save a lot of space for thin tables or tables with large keys, and would be basically an auto-CLUSTER with all the performance that comes with that. This is how MySQL works. The hash table primary storage is better sometimes but it sucks for range queries leading people to need timescale for good data locality

jeltz · on Jan 29, 2023

> CTEs should not be materialized by default (latest PG if you reference a CTE more than once it stores intermediate rows on disk and disables indexes and causes a lot of trouble.)

Nope, it does not write anything to disk unless you have more data than work_mem.

srcreigh · on Jan 29, 2023

Nice, TIL. The default value for work_mem is only 4MB though, and when I’ve seen materialized CTEs go wrong they’re usually using temp tables in the 10s or even 100s of MBs. Usually “with products as (select * from products where customer_id = 123)” type of stuff where some customers have hundreds of products with large rows.

mkleczek · on Jan 29, 2023

> I’d love to see B-Tree primary storage option. Aka store the row data inside the primary index.

It is coming: https://github.com/orioledb/orioledb

hardwaresofton · on Jan 29, 2023

> I’d love to see B-Tree primary storage option. Aka store the row data inside the primary index.

Postgres has this - they’re called “covering indexes”

https://www.postgresql.org/docs/current/indexes-index-only-s...

Note that there is a limit to index tuple size, which is lower than the limit on row size

SigmundA · on Jan 29, 2023

Not the same, this is called a clustered index or index organized tables.

With a covering index in PG you still have a heap storing the rows and a copy of the covered columns in the index.

With clustered indexes or IOT's the table is the index there is no duplication unless you have other secondary indexes. This save a lot of space and reduces indirection when seeking on the clustered index.

Some DB's like Sqlite and InnoDB (MySQL) this is always the case the table is a b-tree and there is always a clustered index even if you don't define one explicitly. In others like Oracle or MSSQL you have a choice of unordered heap or b-tree, in PG you have no choice the table is always an unordered heap and all indexes are secondary.

hardwaresofton · on Jan 29, 2023

That's for this explanation -- didn't see how they were different but your comment explains it perfectly.

With this understanding I can't grasp the benefit of a clustered index -- if you're using the primary index (let's say numeric auto-incrementing ID) then you'd likely have a secondary index for that already (in PG). If you were searching by something else, the default clustered layout is a hindrance as you must traverse unnecessarily to find entries (rather than sequentially scan)...

The only penalty of PG's decision seems to be excess memory usage (storing a second copy of the identifying tuple contents in memory) -- is that characterization correct? But I wonder how this holds up with a spinning disk -- I wouldn't want to follow indices (and do random reads) on spinning disks.

Looking at the other side, I guess the case where PG shines is where you want to do batch processing (so looking at a page of tuples is good locality-wise, but you also often retrieve by the identifier, and don't mind paying the cost of [rows x identifier size] for the privilege.

Are there some specific use cases where clustering indices clearly outperform/are the right choice?

SigmundA · on Jan 29, 2023

If you primarily look up by the primary key which is pretty common the clustered index is faster than looking up the primary key in a standard index then finding the row in the heap for the other columns. If you make the standard indexing covering all your columns you have the behavior of a clustered index except the data is duplicated in the heap.

Clustered indexes can save significant space on a narrow table with a lot of rows accessed in a particular way and perform better as well, if you access the table in multiple ways with other secondary indexes it can be slightly slower to access through a secondary.

srcreigh · on Jan 29, 2023

The important factor is # of pages on disk that need to be retrieved. Secondary indexes (aka every index in Postgres) have to lookup the primary storage as well. If you use a primary key index to pull 100 rows, if the rows aren't clustered, then you're looking at ~300 pages needing to be pulled. That's roughly 2 per index traversal (the first level of the index is generally cached) plus one per row to pull from primary storage.

This can be improved in two ways. One, if you add a second index which gives you better locality in the index. For example (customer_id, product_id) will group up all the rows by customer id. This can reduce the # of pages for index traversals down to <5 as long as each customer doesn't have a lot of rows. And in many cases this makes the primary index on just id useless. This brings the total down to 105 pages give or take. (depends on how many products each customer has)

The other way is to use an actual primary index, or use a covering index so that the data you retrieve is already in the index. For example if you're just pulling product_name from your table, you can use covering index on (customer_id, id, product_name) so that the product_name has locality with the customer's product IDs. This would bring down the total pages to be retrieved down to maybe ~20, since product_name tends to be larger data. It's a question of how many (customer_id, product_id, product_name) tuples can fit on one 8KB page and how many products the customer has.

If you use a primary index, the whole rows are on pages. This lets you run queries that pull lots of data (or different data) and have good data locality, but it means less tuples per row so you need more rows. So you'd access maybe ~50 rows but this index could cover a lot of queries unlike the covering index which only works for product_name.

These days SSDs are much faster than hard drives, so # of rows pulled off disk is still important but not as much so. Another thing this buys you is that you don't pollute the in memory cache by evicting pages just to load a new page that doesn't get utilized well. For instance original the index leaf nodes that are just 4 bytes (product_id, rowid) so every one you throw away 99% of the data on that page.

srcreigh · on Jan 29, 2023

Covering indexes are different. You’d have to have a 2nd unique index which has all the rows from the table. A lot of large datasets would benefit from auto clustering but can’t handle 2x the storage for the same data.

hardwaresofton · on Jan 29, 2023

Ah thank you for noting this -- combined with SigmundA's answer I think I get how they're different, and how covering indexes are not the same solution

bornfreddy · on Jan 28, 2023

SHOW DATABASES; and SHOW TABLES;, please. DESC would be nice too. And let it work everywhere, not just in CLI (like \d & co. do).

leo8 · on Jan 28, 2023

This a thousand times. I have to ALWAYS google that one big query for showing me all the databases. (I almost always connect with pgAdmin)

deathanatos · on Jan 29, 2023

I'm confused; in what context are you always googling it? In pgAdmin, its displayed in the UI, no?

(I sort of agree with the parent, in that, if you need this in a raw SQL interface, it's a pain. But it shouldn't be a common pain point?)

leo8 · on Jan 30, 2023

Ok, that makes no sense, now I see. I was talking about SHOW TABLES actually.

In pgAdmin I would have to expand every schema to see every table.

Was a bit tired when writing that comment.

deathanatos · on Jan 31, 2023

Ah, another UI that hides information by default. Those are annoying.

(I'm a CLI user, so it's a \dt away. But I Googled screenshots of pgAdmin prior to commenting to see if it was in the UI, and it/was.)

zX41ZdbW · on Jan 29, 2023

Although it has its gotchas, MySQL dialect is still somewhat better from the usability standpoint. "SHOW TABLES" is nice; it is natural and self-explanatory. "\d" is obscure.

It also reminds me how many DBMSs did not support the LIMIT clause because it is non-standard. But it is good from a usability perspective.

https://stackoverflow.com/questions/1528604/how-universal-is...

That's why ClickHouse has support for most of the extensions from the MySQL dialect.

sroussey · on Jan 28, 2023

Also, MySQL's `show create table ...` would be great.

drodgers · on Jan 29, 2023

I've never looked into this for Postgres itself, but AWS Redshift (a postgres derivative) has enough info in the system tables to reconstruct a create table statement using a view (see: https://raw.githubusercontent.com/awslabs/amazon-redshift-ut...) and I'd guess that you can do the same for Postgres.

arp242 · on Jan 29, 2023

You can use "select * from pg_stat_user_tables" etc.

Admittedly the internal tables and views have a bit of a learning curve and are a little bit obscure (and inconsistent) at times, but it's much nicer and much more flexible.

The CLI stuff like \d are just "aliases" for queries; \set echo_hidden shows them.

bornfreddy · on Jan 30, 2023

Of course you can, but that is beside the point. Convenience matters - besides, these approaches can coexist, adding those simple queries (as aliases) should be trivial.

arp242 · on Jan 30, 2023

SHOW has well-defined consistent semantics. Adding "special shortcuts" doesn't seem like much of an improvement.

From the CLI \d etc. seems just as easy (easier actually), and you can add your own shortcuts if you want.

From outside the CLI, this sort of introspection is rare enough I don't really see the point in making special shortcuts for it.

hholzgra · on Jan 30, 2023

SHOW CREATE TABLE and friends even more so

dudeinjapan · on Jan 28, 2023

Array foreign keys and indexes on arrays/JSON nested fields. When using MongoDB, having this eliminates most need for join tables. (*This may have already been implemented, I haven't used SQL recently.)

aidos · on Jan 28, 2023

I think they have indexes on json (or any selectable thing?).

You could probably implement the foreign key using generated columns these days. Ah I guess you mean instead of having the association table? That would definitely be nice in some cases.

stephen · on Jan 28, 2023

> Array foreign keys

Yes!

We very occasionally use integer[] columns with ids, but hold off on using them widely b/c of the lack of integrity constraints. It'd be awesome to use more often.

jeltz · on Jan 29, 2023

PostgreSQL has had indexes on JSON for many years.

nhumrich · on Jan 29, 2023

Yes! Array of foreign keys would be amazing!

captrb · on Jan 28, 2023

A built-in timestamp type that stores both the UTC time, the originating timezone IANA name, and the timezone offset. Without loss.

"it was 2:01am on November 5th, in America/Los_Angeles and the offset was -7"

svieira · on Jan 28, 2023

Without loss requires the version of the timezones data too.

"it was 2:01am on November 5th, in America/Los_Angeles and the offset was -7 (retrieved from ICU v97)"

svieira · on Jan 30, 2023

I can't take credit for this idea - it's from Jon Skeet: https://codeblog.jonskeet.uk/2019/03/27/storing-utc-is-not-a...

captrb · on Jan 28, 2023

Interesting point. I don't think I have that particular use case, but I can definitely imagine it. I'd think especially with dates in the far future and the timezone offsets are altered between the time of recording and the time value itself?

giraffe_lady · on Jan 28, 2023

Individual timezones are pretty stable but a handful change every year in some way, often switching how they observe DST or something similar. If you have a truly global userbase and this actually matters, you'll definitely hit them.

mjevans · on Jan 29, 2023

Which components should take precedence if converting to an updated ruleset?

E.G. https://www.rfc-editor.org/rfc/rfc5545#section-3.8.5.3 (Internet Calendar) still has the issue of events not tied to the time a 'normal' human might expect if rules change.

Lunch service 11am to 3pm Weekdays, 11 to 2pm Weekends

How would that dynamically map to changes in the timezones that might be mandated by the law but reflect the correct event times in a scheduling system? Spreadsheet systems developed the $ prefix for cell addresses as a shorthand to lock that in when adjusting the relative offsets in copy and paste / duplicate operations.

derefr · on Jan 28, 2023

1. What would you use this for?

2. What is the collation (sorting algorithm) for this datatype?

doctor_eval · on Jan 28, 2023

That’s a weird use case but you could just create a UDT to do it. You need the timestamptz value, the timestamp and the Timezone. Super expensive but ¯\_(ツ)_/¯

zeroimpl · on Jan 29, 2023

Ha I had a need for this yesterday. Just added a timestamptz column and text column. In my use case the text column preserves the input value for display to the user while the timestamp itself is used programmatically. But a “TIMESTAMP WITH TIME ZONE PRESERVING TIME ZONE” data type would be cool.

philliphaydon · on Jan 29, 2023

> –i-am-a-dummy mode

I like Jetbrains DataGrip for this.

If you execute a modifying query without a where clause it will stop you and double check that's what you intended to do.

Likewise you can specify a database connection as read-only so that it wont run modifier queries at all, attempting to do so will stop you but you can then explicitly run it if you need to.

ht85 · on Jan 29, 2023

> Likewise you can specify a database connection as read-only

We emulate that in prod by having admin users with read only permissions. They are granted other roles without the INHERIT option, thus needing an explicit `SET ROLE ...` before being able to do anything dangerous.

I also teach people the habit of combining `BEGIN; SET ROLE ...` everytime they need to write something. It has completely stopped "woops prod" incidents since it was implemented.

koolba · on Jan 29, 2023

> Likewise you can specify a database connection as read-only so that it wont run modifier queries at all, attempting to do so will stop you but you can then explicitly run it if you need to.

Does it parse the SQL and guess if it’s DML or run it in a transaction and rollback after?

Because “SELECT foo(‘bar’)” could perform DML.

philliphaydon · on Jan 29, 2023

So I just quickly tested this and it appears to block it.

I created a function to modify then set the connection to readonly.

https://prnt.sc/0mNHkGIa1klR

cosmotic · on Jan 28, 2023

I'd like to see easier upgrades between major versions, or at least published docker images that can do the upgrade. As it stands, it's a huge hassle to get both binaries installed on the same system. This is a task almost everyone has to do periodically, it should be easier.

zX41ZdbW · on Jan 29, 2023

I also found it strange that PostgreSQL cannot simply support data formats of previous versions. In contrast, the latest version of ClickHouse[1] (23.1) can be installed over the version from 2016, and it does not require format conversions or any other migration procedures.

[1] https://github.com/ClickHouse/ClickHouse/

ncrmro · on Jan 29, 2023

In Kubernetes a service call end an operator watches for CRD specifying databases to be created and manages upgrade and backs for those databases.

https://github.com/zalando/postgres-operator

doctor_eval · on Jan 28, 2023

Yes this is a huge problem. I wish PG could contain whatever it needs to do in place upgrades from supported versions. Statically compiled builds of previous versions of pg_upgrade?

sergiotapia · on Jan 28, 2023

I wish there were a easy way for Postgres to tell me "hey you've been doing a ton of queries with this field that has no index."

thinkingkong · on Jan 28, 2023

You can use the slow query log which is very helpful. Lots of logging tools will injest it properly and you can aggregate queries that are ran frequently. Having an index or not is just one optimization.

ahachete · on Jan 28, 2023

Many of those queries may run sub-second (typical lower-end value for log_min_duration_statement), so they won't get logged. Yet, if called at high frequency, may represent a notable % of your CPU and I/O. The slow query log is not enough in many cases.

sroussey · on Jan 28, 2023

The MySQL slow query log can be made to log queries that don't use an index, even if fast. Maybe some tweaks to this idea for postgresql.

ahachete · on Jan 29, 2023

Maybe... but there are many queries that don't use an index (whether fast or slow) and that's the right thing to do. As a DBA, I'd see too many "false positives" due to this reason.

jayski · on Jan 28, 2023

I'd like to have per database WALs, instead of the WAL being for everything.

I have no idea what this entails or if it's even possible, I only know it would make my life easier.

macdice · on Jan 29, 2023

One technical problem is the WAL records changes to "shared" relations (stuff like the catalogues of databases and users that are global to the whole "cluster", not just one database), and those are mixed up with changes to per-database objects (ie most WAL activity).

rastignack · on Jan 29, 2023

Other database engines solve this using phase commits.

forinti · on Jan 28, 2023

That's interesting. I have a separate cluster setup without archiving because some databases don't need it. It's so simple to run various clusters side-by-side on the same host that I don't think having compartmentalised wals would be easier to manage.

derefr · on Jan 28, 2023

Yeah, but I want cheap in-memory joins between the WAL-isolated datasets. I.e. "multi-world" MVCC concurrency, where a TX is locking in a separate min_xid for dataset A, B, C, etc. for the lifetime of the TX — but without this being a big distributed-systems vector-clocks problem, because it's all happening in a single process.

Why? Being able to run a physical replica that loads WAL from multiple primaries that each have independent data you want to work with, for one thing. (Yes, logical replication "solves" this problem — but then you can't offload index building to the primary, which is often half the point of a physical replication setup.)

gavinray · on Jan 29, 2023

I'm currently writing my own database from scratch and at the point where I have a buffer pool and storage layers, and can do scans with these

(no actual query layer obviously, these are raw heap-file scans)

I'm now studying, trying to implement a WAL and transactions + recovery.

Could you show some pseudocode for what you mean, because this sounds interesting and I _almost_ get what you mean, but not entirely.

derefr · on Jan 29, 2023

I don't know about pseudocode, but I think a simplifying analogy would be to consider an embedded DB (like SQLite) or key-value store (like LMDB) library that you interact with through transactions.

In such a system, "single-world MVCC" would be what you'd get by putting everything into one database file, with any changes intended always done within one DB tx of that single DB file.

"Multi-world MVCC", then, would be what you'd get by opening multiple database files, each of which maintains its own WAL/journal/etc., and then creating an application-layer abstraction that allows you to coordinate opening DB txs against multiple open DB files at once, holding the result as a single handle where if you hit a rollback on any constituent tx, then the application-layer logic guarantees that the other DB files' respective txs will be told to roll back as well; and that when you tell this coordinated-tx to commit, then it'll synchronously commit all the constituent DB files' txs before returning.

Note that unlike with a single DB file managed through a single readers-writer lock, this kind of system can introduce deadlocks (but DBs with more complex locking systems, like Postgres, already have that possibility.)

gavinray · on Jan 29, 2023

Oh I think I get what you mean.

Something like:

  interface IDatabase {
    wal: IWAL,
    transactionManager: ITransactionManager,
    // etc
  }

And then your application has multiple instances of these "IDatabase" objects, maybe one per physical/logical database file, e.g. ".sqlite3" in the case of SQLite

And at the root of your application, you have something like an:

  interface IDatabaseManager {
    databases: Set<IDatabase>
    transactions: Map<IDatabase, Set<Transaction>>
  }