Never agreed with something so much more in my life. Points 1-3 in particular. I understand their logic of not wanting to "pollute" the core products with these features.... but it's sort of past the point now where you're starting to expect this stuff in a modern database product (instead of add-ons, hacks, and outdated blog posts)
It is kind of ridiculous that the first three haven't been sorted by now.
And over time it will increasingly relegate PostgreSQL to being for development only with production use being handled by wire-compatible databases e.g. Aurora.
> 2) Horizontal scalability without having to resort to an extension like Citus.
Just curious, what would save you having the solution in-core? Installation, sure, but that's a one-off possibly in your deployment code. "CREATE EXTENSION citus" and add that to postgresql.conf? Sure, but not too much work for me. The rest (commands to actually create the nodes, do the sharding itself) are something I cannot imagine being different or simpler if with an in-core solution.
It means that when you upgrade PostgreSQL you don't have to worry if the extensions are compatible and have been fully tested not just for functionality but security etc as well.
And most importantly it means you don't have to worry if that extension will move to a freemium model which (a) often has important features out of your price range and (b) is generally unacceptable in enterprise environments.
Right, those are compulsory steps for every upgrade.
Yet in the particular case of Citus, history (so far) has shown a) that they update the extension regularly and fast, so by the time you want to upgrade to a newer major version you already have Citus updated too; b) they are going exactly in the opposite direction of "freemium", they actually open sourced even the previous proprietary bits; c) as OSS, it can always be forked and if one day closed source, being such an important project, it would be definitely forked.
Sharding databases is not such a dark, magic art. Citus relies on Postgres for many key features; and Citus does the rest. It's already quite "feature complete".
If Citus would become proprietary overnight, my main concerns of maintaining a fork would be around the codebase and the language expertise more than the sharding concepts.
Note that sharding is different from a purely distributed database. The latter is an entirely different class (and more complex system).
You can’t use those extensions on almost any cloud like AWS, GCP, Aiven, etc. - I think Azure is the only one offering a citus product because they acquired them. Also extension updates are a major pain point in hosted DBs - always lagging behind. Having some solution in core would resolve this, even if it was just bundling some best in class extensions out of the box to bypass these cloud providers very selective extension support.
It is funny to see this because every time these reasons for people continue to use MySQL are listed, PostgreSQL folks will be quick to reply most of these are non-issues.
This sort of tribal framing doesn’t do anything to help the situation. I’m immediately willing to believe that the “Postgres folks” that are saying that, genuinely have not found these points to be issues, which doesn’t mean that OP doesn’t.
Big second on horizontal scaling. Amazon Redshift is a great example of what's possible, but it also has enough problems that you can't really use it as a primary application database.
Redshift is an OLAP database, much like ClickHouse. For OLAP databases, horizontal scalability is a must - due to generally larger data volumes.
While Postgres is OLTP. For OLTP databases achieving horizontal scalability require a more sophisticated approach to distributed consensus, like in CockroachDB or Spanner.
Currently, supporting a feature like "users can reorder the items in a playlist" is typically done using an integer position column. However, this doesn't prevent gaps (so e.g. the third item in the playlist might have `position = 42`), and inserting between two other items requires updating every row with a greater position value.
I'd like to be able to say `update ... set position = 3` to make the record the third item in the list. You'd need to be able to set the scope (eg `add column position ORDINAL WINDOW BY playlist_id` or something).
Maybe you could use the technique of maintaining a separate adjacency list column and materializing the ordering list. It would be a simplified version of what is done for hierarchy to materialize a nested set: https://www.sqlservercentral.com/articles/hierarchies-on-ste...
A DB type could be nice to hide away the adjacency list.
If I were happy to rewrite the queries, I could use something like (forgetting the exact incantation)`select row_number(), rest_of_table as position partition by playlist_id order by position` to get the position column.
Use numeric data type instead of integer, and update the position to the midpoint of the other 2 you're trying to place the record between.
update ... set position = (prev + next)/2;
To get better performance you can choose to use the float data type, but then you'd be limited to a fixed precision; sufficient for most cases, though.
That still requires you to look up the element in the target position, instead of just saying “put this element in the 3rd position.”
Also, you’ll still eventually need to go clean up the entire sequence because you’ll run out of gaps between adjacent numbers. Because of this, I’d probably rather use a more predictable type (like one of the integer types) and explicitly plan my cleanup schedule.
> Also, you’ll still eventually need to go clean up the entire sequence
If you're using a numeric type, you get up to 16383 digits after the decimal. That's... Probably more precision than you'll be able to reasonably use up in almost any use case. Any time you're reordering in bulk, you're resetting the order value to a nice integer, so it would take many thousands of ad-hoc reordering operations near a single position to get it close to the precision limit, yeah?
Sure, but if it’s orders of magnitude more than you’ll ever need, you’re just using way more storage than you need. That’s what I meant by being more explicit about your plan and using more predictable storage. As a basic example, you could also use integer (or bigint) and start by number things like 1000, 2000, 3000, etc. Now you know exactly how many slots between items you have, and can more easily query for cases where you’re running low on slots.
Capacity doesn't mean storage. The default text field can store orders of magnitude more than I might need for a field, but that doesn't mean it takes orders of magnitude more storage.
Or model it as a linked list and you can sidestep the limitations / complexity of some kind of numeric (or bytes / text based ordering field)
------------
playlists {
id
}
------------
playlist_members {
id
playlist_id
prev_playlist_member_id
(and/or next_playlist_member_id)
song_id
}
------------
you could then just select * from playlist_members where playlist_id = ... and sort on the client side.
you'd probably add an application limit where playlists have a max length of some kind.
re-orders can be done in a fixed number of row updates and typical application queries are still possible / fast.
or perhaps for some applications it would be sufficient to do
playlists {
id
song_ids []
}
-------------
and just store the ordering in an array. Some postgres drivers might start shitting the bed though at some gigantic array sizes, but a playlist probably has reasonable enough limits that you wouldn't have a big problem.
ID arrays can't use FK constraints (requested elsewhere in these comments), otherwise that would be pretty good. Performance-wise it means more IO than optimal, but that's not necessarily a huge problem.
If you were going to enforce a maximum size so that you could sort on the client, I bet you could just use an integer and rewrite the entire sequence from 1 to N on every reorder operation. Unless you were expecting to have way more writes than reads, which is a little hard to imagine.
Indeed, but AFAIK that is still much slower than sorting on an indexed integer column, which probably means you still need to enforce a maximum list length.
Writing a linked list in the database might sound good in theory; as someone maintaining a system that uses that technique, please please please do not ever do it. You lose access to basically every database-provided consistency technique.
That's exactly what I want the ordinal type to do transparently under the hood, automatically touching the minimum possible rows when I specify EG `position = 3`. Bonus points if there's an autovacuum-style procedure to move records apart if they're sitting too close together.
I know the postgres devs don't like them, and that the query planner should be good enough that they're not needed, but it's not, and it regularly fucks up.
I would go further: give me a PL/ language (or an equivalent bytecode-abstract-machine abstraction) that lets me program — at least in a read-only capacity — directly against the access-method handles, such that the DB's table heap-file pages, index B-tree nodes, locks, etc. are "objects" I can manipulate, pass to functions, and navigate graph-wise (i.e. ask a table for its partitions; ask a partition for its access-method; type-assert the access-method as a heap file; ask for the pages as a random-access array; probe some pages for row-tuples; iterate those, de-serializing and de-toasting them as a type-cast; and then implement some efficient search or sampling on those tuples. Basically the code you'd write in a PG extension to interact with the storage engine on that level, but limited to only "safe" actions against the storage, and so able to be directly user-exposed.)
The closest analogy I know of to that, is how you work with ETS tables in Erlang. I want to send the RDBMS code that operates at that level!
Actually, I presume that SQLite would necessarily have some low-level C interface that works like this — but few people seem to talk about it/be aware of it compared to its high-level SQL-level interface.
This is vaguely how the Microsoft "Jet" database engines work. Microsoft Exchange uses this low-level query authoring technique to achieve its scalability and performance goals. This generally makes its performance consistent and predictable.
There were some attempts back in the early 2000s to move Exchange over to use the SQL Server RDBMS engine, and they added a bunch of features to enable this kind of low-level control. Not just join hints: you could force specific query plans by specifying the plan XML document for that query. See: https://learn.microsoft.com/en-us/sql/t-sql/queries/hints-tr...
This wasn't good enough however, and Exchange still uses the Jet database.
Something that might be interesting is an RDBMS "as a library", where instead of poking it with ASCII text queries, you get a full programming API surface where you can do exactly the type of thing you propose: perform arbitrary walks through data structures, develop custom indexes, or whatever.
SQLite compiles SQL to bytecode, and then executes that bytecode against the database. However, there's no public interface for creating/running bytecode directly, instead of as a result of a compiled statement. You almost certainly COULD do what you're trying to achieve, but the SQLite author's have specifically called it out as a bad idea - https://sqlite.org/forum/info/c695cbe47b955076 - since bytecode representation can change from release to release in a way that would only matter to the compiler (or to your weird hacked in interface). Meaning, non-portable.
> since bytecode representation can change from release to release in a way that would only matter to the compiler (or to your weird hacked in interface). Meaning, non-portable.
IIRC JVM static-analysis libraries get around this by essentially forcefully pulling in and reflecting upon the particular compiler release's internals that are being built against. The result is "non-portable", but only in the sense that it's getting tailored to the particular compiler release that's already concretely available in the build environment.
Mind you, that's a bit different, because you don't usually ship the compiler parts of the JDK as part of your application JAR; while SQLite does ship this compiler as part of the library. Would be fine, though, as long as your executable's embedding SQLite statically (or in a Docker image, etc) — in other words, vendoring the particular version of SQLite that matches the version the application-layer codegen library was compiled against.
I would love this too, and looked into building an extension for it a few years ago. IIRC the main challenge was that many features such as row level security are built straight into the current query executor, making it difficult to build this as a production grade tool. I could try to dig up my notes if you’re interested.
The argument from the camera folks goes like this:
Why should a camera's software written 5 years ago in Japan/China/Taiwan choose for me with the lighting conditions I have right now in Seoul at 2:30 in the morning?
That's why most professional prefer to use a manual mode. Auto is often used as a first suggestion (but not a very good first suggestion).
You usually can manage this by dividing your query into subqueries each creating some temp table, so you have control over how joining actually happens.
If you do this be careful what your temp_buffers is set to so that your temp tables don't spill to disk. If you are on a network file system, for example on AWS RDS, writing big (a few hundred MB) temp tables to disk will stall all transactions.
The same is of course true when you have big joins that don't fit in work_mem but default size of this will be much larger.
I can usually fix bad plans with CTEs, no need to get much fancier. And the problem is often caused by schema design where you have a mapping table of two tables in the middle and your join is N-M-M where the planner has no information about the relationship between the two outer tables.
my setup is local nvme ssd raid, so I hope this part won't be bottle neck.
Also, if you are doing heavy join, where join order and method is need to be controlled, you temp table likely will be large, so you will need to be ready to have disk io.
It's a shame, then, that there's no way to define at-first-purely-in-memory tables, which only "spill" to disk if they cause your query to exceed work_mem.
Within PL/pgSQL, CREATE TEMPORARY TABLE is still (sometimes a lot!) slower than just SELECTing an array_agg(...) INTO a variable, and then `SELECT ... FROM unnest(that_variable) AS t` to scan over it. (And CTEs with MATERIALIZED are really just CREATE TEMPORARY TABLE in disguise, so that's no help.)
> It's a shame, then, that there's no way to define at-first-purely-in-memory tables, which only "spill" to disk if they cause your query to exceed work_mem.
But isn't this exactly how temp tables work? A temp tabe lives in memory and only spills to disk if it exceeds temp_buffers[1].
Huh, I think you're right... but it's still slower! I've definitely measured this effect in practice.
Just spitballing here — I think the difference might come from where the metadata required to treat the table "as a table" in queries has to be entered into, and the overhead (esp. in terms of locking) required to do so.
Or, perhaps, it might come from the serialization overhead of converting "view" row-tuples (whose contents might be merged together from several actual material tables / function results) into flattened fully-materialized row-tuples... which, presumably, emitting data into an array-typed PL/pgSQL variable might get to skip, since the handles to the constituent data can be held inside the array and thunked later on when needed.
I believe the metadata about them is still written to various system catalog tables. Creating lots of temp tables will cause autovacuum activity on tables like pg_attr, for example.
I am trying to say that your link says it is for access buffer for accessing temp table, and it doesn't say actual temp table is stored in that buffer and not flushed on the disk.
There is also the use case where the planner comes up with the correct plan in the end, but the process could be sped up by giving it additional hints that lead it to the right result faster by restricting the search space up front
Ran into this one recently, and worked around it by executing a CLUSTER command after inserting data. By CLUSTERing on a column that contains random data (which your test could inject) you force Postgres to randomise the row order on disk, and thus randomise the return order.
In my case our primary key is a random UUID (I know, it’s a terrible thing, it’s not my choice), which is perfect for the task as CLUSTER requires an index. You might be able to pull the same trick by creating an index in a transaction, CLUSTERing then rolling back the transaction, but I suspect that you can’t call CLUSTER in a transaction.
Not as nice as proper test mode, but gets the job done, and avoids the need to wrap queries deep inside your application, with all the brittleness and peril that comes from using wizard level code reflection that’s normally required for such tricks.
The first and maybe the only time I saw UUID PKs was in a multi-tenant solution. If your customers are potentially competitors to each other, you absolutely do not want people to be able to walk IDs sniffing for other stuff. I wasn't the biggest fan of this solution, but I really didn't have a better suggestion.
On that same project we ran afoul of Javascript's 10^53 floating point limit with IDs, but I think that was a separate issue.
Honestly I think random UUIDs, or just UUIDs in general, make for terrible PKs and generally shouldn’t be used.
To solve the multi-tenant issue, I personally prefer either flakeID/hashID that have an ordered component and a random component to make walking IDs hard; Aggressively name spacing your data so a customer ID plus an object ID is always needed as pair to look something up, so trying to walk object IDs can only every result in someone accidentally looking up objects that already belong to them; edge layers that remap and filter internal identifiers via hashing etc so externally all ID are opaque and random; strong and careful access control, that ensures you can only ever lookup ID that belong to you, with careful consideration for side channel timing attacks.
Relying on random UUIDs for customer privacy would raise red flags for me. If being able to walk ID is enough to break your security model, then I kinda wonder if you actually have a security model.
It's security in depth in this case, as I recall. There are subtleties in how you confirm or deny the existence of a record and for their particular solution to the problem you would get a 403 vs a 404 for a record you did not have access to.
Knowing how much activity a competitor is adding into a project management system isn't a lot of data, but it's more than zero.
There’s a simple solution to that. Always return 403, it’s what you should be doing regardless of your underlying data model.
You should always be determine right to access a record before attempting to look it up. If you can’t determine ACLs without the lookup, then your permissions layer should always assume the requester doesn’t have permission to view a non-existent record and return a 403.
Doing anything else is dangerous, and randomising PKs is a sticky plaster over a badly designed access control system.
How about solutions like https://hashids.org/ which let you keep auto increment intger PKs and present the IDs to the users in a obfuscated string form?
It sounds like GP encountered UUID Version 4 keys.
UUID version 7 features a time-ordered value field derived from the widely implemented and well known Unix Epoch timestamp source, the number of milliseconds seconds since midnight 1 Jan 1970 UTC, leap seconds excluded. As well as improved entropy characteristics over versions 1 or 6.
If your use case requires greater granularity than UUID vesion 7 can provide, you might consider UUID version 8. UUID version 8 doesn't provide as good entropy characteristics as UUID version 7, but it utilizes timestamp with nanosecond level of precision.
> Both UUIDv8 and UUIDv4 only specify that the version and variant bits are in their correct location. The difference is that UUIDv4 specifies that the remaining 122 bits be pseudo-randomly generated. UUIDv8 suggests that the generated value still be time-based however the implementation details of how that generation happens are up to the implementor. This means we have 48 bits of custom implementation, four bits for the version (1000), 12 more bits of custom implementation, two bits for the variant, and finally 62 bits as the implementation sees fit. This leaves a lot open to the implementor but it has enough rules around it that it can coexist in the existing UUID environment.
What's wrong with UUIDs? Back in the 90s Microsoft strongly encouraged it. You lose a little performance and a little storage (128 vs 64 bits), but gain the ability to merge databases (think corporate acquisitions) with less pain.
At least one distribute datastore (Google Cloud Datastore) works better with random keys rather than incremental keys, it tends to produce fewer tablet splits. At least, it used to, the technology may have changed since then.
Honestly it seems like the arguments for/against UUIDs as keys are pretty mild on both sides. Why would it be "terrible" to go one way or the other?
Harder to copy and paste IDs. Discourages denormalization since every reference is 128 ~~bytes~~ bits. Reduced throughout for inserts since the primary index pages won’t all be in cache if the keys are randomized. Lose ability to use id as a fast temporal ordering. Encourages uuids generated client-side.
All in all not enough good reasons to make a lot of stuff slightly worse.
> Reduced throughout for inserts since the primary index pages won’t all be in cache if the keys are randomized. Lose ability to use id as a fast temporal ordering.
You can generate the first few bytes (we use the first 3) of the UUID from a timestamp - that gives you index locality. It's even better than an auto-inc because you can control exactly how much of your index will be used for hot inserts based on how many timestamp bits you use, so you can avoid lock contention around the single latest index page.
> Encourages uuids generated client-side.
It's really an API design choice about whether you allow this, but it can be useful in some circumstances (if you trust the client).
> so you can avoid lock contention around the single latest index page.
That is usually the total opposite of what you want. There are some optimizations for inserting to the last page but primarily it is because you want to to sequential inserts. So if you want to avoid contention on the last page you should insert ordered by (connection ID, sequential ID or high resolution timestamp). That way every connection will do sequential inserts on its own page.
Nice this makes sense. Seems like storing connection_id in the table would make it difficult to look up data later? It seems like you could use (customer_id, seq_id) for similar effect.
- UUID (fully random) is completely secure and private. You cannot infer anything from it. As soon as you add anything to it ( like timestamp or counter), outsiders can infer things like how fast you are generating new objects
- UUID (fully random) cannot be abused by developers, which might skip creating timestamp columns and read it from the id instead
- UUIDs are slower, but as you see y's not for no benefits
Not that much slower than bigint as long as you replace the top one or two bytes in the UUID with a rotating per-minute counter.
You keep 106 bits of entropy (from v4's 122 bits) while largely eliminating page faults. Walkability is eliminated while not leaking too much temporal info.
> Discourages denormalization since every reference is 128 bytes.
This might just be a mistake, but a uuid is 128 bits = 16 bytes not 128 bytes. Of course yes there are still circumstances where the extra space (16 bytes vs 8 or 4 or something else) isn't worth it.
Is there an actual problem with this? I don't mean "it feels impure" - what's the failure case here that's severe enough to discard UUIDs as PKs?
The other reasons all boil down to 'database performance tuning', which is reasonable if you have any prospect of needing it, but most datasets are _tiny_.
The other reasons are more important imo, and they aren’t really all tuning more ergonomics and simplicity.
There’s a lot more edge cases when you use client-side generated IDs. You have to check if the ID is actually new to avoid security issues is the main one. Let’s just upsert into the DB and return the results and the uuid is new and randomly generated so it’s fine! So simple! Except now Alice can send someone else’s uuid and read their data. X100 endpoints all need to handle this correctly. It’s a huge risk.
If you do offline first ID generation using global uuids, the possibility of conflicts (due to bugs, patched client code, etc) that’s a whole rabbit hole of edge cases and problems as well.
It can be done, Bret Taylor used that architecture for quip, but it’s needlessly tricky which is the whole story for UUIDs imo - annoying and slightly worse for many normal apps. If you’re building a complex distributed system and want to deal with all the trickiness, go ahead. I’d recommend you use a B64 custom identifier instead of UUID so that it’s copy paste able :-)
The nature of your application and its security and idempotence needs will determine who should (ideally) generate ids for any specific context. The format is irrelevant.
Generate ids on the client when it makes sense, generate them on the server when it makes sense. Server-side is more common but if you choose poorly you'll make bad software either way.
> Postgres can auto generate UUIDs too and can be used for pk
True, but the practice makes it easier to shift to the dark side and insert the child records first, as you can already know the pk of the parent record in advance.
> How can you do that when you have fkeys in your database design like a good engineer, right? Right?
I know you are trolling, and it is ok if done sensibly, but it is not simply purism that makes it a bad idea. Because you have to trust the client it opens the way to discovery attacks (try inserting a record with a specific PK -> it will fail if the record exists -> now you know that it exists even if you don't have access to that record).
You may also not have access to all client implementations (think of public APIs) so some client libraries might not implement proper (i.e. strongly random) UUID generation.
B-trees are self balancing so I don't think there will be bloat. But if you insert towards the end of an index a lot, the right-side of the index will more likely be in cache which makes it a lot faster. If you inserted randomly then you need to do lots of traversals in different parts of the B-Tree.
When you’ve got 128bits to play with, I think you can do more useful thing with those bits than UUIDs provide.
UUID are perfect fine identifiers in an entirely pure sense. But baking a like more info into those identifiers can make debugging and building operational tools so much easier. It’s just a massive waste to fill those 128bits with either random data, or very limited options that other UUID versions give you.
Arguments around leaking information into public I think are a little silly as well. There are plenty of ways of preventing that, trading ease of debugging for making your internal identifiers safe in public is a poor trade off in my view.
Temporal tables are IMHO a profound gamechanger. Once you understood how to use them, you'll have a completely different view on your data models/db schemas and your data lifecycle.
It makes so many things far easier, having a temporal relation for every record and saves a lot of headaches that would otherwise usually be dealt with on the application level.
Imagine you're building an support ticket tracker whose reporting capabilities should allow to provide data on various historic items of your tickets such as "time per assignee", "time per state", "time between creation and solution" etc.
In an environment built on "traditional" SQL you'd have a history table, where you first and foremost would have to maintain records for those values:
- ticket ID
- time of state change
- type of state change
But for the stats to be actually valuable, you might need additional history context (different per type), such as the user ID, the state, criticality, etc for which either an additional table per history type item type would be required or the data could be stored as serialized value with a metadata column of the history table. One requires complexity on the DB level, the other requires application level logic and slows down report generation tremendously.
Temporal tables to the rescue to remove all this complexity.
The full state of a ticket record including it's related entities at the given time of an UPDATE is fully preserved by the DB natively, so all you'd have to do to retrieve historic records at a given time is to append "AS OF $TIMESTAMP" to your query.
In the end, there's no need for extra history tables, much better performance when building reports, no need for application level data wrangling, ...
Half my job is doing this in reverse by accumulating deltas over streams and materializing the current state in various caching tiers. I'd love this to be native feature in PG, but for any my workloads it would need to support a lower-cost archival-grade storage tier ala a blob store.
Oh thanks for sharing this! I love all of those, and most seem like they'd be easy sugar over the existing syntax. The biggest missing feature from those that I would really enjoy in data exploration tasks (though not in PROD) would be automatic JOIN ON selection based on foreign keys.
Example:
SELECT users.id, COUNT(*)
FROM users
JOIN orders ON AUTO
WHERE orders.created_at > NOW() - '7 day'::interval
GROUP BY ALL
This would only work if there was an obvious path to do the join. In this case, I'm imagining that the `orders` table might have a `user_id` column which is a foreign key into the `users` table.
That sounds very close to NATURAL JOIN which is already present[0] although that does rely on the typical convention of FK columns being named the same on parent and child (related) tables.
I think you are suggesting some sort of lookup based on the defined FK relation, but that would be confused by situations where tables have multiple FK relations, such as tables with values restricted by a lookup value table (or more than 1 such FK). Those are pretty common, so I could see the 'AUTO' feature breaking down quickly. I think that is why the NATURAL JOIN approach is taken and that basically does what I believe you are describing, provided the column naming is matched.
You could use the name of the foreign key, together with FKs namespaced to the table they are on would allow very expressive joining, and the query might even survive schema changes. ORMs tend to work like this.
> "Loose indexscan" in MySQL, "index skip scan" in Oracle, SQLite, Cockroach (very recently added), and "jump scan" in DB2 are the names used for an operation that finds the the distinct values of the leading columns of a btree index efficiently; rather than scanning all equal values of a key, as soon as a new value is found, restart the search by looking for a larger value. This is much faster when the index has many equal keys.
Omg yes please. I get that it might be slightly harder to parse but any language that doesn't support it inevitably annoys me. In pg and json you end up with runtime errors, programming languages lead to bigger diffs than necessary (adding an entry to a static list is a 2 line change instead of 1 line)
This is how I format sql code when I have control over it, have for years. People often recoil in horror, sometimes they ponder for a minute then switch over themselves.
It works very easily and consistently to center around the space and lead with the comma:
select id
, name
, address
, ts
from table
where condition
and etc;
SELECT a.foo
, b.bar
, array_agg(z.gumball) gumballs
FROM zoo z
INNER JOIN alpha a
ON (z.z_id = a.z_id)
LEFT JOIN baker b
ON (a.a_id = b.a_id)
WHERE z.last_modified > '2023-01-01'
AND a.is_active
GROUP BY 1, 2
ORDER BY 2, 1
LIMIT 100
;
Left side highlights the operations. Commas and "AND" delimit parts of each directive. The semicolon lines up on the left side to help visualize the end of each statement in a long chain of commands, especially DDL.
I also tend to capitalize the SQL keywords and leave the identifiers lower case.
Sometimes yes. I don't think there are any perfect solutions and I won't pretend this one is. But usually I find those change less frequently than columns and where clauses, and if you write them out and declare them explicitly you rarely have to move the whole query over. It does happen though.
My alternative is to follow with “1” so every preceding line can have a trailing comma or start a predicate clause with “1 = 1” so every subsequent line can start with “AND foo …”
I hope postgres adopts easily disabling an index instead of deleting it. MySQL has this feature and it's very useful for testing whether an index can be deleted on production load. Transactional DDL is amazing already, disabling indexes would be a great addition for low-risk performance tuning.
There was an extension from Teodor some time ago, plantuner, to disable indexes in a session or globally ("set plantuner.forbid_index='id_idx2';") – but it didn't make it to core, and even to contribs. Maybe because the functionality to disable indexes was mixed with planner hints there. It's a very old story, discussion from 2009: https://www.postgresql.org/message-id/flat/47E63672-972E-452...
- disable for all
- disable only for my session, to check what would happen with the plan, and only then decide to proceed with disabling for all (or to drop it)
ALTER is quite invasive way, even more than "UPDATE .. SET indisvalid = false ...". It would be good to do it via SET as it was proposed in the plantuner extension long ago.
CTEs should not be materialized by default (latest PG if you reference a CTE more than once it stores intermediate rows on disk and disables indexes and causes a lot of trouble.)
I’d love to see B-Tree primary storage option. Aka store the row data inside the primary index. This can save a lot of space for thin tables or tables with large keys, and would be basically an auto-CLUSTER with all the performance that comes with that. This is how MySQL works. The hash table primary storage is better sometimes but it sucks for range queries leading people to need timescale for good data locality
> CTEs should not be materialized by default (latest PG if you reference a CTE more than once it stores intermediate rows on disk and disables indexes and causes a lot of trouble.)
Nope, it does not write anything to disk unless you have more data than work_mem.
Nice, TIL. The default value for work_mem is only 4MB though, and when I’ve seen materialized CTEs go wrong they’re usually using temp tables in the 10s or even 100s of MBs. Usually “with products as (select * from products where customer_id = 123)” type of stuff where some customers have hundreds of products with large rows.
Not the same, this is called a clustered index or index organized tables.
With a covering index in PG you still have a heap storing the rows and a copy of the covered columns in the index.
With clustered indexes or IOT's the table is the index there is no duplication unless you have other secondary indexes. This save a lot of space and reduces indirection when seeking on the clustered index.
Some DB's like Sqlite and InnoDB (MySQL) this is always the case the table is a b-tree and there is always a clustered index even if you don't define one explicitly. In others like Oracle or MSSQL you have a choice of unordered heap or b-tree, in PG you have no choice the table is always an unordered heap and all indexes are secondary.
That's for this explanation -- didn't see how they were different but your comment explains it perfectly.
With this understanding I can't grasp the benefit of a clustered index -- if you're using the primary index (let's say numeric auto-incrementing ID) then you'd likely have a secondary index for that already (in PG). If you were searching by something else, the default clustered layout is a hindrance as you must traverse unnecessarily to find entries (rather than sequentially scan)...
The only penalty of PG's decision seems to be excess memory usage (storing a second copy of the identifying tuple contents in memory) -- is that characterization correct? But I wonder how this holds up with a spinning disk -- I wouldn't want to follow indices (and do random reads) on spinning disks.
Looking at the other side, I guess the case where PG shines is where you want to do batch processing (so looking at a page of tuples is good locality-wise, but you also often retrieve by the identifier, and don't mind paying the cost of [rows x identifier size] for the privilege.
Are there some specific use cases where clustering indices clearly outperform/are the right choice?
If you primarily look up by the primary key which is pretty common the clustered index is faster than looking up the primary key in a standard index then finding the row in the heap for the other columns. If you make the standard indexing covering all your columns you have the behavior of a clustered index except the data is duplicated in the heap.
Clustered indexes can save significant space on a narrow table with a lot of rows accessed in a particular way and perform better as well, if you access the table in multiple ways with other secondary indexes it can be slightly slower to access through a secondary.
The important factor is # of pages on disk that need to be retrieved. Secondary indexes (aka every index in Postgres) have to lookup the primary storage as well. If you use a primary key index to pull 100 rows, if the rows aren't clustered, then you're looking at ~300 pages needing to be pulled. That's roughly 2 per index traversal (the first level of the index is generally cached) plus one per row to pull from primary storage.
This can be improved in two ways. One, if you add a second index which gives you better locality in the index. For example (customer_id, product_id) will group up all the rows by customer id. This can reduce the # of pages for index traversals down to <5 as long as each customer doesn't have a lot of rows. And in many cases this makes the primary index on just id useless. This brings the total down to 105 pages give or take. (depends on how many products each customer has)
The other way is to use an actual primary index, or use a covering index so that the data you retrieve is already in the index. For example if you're just pulling product_name from your table, you can use covering index on (customer_id, id, product_name) so that the product_name has locality with the customer's product IDs. This would bring down the total pages to be retrieved down to maybe ~20, since product_name tends to be larger data. It's a question of how many (customer_id, product_id, product_name) tuples can fit on one 8KB page and how many products the customer has.
If you use a primary index, the whole rows are on pages. This lets you run queries that pull lots of data (or different data) and have good data locality, but it means less tuples per row so you need more rows. So you'd access maybe ~50 rows but this index could cover a lot of queries unlike the covering index which only works for product_name.
These days SSDs are much faster than hard drives, so # of rows pulled off disk is still important but not as much so. Another thing this buys you is that you don't pollute the in memory cache by evicting pages just to load a new page that doesn't get utilized well. For instance original the index leaf nodes that are just 4 bytes (product_id, rowid) so every one you throw away 99% of the data on that page.
Covering indexes are different. You’d have to have a 2nd unique index which has all the rows from the table. A lot of large datasets would benefit from auto clustering but can’t handle 2x the storage for the same data.
Ah thank you for noting this -- combined with SigmundA's answer I think I get how they're different, and how covering indexes are not the same solution
Although it has its gotchas, MySQL dialect is still somewhat better from the usability standpoint. "SHOW TABLES" is nice; it is natural and self-explanatory. "\d" is obscure.
It also reminds me how many DBMSs did not support the LIMIT clause because it is non-standard. But it is good from a usability perspective.
I've never looked into this for Postgres itself, but AWS Redshift (a postgres derivative) has enough info in the system tables to reconstruct a create table statement using a view (see: https://raw.githubusercontent.com/awslabs/amazon-redshift-ut...) and I'd guess that you can do the same for Postgres.
You can use "select * from pg_stat_user_tables" etc.
Admittedly the internal tables and views have a bit of a learning curve and are a little bit obscure (and inconsistent) at times, but it's much nicer and much more flexible.
The CLI stuff like \d are just "aliases" for queries; \set echo_hidden shows them.
Of course you can, but that is beside the point. Convenience matters - besides, these approaches can coexist, adding those simple queries (as aliases) should be trivial.
Array foreign keys and indexes on arrays/JSON nested fields. When using MongoDB, having this eliminates most need for join tables. (*This may have already been implemented, I haven't used SQL recently.)
I think they have indexes on json (or any selectable thing?).
You could probably implement the foreign key using generated columns these days. Ah I guess you mean instead of having the association table? That would definitely be nice in some cases.
We very occasionally use integer[] columns with ids, but hold off on using them widely b/c of the lack of integrity constraints. It'd be awesome to use more often.
Interesting point. I don't think I have that particular use case, but I can definitely imagine it. I'd think especially with dates in the far future and the timezone offsets are altered between the time of recording and the time value itself?
Individual timezones are pretty stable but a handful change every year in some way, often switching how they observe DST or something similar. If you have a truly global userbase and this actually matters, you'll definitely hit them.
Lunch service 11am to 3pm Weekdays, 11 to 2pm Weekends
How would that dynamically map to changes in the timezones that might be mandated by the law but reflect the correct event times in a scheduling system? Spreadsheet systems developed the $ prefix for cell addresses as a shorthand to lock that in when adjusting the relative offsets in copy and paste / duplicate operations.
That’s a weird use case but you could just create a UDT to do it. You need the timestamptz value, the timestamp and the Timezone. Super expensive but ¯\_(ツ)_/¯
Ha I had a need for this yesterday. Just added a timestamptz column and text column. In my use case the text column preserves the input value for display to the user while the timestamp itself is used programmatically. But a “TIMESTAMP WITH TIME ZONE PRESERVING TIME ZONE” data type would be cool.
If you execute a modifying query without a where clause it will stop you and double check that's what you intended to do.
Likewise you can specify a database connection as read-only so that it wont run modifier queries at all, attempting to do so will stop you but you can then explicitly run it if you need to.
> Likewise you can specify a database connection as read-only
We emulate that in prod by having admin users with read only permissions. They are granted other roles without the INHERIT option, thus needing an explicit `SET ROLE ...` before being able to do anything dangerous.
I also teach people the habit of combining `BEGIN; SET ROLE ...` everytime they need to write something. It has completely stopped "woops prod" incidents since it was implemented.
> Likewise you can specify a database connection as read-only so that it wont run modifier queries at all, attempting to do so will stop you but you can then explicitly run it if you need to.
Does it parse the SQL and guess if it’s DML or run it in a transaction and rollback after?
I'd like to see easier upgrades between major versions, or at least published docker images that can do the upgrade. As it stands, it's a huge hassle to get both binaries installed on the same system. This is a task almost everyone has to do periodically, it should be easier.
I also found it strange that PostgreSQL cannot simply support data formats of previous versions. In contrast, the latest version of ClickHouse[1] (23.1) can be installed over the version from 2016, and it does not require format conversions or any other migration procedures.
Yes this is a huge problem. I wish PG could contain whatever it needs to do in place upgrades from supported versions. Statically compiled builds of previous versions of pg_upgrade?
You can use the slow query log which is very helpful. Lots of logging tools will injest it properly and you can aggregate queries that are ran frequently. Having an index or not is just one optimization.
Many of those queries may run sub-second (typical lower-end value for log_min_duration_statement), so they won't get logged. Yet, if called at high frequency, may represent a notable % of your CPU and I/O. The slow query log is not enough in many cases.
Maybe... but there are many queries that don't use an index (whether fast or slow) and that's the right thing to do. As a DBA, I'd see too many "false positives" due to this reason.
One technical problem is the WAL records changes to "shared" relations (stuff like the catalogues of databases and users that are global to the whole "cluster", not just one database), and those are mixed up with changes to per-database objects (ie most WAL activity).
That's interesting. I have a separate cluster setup without archiving because some databases don't need it. It's so simple to run various clusters side-by-side on the same host that I don't think having compartmentalised wals would be easier to manage.
Yeah, but I want cheap in-memory joins between the WAL-isolated datasets. I.e. "multi-world" MVCC concurrency, where a TX is locking in a separate min_xid for dataset A, B, C, etc. for the lifetime of the TX — but without this being a big distributed-systems vector-clocks problem, because it's all happening in a single process.
Why? Being able to run a physical replica that loads WAL from multiple primaries that each have independent data you want to work with, for one thing. (Yes, logical replication "solves" this problem — but then you can't offload index building to the primary, which is often half the point of a physical replication setup.)
I don't know about pseudocode, but I think a simplifying analogy would be to consider an embedded DB (like SQLite) or key-value store (like LMDB) library that you interact with through transactions.
In such a system, "single-world MVCC" would be what you'd get by putting everything into one database file, with any changes intended always done within one DB tx of that single DB file.
"Multi-world MVCC", then, would be what you'd get by opening multiple database files, each of which maintains its own WAL/journal/etc., and then creating an application-layer abstraction that allows you to coordinate opening DB txs against multiple open DB files at once, holding the result as a single handle where if you hit a rollback on any constituent tx, then the application-layer logic guarantees that the other DB files' respective txs will be told to roll back as well; and that when you tell this coordinated-tx to commit, then it'll synchronously commit all the constituent DB files' txs before returning.
Note that unlike with a single DB file managed through a single readers-writer lock, this kind of system can introduce deadlocks (but DBs with more complex locking systems, like Postgres, already have that possibility.)
And then your application has multiple instances of these "IDatabase" objects, maybe one per physical/logical database file, e.g. ".sqlite3" in the case of SQLite
And at the root of your application, you have something like an:
Yes query plan reuse like every other db, this still blows me away PG replans every time unless you explicitly prepare and that's still per connection.
As I understand it this is supposed to be done in the client libraries rather than in the server. It’s not that it doesn’t reuse query plans, it just doesn’t do it in core.
Makes most sense for server to reuse across all connections regardless of client, big iron db's have been doing this forever, multiple app servers remember. Even beyond that PG is per connection, so even multiple threads/connection per client there is no plan reuse between them.
I would love to see more languages in their fulltext search by default. But it looks like currently its „if you want language contribute it” kind of deal and I’ve found this outside of my depth.
My vote is for ids with prefixes. If you've ever used the Stripe API, you know how nice it is that all customer ids are prefixed with `cus_`, all product ids are prefixed with `price_` etc. I want that in postgres, so that any time you see an id you know exactly what kind of entity you are dealing with. It'd be incredibly useful for dealing with data analysts, and also for debugging production issues from logs / crash dumps.
You wouldn't need to store the prefixes at all, it'd be part of the column metadata, but it would be prefixed before sending the data back to the client.
I could do this at the ORM level but in my experience even with the greatest ORM in the world you often have to drop into raw SQL. Additionally people might have multiple ways of accessing the same db or a read-only duplicate.
Postgres allows one to write custom data types in C or other system programming language. The only limitation is that the "type modifier", in this case the prefix, would need to fit in a "single non-negative integer", so in priciple using e.g. 5 bits per character you could get up to 6 lowercase letters, `-`, `_` etc into the prefix. Of course a custom extension would have limited utility if it doesn't land into cloud services like AWS RDS.
I worked on a db that did this once. Prevented generic code being written that could work over multiple objects/tables. Caused more work than it prevented bugs.
You can do this with numbers. Customers start with 1, products start with 2, etc. This is a little annoying if you aim small (products are 200-299, but now you have your 101st product), or design this scheme without realizing you need to add a 10th entity (or 100th), but it does mostly work.
Rather than this, why not do more with existing table REFERENCES metadata? For example, why can't I have a covering index across data from two tables, joined through a foreign-key column with a pre-established REFERENCES foreign-key constraint against the other table, where the REFERENCES constraint then keeps everything in place to make that work (i.e. ensuring that vacuuming/clustering either table also vacuums/clusters the other together in a transaction in order to rewrite the LSNs in the combined index correctly, etc.)
(Author here) This is also a neat idea but wouldn't help with subqueries, views or other computed relations getting joined on each other which is a major motivation of the validation proposal.
I could see that being a motivation, but isn't that exactly the use-case where it'd also be a bad idea, overhead-wise?
With a static DDL constraint on join shape, an assertion about such shape can be pre-validated at insert time, to be always-valid at query time, such that queries can then pass/fail such assertions during their "compilation" (query planning) step.
Without such a static constraint, you have to instead insert a merge node in the query plan (same as what a DISTINCT clause does) in order to normalize the input row-tuple-stream and fail on the first non-normalized row-tuple seen.
You could pre-guarantee success in limited cases (e.g. it's going to be a 1:N join if the LHS of the join has a UNIQUE constraint on the single column being joined against); but you're not going to have those guarantees in most cases of "computed relations."
The overhead would be terrible. If implemented, you'd put the syntax in your queries, but a GUC or some such would turn off validation, it'd be off by default (because of that overhead) and you'd only turn it on for local dev and for testing.
The underlying issue, mixing up the cardinality of one side of the join, isn't a super tricky one to identify, especially once you've seen it once or twice or seen what the buggy output looks like. Every intermediate SQL developer has run into the bug and has fixed it. The vision I have for the feature is more like a set of training wheels for developers who are earning their SQL scars. And it's also probably not a terrible idea to document exactly where you are expecting the full M x N cartesian join results when you do want them.
I wish there was a way to have "named" constraints that you voukd share between tables. Because they would behave akin to mixins and that would open up a world of possibilities.
(And if you're into that sort of thing, there's a lot of type theory concepts that could be applied on top of it that would make things much more practical and approachable)
Also, I'm eagerly waiting for Incremental View Maintenance to be merged into main postgres.
Comments inline with column and table definition. And simple idempotent DDL.
AFFIRM TABLE product_category
(
category_id smallserial
PRIMARY KEY
, name text
NOT NULL
) COMMENT 'Product categories from the official catalog'
;
AFFIRM TABLE product
(
product_id uuid
PRIMARY KEY
DEFAULT gen_random_uuid()
WAS (id uuid DEFAULT gen_random_uuid())
, (id int8 DEFAULT gen_random_uuid() USING gen_random_uuid())
COMMENT 'Product lookup id'
, category_id REFERENCES product_category
ON UPDATE CASCADE
ON DELETE RESTRICT
NOT NULL
COMMENT 'Product category'
, serial_num text
NOT NULL
COMMENT 'Product serial number'
) COMMENT 'Our nifty products'
WAS (products)
;
Just like a SELECT statement describes the structure of the output with the engine figuring out how to retrieve it, instead of CREATE, CREATE IF NOT EXISTS, and ALTER statements over and over, just describe what your target is supposed to look like and let the engine reconcile. The WAS keyword (for example) could allow engines to figure out how to get from one state to the next. In this case converting the id column from an int8 to a UUIDv4 and then changing its name from id to product_id. The table was named products and is now product. No more Flyway or the like where you have to hunt through migration files to find the latest definition for a view or function.
So much easier to diff when making changes. Then after all databases have been migrated, prune the old, obsolete WAS definitions. Also, why do we have to specify the type of category_id in the product table? It's a foreign key reference; the type is not only implied, it's mandatory to match exactly.
But since we're already here on the topic:
• Real pivot support, not that crosstab hack
• System-versioned temporal tables
• Append-only (log) tables
• Unstored computed columns
• Alter a table used in a view, so the view loses column(s) or gains
• SELECT * EXCEPT (<column names>)
• GROUP AUTO where the query uses any non-aggregate column(s) as the GROUP BY column(s)
• Automatically updated materialized views
• Access Postgres via WebSocket
Sounds dumb, but I want JSON field schema validation. I added a JSON column for flexible data, and although I'm happy with its flexibility, I kinda hope I can validate the JSON data structure. Recently I just found an extension [1] and will try soon.
I dont know. I mean...I'd like that too, but is this really more important than other mentioned features (in this thread or article)? What your reason? (I'm just curious)
Not sure why people can live with columns that are added later in a very random order from a logically sorted order with dozens of columns when inspecting data.
I've used PostgreSQL on a live system but switching back to MySQL on next project because I don't miss anything from PostgreSQL but I do miss this basic feature.
I've read somewhere this feature may be coming but I'd just switch back today.
That'd be tedious and error-prone if had schemas with maybe dozens of tables. Now you've also got dozens of views to manage and keep in sync with the underlying tables.
I work with both MySQL and PgSQL, and not being able to reorder columns easily is one feature I really miss in PG.
I´m waiting for a compact data-structure such as RocksDB. I store TBs of data and MySQL and MyRocks are miles ahead on this department compared to Postgres. I prefer to run from sharding and distributed solutions for now.
Simply by larger sizes of compressed blocks, which are limited to page size in Postgres, and by improving the data locality by sorting, which is inherent for LSM-trees.
But if you want higher compression, you need to consider column-oriented DBMS, such as ClickHouse[1]. They are unbeatable in terms of data compression.
1) “NOTIFY” should be eligible for prepared statements to assist in the case where untrusted values is part of the “channel” or “payload”.
I had to write some ugly stored procedures to attempt sanitising myself
2) Query batching where queries are submitted to a database together and then results are returned. If this could be done without a connection per query that would be awesome. The use case is a global truth database needing 1 write to 20 reads… If the server is in the UK and the client is in Fiji then batching the 20 reads together is a massive latency win.
Can’t you do number 2 with Postgres pipelining? Here’s an article from bit.io (serverless Postgres w/ data repos) their example is inserts but also works for read ops.
Yes, you can, except you have to constantly remember that column exists and adjusts queries accordingly so you don’t get back deleted results or modify deleted rows. It would be nice if that was abstracted away and you simply added some other keyword to a query when you want to deal in deleted objects.
I think the problem is there are too many trade offs Postgres would have to make.
- should soft-deleted rows move to another partition to avoid bloat?
- what’s the column name and type (boolean, timestamp, tstzrange)?
- what’s the syntax for a soft delete versus hard delete?
- what’s the behavior of updating a soft deleted row?
- do soft deleted keys prevent inserts with the same key?
Bitemporal tables mentioned in the thread are a more general solution. Alternately, you could use a statement-based trigger to prevent deletes to the table.
I can think of one possible solution:
- Soft deleted rows should be moved to a shadow table, so that they can have their own unique ID and timestamp fields.
- What's the column name and type of?
- Well, you'd have to extend SQL, but that's done regularly. DELETE (PERMANENTLY | AS MOVE TO SHADOW) with permanently the default.
- Not a good idea to update a soft deleted row, I'd prevent it with roles if possible.
- No, because the shadow table has its own primary key, and shouldn't have any unique constraints on it because of this design.
I love PostgreSQL, it's the database I go for even for really tiny projects and data needs even when SQLite could be used, but one thing that makes me annoyed is the upgrading.
“I am a dummy mode” sort of exists already in postgres, it’s called row based access control. It allows you to impose where clauses on queries to enforce multitenancy constraints.
Having said that I do like the simplicity of being able to configure the entire database to require where clauses on certain statements, it’s like a poor man’s multitenancy, without the complexity of RBAC.
I'd like for bi-directional replication just like MariaDB without having to use the Citus plugin or Patroni. Also I'd like to note that Citus is great however for existing DB's you would have to basically rewrite your scheme to support Citus.
For running queries that modify data, I will start with BEGIN, run the query, maybe run some checks, then COMMIT or ROLLBACK depending on if it did what I hoped it would do.
Are there any downsides to that outside of more typing?
Ah I see the `–i-am-a-dummy` also affects queries with large result set sizes.
(Author here) I think the purpose of the large result limiting is to avoid swamping the terminal with output. The whole set of --i-am-a-dummy features is to give a saving chance to the hackers who are typing away without any forethought into a prompt in autocommit mode. Using transactions, copy and pasting complete queries, having your coworker double-check your work etc give you enough foresight that you'll not have a need for it.
Your comment led me to realize that I could run psql with -v AUTOCOMMIT=0 (or add that to ~/.psqlrc) to achieve most of the safety net I've been wanting. My fear has been forgetting the BEGIN.
Inline SQL functions so I can modularize my code without a massive performance penalty. Foreign key references to system tables so I can reliably store, e.g. function oids.
I'd like that postgresql develops a vision and roadmap for emulating a graph database, bringing the sql query world and the graph query world closer together.
a great project that shows that such a convergence is not out of the question!
It seems that some sort of graph querying will be part of an upcoming sql:2023 standard but detail is sparse. This repo seems to have collected some relevant links [0]
> some code comes to accidentally depend on a coincidental ordering of the results
It's not really "coincidental". It's insertion ordering on-disk. And that varies between servers. So if you have 10 replicas they might persist records to disk in a slightly different order. So, if you don't specify an explicit ordering in the query then you will get records back in the order the server finds them on-disk.
It is coincidental if the insertion ordering happens to match the ordering your business logic expects.
Additionally it’s not even insertion ordering, more like update ordering. All row updates in Postgres result in a new tuple being written, which may be near the original tuple if there’s space, or may be at the top of the table on-disk. But either way, updates will also change your row ordering.
With replicas you would expect them all to have very similar, possibly identical on-disk ordering, depending on the replication method. If you’re using plain WAL replication, not logical replication, the the on-disk order is probably going to be preserved because the basic WAL only contains the resulting storage layer operations, not the higher level queries. Which means that playing the WAL back else where should result in identical on disk results.
Good luck with that! Oracle have never made anything that they can't extract money from. In this case look to their attempts to kill logminer (which most CDC systems eg. Debezium use) and force you to pay for Goldengate hub/microservices/cloud.
1) Simple, easy to use, high-availability features.
2) Horizontal scalability without having to resort to an extension like Citus.
3) Built-in connection pooler.
4) Query pinning.
5) Good auto-tuning capability.