If at all possible, use a union all rather than a plain union to avoid an extra sort / unique node.
I've used that OR > UNION ALL trick a number of times to vastly improve performance on specific queries. It's crazy how much of an effect it can have.
I wish Postgres would implement a planner optimization to automatically run queries with an OR more efficiently (e.g. use the same plan as with a UNION/ALL where possible).
I have a vague memory of this exact optimisation being one of the things Hibernate did automagically, about 10 years ago. I remember doing a fair bit of reading to figure out how & why it was so much faster.
I was able to implement this in the ORM layer of Django through a function that would inspect the query and split it if it had an OR under certain conditions.
I remembered reading a thread on the hackers mailing list about this starting probably 6 or 7 years ago... Nothing ever came of it if I remember right.
While OR can be performance killers, in other cases it works perfectly fine, or sometimes even better than the equivalent UNION ALL query. In this case the problem is that they are applying a LIMIT with ORDER BY so it can’t do the usual BitmapOr optimization and thus the index ends up being used purely for sorting purposes.
+1. If you really understand your queries (which, if your queries are complicated enough that you already don't use an ORM, you should), then my favorite set of "tricks" are CTEs to collect info and convert/translate as necessary, then unions to merge/filter said data from the CTEs. Add aggregate/window functions to this data, and you are good to go.
You can really do some complicated pipelining of data cleaning and filtering this way, much faster than resorting to a whole bunch of complicated inline SQL work in the clauses.
[edit] I should add, learn how to read an explain plan, this is vital for understanding and improving your queries.
This seems like the type of thing that a sophisticated query planner in 2024 would be able to figure out on its own? Maybe as a non database expert I'm expecting too much?
There were attempts to do this automatically ~7 years ago (https://www.postgresql.org/message-id/flat/7f70bd5a-5d16-e05...). It didn't actually make it in due to worries about the approach and properly de-duplicating the final resultset, as well as that approach to the optimization not playing well with duplicate inner-join removal optimizations (which were also being discussed at the time).
FTA: “Why the query planner doesn’t automatically convert a condition like x > a OR (x == a AND y > b) to (x, y) > (a, b) is something I still don’t understand to this day, though.”
“The order of evaluation of subexpressions is not defined. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order.
Furthermore, if the result of an expression can be determined by evaluating only some parts of it, then other subexpressions might not be evaluated at all. For instance, if one wrote:
SELECT true OR somefunc();
then somefunc() would (probably) not be called at all. The same would be the case if one wrote:
SELECT somefunc() OR true;
Note that this is not the same as the left-to-right “short-circuiting” of Boolean operators that is found in some programming languages.”
Agree it's well-known, in a slightly snobbish sort of way, but it could do with being more widely known for sure; it is a bit of a gotcha, and IMO TFA is quite a nice tutorial on how they went about it, investigating the issue and determining the problem etc., perhaps especially if the answer as you wrote isn't already somewhat familiar, the explain analyze -err-explainer is quite nice. Not reference material, but as an intro, hey this is how you might be able to debug a similar problem you encounter.
The article felt like it was fumbling around when the initial explain pointed right at the issue. I didn’t know this specific trick, so I did learn something I guess.
That looks like a trivial optimizaiton / transformation. I am still quite puzzled how much work has gone into compilers, static code analysis, guided compilation (naturally comes w/ JITs, too). Yet, database optimizers of pretty much any database look rudimentary and unreliable. Sometimes I think they got stuck in the '90s and that's it.
That is a bit wrong and I'm pointing it out because your summary logically didn't make sense to me. So I did have to read the blog post after reading your comment.
The query should be CreateAt > $1 OR (CreateAt = $1 (NOT $2 like in your sample) AND Id > $2)
So the idea is here about paginating through posts that might be constantly changing so you can't use a simple offset, as that would give you duplications along the way. So you try to use CreateAt, but it could be possible that CreateAt is equal to another one so you fallback to ID.
But here I stopped reading the blog post, because I now think why not use Id in the first place since it also seems to be auto increment since otherwise you couldn't really rely on it to be a fallback like this? I don't have time to investigate it further, but tldr; that still left me confused - why not use ID in the first place.
It seems like the code is used both for indexing from scratch and for maintaining an existing index. In the case of maintaining an existing index you need to know where to start back up and Postgres doesn't guarantee that SERIAL primary keys are always in chronological order when there are multiple simultaneous sessions.
But if you are doing indexing, why do you necessarily care that it's specifically chronological?
Or besides that, if there are odds of CreateAt collision, and you are fallbacking to ID, you are still possibly not getting it chronologically?
And also if CreateAt does happen to equal to another record that is exactly the case where Postgres might most likely not have the auto incr chronological.
So still it seems like the edge case it tries to prevent it would still happen at least at similar magnitude of odds.
> But if you are doing indexing, why do you necessarily care that it's specifically chronological?
edit: on second review if live insertions were occurring then this code would have an edge case, however the indexing job has an endtime presumably chosen where they can be sure no more inserts will occur. Given that the choice to use a timestamp probably has to do with the fact that there are 4 different tables being indexed and you would otherwise have to track their IDs individually.
ID only: If we use ID > 4 as our start point for next time and ID 3 was not inserted yet then we will have missed it
createdAt only: If we use createdAt > 1002 then we will skip ID 5 next batch
OPs strategy: Even if we use createdAt > 1002 and it skips ID 5 it will be caught by createdAt = 1002 AND ID > 4. The Order by createdAt asc, id asc guarantees that if the limit splits 4 and 5 that we see 4 first and thus don't miss 5. I think this does still miss the case where ID 4 is inserted after ID 5 however.
> I think this does still miss the case where ID 4 is inserted after ID 5 however.
Yeah, and I would think that is very likely to happen given it would be the same timestamp.
So the whole thing still seems like a flawed, and unnecessarily complex solution to me, which should just use one simple unique sortable field to do all of it.
Like my intuitive guess is that maybe the solution could save maximally 20% - 40% of the same edge case, which doesn't seem like a good solution. It is not going to solve the problem. It's just adding complexity that can cause other problems.
So if postgres does the type of caching where it allocates 10 auto incr IDs to each process, which causes sometimes IDs being out of order, then normally it would be just enough to wait after these allocations have performed and index then, you are not going to miss any rows.
I would assume these processes have some form of timeout if there was a case where they couldn't assign one of those IDs, and then this ID would just maybe not exist or if there was a mechanism to reallocate that would work too, but none the less, I think some form of postgres sortable unique id would have to work by itself.
Ha, that's a good point. Devil's advocate though, and because perhaps maybe they changed it slightly for the example in the blog post, it could be that 'CreateAt' is more like 'PublishedAt', i.e. doesn't include a possible draft period, but then id (which does correspond to actual record initial creation time, obviously) is as good as anything to disambiguate them - because it's actually arbitrary, only needs to be consistent at that point.
This is supposed to be a solution to a rare edge case though, and it seems to me then that there would still be some rare edge cases where the timestamp would mismatch with the order of the ID and so still cause the very edge case it was supposed to avoid?
Because it's ">" it might be missing that one record during what I think is pagination.
[edit] CreatedAt timestamp could be something from the client when the post is submitted or tagged from the ingest server and not when they actually are processed and hit the database
It depends on the resolution of the timestamp. With 1000s of posts a second coming in, the fallback would be that it is at the same "timestamp" and then we fallback to a greater identifier. My guess is that the resolution of the timestamp allows for more efficient postgres index usage then the id index which my guess is larger and a string so not as efficiently searchable as the underlying identifier
(my guess is time based index offer faster search performance or lower overhead than a string based search index which doesn't understand it is representing encoded time data)
I'm only speculating here, but it doesn't seem right to me at all to end up with a solution like that. I don't think there would be a reason where this ID would perform so bad as an index that it would be worth what I would consider odd reasoning and tech debt if I'm understanding things correctly. And clearly the query is tech debt because now it did cause such an issue. But also I think current solution seems like tech debt to me, having high odds of causing issues in the future, if the DB driver was to be changed or something about the way DB resolves indexes was to change, etc.
And with fallback they would end up reusing that index anyway.
This could also come up easily if generating queries to back a paginated UI, where the user can choose a sort order by user-meaningful metadata. So, the user might ask to sort by CreatedAt and then the app/service would need to add the Id tie-breaker.
That is possible, although I would say there are other potential challenges there in this case, and arguably there should be a specific index for that case, and a lot would depend on what exactly is being sorted, how frequently new rows are being inserted, etc.
But here the case should be batch indexing, processing, so it seems like auto incr with a timeout of assignment if those auto incr ranges are cached would still be suitable.
Like as I understand the problem, there is one service (ElasticSearch) that is working on indexing, and it's getting batched rows from postgres, to then index, but make sure at the same time to not miss any in those batches. And it's fine that it doesn't immediately at this second or minute do the indexing, so it should be fine to wait for the IDs to have been allocated.
Not at all surprised that Laurenz Albe played a key role in figuring this out -- I've learned so much about postgres from his Stack Overflow answers over the years.
I think it's a reasonable implementation. It is only used to separate rows with the same created timestamp. This query is run some time after the insert has happened, so it is unlikely that there have been more rows with the same timestamp at query time.
If this assumption is correct, then first sorting by the timestamp and then sort the id alphabetically will ensure that the pagination is deterministic.
I guess we could consider some edge cases where we have lots of rows with the same timestamp that is inserted after the ingestion query is run due to latency, but it might be acceptable for this use case.
One alternative could be to add a separate column that can be used to keep track of insert order. But they would need to consider the costs before they do this, as it could impact the insert performance.
The true "modern" cool kids solution would of course be to create a service that listens to WAL, inserts it into a Kafka cluster that is connected to a pipe for inserting into Elastic search. Much more fun and resumé friendly than optimizing a query. I bet the author of this blog would get a much larger audience.
Bonus points if the ingestion service was written in Rust and run on a serverless platform
I generate queries like this in a paging library I wrote at work, I did not use the tuple comparison because sometimes the sorting order is different between fields, each field in the tuple comparison would need a different operator, such as
WHERE Posts.CreateAt > ?1 OR (Posts.CreateAt = ?1 AND Posts.Id < ?2) ORDER BY Posts.CreateAt ASC, Posts.Id DESC
I now wonder if it might be possible to use a negation operator to invert the sorting order, like:
(Posts.Id, -Posts.CreateAt) < (?1, ?2) ORDER BY Posts.CreateAt ASC, Posts.ID DESC
Hunting and solving bugs is good but there's a better alternative: having pipelines that run performance regression tests before commits are merged in master branch.
Don't have a postgres under the hand, would adding a AND condition
WHERE CreateAt >= $1 AND (CreateAt > $1 OR (CreateAt = $1 AND Id > $2))
workaround the query planner limitations as well?
[edit: holy downvotes folk, the referenced 'olliej' user is me - I'm not shitting on others]
Based on the last time I did anything involving databases, I think the easy way to do this is "find a query written by someone like olliej, and then fix the clear and obviously stupid choices" :D
Sure, but if you are going to be self deprecating, at least provide some example content of the hilariously bad queries you have written and how you have been chastised for them.
A UNION allows you to use two separate indexes and can speed up queries.