Seems strange the article doesn't mention the `= ANY(array)`[0] alternative to `IN (?)`. The problem with `IN` is that PostgreSQL uses a slow parser for the subquery, since `IN` is an SQL standard, and a legacy parser is used. You can see in the EXPLAIN, that PostgreSQL converts the `IN (...)` into `= ANY(array)`.
The problem is that it is slower to parse an `IN` query than an `= ANY(array)` query. Because the syntax for an array in PostgreSQL is simple, PostgreSQL can use a faster parser for the array. However, with `IN`, it cannot know the type of the next element in the list. Since `IN (a, 2, (select id from table), null)` is valid SQL, it needs a parser that permits all those scenarios. But with an `{1,2,3}::integer` array, it knows it can only have one layout.
Often people do not discuss the time it takes for the database to parse the SQL query. If you run an ORM, with a one to many relationship between two tables, which are modelled in an application layer, you may wish to preload the children onto the main object. Suppose you are loading thousands, if not millions of objects, and you wish to preload all of their children, your ORM might simply do this by running two SELECTs, first the main object's table, then children's table with the parent IDs of the main objects included in the query.
When using the `IN` syntax, parsing this will be very slow compared to the `= ANY(array)` syntax. I remember running some tests to compare them, and once you get to 100'000 values, `IN` becomes significantly slower compared to `= ANY(array)`. I cannot recall whether it was 100'000 or 1'000'000 values, but at one of them, PostgreSQL just never finished the job with `IN`, where `= ANY (array)` took a couple of seconds.
I'd argue that if you have an IN() with more than a dozen values, you're doing it wrong. Especially so if that's something that your app does often and it needs to be fast (as opposed to one-of analytic queries where you don't care about performance).
If your data is intrinsically shaped like 1:100, what are you supposed to do - issue 10x batches of fetches with 10 elements? Give up on a relational database? Start designing custom sharding schemes to avoid doing the "wrong" thing that might take a few ms longer?
And I'll have you know that we do actually care about the performance of analytic queries. The difference in performance might be minutes vs hours, or hours vs days. Just because it's not as quick as point lookup doesn't mean we are completely time insensitive.
If you're issuing a statement that includes an 'IN' statement with thousands of values, where are those values coming from, how are you getting them? In a relational database, a so large list of values generally indicates that they aren't arbitrary but would come from some table (perhaps with some condition), so the selection would be implemented with a table join - if you're fetching a huge list of IDs from the database and then passing them back as part of a query, that doesn't seem right.
In the grandparent comment, "Suppose you are loading thousands, if not millions of objects, and you wish to preload all of their children, your ORM might simply do this by running two SELECTs ..." - well if your ORM is doing something stupid, then you either fix the ORM or work around the ORM to get a reasonable query that will not uselessly send these millions of object IDs back to the database - what a waste of IO.
I was very confused about your comment (because the obvious answer is to use a JOIN), but then I realized you're talking about using a sharded database. To which my response is: don't. Just buy more RAM.
I realize that this doesn't scale indefinitely, but for the 99% of us who just need to manage a few billion rows, it's the right answer.
The other place where `ANY(array)` is useful is when interfacing with Postgres from code, where you will quickly find that building a `IN (?)` query from an array is disgusting, if not impossible.
[0] https://www.postgresql.org/docs/current/functions-comparison...