How is that an ugly hack? That looks like a reasonable solution to a reasonable problem. Counting millions of things with user-defined filters looks like a hard problem.
Did I misunderstand the approach or is it sort of risky as it uses escaping instead of bind parameters to create the query to be explained, potentially opening itself to SQLi?
There's a mention of injection considerations in both the article and code snippet. The article is definitely a sketch of a cute approach and doesn't claim to be a full production solution.
Indeed, this is relying on the optimizer’s cardinality estimates from previously collected statistics. These are notoriously inaccurate across all DBMS’s in even marginally complex queries[1].
They’re obviously better than nothing when making planning decisions—correctly estimating A > B is useful enough even if both are off by large factors—and maybe for this application it’s effective enough.
For the DBMS I work on, an alternative solution would involve bitmap indexes for each possible filter field. These are compact and can be logically combined and counted at rates of billions/second via SIMD, even without HLL or other estimators.
I've hacked together something nearly equivalent to roaring bitmaps with varbits and a bit of extra metadata. Caveat emptor, but it was good enough for the problem at hand.
AFAICT bitmaps as a persistent index structure aren’t supported. They’re only used in-memory at query runtime for combining results from multiple index scans.
Even with an index, counting the number of rows is an O(N) operation in most DBs.
Why?
1. DBs typically store data in some flavor of b-tree. Determining the number of elements in a sub-range of a b-tree (or any "standard" search tree) is O(N).
2. You can improve that to O(log N) if each b-tree node stores the size of its subtree. But this also increases the cost of add/remove operations.
3. And things get more complicated for DBs that support concurrent transactions, since the count is different for each transaction, depending on when it began running. You'd need something more complicated than just a size-of-subtree integer per node.
It's all doable, but isn't trivial and increases the cost of all add/remove operations.
There have been times where I really wished it existed, though... :-)
If I understand correctly, it is because they want to support arbitrary queries and so a single index would only help for certain queries but not others.
Indexes and compound indexes tend to become slow themselves when you end up indexing a significant fraction of the columns or when your result sets are a significant portion of the total for count. They cost extra time and space to update on write. If your database has concurrent writes going on Postgres still needs to verify which row versions are visible to the transaction on read. The additional pressure on memory and disk means you have a bigger working set. The query planner has a combinatorial number of indexes to choose from
So the planning stage slows down. You have to spend even more time on analyze to keep the estimates (which power this technique) correct etc. At 10s of millions of rows with more than a few low cardinality columns in the filter it’s sometimes faster to just get as much of the table as you can into ram and accept the scan.
I have nothing but respect for the author, but if this is needed, maybe Postgres needs to invest some work?