Hacker News new | past | comments | ask | show | jobs | submit login
Fast Counting with PostgreSQL and Haskell (jezenthomas.com)
42 points by EntICOnc on Dec 26, 2021 | hide | past | favorite | 17 comments



This is the ugliest hack I’ve seen since I wrote a php script that would output a Perl script that would create an excel file.

I have nothing but respect for the author, but if this is needed, maybe Postgres needs to invest some work?


How is that an ugly hack? That looks like a reasonable solution to a reasonable problem. Counting millions of things with user-defined filters looks like a hard problem.


Did I misunderstand the approach or is it sort of risky as it uses escaping instead of bind parameters to create the query to be explained, potentially opening itself to SQLi?


There's a mention of injection considerations in both the article and code snippet. The article is definitely a sketch of a cute approach and doesn't claim to be a full production solution.


> Some clever abuse of the query planner output quickly gives us an estimate.

Tldr: They estimate counts by parsing it out of the query plan but get an exact count if the estimate is under a threshold.

Clever! But it makes me nervous. I was expecting it to involve HLL or CountMinSketch or something.


Indeed, this is relying on the optimizer’s cardinality estimates from previously collected statistics. These are notoriously inaccurate across all DBMS’s in even marginally complex queries[1].

They’re obviously better than nothing when making planning decisions—correctly estimating A > B is useful enough even if both are off by large factors—and maybe for this application it’s effective enough.

For the DBMS I work on, an alternative solution would involve bitmap indexes for each possible filter field. These are compact and can be logically combined and counted at rates of billions/second via SIMD, even without HLL or other estimators.

[1] https://www.vldb.org/pvldb/vol9/p204-leis.pdf


Curious how you'd do bitmap indexes at a user level in postgres? An extension like pg_roaringbitmap [1] or am I missing something native?

[1] https://pgxn.org/dist/pg_roaringbitmap/


I've hacked together something nearly equivalent to roaring bitmaps with varbits and a bit of extra metadata. Caveat emptor, but it was good enough for the problem at hand.


AFAICT bitmaps as a persistent index structure aren’t supported. They’re only used in-memory at query runtime for combining results from multiple index scans.


Another option is TABLESAMPLE (but of course, probability and statistics caveats apply)


This would be a great use case for a (secondary) columnar database, like Elasticsearch, Clickhouse or QuestDB


Why can't he just use an index?


Even with an index, counting the number of rows is an O(N) operation in most DBs.

Why?

1. DBs typically store data in some flavor of b-tree. Determining the number of elements in a sub-range of a b-tree (or any "standard" search tree) is O(N).

2. You can improve that to O(log N) if each b-tree node stores the size of its subtree. But this also increases the cost of add/remove operations.

3. And things get more complicated for DBs that support concurrent transactions, since the count is different for each transaction, depending on when it began running. You'd need something more complicated than just a size-of-subtree integer per node.

It's all doable, but isn't trivial and increases the cost of all add/remove operations.

There have been times where I really wished it existed, though... :-)


If I understand correctly, it is because they want to support arbitrary queries and so a single index would only help for certain queries but not others.


Given that the users can apply arbitrary filters, what exactly would you index?


Surely you could use compound indexing, depending on the set size of fields to be indexed, should be achievable.


Indexes and compound indexes tend to become slow themselves when you end up indexing a significant fraction of the columns or when your result sets are a significant portion of the total for count. They cost extra time and space to update on write. If your database has concurrent writes going on Postgres still needs to verify which row versions are visible to the transaction on read. The additional pressure on memory and disk means you have a bigger working set. The query planner has a combinatorial number of indexes to choose from So the planning stage slows down. You have to spend even more time on analyze to keep the estimates (which power this technique) correct etc. At 10s of millions of rows with more than a few low cardinality columns in the filter it’s sometimes faster to just get as much of the table as you can into ram and accept the scan.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: