
Postgres Count Performance - rdegges
https://www.citusdata.com/blog/2016/10/12/count-performance?source=pg_weekly
======
rowanseymour
Even if counts could be made faster, at scale you'd probably still want to
avoid counting anything that can be pre-calculated.

We use something similar to the trigger-based method they describe, tho have
found that a lot of updates to count table inevitably ends with deadlocks. So
instead of updating a count value, we always insert a new count of 1 or -1,
and use summing to calculate the total count as needed. A background task is
responsible for continually squashing the count values.

~~~
oelmekki
Great solution, thanks for sharing.

At this point, I wonder: with a huge dataset, wouldn't you have better time
leaving the count field out of postgres altogether and use something like
redis with its INCR/DECR instructions instead? This would prevent having
deadlocks as well.

EDIT: that is, if you don't need to use the count field in other queries.

~~~
rowanseymour
We've found that Redis works great for counting some things, but when you need
a count which is guaranteed to match what is in a db table, it's preferable to
manage counts at a db level. Often we're making counts based on the state of a
record, and it's easy for the application level view of things to end up
stale.

------
pgaddict
A few comments regarding the count_distinct extension (author here):

> Note that custom extensions written in C like count_distinct are not bound
> by the value of work_mem. The array constructed in this extension can exceed
> your memory expectations.

This is slightly inaccurate, as this is not specific to custom aggregates.
Custom aggregates are estimated just like any other (built-in) aggregates -
number of expected groups times memory per group. The trouble is that (a) for
aggregates with variable-length per-group state, we don't have a good size
estimate, and (b) if the planner decides to use HashAggregate, we're unable to
do anything when reaching work_mem. But this has nothing to do with the
aggregate being custom - array_agg() and string_agg() have the same issue, for
example.

FWIW, the extension was written quite a long time ago - before the various
sort optimizations made by Peter Geoghegan. I wonder whether that made
count_distinct obsolete.

~~~
life_is_short
> for aggregates with variable-length per-group state, we don't have a good
> size estimate

Can you say a bit more why?

~~~
pgaddict
No one implemented a better solution ;-)

Jokes aside, it seems simple but is fairly tricky, as it depends both on input
data and various other parameters. For example for array_agg() or string_agg()
it might be estimated from number of entries / average length. For hll it also
depends on the accuracy and expected number of values to track, etc. I was
considering adding another method to the API, providing a better estimate, but
never got to that.

So the current code simply assumes 1KB (IIRC) per group in those cases, or
something like that.

Of course, the memory estimate also depends on the number of groups, but
that's an orthogonal issue.

------
JoelJacobson
No mention in the article of the new parallel query support in 9.6 that can
speed up COUNT(*)?

[https://www.postgresql.org/docs/current/static/parallel-
plan...](https://www.postgresql.org/docs/current/static/parallel-plans.html)

~~~
xomateix
We were doing some benchmarks recently on different databases. Most of the
queries were heavy aggregations (>1billion of rows) and, to be honest, we were
a bit disappointed with the new parallel query support in pg, we were
expecting much better performance.

While doing the benchmarks, we could see that citus was always taking full
advantage of all the cores in the cluster, while postgres parallelization was
not.

Disclaimer: I'm not a db expert and I don't have any relationship with either
citus or postgres.

------
orf
> On the other hand count(1) takes an argument and PostgreSQL has to check at
> every row to see that ts argument, 1, is indeed still not NULL.

I expect that's the same for all function calls like this? Surely pg has a
concept of constants and doesn't needlessly re-check parameters?

~~~
DonPellegrino
It checks that the first column of each row is not null, that's what the '1'
stands for.

------
oelmekki
Super in-depth analysis, thanks to the author for writing it.

Past the recommendation of counting based on an indexed column, I wonder if
this should really be user's concern. This paragraph especially triggers a
"this should be fixed upstream" feeling in me:

> A word of warning. When work_mem is high enough to hold the whole relation
> PostgreSQL will choose HashAggregate even when an index exists.
> Paradoxically, giving the database more memory resources can lead to a worse
> plan. You can force the index-only scan by setting SET enable_hashagg=false;
> but remember to set it true again afterward or other query plans will get
> messed up.

But worst case scenario, this article will be useful until this is fixed, so
thanks again :)

~~~
barrkel
If you think "needs upstream fix", then no database is suitable for use,
because they all choose poor plans in many edge cases. This is why most
databases have extensions that let you hint or force index use (not Postgres,
however; a mistake with mitigation), or support parenthesising your joins to
force an evaluation order (this is an indirect way of forcing index use or
join order, a mitigation), etc.

Query planning is something where a poor choice can have serious performance
ramifications, because n is usually much larger than in most programs.
Analyzing the algorithmic complexity of a piece of SQL takes some experience
and experimentation, and with different table stats the query planner may make
different decisions. It can be worthwhile limiting the planner's discretion to
get more predictable performance.

(I work on a product where many of the features can be expressed in terms of
relational algebra. Often, both the best performing and quickest to write and
test implementation logic is a bunch of SQL, and not the kind of CRUD that is
easily wrapped with an ORM. What would make my life easier is a SQL linter
that, given a model of costs, would prevent people writing queries that scale
only linearly with specific table sizes. I've accumulated sufficient intuition
that I could do this for MySQL at this point.)

~~~
oelmekki
Optimizing queries manually is fine by me, what troubled me was more the fact
that it was recommended to change a configuration setting on the fly, which
sounded like asking for problems regarding concurrent requests.

But actually, I've just made a test, and it appears changing this setting only
impacts the current connection, so provided it's toggled back after the
request, this should not be a problem.

~~~
pavlik_enemy
Though I wonder how it would work with pg-bouncer.

------
andrea_s
It's a bit odd there's no mention of the PG columnar store in this article
([https://www.citusdata.com/blog/2014/04/03/columnar-store-
for...](https://www.citusdata.com/blog/2014/04/03/columnar-store-for-
analytics/)) - especially since it's from the same company.

It would be interesting to see how much the performances improve once you use
cstore_fdw (especially since 1M records is quite small when talking about OLAP
workloads).

disclaimer: I've never used cstore_fdw, but I have evaluated a number of
columnar databases in the past.

~~~
ozgune
(Ozgun from Citus Data)

We find that the primary motivation for using cstore is reducing disk I/O /
storage footprint. cstore_fdw keeps a columnar layout on disk in compressed
form and reads only relevant columns. For example, it's commonly used for data
archival purposes.

That said, cstore_fdw doesn't yet make optimizations related to query planning
and execution. We made experiments in that direction
([https://news.ycombinator.com/item?id=8423825](https://news.ycombinator.com/item?id=8423825)),
but making those changes production ready is no small effort.

Since all benchmarks in this blog post are for in-memory data, I don't know
how much they would benefit from cstore. If I have the time, I'll give it a
try and update this comment with the results.

------
throwawayReply
The counting seems really slow compared to MSSQL? I'm not familiar with the
reference hardware used here, but just running a similar test on my desktop
with sql server it can count distinct a million strings in under a second
(with parallelisation, around 2.5s cpu time).

Or am I missing the fact that these benchmarks are run on a reference spec
which is comparatively old?

~~~
sb8244
I'm not familiar with mssql, does it use mvcc? That is the reason PG counting
is slow (really slow)

~~~
throwawayReply
I'm not an expert, this is my layman's understands.

It has different isolation levels, some involving snapshots and some not. I
think most concurrency issues are (by default) dealt with by locks, which
start at row level and can escalate to page and table level (with significant
slowdown seen when lock escalation happens in contentious places).

But that only has an effect if it's under write.

~~~
sbuttgereit
This is historically correct, but I believe later versions of mssql server now
default to their mvcc implementation. I think this switch was circa 2005; not
100% sure.

I managed an enterprise applications group that primarily used mssql for data
in 2007. I can't recall which mvcc implementation they use. Our servers were
MSSQL 2000 and I remember being a bit more than surprised when the DBAs told
me the root of the performance problems we had were due to lock escalation;
having come from the Oracle and PostgreSQL worlds, I was naive enough to have
thought that lock escalation implementations like this were historical
curiosities rather than something I'd actually run into... live and learn I
guess.

~~~
nickpeterson
I believe it still has to be chosen, and the MS term is 'Read Committed
Snapshot'. Everything is a tradeoff, read committed snapshot makes your tempdb
busier (since it is involved in versioning), and balloons up the default size
of a row in a table. With RCS you end up adding 14 bytes of overhead to each
row, and which otherwise wouldn't be there.

So imagine a table with a single integer column and 1 billion rows. Instead of
the 9 byte per row overhead (7 bytes for row metadata and 2 bytes for the page
offset), you instead have 23 bytes, plus the 4 bytes to hold the int32.
Without RCS on, that row would only have a 9 byte overhead and the rowsize
would be 13 bytes.

so 1billion * 27bytes = 25.14 GB 1billion * 13bytes = 12.11 GB

There are some other performance tradeoffs (walking row version values,
pagesplits when updating records without rowversions after converting the
database, et cetera).

Ultimately, MVCC is great for contention, but it stinks if you're trying to
efficiently pack in data.

For an alternative perspective, I sometimes bemoan the size of tables in
postgres because of the mandatory overhead and versioning.

~~~
sbuttgereit
Agreed. Table bloat is no fun either. Though, I would say under the most
common workloads, it's easier to manage table bloat than it is lock
escalation... I can pick my battleground for table bloat whereas lock
escalation is immediate.

Oracle's MVCC method doesn't have that problem... but then you get the
imfamous ORA-01555, "Snapshot too Old" from time to time.

Oh well, no perfect worlds I suppose.

