
Parallelism in PostgreSQL - fanf2
https://www.percona.com/blog/2019/07/30/parallelism-in-postgresql/
======
macdice
The section on parallel joins is a bit out of date. Shameless plug: I've
written and spoken about this subject extensively, if you are looking for more
information, with further links and pointers to other blogs and academic
papers.

Parallelism in PostgreSQL in general:

[https://speakerdeck.com/macdice/parallelism-in-
postgresql-11](https://speakerdeck.com/macdice/parallelism-in-postgresql-11)

Deeper dives on hash joins:

[https://write-skew.blogspot.com/2018/01/parallel-hash-for-
po...](https://write-skew.blogspot.com/2018/01/parallel-hash-for-
postgresql.html) [https://speakerdeck.com/macdice/hash-joins-past-present-
and-...](https://speakerdeck.com/macdice/hash-joins-past-present-and-future)

~~~
ibrar74
It is not outdated, it has less information. The parallel joins is a bigger
topic itself. In this blog, I want to give a brief introduction about
parallelism in PostgreSQL.

I will definitely read what you suggested. But keep in touch I will post more
blogs for each sub-topic to cover it all.

~~~
macdice
Cool. I was just reacting to the fact that you highlighted merge joins in
parallel queries with a partial plan on the outer side only, which was a new
feature back in 2017 when 10 came out (and was already supported for hash
joins and nested loops since 9.6), without mentioning that in 2018 when 11
came out we got parallel hash joins that can use a partial plan on both sides
of the join.

------
nymonym
I'm not sure why the performance benefits hit a limit with 10 workers on a
64-core machine. It doesn't look like there should be too much communication
complexity given largely disjoint workloads.

Are we hitting memory/IO bottlenecks?

~~~
slaymaker1907
It’s not really because of Amdahl’s law but because the concurrency control
used by Postgres is not infinitely scalable. IIRC, Postgres uses MVCC with
some modifications to give full serializability if desired and that some of
this is basically run optimistically such that it may have to abort a
transaction if it cannot be committed in a serializable way. A consequence of
this is that the number of aborts will grow very rapidly as you increase
concurrency.

A particularly nasty behavior occurs when you get into a situation where given
a set of concurrent transactions, all of them must be aborted. While 2PL is
not very popular these days, it does have the advantage that you can often
entirely eliminate deadlock via carefully written queries. Then you just have
to deal with Amdahl’s law and you won’t run into quite as catastrophic
performance given high concurrency.

Andy Pavlo wrote a paper looking at this comparing different concurrency
control strategies under very high levels of concurrency.

~~~
hinkley
The limitations of your concurrency primitives _are_ Amhdahl's law. We call
them "concurrency primitives" but they're really about shared state. They are
the sequential parts of the operation.

The commit phase of optimistic locking is cheap except when it's a do-over,
and by necessity it causes the second action to use the resource twice, the
second time sequentially to the first action.

~~~
londons_explore
Are you sure about this? I'm not aware of any parts of postgres which use
optimistic locking?

------
zmmmmm
We've been investigating using these features for some compute intensive
workloads to see if we can keep some things inside the database that would
normally have to break out to a separate compute service. The biggest problem
really is this:

> .... if a parallel plan is not produced, refer to the PostgreSQL
> documentation on parallelism for details.

That is, there are a number of incidental features of queries that can
silently disable parallelism. It becomes an art of knowing what will trip it
off and another thing you have to test and think about while working with your
queries.

Other than that however, it's extremely powerful to be able to use these
features and I'm hopeful we'll be able to avoid a whole lot of bespoke compute
services we'd otherwise need to make just to do certain compute operations in
a timely manner.

~~~
rosser
I would try to think of it less as a matter of "incidental features... that
can silently disable parallelism", than one of "operations that are
incompatible with parallel execution".

Writing anything [0], locking rows (e.g., "SELECT ... FOR UPDATE"), being run
in the context of a cursor or loop, invoking "PARALLEL UNSAFE" functions, and
a subquery in a query that is _already_ parallelized are all cases where it is
not meaningful for the planner to try to parallelize, so it punts.

Remember, also, that this is a young implementation. The PostgreSQL
community's first priority is taking great care not to eat your data. They'll
find more ways to safely parallelize things.

[0] In 11+, you can get a parallel plan on, e.g., "CREATE TABLE ... AS" and
"SELECT INTO", if the query is otherwise parallelizable.

------
chupa-chups
Thanks for the submission. This looks quite interesting by showing the
benefits of parallel execution in a single connection compared to versions
before 10.

~~~
ibrar74
Thanks for reading that.

------
gomoboo
The cookie acceptance banner for this site is one of the most obnoxious that
I’ve seen. Basically accept them or leave (no problem with this as it’s their
site). Thing is, the website had already stored data in my browser when I
checked after deciding to leave. What’s the point of the banner then?

------
iandanforth
A company that doesn't care enough to have a copy editor go over its blog
posts does not inspire confidence. What other errors have crept into this post
that _arn 't_ obvious?

