
Pgcat – Enhanced PostgreSQL logical replication - jinhualuo
https://github.com/kingluo/pgcat
======
anarazel
> Instead of worker processes and low-level row ingression, pgcat uses sql
> template to apply high-level sql commands, so it could make largest
> compatibility on target table type. It is written in golang and runs in
> separate process.

Doing "pgcat uses sql template to apply high-level sql commands"
unconditionally is somewhat sad. It has _WAY_ higher overhead than the
referenced "low-level row ingression". It makes sense to so in some cases
(into a view, triggers, apply functions, ...), but always doesn't seem great.

~~~
jinhualuo
Yes, the template generated sql would bring some overhead. But it's the cost
for flexibility: The target table would not be limited to base table. It could
be a view, a foreign table, partitioned table, or citus distributed table.
And, it also support table name mapping.

~~~
anarazel
You can detect whether a target is a plain table or not. And use a different
path if not.

And independent of using SQL or not: Executing them separately out of process
kills you via latency, even if you're on the same machine. The context
switches are just too expensive. Which makes single threaded apply untenable
for anything with a few more clients, or bulk load.

~~~
jinhualuo
I would try to do more test for performance, especially in product
environment.

Detect the target table is easy, but it seems tricky to redirect the flow to
pg build-in logical replication (not necessary to re-implement it, especially
it was written in C language, so redirection is the right way).

In fact, you could use pg logical replication directly for some cases like big
and real-time data flows. But currently, it has one big flaw: bi-directional
replication or multi-master replication is impossible, where it would cause
dead cycle loops, because it do not filter any origin currently, you could
check function pgoutput_origin_filter.

And, even if the filter is right, it still lacks of conflict solution, but
pgcat could help there, and you don't need to use pgcat, because pgcat has
trigger way to resolve conflict, you could check pgcat_setup_lww command.

------
ahachete
I'm not a big fan of conflict resolution, and LWW is a pretty bad choice in
most cases. Conflict resolution means data loss (it violates the D of ACID, as
an user expects that a transaction that was committed would endure, but it is
deleted or overwritten by another one), and most likely C.

The effects of this are hard to predict and can hit pretty hard your business
logic. This is not the kind of expectation you get from Postgres. This would
fail Jepsen at second 1.

So I'd advise not to use it. The only use case I see for bi-directional
replication is where you ensure that all the master nodes have disjoint sets
of data, and the logical replication just aggregates them, without a
possibility for conflicts.

If you really want multi-master you need a system that works on top of a
consensus protocol. And there are definitely tools out there for this. But not
LWW.

~~~
bonesmoses
LWW / LUW systems are a fine first approximation, but generally do not fulfill
the strictest guarantees and desirable outcomes for a database without
accounting for a lot of edge cases.

A good conflict management system will have that only as a default. Better
ones will also provide mechanisms for overriding LWW in specific scenarios on
a per table or even row basis depending on the in/out values from both nodes.
The trick is building these conflict management rules so they're independently
deterministic, or you'll end up with node divergence.

Normally we just do what you said, and strongly recommend interacting with
sticky sessions, geographically disparate segments, etc., to prevent the
problem before it happens. It's far easier to avoid a conflict than to correct
it after the fact.

While a consensus model "fixes" this, it drastically cuts your throughput to a
function of the latency of the most remote node in the quorum, essentially
defeating the purpose of having a local Master node. Ideally you would reserve
this for specific transactions that are sensitive in that they require
absolute consistency regardless of incurred latency (See PACELC). And even
this requires a distributed deadlock detector unless more than just the
transactions are managed through the consensus layer.

BDR for example provides both of these models for the above reasons. Multi-
Master is not an easy problem to solve, and requires a multi-faceted approach
to even come close to something usable in a generalized scenario.

~~~
ahachete
I don't agree with the "fine" in the "fine first approximation". I believe it
is indeed a bad solution.

LWW/LUW basically leads to data loss. Something unexpected for Postgres users.
Which potentially undermines its heritage of caring of data to the extreme.

From a practical perspective, if you run an inventory or sales of physical
goods, a counter over a multi-master with said conflict resolution may lead to
double bookings, lost bookings or negative inventory stocks. None of these
events are fine to me.

I'd prefer _not to have_ conflict resolution and have the replication system
explicitly check that data sets are disjoints. In this case, these multi-
master systems would be very useful (for a particular set of applications).

That consensus is "slow" is a problem also well solved in the real world.
Sure, it's not easy, but when you partition and parallelize, you overcome many
of the problems. Look at Spanner, AzureDB or the Postgres-compatible APIs of
YugabyteDB or CockroachDB. They all have implemented multi-master with
consensus.

~~~
bonesmoses
Right tool for the right job. Users in Tokyo won't be editing accounts for
users in Austrailia. There's often a natural partition there where you can
leverage the data locality with an eventual consistency basis safely.

Conflict-free Replicated Data Types (CRDTs) and column-level conflict
resolution also largely address the issue with data merges and coarse counter
systems. I'll just ignore the general design flaw there, considering these
should be an insert-only ledger system rather than counters, because bad or
naive implementations abound. I get that the application layer shouldn't
necessarily have to coddle the database, given absolute ACID would resist non-
ideal usage patterns.

I haven't played with the others, but I set up an 8-node CockroachDB and
managed to get pgbench to run on it. I was clearly doing something wrong,
because I wasn't impressed at the throughput even with very high client
counts. From what I could see there, I was clearly being throttled by some mix
of RTT and locking contention between the 3-node default consensus group.
Supposing this is not problematic in non-overlapping shard locality, you still
hit a hard performance ceiling when multiple clients interact with the same
pool fragment.

I'll have to look at the others now.

~~~
ahachete
> Right tool for the right job. Users in Tokyo won't be editing accounts for
> users in Austrailia. There's often a natural partition [...]

Exactly. That's why I'd say for tools that support multi-master to provide and
_enhance_ (enforce) support for this use case and avoid (don't support) the
case where data durability and consistency are lost.

> Conflict-free Replicated Data Types (CRDTs) [...]

Yes, that's another solution for the problem... but I don't see them provided
by BDR or Pgcat.

> I'll just ignore the general design flaw there, considering these should be
> an insert-only ledger system rather than counters [...]

It's not that I advocate for that system, but many users may choose this
implementation. That you say "hey, you need to do all this and implement this
best practices (and pass the Jepsen test) just to be sure you don't screw up
data consistency and durability... it's putting a heavy burden in the user.
Databases, if anything, are abstracting users away from complex issues with
data and providing solid primitives to work on them. Again: Postgres is known
as a very well trusted data store with given ACID properties. Conflict
resolution breaks away with this.

> I haven't played with the others, but I set up an 8-node CockroachDB and
> managed to get pgbench to run on it. I was clearly doing something wrong,
> because I wasn't impressed at the throughput even with very high client
> counts [...]

I haven't benchmarked it myself, but I heard stories of low performance.
Actually, I always believed that the internal key-value store that they use
would never scale to represent table workloads. But give a shot to the other
solutions. Spanner, for one, is known to process millions of transactions per
second, with global (planet scale) replication (of course, with notable lag,
but total consistency). That's pretty cool ;)

~~~
irfansharif
> Actually, I always believed that the internal key-value store that they use
> would never scale to represent table workloads.

Care to elaborate here? As someone working at that layer of the system, our
RocksDB usage is but a blip in any execution trace (as it should be, any
network overhead you have given it's a distributed system would dominate
single-node key-value perf). That aside, plenty of RDBMS systems are designed
such that they sit atop internal key-value stores. See MySQL+InnoDB[0], or
MySQL+RocksDB[1] used at facebook.

[1]:
[https://en.wikipedia.org/wiki/InnoDB](https://en.wikipedia.org/wiki/InnoDB)

[0]: [https://myrocks.io/](https://myrocks.io/)

~~~
ahachete
Don't get me wrong, both RocksDB and the work done by CCDB is pretty cool.

Yet I still believe that layering a row model as the V of a K-V introduces by
definition inefficiencies when accessing columnar data in a way that row
stores do, as compared to a pure row storage. Is not that it can't work, but
that I believe it can never be as efficient as a more row-oriented storage
(say like Postgres).

~~~
irfansharif
I have no idea what you're saying. What's a "row-oriented storage" if not
storing all the column values of a row in sequence, in contrast to storing all
the values in a column across the table in sequence (aka "column store"). What
does the fact that it's exposed behind a KV interface have to do with
anything? What's "more" about Postgres' "row-orientedness" compare to MySQL?

In case you didn't know, a row [idA, valA, valB, valC] is not stored as [idA:
[valA, valB, valC]]. It's more [id/colA: valA, id/colB: valB, id/colC: valC]
(modulo caveats around what we call column families[0], where you can have it
be more like option (a) if you want). My explanation here is pretty bad, but
[1][2] go into more details.

[0]: [https://www.cockroachlabs.com/docs/stable/column-
families.ht...](https://www.cockroachlabs.com/docs/stable/column-
families.html)

[1]: [https://www.cockroachlabs.com/blog/sql-in-cockroachdb-
mappin...](https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mapping-
table-data-to-key-value-storage/)

[2]:
[https://github.com/cockroachdb/cockroach/blob/master/docs/te...](https://github.com/cockroachdb/cockroach/blob/master/docs/tech-
notes/encoding.md)

~~~
ahachete
I know well. I have read most CCDB posts and architecture documentation.
Pretty good job.

There are several ways to map a row to K-V stores, and different databases
have chosen different approaches, I'm not referring specifically to CCDB's.

Whether you do [idA: [valA, valB, valC]] or [id/colA: valA, id/colB: valB,
id/colC: valC], what I say is that I believe it is less efficient than [idA,
valA, valB, valC], which actually also supports more clearly the case of
compound keys (aka [idA, idB, idC, valA, ....]). Both are the ways Postgres
stores rows.

------
_-___________-_
What use cases are there for a replication that doesn't necessarily produce an
accurate replica?

------
akdor1154
Looks really nice! Out of interest, at what stage would the dbas here consider
running a newish third party project such as this on a production workload?

~~~
CuriouslyC
Personally I would test this in a greenfield project first, unless this solved
a serious problem I was facing.

~~~
gigatexal
Yep. Former DBA here. Don’t trust something new with price ion data without
getting really familiar with it and learning its corner cases and how to
recover and learning to trust it with your data

