PostgreSQL 16 Bi-Directional Logical Replication

perlgeek · on Dec 18, 2023

The description of preconditions seems a bit incomplete.

To me it seems you need to avoid certain constructs, like UNIQUE constraints. Otherwise you might have a local insert plus a replicated one, both the same value in the unique column, and different nodes reject different inserts.

smilliken · on Dec 18, 2023

Agreed. Also exclusion constraints (a generalization of unique constraints). Also foreign key constraints, because one database might delete a record simultaneously with another database adding a reference to it.

hinkley · on Dec 18, 2023

Deleting from a database is a sucker's bet. If you add to a record and mark it dead there are fewer semantic problems to deal with.

Partial indexes give you approximately the same speedup as deletes.

arthurcolle · on Dec 18, 2023

Why is this comment greyed out? Seems like a reasonable comment

hobs · on Dec 18, 2023

I didn't vote it down but there's huge tradeoffs here:

1. Row bloat.

2. Bad plan estimation due to bloat.

3. Needing filtered indexes for anything that's a row based unique constraint.

4. All history in one table means changing schemas is a PITA, and you're backfilling stuff that's not real.

5. There's more but that's just off the top.

hinkley · on Dec 19, 2023

1. Most of the problem with row 'bloat' is complexity of indexes, row deletion increases the number of required indexes, and lack of those indexes is a footgun

2. Got a citation for that? Pretty sure #3 makes that not true, but I'm open to be educated.

3. Yes, and? You're going to have those indexes anyway. That's the point.

4. You don't have to have all rows in a single table just because you took delete out of any user-facing query logic.

5. I think you need some more examples.

hobs · on Dec 19, 2023

And the number of rows in the table, effectively bloating it for all time.

Deleting rows should generally be via the same type of query patterns you use to find them, or that's pretty weird.

I think #4 is something that most people dont implement - ime they just tombstone in their own table, and then all of the other problems are pretty rampant.

Having separate indexes for deletes might be a thing, but taking a history of every change to a thing in a relational way is still a big pain because of the schema cost, merging changes in an audit table when its really a slowly changing dimension is really weird to most query patterns.

To #2 - it depends on your query engine, but if every query does have filtered indexes there's still a cost to having two tables in one table (ignoring #4)

hinkley · on Dec 19, 2023

> And the number of rows in the table, effectively bloating it for all time.

Particularly in the case of growth oriented companies, if your user base is growing exponentially, then half of your records are less than a year old anyway. This bloat is not the problem you make it. And as I said elsewhere, a row that is tombstoned can be deleted offline at some cadence that doesn’t break your other workflows. For instance when nearly all of its siblings are also dead.

hobs · on Dec 19, 2023

Users per second and rows per second are not really that strongly correlated - plenty of services create an excessive amount of noise for a fairly low amount of "real rows".

You can push on devs to fix things like pointless logical updates but its really easy to have a rogue process create copies of your entire dataset each time it runs.

arthurcolle · on Dec 19, 2023

I agree with all of your comments - this is why I asked initial question because literally all his initial comments seems perfectly valid. I appreciate the discourse, and I work on this stuff every day so I was just confused

hinkley · on Dec 19, 2023

HN is not immune to forgetting things we used to have consensus on.

There's always someone who is prepared to downvote, literally or figuratively, something that will eventually be the solution to some problem they created in production.

When I get downvoted for ideas and not demeanor, 2/3rds of the time it's just job security.

refulgentis · on Dec 19, 2023

Neutral party here: not the case, on your end, the mind-reading distracts from hearing the voice of Chesterton talking about his fence.

c.f. sibling comment written 10 minutes before yours, several trivial things.

More concretely from a downvoter: 1. I want _nothing_ to do with user data. Nothing. Toxic nuclear waste. The idea of keeping the waste on hand needs strong justification.

2. The idea of "just leave the rows but mark is_deleted" has a fundamental distate to a profession where O(n) is a constant worry.

3. The breezy way it dismisses this as an obvious solution that ~eveyone agrees suggets a small breadth of experience. That's not bad! But then we're at Chesterton's fence.

In summary, you can get downvoted for ruling out someone else's concrete lived experience, not just because your dangerous ideas threaten their paycheck (I don't even have a job!)

person3 · on Dec 19, 2023

> 1. I want _nothing_ to do with user data. Nothing. Toxic nuclear waste. The idea of keeping the waste on hand needs strong justification.

Additionally, with regulations like CCPA in some jurisdictions, this isn't even optional anymore. At some point you will need to hard delete user data.

hinkley · on Dec 20, 2023

I can forget that you were my customer, but I can’t forget that there was a customer. That quickly turns into tax evasion, for one thing.

Much of the user information we acquire is the result of greed, nosiness, or laziness. If deleting users is difficult for you, that’s an architectural problem that has next to nothing to do with my comment.

The world is absolutely full of rules that have exactly one exception. If they have two we apply the Rule of Three and either fix it or change it back to two. I have absolutely no qualms about treating user data as the exception here.

If you’re Amazon, you don’t even need much of the PII until checkout time. Collecting or looking up that data up early is a security risk. Checkouts are going to be orders of magnitude fewer operations than your browsing traffic. When the order of magnitude changes, the solutions often change. And lastly, checkouts are when you make money. Expensive operations, like inserting into a table with fragmentation problems, are much easier to justify when they are attached to revenue events.

An ad campaign that falls flat can bankrupt you. A fire or earthquake can bankrupt you. A fancy and unusable site relaunch can bankrupt you. Spending a little money at the point of sale cannot.

hinkley · on Dec 19, 2023

The lived experience you talk about is that this is a difficult problem without a solution, not a different solution that I dismissed. What I dismissed is the problem.

How do we solve X? You don't solve X, you solve Y. The XY problem, not Chesterton's Fence.

Not sure I follow you entirely on user data. There's user data that absolutely requires an audit trail, like subscriptions or orders. There's user data that might start out as null and eventually transition to an inane value, like avatar or self-description. You don't need complex merge rules for that data. It possibly only happens when their account is being hacked and that's not the data you need in that particular situation, yeah?

vlovich123 · on Dec 19, 2023

Couldn’t you just mark a deletion timestamp and then lazily delete later as part of a GC operation?

chrismarlow9 · on Dec 19, 2023

In larger systems you're pretty much forced to do it this way or face an ugly realization once you scale. Delete from s3 is an example.

Under the hood most db's do this as well (vacuum, repair database, etc).

At this point I've learned this lesson the hard way enough that I'd need to hear some really good reasons to NOT do it this way.

hinkley · on Dec 19, 2023

I’m not sure this is quite true for S3, but for the general case, once there’s consensus that a record is eligible to be deleted, you don’t require consensus to delete them. You can do it any time after the drop dead date. Once a month for instance.

One of the consensus algorithms Google was bragging about a few years back was built on very high precision hardware clocks that set a ridiculously short timeframe to achieve consensus on new data. After a few hundred milliseconds you could be certain a record had settled and make business decisions based off of it. It’s the same basic idea, but three orders of magnitude faster.

azlev · on Dec 19, 2023

I wouldn't word that way. Delete records is how SQL works. Also it's wiser to move old information to a separate database instead of paying maintenance costs of old or dead records. There are a lot of gotchas in keep old data, like GDPR.

hinkley · on Dec 19, 2023

Only if all records age out at the same rate, otherwise you’re making Swiss cheese of your tables. And/or you’re losing history data.

If your rows do age out at the same time, then I’d be tempted to ask you why you’re storing logs in a relational database, because that’s essentially what you have at that point.

cpursley · on Dec 19, 2023

Another option is an audit log that tracks all changes to a record.

hinkley · on Dec 19, 2023

It's possible to do row deletions and an audit log table, but do you see a lot of people who do both? I haven't. It's easier to keep a FK to the mutable table.

epcoa · on Dec 19, 2023

> but do you see a lot of people who do both?

Yes, pretty sure it’s very common. That doesn’t make it the greatest design, common doesn’t mean good.

hinkley · on Dec 19, 2023

I have too much scar tissue around accidental table scans caused by the fact that foreign key constraints don't implicitly create an index. If you think logarithmic insertion time is bad for your p95 times, try quadratic times on delete operations gone horribly wrong.

rvdginste · on Dec 19, 2023

Would I be wrong in thinking that this "bi-directional logical replication" is only usable in a very limited number of use cases? As you mention, unique constraints pose a problem, and a lot of tables would have unique constraints to protect business keys?

The article references another article on this, the refers to the PostgreSQL documentation:

- conflicts [1]

- restrictions [2]

You need to be very aware of the limitations to decide whether this is usable in a specific context. I don't think this is really intended to be used for "bi-directional" logical replication.

[1] https://www.postgresql.org/docs/current/logical-replication-...

[2] https://www.postgresql.org/docs/current/logical-replication-...

philjohn · on Dec 19, 2023

This feels like something for the specific use case of heavy writes that are treating the DB like an append-only log, and then running analytics queries over the top.

Something like a distributed metrics collection system.

dazzla76 · on Dec 19, 2023

Unique constraints can still be used as long as you accommodate the multiple sources. Something as simple of increment by 2 and the servers alternative odd and even as the starting point.

I used to do this with 3 way DB2 LUW replication back when i was a DBA. It all needs a bit of thought but is fairly simple when it comes down to it.

egamirorrim · on Dec 18, 2023

I don't know that this article tackled the real elephant in the room for bi-di replication. What happens when the same insert happens on both primaries at once, who wins?

singron · on Dec 18, 2023

They will each apply the insert locally. Then they will attempt to replicate that insert to each other. Each will attempt to apply the replicated conflicting insert, which will cause an error and halt replication for both nodes.

If you update the same data on both nodes, this is a recipe for almost certain disaster. Postgres is not a distributed database and this doesn't make it one.

cogman10 · on Dec 18, 2023

Yeah, read the docs here [1] and really does not seem like this would be a good system to work with. To make something fault tolerant you have to have a side subscription running watching for server errors so you can resolve these conflicts when they arise.

pglogical gave you the option to determine the winner of a conflicting write (which I think is preferable). The inbuilt postgres stuff seems like it'd be rife with potential errors.

[1] https://www.postgresql.org/docs/16/logical-replication-confl...

brianwawok · on Dec 19, 2023

I assume you just go very basic. All updates to table X go to database Y. Don’t have all the clients slamming different databases with updates to the same stuff.

Not ideal but would still help with some scale stuff.

philjohn · on Dec 19, 2023

Or an analytics system - spray events at whatever server, because each row is a unique record separate from the others, you get no conflicts.

Benefit - if your current system with a single primary for writes is suffering and your data and use-case fits this pattern, you can increase your write throughput without scaling up.

gregw2 · on Dec 19, 2023

This. But if you use integer surrogate primary keys and seed one server with even ones and one server with odd ones, both incrementing by two, and put some uniqueness constraint on your natural key, you won't have this issue of insert collisions and halted replication, just normal SQL errors. IIRC (non postgres-based advice from an earlier life but should work.)

winrid · on Dec 19, 2023

Yeah, that's not what this is for. This is more for "I have one DC in US and one in EU, and so on, with completely different customers, but if I need to I can access the data from any region anywhere".

collaborative · on Dec 19, 2023

Everyone in the thread including myself were thinking about failover and scale applications, but your post is the right answer

simonw · on Dec 18, 2023

Right: Is the idea here that you use UUIDs for inserted rows, hence avoiding duplicate inserts?

But what happens if multiple updates or deletes target the same existing row?

derefr · on Dec 19, 2023

I would assume the intended use-case is a "pseudo-sharded" cluster.

In an actually-sharded cluster, some scheme (e.g. a hash ring) directs writes to particular shards, and so any given row only lives on one shard.

In a pseudo-sharded cluster, you still direct writes to different "shards" — but all the pseudo-shards also mirror each-other's data, so every pseudo-shard actually has a complete dataset. But each row is still only owned by one particular shard; when updating that row, you must update it through that shard. All the data not owned by the shard, should be treated as a read-only/immutable cache on that shard of the other shards' data.

Personally, though, if I were setting something like this up, rather than bi-di logical replication on a single table, I'd just partition the table with one partition per shard, and then have the "owning" shard be the publisher for that partition and the rest of the shards be subscribers for that partition. Same effect, much less implementation complexity and much less risk.

simonw · on Dec 19, 2023

That sounds like a smart model for thinking about this, thanks.

crest · on Dec 19, 2023

Maybe a column with a server id could work? Can you have a schema differing only in the default value for a server id column or would it be replicated as "insert default value"?

gregw2 · on Dec 19, 2023

No need... here's a different twist on that .. make the integer primary key generation setup such that one server has odd IDs and the other even and they each increment by two. I did this for many years on Sybase and on SQL Server for active-active failover.

grogers · on Dec 18, 2023

In the MySQL world, bidirectional replication is common, but with at-most-one of the two being writable (the replication user can still write even if the secondary is read-only). Maybe they meant it to be similar and not truly multi-master? It does seem like a glaring omission given they talk about having two primaries.

stephenr · on Dec 19, 2023

This sounds like circular replication that was possible back in MySQL 3.x using asynchronous replication, and generally not recommended for primary-primary setups.

I can't speak for others but to my mind Galera or Group Replication are the "modern" solutions to multi-primary MySQL replication. Both are considered virtually-synchronous, and thus actively prevent conflicting queries from succeeding.

solidsnack9000 · on Dec 19, 2023

Maybe the way to use this is to have two nodes but only one of which receives traffic at any one time. With an active-active setup like that, the failover process can be swift.

However, this is kind of a tough thing to do right -- how do really know that only one is receiving traffic? -- and I wonder if there are many cases where it is a compelling alternative to an ordinary failover setup.

gregw2 · on Dec 19, 2023

I had pure active-active running at one job. We had our app layer centralize all its database access through a single set of wrapper generic SQL functions and had that set of functions implement some failures and hang-handling simple exponential backoff retry logic on the local DB before failing to the more remote one, with bidirectional. (nonpostgres) replication.

You can indeed get into split brain situations and have to make hard choices as per the CAP theorem. And the replication will break sooner or later and you have to detect that and learn how to recover safely manually at least. Don't do it just for funsies. But it's definitely doable.

Horffupolde · on Dec 19, 2023

You use shard-compatible ids like snowflake.

dhduebnnsj · on Dec 19, 2023

I assumed the above poster was referring to business rules - ie you can only have 1 person per table. DB 1 seats person A, DB 2 seats person B, when they replicate there’s a constraint that fails.

I’m not sure how ids like that would resolve this.

tristenharr · on Dec 20, 2023

This is exciting, I’m hopeful that we get distributed SQLite via something like Turso that can hook into this to stream subsets of data into SQLite edge-caches that can be pushed through the edge onto devices via something like embedded replicas. I know experimental work has been done here with pg_turso (I recently wrote my first few lines of zig to add support for some additional data-types for it) and this bodes well with the recent announcement of Postgres coming to fly.io, I dream of a day when I can make Postgres and SQLite play nicely together.

kroolik · on Dec 19, 2023

I find the explanation of logical and physical replication to be weird. Seems like only logical replication transfers changes.

I have always thought physical replication is replicating using WAL logs, which is a stream of changes. The logical replication would execute the queries on the replicating nodes, leaving WAL management to the replica.

This is also the reason why physical replication requires same pg versions (there may be diffs to the physical format of data in WAL), whilst logical replication uses domain language, which is independent from physical layout.

MarvinYork · on Dec 19, 2023

Could this handle more than two databases with this technique or would it generate the mentioned loop when a third database sends the data to multiple databases?

ttfkam · on Dec 19, 2023

Yes, the implementation distinguishes between a replication client and a user/app client specifically to prevent multiple query invocations. Network latency and CAP tradeoffs still apply of course, but you can absolutely scale well beyond two instances with this.

westurner · on Dec 19, 2023

/? postgres replication https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

"Pgactive: Active-Active Replication Extension for PostgreSQL on Amazon RDS" (2023) https://news.ycombinator.com/item?id=37838223

/? "pgactive" site:github.com https://www.google.com/search?q=%22pgactive%22+site%3Agithub...

cloudnative-pg/cloudnative-pg: https://github.com/cloudnative-pg/cloudnative-pg :

> primary/standby

cloudnative-pg > Logical replication #13: https://github.com/cloudnative-pg/cloudnative-pg/issues/13

/? logical replication:

https://www.google.com/search?q=logical+replication

pgadmin docs > Publication Dialog; logical replication: https://www.pgadmin.org/docs/pgadmin4/development/publicatio...

https://github.com/dalibo/pg_activity#faq ; pip install `pg_activity[psychopg]` :

> FAQ: I can't see my queries only TPS is shown

(How) Do any ~pg_top tools delineate logical replication activity?

pgcenter > PostgreSQL statistics [virtual tables] (and also /proc) https://github.com/lesovsky/pgcenter#postgresql-statistics

...

"Show HN: ElectricSQL, Postgres to SQLite active-active sync for local-first apps" (2023) from the creators of CRDTs https://news.ycombinator.com/item?id=37590257

electric-sql/electric: https://github.com/electric-sql/electric :

> ElectricSQL is a local-first software platform that makes it easy to develop high-quality, modern apps with instant reactivity, realtime multi-user collaboration and conflict-free offline support.

> Local-first is a new development paradigm where your app code talks directly to an embedded local database and data syncs in the background via active-active database replication. Because the app code talks directly to a local database, apps feel instant. Because data syncs in the background via active-active replication it naturally supports multi-user collaboration and conflict-free offline

"SQLedge: Replicate Postgres to SQLite on the Edge" (2023) https://news.ycombinator.com/item?id=37067980 ; gh-ost, dolt, sqldiff

/?hnlog ctrl-f Consistenc, Consensus :

- "A few notes on message passing" https://news.ycombinator.com/item?id=26535969 :

> The C in CAP theorem is for Consistency [3][4]. [...]

> [3] https://en.wikipedia.org/wiki/Consistency_model

> [4] https://en.wikipedia.org/wiki/CAP_theorem

MarvinYork · on Dec 19, 2023

Are you ok?

westurner · on Dec 21, 2023

I could keep teaching notes and discover opportunities.