Hacker Newsnew | past | comments | ask | show | jobs | submit | gpavanb's commentslogin

The fundamental problem is that a synchronous, lock-based approach is being used in an asynchronous, event-driven context (logical replication).

The Postgres team needs to go back to the drawing board and avoid polling altogether and more importantly have event-listener based approaches for primary and replicas separately

It's great to see ClickHouse contributing to Postgres though. Cross-pollination between two large database communities can have a multiplying effect.


> The fundamental problem is that a synchronous, lock-based approach is being used in an asynchronous, event-driven context (logical replication).

This isn't hit in ongoing replication, this is when creating the resources for a new replica on the publisher. And even there the poll based thing is only hit due to a function being used in a context that wasn't possible at the time it was being written (waiting for transactions to complete in a hot standby).

> The Postgres team needs to go back to the drawing board and avoid polling altogether and more importantly have event-listener based approaches for primary and replicas separately

That's, gasp, actually how it already works.


> That's, gasp, actually how it already works.

Right?

Armchair software engineers are so incredibly annoying.

Postgres codebase is a shining example of good engineering. They walk a very fine line between many drawbacks. Code is battle tested (and surprisingly readable). Not many people (or groups) can do much better - and if they think they can, they are probably going to learn a lesson in humility.


> Postgres codebase is a shining example of good engineering. They walk a very fine line between many drawbacks. Code is battle tested (and surprisingly readable). Not many people (or groups) can do much better - and if they think they can, they are probably going to learn a lesson in humility.

To be fair, we have/PG has plenty crud-y code... Some of it even written by yours truly :(. Including some of the code involved here.

But just saying "go back to the drawing board" without really understanding the problem isn't particularly useful.


I'd be happy to make the changes to the codebase but there's definitely scope for improvement. Putting a sleep to avoid repeated lock acquisition should have raised an eyebrow or two when it was implemented.

As mentioned, there should be a read-replica specific code that works using event listeners. Repurposing the primary's code is what causes the issue.

The following pseudocode would give a better picture on the next steps

// Read replica specific implementation

void StandbyWaitForTransactionCompletion(TransactionId xid) {

// Register to receive WAL notifications about this transaction RegisterForWALNotification(xid);

    while (TransactionIdIsInProgress(xid)) {
        // Wait for WAL record indicating transaction completion
        WaitForWALEvent();
        CHECK_FOR_INTERRUPTS();
        
        // Process any incoming WAL records
        ProcessIncomingWAL();
    }
    
    UnregisterForWALNotification(xid);
}


This has absolutely nothing to do with the problem at hand.


KiteModeler-Reboot, which is a modern version of KiteModeler from NASA Glenn.

You can check out the repository at http://github.com/gpavanb1/KiteModeler


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: