Hacker News new | comments | ask | show | jobs | submit login
Scaling Postgres with Read Replicas and Using WAL to Counter Stale Reads (brandur.org)
221 points by craigkerstiens on Nov 17, 2017 | hide | past | web | favorite | 21 comments

Instead of having an "observer" process that updates a table with the LSN you can just ask Postgres directly: The `pg_stat_replication` view has the last LSN replayed on each replica available: https://www.postgresql.org/docs/10/static/monitoring-stats.h...

Also, instead of updating the `users` table with the LSN of the commit - which creates extra write load - why not store it in the session cookie, then you can route based on that.

Another option is to enable synchronous replication for transactions that need to be visible on all replicas: https://www.postgresql.org/docs/10/static/warm-standby.html#...

Since this can be enabled/disabled for each transaction it's really powerful.

> Instead of having an "observer" process that updates a table with the LSN you can just ask Postgres directly: The `pg_stat_replication` view has the last LSN replayed on each replica available:

Note that that'll lag behind reality a bit, we don't continually send the feedback messages that contain the replay progress.

> Another option is to enable synchronous replication for transactions that need to be visible on all replicas: https://www.postgresql.org/docs/10/static/warm-standby.html#....

Note that's only behaving correctly if synchronous_commit is set to remote_apply, which has been added in 9.6. Before that syncrep could only guarantee that the remote side has safely received the necessary WAL, not that it has been applied.

With synchronous_commit = remote apply there are complications though. You either have to make it wait for ALL your standbys (and then deal with the fact that a failing standby can hold up commits indefinitely), or you can have it wait for N of M standbys to apply, but then how can a reader know which standby nodes have applied the commit they're interested in? I have proposed synchronous_replay = on to solve these problems. It's building on the same technology: the remote_apply patch was a stepping stone, and replay_lag was another patch in that series. The third patch add read leases so you can fail gracefully while retaining certainty for readers about which nodes have fresh data. See https://commitfest.postgresql.org/15/951/ .

There is also a proposal to add a 'wait for LSN' command, which the author of this article should check out. See https://commitfest.postgresql.org/15/772/ .

The two proposals achieve some form of causal consistency from different ends: one makes writers wait but doesn't require extra communication of LSN inside (and possibly between) client apps, and the other makes readers wait but requires more work from clients. I hope we can get both of these things in (in some form)!


What's your recommendation how to solve the problem described in the article? Would you compare LSNs of replicas in application logic at all?

I've only skimmed it so far, so I might be missing parts.

I think it depends a lot on what you're trying to achieve. In a lot of cases all the guarantee you need is read-your-own-write, and small delays for an individual connection aren't that bad. In that case it can be a reasonable approach to inquire the LSN of the commit you just made, and just have your reads, to whichever replica they go, wait for that LSN to be applied.

For a system that's very latency sensitive that'd not work, and you'd fall back to querying the master pretty much immediately. Or you'd use something like described in the article, although I'd probably implement it somewhat differently.

(I wrote this.)

> Also, instead of updating the `users` table with the LSN of the commit - which creates extra write load - why not store it in the session cookie, then you can route based on that.

The article's a little long, but at one point I do mention that we're putting this information in Postgres in the demo for convenience, but that in a real implementation it might be worthwhile moving it elsewhere as an optimization.

This addresses the first point as well: you're right in that this information can be procured from Postgres, but the point of putting in the observer is to demonstrate a system that could be implemented in a way that's agnostic of storage. It's very plausible that you might want to put `min_lsn` and replication statuses in something very fast like Redis, and even have each `api` worker caching its own version of the latter so that a replica can be selected without even making a network call.

How do you generate your diagrams? They're really beautiful. Zooming in on the svg I see that it's all just unicode, but was wondering if you're manually typesetting it.

AFAIK its Monodraw: https://monodraw.helftone.com/

(I've asked Brandur this in the past, so assuming he hasn't changed it this should still be what he uses)

There was another article from Brandur recently about redis streams- which was great. The diagrams really called my attention but I shied away from asking the source since it felt like a bit disrespectful. Thanks for the link.

Oh, wow.

My internal ikiwiki pages are filled with ASCII art diagrams. This tool looks awesome; I’ll have to try it out!

Pretty clever take on an old technique.

At reddit we would store a field on the user recording the last time they voted on something. If they loaded a page that was newer than that timestamp, then we could skip reading any votes for that page since we knew they couldn't possibly have voted on anything on that page.

This basically takes the same technique and puts it a level down and reverses it, saying "this database has any update this user could have made".

I like it.

If I understand this right, they can guarantee fresh point reads. But an actual query needs a consistent snapshot of all entities (rows) under consideration for the query. ... for all of the "hundreds of queries a mature application might execute to fulfill a single request". You need a snapshot of the entire database (like git)

FTA "When mutating a resource in the system we’ll store the last committed log-sequence-number for the entity making the request"

This is what Datomic does by design, if you need infinity read scaling and perfect consistency just use that, and you get a host of other benefits (the replica is inside your application process, like a data structure, which has profound architectural consequences, for example you can query in a loop so N+1 problem is solved).

Wouldn't this only work when your client is the one that last updated the record? Because how else would you know the correct min lsn? Seems like you would need to store the map from user to lsn in something other than pg to get any benefit here.

(I'm the author.)

I'm storing `min_lsn` directly to the database, so anything connecting to it and loading a user record should be able to take advantage of the information.

I have a note in there about this as well, but although I'm showing how this information can be persisted to Postgres so that I'm only using one data store, it could really live in any kind of shared persistence layer.

For example, replication statuses and a user's `min_lsn` might be put in Redis instead, and be treated as ephemeral data such that we don't worry about losing them because they can be repopulated on every restart. It's also probably a good idea to have individual API workers caching replication statuses directly in memory so they can select a read candidate without even having to go to a database.

Thanks for writing the article, it was interesting. To see if I'm missing something here though, in order for this to offer benefit the way it is now, I would need to already have a correct min_lsn for a user? But the issue is that if I got this from a previous request it could be out of date, isn't that accurate?

I assume they are fetching the min_lsn for a user from the master so it would be up to date.

Doesn't that defeat the point of moving reads off the master? I guess they're lower cost reads though..

I think there's a stateful load balancer polling every replica

"By routing read operations only to replicas that are caught up enough to run them accurately"

"To save every api process from having to reach out and check on the replication status of every replica for itself, we’ll have a process called an observer that periodically refreshes the state of every replica and stores it to a Postgres table. ... Keep in mind that this status information could really go anywhere. If we have Redis available, we could put it in there for fast access"

"The observer runs in a loop, and executes something like this on every iteration"

Also, "The technique is inspired by GitLab’s article on scaling their database, where they refer to it as 'sticky connections'"

I'm having a bit of a hard time figuring out whom would use this. How big is the intersection of folks that can't figure out how to setup Citus but need a number of read replicas?

I cannot go into much details about what I do, but basically we have some kind of "network operators" that are the only one using the write db (and not a lot) and then we have a massive amount of computers doing reads on the system. We don't need citus for that, but PG replica does the job very well

PostgreSQL has a bit of a tradition of having multiple solutions to problems with different approaches and different tradeoffs. Wonderful as Citus may be, I think it's good to see that tradition continue.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact