Read-Only Mode for Better Rails Downtime

xfalcox · on Oct 14, 2020

At Discourse we maintain a gem for automatic read-only when the main PostgreSQL/Redis is down: https://github.com/discourse/rails_failover

christoomey · on Oct 14, 2020

This looks interesting, thanks for sharing! Most of the apps I work with run on Heroku, so I'd likely end up reaching for their [High Availability Postgres] feature, but nice to know of an alternative if I'm not on heroku.

That said, I don't think this would serve the intended purpose here. My specific use case was upgrading the postgres version I was using. To do that we introduced a follower, prevented writes to the primary (with the approach summarized in the post), upgraded the follower, and finally promoted the follower to be the new primary. Automatic failover to the follower during this process would likely confuse things.

[High Availability Postgres]: https://devcenter.heroku.com/articles/heroku-postgres-ha

xfalcox · on Oct 14, 2020

PostgreSQL upgrades are indeed one of the use cases we have with the gem, the way we do it is:

1. Have app configured to connect to both main and replica.

2. Connect to the rails console and tell the app to stay in read only mode until told otherwise.

3. Disable replication

4. Upgrade main to new PostgreSQL version

5. Tell the app to move back to read-write mode

6. Re-create the replica

This flow helped us do hundreds of PostgreSQL major version upgrades in AWS RDS this quarter when we moved from PG 10 to 12.

And this is just a plus, using the gem during normal operations means that if a Redis or PostgreSQL main explodes for any reason the app keeps serving traffic, albeit in read-only.

> Automatic failover to the follower during this process would likely confuse things.

I believe here the problem is mostly naming. The gem "failover" to read-only mode to a replica, it doesn't promote replicas to main ever. Naming is hard.

christoomey · on Oct 14, 2020

Oh that's interesting, thanks for the additional detail! I'm intrigued by what you mean when you say: "tell the app to stay in read only mode until told otherwise". Does that mean use the read-only replica connection during that time? If so, did you have to configure some error handling for that timeframe?

xfalcox · on Oct 14, 2020

> Does that mean use the read-only replica connection during that time?

Yes!

https://github.com/discourse/discourse/blob/721ee3642558d960...

> If so, did you have to configure some error handling for that timeframe?

Since this was an early goal in the project, the controllers are mostly away of the read-only mode already and know how to deal with it is most places.

christoomey · on Oct 14, 2020

I've added a note at the bottom of the post as your gem seems like a great option for this sort of functionality. Thanks again for sharing!

clairity · on Oct 14, 2020

> "I felt a bit odd using exceptions as the core of this workflow..."

i've done something analogous before and it seemed to me the exception framework was the right abstraction to use (though other opinions are welcome!). in this particular case, it seems like the main problem is that the trigger exceptions are not sufficiently granular/appropriate to describe the exact exception condition on which to trigger read-only mode:

* ActiveRecord::StatementInvalid

* PG::InsufficientPrivilege

rather than "DatabaseDownForMaintenance".

stubish · on Oct 15, 2020

A similar approach, for both horizontally scaling your db and automatic read-only mode. Ensure you are following the rule to only attempt writes in POST requests. POST requests get their queries sent to the primary db. GET requests send db queries to a read-only standby, unless the client sent a POST request recently (storing the last POST timestamp in a cookie or a session db, and the value for 'recently' is the current database replication lag). Catch any connection failures for requests using the primary db, and retry on a read-only standby. Catch db write failure to a read-only standby and replace with a 'down for maintenance' page.

We retrofitted the above behavior to an ~400k line monolithic Python app with minimal changes a bit over 10 years ago (the web framework was thankfully agreeable), shed over half the load on the master day 1, and could happily bounce the master db or do schema updates at will without needing to touch the appservers. Plug it into your CI system and you can do both automatic deployments and database updates, often with unnoticeable downtime if you are clever about your database patches.

Justsignedup · on Oct 14, 2020

An interesting idea. I know this will cause issues with Devise or something similar when the login token needs to be refreshed and it needs to update a record for that. I guess it depends on how the auth is set up.

christoomey · on Oct 14, 2020

The approach I shared worked really well for the use case I had (API, read-heavy traffic load, and a solid experience in the mobile apps with retry logic for failed writes), but I certainly wouldn't treat it as a silver bullet.

I would want to test the end-user experience before using it in another app, especially one with very different usage patterns, but for this case it ended up being a great optimization between simplicity and robustness.

perezperret · on Oct 14, 2020

I heard Chris talk over this on this The Bike Shed episode: https://www.bikeshed.fm/262, and thought he was going to do something horrible. I was relieved to read the post and see how sensible the solution ended up being! Great to know this is an option now!

christoomey · on Oct 14, 2020

Glad I didn't let you down :) In my defense, I did call the segment "Good Idea, Terrible Idea". This solution did end up being much better than what I was initially considering, but sometimes it takes a few bad ideas to get to the good one.

We also discussed the eventual solution I ended up with (which is summarized in this post) in this week's Bike Shed if you want to hear a bit more about it: https://www.bikeshed.fm/264.