It's a shame it's not easier or more common for people to create clones of (most|all) of their infrastructure for testing purposes.
Something like half of outages are caused by configuration oopsies.
If you accept that configuration is code, then you also come to the following disturbing conclusion: the usual test environment for critical network-related code in most environments is the production environment.
In an AWS environment, imagine a setup where all that differs is the API keys used (the API keys of the production vs test environment). What gets tricky is dealing with external dependencies, user data, and simulating traffic.
For an example more relevant to today's issue: imagine a second simulated "internet" in a globally distributed lab environment. With BGP configs, fake external BGP sessions, etc, servers receiving production traffic, etc.
I get that it's a lot of work to setup and would require ongoing work to maintain - and that it's hard/impossible to have it correctly simulate the many nuances of real world traffic - and yet I also think in many cases it would be sufficient to prevent issues from making it into production.