What types of actions can you do to correct and prevent this class of errors? I think you could probably enforce deployment and shutdown checklists, perhaps, or have automated DNS checking software to see if any of the issues exist (I bet you guys have a solution for that) but there are so many human-error problems in manufacturing, and I kinda consider the large-scale deployment of apps to have similar issues and failure modes on the human side.
We have an inventory of everything running, and where they are supposed to be running. If service X does not respond on resource Y the team responsible get an ticket. Check is on IP and names, and some other services. There are no good ways to do this other than being meticulous IMHO. Getting dumps of what is running where from all services is rather hard but more or less doable.
Azure has options when you use their DNS that they tie resource, Public IP, Azure WebApp and other to DNS. If resource is deleted, the record will be NXDomain. AWS probably has something for Route53.
Otherwise, good IaC can help but even in larger companies, I see more ClickOps then I should.
- Stay within the cloud provider's ecosystem as much as possible, including for domain registration and DNS. All records then should be pointing to resources that include your account id in them and can't be taken over by others. If you delete the entire account, there'd be nothing to take over.
- Do everything with Infrastructure as Code, including DNS. If a single "terraform apply" creates everything, then a single "terraform destroy" deletes it all, leaving nothing dangling, provided of course that it is setup correctly and doesn't error out midway through a run.
Otherwise, it's a matter of being thorough. Automate what you can, including creating and deleting resources, if not through a single cloud provider API or some standard IaC product, then roll your own software to do it, but have software do it. Regularly roll out and tear down entire test installations of full systems, including valid DNS records. When you intend for them to be gone, ensure they are really, truly gone.
If you can't automate it, then yeah, checklists.
It's one of those things that is simple but not easy. It takes an organization that respects the tedious and time-consuming nature of ops, plans for it, and doesn't push people to cut corners for the sake of speed when the first time trying to do something takes much longer than someone's uninformed first guesstimate.
Really, automate. At a small enough scale, it doesn't matter, but if you're Mastercard doing this kind of thing thousands of times over the course of decades, humans will inevitably make mistakes. Software will make mistakes, too, but at least when you test software, it will do the same thing every time it is tested. Humans do not provide that guarantee, even if they have checklists.
Edit: Note the above is not true for LLMs, so when I say use software, I mean classical deterministic software. Don't have AI do it for you, because LLMs can and will produce different responses every time you make the same request. Don't devolve to making software that is just as flaky as humans.
> Stay within the cloud provider's ecosystem as much as possible, including for domain registration and DNS
Alas, if you follow this advice to mitigate this particular risk, you're completely hosed if your cloud account gets taken down or compromised. Which is why the standard advice is to do exactly the opposite and make sure your domains and DNS are separate from your cloud provider.
What if you have your domain registered outside of your cloud provider, but have your nameserver on your cloud provider's infra.
You can have another cloud platform configured with a duplicate nameserver, then go to your registrar and change the nameserver for your domain.Your replacement nameserver would then control any subdomain provisioning.
I think that would deal with the risk somewhat, though could be missing something.
> your cloud account gets taken down or compromised
In risk assessment this risk should be resolved as „avoid“, because loosing DNS will be the secondary concern. Data is even more important. I agree that domains should be registered elsewhere and it’s good idea to have the backup of the zone.