As far as moving from cloud to bare metal, another thing to take into consideration (that this was replied to), if you don't architect your AWS (or other cloud solution) to take advantage of multiple geographic regions, the cloud won't benefit you.
I 100% agree that there should be more than one region deployed for this service. As others have said, all it takes is 1 event and the site will be down for days to weeks to months (It may not happen often, but when it does, you go out of business). The size and complexity of this infrastructure will make it nearly impossible to reproduce on short order in a new facility. If I were the lead on this I would have either active / active sites, or an active / replica site.
I would also have both local (fast restores), and off-site backups of all data. A replica site protects against site failure not data loss and point-in-time recovery.
Yep, this is why scaling starts with scalable distributed design. We were moving a fairly large logging stack from NFS to S3 once, for the same reason Gitlab is trying to move to bare metal now. Moving off cloud was not an option, moving to a TCO efficient service was. NFS did not scale and there was the latency problem. I think moving to bare metal cannot help with scale as much as a good architecture can. We will see how deep the datacenter hole goes. :)
To add to my previous comment though, AWS (and cloud in general) tends to make much more sense if you are utilizing their features and services (Such as Amazon RDS, SQS, etc.), and if you aren't using these services I can absolutely guarantee I can deliver a much lower TCO on bare metal than AWS. (Which is why I offered to consult for them) I see this all the time. Company moves from bare metal to AWS as bare metal is getting expensive, then they quickly find out AWS can't deliver the performance they need without massive scale (because they aren't using a proper salable distributed design and can't afford to re-architect their platform)