Do you mean the physical server crashing? You should have a hot spare replicated machine for that. If you mean the RDBMS failing randomly at night, that really doesn't happen on mature systems like PostgreSQL or MS SQL Server. I mean it could happen but it's as rare as hen's teeth.
Every time I hear someone make a claim like this I have to wonder how much experience they actually have with these systems in a demanding environment.
Relational databases, all relational databases, have a disturbing tendency to fall over suddenly under load with little advance warning. They don't have a pleasant gradual failure mode. Instead some point of contention goes from 99% of capacity (at which point it takes very little load) to 101% of capacity (at which point everything falls apart).
If you've never experienced this, then I'm pretty confident that you've never scaled a database to its limit.
Which isn't that surprising. Most companies don't have sufficient load to make their database break a sweat. But once you've encountered the region where problems happen, life gets much more difficult.
I've been using SQL Server 5 hours a day for almost 15 years now and yes, I've taken databases to the limits of the hardware many times. We've upgraded our SQL Servers' hardware many times due to increasing load. We've never had a scaling problem that we could not immediately and easily resolve.
If your server falls apart it's usually because of a single bad query and with SQL Server it's super easy to determine which query is at fault and kill it, at which point everything is immediately back to normal with NO LOST DATA. You can even set limits to how much resources a query can consume if you want.
As for the 'no warning' thing, all you need to do is monitor your server. It should not run at 99% CPU or IO capacity at peak times! If you know the limits of your hardware it's really not difficult to monitor the actual usage and plan your upgrades accordingly.
It doesn't matter how bad things get you can rest assured that you'll end up with a consistent database once the dust settles. You can even do database restores up to an arbitrary point in time if you need to! We've fucked up in every way imaginable but we've never lost any data let alone an entire database. I have nothing but praise for SQL Server.
Where I first hit this is in Oracle. Which unexpectedly locked up at a million dynamic pages/hour served out of the database, over lock contention. Ever hit a lock contention issue?
That is the scaling problem that gives no warning. It is humming along fine with reasonable load, and suddenly is falling over. Sure, you can identify the problem query if you have good monitoring, but you can't just kill one instance because there are another 100 coming along in the next second...
Having dived into the guts of those failures, and having talked with expert DBAs for multiple databases, that failure mode is endemic to databases and nobody has figured out how to catch it with monitoring. (At least that was the state of the art not many years ago.)
Yes, there are ways to tune it and to scale it out horizontally. However you never know if you will need to.
If you've got a reporting database, as opposed to a transactional one, scaling is much more straightforward to predict and handle. From your reference to a single problem query, I strongly suspect that that is what you dealt with.