We use a microservice architecture at the company and like every second incident is a cascade failure which is actually harder to monitor and manage... And having microservices doesn't help.
Things like, someone forgot to batch the events when posting to the event bus so a large tenant generating many events suddenly overflowed RabbitMQ which ran out of memory and started blocking producers who are holding DB connections and so we're out of DB connections now and the whole thing goes down. The only difference is that when it's going down it's harder to understand who's calling who in all this mess (compared to a single beefy server).
That works great until it suddenly and unexpectedly stops working.