In one instance, MDAM RAID checks caused P99 latency spikes first Sunday of every month  (default setting). It caused a lot of pain to our customers until the check was IO throttled, which meant spikes weren't as high, but lasted for a longer time.
Scheduled tasks are a great way to brown-out yourself.
In another case, the client process hadn't set a socket timeout on a blocking tcp connection  (default setting), and so it'd run out of worker threads blocked on recv routinely when the server (fronted by reverse-proxy) started rejecting incoming due to overload. Only a restart of the process would recover the client.
Scheduled tasks are a great way to prove HAProxy will scale way better than your backend. Thanks u/wtarreau
Speaking of HAProxy: It fronted a thread-based server processing all sorts of heavy and light queries with segregated thread-pools for various workloads. During peak, proxy would accept connections and queue work faster than the worker threads could handle, and on occasion, the work-queue would grow so big that it not only contained retries of same work scheduled by the desperate clients but a laundry list of work that wasn't valid anymore (processed by some other back-end in the retry path, or simply expired as time-to-service exceeded SLA). Yet, there the back-end was, in a quagmire, chugging through never-ending work, in constant overload when ironically the client wasn't even waiting on the other end and had long closed the connection. The health-checks were passing because, well, that was on a separate thread-pool, with a different work-queue. Smiles all around.
Least conns and the event horizon. Tread carefully.
Least conns load balancing bit us hard on multiple occasions, and is now banned for similar reasons outlined here: https://rachelbythebay.com/w/2018/04/21/lb/
I've been trying to convince my division to prioritize adding events to our stats dashboard.
Comparing response times to CPU times is just expected level of effort to interpret the graphs. But you don't have any visibility into how a chron job, service restart, rollback, or reboot of a host caused these knock-on effects. And without that data you get these little adrenaline jolts at regular intervals when someone reports a false positive. Especially in pre-prod, where corners get cut on hardware budgets, and deploying a low-churn service may mean it's down for the duration.
This might end up being a thing I have to do myself, since everyone else just nods and says that'd be nice.