We’re hitting a classic distributed systems wall and I’m looking for war stories or "least worst" practices.
The Context: We maintain a distributed stateful engine (think search/analytics). The architecture is standard: a Control Plane (Coordinator) assigns data segments to Worker Nodes. The workload involves heavy use of mmap and lazy loading for large datasets.
The Incident: We had a cascading failure where the Coordinator got stuck in a loop, DDOS-ing a specific node.
The Signal: Coordinator sees Node A has significantly fewer rows (logical count) than the cluster average. It flags Node A as "underutilized."
The Action: Coordinator attempts to rebalance/load new segments onto Node A.
The Reality: Node A is actually sitting at 197GB RAM usage (near OOM). The data on it happens to be extremely wide (fat rows, huge blobs), so its logical row count is low, but physical footprint is massive.
The Loop: Node A rejects the load (or times out). The Coordinator ignores the backpressure, sees the low row count again, and retries immediately.
The Core Problem: We are trying to write a "God Equation" for our load balancer. We started with row_count, which failed. We looked at disk usage, but that doesn't correlate with RAM because of lazy loading.
Now we are staring at mmap. Because the OS manages the page cache, the application-level RSS is noisy and doesn't strictly reflect "required" memory vs "reclaimable" cache.
The Question: Attempting to enumerate every resource variable (CPU, IOPS, RSS, Disk, logical count) into a single scoring function feels like an NP-hard trap.
How do you handle placement in systems where memory usage is opaque/dynamic?
Dumb Coordinator, Smart Nodes: Should we just let the Coordinator blind-fire based on disk space, and rely 100% on the Node to return hard 429 Too Many Requests based on local pressure?
Cost Estimation: Do we try to build a synthetic "cost model" per segment (e.g., predicted memory footprint) and schedule based on credits, ignoring actual OS metrics?
Control Plane Decoupling: Separate storage balancing (disk) from query balancing (mem)?
Feels like we are reinventing the wheel. References to papers or similar architecture post-mortems appreciated.
If your swap use jumps 10 points in a small time frame, you are running out of memory quickly. If your swap use hits 50 % or 80% or [whatever threshold], without any big jumps you're running out of memory slowly.
If your swap I/O is all output, not a huge deal... you're swapping stuff you never read. If you've got a lot of swapping in, chances are you're swapping to death.
> The Core Problem: We are trying to write a "God Equation" for our load balancer. We started with row_count, which failed. We looked at disk usage, but that doesn't correlate with RAM because of lazy loading.
I'm a big fan of straight up even distribution of requests. It's simple and predictable, although it's not going to get you the best throughput, predictability and simplicity is often better than perfection. If you always send each node 1/Nth of requests, worst case of a node that is broken but looks up is that you're still sending it a share when it should get nothing; if you have some sort of utilization based metric, if it looks underutilized because it's just dropping requests and responding with success but empty, it sucks up all your requests. Alternatively, people have good results with select M nodes by metrics, and then random selection between those. But also, IMHO, you want to reduce the work your load balancer(s) do, because load balancing load balancers is hard.
reply