I often think about huge, fancy cloud setups literally costing silly money to run, being replaced by a single beast of a machine powered by a modern, high core count CPU (say 48+), lots of RAM and lots of high performance enterprise-grade SSD storage.
(Oftentimes part of the reason the huge, fancy cloud setup costs more is that any component can fail, all the way down to the region-level, without loss of service.)
The difference in throughput for local versus distributed orchestration would mainly come from serdes, networking, switching. Serdes can be substantial. Networking and switching has been aggressively offloaded from CPU through better hardware support.
Individual tasks would definitely have better latency, but I'd suspect the impact on throughput/CPU usage might be muted. Of course at the extremes (very small jobs, very large/complex objects being passed) you'd see big gains.
By way of a single example, we've been migrating recently from spark to duckdb. Our jobs are not huge, but too big for a single 'normal' machine. We've gone from a 2.5 hour runtime on a cluster of 10 machines (40,vCPU total) to a 15 minute runtime on a 32vCPU single machine. I don't know for sure, but I think this is largely because it eliminates expensive shuffles and serde. Obviously results vary hugely depending on workload, and some jobs are simply too big even for a 192 core machine. But I suspect a high proportion of workloads would be better run on single large machines nowadays
A cluster of 10 machines with 40 vCPUs in total would equate to 4 vCPUs per machine. I am not familiar with Spark internals but in the realm of distributed databases such a setup would generally make no sense at all (to me). So I think you're correct that most of the overhead was caused by machine-to-machine byte juggling. 4 vCPUs is nothing.
I suspect you would be able to cut down the 2.5hr runtime dramatically even with the Spark if you just deployed it as a single instance on that very same 32vCPU machine.
Your measuring wall time, not CPU time. It may be that they are similar, but I'd suspect you aren't loading the worker nodes well. If the savings are from the reduced shuffles & serde, it's probably something you can measure. I'd be curious to see the findings.
I'm not against using simple methods where appropriate. 95% of the companies out there probably do not need frameworks like spark. I think the main argument against them is operational complexity though, not the compute overhead.
When you talk between remote machines, you have to translate to a format that can transmitted and distributed between machines(serialization). You then have to undo at the other end(deserialization). If what you are sending along is just a few floats, that can be very cheap. If you're sending along a large nested dictionary or even a full program, not so much.
Imagine an example where you have two arrays of 1 billion numbers, and you want to add them pairwise. You could use spark to do that by having each "task" be a single addition. But the time it would take to structure and transmit the 1 billion requests will be many multiples of the amount of time it would take to just do the additions.