I mean "when you've proven that the application you've written can fully, or near fully utilise the available power on a single machine, and that when running production-grade workloads, actually does so, then you may scale to additional machines.
What this means is not getting a 9-node spark cluster to push a few hundred gb of data from S3 to a database because "it took too long to run in python" because it's a single threaded, non-async, non-performance tuned.
How is that any different? You just backed off a tiny amount by saying “fully or near fully” - you still shouldn’t burden a single host to “fully or near Fully” because:
It puts more strain on the hardware and will cause it to fail a LOT faster
There’s no redundancy so when the system fails you’ll probably need hours or maybe days to replace physical hardware, restore from backup, verify restore integrity, and resume operations - which after all this work, will only put you in the same position again, waiting for the next failure
Single node systems make it difficult to canary deploy because a runaway bug can blow a node out - and you only have one.
Workload patterns are rarely a linear steam of homogenous tiny events - a large memory allocation from a big query, or an unanticipated table scan, or any 5th percentile type difficult task can cause so much system contention on a single node that your operations effectively stop
What about edge cases in kernels and network drivers - many times we have had frozen kernel modules, zombie processes, deadlocks and do on, again, with only one node something as trivial as a reboot means halting operations.
There’s just so many reasons a single node is a bad idea, I’m having trouble listing them
You're missing the word "can". It's a very important part of that sentence.
If your software can't even use 80% of one node, it has scaling problems that you need to address ASAP, and probably before throwing more nodes at it.
> It puts more strain on the hardware and will cause it to fail a LOT faster
Unless you're hammering an SSD, how does that happen? CPU and RAM should be at a pretty stable amount of watts anywhere from 'moderate' load and up, which doesn't strain anything or overheat.
I’m not against redundancy/HA in production systems, I’m opposing clusters of machines to perform data workloads that could more efficiently handled by single machines. Also note here that I’m talking about data science and machine learning workloads, where node failure simply means the job isn’t marked as done, a replacement machine gets provisioned and we resume/restart.
I’m not suggesting running your web servers and main databases on a single machine.