I’m no fan of distributed systems that sit idle at 2% utilisation when a single node would do : BUT, reducing it down to “cost” and “does it fit in RAM” is way too reductive
These days I’m firmly of the opinion that if you can make it run on a single machine, you absolutely should, and you don’t get more machines until you can prove you can fully utilise one machine.
When a single box cannot handle your load, you start one more, and then you start worrying about scalability.
(Of course, YMMV, I guess there are cases where "one box" -> "two boxes" requires an enormous quantum jump, like transactional DB. For everything else, there's a box.)
Do you know how quickly disks fail if you force them at 100% utilisation, 24/7?
Then what happens when this system dies? How much downtime do you have because you have to replace then hardware then get your hundreds of gigabyte dataset back in RAM and hot again?
I’ve worked as a lead developer at companies where I’ve been personally responsible for hundreds of thousands of machines, and running a node to 100% and THEN thinking about scaling is short sighted and stupid
I mean "when you've proven that the application you've written can fully, or near fully utilise the available power on a single machine, and that when running production-grade workloads, actually does so, then you may scale to additional machines.
What this means is not getting a 9-node spark cluster to push a few hundred gb of data from S3 to a database because "it took too long to run in python" because it's a single threaded, non-async, non-performance tuned.
How is that any different? You just backed off a tiny amount by saying “fully or near fully” - you still shouldn’t burden a single host to “fully or near Fully” because:
It puts more strain on the hardware and will cause it to fail a LOT faster
There’s no redundancy so when the system fails you’ll probably need hours or maybe days to replace physical hardware, restore from backup, verify restore integrity, and resume operations - which after all this work, will only put you in the same position again, waiting for the next failure
Single node systems make it difficult to canary deploy because a runaway bug can blow a node out - and you only have one.
Workload patterns are rarely a linear steam of homogenous tiny events - a large memory allocation from a big query, or an unanticipated table scan, or any 5th percentile type difficult task can cause so much system contention on a single node that your operations effectively stop
What about edge cases in kernels and network drivers - many times we have had frozen kernel modules, zombie processes, deadlocks and do on, again, with only one node something as trivial as a reboot means halting operations.
There’s just so many reasons a single node is a bad idea, I’m having trouble listing them
You're missing the word "can". It's a very important part of that sentence.
If your software can't even use 80% of one node, it has scaling problems that you need to address ASAP, and probably before throwing more nodes at it.
> It puts more strain on the hardware and will cause it to fail a LOT faster
Unless you're hammering an SSD, how does that happen? CPU and RAM should be at a pretty stable amount of watts anywhere from 'moderate' load and up, which doesn't strain anything or overheat.
I’m not against redundancy/HA in production systems, I’m opposing clusters of machines to perform data workloads that could more efficiently handled by single machines. Also note here that I’m talking about data science and machine learning workloads, where node failure simply means the job isn’t marked as done, a replacement machine gets provisioned and we resume/restart.
I’m not suggesting running your web servers and main databases on a single machine.