I don't mean "let your production systems spool up to point where you're maxing ...

malux85 · on Feb 13, 2020

And what about redundancy in case of node failure?

malux85 · on Feb 13, 2020

> I mean "when you've proven that the application you've written can fully, or near fully utilise the available power on a single machine, and that when running production-grade workloads, actually does so, then you may scale to additional machines

How is that any different? You just backed off a tiny amount by saying “fully or near fully” - you still shouldn’t burden a single host to “fully or near Fully” because:

It puts more strain on the hardware and will cause it to fail a LOT faster

There’s no redundancy so when the system fails you’ll probably need hours or maybe days to replace physical hardware, restore from backup, verify restore integrity, and resume operations - which after all this work, will only put you in the same position again, waiting for the next failure

Single node systems make it difficult to canary deploy because a runaway bug can blow a node out - and you only have one.

Workload patterns are rarely a linear steam of homogenous tiny events - a large memory allocation from a big query, or an unanticipated table scan, or any 5th percentile type difficult task can cause so much system contention on a single node that your operations effectively stop

What about edge cases in kernels and network drivers - many times we have had frozen kernel modules, zombie processes, deadlocks and do on, again, with only one node something as trivial as a reboot means halting operations.

There’s just so many reasons a single node is a bad idea, I’m having trouble listing them

Dylan16807 · on Feb 13, 2020

> How is that any different? You just backed off a tiny amount by saying “fully or near fully” - you still shouldn’t burden a single host to “fully or near Fully”

You're missing the word "can". It's a very important part of that sentence.

If your software can't even use 80% of one node, it has scaling problems that you need to address ASAP, and probably before throwing more nodes at it.

> It puts more strain on the hardware and will cause it to fail a LOT faster

Unless you're hammering an SSD, how does that happen? CPU and RAM should be at a pretty stable amount of watts anywhere from 'moderate' load and up, which doesn't strain anything or overheat.

> redundancy

Sure.

FridgeSeal · on Feb 13, 2020

Fail over?

I’m not against redundancy/HA in production systems, I’m opposing clusters of machines to perform data workloads that could more efficiently handled by single machines. Also note here that I’m talking about data science and machine learning workloads, where node failure simply means the job isn’t marked as done, a replacement machine gets provisioned and we resume/restart.

I’m not suggesting running your web servers and main databases on a single machine.