Hacker News new | past | comments | ask | show | jobs | submit login

I don't mean "let your production systems spool up to point where you're maxing out a single machine" - that would be exceedingly silly.

I mean "when you've proven that the application you've written can fully, or near fully utilise the available power on a single machine, and that when running production-grade workloads, actually does so, then you may scale to additional machines.

What this means is not getting a 9-node spark cluster to push a few hundred gb of data from S3 to a database because "it took too long to run in python" because it's a single threaded, non-async, non-performance tuned.




And what about redundancy in case of node failure?


> I mean "when you've proven that the application you've written can fully, or near fully utilise the available power on a single machine, and that when running production-grade workloads, actually does so, then you may scale to additional machines

How is that any different? You just backed off a tiny amount by saying “fully or near fully” - you still shouldn’t burden a single host to “fully or near Fully” because:

It puts more strain on the hardware and will cause it to fail a LOT faster

There’s no redundancy so when the system fails you’ll probably need hours or maybe days to replace physical hardware, restore from backup, verify restore integrity, and resume operations - which after all this work, will only put you in the same position again, waiting for the next failure

Single node systems make it difficult to canary deploy because a runaway bug can blow a node out - and you only have one.

Workload patterns are rarely a linear steam of homogenous tiny events - a large memory allocation from a big query, or an unanticipated table scan, or any 5th percentile type difficult task can cause so much system contention on a single node that your operations effectively stop

What about edge cases in kernels and network drivers - many times we have had frozen kernel modules, zombie processes, deadlocks and do on, again, with only one node something as trivial as a reboot means halting operations.

There’s just so many reasons a single node is a bad idea, I’m having trouble listing them


> How is that any different? You just backed off a tiny amount by saying “fully or near fully” - you still shouldn’t burden a single host to “fully or near Fully”

You're missing the word "can". It's a very important part of that sentence.

If your software can't even use 80% of one node, it has scaling problems that you need to address ASAP, and probably before throwing more nodes at it.

> It puts more strain on the hardware and will cause it to fail a LOT faster

Unless you're hammering an SSD, how does that happen? CPU and RAM should be at a pretty stable amount of watts anywhere from 'moderate' load and up, which doesn't strain anything or overheat.

> redundancy

Sure.


Fail over?

I’m not against redundancy/HA in production systems, I’m opposing clusters of machines to perform data workloads that could more efficiently handled by single machines. Also note here that I’m talking about data science and machine learning workloads, where node failure simply means the job isn’t marked as done, a replacement machine gets provisioned and we resume/restart.

I’m not suggesting running your web servers and main databases on a single machine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: