Hacker News new | past | comments | ask | show | jobs | submit login

Yeah, except Hadoop provides redundancy, easier backups, turnkey solutions for governance and compliance and easier scaling

I’m no fan of distributed systems that sit idle at 2% utilisation when a single node would do : BUT, reducing it down to “cost” and “does it fit in RAM” is way too reductive




And reducing it to a single machine means I have an order of magnitude less time spent on setting it up, maintaining it, adapting all my code, debugging, etc.

These days I’m firmly of the opinion that if you can make it run on a single machine, you absolutely should, and you don’t get more machines until you can prove you can fully utilise one machine.


Account for parent's concerns for redundancy, backups, scalability.


If you store data in something like AWS S3, "redundancy" and "backup" is handled by AWS. And scalability is a moot point if a single box can handle your load.

When a single box cannot handle your load, you start one more, and then you start worrying about scalability.

(Of course, YMMV, I guess there are cases where "one box" -> "two boxes" requires an enormous quantum jump, like transactional DB. For everything else, there's a box.)


What about services where background load causes unacceptable latencies due to other bottlenecks on that single machine? What if you are IO bound and you couldn't possibly use all of the CPU or memory on any system which was given to you?


Waiting for a machine to be “fully utilised” before scaling just shows your lack of experience at systems engineering.

Do you know how quickly disks fail if you force them at 100% utilisation, 24/7?

Then what happens when this system dies? How much downtime do you have because you have to replace then hardware then get your hundreds of gigabyte dataset back in RAM and hot again?

I’ve worked as a lead developer at companies where I’ve been personally responsible for hundreds of thousands of machines, and running a node to 100% and THEN thinking about scaling is short sighted and stupid


I don't mean "let your production systems spool up to point where you're maxing out a single machine" - that would be exceedingly silly.

I mean "when you've proven that the application you've written can fully, or near fully utilise the available power on a single machine, and that when running production-grade workloads, actually does so, then you may scale to additional machines.

What this means is not getting a 9-node spark cluster to push a few hundred gb of data from S3 to a database because "it took too long to run in python" because it's a single threaded, non-async, non-performance tuned.


And what about redundancy in case of node failure?


> I mean "when you've proven that the application you've written can fully, or near fully utilise the available power on a single machine, and that when running production-grade workloads, actually does so, then you may scale to additional machines

How is that any different? You just backed off a tiny amount by saying “fully or near fully” - you still shouldn’t burden a single host to “fully or near Fully” because:

It puts more strain on the hardware and will cause it to fail a LOT faster

There’s no redundancy so when the system fails you’ll probably need hours or maybe days to replace physical hardware, restore from backup, verify restore integrity, and resume operations - which after all this work, will only put you in the same position again, waiting for the next failure

Single node systems make it difficult to canary deploy because a runaway bug can blow a node out - and you only have one.

Workload patterns are rarely a linear steam of homogenous tiny events - a large memory allocation from a big query, or an unanticipated table scan, or any 5th percentile type difficult task can cause so much system contention on a single node that your operations effectively stop

What about edge cases in kernels and network drivers - many times we have had frozen kernel modules, zombie processes, deadlocks and do on, again, with only one node something as trivial as a reboot means halting operations.

There’s just so many reasons a single node is a bad idea, I’m having trouble listing them


> How is that any different? You just backed off a tiny amount by saying “fully or near fully” - you still shouldn’t burden a single host to “fully or near Fully”

You're missing the word "can". It's a very important part of that sentence.

If your software can't even use 80% of one node, it has scaling problems that you need to address ASAP, and probably before throwing more nodes at it.

> It puts more strain on the hardware and will cause it to fail a LOT faster

Unless you're hammering an SSD, how does that happen? CPU and RAM should be at a pretty stable amount of watts anywhere from 'moderate' load and up, which doesn't strain anything or overheat.

> redundancy

Sure.


Fail over?

I’m not against redundancy/HA in production systems, I’m opposing clusters of machines to perform data workloads that could more efficiently handled by single machines. Also note here that I’m talking about data science and machine learning workloads, where node failure simply means the job isn’t marked as done, a replacement machine gets provisioned and we resume/restart.

I’m not suggesting running your web servers and main databases on a single machine.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: