> you cannot expect all of them to SSH into a single machine Why not? I can do e...

arjie · on Sept 11, 2016

> Building one or two very beefy machines is 1000x faster, and a lot cheaper than a Hadoop setup.

You have a few petabytes of data and your working set is 50 TB. You put it on two machines. All your data is now on these SGI UV 3000s or whatever. You now need a bunch of experts because any machine failure is a critical data loss situation, and a throughput cliff situation. Took a fairly low-stakes situation (disk failure, let's say) and transformed it into the Mother of All Failures.

And then you've decided that next year your working set won't be over the max for the particular machine type you've decided on. What will you even do then? Sell this and get the new bigger machine? Hook them both up and run things in a distributed fashion?

And then there's the logistics of it all. You're going to use tooling to submit jobs on this machine, and there's got to be configurable process limits, job queues, easy scheduling and rescheduling, etc. etc.

I mean, I'm firmly in the camp that lots of problems are better solved on the giant 64 TB, 256 core machines, but you're selling an idea that has a lot of drawbacks.

qwertyuiop924 · on Sept 11, 2016

And people with 64TB, 256 core machines don't have RAID arrays attached to their machine for this exact reason?

If it's "machines" plural, than you can do replication between the two. There's your fallover in case of complete failure.

yid · on Sept 13, 2016

> If it's "machines" plural, than you can do replication between the two.

This is the start of a scaling path that winds down Distributed Systems Avenue, and eventually leads to a place called Hadoop.

(Replication and consensus are remarkably difficult problems that Hadoop solves).

qwertyuiop924 · on Sept 13, 2016

Fair 'nuff, but if you don't distribute compute, and you store the dataset locally on all the systems (not necessarily the results of mutations, just the datasets and the work that's being done), you'll still possibly reap massive perf gains over Hadoop in certain contexts.

yid · on Sept 13, 2016

> you'll still possibly reap massive perf gains over Hadoop in certain contexts.

Certainly, and unfortunately, the exact point at which Hadoop becomes the better option over big iron is generally an ongoing debate and shifting target. But there's no doubt that such a point actually exists.

qwertyuiop924 · on Sept 14, 2016

...I'm not sure if it does. Bain's Law still stands.

But if it does, then it's a pretty big chunk of data, and a very fast network.

yid · on Sept 14, 2016

Couldn't find anything for Bain's Law, do you have a reference I could follow?

jonathanhefner · on Sept 19, 2016

I believe it's a reference to this old (but insightful) comment: https://news.ycombinator.com/item?id=8902739

As such, it is actually "Bane's Rule" which states, "you don't understand a distributed computing problem until you can get it to fit on a single machine first."

(Thanks to nekopa, who also referenced it further down in this thread.)

faragon · on Sept 11, 2016

Both disk and CPU failures are recoverable on expensive hardware.

nickpsecurity · on Sept 12, 2016

"You have a few petabytes of data and your working set is 50 TB. You put it on two machines. All your data is now on these SGI UV 3000s or whatever. "

There's usually a combination of apps that work within the memory of the systems plus huge amount of external storage with a clustered filesystem, RAID, etc. Example supercomputer from SGI below since you brought them up that illustrates how they separate compute, storage, management and so on. Management software is available for most clusters to automate or make easy a lot of what you described in later paragraph. They use one. It was mostly a solved problem over a decade ago with sometimes one or two people running supercomputer centers at various universities.

http://www.nas.nasa.gov/hecc/resources/pleiades.html

nl · on Sept 12, 2016

Yes, but old-school MPI style supercomputer clusters are closer to Hadoop style clusters than standalone machines for the purpose of this discussion.

Both have mechanisms for doing distributed processing on data that is too big for a single machine.

The original argument was that command line tools etc are sufficient. In both these cases they aren't.