
Options for clusters of 16+ GB RAM 4+ core hosting? - gyro_robo

======
aristus
theplanet.com will rent you soemthing like that for USD$1,600 per month, plus
bandwidth.

You can also make it yourself and colo it. penguincomputing.com has a line
called the Altus 600. It's a 1U machine with two CPU slots and 16 RAM slots.
Source the RAM from Crucial or PNY, and you can get a 2 core, 16GB machine for
about $3,300 bucks.

[http://www.penguincomputing.com/index.php?option=com_content...](http://www.penguincomputing.com/index.php?option=com_content&id=356&Itemid=526&task=view)

~~~
gyro_robo
Thanks for that -- the Penguin computing 1Us look promising -- 16 slots!

It's cool theplanet has high-end options -- though I recently read an article
that gave me the impression they had some serious support problems related to
the EV1 merger.

~~~
johnm
In terms of vendors, also check out SiliconMechanics. They're touting the new
2-in-1U chassis from Supermicro that does up to 16 x86 cores in 1U. :-)

If you're in the Bay Area, ASA Computers is very competitive. They're in the
south bay and deliver straight to our colo. For hardware support, they even
come back to the colo and pull the boxes right off our racks, take them back
and fix them, and then return them to our racks.

~~~
gyro_robo
16 cores in 1U is pretty amazing -- I wonder how long until we look at it like
the Altair 8800 ;)

~~~
johnm
:-)

If you're doing Java, Azul Systems does 192 cores in 5U.

~~~
gyro_robo
Very interesting, a specialized processor. 192 cores and 192 GB in 5U and 1000
watts! I wonder how much performance a 5-watt chip gives you.

In 10 years our cell phones will have 192 cores.

------
johnm
What do you mean by "clusters"? High-end, low-latency compute clusters (using
e.g., Myrinet) or just a rack of machines? Are you doing lots of calculation
or lots of data or both? If you have lots of data, what's your storage model?
If you're going to have lots of machines, what's your needs for connectivity
(NICs, intra-cluster, 'net drops, etc.)?

~~~
gyro_robo
Isn't a Myrinet cluster just a rack of machines with Myrinet cards? I am under
the impression that 10gigE with TCP/UDP offload is pretty similar to 10gb
Myrinet.

Lots of calculation and data, hence the request for lots of CPU and RAM.
Storage is RAM binary-dumped to flat files (snapshots), so standard hard
drives are fine. Intra-cluster: preferably 10gigE but multiple 1gigE might
work. Net connectivity would probably be okay with 1gigE (total).

I can probably do some lower-end testing on Sun's compute grid, but real-world
testing needs real-world hardware. It might seem like overkill at first, but
every successful service seems to hit scaling issues sooner than they think!

~~~
johnm
If you need Myrinet for your clusters then you already know why. :-? The main
reason for Myrinet, IMHO, is the low latency -- otherwise, it's not worth the
money.

8GB RAM seems to be the sweet spot at the moment, price-wise. Similarly, the
500GB SATA drives are a much better deal than the 750+GB drives. Be careful
when you spec. the boxes because if you don't specify that you're going to
e.g., load them up with additional drives later, the builders will spec them
with the smaller/cheapest controller they can.

If all of the data can be handled locally then you definitely don't have a lot
of data. :-). To calibrate, we're (krugle.com) pushing around terabytes of
data.

If you're pushing a lot of data between nodes, don't underestimate the
importance of your network infrastructure and architecture. We're using
multiple 1GigE NICs per node into Foundry SuperX's (IIRC, it has a 36Gbps
fabric) and 10G crosses. We've got multiple 1Gbps backbone drops into our load
balancers / firewalls.

If you're going relatively mainstream on the CPU side, the dual-core Intel
Xeons are definitely the choice at the moment. Watch out for the different FSB
speeds.

Re: Sun's Grid. They were very aggressively trying to get our business but
they aren't really geared for big data and definitely not for big, relatively
non-transient data.

Hope this helps, John

~~~
gyro_robo
I've read about Myrinet but never used it; I wasn't sure about the distinction
you were making about a compute cluster vs. a rack. As for latency, isn't that
only on an empty connection? E.g., if you're sending half the max per-second
traffic down the pipe, isn't _any_ new message going to take half a second to
arrive? (In which case 5 usec vs. 15 usec on an empty pipe is lost in the
noise.)

Sun's grid would just be for automated testing, as a step up from uniprocessor
EC2 nodes. I agree on the network infrastructure using multiple 1gig cards or
a 10gig. I'm not sure what you mean by "handled locally" -- each node handles
_part_ of the data, so collectively it's larger than what any single node can
manage.

~~~
johnm
That's why these kinds of discussions are difficult. Not just the usual "it
depends" but multiple levels of "it depends". As you say, if you're job mix is
keeping a pipe saturated then raw latency is probably not your critical
problem. Though, of course, if, for example, you have one pipe for bulk
transfer and one for control/meta-data.... :-)

The comment about handling data locally was trying to get to the issue of how
much data needs to be going back and for across the cluster rather than just
living on the individual nodes. For example, we crawl millions of sites so
there's an initial lump of intake data to each of the crawlers. On each
crawler, that lump of data will get processed, redacted, expanded, and what
not and only the resulting indexes and segments are ever transfered off the
crawlers.

~~~
gyro_robo
> Though, of course, if, for example, you have one pipe for bulk transfer and
> one for control/meta-data.... :-)

I do, but 15 usec is fast enough for that. :) I generally don't like paying a
premium for slightly better performance. Even these rackmount systems are very
pricey compared to putting together your own vanilla boxes.

Data bounces from node to node; the main reason to go 10gigE instead of 1gigE
is actually for lower latency from the extra room. 100 Mbit of traffic on fast
ethernet = 1 sec latency vs 0.1 sec on 1gigE vs. 0.01 sec on 10gigE (not
counting overhead of course).

