Hacker News new | comments | show | ask | jobs | submit login
64 Terabyte RAM computer from SGI (sgi.com)
58 points by auvi 1336 days ago | hide | past | web | 51 comments | favorite

SGI still exists? Somehow they'd been filed away in my head in the "really cool tech companies that disappeared" like DEC and Hewlett Packard (I am aware there is a company called "HP", but it exhibits no evidence of being what was once known as Hewlett Packard). Maybe the fact that Google campus is former SGI campus made me assume all of SGI was gone.

Go read the Wikipedia article on SGI. You're actually correct, the SGI of old is long dead.

Some other company bought them (Rackable Systems) just for the name and are selling products that are not anywhere near as cool as the old SGI products, but are still considered super computers and big storage.

You are right.

I do miss SGI. They had really cool workstations. Expensive as hell. We had to lease them instead of buying them.

Besides doing actual work, remember playing and compiling a bunch of OpenGL demos on them. Even found the Jurassic Park file browser, by accident, and only years later connected the two when watching the movie the 2nd or 3rd time.

I remember the "onslaught" of Windows NT and Windows 2000 workstations with larger, beefier graphics cards, more memory and faster processor. I could tell it was the end for SGI. But I will always remember them fondly.

I also enjoyed their Indigo workstations, which included a 3D stereo goggle viewport. In 1998 I spent 6 months working on a 64-cpu Origin 2000 supercomputer, which had some serious power for long-running Computational Fluid Dynamics jobs.

SGI no longer does graphics, but UV still has their massive ccNUMA technology in it.

SGI died; Rackable is using the name.

Young people here might not remember, but the death of SGI has a story very similar to HP and Nokia -- an ex-microsoftie is hired to head the company, decides to abort the previously successful (but not successful enough) core product (Irix, Meego, DYNAMO) and switch to being a Microsoft serf.

SGI died.

HP's relevant division died.

The story of Nokia is still being rewritten -- but it seems the spirit had already died (and reincarnated in Jolla), and the corpse is being reanimated by Microsoft.

The old HP still exists, they go by the name Agilent.

Soon to be Keysight, because test and measurement got spun off.. again.

If you want to see something awesome HP has done "recently" check out this talk on memristors. Other than that I don't know... https://www.youtube.com/watch?v=bKGhvKyjgLY

I'm pretty sure memristors fell out of favor not long after being announced. I haven't heard anything about the technology lately.

If you're a researcher in the US, you can access a 32Tb version of one of these boxes (PSC Blacklight) through NSF XSEDE. (And it's surprisingly easy to apply and get an allocation! We used it to do some de novo transcriptome assembly when we couldn't find a big enough machine locally)

Or you could apply to NERSC and hop on the Cray XC30. It was actually "free" for a while to use.

How bad is memory latency? If this is really a 64TB address space you are going to need an insane cache hierarchy to make this fast unless they've made some scientific breakthrough and not shared it with the world.

They mention it being composed as a cluster of blades with "Numalink 6 interconnect", so I'd guess "pretty bad" if you're hitting non-local RAM. You'll definitely be rewriting your programs to deal with the quasi-distributed aspect.

There are actually a number of options if you're willing to do that kind of architecture, but I haven't seen anything with >2tb of RAM on a single motherboard. Even those are pretty strongly NUMA between sockets.

It probably does have an insane cache hierarchy, but in any case think of it this way: latency is going to be dramatically better to main memory on this machine than to disk, which is what the alternative would be if you have a highly nonlocal workload.

True, but gotta wonder how much better this latency is than a more reasonable amount of RAM backed by a super-fast SSD.

It depends on locality I suppose.

If your workload is spatially local, you're good on this machine and might also be good on a bunch of smaller boxes.

If spatially nonlocal but temporally local, your solution might work well.

If also temporally nonlocal, this machine might be your best bet; even if latency hurts bad, it might still be better than on any other hardware.

There isn't a bus with the characteristics you mentioned.

For example, SATA 3 saturates at 1 GBps.

A few years ago I measured something like 50GBps on these guys. The real trick is to walk a graph with multiple processors so that you saturate the bus. That being said I liked the Yarc data architecture more.

That's why the fastest SSD's use PCI Express instead of SATA.

[1] http://www.fusionio.com/products/iodrive-octal/ (6GB/s)

There is an interesting story here, take a look at the benchmarks http://regmedia.co.uk/2011/04/07/ssd_write_drop_off.jpg

I did a bunch of work with SSDs and trying to get high throughput, at the end of the day I could touch ram at 10 megabytes per millisecond but only do IO at 2 megabyte per millisecond.

Tangentially related story: On my first ever visit to Akihabara, back in the autumn of 2000, I saw a used SGI workstation (I think an Octane, it was rounded and kind of a teal blue) on sale in a really cool electronics shop. They had all kinds of other great stuff -- a big mixing board from a studio, television cameras, rgb monitors, stuff like that. The price was not that crazy but I was a broke college student, so I had to be satisfied with just the coolness of having seen it. So I moved on to the shops with robot parts and old arcade boards and stuff.

Akihabara is still a fun place to visit, but it seems to have been taken over entirely by the Otaku culture. Otakudom has always been a part of Akihabara, but now it seems like that's all there is. I miss the old Akihabara and the DIY/tinkerer spirit of it.

Note that the size of the virtual address space on current x86-64 processors is only 256 TB! (And half of that is usually reserved for the kernel.)

And inevitably, some programs take advantage of the other 16 bits to store data, so even if you get new hardware to use the full 64 bits and kernel support to match, you'll have to watch out for JavaScript engines and other programs randomly failing :)

The interesting thing is that this is single system image - it just looks like one very large desktop computer.

EDIT: add "single"

Right but that is probably deceptive. Just like you can mount a network drive and it "looks" like local storage. If you treat that memory naively (even like on 1TB RAM NUMA systems), you're in for a bad time.

As a reference point, getting a cache line from one CPU to another on the Xeon 5600 takes ~300 cycles, IIRC. That's just in a two-socket cheapo machine.

It could be considerably worse in this system.

I'm not experienced enough, but so far from what I've dealt with, treating NUMA systems as separate nodes and coding them as such is the best way to deal with things. And it lets you scale out to multiple machines easily, too. But there's probably some workloads that benefit from having what appears to be a single memory space. SQL Server, for instance, is aware of the various memory hierarchies and can optimize around it, so it might allow scale-up where scale-out is simply not an option.

> Just like you can mount a network drive and it "looks" like local storage.

People do that all the time, though! NSF-mounted network drives (often backed by a NetApp-type box) that give you cluster-wide permanent storage is the standard way of setting up a compute cluster. There are downsides, but it greatly simplifies many things vs. not having the same home directories and software on all the cluster machines. Or, to take a more cloudy example, it's how Amazon EBS works.

These monster NUMA systems are usually intended for code that's difficult to turn into cluster code, though, because of too much interaction needed between parts of the computation. Usually the computation doesn't have to literally access all of the memory and cores simultaneously, so the fact that it's NUMA isn't fatal, especially if you have a decent scheduler (improving NUMA-aware schedulers is an active research topic). But it's often difficult to partition in a clean way so you can just mapreduce the work onto cluster machines. These SGI machines don't eliminate the problem, but by offloading cache coherence to hardware it can both simplify code and improve efficiency vs. trying to handle everything in software. If you have code that isn't amenable to a simple map-reduce type architecture, and you don't have hardware cache coherence, you end up rolling your own state maintenance over a network protocol or MPI or something, perform explicit work migration via task checkpointing and task queues, or via finer-grained MPI blocks that produce smaller tasks not needing migration, etc. Which is all more bug-prone and probably slower. Also if you have ancient legacy stuff you need to scale up, the SGI box will be more likely to at least run it successfully without porting.

Yep, special cases.

My comment about mounting network drives as local is that all of a sudden, a file move takes a non-trivial amount of time, and may even timeout. Opening a "Windows Explorer" type view and generating thumbnails becomes super expensive.

Naive software, in personal experience, doesn't even work well with modern multi-core, multi-cache CPUs. Even when it's multithreaded, if it wasn't designed with all this in mind, you're better off running multiple processes on a single machine, treating each core (or sometimes pair) as as separate computer.

The Linux man pages talk about it at a high level: http://man7.org/linux/man-pages/man7/numa.7.html

It would depend a great deal on exactly what kinds of problems you wanted to solve, I am sure. You wouldn't want to run just 1 instance of Postgres on it, for instance (as you point out); instead you would be more likely to do some form of sharding.

The Numalink 6 speeds don't seem to be defined, Numalink 5 was about 15GB/s , but when you are reading memory at up to 51GB/s, you certainly don't want to slow down and wait for another node to give you access to your data...

CPU info on the 2.9Ghz part in question: http://ark.intel.com/products/64608/Intel-Xeon-Processor-E5-...

I think you meant "single system image".

Surely some mistake. Only yesterday we were reading a slide deck asserting Solaris knew how to scale to many processors and Linux didn't. SGI have clearly misprinted their OS support options.

A 64TB flat address space? I wonder what the latency is like...

That's my question too... There's a reason there's such a thing as a cache hierarchy.

The address space may be flat, but the latency will vary a lot depending on what part of the memory you're dealing with.


It may seem like a large amount of memory today, but it may not be for tomorrow, just like 4 GB main ram seemed a huge amount of memory 30 years ago.

When x86 started going over 4GB and getting the 64-bit extensions, it was thought that a 64-bit address space would be so big (more than 4 billion times bigger than a 32-bit one) that it would take many decades before we run into that limit; and that wasn't too long ago either - the first AMD64 CPU was in 2003, only 11 years ago, and it supported "only" 52 bits of physical address.

Now we have 64TB, which is 2^46, which means there's only 18 "unused" bits of address left - 256K. If you could connect only(!) 262,144 of these machines together and present the memory on them as one big unit, you would have exhausted the 64-bit address space. That is what I think is really incredible. What's next, 128-bit addresses? Or maybe we'll realise that segmented address spaces (e.g. something like 96-bit, split as 32:64) are naturally more suited to the locality of NUMA than flat ones?

I read that as 64 GB and thought that's not impressive for a server. Then I saw it was TB...

Oh god! they found a way to get back into the ludicrously priced computer market!

Is it wrong that I want one?

Some previous-gen versions of these have fallen into the price range that an individual could plausibly buy. For example, here's a used 152-core, 608GB RAM version for AU$10k: http://www.ebay.com/itm/SGI-ALTIX-4700-76-DC-itanium2-1-6-60...

Unfortunately, the power usage is ridiculous, something like 20 kW. That's what typically makes old big-iron impractical. At the mid-range, used hardware is even cheaper: you can pick up old 12-core, 48-GB-RAM Itanium boxes for the cost of a Chromebook (~$200-300). But they take so much power that you don't end up saving anything over buying a dual-Xeon new, at least if you keep it on. Also, everything is very heavy and a hassle to move.

It would be wrong if you didn't.

I love the idea of running entire VMs from ram.

We started doing this when I was working at Cisco for a CI system - rack servers with 768GB of ram and tmpfs as instance storage for Openstack servers. It worked pretty well.

If you have 16GB, which is normal for high-end laptops these days, then you already can.


This machine would probably be really useful for analyzing genetic data :-)

How much would this cost ?

Let's see, 64 TB of RAM is probably at least $1M, 256 E5-4xxx will be another million, then you add the SGI goodness... you might be lucky to get change from $4M.

SGI has always went by the old adage of "If you have to ask, you can't afford it."

Amazon should offer this as "Monster Instance" on EC2.

Any idea how much Scrypt hash rate will this computer provide?

It has 2048 Sandy Bridge cores, so maybe 20 MH/s.

Dare to find it!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact