SGI still exists? Somehow they'd been filed away in my head in the "really cool tech companies that disappeared" like DEC and Hewlett Packard (I am aware there is a company called "HP", but it exhibits no evidence of being what was once known as Hewlett Packard). Maybe the fact that Google campus is former SGI campus made me assume all of SGI was gone.
Go read the Wikipedia article on SGI. You're actually correct, the SGI of old is long dead.
Some other company bought them (Rackable Systems) just for the name and are selling products that are not anywhere near as cool as the old SGI products, but are still considered super computers and big storage.
I do miss SGI. They had really cool workstations. Expensive as hell. We had to lease them instead of buying them.
Besides doing actual work, remember playing and compiling a bunch of OpenGL demos on them. Even found the Jurassic Park file browser, by accident, and only years later connected the two when watching the movie the 2nd or 3rd time.
I remember the "onslaught" of Windows NT and Windows 2000 workstations with larger, beefier graphics cards, more memory and faster processor. I could tell it was the end for SGI. But I will always remember them fondly.
I also enjoyed their Indigo workstations, which included a 3D stereo goggle viewport. In 1998 I spent 6 months working on a 64-cpu Origin 2000 supercomputer, which had some serious power for long-running Computational Fluid Dynamics jobs.
Young people here might not remember, but the death of SGI has a story very similar to HP and Nokia -- an ex-microsoftie is hired to head the company, decides to abort the previously successful (but not successful enough) core product (Irix, Meego, DYNAMO) and switch to being a Microsoft serf.
HP's relevant division died.
The story of Nokia is still being rewritten -- but it seems the spirit had already died (and reincarnated in Jolla), and the corpse is being reanimated by Microsoft.
If you're a researcher in the US, you can access a 32Tb version of one of these boxes (PSC Blacklight) through NSF XSEDE. (And it's surprisingly easy to apply and get an allocation! We used it to do some de novo transcriptome assembly when we couldn't find a big enough machine locally)
Tangentially related story: On my first ever visit to Akihabara, back in the autumn of 2000, I saw a used SGI workstation (I think an Octane, it was rounded and kind of a teal blue) on sale in a really cool electronics shop. They had all kinds of other great stuff -- a big mixing board from a studio, television cameras, rgb monitors, stuff like that. The price was not that crazy but I was a broke college student, so I had to be satisfied with just the coolness of having seen it. So I moved on to the shops with robot parts and old arcade boards and stuff.
Akihabara is still a fun place to visit, but it seems to have been taken over entirely by the Otaku culture. Otakudom has always been a part of Akihabara, but now it seems like that's all there is. I miss the old Akihabara and the DIY/tinkerer spirit of it.
How bad is memory latency? If this is really a 64TB address space you are going to need an insane cache hierarchy to make this fast unless they've made some scientific breakthrough and not shared it with the world.
They mention it being composed as a cluster of blades with "Numalink 6 interconnect", so I'd guess "pretty bad" if you're hitting non-local RAM. You'll definitely be rewriting your programs to deal with the quasi-distributed aspect.
There are actually a number of options if you're willing to do that kind of architecture, but I haven't seen anything with >2tb of RAM on a single motherboard. Even those are pretty strongly NUMA between sockets.
It probably does have an insane cache hierarchy, but in any case think of it this way: latency is going to be dramatically better to main memory on this machine than to disk, which is what the alternative would be if you have a highly nonlocal workload.
There isn't a bus with the characteristics you mentioned.
For example, SATA 3 saturates at 1 GBps.
A few years ago I measured something like 50GBps on these guys. The real trick is to walk a graph with multiple processors so that you saturate the bus. That being said I liked the Yarc data architecture more.
Note that the size of the virtual address space on current x86-64 processors is only 256 TB! (And half of that is usually reserved for the kernel.)
Right but that is probably deceptive. Just like you can mount a network drive and it "looks" like local storage. If you treat that memory naively (even like on 1TB RAM NUMA systems), you're in for a bad time.
As a reference point, getting a cache line from one CPU to another on the Xeon 5600 takes ~300 cycles, IIRC. That's just in a two-socket cheapo machine.
It could be considerably worse in this system.
I'm not experienced enough, but so far from what I've dealt with, treating NUMA systems as separate nodes and coding them as such is the best way to deal with things. And it lets you scale out to multiple machines easily, too. But there's probably some workloads that benefit from having what appears to be a single memory space. SQL Server, for instance, is aware of the various memory hierarchies and can optimize around it, so it might allow scale-up where scale-out is simply not an option.
> Just like you can mount a network drive and it "looks" like local storage.
People do that all the time, though! NSF-mounted network drives (often backed by a NetApp-type box) that give you cluster-wide permanent storage is the standard way of setting up a compute cluster. There are downsides, but it greatly simplifies many things vs. not having the same home directories and software on all the cluster machines. Or, to take a more cloudy example, it's how Amazon EBS works.
These monster NUMA systems are usually intended for code that's difficult to turn into cluster code, though, because of too much interaction needed between parts of the computation. Usually the computation doesn't have to literally access all of the memory and cores simultaneously, so the fact that it's NUMA isn't fatal, especially if you have a decent scheduler (improving NUMA-aware schedulers is an active research topic). But it's often difficult to partition in a clean way so you can just mapreduce the work onto cluster machines. These SGI machines don't eliminate the problem, but by offloading cache coherence to hardware it can both simplify code and improve efficiency vs. trying to handle everything in software. If you have code that isn't amenable to a simple map-reduce type architecture, and you don't have hardware cache coherence, you end up rolling your own state maintenance over a network protocol or MPI or something, perform explicit work migration via task checkpointing and task queues, or via finer-grained MPI blocks that produce smaller tasks not needing migration, etc. Which is all more bug-prone and probably slower. Also if you have ancient legacy stuff you need to scale up, the SGI box will be more likely to at least run it successfully without porting.
My comment about mounting network drives as local is that all of a sudden, a file move takes a non-trivial amount of time, and may even timeout. Opening a "Windows Explorer" type view and generating thumbnails becomes super expensive.
Naive software, in personal experience, doesn't even work well with modern multi-core, multi-cache CPUs. Even when it's multithreaded, if it wasn't designed with all this in mind, you're better off running multiple processes on a single machine, treating each core (or sometimes pair) as as separate computer.
It would depend a great deal on exactly what kinds of problems you wanted to solve, I am sure. You wouldn't want to run just 1 instance of Postgres on it, for instance (as you point out); instead you would be more likely to do some form of sharding.
The Numalink 6 speeds don't seem to be defined, Numalink 5 was about 15GB/s , but when you are reading memory at up to 51GB/s, you certainly don't want to slow down and wait for another node to give you access to your data...
When x86 started going over 4GB and getting the 64-bit extensions, it was thought that a 64-bit address space would be so big (more than 4 billion times bigger than a 32-bit one) that it would take many decades before we run into that limit; and that wasn't too long ago either - the first AMD64 CPU was in 2003, only 11 years ago, and it supported "only" 52 bits of physical address.
Now we have 64TB, which is 2^46, which means there's only 18 "unused" bits of address left - 256K. If you could connect only(!) 262,144 of these machines together and present the memory on them as one big unit, you would have exhausted the 64-bit address space. That is what I think is really incredible. What's next, 128-bit addresses? Or maybe we'll realise that segmented address spaces (e.g. something like 96-bit, split as 32:64) are naturally more suited to the locality of NUMA than flat ones?
Unfortunately, the power usage is ridiculous, something like 20 kW. That's what typically makes old big-iron impractical. At the mid-range, used hardware is even cheaper: you can pick up old 12-core, 48-GB-RAM Itanium boxes for the cost of a Chromebook (~$200-300). But they take so much power that you don't end up saving anything over buying a dual-Xeon new, at least if you keep it on. Also, everything is very heavy and a hassle to move.