Some other company bought them (Rackable Systems) just for the name and are selling products that are not anywhere near as cool as the old SGI products, but are still considered super computers and big storage.
I do miss SGI. They had really cool workstations. Expensive as hell. We had to lease them instead of buying them.
Besides doing actual work, remember playing and compiling a bunch of OpenGL demos on them. Even found the Jurassic Park file browser, by accident, and only years later connected the two when watching the movie the 2nd or 3rd time.
I remember the "onslaught" of Windows NT and Windows 2000 workstations with larger, beefier graphics cards, more memory and faster processor. I could tell it was the end for SGI. But I will always remember them fondly.
Young people here might not remember, but the death of SGI has a story very similar to HP and Nokia -- an ex-microsoftie is hired to head the company, decides to abort the previously successful (but not successful enough) core product (Irix, Meego, DYNAMO) and switch to being a Microsoft serf.
HP's relevant division died.
The story of Nokia is still being rewritten -- but it seems the spirit had already died (and reincarnated in Jolla), and the corpse is being reanimated by Microsoft.
There are actually a number of options if you're willing to do that kind of architecture, but I haven't seen anything with >2tb of RAM on a single motherboard. Even those are pretty strongly NUMA between sockets.
If your workload is spatially local, you're good on this machine and might also be good on a bunch of smaller boxes.
If spatially nonlocal but temporally local, your solution might work well.
If also temporally nonlocal, this machine might be your best bet; even if latency hurts bad, it might still be better than on any other hardware.
For example, SATA 3 saturates at 1 GBps.
A few years ago I measured something like 50GBps on these guys. The real trick is to walk a graph with multiple processors so that you saturate the bus. That being said I liked the Yarc data architecture more.
 http://www.fusionio.com/products/iodrive-octal/ (6GB/s)
I did a bunch of work with SSDs and trying to get high throughput, at the end of the day I could touch ram at 10 megabytes per millisecond but only do IO at 2 megabyte per millisecond.
Akihabara is still a fun place to visit, but it seems to have been taken over entirely by the Otaku culture. Otakudom has always been a part of Akihabara, but now it seems like that's all there is. I miss the old Akihabara and the DIY/tinkerer spirit of it.
EDIT: add "single"
As a reference point, getting a cache line from one CPU to another on the Xeon 5600 takes ~300 cycles, IIRC. That's just in a two-socket cheapo machine.
It could be considerably worse in this system.
I'm not experienced enough, but so far from what I've dealt with, treating NUMA systems as separate nodes and coding them as such is the best way to deal with things. And it lets you scale out to multiple machines easily, too. But there's probably some workloads that benefit from having what appears to be a single memory space. SQL Server, for instance, is aware of the various memory hierarchies and can optimize around it, so it might allow scale-up where scale-out is simply not an option.
People do that all the time, though! NSF-mounted network drives (often backed by a NetApp-type box) that give you cluster-wide permanent storage is the standard way of setting up a compute cluster. There are downsides, but it greatly simplifies many things vs. not having the same home directories and software on all the cluster machines. Or, to take a more cloudy example, it's how Amazon EBS works.
These monster NUMA systems are usually intended for code that's difficult to turn into cluster code, though, because of too much interaction needed between parts of the computation. Usually the computation doesn't have to literally access all of the memory and cores simultaneously, so the fact that it's NUMA isn't fatal, especially if you have a decent scheduler (improving NUMA-aware schedulers is an active research topic). But it's often difficult to partition in a clean way so you can just mapreduce the work onto cluster machines. These SGI machines don't eliminate the problem, but by offloading cache coherence to hardware it can both simplify code and improve efficiency vs. trying to handle everything in software. If you have code that isn't amenable to a simple map-reduce type architecture, and you don't have hardware cache coherence, you end up rolling your own state maintenance over a network protocol or MPI or something, perform explicit work migration via task checkpointing and task queues, or via finer-grained MPI blocks that produce smaller tasks not needing migration, etc. Which is all more bug-prone and probably slower. Also if you have ancient legacy stuff you need to scale up, the SGI box will be more likely to at least run it successfully without porting.
My comment about mounting network drives as local is that all of a sudden, a file move takes a non-trivial amount of time, and may even timeout. Opening a "Windows Explorer" type view and generating thumbnails becomes super expensive.
Naive software, in personal experience, doesn't even work well with modern multi-core, multi-cache CPUs. Even when it's multithreaded, if it wasn't designed with all this in mind, you're better off running multiple processes on a single machine, treating each core (or sometimes pair) as as separate computer.
It would depend a great deal on exactly what kinds of problems you wanted to solve, I am sure. You wouldn't want to run just 1 instance of Postgres on it, for instance (as you point out); instead you would be more likely to do some form of sharding.
The Numalink 6 speeds don't seem to be defined, Numalink 5 was about 15GB/s , but when you are reading memory at up to 51GB/s, you certainly don't want to slow down and wait for another node to give you access to your data...
CPU info on the 2.9Ghz part in question: http://ark.intel.com/products/64608/Intel-Xeon-Processor-E5-...
Now we have 64TB, which is 2^46, which means there's only 18 "unused" bits of address left - 256K. If you could connect only(!) 262,144 of these machines together and present the memory on them as one big unit, you would have exhausted the 64-bit address space. That is what I think is really incredible. What's next, 128-bit addresses? Or maybe we'll realise that segmented address spaces (e.g. something like 96-bit, split as 32:64) are naturally more suited to the locality of NUMA than flat ones?
Unfortunately, the power usage is ridiculous, something like 20 kW. That's what typically makes old big-iron impractical. At the mid-range, used hardware is even cheaper: you can pick up old 12-core, 48-GB-RAM Itanium boxes for the cost of a Chromebook (~$200-300). But they take so much power that you don't end up saving anything over buying a dual-Xeon new, at least if you keep it on. Also, everything is very heavy and a hassle to move.