A 2TB RAM machine with lots of high end Xeon chips is only about $120K, probably less if you are a good negotiator.
I've programmed large RAM machines and it's not that hard. In general, it simply let me run programs I couldn't run previously because they allocated too much memory and crashed or swapped/paged too much.
Having a flat memory hierarchy (all RAM has the same cost) makes it dirt simple. NUMA made it significantly harder because you typically had to structure you program's data, threads to schedule on the appropriate cores or processors.
However, I've found over time that unless you absolutely need to hold all your data in RAM, then spending the additional money to get the largest DIMMs and a motherboard with tons of DIMM slots isn't really cost effective. So long as you can partition your problem, that is a preferred solution in nearly all cases. however, increased programmer productivity can often be more cost effective ("just buy more RAM") and I know people who have been upgrading their computers for years and running the exact same algorithms on their in-memory data sets at very high speeds for years.
Depending on your code, some complex problems can occur. For example, the TLB, which speeds virtual/physical address translations, has its own TLB, and if you're screaming all over memory you can blow out the TLB for the TLB. my experience has been the TLB issues go up with larger memory (since you tend to load more data on each node, and access it with sparser patterns).
I haven't looked at DIMM price curves or roadmaps lately, but I assume that 4TB will be $120K in another 3-5 years, then 8TB will be $120K in 10 years.
With all that said, SSDs greatly reduced the issue of needing lots of memory. I've done jobs that would have required too much physical RAM for my budget by configuring linux swap on a fast SSD. It swapped a lot, but the jobs ran :-)
For the programmer who works with far more pedestrian hardware what does someone do with 2TB of ram that you wouldn't implement in lets say a rainbow table on a machine with lesser specs? If you are allowed to say that is.
Sure (I'm not sure what rainbow tables have to do with this; those are typically for a very specific use. Bloom filters seem to be a better choice; they allow for fixed-memory hashmaps with known false positive rates).
I had a large multidimensional object and I wanted to smooth it.
So, I loaded the whole object in RAM, did an FFT on it, multiplied it by a gaussian, then did an inverse FFT on it.
Another example would be storing very large hashmaps (as a backing store for a sparse multidimensional array).
If you had a very large biosequence database, and you wanted to map next-generation reads against it, you'd load the database into RAM with an mmap call followed by a block read, then subsequent mappings would be very fast.
Effectively, having large amounts of flat access RAM means you don't have to worry about communication (between nodes, or between node and disk).
One interesting aspect is that we've built our computer architecture with the assumption that storage is tiered from expensive to cheap: L1-3 CPU cache, RAM, SSD, HD, maybe Tape.
If just went L1-3 + RAM, it would greatly simplify the job of programmers and programs. (theoretically) no need for virtual memory, buffering / flushing, etc.
You start by congratulating yourself for refusing to buy into the Java hype, then continue on with business as usual in C++ or some other non-garbage-collected language.
After that you discover it's a NUMA machine, you have a lot of inter-cpu communication and curse you have no GC. Why? Because you bottleneck inter-cpu communication so easily with object synchronization. With GC quite a bit of it can be avoided.
If Java just didn't need to allocate each damn small value object (think struct) separately, I think it'd be a lot better with large heaps.
Array of 1000 non-elementary values could be just one allocation, not 1001. That'd also be 1000 times faster to garbage collect.
Quite the opposite. The only way to use so much RAM is with lots and lots of threads (running on lots and lots of cores), which means either sharding or sharing. Sharding scales really badly because when you do have cross-shard transactions or queries, you need expensive locks. A GC makes general-purpose, high-performance lock-free data structures much, much, much easier.
Better use Azul's Zing unless you want worst case heap compactions to take 3 hours (maybe it's better now, per their Pauseless paper, I think, they said 1 second per GiB, but it would have to be a LOT better).
Their approach is to engineer for the worst case, then everything else is easy ^_^. More seriously, their collectors run all the time and GC the whole heap as well as of course nurseries more often.
I'd add that these systems often have NUMA effects in access times, even if it's one big address space, i.e. each chip has better access to the memory directly attached to it than memory on other chips. The mentioned SGI system uses stock Xeon chips, so it's got these sorts of issues I'm pretty sure.
Pause times are better than 1 sec/GB with CMS and (by all reports) a lot better with G1. We have a service running an 8GB heap and using CMS and I want to say that it has < 100 ms pauses, definitely < 1s.
With a 64TB heap, the question isn't "is it better?" but "is pause time O(1)?" I can't recall what the Azul collector does, but I think G1 has some kind of linear behavior that would make it unusable on a heap that big.
I don't want to be Pollyannaish, but it is true that as time goes on, people will keep developing technology for handling large heaps. G1 was not fully supported until Java 7 and will probably be the default in Java 9. Meanwhile there's another low pause collector that Red Hat is working on (http://www.infoworld.com/article/2888189/java/red-hat-shenan...).
I've programmed large RAM machines and it's not that hard. In general, it simply let me run programs I couldn't run previously because they allocated too much memory and crashed or swapped/paged too much.
Having a flat memory hierarchy (all RAM has the same cost) makes it dirt simple. NUMA made it significantly harder because you typically had to structure you program's data, threads to schedule on the appropriate cores or processors.
However, I've found over time that unless you absolutely need to hold all your data in RAM, then spending the additional money to get the largest DIMMs and a motherboard with tons of DIMM slots isn't really cost effective. So long as you can partition your problem, that is a preferred solution in nearly all cases. however, increased programmer productivity can often be more cost effective ("just buy more RAM") and I know people who have been upgrading their computers for years and running the exact same algorithms on their in-memory data sets at very high speeds for years.
Depending on your code, some complex problems can occur. For example, the TLB, which speeds virtual/physical address translations, has its own TLB, and if you're screaming all over memory you can blow out the TLB for the TLB. my experience has been the TLB issues go up with larger memory (since you tend to load more data on each node, and access it with sparser patterns).
I haven't looked at DIMM price curves or roadmaps lately, but I assume that 4TB will be $120K in another 3-5 years, then 8TB will be $120K in 10 years.
With all that said, SSDs greatly reduced the issue of needing lots of memory. I've done jobs that would have required too much physical RAM for my budget by configuring linux swap on a fast SSD. It swapped a lot, but the jobs ran :-)