
Clever RDMA Technique Delivers Distributed Memory Pooling - rbanffy
https://www.nextplatform.com/2017/06/12/clever-rdma-technique-delivers-distributed-memory-pooling/
======
ChuckMcM
As Greg points out this has come up multiple times and in multiple strategies
for implementation.

In the early 2000's I lead the design of a memory based Ethernet appliance
which we called "Network Accessible Memory". The basic idea combined the fact
that FPGAs were pretty cheap, you could easily build a gigabit network
interface on one side and a DDR2/3 interface on the other, and with a very
simple ethernet protocol have terabytes of memory basically a couple of
microseconds away from a CPU as opposed to 'swap' which was milliseconds away
at best. (we also simulated the architecture using a machine just exporting
memory to the network)

The two biggest wins are that the memory is already 'live' when machines that
might use it boot up so they don't spend time reading in state, and in a
multi-processor storage subsystem you can put the buffer cache into memory
without tying up any individual processor's main memory which gives you faster
throughput and lower latency as you're not turning around the disk interfaces
for meta data reads and mixed r/w workloads don't blow up your buffer cache.

All very cool. But all very "specialized" too.

~~~
greglindahl
Fun -- Texas Memory Systems was already shipping a product in that era that
was FPGA-based. It did a variety of complicated network protocols, like Fibre
Channel and eventually InfiniBand, with a pure FPGA implementation. The guy
who ran the company was also an FPGA wizard. I was on a telecon with him once.

[https://en.wikipedia.org/wiki/Texas_Memory_Systems](https://en.wikipedia.org/wiki/Texas_Memory_Systems)

~~~
ChuckMcM
Yeah, I think they are cited in the patent we did. The key differences were a)
no non-volatile storage at all, and b) we took 16 memory 'blades' (1GBit each)
and combined them into a single 10gbit backed 'controller'. It leveraged a lot
of ideas from the disk shelves at the time. It did give you 1GB/s read/write
fully raid protected memory (with equivalent FEC bits in the packets to catch
network corruption) to the network.

I expect TMS would have gotten there eventually once they moved on from
'storage' on the network/fabric to 'memory' on the network/fabric.

------
greglindahl
I know this is a "pop" news writeup and not a proper paper, but this same
thing has been done many times before, for example the Cray T3E had this sort
of thing built into its OS. One of the interesting challenges is that the
networking stack allocates memory, so you have to be careful to avoid
deadlocks when memory runs low.

A user-level example is the Global Arrays toolkit from PNNL, released in 1994:
[http://www.emsl.pnl.gov/docs/global_arraysdev/index.shtml](http://www.emsl.pnl.gov/docs/global_arraysdev/index.shtml)

~~~
philipkglass
It looks like PNNL has been slow to update their own Global Arrays web page.
The source has finally moved to github:
[https://github.com/GlobalArrays/ga](https://github.com/GlobalArrays/ga)

I noticed this a few months ago when NWChem build scripts added a tool to
fetch GA from github. It's nice because although the NWChem svn repo has long
enabled read-only public access, the svn repo for GA was closed.

------
rmetzler
I didn't see the source code for infiniswap linked in the article. Here it is:
[https://github.com/Infiniswap/infiniswap](https://github.com/Infiniswap/infiniswap)

------
drewg123
This sounds like a modern follow on in the spirit of work done at UW, Duke and
UBC in the 90s called "GMS". The interconnect was ATM at first, and then
Myrinet (which is a sort of predecessor to Infiniband). The GMS research was
done on DEC Alphas running DEC UNIX (and later FreeBSD) and involved allowing
applications transparent access to huge amounts of RAM over the network.

And in fact, it looks like they cite some of the GMS papers..

------
convolvatron
there was a whole movement exploring distributed shared memory in the early
90s, using VM hardware to trigger distributed swap with coherency protocols on
top. in fact there was a parallel version of OSF/1

[https://en.wikipedia.org/wiki/Distributed_shared_memory](https://en.wikipedia.org/wiki/Distributed_shared_memory)

which isn't to say that low latency and high bandwidth doesn't make this a lot
more feasible. fault tolerance is pretty hard though :)

~~~
greglindahl
In this case they're writing everything to a local SSD. The other usual
techniques are RAID-1 (multiple copies) or a higher RAID level (if you want to
conserve memory.)

------
gct
I think infiniswap is pretty neat technically. The real solution to this
problem is to design your application so it doesn't swap though.

~~~
greglindahl
If you have a lot of unused memory in your cluster, and your application runs
a lot faster if you use it, how is that a bad thing?

~~~
jessaustin
Of course very few usage patterns can avoid swap entirely, but in general,
avoiding it is faster than hitting it.

~~~
greglindahl
Almost all usage patterns can avoid swap. In this case they're intentionally
using a ram-network swap device because it's cost-effective... faster than
avoiding swap.

