
Designing far memory data structures: think outside the box - aloknnikhil
https://blog.acolyer.org/2019/06/26/designing-far-memory-data-structures/
======
eternalban
I find the term "far memory" a bit strange, specially considering that the
paper starts using the dual of "remote" and "local". The first paper of the
"Prior works" also is being consistent and uses the adjective "remote" applied
to cpu, procedure, and memory. Is there a technical distinction that I am
missing here?

(Oddly enough, just did a search for "remote memory data structures" and guess
what blog post and paper comes up!)

------
amelius
Shouldn't we use better notation for the time complexity of the algorithms?
For example, an algorithm can have

    
    
        O(n^2) + rt * O(n)
    

time complexity (where rt is the round trip time). Of course this expression
collapses to O(n^2), but by writing it like above you can more clearly see
where the cost comes from.

EDIT: on second thought, perhaps bring the rt under the O() together with n.

~~~
sagebird
I agree with the spirit, but why use O here at all? Isn’t the idea that O
collapses to its highest ordered term, so if you don’t want that, don’t use
it.

You could use a normal function. Like t(n) = f(n^2) + g(n) + rt

~~~
afiori
The point is that all the nice manipulation you would like to do are sound in
O-notation and unsound in many other notations, what the parent wants is

O(n^2) + O(m) * O(n)

where m is the number of roundtrips.

~~~
sterkekoffie
Writing an O in front of something doesn't mean it's in big O.

~~~
afiori
You can use every tool in the wrong way; if you stay in simple cases it is
(comparatively) hard to misuse big-O notation.

------
T3OU-736
Curious. This is somewhat reminiscent of SGI's ccNUMA and CRAYLink/NUMALink
architectures.

If memory serves, IRIX (SGI's UNIX OS) had both the metrics to see the latency
of access, and the ability to migrate the data and/or the compute closer to
each other.

ccNUMA was open-sourced and AMD uses it on their multi-core/multi-socket
systems, though usually within the motherboard. Not so much leaving the case
and interlinking SGI Origin system style (which is what the CRAYLink/NUMALink
tech did).

~~~
MisterTea
The sad thing is that Hyper Transport was supposed to offer this exact feature
and implement it just like SGI did with NUMAlink. There were a few boards
produced with HTX slots, I have an older Tyan dual socket Opteron board with
an HTX slot kicking around.

There is a connector standard: [https://www.hypertransport.org/ht-connectors-
and-cables](https://www.hypertransport.org/ht-connectors-and-cables)

Connectors available from Samtec:
[https://www.samtec.com/standards/ht3#connectors](https://www.samtec.com/standards/ht3#connectors)

Manycore CPU's and converged ethernet pretty much made it moot.

~~~
kjs3
Yeah...HTX was really interesting until it was clear that 40G/100G enet was
going to become commodity really fast.

------
inetknght
This talk seems to me to follow a similar line of thinking to the one I saw
presented by Chandler Carruth at the 2014 C++ conference [0]. In the talk he
presented a table with (approximately) round-trip-times of various data
layers.

[0]:
[https://youtu.be/fHNmRkzxHWs?t=2208](https://youtu.be/fHNmRkzxHWs?t=2208)

------
davidw1t
The [https://wizzlove.com/reviews/datingcom-
review](https://wizzlove.com/reviews/datingcom-review) has been a great social
networking site to search for the person I love. They helped me to get in link
with 3-4 considerable matches. the effort was great. Thanks.

------
eloff
Is it possible to have direct remote memory access in any of the major cloud
providers?

I think it should be technically possible inside your virtual network, if the
cloud platform and network gear were to support it.

~~~
xiii1408
Generally, no.

The main requirement to support this is that a RoCE or other RDMA API needs to
be exposed inside the cloud VM. This requires (1) the physical boxes have RDMA
(likely universal at this point), but also (2) the virtualized network
adapter, e.g. AWS ENA, to expose an RDMA API, which is much harder.

AWS did not support any kind of RDMA when I looked into it last year. Azure
does, but in my understanding this is only in their "supercomputer partition,"
which is not _really_ a cloud environment.

I've heard that AWS is looking to write an ENA backend for GASNet (a
communication library), which could perhaps (?!) lead to them exposing RDMA
and other low-level NIC features.

~~~
posnet
[https://aws.amazon.com/blogs/aws/now-available-elastic-
fabri...](https://aws.amazon.com/blogs/aws/now-available-elastic-fabric-
adapter-efa-for-tightly-coupled-hpc-workloads/)

------
deffbjinnnbbvf
How is far memory different from a disk?

~~~
xiii1408
Disk could be considered a specific form of "far memory."

In the context of this paper, though, "far memory" is referring to memory
outside the local system that is accessed using RDMA instructions.

~~~
deffbjinnnbbvf
Don't disk-based data structures have similar constraints? There too there is
no ability to ship computations and we try to optimize for minimal data round
trips.

~~~
xiii1408
RDMA instructions are (1) more expressive than disk operations, from what I
understand (support compare-and-swap, fetch-and-add, etc.) and (2) have
different latencies and bandwidths (on the order of 1us latency, 20 GB/s BW).

This paper is mostly about proposing _new_ RDMA instructions, such as a
relative load/store, that could make remote data structures more efficient.

~~~
wtallis
NVMe defines compare and atomic compare-and-write operations, but I'm not sure
if there are any notable users of them. They certainly aren't exposed by
typical file IO abstractions. There's nothing like a fetch-and-add in any
typical storage protocol that I know of.

