Designing far memory data structures: think outside the box 127 points by aloknnikhil 10 months ago | hide | past | favorite | 23 comments

 I find the term "far memory" a bit strange, specially considering that the paper starts using the dual of "remote" and "local". The first paper of the "Prior works" also is being consistent and uses the adjective "remote" applied to cpu, procedure, and memory. Is there a technical distinction that I am missing here?(Oddly enough, just did a search for "remote memory data structures" and guess what blog post and paper comes up!)
 Shouldn't we use better notation for the time complexity of the algorithms? For example, an algorithm can have`````` O(n^2) + rt * O(n) `````` time complexity (where rt is the round trip time). Of course this expression collapses to O(n^2), but by writing it like above you can more clearly see where the cost comes from.EDIT: on second thought, perhaps bring the rt under the O() together with n.
 I agree with the spirit, but why use O here at all? Isn’t the idea that O collapses to its highest ordered term, so if you don’t want that, don’t use it.You could use a normal function. Like t(n) = f(n^2) + g(n) + rt
 The point is that all the nice manipulation you would like to do are sound in O-notation and unsound in many other notations, what the parent wants isO(n^2) + O(m) * O(n)where m is the number of roundtrips.
 Writing an O in front of something doesn't mean it's in big O.
 You can use every tool in the wrong way; if you stay in simple cases it is (comparatively) hard to misuse big-O notation.
 Curious. This is somewhat reminiscent of SGI's ccNUMA and CRAYLink/NUMALink architectures.If memory serves, IRIX (SGI's UNIX OS) had both the metrics to see the latency of access, and the ability to migrate the data and/or the compute closer to each other.ccNUMA was open-sourced and AMD uses it on their multi-core/multi-socket systems, though usually within the motherboard. Not so much leaving the case and interlinking SGI Origin system style (which is what the CRAYLink/NUMALink tech did).
 The sad thing is that Hyper Transport was supposed to offer this exact feature and implement it just like SGI did with NUMAlink. There were a few boards produced with HTX slots, I have an older Tyan dual socket Opteron board with an HTX slot kicking around.There is a connector standard: https://www.hypertransport.org/ht-connectors-and-cablesConnectors available from Samtec: https://www.samtec.com/standards/ht3#connectorsManycore CPU's and converged ethernet pretty much made it moot.
 Yeah...HTX was really interesting until it was clear that 40G/100G enet was going to become commodity really fast.
 This talk seems to me to follow a similar line of thinking to the one I saw presented by Chandler Carruth at the 2014 C++ conference [0]. In the talk he presented a table with (approximately) round-trip-times of various data layers.
 The https://wizzlove.com/reviews/datingcom-review has been a great social networking site to search for the person I love. They helped me to get in link with 3-4 considerable matches. the effort was great. Thanks.
 Is it possible to have direct remote memory access in any of the major cloud providers?I think it should be technically possible inside your virtual network, if the cloud platform and network gear were to support it.
 Generally, no.The main requirement to support this is that a RoCE or other RDMA API needs to be exposed inside the cloud VM. This requires (1) the physical boxes have RDMA (likely universal at this point), but also (2) the virtualized network adapter, e.g. AWS ENA, to expose an RDMA API, which is much harder.AWS did not support any kind of RDMA when I looked into it last year. Azure does, but in my understanding this is only in their "supercomputer partition," which is not really a cloud environment.I've heard that AWS is looking to write an ENA backend for GASNet (a communication library), which could perhaps (?!) lead to them exposing RDMA and other low-level NIC features.
 I think the answer is, it depends. Far memory is only useful when the CPU isn't involved. Which probably means, the VMM underneath should support VM to VM memory access without trapping the call. I don't think that's something VMMs support today. In fact, they're actively building measures to defend against such an access.
 If there's Remote DMA (RDMA) capable hardware (Infiniband or 10-gigabit Ethernet pci card) and a hypervisor that supports PCI-passthrough, then guest VMs can do RDMA. Not especially applicable for cloud providers trying to offer generic VPS' but possibly useful on the backend for managed services where the per-customer VM is not exposed to the customer (Eg AWS Redshift).
 Azure has Infiniband clusters.
 Oracle Cloud can support this - yes.Disclaimer: I work for Oracle.
 How is far memory different from a disk?
 Disk could be considered a specific form of "far memory."In the context of this paper, though, "far memory" is referring to memory outside the local system that is accessed using RDMA instructions.
 Don't disk-based data structures have similar constraints? There too there is no ability to ship computations and we try to optimize for minimal data round trips.
 RDMA instructions are (1) more expressive than disk operations, from what I understand (support compare-and-swap, fetch-and-add, etc.) and (2) have different latencies and bandwidths (on the order of 1us latency, 20 GB/s BW).This paper is mostly about proposing new RDMA instructions, such as a relative load/store, that could make remote data structures more efficient.
 NVMe defines compare and atomic compare-and-write operations, but I'm not sure if there are any notable users of them. They certainly aren't exposed by typical file IO abstractions. There's nothing like a fetch-and-add in any typical storage protocol that I know of.

Search: