Hacker News new | comments | show | ask | jobs | submit login

I also worked in that space several years ago. RDMA-based paging is a very tempting (and sexy) concept. There are however several challenges that make it very impractical in commercial scenarios:

1. Availability: what happens when a node goes down ?

2. Page sharing: how should / can multiple writers be supported ?

3. Kernel overhead: how do you hook in the paging subsystem without significant overhead ?

4. Cost/performance: rare are the companies that would blindly jump into an IB network investment; and RoCE is not as fast as IB (although it may be catching up with 25/40/100 Gb cards).

In the end we ended up designing a RDMA-based DHT. Besides, this kind of approach becomes less relevant nowadays with solutions like Intel Optane.




Awesome, did write ups appear anywhere of what you learned? I would enjoy reading them.

   > 1. Availability: what happens when a node goes down ?
Pretty typically you fail. One of the things we noted was that making a "shelf" that could get 5 9's (99.999% uptime) was a lot easier than a system that got five 9's. In the NAM design the RAM was ECC (bit protection), the individual cards were RAIDed across (card failure protection), dual power supplies and controller (no state that isn't in the memory so you could have active passive controllers with no synchronization protocol) and a very basic optimistic locking protocol (cost free on success, resync on collision which for this use case was pretty reasonable (not a lot of write contention)).

   > 2. Page sharing: how should / can multiple writers
        be supported ?
The architecture was single writer, multiple readers. It was not designed (although could probably be adapted) to prevent other network nodes from reading data or snooping data reads across the network. We talked briefly about using multi-cast in a shoot down protocol to allow machines to invalidate their lock state. The idea was that nobody 'cached' these blocks so invalidating caches wasn't an issue. That said, looking at the way AMD and Intel manage the Layer1 (per core)/Layer2 (across cores) cache layers for their chips had me wondering if that would have been useful here. Generally if you write and your nonce is invalid you re-read the current nonce and re-write with the updated nonce. Also important to note that in our case it was a 'data' cache not an execution cache.

   > 3. Kernel overhead: how do you hook in the paging
        subsystem without significant overhead ?
We were using Ethernet, which had lower latency and overhead than using the (SATA/IDE) layer in Linux. Not IP, Ethernet. So to fetch a 'block' (8K bytes) you'd drop the address of that block on the network pipe and the packet would go out, and the response would come back pretty much instantly. (1 - 2uS).

   > 4. Cost/performance: rare are the companies that 
        would blindly jump into an IB network investment;
        and RoCE is not as fast as IB (although it may
        be catching up with 25/40/100 Gb cards).
Our thought was you got a benefit just using the existing Ethernet network you had. (after all you were mostly getting rid of disk seeks which were milliseconds for Ethernet packets (microseconds) who doesn't like a 1000:1 reduction in latency :-). In our case it was 'add this box to your storage stack and get a xx% boost in throughput or yy% decrease in latency.' Where the numbers could be calculated based on your workload and your read/write mix. If the sale price had 55 points of margin then we had successfully done our job :-).


Here is a publication that describes a more advanced embodiment of the system I mentioned: https://www.computer.org/csdl/proceedings/sc/2015/3723/00/28... ; it don't think it explicitly compares to distributed RDMA-based paging systems but most of its non-RDMA features were implemented to address the shortcomings of PGAS systems (hardware-accelerated or not).




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: