

Hardware Acceleration of Key-Value Stores [pdf] - StylifyYourBlog
http://zhehaomao.com/papers/memcached-fpga-accel.pdf

======
TD-Linux
Their system works well on a RISC-V, which is a deeply pipelined but in-order
architecture running at only 50MHz, with very fast memory relative to its
clock speed. I'd like to see results compared to a more reasonable CPU
implementation. For example they could have used a Xilinx Zynq, which includes
hard Cortex A9 cores, which are out-of-order and superscalar. They also run at
a much higher frequency relative to the memory speed.

I think this paper vastly underestimates memory constraints of higher
performance systems.

~~~
ms705
Another comparison point (curiously not referenced in the above paper) is this
similar effort: [https://www.usenix.org/conference/hotcloud13/workshop-
progra...](https://www.usenix.org/conference/hotcloud13/workshop-
program/presentations/blott) [PDF/slides/video]

From a quick skim, the Xilinx work synthesises the entire network stack
(including TCP) in hardware, unlike the above study, which only supports UDP
in the HW traffic manager.

The numbers in the Xilinx paper are more attractive than what this study found
and the paper also includes power measurements (since Joule/request is one
metric on which FPGAs and hardware do quite well compared to software).

That said, much of the latency gain likely comes from bypassing the OS kernel
and its generalised network stack. There is plenty of existing work
(unfortunately also not referenced in this paper) that does this and which
achieves very low latency, albeit -- to be fair -- on x86 hardware. (Examples:
Arrakis [1], IX [2] and MICA [3].)

[1] -- [https://www.usenix.org/conference/osdi14/technical-
sessions/...](https://www.usenix.org/conference/osdi14/technical-
sessions/presentation/peter)

[2] -- [https://www.usenix.org/conference/osdi14/technical-
sessions/...](https://www.usenix.org/conference/osdi14/technical-
sessions/presentation/belay)

[3] -- [https://www.usenix.org/conference/nsdi14/technical-
sessions/...](https://www.usenix.org/conference/nsdi14/technical-
sessions/presentation/lim)

------
toomim
At first, I was impressed they reduced latency by 10x:

    
    
        Our initial evaluation with a realistic workload shows
        a 10x improvement in latency for 40% of requests without
        adding significant overhead to the remaining requests.
    

And then I re-read this claim, did a little math, and realized that they only
reduced it by 36%.

"10x for 40% of requests" is a skeezy way of saying 36%.

~~~
ComputerGuru
And that's not taking into account the "not significant" overhead added to the
remaining 60%.

------
pradn
Nothing revolutionary, but someone had to do it. (Of course, you would take
something expensive in software and implement the logic in software. Of course
you can get by implementing only the most common requests: GETs to a small
subset of keys.)

I don't intend to criticize this paper in particular, but, generally, I don't
see small performance improvements in such software to be very useful for
society. Academia just becomes a research arm of corporations that might even
be a net negative for society: eroding privacy rights (Facebook et al) or
introducing volatility into stock markets (HFT could use this paper's insight
just as fruitfully.)

------
moru0011
Main source of latency will be network. The main problem are synchronous GET
requests as then performance == latency. Better go async instead of reducing
latency with hardware accel.

~~~
yxhuvud
Not necessarily if the request is coming from within the same data centre.
Then the network can introduce less latency than disk access do.

~~~
mmf
I have to agree with moru here. The latency on memory access will be
negligible with respect to the latency of any io operations, even within the
same data center. In my experience anything that involves the OS is >> 1us.
Also beware of anything that declares 10x performance improvement.

~~~
deadgrey19
You are assuming that a high performance, latency tuned system is using the OS
network stack. This would be a petty naive implementation. Offerings from
Solarfalre (OpenOnload) and Exablaze (ExaSock) transparently offload and
bypass the kernel stack. Quoted performance numbers are around 1us, of which
500ns is spent getting up and down the PCIE bus. Offloading to the nic makes a
whole lot of sense in this space except that it has already been done, and
with more compelling perfomance. The authors compleltly failed to take into
account existing work.

~~~
moru0011
nope RTT even for these kind of network hardware is >=10 microseconds (I deal
with such stuff professionally). Still a big gain going async :-)

~~~
deadgrey19
Surprising. >10us sounds pretty slow to me.

~~~
moru0011
round trip. one way 5 to 7 micros. some special cards go down to 3 one way,
however with rtt, software usually adds some 2 micros overall

~~~
deadgrey19
Less than 1us, RTT to software
([http://exablaze.com/exanic-x4](http://exablaze.com/exanic-x4))

Less than 400ns per switch hop
([http://www.arista.com/en/products/7150-series](http://www.arista.com/en/products/7150-series))
excluding congestion.

