
FPGA-based hardware acceleration for a key-value store database (2014) - breck
https://dspace.mit.edu/handle/1721.1/91829
======
hkhall
I lived with the author for several years so I am coming at the paper from a
position of knowing how he approached the work. I also see it as typical of
systems and EE/CS MEng work at MIT.

In my grad program a Dr. Jason Dahlstrom [1] just graduated and his work with
FPGAs to provide a limited attack vector surface is some of the more
interesting FPGA work I have seen. The 10k' overview is you put some of the
core features of the operating system inside FPGA hardware and define a
controlled interface to it. You leverage the mass parallelism of the FPGA to
have multiple instances running simultaneously and use a consensus protocol
for output. Further you enable kill of each of these processes in a
pseudorandom fashion and restarts from PROM so in case any one of them gets
corrupted it won't have an effect for a long. I can't speak to all of the
details but that is what I gleaned from a couple talks I attended in the past
2-3 years. Cool stuff.

[1] [https://engineering.dartmouth.edu/people/faculty/jason-
dahls...](https://engineering.dartmouth.edu/people/faculty/jason-dahlstrom/)

 __edit: a space

------
mprovost
There have been FPGA-based startups before, the one that comes to mind is
Bluearc (implemented TCP and NFS stacks, and a filesystem in FPGAs in a
network attached storage appliance, later acquired by HDS). The problem seems
to be, that it takes a massive engineering effort to keep introducing new FPGA
designs that are even faster. Otherwise your temporary advantage gets eaten up
by Moore's law. Competitors just have to wait for Intel to introduce a new CPU
and their code gets an automatic speed boost. Meanwhile their engineers are
working on features. Bluearc fell behind with delays to their new models and
the competition caught up in performance just by buying new motherboards. As a
consumer you want some guarantee that the product you're buying will continue
to be the fastest, and it's a safer bet to place on Intel than a startup.

If it's a very specific problem domain you can get a temporary advantage but
you have to stay on the treadmill constantly to stay ahead. It also seems that
there's a middle road that isn't often taken (possibly with the exception of
gaming) where you can use assembly language (at least for performance critical
sections). It's probably less difficult than designing for FPGAs and you still
get a boost with new chips, assuming Intel doesn't destroy your optimisations.

~~~
bravo22
I mostly agree with you but in the case of key/value store the problem is
mostly an I/O issue, not CPU performance. It is an architectural issue. An
FPGA can give you _deterministic_ , _consistent_ storage at very low latency
-- even microseconds -- specially when network processing is done on the wire.

This is the same reason a lot of deep packet inspection and do-X-with-a-
packet-on-the-wire hardware is built around FPGA. Moore's law speeds up FPGAs
the same as other ASICs. Newest Xilinx FPGAs are 16nm FinFET.

It may not make much sense to deploy this for a small shop but at Google or
AWS scale having a hardware based key/value store has many advantages.

~~~
mprovost
That's why I thought of Bluearc - it was a storage system so the overall
throughput was ultimately limited by hard disks. The FPGAs were used to
optimise all of the front end protocol and filesystem processing. I imagine
the same is true of most key-value stores, at scale you're going to have to
store the values somewhere slow and optimising key lookups only gains you so
much.

My impression of FPGAs is that yes, Moore's law does help, but at this point
it's mostly by adding transistors. If your design doesn't take advantage of
them, don't they just sit idle? Whereas Intel puts a lot of effort into making
those extra transistors speed up even single threaded legacy code.

Your example of network hardware is interesting where you can keep the whole
problem domain local I imagine it's a better use case.

~~~
bravo22
KV would be stored in DDR memory in this case, which on an FPGA you can have
many channels all working in parallel. In fact you could design a system that
could saturate a 10Gbps network link and give you deterministic, 1ms or less,
I/O latency.

That is difficult to match for a regular CPU system. You'll find that a lot of
high-throughput CPU systems end up using FPGAs on PCIe cards to the same thing
in order to achieve the needed performance.

------
nickpsecurity
A few others in this space:

MapReduce on FPGA:
[http://nics.ee.tsinghua.edu.cn/people/wangyu/conference/Yi%2...](http://nics.ee.tsinghua.edu.cn/people/wangyu/conference/Yi%20Shan_ISFPGA2010.pdf)

Memcached on FPGA: [http://zhehaomao.com/papers/memcached-fpga-
accel.pdf](http://zhehaomao.com/papers/memcached-fpga-accel.pdf)

Hashtable design for 10Gbps on FPGA:
[https://people.inf.ethz.ch/zistvan/doc/paperM3C_3.pdf](https://people.inf.ethz.ch/zistvan/doc/paperM3C_3.pdf)

~~~
gricardo99
This one is commercial solution: [http://algo-logic.com/kvs-sc15](http://algo-
logic.com/kvs-sc15)

~~~
nickpsecurity
I knew there would be at least one. Probably some startups, too. Thanks for
the link. Those are some badass numbers. Makes me think that was on a high-end
FPGA, though.

------
jhallenworld
How about a full line rate (2x 10G ports?) key-value store database on stock
x86 server? It would be cheaper than the FPGA solution. OS overhead a problem?
Either write the application as an OS module, or use RDMA to avoid the OS.

One thing I realized long ago is that FPGA and CPU have very nearly the same
design constraints. CPU has cache, FPGA has block-ram. CPU has a DRAM
interface, as does an FPGA. CPU has serdes (PCIe), as does an FPGA. CPU has a
lot of overhead for Tomasulo's algorithm (it boils down access to many-port
memories), but FPGA has a lot of overhead for configuration memory.

Except for some massively parallel simple algorithms, CPU is just as capable
as FPGA (and in fact is usually much more versatile and easier to program).

~~~
mtanski
Agreed. It seams to me like the both CPU and FPGA would be both bound by the
speed of their memories for the KV store. That's why I'm not sure why it would
be beneficial to put it on a FPGA.

Skimming their summary, conclusion & comparison I can't really find a a good
answer to it. Not saying it's a cool project but I don't see a practical case
/ nor pushing anything forward.

I imagine a pretty low-end server class x86 processor should be able to
saturate most network links. There prob is a fair overhead going from network
device, memory, OS, process and back out. But you could have your KV run in
kernel space or as a real time process with dedicated core(s) / network device
(memory mapped).

~~~
bravo22
In an FPGA based design you can have many more memory channels and have them
be dedicated for this purpose. Packet processing can also be done on the wire,
giving you deterministic low latency access to memory storage.

------
reynoldsbd
FPGA-based hardware acceleration for "X" seems like a really practical idea.
Any chance of seeing this kind of technology expand in the future?

~~~
amirhirsch
There's been a lot of use of FPGAs for acceleration. Netezza is a good example
of an early win in this space. FPGAs have become only slightly easier to
program in the last 10 years...

Intel bough Altera and will integrate FPGAs into the Xeon this year so expect
to see this expand.

~~~
listic
Any recent details from Intel? I'm still unsure how x86+FPGA will be working.

------
touristtam
Caustic Graphics had an FPGA board for raytracing doing something similar I
think about 7-8 years ago, before they got bought over by PowerVR
([http://www.anandtech.com/show/2752](http://www.anandtech.com/show/2752) for
old news about this)

------
vox_mollis
Interesting. Hugo DeGaris tried this approach with accelerating neural network
AI performance over a decade ago.

------
jhallenworld
What these FPGA-based solutions need is TLS. Then you could have internet
facing FPGA-based servers.

~~~
exabrial
FPGA-as-a-service? Brilliant! Get the VCs

~~~
toomuchtodo
I really hope the creators of Silicon Valley read HN.

------
aprdm
The problem I see, at least for scalable solutions, is that the cloud vendor
would need to have an FPGA plugged into the ethernet or attached to an
instance. Doesn't seem very practical. At least for now.

~~~
duskwuff
Amazon manages to provide EC2 instances with GPUs. I see no reason (beyond
cost and low demand!) why another provider couldn't provision some instances
with FPGA accelerators.

~~~
aprdm
Well for a start, if it's an FPGA that the end user can download a bitstream
to configure it is really easy to burn the chip and then Amazon has to replace
it.

If you sandbox it you lose some flexibility.

Also, as you said, demand and cost are the problems I think of. The regular
FPGA developer usually has no idea of how to interact with the cloud.

I worked with FPGA for five years before going to python / backend. Most of
FPGA people develop in windows and don't know web at all in the application
layer.

I think FPGA as a service can be something really interesting. But it's really
hard to find the customers. GPU programmers are software developers, whereas
FPGA people aren't.

------
bogomipz
Are FPGAs that much cheaper than NAND flash though? Can't you achieve the same
performance with a FTL aware NAND-native key value store similar to what
Aerospike is doing or NVMKV?

------
Shamiq
I don't have access to the full article, but the premise is interesting. at
what transaction volume would I be better of running my own fgpa hardware?

~~~
SixSigma
The link is right there on the page

[https://dspace.mit.edu/bitstream/handle/1721.1/91829/8942284...](https://dspace.mit.edu/bitstream/handle/1721.1/91829/894228451-MIT.pdf?sequence=2)

~~~
Shamiq
oh, I scrolled down and only saw the purchase option.

------
polskibus
Please put 2014 in the title

------
justaaron
very cool!

------
danbruc
TL;DR They compare one specific general purpose persistent key value store
with transaction support - Kyoto Cabinet [1] - running on one specific machine
- without further details besides running at 3.2 GHz - with an implementation
of an in-memory hash map with chaining (32 byte keys, 16 byte values) on a
Stratix IV FPGA with 8 GiB of external DDR SDRAM running at 244 MHz and find
that the FPGA is an order of magnitude or two, depending on the operation,
faster. Essentially a large but slow associative memory. They also ignore any
communication overhead between the host and the FPGA.

I don't think that this is really a relevant result, my old Core i3 with 2.5
GHz easily achieves their five million operation per second when I just use an
in-memory hash map - tested with the simplest possible C# program adding ten
million strings into a Dictionary<String, String>.

[1] [http://fallabs.com/kyotocabinet/](http://fallabs.com/kyotocabinet/)

~~~
CyberDildonics
This may have been true in the past but now that individual cores are starting
to reach diminishing returns from more transistors the game will be different
though it might turn out the same.

If a key value store uses concurrency well it might continue to benefit from
better hardware and likewise if an FPGA key value store builds in more
concurrency it might be able to perform substantially in overall throughput.

~~~
danbruc
I would say there is not much to be gained here, a key value store is just to
simple to benefit much from custom logic. You hash the key, you read from or
write to a memory location based on the resulting hash. It is pretty likely
that with a fast hash function the bottleneck on modern hardware is the memory
bandwidth and the same would very likely apply to a FPGA implementation unless
you go some extra way to also create some exceptionally fast memory interface.
You could likely get some speed-up with dedicated logic to calculate the
hashes but what good is that if you afterwards have to wait for the memory?

~~~
CyberDildonics
What I'm saying is that if you look beyond reading or writing one key, as a
whole what you want is throughput.

Throughput is going to mean concurrency and that could mean a lot more happens
with the same resources in an FPGA since it is dedicated.

