
Achieving 10Gbps Line-rate Key-value Stores with FPGAs [pdf] - kiwidrew
http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/11566-hotcloud13-blott.pdf
======
_wmd
I've been thinking about something similar for most of the year, but not using
custom hardware. With a zero-copy database like LMDB combined with userspace
networking (e.g. netmap or PACKET_MMAP), it should already be possible to
saturate 1Gbit with a single core even with just 100 byte values. With a
little more effort, given that network cards commonly support scatter-gather
IO, combined with a database (such as LMDB) that avoids copying or (mostly)
touching record data, it's easy to imagine a design where most grunt work is
kept away from the CPU entirely.

Given a 2x quad core box and a decent network card, I'm pretty sure you could
approach or exceed 10Ge without requiring original research or an insane
amount of code, especially as record size increases. Software could at least
could compete on throughput, although the paper also indicates latency
benefits using the FPGA design.

~~~
jasonwatkinspdx
It's getting a little stale, but you may find this paper interesting:
[http://simula.no/research/nd/publications/Simula.ND.8/simula...](http://simula.no/research/nd/publications/Simula.ND.8/simula_pdf_file)

With anything that provides stable pointers to values like LMDB it should be
possible to output to the network extremely efficiently. One thing to keep in
mind with LMDB is that it's performance isn't so great with purely random keys
such as SHA hashes. I'm fooling around on something similar myself targeting
this use, and so far critbit tries are looking good (though I have yet to
implement benchmarks I'm confident of).

I haven't dug through the OP paper yet, but my guess is that while the
performance numbers are interesting, the cost of the FGPA itself would make it
less desirable unless you really did need the latency
performance/predictability.

~~~
scott_wilson46
One thing to note is that if as the paper says:

"Many well-known web-sites deploy distributed in- memory caches, which are
implemented with standard DRAM on a large array of x86 servers, to reduce
access load on databases"

then a single server with an FPGA card on it is going to be a lot cheaper than
an array of x86 servers. Plus I would imagine that real-estate in a data
centre is another significant cost and having one server with an attached FPGA
would also be an advantage in this situation

~~~
jasonwatkinspdx
That may be true. I think you'd have to look at the numbers to see for sure.
The paper uses a virtex 6 with 24GB of DDR on a PCIe card. The evaluation
board for one of those is about $2k, though obviously the FGPA itself would
hopefully be lower in price. I can't seem to find any open pricing for the
virtex line. They say using a larger FGPA would allow addressing more DRAM,
but of course, that raises your chip costs as well. You'll have to add to that
small volume PCB costs, as well as the cost of the hosting server node.

Compare all that to a $2k 1U Xeon with 32GB of ram and dual 10GbE and the
prices aren't that dissimilar. Also one is a research project while the other
is a commodity a phone call and fedex away.

My guess would be you'd need the FPGA design in high volume for the costs to
work out. So it'd be good as an appliance product for a vendor, but unless
you're google/facebook huge trying to build this sort of thing yourself would
likely be a boondoggle. The exception is the latency performance: you'd have
trouble duplicating that with the commodity server design. So if you're doing
HFT or somesuch maybe this is interesting again.

A middle road might be something like the Fast Array of Wimpy Nodes design but
with a lot more dram and hacking up the kernel to get you true zero copy io
with minimal syscall overhead.

------
colanderman
Comparing to a TilePRO running memcached is disingenuous – the TilePRO running
Linux doesn't have the zero-copy network I/O the GX does, and shuttling
packets through Linux adds a ton of context switching overhead.

You could do this – heck, probably even twice this – easily on a processor
with an integrated NIC like the TileGX or the Broadcom XLP. Is there a market
for these things? Should I quit my job and make this product?

------
jdmt
"the chosen approach resolves collisions only up to a certain degree. If,
however unlikely, a further key maps to the same hash index, then the request
simply fails and an appropriate response is generated. We believe this to be
an acceptable behaviour for what is essentially equiv-alent to a cache
full/miss."

So it's more like a cache than a key/value store? Nothing wrong with that but
to me a key/value store implies that data isn't lost in the case of a hash
collision. Pretty cool stuff nonetheless.

~~~
justincormack
Yes, more memcache like. Its a research project not a production system.

~~~
Nrsolis
What I'm not getting from this is why it's any different from a properly
implemented TCAM.

Key/value lookups are something that's done by existing networking hardware
very well. There are existing merchant-silicon chipsets that do this on-die
very well.

There was once an evolving market for "network processor units" but it looks
like that all cooled down. IIRC, there was an effort to do something akin to a
database lookup on an NPU, but the tradeoff between flexibility and what the
NPUs could provide wasn't amenable to the kind of work required to bring those
solutions to market.

~~~
bakul
TCAMs have three problems for such applications: 1) limited key length
(usually less than 64 bytes) 2) power-hungry as you are comparing against
every key 3) limited size -- try getting a 4GB TCAM!

------
MasterScrat
Mmh, Amazon should offer "hardware accelerated KVS" along with its EC2
instances. Maybe even using ASICs?

It could act like a super super fast memcached.

~~~
thezilch
AWS doesn't even offer great hardware-solutions for networking needs, such as
there ELB and NAT/PAT offerings. It'd be brilliant if they'd start exposing
ASIC-backed instances.

------
recuter
My only question is when can I use this myself?

