I lived with the author for several years so I am coming at the paper from a position of knowing how he approached the work. I also see it as typical of systems and EE/CS MEng work at MIT.
In my grad program a Dr. Jason Dahlstrom [1] just graduated and his work with FPGAs to provide a limited attack vector surface is some of the more interesting FPGA work I have seen. The 10k' overview is you put some of the core features of the operating system inside FPGA hardware and define a controlled interface to it. You leverage the mass parallelism of the FPGA to have multiple instances running simultaneously and use a consensus protocol for output. Further you enable kill of each of these processes in a pseudorandom fashion and restarts from PROM so in case any one of them gets corrupted it won't have an effect for a long. I can't speak to all of the details but that is what I gleaned from a couple talks I attended in the past 2-3 years. Cool stuff.
There have been FPGA-based startups before, the one that comes to mind is Bluearc (implemented TCP and NFS stacks, and a filesystem in FPGAs in a network attached storage appliance, later acquired by HDS). The problem seems to be, that it takes a massive engineering effort to keep introducing new FPGA designs that are even faster. Otherwise your temporary advantage gets eaten up by Moore's law. Competitors just have to wait for Intel to introduce a new CPU and their code gets an automatic speed boost. Meanwhile their engineers are working on features. Bluearc fell behind with delays to their new models and the competition caught up in performance just by buying new motherboards. As a consumer you want some guarantee that the product you're buying will continue to be the fastest, and it's a safer bet to place on Intel than a startup.
If it's a very specific problem domain you can get a temporary advantage but you have to stay on the treadmill constantly to stay ahead. It also seems that there's a middle road that isn't often taken (possibly with the exception of gaming) where you can use assembly language (at least for performance critical sections). It's probably less difficult than designing for FPGAs and you still get a boost with new chips, assuming Intel doesn't destroy your optimisations.
I mostly agree with you but in the case of key/value store the problem is mostly an I/O issue, not CPU performance. It is an architectural issue. An FPGA can give you deterministic, consistent storage at very low latency -- even microseconds -- specially when network processing is done on the wire.
This is the same reason a lot of deep packet inspection and do-X-with-a-packet-on-the-wire hardware is built around FPGA. Moore's law speeds up FPGAs the same as other ASICs. Newest Xilinx FPGAs are 16nm FinFET.
It may not make much sense to deploy this for a small shop but at Google or AWS scale having a hardware based key/value store has many advantages.
That's why I thought of Bluearc - it was a storage system so the overall throughput was ultimately limited by hard disks. The FPGAs were used to optimise all of the front end protocol and filesystem processing. I imagine the same is true of most key-value stores, at scale you're going to have to store the values somewhere slow and optimising key lookups only gains you so much.
My impression of FPGAs is that yes, Moore's law does help, but at this point it's mostly by adding transistors. If your design doesn't take advantage of them, don't they just sit idle? Whereas Intel puts a lot of effort into making those extra transistors speed up even single threaded legacy code.
Your example of network hardware is interesting where you can keep the whole problem domain local I imagine it's a better use case.
KV would be stored in DDR memory in this case, which on an FPGA you can have many channels all working in parallel. In fact you could design a system that could saturate a 10Gbps network link and give you deterministic, 1ms or less, I/O latency.
That is difficult to match for a regular CPU system. You'll find that a lot of high-throughput CPU systems end up using FPGAs on PCIe cards to the same thing in order to achieve the needed performance.
> It also seems that there's a middle road that isn't often taken (possibly with the exception of gaming) where you can use assembly language (at least for performance critical sections).
Is using assembly really a meaningful advantage though? I think FPGAs have the advantage of being a completely different architecture which allows different algorithms with non-constant factor speedups. Assembly is going to give at most something like a constant factor 5% speedup and that's being pretty optimistic (assuming C baseline).
That's not entirely true... C imposes some basic architecture structures (heap, stack, calling conventions, etc.) which can impose non-linear efficiency costs.
All of those costs, except for heap allocation related ones, are roughly per-function-call costs. If a function gets inlined, an optimizer can get rid of them.
Also, don't some link time optimizers help with pessimizations related to calling conventions?
BlueArc is just one example, and arguably they were crushed a lot of other factors.
In general, Moore's law works just as well for FPGAs as it does CPUs & ASICs. FPGA expertise is definitely in short supply, but, as an example, with HFT, you see a lot of FPGAs.
I'm struggling to think of other FPGA startups, do you know of any? It seems like if it were the silver bullet that it's often made out to be, then there would be more examples.
I knew there would be at least one. Probably some startups, too. Thanks for the link. Those are some badass numbers. Makes me think that was on a high-end FPGA, though.
How about a full line rate (2x 10G ports?) key-value store database on stock x86 server? It would be cheaper than the FPGA solution. OS overhead a problem? Either write the application as an OS module, or use RDMA to avoid the OS.
One thing I realized long ago is that FPGA and CPU have very nearly the same design constraints. CPU has cache, FPGA has block-ram. CPU has a DRAM interface, as does an FPGA. CPU has serdes (PCIe), as does an FPGA. CPU has a lot of overhead for Tomasulo's algorithm (it boils down access to many-port memories), but FPGA has a lot of overhead for configuration memory.
Except for some massively parallel simple algorithms, CPU is just as capable as FPGA (and in fact is usually much more versatile and easier to program).
Actually, the design constraints are quite different.
1. FPGAs have between 1-2 orders of magnitude higher on-chip memory bandwidth. Imagine lighting up all of the block RAMs on your chip. That blows away the memory performance of a CPU. It's not even close.
2. FPGAs have significantly more parallel compute resources (if your problem isn't purely memory bandwidth limited). Even if it's not stupidly data-parallel, there are several clever design choices you can make to extract a lot of parallelism.
It is, as you point out, much harder to program an FPGA. God save me if I have to run SP&R tools again in my life. However, if you have the resources and know what you are doing you will get 10-1000x more performance from an FPGA. Just look at any FCCM paper from the last 20 years.
The more apt comparison is FPGAs vs ASSPs. Those are what have been eating FPGA's lunch for the last decade.
Agreed. It seams to me like the both CPU and FPGA would be both bound by the speed of their memories for the KV store. That's why I'm not sure why it would be beneficial to put it on a FPGA.
Skimming their summary, conclusion & comparison I can't really find a a good answer to it. Not saying it's a cool project but I don't see a practical case / nor pushing anything forward.
I imagine a pretty low-end server class x86 processor should be able to saturate most network links. There prob is a fair overhead going from network device, memory, OS, process and back out. But you could have your KV run in kernel space or as a real time process with dedicated core(s) / network device (memory mapped).
In an FPGA based design you can have many more memory channels and have them be dedicated for this purpose. Packet processing can also be done on the wire, giving you deterministic low latency access to memory storage.
Here is a device that recreates many classic computers with an FPGA https://github.com/mist-devel/mist-board/wiki and I can also think of http://redpitaya.com which is primarily an FPGA based O-scope and SDR, but you can download "apps" for it to work of other common electronics workbench devices.
There's been a lot of use of FPGAs for acceleration. Netezza is a good example of an early win in this space. FPGAs have become only slightly easier to program in the last 10 years...
Intel bough Altera and will integrate FPGAs into the Xeon this year so expect to see this expand.
Caustic Graphics had an FPGA board for raytracing doing something similar I think about 7-8 years ago, before they got bought over by PowerVR (http://www.anandtech.com/show/2752 for old news about this)
The problem I see, at least for scalable solutions, is that the cloud vendor would need to have an FPGA plugged into the ethernet or attached to an instance.
Doesn't seem very practical. At least for now.
Amazon manages to provide EC2 instances with GPUs. I see no reason (beyond cost and low demand!) why another provider couldn't provision some instances with FPGA accelerators.
Well for a start, if it's an FPGA that the end user can download a bitstream to configure it is really easy to burn the chip and then Amazon has to replace it.
If you sandbox it you lose some flexibility.
Also, as you said, demand and cost are the problems I think of. The regular FPGA developer usually has no idea of how to interact with the cloud.
I worked with FPGA for five years before going to python / backend. Most of FPGA people develop in windows and don't know web at all in the application layer.
I think FPGA as a service can be something really interesting. But it's really hard to find the customers. GPU programmers are software developers, whereas FPGA people aren't.
Are FPGAs that much cheaper than NAND flash though? Can't you achieve the same performance with a FTL aware NAND-native key value store similar to what Aerospike is doing or NVMKV?
TL;DR They compare one specific general purpose persistent key value store with transaction support - Kyoto Cabinet [1] - running on one specific machine - without further details besides running at 3.2 GHz - with an implementation of an in-memory hash map with chaining (32 byte keys, 16 byte values) on a Stratix IV FPGA with 8 GiB of external DDR SDRAM running at 244 MHz and find that the FPGA is an order of magnitude or two, depending on the operation, faster. Essentially a large but slow associative memory. They also ignore any communication overhead between the host and the FPGA.
I don't think that this is really a relevant result, my old Core i3 with 2.5 GHz easily achieves their five million operation per second when I just use an in-memory hash map - tested with the simplest possible C# program adding ten million strings into a Dictionary<String, String>.
This may have been true in the past but now that individual cores are starting to reach diminishing returns from more transistors the game will be different though it might turn out the same.
If a key value store uses concurrency well it might continue to benefit from better hardware and likewise if an FPGA key value store builds in more concurrency it might be able to perform substantially in overall throughput.
I would say there is not much to be gained here, a key value store is just to simple to benefit much from custom logic. You hash the key, you read from or write to a memory location based on the resulting hash. It is pretty likely that with a fast hash function the bottleneck on modern hardware is the memory bandwidth and the same would very likely apply to a FPGA implementation unless you go some extra way to also create some exceptionally fast memory interface. You could likely get some speed-up with dedicated logic to calculate the hashes but what good is that if you afterwards have to wait for the memory?
In my grad program a Dr. Jason Dahlstrom [1] just graduated and his work with FPGAs to provide a limited attack vector surface is some of the more interesting FPGA work I have seen. The 10k' overview is you put some of the core features of the operating system inside FPGA hardware and define a controlled interface to it. You leverage the mass parallelism of the FPGA to have multiple instances running simultaneously and use a consensus protocol for output. Further you enable kill of each of these processes in a pseudorandom fashion and restarts from PROM so in case any one of them gets corrupted it won't have an effect for a long. I can't speak to all of the details but that is what I gleaned from a couple talks I attended in the past 2-3 years. Cool stuff.
[1] https://engineering.dartmouth.edu/people/faculty/jason-dahls...
edit: a space