
FPGAs in Data Centers - rbanffy
https://queue.acm.org/detail.cfm?id=3231573
======
ansible
Xilinx has been doing some interesting work with things like implmeneting
memcached in a FPGA:

[https://www.usenix.org/sites/default/files/conference/protec...](https://www.usenix.org/sites/default/files/conference/protected-
files/blott_hotcloud13_slides.pdf)

There's also efforts to implement SQL on a FPGA:

[https://www.nextplatform.com/2016/08/24/baidu-takes-fpga-
app...](https://www.nextplatform.com/2016/08/24/baidu-takes-fpga-approach-
accelerating-big-sql/)

~~~
godelmachine
Are there any papers on the efforts to port SQL to FPGA?

------
glangdale
From long experience in working on regular expressions, at least some of the
FPGA acceleration projects rely on comparison to fairly daft software
implementations to yield a speedup. I haven't worked on these particular
problems, but I think there's a tendency to regard these systems as magic.

Remember that the restructuring to make things work well on FPGA
(regularization, finding lots of independent parallel work to do, removing
branches, etc) also work _really_ well on software. One of the best things
that helped my high-performance software crafting on CPU was spending some
time with GPGPU programming; I imagine that the discpline of working with FPGA
would be similar.

I've seen a lot of FPGA stuff go by that seems like the competing software
implementation should have been pipelined, unrolled, and generally made less
stupid. So unless the software implementation has an independent force keeping
it honest (e.g. it's a production system being used elsewhere), be careful.
Also be careful of the tendency of FPGA papers to find One Weird Case where
the software does badly and benchmark mainly about that.

~~~
slivym
I work in the FPGA industry and I've supervised university projects, seen lots
of research, hired PhD grads and stuff. I've got to agree that one of the most
frustrating parts of FPGA research is that it's almost uniformally done in
comparison to the more laughable software implementations.

FPGAs in industry are used for a very small number of specific applications:
Smart NICs, Early stages of wireless networks (5G whilst the standards are
being hammered out), military (where you need high performance with no
consideration of cost), and embedded, Prof Video (where the custom I/O is
essential).

Generally, unless you're doing something that fits those applications well,
the FPGA will not look good, and there are the same mistakes made in research
time after time. For data centre these are twice as bad. The four really
glaring ones are always:

* Quoting performance without taking into account the time to get the data onto the FPGA (generally via a PCI-E link that killed any chance of winning vs. CPU).

* Assuming performance scales linearly to fill up an FPGA (Full FPGAs can't run as fast as 10% full ones without significant effort)

* Profiling only the part of the problem or set of data that your code performs well for and not reporting how it transfers onto corner cases that CPUs would obviously do well for.

* Comparing against some noddy s/w solution when you've literally spent the last 3 years of your PhD optimizing the FPGA solution, and doing no background reading to see what the state of the art s/w does.

It just destroys a load of the research we see. The good applications are far
less exciting, but the MS Catapult is a great example - the reason it's
competitive is because they're using the custom I/O of the FPGA to move data
around really quick, it's like a custom smart NIC almost.

~~~
glangdale
Thanks for the detailed reply. My post may have seemed like partly-informed
sour grapes but your information fits in well with what I've seen.

In a number of the applications I've seen the other killers are the fact that
not only do you have the transfer costs you mentioned _to_ the device, you
also:

1\. Have to get information back _from_ the device - and in regular expression
matching this might be 1 match in 1000 or 1 match in 5 if you're unlucky, and

2\. Have to have a lot of parallelism to hit peak performance, yielding great
throughput but so-so latency. At Sensory Networks during our hardware stage,
we had a "2 Gbps regex accelerator" (hah) that didn't even hit that modest
number on a single stream - it actually required 14 streams or so running at
142Mbps.

Many of the same sins are repeated for GPGPU.

The other thing that I notice is that the "noddy s/w solution" sometimes is
the only thing out there. I looked at some accelerator work on Random Forest
inference (not training) and - wow - all the RF implementations are naive.
There are a lot of s/w tasks out there that no-one has bothered to optimize
with any effort at all.

However, when your adviser says "make a GPGPU/FPGA thesis" I think a smart PhD
just goes and does that, rather than sinking 6 months into building a really
great s/w comparison. :-)

------
f3f3_
TL;DR

They present four papers in total that shed light on developments and
deployments of FPGAs in data centers:

Project Catapult (Bing/Microsoft):

First paper:

> [...] provides insights into the development process of FPGA base systems.
> The target application is accelerating the Bing web search engine. [...] The
> paper shows how such a system can improve the throughput of document ranking
> or reduce the tail latency for such operations by 29 percent.

Second paper:

> The web-search accelerator was based on a unit of 48 machines, a result of
> the decision to use a torus network to connect the FPGAs to each other. Not
> only is the cabling of such units cumbersome, but it also limits how many
> FPGAs can talk to each other and requires routing to be provided in each
> FPGA, complex procedures to achieve fault tolerance, etc. [..] Hence, the
> second paper describes the solution being deployed in Azure: the FPGA is
> placed between the NIC (network interface controller) of the host and the
> actual network, as well as having a PCI connection to the host.

The other two papers debate whether FPGAs could actually be implemented using
ASICs or other dedicated hardware. To do so, they discuss how FPGAs can be
used in MySQL with a SSD+FPGA storage engine.

------
johnflan
Its interesting that Microsoft Bing uses clusters of FPGAs calculate a 'page
rank' for search

~~~
doh
Do you have any source for that? Would be interested to read more.

~~~
csteegz
It’s talked about a bit here, along with links to the papers about the
architecture.

[https://www.microsoft.com/en-us/research/project/project-
cat...](https://www.microsoft.com/en-us/research/project/project-catapult/)

We also now have an FPGA accelerated Resnet-50 as a service on Azure with more
models in the pipeline. (I work on the Azure Machine Learning side of this
stuff)

