
Microsoft Supercharges Bing Search With Programmable Chips - l31g
http://www.wired.com/2014/06/microsoft-fpga/
======
chollida1
This has been going on in the HFT space for a number of years. FPGA's are used
to parse data feeds as the sheer volume of quotes overwhelms most systems.

In fact after moving the networking stack into user land and using inifiniband
networking gear, its probably the third most common optimization I've
seen/heard of for HFT systems.

Here's a quick, but surprisingly accurate description of a common HFT setup:
[http://www.forbes.com/sites/quora/2014/01/07/what-is-the-
tec...](http://www.forbes.com/sites/quora/2014/01/07/what-is-the-technology-
stack-like-behind-a-high-frequency-trading-platform/)

Some one had asked about hte number of quotes that need to be parsed. From
forbes...

> Mr. Hunsader: The new world is now a war between machines. For some
> perspective, in 1999 at the height of the tech craze, there were about 1,000
> quotes per second crossing the tape. Fast forward to 2013 and that number
> has risen exponentially to 2,000,000 per second.

Keep in mind that the "tape" is the slow SIP line that exchanges use to keep
prices in sync and show customers that don't use the exchanges direct feeds.
ie it aggregates all the quotes from all venues and throws a way alot as they
can't be parsed in time or didn't change the top level quote.

With 40+ venues at which a HFT fund can get feeds from 2,000,000 second is a
fraction of what a cutting edge HFT would have to parse to keep up with all
venues.

The typical setup is that you'll run strategies across multiple machines so
you have the gateway machine that directs the quote to the appropriate
machine. The biggest problem is the speed at which the quotes arrive.

Unlike a web request, that you can take 300 milliseconds to parse and return,
if you don't parse and respond to the quote in under 10-20 micro seconds
you've already lost.

So the FPGA transition is to make sure there is never a back log of quotes or
any pauses in the handling of bursty quotes. This can't be overstated enough.
Margins are squeezed so tightly now that your algo will appear to be working
fine until a big burst of quotes happen and your machines can't keep up and
when the dust settles in 20 seconds, you'll find you lost $5000, which might
be your entire day's profit from that one symbol/algo pair.

~~~
MrBuddyCasino
I'm curious if these kinds of setups are also used in other high-freq
scenarios.

For instance, I could imagine using techniques like userland network stack and
reserving cores exclusively in services like WhatsApp. I think they're
currently on a highly customized Erlang stack and are able to handle huge
numbers of queries per machine. Any insiders here with a good background
story?

~~~
SEJeff
You imagine correctly :)

I work in finance and have for the past 6.5ish years in electronic trading.

------
dmmalam
Is there any breakthrough in programming these things? From a quick glance at
the paper it seem like the kernels are still hand written in Verilog. Though
there seem to be some significant software infrastructure in integrating the
FPGAs into cluster management systems.

I think easily and uniformly programming disparate compute devices (CPUs,
SIMD, GPUs, FPGAs, ISPs, DSPs, and eventually quantum) is the next BIG problem
in programming languages. Several Haskell projects seem promising, but these
still tend to be nice DSLs that generate verilog or shaders.

On most mobile SoCs the CPU usually takes an increasingly smaller part of the
die; there are 10gbe network cards with FPGAs on them; and we've got parrella.
The hardware exists, we sorely need the next breakthrough programming
environment.

~~~
sliverstorm
CPUs, GPUs, DSPs, FPGAs... they are so different, it's hard to say they ever
_could_ be programmed uniformly.

~~~
14113
I think they could - in fact I'm starting a PhD in a similar area soon! It's a
matter of providing a high level enough programming language (e.g. Haskell)
and a smart enough compiler that can automatically parallelise sections, and
with the right middleware/compiler back end it should be possible!

~~~
sliverstorm
I buy CPU+GPU unification (in fact I highly anticipate it) and I also buy that
a DSP could function as a dynamic coprocessor, as they are often programmed in
C.

But my day job revolves around HDLs, and it is my opinion that a higher level
language isn't the answer. Fifty years from now it might be, but state-of-the-
art HDL compilers just aren't good enough yet. It's like C compilers a few
decades ago, where you had to insert some inline ASM in your code here and
there because the compilers were still developing.

So I guess what I'm saying is you can't target CPU, GPU, DSP, & FPGA in one
compiler until we can master targeting FPGA even just by itself.

~~~
reeses
To make a comparison for anyone who hasn't programmed FPGAs (especially on the
path to etching silicon), placement is extraordinarily important. Not only can
(will) you make a highly non-optimal layout, FPGAs are not orthogonal. You'll
spend a lot of time trying to route the bits that need to talk to each other
via direct connection as much as possible instead of going through a gp line
or worse.

Depending on the make and model of FPGA, you will have "large" areas that you
either can't or don't want to plop logic.

You can have a pretty netlist that validates and simulates correctly (although
you'll eventually end up dealing with Cadence, who seem to have the right hand
side of the bugs per line of code curve locked up) but still takes weeks or
months of that inline ASM work to make it competitive with a rack of Xeons.
The edit/compile/debug cycle is not quick by any means past a trivial number
of gates.

Dealing with that junk is why IP blocks are so attractive, but you end up on
the road to structured ASICs and that just leads to misery.

------
l31g
[http://research.microsoft.com/apps/pubs/default.aspx?id=2120...](http://research.microsoft.com/apps/pubs/default.aspx?id=212001)

------
jacquesm
I'd rather have _better_ results than _faster_ results. Faster is only
important once you have the quality problem worked out, first make it good and
then make it fast has been a long time mantra. The reason is that it is
usually very expensive to make something really fast because optimizing code
is hard and expensive (case in point they use custom hardware here).

The upside is that they're doing something innovative but if Bing really wants
to steal marketshare from Google they have to improve on their quality, not on
their speed. I'd rather see them take 10 seconds and deliver an absolutely
perfect answer than 0.001 second and deliver something not on par with Google
but 10 times faster.

Impressive to see them backing an exotic solution like this though, and if and
when they _do_ get it to be better than Google it may pay off.

Are there any developments like this underway at Google?

~~~
samirahmed
There was no mention of where exactly these would go. I doubt it would be on
machines serving response online ... since the bottle neck is often in IO.

Being able to index, process and learn data faster can lead to faster
iteration and improve the speed of batch or offline jobs which in turn could
improve the relevance.

~~~
jacquesm
> There was no mention of where exactly these would go.

From the article:

"The system takes search queries coming from Bing and offloads a lot of the
work to the FPGAs, which are custom-programmed for the heavy computational
work needed to figure out which webpages results should be displayed in which
order. "

That looks like they're in the interactive path somewhere.

------
azakai
Actually, I already find bing quite fast. Comparing to google search, bing
results tend to load a little faster but to be a little lower in quality.

FPGAs may make bing twice as fast as it already is, but I don't feel like it
needs to be faster. Although, I guess if its faster they can trade that off
for more work done and so better results, perhaps.

~~~
l31g
If they make Bing 2X faster, then they can roughly cut the amount of servers
they need by half. They measure "speed" by number of requests that can be
fulfilled in Z amount of time.

------
valarauca1
Sounds like there is a market niche starting to develop for FPGA in server
applications. I'm not saying rush out and make PCIe powered and communicating
FPGA's I'm just saying there maybe a market developing for it.

Especially with good open source dev tools.

~~~
sliverstorm
I think you would just about have to start from square one if you want an open
source FPGA toolchain. I don't believe such a toolchain exists _at all_ right
now.

So what I'm saying is, forget about developing a PCI-e FPGA board. If you want
an open source toolchain for it, you better start there, because that's going
to be 99.9% of the effort.

------
th0ma5
Seems like I read about Google doing this almost 10 years ago. I know that IBM
has the Netezza product which also uses FPGAs for accelerating queries.

------
zackmorris
I've been ranting about the inadequacies of mainstream processors for almost
twenty years. I remember even back in the late 90s, seeing processors that
were 3/4 cache memory, with barely any transistors used for logic. It's surely
worse than that now, with the vast majority of logic gates on chips just
sitting around idle. To put it in perspective, a typical chip today has close
to a billion transistors (the Intel Core i7 has 731 million):

[https://en.wikipedia.org/wiki/Transistor_count](https://en.wikipedia.org/wiki/Transistor_count)

A bare minimum CPU that can do at least one operation per clock cycle probably
has between 100,000 (SPARC) and 1 million (the PowerPC 602) transistors and
runs at 1 watt. So chips today have 1,000 or 10,000 that number of
transistors, but do they run that much faster? No of course not.

And we can even take that a step further, because those chips suffered from
the same inefficiencies that hinder processors today. A full adder takes 28
(yes, twenty eight) transistors. Could we build an ALU that did one simple
operation per clock cycle with 1000 transistors? 10,000? How many of those
could we fit on a billion transistor chip?

Modern CPUs are so many orders of magnitude slower than they could be with a
parallel architecture that I’m amazed data centers even use them. GPUs are
sort of going the FPGA route with 512 cores or more, but they are still a
couple of orders of magnitude less powerful than they could be. And their
proprietary/closed nature will someday relegate them to history, even with
OpenCL/CUDA because it frankly sucks to do any real programming when all you
have at your disposal is DSP concepts.

I really want an open source billion transistor FPGA running at 1 GHz that
doesn’t hold my hand with a bunch of proprietary middleware, so that I can
program it in a parallel language like Go or MATLAB (Octave). There would be
some difficulties with things like interconnect but that’s what things like
map reduce are for, to do computation in place rather than transferring data
needlessly. Also with diffs or other hash-based algorithms, only portions of
data would need to be sent. And it’s time to let go of VHDL/Verilog because
it’s one level too low. We really need a language above them that lets us wire
up basic logic without fear of the chip burning up.

And don’t forget the most important part of all: since the chip is
reprogrammable, cores can be multi-purpose, so they store their configuration
as code instead of hardwired gates. A few hundred gates can reconfigure
themselves on the fly to be ALUs, FPUs, anything really. So instead of wasting
vast swaths of the chips for something stupid like cache, it can go to storage
for logic layouts.

What would I use a chip like this for? Oh I don’t know, AI, physics
simulations, formula discovery, protein folding, basically all of the problems
that current single threaded architectures can’t touch in a cost-effective
manor. The right architecture would bring computing power we don’t expect to
see for 50 years to right now. I have a dream of someday being able to run
genetic algorithms that take hours to complete in a millisecond, and being
able to guide the computer rather than program it directly. That was sort of
the promise with quantum computing but I think FPGAs are more feasible.

~~~
jacquesm
> I really want an open source billion transistor FPGA running at 1 GHz that
> doesn’t hold my hand with a bunch of proprietary middleware, so that I can
> program it in a parallel language like Go or MATLAB (Octave).

I fail to see how not having proprietary middleware will enable you to program
an FPGA in Go or MATLAB.

FPGA's are not well suited to being programmed in conventional languages (and
go is not a parallel language, it employs a clever model that may give you
that impression but under the hood it is fairly regular, nothing you could not
achieve using co-routines and threads in a different language, maybe
syntactically cleaner and easier to understand). MATLAB might be more feasible
but still not an easy match. You could conceivably make an FPGA co-processor
though that you access from those languages through some library.

If you want to program an FPGA in something that _looks_ like a high level
language syntactically there are a number of solutions:

[http://stackoverflow.com/questions/5603285/c-to-hardware-
com...](http://stackoverflow.com/questions/5603285/c-to-hardware-compiler-hll-
synthesis)

And some newer developments. But none of those change the essence of the chip.

I'm not sure how to convey the difference between an FPGA and something like a
high level language any better than 'imagine all your code executing at once'.

FPGAs are something you tell what to be, not what to do (and high level
languages tell CPUs what to do, not what to be).

~~~
joshu
Yeah, this rant makes me wonder if poster actually understands how these
things work in deep detail.

~~~
zackmorris
I used VHDL for a semester back in 98 or 99 for my ECE degree. I remember it
being extraordinarily brittle compared to something like C, because there were
so many ways to trigger unreliability with the wrong clock edges etc. So for
example we did almost everything as state machines instead of math. The code
was basically unmaintainable by today's standards but was an important
learning tool. As I recall, we wrote a VGA signal generator, and if you had
less than half a dozen incorrect pixels onscreen, you were doing pretty well.
So in fairness, I USED to know this stuff in nauseating detail.

~~~
joshu
Yes, this is because VHDL is not code. It's a different paradigm entirely. The
FPGA doesn't "run" the VHDL, etc.

The units in an FPGA are just a little bit of logic with a bunch of
connectivity to local busses.

I haven't done much VHDL lately (ECE '96 here) but I do remember how the stuff
works. But then again I focused mainly on CPU architecture.

------
l31g
[http://www.theregister.co.uk/2014/06/16/microsoft_catapult_f...](http://www.theregister.co.uk/2014/06/16/microsoft_catapult_fpgas/)

------
samfisher83
This is cool and all, but instead of spending money on this project why not
try to improve their search engine or just not spend this money since bing
loses so much money. I don't mind waiting half a second extra for my search
results. It seems more like their thinking is we got a lot of engineers we are
paying a bunch of money to lets do some project.

~~~
Scaevolus
This move saves money, since the servers process queries more efficiently--
“Right off the bat we can chop the number of servers that we use in half,”
Burger says.

"Just improving their search engine" isn't some simple task. Google has a
head-start measured in thousands of man-years. Closing that gap takes a great
number of smart people a great amount of time.

I assume that Google is improving slower than Bing, since their algorithms and
systems are more mature and closer to the "asymptotically ideal search
engine".

