Hacker News new | comments | show | ask | jobs | submit login
Microsoft Supercharges Bing Search With Programmable Chips (wired.com)
122 points by l31g 1048 days ago | hide | past | web | 63 comments | favorite



This has been going on in the HFT space for a number of years. FPGA's are used to parse data feeds as the sheer volume of quotes overwhelms most systems.

In fact after moving the networking stack into user land and using inifiniband networking gear, its probably the third most common optimization I've seen/heard of for HFT systems.

Here's a quick, but surprisingly accurate description of a common HFT setup: http://www.forbes.com/sites/quora/2014/01/07/what-is-the-tec...

Some one had asked about hte number of quotes that need to be parsed. From forbes...

> Mr. Hunsader: The new world is now a war between machines. For some perspective, in 1999 at the height of the tech craze, there were about 1,000 quotes per second crossing the tape. Fast forward to 2013 and that number has risen exponentially to 2,000,000 per second.

Keep in mind that the "tape" is the slow SIP line that exchanges use to keep prices in sync and show customers that don't use the exchanges direct feeds. ie it aggregates all the quotes from all venues and throws a way alot as they can't be parsed in time or didn't change the top level quote.

With 40+ venues at which a HFT fund can get feeds from 2,000,000 second is a fraction of what a cutting edge HFT would have to parse to keep up with all venues.

The typical setup is that you'll run strategies across multiple machines so you have the gateway machine that directs the quote to the appropriate machine. The biggest problem is the speed at which the quotes arrive.

Unlike a web request, that you can take 300 milliseconds to parse and return, if you don't parse and respond to the quote in under 10-20 micro seconds you've already lost.

So the FPGA transition is to make sure there is never a back log of quotes or any pauses in the handling of bursty quotes. This can't be overstated enough. Margins are squeezed so tightly now that your algo will appear to be working fine until a big burst of quotes happen and your machines can't keep up and when the dust settles in 20 seconds, you'll find you lost $5000, which might be your entire day's profit from that one symbol/algo pair.


I'm curious if these kinds of setups are also used in other high-freq scenarios.

For instance, I could imagine using techniques like userland network stack and reserving cores exclusively in services like WhatsApp. I think they're currently on a highly customized Erlang stack and are able to handle huge numbers of queries per machine. Any insiders here with a good background story?


You imagine correctly :)

I work in finance and have for the past 6.5ish years in electronic trading.


They're used as "First level triggers" at some LHC experiments to pick out which collisions are interesting enough to store data from.


> sheer volume of quotes overwhelms most systems.

I'm curious about what volumes we're talking about here. I'm not an electrical engineer, so I don't understand when FPGAs are needed (or justify the development cost) over distributed software systems.


It's usually the latency requirements that rule-out distributed systems. The amount of total data isn't particularly crazy, it's just that it comes in very concentrated bursts.


What does moving the networking stack into userland accomplish?


Avoiding system call overhead and the GPL. I suspect most of the performance improvement has nothing to do with being in userspace, but people use that term as a shorthand to refer to a whole collection of techniques like polling, batching, not performing routing/filtering, etc.


Close, but it really has nothing at all to do with the GPL. It has to do with reducing context switches (from userspace to kernelspace) and reducing hardware interrupts. When the name of the game is latency, context switches really hurt you. Batching is awful for latency, and great for throughput fyi.

In fact, openonload[1], and vma[1], 2 of the most common vendor provided kernel offload tools in use are both open source!

[1] http://www.openonload.org/

[2] http://www.mellanox.com/page/software_vma?mtag=vma


Is there any breakthrough in programming these things? From a quick glance at the paper it seem like the kernels are still hand written in Verilog. Though there seem to be some significant software infrastructure in integrating the FPGAs into cluster management systems.

I think easily and uniformly programming disparate compute devices (CPUs, SIMD, GPUs, FPGAs, ISPs, DSPs, and eventually quantum) is the next BIG problem in programming languages. Several Haskell projects seem promising, but these still tend to be nice DSLs that generate verilog or shaders.

On most mobile SoCs the CPU usually takes an increasingly smaller part of the die; there are 10gbe network cards with FPGAs on them; and we've got parrella. The hardware exists, we sorely need the next breakthrough programming environment.


> I think easily and uniformly programming disparate compute devices (CPUs, SIMD, GPUs, FPGAs, ISPs, DSPs, and eventually quantum) is the next BIG problem in programming languages.

The biggest challenge to producing a unified programming model across all of these is dealing with moving data around. CPU vs GPU vs FPGA all have very different memory access characteristics. Gary Sabot was the first to try to tackle the problem of dealing with memory layout/locality in a language with what he called the Paralation model (http://www.amazon.com/The-Paralation-Model-Architecture-Inde...) and a lot of that work carried over into NESL (http://www.cs.cmu.edu/~scandal/nesl.html). Hadoop is a very lobotomized and hard to use modern version of these ideas.

I wouldn't knock DSLs - a lot of people are having good results using DSLs to program FPGAs (for example that's pretty much what http://www.novasparks.com/ does - http://lisp-univ-etc.blogspot.com/2013/06/lisp-hackers-marc-...).


Yea, totally agree that memory is at the heart of the problem. Not only in needing a unified view, but also in bandwidth constraints that seem to be the bottleneck in so much code (von neumann bottleneck).

I think a real solution would not just be software only, it would include a hardware component (or tuned to work with a particular chip): a SoC with a few fat cpu cores, many lite core, a GPU, some fpga fabric, all sharing a large cache subsystem and memory.


Convey computer has something that looks like a useful FPGA development model. Basically you move critical sections of your code into the hardware using a hardware description language(with some help from the sdk), and when writing the software , you just call that section almost normally , and convey system manages this transfer of control between cpu and FPGA as needed.

Using that you create a unique "instruction set" that can generally fit a number of applications in the same area. With that programmers can program normally and still benefit from a decent speedup.

BTW it seems that convey has a deal with DELL, and also has developed a memcached acceleration kit. This stuff will probably be available only to big cloud providers, giving them another big advantage on the DIY guys.


Altera and Xilinx are starting to support compiling OpenCL to FPGAs. Not sure how efficient that is, but that is at least a step towards unified programming environment of various types of devices.


CPUs, GPUs, DSPs, FPGAs... they are so different, it's hard to say they ever could be programmed uniformly.


I think they could - in fact I'm starting a PhD in a similar area soon! It's a matter of providing a high level enough programming language (e.g. Haskell) and a smart enough compiler that can automatically parallelise sections, and with the right middleware/compiler back end it should be possible!


I buy CPU+GPU unification (in fact I highly anticipate it) and I also buy that a DSP could function as a dynamic coprocessor, as they are often programmed in C.

But my day job revolves around HDLs, and it is my opinion that a higher level language isn't the answer. Fifty years from now it might be, but state-of-the-art HDL compilers just aren't good enough yet. It's like C compilers a few decades ago, where you had to insert some inline ASM in your code here and there because the compilers were still developing.

So I guess what I'm saying is you can't target CPU, GPU, DSP, & FPGA in one compiler until we can master targeting FPGA even just by itself.


To make a comparison for anyone who hasn't programmed FPGAs (especially on the path to etching silicon), placement is extraordinarily important. Not only can (will) you make a highly non-optimal layout, FPGAs are not orthogonal. You'll spend a lot of time trying to route the bits that need to talk to each other via direct connection as much as possible instead of going through a gp line or worse.

Depending on the make and model of FPGA, you will have "large" areas that you either can't or don't want to plop logic.

You can have a pretty netlist that validates and simulates correctly (although you'll eventually end up dealing with Cadence, who seem to have the right hand side of the bugs per line of code curve locked up) but still takes weeks or months of that inline ASM work to make it competitive with a rack of Xeons. The edit/compile/debug cycle is not quick by any means past a trivial number of gates.

Dealing with that junk is why IP blocks are so attractive, but you end up on the road to structured ASICs and that just leads to misery.


Fair I suppose - but I still think it doesn't contradict my point. I believe that what we need is a general (and smart!) enough language, along with advanced enough and intelligent enough back ends, which should solve the problem. It will obviously take a while, but it's a really quite large area of research at the moment!


That is true of Haskell at this point too. Stictness annotations and unboxed types are sometimes necessary to squeeze maximum performance.

I think the idea would be to keep the backdoors, the question is how to make them still usable in that case.


In theory it should be possible, but doing so in the general case is a ridiculously hard problem as sliverstorm hints upon.

If I was back at school, this is something I would definitely spend time on.

Some possible ideas:

- Writing different LLVM backends to target the various hardware. You still would need some runtime to schedule/organise the tasks.

- A language with a very powerful type system (ie fully dependent) which allows a high level of programmer intent to be extracted by a compiler. (ie we are only doing Multiply-adds on a specific range of doubles, so create an custom fpga for just that)

- Just writing a bunch of libraries which provide nice APIs for the various functions, kinda like NumPy.


I dont think Haskell is suitable. In fact there are very few languages that are. The reason is you need dependent types in order to compete with VHDL when it comes to type safety. I only know of two general purpose languages that support that, Idris and Nimrod (kinda.. Its coming along).

Your work sounds interesting. Contact me at skyfex@gmail.com if you want to exchange more ideas


I agree that haskell isn't smart enough, but I think it's a step in the right direction, in that a lot of constructs seem geared to abstracting away the details of how to perform certain computations, and more about describing what a computation should perform.


Maybe we can open some group for this discussion. Anyway; email in my HN profile if you want to exchange ideas as well.


Chalmers univeristy has a HDL built as Haskell modules called Lava: http://www.cse.chalmers.se/edu/year/2012/course/_courses_201... Don't think its used outside of the university though.


Berkeley has Chisel where you write something like C + Scala and you get Verilog for hardware instantiation and C for verification https://chisel.eecs.berkeley.edu/


It actually is just Scala. The whole thing is a Scala DSL.


I thought you could write some modules in C


I suppose you could, in practice, write modules for simulation in C or C++ and hook it up to the C++ code that the compiler generates. I doubt this is very useful in practice since you can't generate any Verilog from it.


Have a look at OpenCL, I know that can run on at least FPGAs and GPUs.


Better direct programming would be a big help, but check out 'dynamic method migration' sometime. There is still work to be done on the performance, but the concepts have at least been demonstrated; quite a while ago, actually.


This was in the comments: https://www.synflow.com/


Just learn a (or both) HDL, they are not hard to understand.



I'd rather have better results than faster results. Faster is only important once you have the quality problem worked out, first make it good and then make it fast has been a long time mantra. The reason is that it is usually very expensive to make something really fast because optimizing code is hard and expensive (case in point they use custom hardware here).

The upside is that they're doing something innovative but if Bing really wants to steal marketshare from Google they have to improve on their quality, not on their speed. I'd rather see them take 10 seconds and deliver an absolutely perfect answer than 0.001 second and deliver something not on par with Google but 10 times faster.

Impressive to see them backing an exotic solution like this though, and if and when they do get it to be better than Google it may pay off.

Are there any developments like this underway at Google?


>>The upside is that they're doing something innovative but if Bing really wants to steal marketshare from Google they have to improve on their quality, not on their speed.

Well, there are two ways to increase bottom-line profits. One is to increase revenue by stealing market share. The other is to decrease costs.

This would apparently help cut costs by decreasing the number of servers and saving on electricity.


Their quality is sometimes really bad. Check out the first result for their own portal to all their online services:

http://www.bing.com/search?q=portal.microsoftonline.com

edit: Interestingly, for me it went from an "offer not available" link to urlwebz to the actual sign-in in the hour since I posted that. If nothing else, the search results are changing fast.


There was no mention of where exactly these would go. I doubt it would be on machines serving response online ... since the bottle neck is often in IO.

Being able to index, process and learn data faster can lead to faster iteration and improve the speed of batch or offline jobs which in turn could improve the relevance.


> There was no mention of where exactly these would go.

From the article:

"The system takes search queries coming from Bing and offloads a lot of the work to the FPGAs, which are custom-programmed for the heavy computational work needed to figure out which webpages results should be displayed in which order. "

That looks like they're in the interactive path somewhere.


I just wish they'd get rid of that annoying image on their homepage or at least let me turn it off.


I googled that for you ;):

http://www.bing.com/?rb=0


Actually, I already find bing quite fast. Comparing to google search, bing results tend to load a little faster but to be a little lower in quality.

FPGAs may make bing twice as fast as it already is, but I don't feel like it needs to be faster. Although, I guess if its faster they can trade that off for more work done and so better results, perhaps.


If they make Bing 2X faster, then they can roughly cut the amount of servers they need by half. They measure "speed" by number of requests that can be fulfilled in Z amount of time.


Sounds like there is a market niche starting to develop for FPGA in server applications. I'm not saying rush out and make PCIe powered and communicating FPGA's I'm just saying there maybe a market developing for it.

Especially with good open source dev tools.


I think you would just about have to start from square one if you want an open source FPGA toolchain. I don't believe such a toolchain exists at all right now.

So what I'm saying is, forget about developing a PCI-e FPGA board. If you want an open source toolchain for it, you better start there, because that's going to be 99.9% of the effort.


Seems like I read about Google doing this almost 10 years ago. I know that IBM has the Netezza product which also uses FPGAs for accelerating queries.


I've been ranting about the inadequacies of mainstream processors for almost twenty years. I remember even back in the late 90s, seeing processors that were 3/4 cache memory, with barely any transistors used for logic. It's surely worse than that now, with the vast majority of logic gates on chips just sitting around idle. To put it in perspective, a typical chip today has close to a billion transistors (the Intel Core i7 has 731 million):

https://en.wikipedia.org/wiki/Transistor_count

A bare minimum CPU that can do at least one operation per clock cycle probably has between 100,000 (SPARC) and 1 million (the PowerPC 602) transistors and runs at 1 watt. So chips today have 1,000 or 10,000 that number of transistors, but do they run that much faster? No of course not.

And we can even take that a step further, because those chips suffered from the same inefficiencies that hinder processors today. A full adder takes 28 (yes, twenty eight) transistors. Could we build an ALU that did one simple operation per clock cycle with 1000 transistors? 10,000? How many of those could we fit on a billion transistor chip?

Modern CPUs are so many orders of magnitude slower than they could be with a parallel architecture that I’m amazed data centers even use them. GPUs are sort of going the FPGA route with 512 cores or more, but they are still a couple of orders of magnitude less powerful than they could be. And their proprietary/closed nature will someday relegate them to history, even with OpenCL/CUDA because it frankly sucks to do any real programming when all you have at your disposal is DSP concepts.

I really want an open source billion transistor FPGA running at 1 GHz that doesn’t hold my hand with a bunch of proprietary middleware, so that I can program it in a parallel language like Go or MATLAB (Octave). There would be some difficulties with things like interconnect but that’s what things like map reduce are for, to do computation in place rather than transferring data needlessly. Also with diffs or other hash-based algorithms, only portions of data would need to be sent. And it’s time to let go of VHDL/Verilog because it’s one level too low. We really need a language above them that lets us wire up basic logic without fear of the chip burning up.

And don’t forget the most important part of all: since the chip is reprogrammable, cores can be multi-purpose, so they store their configuration as code instead of hardwired gates. A few hundred gates can reconfigure themselves on the fly to be ALUs, FPUs, anything really. So instead of wasting vast swaths of the chips for something stupid like cache, it can go to storage for logic layouts.

What would I use a chip like this for? Oh I don’t know, AI, physics simulations, formula discovery, protein folding, basically all of the problems that current single threaded architectures can’t touch in a cost-effective manor. The right architecture would bring computing power we don’t expect to see for 50 years to right now. I have a dream of someday being able to run genetic algorithms that take hours to complete in a millisecond, and being able to guide the computer rather than program it directly. That was sort of the promise with quantum computing but I think FPGAs are more feasible.


> I really want an open source billion transistor FPGA running at 1 GHz that doesn’t hold my hand with a bunch of proprietary middleware, so that I can program it in a parallel language like Go or MATLAB (Octave).

I fail to see how not having proprietary middleware will enable you to program an FPGA in Go or MATLAB.

FPGA's are not well suited to being programmed in conventional languages (and go is not a parallel language, it employs a clever model that may give you that impression but under the hood it is fairly regular, nothing you could not achieve using co-routines and threads in a different language, maybe syntactically cleaner and easier to understand). MATLAB might be more feasible but still not an easy match. You could conceivably make an FPGA co-processor though that you access from those languages through some library.

If you want to program an FPGA in something that looks like a high level language syntactically there are a number of solutions:

http://stackoverflow.com/questions/5603285/c-to-hardware-com...

And some newer developments. But none of those change the essence of the chip.

I'm not sure how to convey the difference between an FPGA and something like a high level language any better than 'imagine all your code executing at once'.

FPGAs are something you tell what to be, not what to do (and high level languages tell CPUs what to do, not what to be).


  FPGAs are something you tell what to be, not what
  to do (and high level languages tell CPUs what to
  do, not what to be).
That's a really great way of putting it. I never thought of it that way before, but you're spot on.


Yeah, this rant makes me wonder if poster actually understands how these things work in deep detail.


I used VHDL for a semester back in 98 or 99 for my ECE degree. I remember it being extraordinarily brittle compared to something like C, because there were so many ways to trigger unreliability with the wrong clock edges etc. So for example we did almost everything as state machines instead of math. The code was basically unmaintainable by today's standards but was an important learning tool. As I recall, we wrote a VGA signal generator, and if you had less than half a dozen incorrect pixels onscreen, you were doing pretty well. So in fairness, I USED to know this stuff in nauseating detail.


Yes, this is because VHDL is not code. It's a different paradigm entirely. The FPGA doesn't "run" the VHDL, etc.

The units in an FPGA are just a little bit of logic with a bunch of connectivity to local busses.

I haven't done much VHDL lately (ECE '96 here) but I do remember how the stuff works. But then again I focused mainly on CPU architecture.


> one operation per clock cycle […] Modern CPUs are so many orders of magnitude slower than they could be with a parallel architecture

Modern CPUs are superscalar, meaning they can execute multiple operations per clock cycle.

> processors that were 3/4 cache memory, with barely any transistors used for logic. It's surely worse than that now, with the vast majority of logic gates on chips just sitting around idle

That's because memory is the bottleneck for 99% of compute workloads. The only things that memory is not the bottleneck for is silly arithmetic (e.g. graphics) and network processing. And we have processors for those workloads. (Source: I've programmed Tilera network processors.)

> even with OpenCL/CUDA because it frankly sucks to do any real programming when all you have at your disposal is DSP concepts.

OpenCL/CUDA are so unlike DSP programming it's not even funny. SIMT and SIMD are very different programming models. (Source: I've programmed both CUDA and DSPs (Blackfin specifically).)

> it’s time to let go of VHDL/Verilog because it’s one level too low. We really need a language above them that lets us wire up basic logic without fear of the chip burning up.

Huh? VHDL and Verilog are very high-level languages that expose concepts like open-drain/collector logic and different voltage domains only unwillingly. Unless you have a really crappy elaborator, "chips burning up" just ain't gonna happen. (Source: I've done mixed-signal ASIC design.)

> So instead of wasting vast swaths of the chips for something stupid like cache, it can go to storage for logic layouts.

I don't think you understand the importance of cache. Memory is S-L-O-W. (Source: I've hand-optimized high-performance network processing code.)

> AI, physics simulations, formula discovery, protein folding, basically all of the problems that current single threaded architectures can’t touch in a cost-effective manor

All those things you list are bottlenecked by memory bandwidth, not CPU time.


I just meant that OpenCL/CUDA are moving towards a narrower scope than what I would like. There is too much emphasis on manually arranging your data with the correct interleaving, or doing things in batches, pipelined in just the right order. It all reminds me of DSP, which I was never a huge fan of. That's because all of the supposed gains can easily be replicated with a general purpose language and an order of magnitude more computing power (which is easily attainable with billions of transistors at your disposal, getting cheaper every year).

You are right about memory bandwidth though. I'm excited that memory has come so far, with wide channels running at a GHz or more. But, that has come at the expense of latency. I think this problem is intractable, because everything is moving toward data centers with even slower interconnect.

I’m reaching here, but I think the endgame will be memory that works more like bittorrent, where the latency won’t be how far the nearest neighbor is, but how much the data differs from what you already have. Then we can eliminate most of the problems with slow interconnect and move to a big dumb hash system that uses diffs but results in orders of magnitude higher throughput. If processing is really not a bottleneck, then this will be trivial.


I think what's more likely is that processing will move into the memory itself. We already see this somewhat with manycore shared-cache architectures like Tilera, which effectively provide one core per ¼ MiB of L3 cache. (Though unfortunately L3 cache latency is only ~twice as fast as RAM, but at least the bandwidth is much better.)


More ALUs don't help: we have a few that can run in a cycle on modern CPUs, but we still get IPCs < 1.0. The bottleneck is getting the arguments and operation into the ALU for it to run. That's what the cache is for. As you pointed out, it's horribly inefficient (in % silicon space, but that's an odd metric, isn't it?) in some ways, but that's what you get for generality.

If you want something faster, you'll need to alter the upper level systems to make feeding the ALU(s) easier. (And you'll still have Amdahl's law to contend with -- which is a major factor even at the single-core scale.) AFAICT, you're asking for a system as high level as Go or MATLAB, but have it compile itself into some sequence of FPGA programs and I/Os on it. Altering those upper level systems (to the point of rewrite with a substantial amount of research work in the middle) is a lot of work, and efficiency at the silicon level isn't enough justification for it. Crude work partitioning by the programmer and more cores/machines often works well for those who actually need the additional performance.

But, there's something reasonable on regular silicon that you may like. Haskell's got auto-parallelization features, as well as the ability to compile code to run on CUDA. http://chimera.labs.oreilly.com/books/1230000000929 Is a book (available to read freely online) covering these features. You can get the background here http://book.realworldhaskell.org/


"I really want an open source billion transistor FPGA running at 1 GHz that doesn’t hold my hand with a bunch of proprietary middleware"

Yeah, and I want a pony. The pony just happens to be about 10000x more likely in terms of capital expense.

I look forward to the Kickstarter to raise the $10 million+ needed just to spin a mask set for this chip, not to mention the licensing costs for EDA tools that can do routing + signal integrity on this scale.


And then it still won't deliver on the intended range of applications. An open source FPGA won't magically do what a closed source one can't, there is a fundamental impedance mismatch here between thinking about what an FPGA could do versus what it actually does.


I think the general priority has been to improve single thread performance and for good reasons.

While there are plenty of problems that are parallelizable, a great majority of problems are sequential. Even if you think of say on the OS level - of running every process on a separate CPU/core/whatever - the single thread performance still comes out being the most important factor



This is cool and all, but instead of spending money on this project why not try to improve their search engine or just not spend this money since bing loses so much money. I don't mind waiting half a second extra for my search results. It seems more like their thinking is we got a lot of engineers we are paying a bunch of money to lets do some project.


This move saves money, since the servers process queries more efficiently-- “Right off the bat we can chop the number of servers that we use in half,” Burger says.

"Just improving their search engine" isn't some simple task. Google has a head-start measured in thousands of man-years. Closing that gap takes a great number of smart people a great amount of time.

I assume that Google is improving slower than Bing, since their algorithms and systems are more mature and closer to the "asymptotically ideal search engine".


I think this project was funded not only because it helps Bing, but it also accelerates pretty much any large-scale data center application. So in theory, this system could help applications that do not exist yet.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: