In fact after moving the networking stack into user land and using inifiniband networking gear, its probably the third most common optimization I've seen/heard of for HFT systems.
Here's a quick, but surprisingly accurate description of a common HFT setup: http://www.forbes.com/sites/quora/2014/01/07/what-is-the-tec...
Some one had asked about hte number of quotes that need to be parsed. From forbes...
> Mr. Hunsader: The new world is now a war between machines. For some perspective, in 1999 at the height of the tech craze, there were about 1,000 quotes per second crossing the tape. Fast forward to 2013 and that number has risen exponentially to 2,000,000 per second.
Keep in mind that the "tape" is the slow SIP line that exchanges use to keep prices in sync and show customers that don't use the exchanges direct feeds. ie it aggregates all the quotes from all venues and throws a way alot as they can't be parsed in time or didn't change the top level quote.
With 40+ venues at which a HFT fund can get feeds from 2,000,000 second is a fraction of what a cutting edge HFT would have to parse to keep up with all venues.
The typical setup is that you'll run strategies across multiple machines so you have the gateway machine that directs the quote to the appropriate machine. The biggest problem is the speed at which the quotes arrive.
Unlike a web request, that you can take 300 milliseconds to parse and return, if you don't parse and respond to the quote in under 10-20 micro seconds you've already lost.
So the FPGA transition is to make sure there is never a back log of quotes or any pauses in the handling of bursty quotes. This can't be overstated enough. Margins are squeezed so tightly now that your algo will appear to be working fine until a big burst of quotes happen and your machines can't keep up and when the dust settles in 20 seconds, you'll find you lost $5000, which might be your entire day's profit from that one symbol/algo pair.
For instance, I could imagine using techniques like userland network stack and reserving cores exclusively in services like WhatsApp. I think they're currently on a highly customized Erlang stack and are able to handle huge numbers of queries per machine. Any insiders here with a good background story?
I work in finance and have for the past 6.5ish years in electronic trading.
I'm curious about what volumes we're talking about here. I'm not an electrical engineer, so I don't understand when FPGAs are needed (or justify the development cost) over distributed software systems.
In fact, openonload, and vma, 2 of the most common vendor provided kernel offload tools in use are both open source!
I think easily and uniformly programming disparate compute devices (CPUs, SIMD, GPUs, FPGAs, ISPs, DSPs, and eventually quantum) is the next BIG problem in programming languages. Several Haskell projects seem promising, but these still tend to be nice DSLs that generate verilog or shaders.
On most mobile SoCs the CPU usually takes an increasingly smaller part of the die; there are 10gbe network cards with FPGAs on them; and we've got parrella. The hardware exists, we sorely need the next breakthrough programming environment.
The biggest challenge to producing a unified programming model across all of these is dealing with moving data around. CPU vs GPU vs FPGA all have very different memory access characteristics. Gary Sabot was the first to try to tackle the problem of dealing with memory layout/locality in a language with what he called the Paralation model (http://www.amazon.com/The-Paralation-Model-Architecture-Inde...) and a lot of that work carried over into NESL (http://www.cs.cmu.edu/~scandal/nesl.html). Hadoop is a very lobotomized and hard to use modern version of these ideas.
I wouldn't knock DSLs - a lot of people are having good results using DSLs to program FPGAs (for example that's pretty much what http://www.novasparks.com/ does - http://lisp-univ-etc.blogspot.com/2013/06/lisp-hackers-marc-...).
I think a real solution would not just be software only, it would include a hardware component (or tuned to work with a particular chip): a SoC with a few fat cpu cores, many lite core, a GPU, some fpga fabric, all sharing a large cache subsystem and memory.
Using that you create a unique "instruction set" that can generally fit a number of applications in the same area. With that programmers can program normally and still benefit from a decent speedup.
BTW it seems that convey has a deal with DELL, and also has developed a memcached acceleration kit. This stuff will probably be available only to big cloud providers, giving them another big advantage on the DIY guys.
But my day job revolves around HDLs, and it is my opinion that a higher level language isn't the answer. Fifty years from now it might be, but state-of-the-art HDL compilers just aren't good enough yet. It's like C compilers a few decades ago, where you had to insert some inline ASM in your code here and there because the compilers were still developing.
So I guess what I'm saying is you can't target CPU, GPU, DSP, & FPGA in one compiler until we can master targeting FPGA even just by itself.
Depending on the make and model of FPGA, you will have "large" areas that you either can't or don't want to plop logic.
You can have a pretty netlist that validates and simulates correctly (although you'll eventually end up dealing with Cadence, who seem to have the right hand side of the bugs per line of code curve locked up) but still takes weeks or months of that inline ASM work to make it competitive with a rack of Xeons. The edit/compile/debug cycle is not quick by any means past a trivial number of gates.
Dealing with that junk is why IP blocks are so attractive, but you end up on the road to structured ASICs and that just leads to misery.
I think the idea would be to keep the backdoors, the question is how to make them still usable in that case.
If I was back at school, this is something I would definitely spend time on.
Some possible ideas:
- Writing different LLVM backends to target the various hardware. You still would need some runtime to schedule/organise the tasks.
- A language with a very powerful type system (ie fully dependent) which allows a high level of programmer intent to be extracted by a compiler. (ie we are only doing Multiply-adds on a specific range of doubles, so create an custom fpga for just that)
- Just writing a bunch of libraries which provide nice APIs for the various functions, kinda like NumPy.
Your work sounds interesting. Contact me at email@example.com if you want to exchange more ideas
The upside is that they're doing something innovative but if Bing really wants to steal marketshare from Google they have to improve on their quality, not on their speed. I'd rather see them take 10 seconds and deliver an absolutely perfect answer than 0.001 second and deliver something not on par with Google but 10 times faster.
Impressive to see them backing an exotic solution like this though, and if and when they do get it to be better than Google it may pay off.
Are there any developments like this underway at Google?
Well, there are two ways to increase bottom-line profits. One is to increase revenue by stealing market share. The other is to decrease costs.
This would apparently help cut costs by decreasing the number of servers and saving on electricity.
edit: Interestingly, for me it went from an "offer not available" link to urlwebz to the actual sign-in in the hour since I posted that. If nothing else, the search results are changing fast.
Being able to index, process and learn data faster can lead to faster iteration and improve the speed of batch or offline jobs which in turn could improve the relevance.
From the article:
"The system takes search queries coming from Bing and offloads a lot of the work to the FPGAs, which are custom-programmed for the heavy computational work needed to figure out which webpages results should be displayed in which order. "
That looks like they're in the interactive path somewhere.
FPGAs may make bing twice as fast as it already is, but I don't feel like it needs to be faster. Although, I guess if its faster they can trade that off for more work done and so better results, perhaps.
Especially with good open source dev tools.
So what I'm saying is, forget about developing a PCI-e FPGA board. If you want an open source toolchain for it, you better start there, because that's going to be 99.9% of the effort.
A bare minimum CPU that can do at least one operation per clock cycle probably has between 100,000 (SPARC) and 1 million (the PowerPC 602) transistors and runs at 1 watt. So chips today have 1,000 or 10,000 that number of transistors, but do they run that much faster? No of course not.
And we can even take that a step further, because those chips suffered from the same inefficiencies that hinder processors today. A full adder takes 28 (yes, twenty eight) transistors. Could we build an ALU that did one simple operation per clock cycle with 1000 transistors? 10,000? How many of those could we fit on a billion transistor chip?
Modern CPUs are so many orders of magnitude slower than they could be with a parallel architecture that I’m amazed data centers even use them. GPUs are sort of going the FPGA route with 512 cores or more, but they are still a couple of orders of magnitude less powerful than they could be. And their proprietary/closed nature will someday relegate them to history, even with OpenCL/CUDA because it frankly sucks to do any real programming when all you have at your disposal is DSP concepts.
I really want an open source billion transistor FPGA running at 1 GHz that doesn’t hold my hand with a bunch of proprietary middleware, so that I can program it in a parallel language like Go or MATLAB (Octave). There would be some difficulties with things like interconnect but that’s what things like map reduce are for, to do computation in place rather than transferring data needlessly. Also with diffs or other hash-based algorithms, only portions of data would need to be sent. And it’s time to let go of VHDL/Verilog because it’s one level too low. We really need a language above them that lets us wire up basic logic without fear of the chip burning up.
And don’t forget the most important part of all: since the chip is reprogrammable, cores can be multi-purpose, so they store their configuration as code instead of hardwired gates. A few hundred gates can reconfigure themselves on the fly to be ALUs, FPUs, anything really. So instead of wasting vast swaths of the chips for something stupid like cache, it can go to storage for logic layouts.
What would I use a chip like this for? Oh I don’t know, AI, physics simulations, formula discovery, protein folding, basically all of the problems that current single threaded architectures can’t touch in a cost-effective manor. The right architecture would bring computing power we don’t expect to see for 50 years to right now. I have a dream of someday being able to run genetic algorithms that take hours to complete in a millisecond, and being able to guide the computer rather than program it directly. That was sort of the promise with quantum computing but I think FPGAs are more feasible.
I fail to see how not having proprietary middleware will enable you to program an FPGA in Go or MATLAB.
FPGA's are not well suited to being programmed in conventional languages (and go is not a parallel language, it employs a clever model that may give you that impression but under the hood it is fairly regular, nothing you could not achieve using co-routines and threads in a different language, maybe syntactically cleaner and easier to understand). MATLAB might be more feasible but still not an easy match. You could conceivably make an FPGA co-processor though that you access from those languages through some library.
If you want to program an FPGA in something that looks like a high level language syntactically there are a number of solutions:
And some newer developments. But none of those change the essence of the chip.
I'm not sure how to convey the difference between an FPGA and something like a high level language any better than 'imagine all your code executing at once'.
FPGAs are something you tell what to be, not what to do (and high level languages tell CPUs what to do, not what to be).
FPGAs are something you tell what to be, not what
to do (and high level languages tell CPUs what to
do, not what to be).
The units in an FPGA are just a little bit of logic with a bunch of connectivity to local busses.
I haven't done much VHDL lately (ECE '96 here) but I do remember how the stuff works. But then again I focused mainly on CPU architecture.
Modern CPUs are superscalar, meaning they can execute multiple operations per clock cycle.
> processors that were 3/4 cache memory, with barely any transistors used for logic. It's surely worse than that now, with the vast majority of logic gates on chips just sitting around idle
That's because memory is the bottleneck for 99% of compute workloads. The only things that memory is not the bottleneck for is silly arithmetic (e.g. graphics) and network processing. And we have processors for those workloads. (Source: I've programmed Tilera network processors.)
> even with OpenCL/CUDA because it frankly sucks to do any real programming when all you have at your disposal is DSP concepts.
OpenCL/CUDA are so unlike DSP programming it's not even funny. SIMT and SIMD are very different programming models. (Source: I've programmed both CUDA and DSPs (Blackfin specifically).)
> it’s time to let go of VHDL/Verilog because it’s one level too low. We really need a language above them that lets us wire up basic logic without fear of the chip burning up.
Huh? VHDL and Verilog are very high-level languages that expose concepts like open-drain/collector logic and different voltage domains only unwillingly. Unless you have a really crappy elaborator, "chips burning up" just ain't gonna happen. (Source: I've done mixed-signal ASIC design.)
> So instead of wasting vast swaths of the chips for something stupid like cache, it can go to storage for logic layouts.
I don't think you understand the importance of cache. Memory is S-L-O-W. (Source: I've hand-optimized high-performance network processing code.)
> AI, physics simulations, formula discovery, protein folding, basically all of the problems that current single threaded architectures can’t touch in a cost-effective manor
All those things you list are bottlenecked by memory bandwidth, not CPU time.
You are right about memory bandwidth though. I'm excited that memory has come so far, with wide channels running at a GHz or more. But, that has come at the expense of latency. I think this problem is intractable, because everything is moving toward data centers with even slower interconnect.
I’m reaching here, but I think the endgame will be memory that works more like bittorrent, where the latency won’t be how far the nearest neighbor is, but how much the data differs from what you already have. Then we can eliminate most of the problems with slow interconnect and move to a big dumb hash system that uses diffs but results in orders of magnitude higher throughput. If processing is really not a bottleneck, then this will be trivial.
If you want something faster, you'll need to alter the upper level systems to make feeding the ALU(s) easier. (And you'll still have Amdahl's law to contend with -- which is a major factor even at the single-core scale.) AFAICT, you're asking for a system as high level as Go or MATLAB, but have it compile itself into some sequence of FPGA programs and I/Os on it. Altering those upper level systems (to the point of rewrite with a substantial amount of research work in the middle) is a lot of work, and efficiency at the silicon level isn't enough justification for it. Crude work partitioning by the programmer and more cores/machines often works well for those who actually need the additional performance.
But, there's something reasonable on regular silicon that you may like. Haskell's got auto-parallelization features, as well as the ability to compile code to run on CUDA. http://chimera.labs.oreilly.com/books/1230000000929 Is a book (available to read freely online) covering these features. You can get the background here http://book.realworldhaskell.org/
Yeah, and I want a pony. The pony just happens to be about 10000x more likely in terms of capital expense.
I look forward to the Kickstarter to raise the $10 million+ needed just to spin a mask set for this chip, not to mention the licensing costs for EDA tools that can do routing + signal integrity on this scale.
While there are plenty of problems that are parallelizable, a great majority of problems are sequential. Even if you think of say on the OS level - of running every process on a separate CPU/core/whatever - the single thread performance still comes out being the most important factor
"Just improving their search engine" isn't some simple task. Google has a head-start measured in thousands of man-years. Closing that gap takes a great number of smart people a great amount of time.
I assume that Google is improving slower than Bing, since their algorithms and systems are more mature and closer to the "asymptotically ideal search engine".