Hacker News new | past | comments | ask | show | jobs | submit login
Graph Processing on FPGAs: Taxonomy, Survey, Challenges (arxiv.org)
127 points by walterbell 40 days ago | hide | past | web | favorite | 21 comments

This seems a weird problem to solve with FPGAs.

The big problem with graphs is that they can get very big, and once they become too big for a single machine's memory things get tricky.

However it's unclear if there are many graphs that are actually too big.

Frank McSherry showed (back in 2015) that he can process 20 iterations of PageRank on a 120B edge graph in ~12 hours on single core on a single laptop[1]. He also shows label propagation to convergence in 1700 seconds on the same single core.

Given this, exactly what advantage does a FPGA give? Who is processing ~100 billion edge graphs so many times that you are going to save on power? And is there actually any power savings over a single core anyway?

[1] https://github.com/frankmcsherry/blog/blob/master/posts/2015...

Here are a few references that solve problems on very large graphs:

[1] implements work-efficient parallel algorithms for a list of fundamental graph problems. They report times on a O(200B) edge graph on a single multicore.

[2] reports distributed running times for BFS, connectivity, PageRank, and SSSP on a large number of distributed hosts. The times are usually slower than the single-machine times, despite using two orders of magnitude more threads.

[3] Reports connectivity times on a truly large graph that is not publicly available (several trillions of edges).

My feeling is that unless you are at FAANG or a big national lab, the largest graphs that will show up in practice (today) are on the order of billions of edges, and can be solved quickly using single-machine multicores.

[1] https://arxiv.org/abs/1805.05208

[2] https://www.cs.utexas.edu/~roshan/Gluon.pdf

[3] https://arxiv.org/abs/1807.10727

Yes, all these are valid cases. Hence the second part of my question:

Given this, exactly what advantage does a FPGA give? Who is processing ~100 billion edge graphs so many times that you are going to save on power? And is there actually any power savings over a single core anyway?

> what advantage does a FPGA give

basically time and power. FPGAs will most likely be both faster and more power efficient than GPUs and CPUs, the memory point isn't quite relevant, you can have an FPGA board that talks to 32 GB of ram, or an HBM chip with as much memory density and memory bandwidth as a 1080 TI GPU.

> Given this, exactly what advantage does a FPGA give?

Not spending an hour on performing the calculation, I’d guess. An hour for PageRank is probably acceptable, but there are applications where waiting an hour for a result isn’t viable.

Any insight on why this paper was posted? What graph algorithms are people in the field interested in solving on FPGAs/GPUs/other accelerators? Unfortunately I usually see the same cluster of (toy) problems solved time and again (BFS/BC/PageRank/(negative)SSSP, usually all using label-propagation style vertex-centric codes). This survey interestingly enough reports some results for maximum matching (approximate?), although it does not provide many details on the result.

There are a few things to note. First, graph algorithms are basically the same thing as sparse matrix algorithms, in that the sort of optimization techniques you use for one class carries over nicely to the other class. The second thing is that graph and sparse matrix algorithms are unavoidably heavy on indirect memory accesses. BFS in particular has the property that there's essentially no computation that you can use to hide the memory latency. Which means that GPUs in particular do a bad job on BFS, and that gives you ample room to post 10× speedups over state-of-the-art GPUs.

I agree.

I think people vote for anything with FPGA in it, because people think FPGAs are the future of high-performance compute (and always will be IMHO).

I've worked in environments where FPGAs were used (specialized camera stabilization and tracking). These kind of environments are around, but generally you need a mix of high performance and some kind of embedded requirement for a FPGA to make sense.

Would PCIe storage/network adapters count as embedded?

Lightbits uses an FPGA on their storage adapter for NVME-over-fabric, enabling custom processing at line rates. This is a standardized evolution of Annapurna (now AWS Nitro virtualization), https://www.lightbitslabs.com/products/lightfield/

https://NetFPGA.org has been around for a while and continues to advance, https://www.usenix.org/system/files/nsdi19spring_pontarelli_...

> programming a Smart-NIC to support a new network function requires hardware design expertise. While a tech giant can build and assign a dedicated team to the task, this is usually not the case for a large majority of companies, e.g., smaller cloud or network operators. As a result, recent network programming abstractions, such as P4 have the explicit goal of simplifying the programming of FPGA-based network devices ... introducing FlowBlaze, an abstraction that extends match-action languages such as P4 or Microsoft’s GFT to simplify the description of a large set of L2-L4 stateful functions, while making them amenable to (line-rate) implementations on FPGA-based SmartNICs. To benefit the community at large, we build FlowBlaze on open SmartNIC technology (NetFPGA), and provide our hardware and software implementations as open source.

Correct. By definition, if volume is high enough, an FPGA will always lose to a custom chip. That's why it feels like FPGAs are always about to win, except they never do because for every real use case, someone will make a better chip.

In other words, the overhead of the re-configurable aspect of FPGA is just too high.

Back when chip costs were reasonable, custom chips would always beat FPGAs and were an easy option for many applications but I think things have changed somewhat in the last few years with the decline of Moore's law. Now there are fewer and fewer companies that can afford to develop custom chips at leading technology nodes. Some estimates are it costs $100m to get a custom chip into product at 7nm. The volumes just aren't there for many applications to justify a custom chip, it doesn't make sense unless the market is huge and the algorithms are static. FPGAs work well in these cases. Have you seen Microsoft Brainwave? https://www.microsoft.com/en-us/research/project/project-bra...

This is true, but:

a) most applications don't need 7nm. S3 will do a custom ASIC for under $2M[1], and I've heard figures as low as $100K for a FPGA for ASIC conversion.

b) General purpose embedded CPUs with extended instruction sets are often fast enough for most purposes.

[1] https://www.s3semi.com/wp-content/uploads/2018/06/S3semi_sil...

We agree, which is why I prefaced my statement with if volume is high enough. Just how high for 7nm is left as an exercise to the reader :-)

Very interesting read.

Some of the algorithms require the FPGA to be reconfigured whenever the node data has changed or when applied to a different data set.

Is this an expensive operation?

Not really, but I suppose it depends on the application requirements. It typically takes under a second which is much faster than taping out another chip ;-) but could be problematic for real-time systems in some cases. It depends upon the size of the FPGA/bus speed etc.

thank you have been wondering about this for ages!!

I've been (I admit) just starting to try and learn about FPGA's and where their applications might be. Outside of the painfully obvious - 100G networking, HFT - I can't see a case where their expense makes them a valid solution when compared to a GPU. Ideas?

Many people think FPGAs have the advantage over GPUs in machine learning inference. See the Bing FPGA stuff. Mostly due to lower power operations, but also a higher level of customization e.g. optimized 1 bit math if you want it, leading to potentially higher speed or larger networks. The memory architecture of FPGAs can also offer advantages - e.g. bigger memories.

The GPU instruction set is limited for some applications because of SIMD. If the GPU threads aren't doing exactly the same thing then they stall. The FPGA can have parallel processes going on that are more differentiated. e.g. HMMs work better on FPGA than GPU.

Anything with low latency feedback (e.g. LSTMs) probably works better on FPGA.

Bing search uses FPGA for scoring. See FPGA for storage/nvme.

I'm hoping for a 10x-100x increase in memory-size for gpu to unlock new possibilities.

If you take apart a gpu and look at the die you can see they basically frankenstein 1 or 2 HBMs with the GPU die. Xilinx is doing this too in their latest line of VU33 - VU38P HBM chips too (4-8GB on chip, 460 GB/s).

Any density upgrades for the GPU will also reflect for FPGAs, but we're physically limited in terms of how many HBM dies you can slap on :(

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact