Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Energy-Efficient Llama 2 Inference on FPGAs via High Level Synthesis (arxiv.org)
94 points by PaulHoule on May 10, 2024 | hide | past | favorite | 29 comments


What's really cool, is that this was built on Karpathy's llama2.c repo https://github.com/karpathy/llama2.c

A great example of the unexpected things that happen when put great code into the commons.

I bet Andrej never expected anything like this when he released it.


Keep in mind that Andrej is probably holding back a lot of optimizations for the sake of keeping the code comprehensible!


> Although the GPU performs inference faster than the FPGA, one of the primary bottlenecks of deep learning inference is memory bandwidth and the availability of on-chip memory (Balasubramanian et al., 2021). A RTX 3090 has 24GB VRAM running at 1219 MHz with a base core clock of 1395 MHz (TechPowerUp, 2024). In comparison, a VU9P FPGA has 345.9 MB of combined on-chip BRAM and URAM, running at a much slower clock speed of around 200-300 MHz depending on the module; however, with much lower clock speeds, the FPGA is able to achieve better efficiency on power and energy consumption, as shown below.

So as far as I can understand, the biggest "bottleneck"/limiting factor with using FPGAs for LLMs is the available memory -- with current large models exceeding 40 GiB in parameter size, GPUs and TPUs with DRAM look like the only way to go forward for the months to come ... Thoughts?


Wouldn't be surprised if AMD or Intel come up with an FPGA especially for this application. Atleast AMD advertises a lot with their AI FPGA stuff, so they'll probably build one with either a lot more BRAM or the ability to attach some very fast RAM? But going from a few Megabytes of RAM to Gigabytes sounds very expensive. DRAM is just too slow, I guess


Check out the Agilex-M device: https://www.intel.com/content/www/us/en/products/details/fpg...

Up to 38 TFLOPs of FP16 performance, up to 116 Gbps transceiver rates, and up to 3.9M logical elements (LE).

Memory bandwidth of over 1TBps using the NoC, using in-package HBM2E (up to 32GB capacity) and hardened DDR5/LPDDR5 memory controller (supporting 5,600 Mbps).

Maybe doesn't come into the power envelope (Agilex 5 is looking good on that front) of a low-power device, but it's an awesome chip. About £8k for a DevKit though.


Yeah, I think DRAM is almost certainly the future, just in terms of being able to afford the memory capacity to fit large models. Even Cerebras using a full wafer only gets up to 44 GB of SRAM on a chip (at a cost over $2M).

An interesting twist is that this DRAM might not need to be a central pool where bandwidth must be shared globally -- e.g. the Tensortorrent strategy seems to be aiming for using smaller chips that each have their own memory. Splitting up memory should yield very high aggregate bandwidth even with slower DRAM, which is great as long as they can figure out the cross-chip data flow to avoid networking bottlenecks


Not exactly a fair comparison - the design is limited to run models that can fit entirely in on chip RAM for the FPGA. This greatly reduces power consumption because the FPGA does not have to pay the overhead of DRAM PHY's, termination, DRAM chips, etc which is a relatively fixed cost irrespective of capacity. This means the energy cost in terms of energy per bit transfered is much higher for the GPU than the FPGA.

Thus the GPU is storing a 110M model in gigabytes of external RAM, and paying the power penalty associated with the excess capacity, while the chosen model 110M fits neatly within the FPGA's on chip RAM, and the design can trim all that overhead accordingly.

A more fair comparison would either run a larger model that had both systems hitting external RAM, or they would compare power/performance against some sort of inference ASIC that had all the RAM on chip (maybe a cerebrus, but scaled according to the portion of the wafer actually used for the model).

That being said, it's neat that they open sourced their work, and it's worth looking down this path more.


I don't know what you are trying to say here. If one system doesn't need to move as much data because it is more flexible, that is a good thing. What do we gain by making it "fair"?


If you're limiting the size of the model to 110 million parameters (105MiB assuming int8) because that's what will fit onto your FPGA then of course it's going to be more energy efficient than a Broadwell era Xeon with a 24GB RTX 3090. It's like concluding that a rickshaw is more efficient than a train, something that will absolutely be true in a technical sense if you're only transporting a single passenger, but makes no sense if you're transporting hundreds if not thousands of passengers.

A more apt comparison would have been with a phone made in the past 5 years, even without an AI accelerator chip I'm sure you could manage 20-30+ t/s from a 110m model but this depends entirely on the memory bandwidth of the phone.


Most of the work of LLMs is in large matrix multiply-accumulate operations. You could take all those constants and convert everything to fixed point, then compile it down to the smallest possible directed acyclic graph of binary logical operations. (In other words, do the NAND to Tetris thing in reverse)

The problem is FPGAs are heterogeneous and highly optimized to reduce latency, instead of for efficiency, so there are strong limits to the approach.

On the plus side, the LLMs are a lot of layers, so you could take one layer per FPGA, and just use the high speed links in the chips to feed the results to the next FPGA.

You'd have maybe even a millisecond of latency, at a clock rate of 100 Mhz, but possibly a million tokens per second of serial / parallel streams of execution.


Perhaps this is the obvious comment, but I really hope that something that employs technology like this can get off the ground and become a services company. The ideas about democratizing the AI inference hardware space and making it energy efficient really resonate with me.


Don't you think that LLM inference is already very democratic? Not saying that there is no room for improvements there -- there is still a lot to do in the space of speculative decoding, quantization and other stuff. I'm saying that every 16 year old with a decent enough personal computer can run fully locally latest open weights model like Llama3-8b that beats almost everything we had a year ago.

The part of this ecosystem that is as non-democratized as it can be is training. It's currently impossible to train decent enough model with resources that are available to one person.


Seems like the claims of the abstract for speed and energy-efficiency relative to an RTX 3090 are when the GPU is using a batch size of 1. I wonder if someone with more experience can comment on how much throughput gain is possible on a GPU by increasing batch size without severely harming latency (and what the power consumption change might be).

And from a hardware cost perspective the AWS f1.2xlarge instances they used are $1.65/hr on-demand, vs say $1.29/hr for an A100 from Lambda Labs. A very interesting line of thinking to use FPGAs, but I'm not sure if this is really describing a viable competitor to GPUs even for inference-only scenarios.


The FPGA being used is I believe one of the lowest speced SKUs.

AWS instance prices are more of a supply/demand/availability thing, it would be more interesting to compare from a total cost of ownership / perf-power-area prespective.


While FPGA may prove more efficient than 3090, a primarily gaming card, I can’t see how it should be more efficient than dedicated training/inference card, as the latter are more effectively ASIC, not to mention memory and bandwidth limitations.

Is there something I am missing making FPGA potentially more viable, besides not feeding into NVIDIA’s greed?


A dedicated training/inference card is still more general than an LLama 2 inference card. It's obvious that you will get better efficiency the more you tailor your silicon to the task, with diminishing gains but still.


Is there anything preventing designing a custom ASIC for LLM training / inference rather relying on using GPU's ? I've been away from hardware for a long time, but if you can compile it into run on an FPGA, then a custom ASIC should be even faster (clock speed) and more power efficient.


You absolutely can and many do. The issue is GPUs are a commodity product that get most of the way there in terms of performance without rewriting a whole software stack, and have huge economies of scale. You just buy millions of pre-existing products that do the job well enough, and dont become obselete when the new hot stuff of the year comes around because theyre generalised parallel processing units. As the industry become more mature we'll see more targetted inference/training engines come to the market, but the semiconductor industry works in decades not years.

The reason FPGAs work as an in-between product is because they are also a commodity product. All you need is a team to program the device. You can buy a million of them off the shelf right now, and there's a hundred other applications they're good for so you're not restricted when the time comes that your ASIC is no longer needed for whatever task it's explicitly designed for.


If you are getting a 403, you may read the HTML at https://bytez.com/docs/arxiv/2405.00738/paper


am i missing something? since when does vitis connect to vivado and do the p&r too?

anyway if you're tempted by this, i strongly advise you to ponder this:

> Run the Hardware build, should take around ~12 hours.


Welcome to the world of FPGA synthesis :-)

> am i missing something? since when does vitis connect to vivado and do the p&r too?

I haven't done much HLS, but isn't that the normal case? Translating the HLS into HDL and then do the pnr with vivado?


i've done plenty of FPGA and HLS as well

> Translating the HLS into HDL and then do the pnr with vivado?

As far as I know, vitis does not give you tcl scripts (or whatever) for vivado - you have to do that yourself.


I've been using HLS recently and from my understanding, all you need is `v++`.

[link to example HLS makefile](https://github.com/Xilinx/Vitis_Accel_Examples/blob/f61637e9...)


Trying to understand what cases we would want to use FPGAs rather than GPUs.

Memory bandwidth for FPGAs seems worse, so for serving models don't GPUs still win out?


if i want to play around with this at home on a fpga devkit...which fpga kit should i use ?

the Xilinx Virtex UltraScale+ VU9P FPGA prototyping boards seem to be 9000 USD. Anything in the 1000$ range ?


ebay has a bunch of Alveo U30 cards for ~500. 500k luts, 3000 dsp slices, ~1M registers, even some DDR.

fair warning: one does not "play around" with an FPGA. they are the antithesis of user friendly.


Note, that paper provides environment for AWS FPGAs, which you can rent on per-hour basis.

As for cheaper FPGAs, the paper notes that the bottleneck is the size of on-chip memory. So I doubt it will be easy to find cheaper model to reduce costs.

Another hidden fee would be Vivado and Vitis (tooling) licenses, which you need for most upper-end FPGAs.


hi. this is very useful info thanks for this.


SiFive should work on this!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: