Hacker News new | past | comments | ask | show | jobs | submit login
AI Flame Graphs (brendangregg.com)
141 points by mfiguiere 10 hours ago | hide | past | favorite | 26 comments





I've tried using flame graphs, but in my view nothing beats the simplicity and succinctness of gprof output for quickly analyzing program bottlenecks.

https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_chapter...

For each function you know how much CPU is spent in the function itself, as opposed to child calls. All in a simple text file without the need for constantly scrolling, panning, and enlarging to get the information you need.


I actually looked at this in detail about a year ago for some automated driving compute work at my previous job, and I found that the detailed info you'd want from Nvidia was just 100% unavailable. There's pretty good proxies in some of the data you can get out of Nvidia tools, and there's some extra info you can glean from some of the function call stack in the open source Nvidia driver shim layer (because the actual main components are still binary blob, even with the "open source" driver), but over all you still can't get much useful info out.

Now that Brendan works for Intel, he can get a lot of this info from the much more open source Intel GPU driver, but that's only so useful since everyone is either Nvidia or AMD still. The more hopeful sign is that a lot of the major customers of Nvidia are going to start demanding this sort of access and there's a real chance that AMD's more accessible driver starts documenting what to actually look at, which will create the market competition to fill this space. In the meantime, take a look at the flamegraph capabilities in PyTorch and similar frameworks, up an abstraction level and eek what performance you can.


> based on Intel EU stall profiling for hardware profiling

It wasn't clearly defined but I think EU stall means Execution Unit stall which is when a GPU "becomes stalled when all of its threads are waiting for results from fixed function units" https://www.intel.com/content/www/us/en/docs/gpa/user-guide/...


> Imagine halving the resource costs of AI and what that could mean for the planet and the industry -- based on extreme estimates such savings could reduce the total US power usage by over 10% by 20301.

Why would it be the case that reducing the costs of AI reduces power consumption as opposed to increase AI usage (or another application using electricity)? I would think with cheaper AI their usage would be come more ubiquitous: LLMs in fridges, toasters, smart alarms, etc.


This is the https://en.wikipedia.org/wiki/Jevons_paradox and it's what always happens in these cases.

It does happen, but not always.

For example, food got cheaper and consumption has increased to the extent that obesity is a major problem, but this is much less than you might conclude from the degree to which productivity has increased per farmer.

For image generation, the energy needed to create an image is rapidly approaching the energy cost of a human noticing that they've seen an image — once it gets cheap enough (and good enough) to have it replace game rendering engines, we can't really spend meaningfully more on it.

(Probably. By that point they may be good enough to be trainers for other AI, or we might not need any better AI — impossible to know at this point).

For text generation, difficult to tell because e.g. source code and legal code have a lot of text.


It's possible to decrease costs faster than usage can rise.

You specifically picked things like toasters and fridges which seem like frivolous if not entirely useless applications of LLMs.

But you can be more charitable and imagine more productive uses of AI on the edge that are impossible today. Those uses would presumably create some value, so if by reducing AI energy costs by 90% we get all the AI usage we have today plus those new uses that aren't currently viable, it's a better bang for buck.


The answer depends on what is rate-limiting growth; while we are supply-constrained on GPUs you can’t just increase AI usage.

The next bottleneck will be datacenter power interconnects, but in that scenario as you say you can expect power usage to expand to fill the supply gap if there is a perf win.


I had the same thought - power use will not be halved, usage will double instead.

> Imagine halving the resource costs of AI and what that could mean for the planet and the industry

Google has done this: "In eighteen months, we reduced costs by more than 90% for these queries through hardware, engineering, and technical breakthroughs, while doubling the size of our custom Gemini model." https://blog.google/inside-google/message-ceo/alphabet-earni...


rephrased as "We took compute from everything else.... and gave it to AI"

This is so cool! Flame graphs are super helpful for analyzing bottlenecks. The eflambe library for elixir has let us catch some tricky issues.

https://github.com/Stratus3D/eflambe/blob/master/README.adoc


This is super interesting and useful. I tried reading the code to understand how GPU workloads worked last year and it was easy to get lost in all the options and pluggable layers.

I can imagine Nelson and other Anthropic engineers jumping for joy at this release.

> Imagine halving the resource costs of AI ... based on extreme estimates such savings could reduce the total US power usage by over 10% by 2030

Is that implying that by 2030 they expect at least 20% of all US energy to be used by AI?


Data centers are big consumers of energy. Most modern data centers will have a mix of vector and scalar compute because ML/AI is a bunch of stuff, most of which was ubiquitous a decade ago.

In the limit case where Prineville just gets 100k BH100 slammed into it? The absolute best you’re going to do is to have Brendan Gregg looking at the cost. He’s the acknowledged world expert on profiling and performance tuning on modern gear in the general case. There are experts in a vertical (SG14, you want to watch Carl Cook).

I’ve been around the block and my go-to on performance trouble is “What’s the Gregg book say here…” it your first stop.


The data source is linked and is based on the ARM Datacenter Energy prediction.

But i don't think its too far fetched.

The compute needed for digital twins, simulating a whole army of robots than uploading it to the robots, who sitll need a ton of compute, is not unrealistic.

Cars like Tesla have A TON of compute build in too.

And we have seen what suddenly happens to an LLM when you switch the amount of parameters. We were in a investment hell were it was not clear in what to invest (crypto, blockchain and NFT bubble bursted) but AI opened up the sky again.

If we continue like this, it will not be far fetched that everyone has their own private agent running and paying for it (private / isolated for data security) + your work agent.


Seems pretty absurd

Given who said it, I chose to read for understanding.

Unrelated, but on the topic of reducing power consumption, I want to once again note that both AMD and NVidia max out a CPU core per blocking API call, preventing your CPU from entering low power states even when doing nothing but waiting on the GPU, for no reason other than to minimally rice benchmarks.

Basically, these APIs are set up to busyspin while waiting for a bus write from the GPU by default (!), rather than use interrupts like every other hardware device on your system.

You turn it off with

NVidia: `cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync)`

AMD: `hipSetDeviceFlags(hipDeviceScheduleBlockingSync)`

On Pytorch

NVidia: `import ctypes \ ctypes.CDLL('libcudart.so').cudaSetDeviceFlags(4)`

AMD: `import ctypes \ ctypes.CDLL('libamdhip64.so').hipSetDeviceFlags(4)`

This saves me 20W whenever my GPU is busy in ComfyUI.

Every single device using the default settings for CUDA/ROCM burns a CPU core per worker thread for no reason.


> for no reason other than to minimally rice benchmarks.

For AI/ML applications, perhaps no one will notice.

For gaming, yielding threads of execution to the OS can periodically incur minimum scheduler delays of 10-20ms. Many gamers will notice an ~extra frame of latency being randomly injected.


Sure, but CUDA is an AI/ML API, and anyways you're not doing blocking calls when writing a graphics engine regardless. (Well, you better not.) And anyways, these calls will already busyspin for a few millis before yielding to the OS - it's just that you have to explicitly opt in to the latter part. So these are the sorts of calls that you'd use for high-throughput work, but they behave like calls designed for very-low-latency work. There is no point in shaving a few milliseconds off a low-seconds call other than to make NVidia look a few percent better in benchmarks. The tradeoffs are all wrong, and because nobody knows about it, megawatts of energy are being wasted.

totally looks like self promotion article lol

This guy invented flame graphs (among other things) so... I'm gonna allow it.

https://en.wikipedia.org/wiki/Brendan_Gregg


There has been a bit of hyperbole of late about energy saving AI.

There isn't a magic bullet here, it's just people improving a relatively new technology. Even though the underlying neural nets are fairly old now, the newness of transformers and the newness of the massive scale means there's quite a lot of low hanging fruit still. Some of the best minds are on this problem and are reaching for the hardest to get fruit.

A lot of these advancements work well together improving efficiency a few percent here, a few percent there.

This is a good thing, but people are doing crazy comparisons by extrapolating older tech into future use cases.

This is like estimating the impact of cars by correctly guessing that there are 1.4 Billion cars in the world and multiplying that by the impact of a single model-T Ford.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: