DWARF perf sampling works fine but has really high overhead, requires reading a large amount of stack space to find each prior frame, and the current tooling for generating perf reports from DWARF samples is really slow. You could definitely improve the last issue, I suspect no one has because it's not widely used. I don't think you can fix the first two issues, or at least you can't make it as fast as frame pointers.
Also, frame pointers just use up one extra register. The actual overhead of compiling with -fomit-frame-pointer vs -fno-omit-frame-pointer is small enough for most programs that it's definitely worth including frame pointers if you think you'll ever need to profile a program running as a release build. I'll also add frame pointers are very useful for debugging if you don't have debug symbols, which is more often than you might think.
My non-expert understanding: It seems that when frame pointers are enabled, the compiler prefers to address all stack variables using offsets from the frame pointer, whereas when frame pointers are disabled, stack variables are addressed at offsets from the stack pointer. But, it turns out frame-pointer-based addressing is slightly less efficient than stack-pointer-based addressing. The problem is that the offset from the stack pointer / frame pointer is encoded into the instruction as either 8 bits or 32 bits -- 8 bits if that's enough, 32 bits otherwise. But it turns out that for large stack frames, the stuff close to the frame pointer is more stuff that isn't actually used during the function body, such as saved register values, whereas the stuff close to the stack pointer is typically the data that's currently being operated upon. So, addressing based on the stack pointer is less likely to spill over into 32-bit offsets.
However, this all sounds like something that could be fixed in the compilers. They could still address off the stack pointer even when frame pointers are available. In fact, if they could choose on an instruction-by-instruction basis, maybe they could even save bytes. But for some reason they don't currently do this. It's not clear to me why -- maybe just because omitting frame pointers has been such a standard optimization for so long that no one has bothered optimizing the other case?
That makes sense and isn't something I had considered, but a 32-bit displacement vs an 8-bit displacement just leads to bloat in binary size, it doesn't affect how many cycles your movs jumps etc. take. There are some second order effects of things where larger code size can cause you to get a worse hit rate in the instruction cache, but usually those effects are miniscule when you actually benchmark code. There are going to be some pathological programs where the instruction cache hit rate gets way worse with the large binary size, but I can't really think of many programs I've seen that are mostly stalled on instruction decoding/fetching.
Here's my hypothesis of why stack-pointer offsets haven't been implemented when compiling with frame pointers. At companies like Google/Facebook that compile huge C++ binaries (usually 100MB+ stripped) it's common that production binaries would be compiled with something like -fno-omit-frame-pointer (so system-wide profiling can be done on all C++ code in production) and also generate split debug symbols using -gline-tables-only. The latter compiler flag basically just generates enough DWARF information to recover the mapping of pc -> source code line, so if you get a core dump you can figure out which line of C++ code was actually being executed in each frame of the call stack. If you're doing frame-pointer offsets it means that you can also recover the value of local variables in each frame (assuming they haven't been optimized out by the inliner) just based on the offset of the variable from the frame-pointer. So basically -fno-omit-frame-pointer and -gline-tables-only give you enough debug data to get the full call stack with line numbers and the values of local variables in each frame (except the inlined ones), while also minimizing the cost of generating/storing debug data.
IMHO this argument is on the wrong side of the fence.
The advantages to auditing and long-term quality to a distro having guaranteed-usable backtraces are just too much to pass up. Absolutely Fedora needs to disallow -fomit-frame-pointer. I just don't see a 2-3% performance argument as being responsive there. Applications with performance tuning requirements at that scale are going to be more than able to, y'know, recompile their junk. Frankly most runtimes these days prefer to deploy static binaries anyway. The default distro isn't the place to be dithering over 2%.
But what this is really measuring isn't the cost of a "frame pointer". It's the cost of a ABI-compliant function call. And there's technology available that reduces the number of these damn near zero (with no impact to backtrace generation or debuggability), which would obviate all the argumentation over -f-o-f-p.
> And there's technology available that reduces the number of these damn near zero (with no impact to backtrace generation or debuggability), which would obviate all the argumentation over -f-o-f-p.
Any pointers on such solution? And why couldn't it be applied right now, without LTO?
There's an interesting update here which wasn't covered in the Fedora thread or this article:
Very new Intel & AMD hardware supports a feature called shadow stacks ("shstk"). This is a copy of the true stack stored separately by the CPU (for security reasons - to avoid stack smashing). These processors also allow you to read out the shadow stack and you can use that to construct an accurate stack trace w/o frame pointers.
(There are good reasons why this shouldn't affect the Fedora decision short-term. It requires hardware which is very new, and I believe support is not quite there in perf. Also on some hardware the number of frames in the shadow stack is limited, apparently even more recent hardware lifts this restriction?)
I'm confused: the article includes concerns about "performance". Is the performance hit noticeable in the final binary (i.e. the one I'm running when I `dnf install` the RPM), or is it only during the source compilation?
If it's the latter, I don't see why it's a big deal. If it's the former, I'd love to know why the frame stack version of the program would run more slowly.
The performance hit (~2%) is observable at runtime, assuming you're doing things which are CPU bound. For cpython programs the especially large hit (~10%) is well understood though not fixed: https://pagure.io/fesco/issue/2817#comment-826636
The flip side to this is that when you have good profiling it changes how you write and debug software. You can actually look at the whole system (eg with Flamegraphs https://www.brendangregg.com/flamegraphs.html) and find out what's taking all the time, and hopefully direct your fixes to the right place. This potentially gives you great performance benefits. Apparently Meta recompile all software with frame pointers and use perf to do this kind of analysis.
> If it's the former, I'd love to know why the frame stack version of the program would run more slowly.
I believe it's the former and my understanding is that the stack frame version uses a register (RBP on x86-64) to keep track of the (start?) of the stack frame, which presumably helps in debugging / getting stack traces and profiling performance.
However, when not keeping track of the stack frame, the compiler can use the RBP register for storing other values, which means that the compiler has one more register it can use to avoid loading/storing values in memory (accessing memory is much slower than accessing registers).
Also, when not keeping track of the stack frame, I believe there is less of a need to store/restore the value of RBP at the beginning/end of functions, which also reduces code size and number of instructions executed, I think.
As far as I understand, those are the reasons for the slowdown.
There is a lot of nuance missing here - virtual frame pointers can be reproduced by debug info, but it requires your tools understanding debug info. So it's not "all or nothing".
You can turn them off and enable separate debug info, at a cost of tools having to understand the debug info.
Compilers produce good enough virtual frame pointers to do unwinding/exceptions, and debugging. They are not perfect however, and people throw rocks more than they fix things, as usual.
It seems like argument is that cross-referencing the debug info is much slower than following frame pointers, so much so that whole-system profiling becomes impractical.
This is new to me, but I suppose it makes sense. You pause a thread to take a stack trace. Each frame of the trace might require reading some debug info which potentially needs to be loaded from disk. You can't predict in advance which debug info the next frame will need, because you have no idea what address it will be until you process the current frame. You can't let the thread continue execution until you're done tracing -- unless you want to clone the whole stack in the meantime? That all seems... pretty bad. If you're profiling a single binary, maybe you can fault in all the debug info in advance, but for whole-system that's probably a lot of RAM burned.
(Disclaimer: I'm just speculating here, I don't personally have experience with whole-system profiling.)
"It seems like argument is that cross-referencing the debug info is much slower than following frame pointers, so much so that whole-system profiling becomes impractical.
"
Think of all of this like eBPF, which can be done fast enough to work fine.
First, the debug_frame sections are small separate, and self-contained. You don't have to fault in the rest of debug info to decode the frame.
Second, I was, at one point heavily involved in such a thing.
Without breaking confidentiality - it can be done.
Also, for the larger ones, you are sampling anyway if you are using production systems the scale of facebook, because it's cheaper to sample a small percent of the time, and you will get the same coverage/results.
So it is not slow all the time, nor slow on all your machines, nor ...
This obviously does not help with running perf live, but ....
Third, if you are concerned about speed, this can also be cached quite well. You can go further too - you can JIT the debug_frame section into executable code if you want, or even compile ahead of time.
I wonder what is causing the massive python regression? Does the interpreter uses rbp for its own purposes and needs to fall back to a slower implementation when frame pointers are enabled?
Right, of course, a giant switch statement to handle the threaded interpreter, which is very hard to optimize and allocate registers for. Makes sense that one less register available can have a negative effects.
The massive Python regression is because of a missing, trivial compiler optimization, one already implemented but just not applied when you compile to push frame pointers.
There is no actual need for the compiler to maintain a frame-pointer register, or to access stack frame values relative to it. When you compile without, the compiler does fine. The compiler could generate that instruction to push a copy of the stack pointer, and do nothing else different from when working without, as it is doing now, except to add a larger value to the stack pointer before returning.
That said, who runs Python if they are bothered about performance?
- one less register means more spills, so more instructions and more read/writes from the stack.
- when frame pointer is enabled, gcc uses offsets from the framepointer to address stack slots instead of offsets from the stack pointer. For the python interpreter function this means larger offsets, but it is not always necessarily the case. The right thing would be for the compiler to always issue the shorter instruction (i.e. an offset from rbp or rsp).
edit: rbp offsets are normally negative, so yes, they are usually larger than positive rsp offsets.
Is there some large hurdle to having two sets of bdist packages, one set compiled with frame pointer support, and the other without? Like pkg vs. pkg-devel or pkg-src?
pkg-fp, and have the dependency network trigger other installations.
As long as you build unwinding tables for each function (which I'm pretty sure gcc has an option for), there is no need for a frame pointer unless the function uses alloca or stack VLAs.
> tl;dr: maybe frame pointer omission was premature
Not really. More like "frame pointer omission made sense on the severely register-starved i386, today we can afford to spare %rbp for substantial debugging/profiling advantages"
> frame pointer omission made sense on the severely register-starved i386
For those who lack the context: 32-bit x86 uses only 3 bits for the register specification, so it can address only 8 general-purpose registers. Of these, one is reserved for the stack pointer, leaving only 7 general-purpose registers available, or 6 if you reserve one of them for the frame pointer.
For 64-bit x86, AMD added one extra register specification bit on a prefix byte, so at the cost of an extra instruction byte, the instruction can address an extra 8 general-purpose registers, for a total of 15 plus the stack pointer, or 14 if you also reserve the frame pointer. That is: on 64-bit x86, you have with frame pointers twice as many available registers as 32-bit x86 has without frame pointers. And these registers are double the size, so a 64-bit variable which needs two registers on 32-bit x86 needs only one register on 64-bit x86, so the effect can be more than just a doubling.
And then you have 64-bit ARM, which uses five bits for the register specification... (32-bit ARM already used four bits, so it was never as register-starved as 32-bit x86; and RISC-V uses five bits for both 32-bit and 64-bit.)
I don't think getting rid of frame pointer omission would be viable if Fedora still supported 32-bit x86. Once they got rid of most of that compatibility (there are still 32-bit compatibility libraries, necessary only for some old binary-only code and for Wine, and I believe Wine will sooner or later not need them anymore), making this change became more acceptable.
> That is: on 64-bit x86, you have with frame pointers twice as many available registers as 32-bit x86 has without frame pointers
Indeed.
> I don't think getting rid of frame pointer omission would be viable if Fedora still supported 32-bit x86
Fedora still supports it (e.g. packages are still built for i686, and you can install them on x86_64) though it has dropped 32-bit ARM. But the change to reenable frame pointers will not be applied to 32-bit x86 builds.
I have to agree with him. perf has an option to use DWARF data for stacks, but IMHO it simply does not work.