Hello. I'm the author of the PR that landed the tail-calling interpreter in CPython.
First, I want to say thank you to Nelson for spending almost a month to get to the root of this issue.
Secondly, I want to say I'm extremely embarrassed and sorry that I made such a huge oversight. I, and probably the rest of the CPython team did not expect the compiler we were using for the baseline to have that bug.
Reading that you are extremely embarrassed and sorry that you made such a huge oversight, I was imagining you had broken something / worsened CPython's performance.
But it's nothing like this. You announced a 10-15% perf improvement but that improvement is more like 1-5% on a non buggy compiler. It's not even like that 10-15% figure is wrong, it's just that it's correct only under very specific conditions, unknowingly to you.
IIUC, you did your homework: you made an improvement, you measured a 10-15% perf improvement, the PR was reviewed by other people, etc. It just so happens that this 10-15% figure is misleading because of an issue with the version of clang you happened to use to measure. Unless I'm missing something, it looks like a fair mistake anyone could have reasonably made. It even looks like it was hard to not fall into this trap. You could have been more suspicious seeing such a high number, but hindsight is 20/20.
Apparently, you still brought significant performance improvements, your work also helped uncover a compiler regression. The wrong number seems quite minor in comparison. I wonder who was actually hurt by this. I only discover the "case" right now but at a first glance it doesn't feel like you owe an apology to anyone. Kudos for all this!
In some way, by indirectly helping fix this bug, they led to a ~10% performance increase for everyone who was using that faulty compiler! That's even better than an optional flag that many people won't know about or use.
That performance regression only hit code that was using a very large number of paths with the same table of computed gotos at the end. That's likely to only be relatively complex interpreters that were affected. So it's not a broad performance improvement. But it is nice to have an example of the compiler's new heuristic failing to prove evidence it needs to be tunable.
> IIUC, you did your homework: you made an improvement, you measured a 10-15% perf improvement, the PR was reviewed by other people, etc. It just so happens that this 10-15% figure is misleading because of an issue with the version of clang you happened to use to measure. Unless I'm missing something, it looks like a fair mistake anyone could have reasonably made. It even looks like it was hard to not fall into this trap. You could have been more suspicious seeing such a high number, but hindsight is 20/20.
Hah! Is this a Gettier problem [0]?
1. True: The PR improves Python performance 15-20%.
2. True: Ken believes that the PR improves Python performance 15-20%.
3. True: Ken is justified in believing that the PR improves Python performance 15-20%.
Of course, PR discussions don't generally revolve around whether or not the PR author "knows" that the PR does what they claim it does. Still: these sorts of epistemological brain teasers seem to come up in the performance measurement field distressingly often. I wholeheartedly agree that Ken deserves all the kudos he has received; still, I also wonder if some of the strategies used to resolve the Gettier problem might be useful for code reviewers to center themselves every once in a while. Murphy's Law and all that.
FWIW - the fix was merged since you wrote that blog post ;)
Beyond that - 3-5% is a lot for something as old as the python interpreter if it holds up. I would still be highly proud of that.
After 30 years, i've learned (like i expect you have) to be suspicious of any significant (IE >1%) performance improvement in a system that has existed a long time.
They happen for sure, but are less common. Often, people are shifting time around, and so it just isn't part of your benchmark anymore[1]. Secondarily, benchmarking is often done in controlled environments, to try to isolate the effect. Which seems like the right thing to do. But then the software
is run in non-isolated environments (IE with a million other things on a VM or desktop computer), which isn't what you benchmarked it in. I've watched plenty of verifiably huge isolated gains disappear or go negative when put in production environments.
You have the particularly hard job that you have to target lots of environments - you can't even do something like say "okay look if it doesn't actually speed it up in production, it didn't really speed it up", because you have no single production target.
That's a really hard world to live in and try to improve.
In the end, performance tuning and measurement is really hard. You have nothing to be sorry for, except learning that :)
Don't let this cause you to be afraid to get it wrong - you will get it wrong anyway. We all do. Just do what you are doing here - say 'whoops, i think we screwed this up', try to figure out how to deal with it, and try to figure out how to avoid it in the future (if you can).
[1] This is common not just in performance, but in almost anything, including human processes. For example, to make something up, the team working on the code review tool would say "we've reduced code review time by 15% and thus sped up everyone's workflow!". Awesome! But actually, it turns out they made more work for some other part of the system, so the workflow didn't get any faster, they just moved the 15% into a part of the world they weren't measuring :)
> After 30 years, i've learned (like i expect you have) to be suspicious of any significant (IE >1%) performance improvement in a system that has existed a long time.
Sure, let me amend it to "i'm suspicious of any significant performance improvement in as system where performance actually matters, and has existed in a state where performance matters for a long time".
I think it's important to note that a primary motivation of the tail call interpreter design is to be less vulnerable to the whims of the optimizer. From my original blog article about this technique (https://blog.reverberate.org/2021/04/21/musttail-efficient-i...):
> Theoretically, this control flow graph paired with a profile should give the compiler all of the information it needs to generate the most optimal code [for a traditional switch()-based interpreter]. In practice, when a function is this big and connected, we often find ourselves fighting the compiler. It spills an important variable when we want it to keep it in a register. It hoists stack frame manipulation that we want to shrink wrap around a fallback function invocation. It merges identical code paths that we wanted to keep separate for branch prediction reasons. The experience can end up feeling like trying to play the piano while wearing mittens.
That second-to-last sentence is exactly what has happened here. The "buggy" compiler merged identical code paths, leading to worse performance.
The "fixed" compiler no longer does this, but the fix is basically just tweaking a heuristic inside the compiler. There's no actual guarantee that this compiler (or another compiler) will continue to have the heuristic tweaked in the way that benefits us the most.
The tail call interpreter, on the other hand, lets us express the desired machine code pattern in the interpreter itself. Between "musttail", "noinline", and "preserve_none" attributes, we can basically constrain the problem such that we are much less at the mercy of optimizer heuristics.
For this reason, I think the benefit of the tail call interpreter is more than just a 3-5% performance improvement. It's a reliable performance improvement that may be even greater than 3-5% on some compilers.
This is a good point. We already observed this in our LTO and PGO builds for the computed goto interpreter. On modern compilers, each LTO+PGO build has huge variance (1-2%) for the CPython interpreter. On macOS, we already saw a huge regression in performance because Xcode just decided to stop making LTO and PGO work properly on the interpreter. Presumably, the tail call interpreter would be immune to this.
I just wanted to say: respect for being able to say "sorry, I made a mistake". I hate the fake it till you make it mentality that seems to be the norm now.
I understand the frustration but I don't think it needed to be said (the part about mentality, the thanks is of course cool), because that's still not the norm.
Why do I even bring this message - I want to say that let's not let what we see in the news influence our perception of the real people of the world. Just because fraud and crimes get elevated in the news, does not mean that the common man is a criminal or a fraud. :)
Why didn't this regression in baseline performance show up (or did it?) on the faster-cpython benchmarks page [0]? Could the benchmarks be improved to prevent similar issues in the future?
As alluded to in https://news.ycombinator.com/item?id=43319010, I see these tests were collected against just 2 Intel and 2 ARM CPUs. So, if you are looking for feedback to improve, you should probably also include (at least) a AMD Zen4 or Zen5 in there. CPU & compiler people have been both trying to "help perf while not trusting the other camp" for as long as I can remember and I doubt that problem will ever go away.
A couple more CPUs will help but not solve generalizability of results. E.g., if somebody tests against some ancient 2008 Nehalem hardware, they might get very different answers. Similarly, answers today might not reflect 2035 very well.
The reality of our world of complex hardware deployment (getting worse with GPUs) is that "portable performance" is almost a contradiction in terms. We all just do the best we can at some semblance of the combination. The result is some kind of weighted average of "not #ifdef'd too heavily source" and "a bit" faster "informally averaged over our user population & their informally averaged workloads" and this applies at many levels of the whole computing stack.
EDIT: And, of course, a compiled impl like Cython or Nim is another way to go if you care about performance, but I do understand the pull & network effects of the Python ecosystem. So, that may not always be practical.
We don't normally test with bleeding-edge compilers on the faster cpython benchmarks page because that would invalidate historical data. E.g. if 2 years ago we used GCC 11 or something to compile and run a benchmark, we need to run it with GCC 11 again today to get comparable results.
Clang 19 was released last year. We only benched it a few months ago. We did notice there was a significant slowdown on macOS, but that was against Xcode Clang, which is a different compiler. I thought it might've been an Xcode thing, which in the past has bit CPython before (such as Xcode LTO working/not working versus normal Clang) so I didn't investigate deeper (facepalming now in retrospect) and chalked it up to a compiler difference.
TLDR: We didn't run benchmarks of clang 19 versus 18. We only ran benchmarks of clang 19 versus gcc, Xcode clang, and MSVC. All of which are not apples-to-apples to Clang 19, so I naiively thought it was a compiler difference.
EDIT: As to how we could improve this process, I'm not too sure, but I know I'll be more discerning when there's a >4% perf hit now when upgrading compilers.
The blog post mentions it brings a 1-5% perf improvement. Which is still significant for CPython. It does not complicate the source because we use a DSL to generate CPython's interpreters. So the only complexity is in autogenerated code, which is usually meant for machine consumption anyways.
The other benefit (for us maintainers I guess), is that it compiles way faster and is more debuggable (perf and other tools work better) when each bytecode is a smaller function. So I'm inclined to keep it for perf and productivity reasons.
Being more robust to fragile compiler optimizations is also a nontrivial benefit. An interpreter loop is an extremely specialized piece of code whose control flow is too important to be left to compiler heuristics.
If the desired call structure can be achieved in a portable way, that's a win IMO.
There was a plan for a 5x speedup overall looking for funding back in 2022. Then a team with Guido and others involved (and MS backing?) got on the same bandwagon and made some announcements for speeding up CPython a lot.
Several releases in, have we seen even a 2x speedup? Or more like 0.2x at best?
Not trying to dismiss the interpreter changes - more want to know if those speedup plans were even remotely realistic, and if anything close enough to even 1/5 of what was promised will really come out of them...
The faster cpython project was for 5x over the 3.10 baseline. CPython 3.13 is currently running at something like 1.6x speed compared to 3.10. With the JIT enabled it goes up by a few more single digit percentage point. With the changes in 3.14 it'll be something like 1.8x speed-up.
So it's slowly getting there, I think the faster cpython project was mostly around the idea that the JIT can get a lot faster as it starts to optimise more and more and that only just got shipped in 3.13, so there's a lot of headroom. We know that PyPy (an existing JIT implementation) is close to 5x faster than CPython a lot of the time already.
There's also now the experimental free-threading build which speeds up multithreaded Python applications (Not by a lot right now though unfortunately).
There has been a lot of speed ups coming out of that project. It's not being done in one go, though, but spread out over the last several releases. So while individual speedups may not look that significant, remember that it is compound on the previous release's speed up.
> His academic and commercial work is focused on compilers, virtual machines and static analysis for Python. His PhD was on building virtual machines for dynamic languages.
This dude looks God-level.
Half-joking: Maybe MSFT can also poach Lars Bak of Google V8-fame.
Thank you for your efforts in improving Python and especially thank you for helping to set the record straight in such a clear and visible manner.
Some of the sibling comments are saying there's "no need to apologize". One way this might be read is "no need to hold yourself to a higher standard". If this is the sentiment, then I disagree—we live in an age where accountability and intellectual integrity are undervalued. Being able to say that you and your team should have investigated a too-good-to-be-true result further before reporting it is setting a higher standard for yourselves.
But another way to read this is "no need to feel bad" and on this I agree. We all often fail to sufficiently challenge our own work, especially when it looks like something we worked hard on had a big impact.
> we live in an age where accountability and intellectual integrity are undervalued
I'm tired of this doomerism trope on HN. One thing has been constant in my life: People complaining that "we live in an age where accountability and intellectual integrity are undervalued". For me, it is on the same level as interviewing people for a local news show: "So, how's traffic these days?" "Oh, it's never been worse."
In what age were accountability and intellectual integrity "correctly" valued?
The way this was handled is incredibly good form! Thank you for going to such lengths to make sure the mistake was corrected. I respect and thank you (and all the developers working in Python) for all the work you do!
Yea, don't feel too bad, you're just unlucky that your mistake was so public, most of us have these kinds of mistakes weekly in a more private setting. I think everyone who knows anything about how optimized Python is would be impressed that you managed even a few percent improvement in speed! That will also probably save many, many GWh of power, even those few percent!
You have a new account here and your blog is just one article so far so you might be a new-ish developer(?), but you are doing great, keep it up! If you are not a new developer, you are still doing great, keep it up!
I don’t think you should be embarrassed or apologize. You still did a thing that improved performance - in the near term it worked around a bug that you weren’t even aware of, and long term there are still gains even with that bug fixed. But even if that weren’t the case, the only way anything gets done in software is to rely on abstractions. You can either get things done or know exactly how every last bit of your stack is implemented, but probably not both. It’s a very reasonable trade off.
Smart people go and build things. Other smart people find problems. Nothings broken with that.
I feel bad for you since a change on the order of 5% would still have been seen as a very nice bit of work. I appreciate the class with with you’re dealing with an honest mistake, and all of the hard work you’ve given the Python community.
You don't need a long heartfelt apology - a simple "god dammit" is enough for this. You made no mistake - you only got unlucky.
It's possible to improve your luck by applying more care, but it's also possible to apply so much care that you do not try things that would have been useful, so I'd rather you keep erring on the side that you did!
No one cares about this oversight - more important is that you spend time on development that the rest of us don’t have (or don’t have the skill to do in my case). Thanks for your hard work.
Hello, I was one of the people who contributed to this interpreter change. Thank you Nelson for the excellent write-up and for going to such lengths to get to the bottom of this. That said, I wanted to defend Ken a bit. :-)
Ken Jin is a volunteer who's been tirelessly working to make CPython faster over the past couple of years for not enough credit. IMHO, Ken did nothing wrong here, and there's really nothing to be embarrassed about. The fact that it took a month (more if you consider the folks helping to reproduce the original results) for anyone to notice anything wrong shows how complicated the situation was! Put yourself in Ken's shoes -- having updated to LLVM-19 to enable the preserve_none flag, would you then hypothesize that LLVM-19 may have introduced an unrelated performance regression that no one noticed for five months? Lots of people underestimate how hard benchmarking is IMO.
A 1-5% performance improvement is also pretty valuable, just not quite as spectacular as we thought originally :-)
First, I want to say thank you to Nelson for spending almost a month to get to the root of this issue.
Secondly, I want to say I'm extremely embarrassed and sorry that I made such a huge oversight. I, and probably the rest of the CPython team did not expect the compiler we were using for the baseline to have that bug.
I posted an apology blog post here. https://fidget-spinner.github.io/posts/apology-tail-call.htm...