I get it. This frustrated me to no end. But still I did what I had to do --- recompiled random software throughout the stack, enabled random flags, etc. It was doable and now I can do it much faster. I don't think it's fair for upstream to disable a useful optimization just so I don't have to do this additional work to fix and optimize my system.
Doing real world, whole system profiling, we've found performance was affected by completely unexpected software running on the system. Recompiling the entire distribution, or even the subset of all software installed, is not realistic for most people. Besides, I have measured the overhead of frame pointers and it's less than 1%, so there's not really any trade-off here.
Anyway, soon we'll have SFrame support in the userspace tools and the whole issue will go away.
In one of my jobs, a 1% perf regression (on a more stable/reproducible system, not PCs) was a reason for a customer raising a ticket, and we'd have to look into it. For dynamically dispatched but short functions, the overhead is easily more than 1% too. So, there is a trade-off, just not one that affects you.
I think it comes down to numbers. What are most installed systems used for? Do more than 50% of installed systems need to be doing this profiling all the time on just all binaries such that they just need to be already built this way without having to identify them and prepare them ahead of time?
If so, then it should be the default.
If it's a close call, then there should be 2 versions of the iso and repos.
As many developers and service operators as there are, as much as everyone on this page is including both you and I, I still do not believe the profiling use case is the majority use case.
The way I am trying to judge "majority" is: Pick a binary at random from a distribution. Now imagine all running instances of that binary everywhere. How many of those instances need to be profiled? Is it really most of them?
So it's not just unsympathetic "F developers/services problems". I are one myself.
Everyone benefits from the net performance wins that come from an ecosystem where everyone can easily profile things. I have no doubt that works out to more than a 1% lifetime improvement. Same reason you log stuff on your servers. 99.9% pure overhead, never even seen by a human. Slows stuff down, even causes uptime issues sometimes from bugs or full discs. It's still worthwhile though because occasionally it makes fixes or enhancements possible that are so much larger than the cost of the observability.
I don't see how this applies. Some shell has to be the default one, and all systems don't pick the same one even. Most systems don't install a compiler by default. Thank you for making my point?
All these things are possible to do, even though only developers need them. Why shouldn’t the same be true for useful profiling abilities? Because of the 1-2% penalty?
Visa sells money for money, skimming off a percentage.
CPUs spends cycles for features (doing useful work). Enabling frame pointers skims off a percentage of the cycles. But it's the impact on useful work that matters, not how many cycles you lose. The cycles are just a means to an end. So x% of cycles is fundamentally incomparable to x% of money.
The whole point of an analogy is to expose a blind spot by showing the same thing in some other context where it is recognized or percieved differently.
There are no performance winners if you include them by default. There will be an additional >0% overhead when you are executing additional code in the prologue and epilogue, and increasing the register pressure by removing rbp from being ever allocated.
There are only "winners" in the sense that people will be able to more easily see why their never-tuned system is so slow. On the other hand, you're punishing all perf-critical usecases with unnecessary overhead.
I believe if you have a slow system, it's up to you to profile and optimize it, and that includes even recompiling some software with different flags to enable profiling. It's not the job of upstream to make this easier for you if it means punishing those workloads where teams have diligently profiled and optimized through the years so that there is no, as the author says, low-hanging fruit to find.
I’ve been around long enough to have had frame pointers pretty ubiquitously, then lost them, and now starting to have them again. The dark times in the middle were painful. For the software I’ve worked on, the easy dynamic profiling using frame pointers (eg using DTrace) has given far more in performance wins than omitting them would have. (Part of my beef with the article is that while edge cases do break some samples, in practice it’s a very small fraction, and almost by definition not the important ones if you’re trying to find heavy on-CPU code paths.)
I get that some use cases may be better without frame pointers. A well-resourced team can always recompile the world, whichever the default is. It’s just that my experience is that most software is not already perfectly tuned and I’d much rather the default be more easily observable.
Look, it's likely we just come from different backgrounds. Most of my perf-sensitive work was optimizing inner loops with SIMD, allowing the compiler to inline hot functions, creating better data structures to make use of the CPU cache, etc. Frame pointer prologue overhead was measurable on most of our use-cases. I have a smaller amount of experience on profiling systems where calls trace across multiple processes, so maybe I haven't felt this pain enough. Though I still think the onus should be on teams to be able to comfortably recompile---not the world---but some part of it. After all, a lot of tuning can only be done through compile flags, such as turning off codepaths/capabilities which are unnecessary.
I wasn't exaggerating about recompiling the world, though. Even if we say I'm only interested in profiling my application, a single library compiled without frame pointers makes useless any samples where code in that library was at the top of the stack. I've seen that be libc, openssl, some random Node module or JNI thing, etc. You can't just throw out those samples because they might still be your application's problem. For me in those situations, I would have needed to recompile most of the packages we got from both the OS distro and the supplemental package repo.
My experience is on performance tuning the other side you mention. Cross-application, cross-library, whole-system, daemons, etc. Basically, "the whole OS as it's shipped to users".
For my case, I need the whole system setup correctly before it even starts to be useful. For your case, you only need the specific library or application compiled correctly. The rest of the system is negligible and probably not even used. Who would optimize SIMD routines next to function calls anyway?
I would probably host even some business-critical services on Hetzner's infra. I'm thinking of "worker"-type workloads, where each machine is 100% stateless and just serves to do some compute-intensive work. With that configuration, single-node data loss doesn't really affect you, and the CPU is plentiful and cheap with Hetzner bare metal (e.g. AX101 AMD machines).
Yeah, but where would you store state? The hyperscalers give you pretty reasonable durable storage (even datalakes). Most people don’t get storage tiering or using PaaS for workers, though.
I'd shy away from storing any non-volatile state on Hetzner. As I said, I'd mostly consider it for stateless compute-bound applications.
If I was looking to scale up an existing operation considerably and minimize costs as much as possible, I'd consider spinning up e.g. a Postgres cluster or minio on their infra, which would be significantly cheaper than RDS or S3. But it's not something that I would gladly do---the storage deals provided by hyperscalers are quite reasonable, as you say.
Usually, your entities all have velocities, which you can use to extrapolate from the last simulated state to the current one (after <dt time has passed). For things like visual effects, you'd have to have a custom extrapolate implementation, which is not really different than a custom interpolate implementation that you need for interpolation.
This eliminates the lag issue, and at anywhere close to 60FPS, looks perfectly fine. It will look strange at very low framerates, but at that point, you can just automatically switch it off.
You do need a way to extrapolate game state, which is slightly painful, but the author's proposed solution has big drawbacks (which he hints at). Since it touches all game state each frame (even though it's "just" a memcpy), it completely changes the performance characteristics of your main loop.
Without this, the complexity of your game step is linear in the number of updated or rendered entities, so you can add large amounts of additional state at any time, as long as only a small part of it will be visible/relevant to update each frame.
With the author's approach, your step complexity is linear in the state size. You basically have an additional write for all state, which gives you a very restrictive upper limit. It's not just AAA games - as soon as you add a particle system, you've created a great many entities which you now need to memcpy every frame.
The scalable solution to this complication is copy-on-write, which is more complexity... or, bite the bullet, write that extrapolation function, and enjoy your freedom to introduce crazy particles, physics, MMO world state, or whatever you want! At real-world framerates, it will look no different.
Extrapolation has some very serious drawbacks in action games which make it unusable though. You cannot just move entities forward in time by velocity*delta, because that ignores all collisions with the world. So you end up with a lot of weird jitter and ugly effects like that. And once you start adding support for things like collision while extrapolating, why not just tick the entire game;)
In a multiplayer scenario, you might not have enough information to tick the entire game. You'll probably want to extrapolate input from other users, at least.
On the contrary in multiplayer extrapolation is the last thing you want to do because the time to correction is long and the prediction is really coarse. It essentially results in a lot of corrections, applied late which looks and feels awful.
Extrapolation is one of those ideas that’s not actually used in practice- at least I’ve yet to see it used in any games in any meaningful capacity.
It’s just far too complicated and requires custom logic while resulting in worse results than more straightforward options. Even for multiplayer games the “extrapolation” is often done by repeating input states and running the regular game loop.
I also wouldn’t equivocate the interpolation approach with extrapolation. With interpolation you interpolate between two valid states. With extrapolation you produce a potentially invalid state (ie a character that’s inside of a wall). The only work around for the latter issue is to perform a full game tick - at which point you’re no longer doing extrapolation.
> Extrapolation is one of those ideas that’s not actually used in practice
This is how VR frame doubling works, no? "Timewarp"/"Spacewarm"
Also I would think that a lot of netcode would be considered extrapolation. You'd extrapolate a peer's input or velocity (and perhaps clean it up with further local simulation) and then deal with mis-prediction when changes are replicated.
For the former, Timewarp is used at an OS level to perturb the visibly rendered quad to match the display time orientation. There’s no extrapolation: the rendered frame is simply adjusted to account for the change in headset orientation.
For the latter, as I mentioned, the extrapolation is not on velocity: you still compute regular game ticks but by holding the input constant. This is quite different from extrapolating velocities.
> For the latter, as I mentioned, the extrapolation is not on velocity: you still compute regular game ticks but by holding the input constant. This is quite different from extrapolating velocities.
Replicating velocity is fairly common. Unreal's character movement replicates velocity and not inputs. I would personally argue that even doing a full game tick with replicated velocities is extrapolation. I'm not sure what the distinction would be or what counts as a full tick with error correction vs local extrapolation per tick with error correction.
I agree- what’s the difference between error correction and a full tick? At what point do you draw the line on error correction?
Extrapolation is often used to mean extrapolating values without error correction, at which point the results are less than stellar.
Spacewarp is, like Timewarp, a way to match the render frame time on a headset but by creating a warp of the output image; ill concede that this is technically extrapolation but is far away from whats generally referred to in describing updating entity values in game loops.
I suspect that plus vs minus is arbitrary in this case (as you said, due to being able to learn a simple negation during training), but they are presenting it in this way because it is more intuitive. Indeed, adding two sources that are noisy in the same way just doubles the noise, whereas subtracting cancels it out. It's how balanced audio cables work, for example.
But with noise cancelling headphones, we don't sum anything directly---we emit an inverted sound, and to the human ear, this sounds like a subtraction of the two signals. (Audio from the audio source, and noise from the microphone.)
Oh! It's been a good while since I've worked in noise cancelling. I didn't know current tech was at the point where we could do direct reproduction of the outside noise, instead of just using mic arrays! That's very cool, it used to be considered totally sci fi to do it fast enough in a small headset.
A bubble doesn't necessarily imply no underlying worth. The dot-com bubble hit legendary proportions, and the same underlying technology (the Internet) now underpins the whole civilization. There is clearly something there, but a bubble has inflated the expectations beyond reason, and the deflation will not be kind on any player still left playing (in the sense of AI winter), not even the actually-valuable companies that found profitable niches.
Joining the praise in this thread. Extremely reliable, fast, versatile, plays anything.
What I haven't seen mentioned is that it has 1- or 2-key keyboard shortcuts for almost everything, down to adjusting audio/video delay, subtitle size and offset, etc etc.
Once you've had the experience of adjusting the video just how you like it in 5 seconds with a series of keypresses without having to pause or disrupt the playback, you'll never want to go back.
You can also hold those keys to play normally at the speed set with [ and ], so you can actually play the video in reverse and slow motion or whatever. Be aware that it's usually very intensive on cpu time (and maybe gpu decoding if applicable) since it has to usually go back a whole keyframe and compute all frames between while doing that, which may result in less than smooth playback on some videos.
Reverse usually works significantly worse, at least with common video codecs that work on key frames and intraframes. Depending on your work flow, codecs that don't operate on keyframes/intraframes will actually provide mpv the capability of playing backwards at full speed (eg: rawvideo, ffv1, magicyuv...).
Collision detection is usually a tree search, and this is a very branching workload. Meaning that by the time you reach the lowest nodes of the tree, your lanes will have diverged significantly and your parallelism will be reduced quite a bit. It would still be faster than CPU, but not enough to justify the added complexity. And the fact remains that you usually want the GPU free for your nice graphics. This is why in most AAA games, physics is CPU-only.
It uses the very simple approach, of testing every particle with EVERY other particle. Still very performant (the simulation, the choosen rendering with canvas is very slow)
I currently try to do something like this, but optimised. With the naive approache here and Pixi instead of canvas, I get to 20000 particles 120 fps on an old laptop. I am curious how far I get with an optimized version. But yes, the danger is in calculating and rendering blocking each other. So I have to use the CPU in a smart way, to limit the data being pushed to the GPU. And while I prepare the data on the CPU, the GPU can do the graphic rendering. Like I said, it is way harder to do right this way. When the simulation behaves weird, debugging is pain.
If you use WebGPU, for your acceleration structure, try to use the algorithm here presented in the Diligent Engine repo. This will allow you not to transfer data back and forth between CPU and GPU: https://github.com/DiligentGraphics/DiligentSamples/tree/mas...
Another reason I did it on CPU was because with WebGL you lack certain things like atomics and groupshared memory, which you now have with WGPU. For the Diligent Engine spatial hashing, atomics is required. I'm mainly using WebGL because of compatibility. iOS Safari still doesn't enable WGPU without special feature flags that user has to enable.
Thanks a lot, that is very interesting! I will check it out in detail.
But currently I will likely proceed with my approach where I do transfer data back and forth between CPU and GPU, so I can make use of the CPU to do all kinds of things. But my initial idea was also to keep it all on the GPU, I will see what works best.
And yes, I also would not recommend WebGPU currently for anything that needs to deploy soon to a wide audience. My project is intended as a long term experiment, so I can live with the limitations for now.
This is a 2D simulation with only self-collisions, and not collisions against external geometry. The author suggests a simulation time of 16ms for 14000 particles. State of the art physics engines can do several times more, on the CPU, in 3D, while colliding with complex geometry with hundreds of thousands of triangles. I understand this code is not optimized, but I'd say the workload is not really comparable enough to talk about the benefits of CPU vs GPU for this task.
The O(n^2) approach, I fear, cannot really scale to much beyond this number, and as soon as you introduce optimizations that make it less than O(n^2), you've introduced tree search or spatial caching that makes your single "core" (WG) per particle diverge.
"that make it less than O(n^2), you've introduced tree search or spatial caching that makes your single "core" (WG) per particle diverge"
Well, like I said, I try to use the CPU side to help with all that. So every particle on the GPU checks maybe the 20 particles around it for collision (and other reactions) and not 14000, like it is currently.
That should give a different result.
Once done with this sideproject, I will post my results here. Maybe you are right and it will not work out, but I think a found a working compromise.
Yeah, pretty much this, I've experimented with putting on the GPU a bit but I would say particle based is 3x faster than a multithreaded & SIMD CPU implementation. Not 100x like you will see in Nvidia marketing materials, and on mobile, which this demo does run on, GPU often becomes weaker than CPU. Wasm SIMD only has 4 wide but the standard is 8 or 16 wide on most CPUs today.
But yeah, once you need to do graphics on top, that 3x pretty much goes away and is just additional frametime. I think they should work together. On my desktop stuff, I also have things like adaptive resolution and sparse grids to more fully take advantage of things that the CPU can do that are harder on GPU.
The Wasm demo is still in its early stages. The particles are just simple points. I could definitely use the GPU a bit more to do lighting and shading a smooth liquid surface.
Agree with most of the comment, just to point out (I could be misremembering) 4-wide SIMD ops that are close together often get pipelined "perfectly" onto the same vector unit that would be doing 8- or 16-wide SIMD, so the difference is often not as much as one would expect. (Still a speedup, though!)
Between the slowness of the editor, the "glue" work that needs to be done to even try out the GPT-generated code, the overall experience seems quite poor.
To me, it seems there are low-code platforms with better UX where doing this exercise, even without any AI assistance, would be as easy and likely smoother.