More

thegeomaster · 2024-11-04T14:29:56 1730730596

I get it. This frustrated me to no end. But still I did what I had to do --- recompiled random software throughout the stack, enabled random flags, etc. It was doable and now I can do it much faster. I don't think it's fair for upstream to disable a useful optimization just so I don't have to do this additional work to fix and optimize my system.

rwmj · 2024-11-04T14:45:27 1730731527

Doing real world, whole system profiling, we've found performance was affected by completely unexpected software running on the system. Recompiling the entire distribution, or even the subset of all software installed, is not realistic for most people. Besides, I have measured the overhead of frame pointers and it's less than 1%, so there's not really any trade-off here.

Anyway, soon we'll have SFrame support in the userspace tools and the whole issue will go away.

thegeomaster · 2024-11-04T17:07:44 1730740064

In one of my jobs, a 1% perf regression (on a more stable/reproducible system, not PCs) was a reason for a customer raising a ticket, and we'd have to look into it. For dynamically dispatched but short functions, the overhead is easily more than 1% too. So, there is a trade-off, just not one that affects you.

Dylan16807 · 2024-11-05T03:39:48 1730777988

If 1% shows up out of nowhere, it's very much worth investigation and trying to fix it. You shouldn't let them freely happen and pile up.

But there are some 1% costs that are worth it.

Brian_K_White · 2024-11-04T14:54:22 1730732062

I think it comes down to numbers. What are most installed systems used for? Do more than 50% of installed systems need to be doing this profiling all the time on just all binaries such that they just need to be already built this way without having to identify them and prepare them ahead of time?

If so, then it should be the default.

If it's a close call, then there should be 2 versions of the iso and repos.

As many developers and service operators as there are, as much as everyone on this page is including both you and I, I still do not believe the profiling use case is the majority use case.

The way I am trying to judge "majority" is: Pick a binary at random from a distribution. Now imagine all running instances of that binary everywhere. How many of those instances need to be profiled? Is it really most of them?

So it's not just unsympathetic "F developers/services problems". I are one myself.

recursivecaveat · 2024-11-04T16:11:15 1730736675

Everyone benefits from the net performance wins that come from an ecosystem where everyone can easily profile things. I have no doubt that works out to more than a 1% lifetime improvement. Same reason you log stuff on your servers. 99.9% pure overhead, never even seen by a human. Slows stuff down, even causes uptime issues sometimes from bugs or full discs. It's still worthwhile though because occasionally it makes fixes or enhancements possible that are so much larger than the cost of the observability.

Brian_K_White · 2024-11-05T14:37:50 1730817470

This at least makes sense. Thanks.

nemetroid · 2024-11-04T15:07:35 1730732855

Do 50% of users need to be able to:

* modify system services?

* run a compiler?

* add custom package repositories?

* change the default shell?

I believe the answer to all of the above is "no".

redox99 · 2024-11-04T15:13:52 1730733232

All those things are free in terms of performance though.

Brian_K_White · 2024-11-04T15:14:23 1730733263

I don't see how this applies. Some shell has to be the default one, and all systems don't pick the same one even. Most systems don't install a compiler by default. Thank you for making my point?

nemetroid · 2024-11-04T15:20:48 1730733648

All these things are possible to do, even though only developers need them. Why shouldn’t the same be true for useful profiling abilities? Because of the 1-2% penalty?

Brian_K_White · 2024-11-04T15:33:07 1730734387

Are you serious?

Visa makes billions per year off of nothing but collecting a mere 2%-3% tax on everything else.

sfink · 2024-11-05T06:59:19 1730789959

Visa sells money for money, skimming off a percentage.

CPUs spends cycles for features (doing useful work). Enabling frame pointers skims off a percentage of the cycles. But it's the impact on useful work that matters, not how many cycles you lose. The cycles are just a means to an end. So x% of cycles is fundamentally incomparable to x% of money.

oasisaimlessly · 2024-11-04T16:49:55 1730738995

I don't see how Visa is in any way relevant here.

Brian_K_White · 2024-11-04T17:44:47 1730742287

I don't see why not.

The whole point of an analogy is to expose a blind spot by showing the same thing in some other context where it is recognized or percieved differently.

thegeomaster · 2024-11-04T14:28:05 1730730485

There are no performance winners if you include them by default. There will be an additional >0% overhead when you are executing additional code in the prologue and epilogue, and increasing the register pressure by removing rbp from being ever allocated.

There are only "winners" in the sense that people will be able to more easily see why their never-tuned system is so slow. On the other hand, you're punishing all perf-critical usecases with unnecessary overhead.

I believe if you have a slow system, it's up to you to profile and optimize it, and that includes even recompiling some software with different flags to enable profiling. It's not the job of upstream to make this easier for you if it means punishing those workloads where teams have diligently profiled and optimized through the years so that there is no, as the author says, low-hanging fruit to find.

dap · 2024-11-04T15:41:25 1730734885

I’ve been around long enough to have had frame pointers pretty ubiquitously, then lost them, and now starting to have them again. The dark times in the middle were painful. For the software I’ve worked on, the easy dynamic profiling using frame pointers (eg using DTrace) has given far more in performance wins than omitting them would have. (Part of my beef with the article is that while edge cases do break some samples, in practice it’s a very small fraction, and almost by definition not the important ones if you’re trying to find heavy on-CPU code paths.)

I get that some use cases may be better without frame pointers. A well-resourced team can always recompile the world, whichever the default is. It’s just that my experience is that most software is not already perfectly tuned and I’d much rather the default be more easily observable.

thegeomaster · 2024-11-04T17:05:23 1730739923

Look, it's likely we just come from different backgrounds. Most of my perf-sensitive work was optimizing inner loops with SIMD, allowing the compiler to inline hot functions, creating better data structures to make use of the CPU cache, etc. Frame pointer prologue overhead was measurable on most of our use-cases. I have a smaller amount of experience on profiling systems where calls trace across multiple processes, so maybe I haven't felt this pain enough. Though I still think the onus should be on teams to be able to comfortably recompile---not the world---but some part of it. After all, a lot of tuning can only be done through compile flags, such as turning off codepaths/capabilities which are unnecessary.

dap · 2024-11-04T18:25:05 1730744705

Makes sense.

I wasn't exaggerating about recompiling the world, though. Even if we say I'm only interested in profiling my application, a single library compiled without frame pointers makes useless any samples where code in that library was at the top of the stack. I've seen that be libc, openssl, some random Node module or JNI thing, etc. You can't just throw out those samples because they might still be your application's problem. For me in those situations, I would have needed to recompile most of the packages we got from both the OS distro and the supplemental package repo.

audidude · 2024-11-04T18:51:09 1730746269

I think your viewpoint is valid.

My experience is on performance tuning the other side you mention. Cross-application, cross-library, whole-system, daemons, etc. Basically, "the whole OS as it's shipped to users".

For my case, I need the whole system setup correctly before it even starts to be useful. For your case, you only need the specific library or application compiled correctly. The rest of the system is negligible and probably not even used. Who would optimize SIMD routines next to function calls anyway?

thegeomaster · 2024-10-21T10:49:33 1729507773

I would probably host even some business-critical services on Hetzner's infra. I'm thinking of "worker"-type workloads, where each machine is 100% stateless and just serves to do some compute-intensive work. With that configuration, single-node data loss doesn't really affect you, and the CPU is plentiful and cheap with Hetzner bare metal (e.g. AX101 AMD machines).

rcarmo · 2024-10-21T16:02:39 1729526559

Yeah, but where would you store state? The hyperscalers give you pretty reasonable durable storage (even datalakes). Most people don’t get storage tiering or using PaaS for workers, though.

thegeomaster · 2024-10-21T23:50:20 1729554620

I'd shy away from storing any non-volatile state on Hetzner. As I said, I'd mostly consider it for stateless compute-bound applications.

If I was looking to scale up an existing operation considerably and minimize costs as much as possible, I'd consider spinning up e.g. a Postgres cluster or minio on their infra, which would be significantly cheaper than RDS or S3. But it's not something that I would gladly do---the storage deals provided by hyperscalers are quite reasonable, as you say.

thegeomaster · 2024-10-18T12:57:37 1729256257

One important approach is missing: extrapolation.

Usually, your entities all have velocities, which you can use to extrapolate from the last simulated state to the current one (after <dt time has passed). For things like visual effects, you'd have to have a custom extrapolate implementation, which is not really different than a custom interpolate implementation that you need for interpolation.

This eliminates the lag issue, and at anywhere close to 60FPS, looks perfectly fine. It will look strange at very low framerates, but at that point, you can just automatically switch it off.

You do need a way to extrapolate game state, which is slightly painful, but the author's proposed solution has big drawbacks (which he hints at). Since it touches all game state each frame (even though it's "just" a memcpy), it completely changes the performance characteristics of your main loop.

Without this, the complexity of your game step is linear in the number of updated or rendered entities, so you can add large amounts of additional state at any time, as long as only a small part of it will be visible/relevant to update each frame.

With the author's approach, your step complexity is linear in the state size. You basically have an additional write for all state, which gives you a very restrictive upper limit. It's not just AAA games - as soon as you add a particle system, you've created a great many entities which you now need to memcpy every frame.

The scalable solution to this complication is copy-on-write, which is more complexity... or, bite the bullet, write that extrapolation function, and enjoy your freedom to introduce crazy particles, physics, MMO world state, or whatever you want! At real-world framerates, it will look no different.

jakubtomsu · 2024-10-18T13:53:52 1729259632

Extrapolation has some very serious drawbacks in action games which make it unusable though. You cannot just move entities forward in time by velocity*delta, because that ignores all collisions with the world. So you end up with a lot of weird jitter and ugly effects like that. And once you start adding support for things like collision while extrapolating, why not just tick the entire game;)

jayd16 · 2024-10-18T16:46:19 1729269979

> why not just tick the entire game

In a multiplayer scenario, you might not have enough information to tick the entire game. You'll probably want to extrapolate input from other users, at least.

meheleventyone · 2024-10-18T20:31:25 1729283485

On the contrary in multiplayer extrapolation is the last thing you want to do because the time to correction is long and the prediction is really coarse. It essentially results in a lot of corrections, applied late which looks and feels awful.

jakubtomsu · 2024-10-18T13:54:41 1729259681

(collisions are by far not the only issue here, it's just the first one you run into)

dyarosla · 2024-10-18T15:40:32 1729266032

Extrapolation is one of those ideas that’s not actually used in practice- at least I’ve yet to see it used in any games in any meaningful capacity.

It’s just far too complicated and requires custom logic while resulting in worse results than more straightforward options. Even for multiplayer games the “extrapolation” is often done by repeating input states and running the regular game loop.

I also wouldn’t equivocate the interpolation approach with extrapolation. With interpolation you interpolate between two valid states. With extrapolation you produce a potentially invalid state (ie a character that’s inside of a wall). The only work around for the latter issue is to perform a full game tick - at which point you’re no longer doing extrapolation.

jayd16 · 2024-10-18T16:34:24 1729269264

> Extrapolation is one of those ideas that’s not actually used in practice

This is how VR frame doubling works, no? "Timewarp"/"Spacewarm"

Also I would think that a lot of netcode would be considered extrapolation. You'd extrapolate a peer's input or velocity (and perhaps clean it up with further local simulation) and then deal with mis-prediction when changes are replicated.

dyarosla · 2024-10-18T19:30:15 1729279815

For the former, Timewarp is used at an OS level to perturb the visibly rendered quad to match the display time orientation. There’s no extrapolation: the rendered frame is simply adjusted to account for the change in headset orientation.

For the latter, as I mentioned, the extrapolation is not on velocity: you still compute regular game ticks but by holding the input constant. This is quite different from extrapolating velocities.

jayd16 · 2024-10-18T20:12:14 1729282334

Spacewarp takes the motion vectors and depth buffer and generates new frames from the extrapolated motion. Its detailed here. https://developers.meta.com/horizon/blog/introducing-applica...

> For the latter, as I mentioned, the extrapolation is not on velocity: you still compute regular game ticks but by holding the input constant. This is quite different from extrapolating velocities.

Replicating velocity is fairly common. Unreal's character movement replicates velocity and not inputs. I would personally argue that even doing a full game tick with replicated velocities is extrapolation. I'm not sure what the distinction would be or what counts as a full tick with error correction vs local extrapolation per tick with error correction.

dyarosla · 2024-10-18T22:37:29 1729291049

I agree- what’s the difference between error correction and a full tick? At what point do you draw the line on error correction?

Extrapolation is often used to mean extrapolating values without error correction, at which point the results are less than stellar.

Spacewarp is, like Timewarp, a way to match the render frame time on a headset but by creating a warp of the output image; ill concede that this is technically extrapolation but is far away from whats generally referred to in describing updating entity values in game loops.

thegeomaster · 2024-10-08T15:01:01 1728399661

I suspect that plus vs minus is arbitrary in this case (as you said, due to being able to learn a simple negation during training), but they are presenting it in this way because it is more intuitive. Indeed, adding two sources that are noisy in the same way just doubles the noise, whereas subtracting cancels it out. It's how balanced audio cables work, for example.

But with noise cancelling headphones, we don't sum anything directly---we emit an inverted sound, and to the human ear, this sounds like a subtraction of the two signals. (Audio from the audio source, and noise from the microphone.)

nowayno583 · 2024-10-08T15:20:50 1728400850

Oh! It's been a good while since I've worked in noise cancelling. I didn't know current tech was at the point where we could do direct reproduction of the outside noise, instead of just using mic arrays! That's very cool, it used to be considered totally sci fi to do it fast enough in a small headset.

thegeomaster · 2024-10-05T14:24:25 1728138265

A bubble doesn't necessarily imply no underlying worth. The dot-com bubble hit legendary proportions, and the same underlying technology (the Internet) now underpins the whole civilization. There is clearly something there, but a bubble has inflated the expectations beyond reason, and the deflation will not be kind on any player still left playing (in the sense of AI winter), not even the actually-valuable companies that found profitable niches.

thegeomaster · 2024-08-17T20:21:00 1723926060

Joining the praise in this thread. Extremely reliable, fast, versatile, plays anything.

What I haven't seen mentioned is that it has 1- or 2-key keyboard shortcuts for almost everything, down to adjusting audio/video delay, subtitle size and offset, etc etc.

Once you've had the experience of adjusting the video just how you like it in 5 seconds with a series of keypresses without having to pause or disrupt the playback, you'll never want to go back.

Bluestein · 2024-08-17T23:37:16 1723937836

It feels like the "Vim" of video players.-

thegeomaster · 2024-08-17T20:17:45 1723925865

mpv does it! Period key (.) for next frame, comma (,) for previous. Super handy.

amlib · 2024-08-17T21:19:18 1723929558

You can also hold those keys to play normally at the speed set with [ and ], so you can actually play the video in reverse and slow motion or whatever. Be aware that it's usually very intensive on cpu time (and maybe gpu decoding if applicable) since it has to usually go back a whole keyframe and compute all frames between while doing that, which may result in less than smooth playback on some videos.

chungy · 2024-08-18T02:41:58 1723948918

Reverse usually works significantly worse, at least with common video codecs that work on key frames and intraframes. Depending on your work flow, codecs that don't operate on keyframes/intraframes will actually provide mpv the capability of playing backwards at full speed (eg: rawvideo, ffv1, magicyuv...).

thegeomaster · 2024-06-27T10:24:36 1719483876

Collision detection is usually a tree search, and this is a very branching workload. Meaning that by the time you reach the lowest nodes of the tree, your lanes will have diverged significantly and your parallelism will be reduced quite a bit. It would still be faster than CPU, but not enough to justify the added complexity. And the fact remains that you usually want the GPU free for your nice graphics. This is why in most AAA games, physics is CPU-only.

lukan · 2024-06-27T12:14:06 1719490446

"Collision detection is usually a tree search"

Yes, because of the very limited numbers of CPU cores. With a GPU you can just assign one core to one particle.

Here is a simple approach to do it with WebGPU:

https://surma.dev/things/webgpu/

It uses the very simple approach, of testing every particle with EVERY other particle. Still very performant (the simulation, the choosen rendering with canvas is very slow)

I currently try to do something like this, but optimised. With the naive approache here and Pixi instead of canvas, I get to 20000 particles 120 fps on an old laptop. I am curious how far I get with an optimized version. But yes, the danger is in calculating and rendering blocking each other. So I have to use the CPU in a smart way, to limit the data being pushed to the GPU. And while I prepare the data on the CPU, the GPU can do the graphic rendering. Like I said, it is way harder to do right this way. When the simulation behaves weird, debugging is pain.

kotsoft · 2024-06-27T12:53:20 1719492800

If you use WebGPU, for your acceleration structure, try to use the algorithm here presented in the Diligent Engine repo. This will allow you not to transfer data back and forth between CPU and GPU: https://github.com/DiligentGraphics/DiligentSamples/tree/mas...

Another reason I did it on CPU was because with WebGL you lack certain things like atomics and groupshared memory, which you now have with WGPU. For the Diligent Engine spatial hashing, atomics is required. I'm mainly using WebGL because of compatibility. iOS Safari still doesn't enable WGPU without special feature flags that user has to enable.

lukan · 2024-06-27T13:26:37 1719494797

Thanks a lot, that is very interesting! I will check it out in detail.

But currently I will likely proceed with my approach where I do transfer data back and forth between CPU and GPU, so I can make use of the CPU to do all kinds of things. But my initial idea was also to keep it all on the GPU, I will see what works best.

And yes, I also would not recommend WebGPU currently for anything that needs to deploy soon to a wide audience. My project is intended as a long term experiment, so I can live with the limitations for now.

thegeomaster · 2024-06-27T12:21:32 1719490892

This is a 2D simulation with only self-collisions, and not collisions against external geometry. The author suggests a simulation time of 16ms for 14000 particles. State of the art physics engines can do several times more, on the CPU, in 3D, while colliding with complex geometry with hundreds of thousands of triangles. I understand this code is not optimized, but I'd say the workload is not really comparable enough to talk about the benefits of CPU vs GPU for this task.

The O(n^2) approach, I fear, cannot really scale to much beyond this number, and as soon as you introduce optimizations that make it less than O(n^2), you've introduced tree search or spatial caching that makes your single "core" (WG) per particle diverge.

lukan · 2024-06-27T12:43:24 1719492204

"that make it less than O(n^2), you've introduced tree search or spatial caching that makes your single "core" (WG) per particle diverge"

Well, like I said, I try to use the CPU side to help with all that. So every particle on the GPU checks maybe the 20 particles around it for collision (and other reactions) and not 14000, like it is currently.

That should give a different result.

Once done with this sideproject, I will post my results here. Maybe you are right and it will not work out, but I think a found a working compromise.

kotsoft · 2024-06-27T12:08:41 1719490121

Yeah, pretty much this, I've experimented with putting on the GPU a bit but I would say particle based is 3x faster than a multithreaded & SIMD CPU implementation. Not 100x like you will see in Nvidia marketing materials, and on mobile, which this demo does run on, GPU often becomes weaker than CPU. Wasm SIMD only has 4 wide but the standard is 8 or 16 wide on most CPUs today.

But yeah, once you need to do graphics on top, that 3x pretty much goes away and is just additional frametime. I think they should work together. On my desktop stuff, I also have things like adaptive resolution and sparse grids to more fully take advantage of things that the CPU can do that are harder on GPU.

The Wasm demo is still in its early stages. The particles are just simple points. I could definitely use the GPU a bit more to do lighting and shading a smooth liquid surface.

thegeomaster · 2024-06-27T12:25:20 1719491120

Agree with most of the comment, just to point out (I could be misremembering) 4-wide SIMD ops that are close together often get pipelined "perfectly" onto the same vector unit that would be doing 8- or 16-wide SIMD, so the difference is often not as much as one would expect. (Still a speedup, though!)

thegeomaster · 2024-05-26T15:48:05 1716738485

Between the slowness of the editor, the "glue" work that needs to be done to even try out the GPT-generated code, the overall experience seems quite poor.

To me, it seems there are low-code platforms with better UX where doing this exercise, even without any AI assistance, would be as easy and likely smoother.