Software rendering simply stinks, doubly so if you're running a fancy composited...

coldtea · on Oct 21, 2022

>Software rendering simply stinks, doubly so if you're running a fancy composited desktop. Regardless of CPU speed or UI fluidity, your CPU and GPU processes are now fighting for the same cycles, bottlenecking one another in the worst way possible.

And yet, after first release use, they have reported that this is the smoothest they've seen a Linux desktop ever run. That is, smoother even when compared to intel-Linux on hw GPU acceleration.

chlorion · on Oct 22, 2022

I use llvmpipe currently and I can assure you that it is no where near as smooth as a setup with hardware acceleration. The 3900x is probably even faster than the M1 at software rendering and it still isn't fast enough to give a consistent 60fps with the browser using most of the screen (shrinking the browser window makes things much more smooth).

Even extremely fast CPUs suck really bad at pushing pixels compared to even the weakest GPUs. It is very much usable though!

Currently llvmpipe is able to use up to 8 cores at a time but not more, and does use SIMD instructions when available from my understanding. There is another software rendering system in MESA that allegedly uses AVX instructions but I have had a better experience with llvmpipe personally.

smoldesu · on Oct 21, 2022

If your desktop idles at 60% CPU utilization, I should hope it's at least getting the frame timing right.

Wowfunhappy · on Oct 21, 2022

Where are you getting this 60% number from?

matthewmacleod · on Oct 22, 2022

That number is absolute nonsense. Someone upthread posted it and it has no relation to reality.

anthk · on Oct 21, 2022

llvmpipe and swpipe have been improved.

the8472 · on Oct 21, 2022

I'd hope that an idle desktop redraws ~nothing and so doesn't waste any CPU cycles. And the GPU not being used might even save power. So as long as it's idle it would ideally consume less power, not more.

Jweb_Guru · on Oct 21, 2022

CPUs use significantly more power to perform the same amount of computation that a GPU does, because they're optimized for different workloads.

GPU input programs can be expensive to switch, because they're expected to change relatively rarely. The vast majority of computations are pure or mostly-pure and are expected to be parallelized as part of the semantics. Memory layouts are generally constrained to make tasks extremely local, with a lot less unpredictable memory access than a CPU needs to deal with (almost no pointer chasing for instance, very little stack access, most access to large arrays by explicit stride). Where there is unpredictable access, the expectation is that there is a ton of batched work of the same job type, so it's okay if memory access is slow since the latency can be hidden by just switching between instances of the job really quickly (much faster than switching between OS threads, which can be totally different programs). Branching is expected to be rare and not required to run efficiently, loops generally assumed to terminate, almost no dynamic allocation, programs are expected to use lower precision operations most of the time, etc. etc.

Being able to assume all these things about the target program allows for a quite different hardware design that's highly optimized for running GPU workloads. The vast majority of GPU silicon is devoted to super wide vector instructions, with large numbers of registers and hardware threads to ensure that they can stay constantly fed. Very little is spent on things like speculation, instruction decoding, branch prediction, massively out of order execution, and all the other goodies we've come to expect from CPUs to make our predominantly single threaded programs faster.

i.e., the reason that GPUs end up being huge power drains isn't because they're energy inefficient (in most cases, anyway)--it's because they can often achieve really high utilization for their target workloads, something that's extremely difficult to achieve on CPUs.

smoldesu · on Oct 21, 2022

> it's because they can often achieve really high utilization for their target workloads, something that's extremely difficult to achieve on CPUs.

This part here 100x. It's worth noting that the SIMD performance of the M1's GPU at 3w is probably better than the M1's CPU running at 15w. It's simply because the GPU is accelerated for that workload, and a neccessary component of a functioning computer (even on x86).

The particularly damning aspect here is that ARM is truly awful at GPU calculations. x86 is too, but most CPUs ship with hardware extensions that offer redundant hardware acceleration for the CPU. At least x86 can sorta hardware-accelerate a software-rendered desktop. ARM has to emulate GPU instructions using NEON, which yields truly pitiful results. The GPU is a critical piece of the M1 SOC, at least for full-resolution desktop usage.

the8472 · on Oct 23, 2022

You're not really responding to my argument? I was talking about an idle desktop where neither CPU nor GPU perform any work since they don't have to redraw anything. With neither performing work pure software-rendering should let the GPU be turned off rather than put into sleep. Granted, those are mobile chips so the power-management probably is good and there isn't much of a difference between off and deep idle.

Jweb_Guru · on Oct 24, 2022

IIRC, modern graphics APIs pretty much require you to go through the GPU's present queue to update the screen, so the GPU likely has to be involved anyhow whenever a draw happens, whether or not it's actually drawn on the GPU. Given that, I'm not sure how you could turn the GPU off during CPU rendering except in circumstances when you would already have been able to turn it off with GPU rendering. But I am basing this on how APIs like Vulkan and Metal present themselves rather than the actual hardware, so maybe there's some direct CPU-rendering-to-screen API that they just don't expose.

the8472 · on Oct 24, 2022

On the M1 the framebuffer is a separate device and can be written to directly[0]. Whether that means the GPU-proper can be powered down I don't know.

[0] https://asahilinux.org/2021/08/progress-report-august-2021/#...

Jweb_Guru · on Oct 24, 2022

Interesting, didn't realize that. This explains some of the weirder present queue requirements, I guess (it doesn't really act like a regular queue). So maybe you really can power down the GPU. I still doubt it would be lower power overall, since IME my M1 GPU takes very little power when I'm not using it intensively, but it's at least possible.

smoldesu · on Oct 21, 2022

If ARM had competitive SIMD performance, then we might be seeing an overall reduction in power usage. The base ARM ISA is excruciatingly bad at vectorized computation though, so eventual GPU support seems like a must-have to me.

dougall · on Oct 22, 2022

In my experience the M1 does have competitive SIMD performance?

https://dougallj.wordpress.com/2022/04/01/converting-integer...

https://dougallj.wordpress.com/2022/05/22/faster-crc32-on-th...

https://lemire.me/blog/2020/12/13/arm-macbook-vs-intel-macbo... (I later optimised the slower benchmark in that post: https://github.com/simdjson/simdjson/pull/1708 )

Obviously the GPU will be better, but at one point I compared the M1 CPU to other ARM GPUs (in laptops at that time) and found it had both better memory bandwidth and compute throughput, which is quite funny.

viraptor · on Oct 21, 2022

That could be true many years ago, but not anymore. GPU is way more efficient at putting many bitmaps in the right place in the output. Even your mouse cursor compositing is hardware accelerated these days because that's faster and more efficient. Doing that on the CPU is wasted power.