> This also applies to compression and decompression performed by Archive Utility: for example, if you download a copy of Xcode in xip format, decompressing that takes a long time as the code is constrained to the E cores, and there’s no way to change that.
This is actually not correct. If you decompress something with Archive Utility, you'll see that it does spend some time on all the cores, although it's definitely not using them equally. It's just that it doesn't spend very much time on them, because it does not parallelize tasks very well, so most of the time it will be effectively running on 1-2 threads with the other cores idle, which macOS will split over the efficiency cores and perhaps bring in one of the P-core clusters if it likes. For the very limited portions of the process that Archive Utility is actually computationally bottlenecked, AppleFSCompression will farm out the work to all the cores. When writing unxip (https://github.com/saagarjha/unxip) I found that you will actually lose to Archive Utility if you don't use all the cores at least part of the way.
> There are two situations in which code appears to run exclusively on a single P core, though: during the boot process, before the kernel initialises and runs the other cores, code runs on just a single active P core.
I'm not 100% sure on this, but I believe the processor comes out of reset on an Icestorm core?
It's very annoying that of all of the commonly used compression formats in use today it seems like only BZip2 parallelizes nicely. Some will use all of the cores on compression but only one for decompression and others do all of the math on a single core.
For plain old zip this is excusable since it comes from the 80s, but for modern compression utilities they really should do better.
I believe zstd does parallel compression/decompression and has some of the best efficiency ratios.[1] I’ve been replacing gzip with zstd as a general purpose compression format.
With the right options, zstd can compress quite a bit better than bzip2 and gzip. Albeit probably not as fast as gzip, because squeezing out that extra compression space is a trade for speed.
Which is really unfortunate, because bzip2 is not a very good compressor otherwise. It outperforms gzip on compression ratio, but is beat out by anything modern, and is extremely slow, even when parallelized.
A lot of modern compressors can be made parallelizable (and, as a bonus seekable!) by breaking the input/output into chunks and compressing each chunk independently. This loses a bit of compression efficiency, but surprisingly little for sufficiently large blocks.
(This isn't what pigz does, by the way. pigz keeps state between each compressed block, which prevents parallel decoding.)
> For plain old zip this is excusable
Do you mean gzip? PKZIP decompression is actually somewhat parallelizable, because each file is compressed independently of the other files in the archive -- I'm not sure anyone actually does this, though.
Last time I looked the information I found online said that due to the way PKZIP files are structured it isn't possible to decompress them in parallel. But checking again it seems that someone had done just that:
That information was incorrect. Any member in a PKZIP archive can be decompressed independently of any other files in the archive, which is what parzip does.
What you can't do is decompress a single archive member in parallel.
Switching from gzip->pigz really sped things up in one of our CI steps recently. It seemed like the output files were identical as well. What are the drawbacks?
That's odd. Zip and Rar, at least, have indexes that point to individual files in the archive and it should be trivial to decompress files in a thread pool with as many cores you want. That's not as easy when you have a tar.bz or tar.gz2 file because you only see files as the .tar file is extracted.
Easy to say, hard to do. Particularly since most threading optimizations will trade some bandwidth - and people generally prefer smaller file sizes as their first-order preference.
> Counterintuitively, it seems that under very light load, things work well, and under very heavy load, things work well, but for medium loads, there is failure. Also counterintuitively, the newer M1 Pro and M1 Max CPUs, with more performance cores (6-8) and fewer efficiency cores (2), seem to have a larger "medium load" range where things don't work well.
This explains Justin Frankel's observations in the article above. It's a bit annoying there seems to be no way to force a process to only run on the P cores with no throttling. Having mostly bought a M1 Mac for music production, I keep running into this exact issue. Light workloads invoke more processing glitches than running loads of tracks and plugins at the same time.
Hopefully developers will eventually get more granular control of this for high-performance apps, battery life be damned.
>It's a bit annoying there seems to be no way to force a process to only run on the P cores with no throttling.
I'm not sure about the throttling side of things as I've mostly worked with M1 Mac Mini, but Logic Pro has CPU thread settings to limit things to just P cores. My guess is that it will take time for other DAW's to implement similar features.
Can't say I've had many issues performance wise in Logic, but I've had some crashing with newer plugins like Korg's recent plugin version of the OP-Six. Especially when memory use ramps up. Reported my findings back to Korg's tech support, but no solutions yet.
Anecdotally, Logic has consistently been the worst DAW in my testing, performance-wise, when it comes to hosting plugins. Maybe it's due to the AU format, but the de facto result is the same.
I run Ableton, FL, Pro Tools and Logic (mainly working in Ableton) and my goal is to record at 48k with a buffer length of 32 samples, with very light use of plugins. In my testing with the exact same plugins, Logic would consistently have buffer overflows every minute or so.
Ableton ran far more stable (especially in Rosetta), but PT was the only DAW to do it without hiccups. My initial guess is that the AAX plugin format is more tightly coupled to its internal processing, improving performance. But I was quite shocked to find out that Logic (Apple's own software) performed the worst.
Plugin development in general is a shitshow where most devs seemingly only do enough to get them barely working, so I'm not surprised most of them misbehave on ARM. It's a bit sad that we're two years into the Apple Silicon transition and performance still is so rocky.
This is interesting to read- For me Logic is the lightest on CPU of all the DAWs I use. (Ableton 11, FL Studio, Reason 12,UAD Luna, Cubase 10.5) on my iMac pro (base model 8 core Xeon with 32gb ram)
Ableton, while it is my main daw and by far my favorite, is pretty CPU heavy as of Version 11.(10 was not quite as bad) The others are more or less similar in performance in terms of track count and latency. Ableton is completely usable but I usually start having to freeze tracks as the track count hits ~24ish depending on what plugins Im using.
With Logic, I can usually get closer to 2x that amount, but I should probably do a more apples to apples comparison.
I do have my eyes on the Mac Studio as my next Audio computer, but Im hoping to stretch my iMac Pros lifespan closer to the 10 year mark if I can...
For what it’s worth, I’ve had the same experience on x86 with Logic being stable as a rock, it just seems to be a M1 thing. Without a doubt this can be sorted out down the road, but Apple’s insistence on only using AU in Logic means that every plugin in Logic is going to take a performance hit as nearly all devs write a VST version and wrap that in an AU.
More complexity = more room for errors especially on a new cpu architecture.
corrscope is a PyQt app with a background thread which streams video frames to a ffplay process, which shows these frames in a video player window along with synchronized audio. During playback, ffplay is focused and corrscope lies in the background generating video frames.
> On M1 processors, if you open preview, after 40 or so seconds, the preview may suddenly slow down and audio will stop playing. This is because macOS thinks ffplay is the active app, and Corrscope is a background job burning CPU, so it moves Corrscope to an Efficiency core, slowing it down.
> There is no fix for this issue at the moment. As a workaround, you can click on Corrscope's window to avoid the slowdown, and drag it aside so it doesn't obstruct the preview.
Setting the background worker thread to QOS_CLASS_USER_INTERACTIVE does not prevent it from being moved to E-cores. This is a rather severe issue which damages the user experience, and I haven't found a workaround yet (outside of moving video playback in-process). How do Chrome and Firefox (and Safari I suppose) give their worker processes P-cores while users are interacting with the browser window process?
> How do Chrome and Firefox (and Safari I suppose) give their worker processes P-cores while users are interacting with the browser window process?
Generally if a UI process makes IPC calls out to a daemon or other process, the Mach kernel is supposed to keep track of prioritization and an on-behalf-of relationship for the daemon. That is, if the kernel already thinks the foreground app is P-core eligible, any work it initiates in other processes via IPC should be P-core eligible as well, through a “voucher” object that is passed between processes: https://opensource.apple.com/source/xnu/xnu-3789.41.3/osfmk/...
I’m not sure the mechanism you’re using to put video playback out-of-process, but if there’s any IPC calls involved, make sure they’re originating from a UI task, and not the other way around? Hopefully this gives you something to go on.
The Python thread shares a process with the GUI thread. It actually pipes data to a ffmpeg process in between, to mix video frames and audio into a NUT stream, which gets piped to the ffplay GUI process. IIRC both processes are created by Python, but we only pipe data directly into ffmpeg.
Not sure why you're downvoted; you seem to be right! In Activity Monitor's Energy tab, I have to expand Terminal > login > zsh > python, but my python process is shown as App Nap: Yes.
This seems weird or wrong. Given that M1 Ultra are two M1 Max bolted together (apparently in a way that is somewhat unconventional, but in effect neat) I don't see how that would be done in hardware or even why the authors would want to it work like that (it is my understanding that the whole reason for clusters is that whole cluster has same clock, which is somewhat hard and pointless to do when half of it resides on different die).
Maybe it is only an software abstraction and in HW there are two 2xIcestorm clusters?
From watching the Asahi IRC over time this sounds right, 6 clusters total. Like you say I wouldn't be surprised if macOS lumps them together from a scheduling perspective to keep low performance scheduling (where going across the interconnect doesn't matter as much anyways) simpler.
I somehow assume that the interconnect between the M1 Ultra dies boils down to bunch of AHB ports more or less directly connected together through the interposer (with only minimal pad logic). This seems consistent with both the performance profile of the thing and with macOS mostly ignoring the topology for scheduling purposes.
In my observations, macOS has a very different behavior with nice than my Linux machines. On my Linux laptop, I can run processor-heavy tasks with `nice -n 19` and browse the web as if the machine was doing nothing else. On my (x86) Mac, if I do that, the machine crawls to a halt, as if the OS were ignoring the nice setting. Maybe there is some connection to this and this elaborate scheduling model - on Linux I'd expect processes with lower nice levels get assigned beefier cores and bump out nicer ones to smaller cores and that would take care of everything.
Most likely some unsolved priority inversion bug doing that. Linux actually bumps up the priority of "nice" tasks when something else is waiting on them, which goes a long way to address this issue.
It's not the nice task that halts - on the Mac, it's everything else, as if the task I started with nice is allowed to hog all resources and isn't bumped out of the CPUs when, say, the browser needs them.
This is a cool exploration of these new systems. I wonder is some of this scheduling functionality exposed in the Apache licensed Linux version of Grand Central Dispatch? Or is this architecturally incompatible with how tasks and processes are scheduled on Linux?
Additionally I wonder why Apple chose to change the number of efficiency cores between M1 and M1-P, M1-M and M1-U?
I use big.LITTLE on Linux, I might be able to shed some light on the situation here:
Technology like Thread Director is simply not ready on Linux yet. Period. You're not going to get intelligent process management a-la M1 or MacOS quite yet, mostly because Intel hasn't quite finished rolling out Alder Lake to the server market. It'll take some time, but I fully expect us to get there.
The flip side of that coin, however, is that the core management is already quite good. Intel had previously merged some basic code for managing heterogeneous systems during the Rocket Lake rollout (iirc), and it does a pretty good job of keeping you on the E-cores most of the time. As far as I can tell, it's original intention was to get the kernel to run processes on lower-binned cores until it needed to turbo, at which point it would heavily prioritize the faster ones. This helped distribute the workload and save some power, but it also came in clutch for Linux users with Alder Lake. So long as the same code works on ARM, I can image the situation will be quite good. Most people would probably never notice the difference.
Apple is the undisputed king of performance per watt but they need to keep up in single core performance. Sure, Intel draws a ton of power but most tasks we do day to day are single core and the Intel i9-12900K is already 15-20% faster than the M1. The M2 will have a modest single core boost (10%) but the next i9-13900K will be 15-20% faster than the 12900K and increase Intel's lead even further.
I think that's pretty poor speculation. We know nothing about M2 and the i9-13900K yet. On top of that, responsible use of energy is exactly where we should be heading, not absolute single threaded performance.
Also if you ever owned a top end Intel MBP you'll know how much pain Intel can inflict on you.
> On top of that, responsible use of energy is exactly where we should be heading,
I never really got this arguement, especially when you're talking about a laptop that will happily draw 40w+ if you crank the display brightness. Arguing about the merits of a 7w CPU vs a 15w one sounds like people are missing the forest for the trees.
40W is way high. https://www.notebookcheck.net/Dell-XPS-15-7590-OLED-Power-Co... shows 20w for max brightness on a laptop OLED monitor (or 6W at minimum brightness). For 13 inch laptops, it's around 25% lower (less area). Also, laptop CPUs can pretty easily use 30W (i7-1185G7 is an ultrabook class CPU and can be configured up to 28W).
Edit: was your 40W for total power consumption? If so, then going from 20W cpu to 10W cpu is still a 25% energy consumption reduction (as well as a cooler lap and better battery life).
Macs have long reserved ~50w of power for the display, whether or not all of it is used is a different question. Giving them the benefit of the doubt (and accounting for screens that can max out at 1500 nits), I think an upper bound of 40w is pretty close to the actual figure. OLED will always pull less power since it doesn't need a backlight, in contrast to the hundreds of backlights on newer Macs.
The macbook pro has a 70 watt hour battery, and has a 10 hour battery life (video playback). Also, if I'm reading correctly, the 1000 nits brightness is only achievable when plugged in, and apple automatically reduces brightness on battery to keep the power consumption lower.
The so-called breakdown of Dennard Scaling happened a little over 10 years ago. Since that time, CPUs have been forced to keep portions of the die powered off at any specific point in time - to prevent melting themselves. Each new generation requires a larger and larger area to be shut off.
Responsible use of energy directly leads to higher performance. You can also restate this: software is a performance bottleneck in a modern CPU.
We still didn't see Apple desktop m-series cpu
And you know, maybe 12900k is 15-20% faster in single core, but cmon - it's 250W tdp cpu vs something that doesn't even need active cooling
My i7 12700k idles ~21c, and much like the M1 series, struggles to peak 50c unless you're running a Cinebench loop. YMMV, but I think x86 has life in it yet if Intel can get these results with silicon that's less than half as dense.
>My i7 12700k idles ~21c, and much like the M1 series
Yea, but unlike the M1 series, you probably have a big hunk of copper on top of your CPU. It really isn't a fair comparison. Intel needs a miracle. Actually, we all need a miracle, because Apple just changed the whole industry and I really hope they get some competition.
Intel just needs to not be using their podunk 10nm lithography they've been touting for the past 5 years or so. They plan to beat Apple's density by 2024, which will raise some interesting questions about how much efficiency they can recoup at this point. I don't think it's fair to claim anyone as the victor yet, we'll simply have to wait and see if the reports of x86's death were grossly overstated or not.
... and the fact that cooling works well enough to keep your CPU at 21 C doesn't mean the heat isn't dumped into your room. Which brings up the A/C costs in warm times.
This is actually not correct. If you decompress something with Archive Utility, you'll see that it does spend some time on all the cores, although it's definitely not using them equally. It's just that it doesn't spend very much time on them, because it does not parallelize tasks very well, so most of the time it will be effectively running on 1-2 threads with the other cores idle, which macOS will split over the efficiency cores and perhaps bring in one of the P-core clusters if it likes. For the very limited portions of the process that Archive Utility is actually computationally bottlenecked, AppleFSCompression will farm out the work to all the cores. When writing unxip (https://github.com/saagarjha/unxip) I found that you will actually lose to Archive Utility if you don't use all the cores at least part of the way.
> There are two situations in which code appears to run exclusively on a single P core, though: during the boot process, before the kernel initialises and runs the other cores, code runs on just a single active P core.
I'm not 100% sure on this, but I believe the processor comes out of reset on an Icestorm core?