Did anyone actually look at the machine code generated here? 0.30ns per value? That is basically 1 cycle. Of course, there is no way that a processor can compute so many dependent instructions in one cycle, simply because they generate so many dependent micro-ops, and every micro-op is at least one cycle to go through an execution unit. So this must mean that either the compiler is unrolling the (benchmarking) loop, or the processor is speculating many loop iterations into the future, so that the latencies can be overlapped and it works out to 1 cycle on average. 1 cycle on average for any kind of loop is just flat out suspicious.
This requires a lot more digging to understand.
Simply put, I don't accept the hastily arrived-at conclusion, and wish Daniel would put more effort into investigation in the future. This experiment is a poor example of how to investigate performance on small kernels. You should be looking at the assembly code output by the compiler at this point instead of spitballing.
A C for statement is a "benchmarking loop" in the same sense that "slice a sponge cake into two layers and place custard in the middle" is an actionable dessert recipe.
Failing to post disassembly for a micro benchmark is annoying.
It is of course speculating all the way through the loop; a short backwards conditional branch will be speculated as "taken" by even very simple predictors.
Op fusion is very likely, as is register renaming: I suspect that "mul" always computes both products, and the upper one is left in a register which isn't visible to the programmer until they use "mulh" with the same argument. At which point it's just renamed into the target register.
The dependency chain is state += 0x60bee2bee120fc15ull or (state += UINT64_C(0x9E3779B97F4A7C15)); the rest of the calculations are independent per iteration.
Anyway, the more important fact is that 64x64b -> 128b mul might be one instruction on x86, but it's broken into 2 µops. Because modern CPUs generally don't design around µops being able to write two registers in the same set.
It's a shame we can't see the rest of the code. What is happening to the result value? Is it being compared to something? Put into an array, or what? All of that code probably totally outweighs what you pointed out here. Or, at least it should. I have a bad feeling it might be being dead-code eliminated, since compilers are super aggressive about that nowadays, but I hope he's somehow controlled for that.
Just staring at the machine code, it looks like the hottest loop for wyrng is about 10 instructions with a store in it. If the processor can do that loop in 1 cycle on average then...holy fuck.
edit: I was looking at similar code generated by clang on my machine. Again, holy fuck.
I don't think the story here is that 64x64=128 multiply is fast, honestly. The real story is the insane level of speculation and huge ROB that is necessary to make that non-unrolled loop go so fast. Everything has to go pretty much perfect to achieve that throughput.
Purely based on number of uops, there's a slight win for wyhash, all other things being equal. However, I doubt that you're really getting one iteration per second here; there are 6 integer units, and even if you perfectly exploited instruction parallelism you're limited to 6 ALU instructions per cycle, which are less than the extent of either loop. It would be possible if the mul-umulh pairs are getting fused, which would bring it down to 8 uops per iteration.
Taking into account the port distribution, each iteration of wyhash involves 4 uops being dispatched to ports 5 and 6, which means you should be getting at least 2 cycles/iteration purely for the multiplications. If it's much lower than that, the whole multiplication being fused into a single port 5-6 uop might be right.
However I can neither confirm nor deny that the loops behave like that on the M1, as I don't have one.
I think you are right that mull/h are fused. I think that M1 has 128 ALUs for the vector unit, so it would be a good way to make use of them. M1 is far from the first iteration of the architecture and Apple has likely picked most if not all low hanging fruits. It also helps x86 emulation I guess.
edit: but see the comment else thread about the loop iteration time being off by a factor of 2.
Oh yeah, I thought the add r,r,2 was odd but didn't investigate. This brings things back to ~2+ cycles per iteration, which strictly speaking does not require fusion.
It would be easier to test this explicitly instead of inside some unrelated RNG.
Perhaps there was a misunderstanding when executives talked to the silicon engineers about "the unbelievable speculation for our first in-house desktop-class CPU"?
I would be wary of using gettimeofday to measure such short periods. As https://pubs.opengroup.org/onlinepubs/009604599/functions/ge... says, the resolution of the system clock is unspecified, and 20,000 × ≈10 ≈ 200k instructions easily run in under a millisecond on modern hardware.
The benchmark probably get rids of that by doing it 40,000 times in quick succession, but why not measure the time of all 40,000 iterations in one go, and decrease the risk (and the overhead of calling gettimeofday 39,999 times)?
The resolution of the clock is unspecified by the POSIX standard but that does not mean it's unspecified by the platform this is actually running on. If this was trying to be portable code you'd have a big issue there but it's not. It is still limited by gettimeofday maxing out at microsecond precision, though, which is quite poor. And seems to be using a realtime clock, which introduces errors from network time sync & such. That's unlikely to crop up here, but it's still a risk. clock_gettime_nsec_np(CLOCK_MONOTONIC) is what should be used here.
Agree, you probably want to just read some cpu time stamp counter before and after, and make sure you ar pinning your threwads, locking the clocks, etc. so that you can get a reliable time from there, or... just use cycles as your unit of measure.
On Linux and MacOS, gettimeofday is accurate to microseconds. Not only is the returned struct is expressed in microseconds, but I have personally observed that it is accurate to microseconds.
The conclusion seems based on the relative execution times for the two benchmarks. Since the benchmarks are measured in the same way, their error bars should be basically the same as well. This analysis is not an analysis of the absolute execution time of these algorithms, but the difference between them.
I don't think the conclusion is hasty. Lemire is saying: "look, if the M1 full multiplication was slow, we'd expect wyrng to be worse than splitmix, but it isn't".
> Lemire is saying: "look, if the M1 full multiplication was slow, we'd expect wyrng to be worse than splitmix, but it isn't".
But that doesn't follow either. Only by inspecting the machine code do we get to see what's really going on in a loop, and the ultimate result is dependent on a lot of factors: if the compiler unrolled the loop (here: no), whether there were any spills in the loop (here: no), what the length of the longest dependency chain in the loop is, how many micro-ops for the loop, how many execution ports there are in the processor, and what type, the frontend decode bandwidth (M1: seems up to 5 ins/cycle), whether there is a loop stream buffer (M1: seems no, but most intel processors, yes), the latency of L1 cache, how many loads/stores can be in-flight, etc, etc. These are the things you gotta look at to know the real answer.
At that throughput the CPU is speculating and exploiting the access pattern.
It's also worth saying that if Apple were dead set on throughput in this area they could've implemented some non-trivial fusion to improve performance. I don't have an M1 so I can't find out for you (and Apple are steadfast on not documenting anything about the microarchitecture...)
"If both the high and low bits of the same product are required, then the recommended code sequence is [...]. Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies."
Iterations: 10000
Instructions: 100000
Total Cycles: 25011
Total uOps: 100000
Dispatch Width: 4
uOps Per Cycle: 4.00
IPC: 4.00
Block RThroughput: 2.5
No resource or data dependency bottlenecks discovered.
, which to me seems like 2.5 cycles per iteration (on Zen3).
Tigerlake is a bit worse, at about 3 cycles per iteration, due to running more uOPs per iteration, by the looks of it.
For the following loop core (extracted from `clang -O3 -march=znver3`, using trunk (5a8d5a2859d9bb056083b343588a2d87622e76a2)):
Multiplier in M1 can be pipelined or relicated (or both), so issuing two instructions can be as fast as issuing one.
Instruction recognition logic (DAG analysis, BTW) is harder to implement than to implement pipelined multiplier. Former is a research project, while latter was done at the dawn of computing.
I wonder if order matters? That is, would mul followed by mulh be the same speed as mulh followed by mul?
How about if there is an instruction between them that does not do arithmetic? (What I'm wondering here is if the processor recognizes the specific two instruction sequence, or if it something more general like mul internally producing the full 128 bits, returning the lower 64, and caching the upper 64 bits somewhere so this if there is a mulh before something overwrites that cache it can use it).
It seems like something that would be arbitrary depending on how the optimization was implemented. There wouldn't be an inherent need for that amount of generalization. Apple can tightly control their compiler to follow the rules, and there seemingly wouldn't be any compelling reason not to stick those two instructions back to back in a consistent order, since the second instruction is effectively free.
It would be fun to experiment with, for someone that has the hardware. My guess is that swapping the order will make it slower, but adding an independent instruction or two between them probably won't have a measureable effect. It would be fun to try and consistently interrupt the CPU between the two instructions as well somehow, to see if that short-circuits the optimization.
I love my M1, but does anyone else have horrific performance when resuming from wake? It’s like it swaps everything to disk and takes a full minute to come back to life.
Yes. This is actually a known issue, provided you have an external monitor attached; lots of people complaining about it. The Mac actually wakes up instantly if you lift the screen, but it usually takes 5-10 seconds before it will wake up the external monitor.
Worse, for some of us when it does finally wake up the monitor, sometimes it wakes it up with all the wrong colors, and rebooting is the only reliable fix. (and before anyone asks, yes, I tried a different HDMI cable)
For me it's maybe 5-10 seconds for it to wake my Thunderbolt monitor (LG 4K) from "cold".
It's much faster if the monitor has been used recently, though, so I always figured it was the monitor that was causing the delay by going into some deep sleep state?
It seems a little faster, but not dramatically so for me, when I'm hot switching between different inputs on the monitor. Definitely I think some of it is just the monitor not being really fast about switching, but I use the same monitor with a brand new 16 inch MBP and it's much faster at triggering the monitor to wake up.
Same monitor. The workaround is to put the monitor on a power strip you can trigger when you walk up to it. Hard power off/on of this monitor and it instantly displays for me now.
The sound quality is indeed pretty bad on the LG 4K. Noticeably worse than the MacBook Air's built in speakers.
It's just "cheap speakers" bad, though, not anything that would suggest an issue with the sound output from the M1. I've used the LG with a few different Macs, and sound quality is the same from any of them.
I think they are still working through some external monitor driver issues. The colors on my LG 4k were initially way off until I did an advanced calibration. Occasionally waking up from sleep it will revert, but opening the Display preference and swapping between the calibrations fixes it. My connection is through usbc.
I don't have any performance issues waking up though.
I think there are a couple issues going on with the colors. For some people it is calibration. But what I experience is almost like a color inversion (but it's not a full inversion, it looks like maybe one or two channels got inverted). Makes it difficult to even find the mouse pointer so I can get to the menu and reboot the machine. Then it comes up fine.
Interesting. I did submit a bug to Apple about my issue. Mainly because I never did a calibration with my 2017 mbp and the monitor looked great. With the M1 MBA I had to do the advanced calibration just to make the same monitor usable. Otherwise the colors were completely washed out. It almost seemed like the setting to auto-dim the laptop monitor was also being oddly applied to the external.
This happens to me too! Exactly as described. I have 16gbac mini conmected to an external monitor with an HDMI cable. A bit irritating but not a huge deal
I don't do any color work on it so the occassional color issues are not a problem for me.
Yeah, I would like to use DP, for sure. A peculiarity of my situation and the devices I need to connect means I can't make that work with just two ports on the laptop. So I use the Apple USB-C PD/HDMI/USB adapter so I can get the ports I need and still charge.
I'm kinda limited in my options, because this MBP only has two USB ports. So I have one port which goes to a USB hub, and the other port goes to an Apple A/V adapter with USB-C power delivery pass-thru, HDMI, and a USB port. If not for my need to have one USB port be switchable, and the rest not, I'd use a USB-C -> DP cable like I use on my 16" MBP.
This may prompt me to upgrade prematurely, if/when the next M1 MBP comes out with more than two ports.
I have the OWC TB dock that was just released (and has been backordered). It works great other than the headphone jack has a slight hiss that I don't get plugging directly into the MBA.
Instant wake for me. However any time I come across a password field in a website the computer freezes for a painfully long 10 seconds or so while it presumably decrypts my password vault.
Sometimes this will happen multiple times per page load if I deselect and reselect the password field.
Oh goodness I thought this was just me after searching online and finding very little, if any, discussion of the bug. The issue I had wasn't on password login prompts, but on account creation prompts (i.e. password+confirm password). I had assumed it was lastpass at first, but the freeze persisted even after removing that -- on a one-day old computer.
I gave up on reporting bugs to Apple a while ago. The experience, repeated several times over, of spending all the time to file a radar, gather all the logs and files that Apple insists on (probably reasonably), then submitting the thing and getting exactly zero feedback for the next year when you are then told to repeat all the same work on a new macos version or the radar will be automatically closed, is demoralizing. Mind you, the issues I reported were fixed/went away a few years after that (so clearly not as a result of the then-long-closed ticket), but the process just feels like a waste of time to interact with the black box of apple.
It helps more than nothing. I would recommend just becoming more casual about it.
My experience maintaining projects is actually not that people don't provide enough info in their bugs (certainly true for random forum rants though), but when they try they try too hard and end up spending a long time writing a bunch of stuff I don't even read, because the log speaks for itself.
In this case you're not necessarily reaching an engineer, it could go to someone who combines reports together, or the fix is because of your report but it doesn't get communicated back properly, but it's still letting someone know it's a problem.
This happens to me too, I’m using 1Password. I have a suspicion the plug-in is involved with this but I’ve not had time to collect evidence yet. Seems to be rare though, I’ve not spotted anybody else with this issue.
I am seeing lots of spinning balls with 1Password on my 2015 Macbook Pro. Even just switching between fields with the keyboard brings up a spinner for two seconds or so. It's a recent thing I think. No other app has any issues.
I just updated to version 7.8 from version 7.7 and it looks fine so far. I haven't seen any freezes in the five minutes that I have been using it.
This makes me realise how much of a pain it has been to use in the past weeks or so. Now that it is back to normal snappy it is such a pleasure to work with again.
I don't use any external password manager, only Keychain and Safari and I see this 10-15 second delay every time when adding new password to Keychain since day 1.
Maybe the time is hashing your password. This is designed to take as long as it can while still being reasonable. 1 second on a fast machine isn't unheard of.
Perhaps it could be a two-level thing then, where you first decrypt e.g. the list of keys and then read the one you want and get a "start/end" offset to read/decrypt just the value for that key from another file. So keys and values are still decrypted, but you don't have to decrypt the whole bundle in one step to get the value.
Detecting the monitors, negotiating the correct resolutions, setting scaling factors and window positions after coming out of sleep will take some time.
Maybe macOS does monitor setup sequentially? (no idea, just got 1 big external screen that also takes a few seconds to light up - handshake speed seems to vary between monitor brands)
This is almost certainly it - the screens show all sorts of weird graphical artifacts (some clearly a Retina display in "native" resolution) as it starts and loads. I assume it's having to fire up all the GPU memory, etc.
This is anecdotal and I don't have anything to prove it, but I really feel like my old spinning-hard-drive 2010 MacBook Pro woke faster from sleep running Snow Leopard, than the Retina models ever did (or the old ones did after a few software updates).
Of course for general tasks it was slower, but I really remember that thing waking up instantly when I raised the lid, every time.
My nostalgia agrees with you, but I think it’s probably wrong. The wake from sleep was what finally convinced me to get my first Mac. Even now they have the best hibernation and wake from sleep.
Yeah. To me it looks like macOS goes so deep into sleep it disconnects the external display. On wake, the system rediscovers the external and resizes the desktop across both displays. With a bunch of apps/windows open, half your apps simultaneously resizing all their windows can peg all CPU cores for a number of seconds.
(It's still way faster than the same set of apps on an Intel Mac laptop, where it could sometimes take on the order of 30 seconds to get to a usable desktop after a long sleep. On Intel Macs it seemed more obvious that the GPU was the bottleneck)
I am using the LG Ultrafine 5K (so it’s a TB monitor) and it takes maybe 1.5s to 5s longer than the built in display (which wakes instantly) to turn on.
I do occasionally have an issue where the brightness on the built in display is borked and won’t adjust back to the correct level for anywhere between 30s to a few minutes.
And then I don’t know if it’s my monitor or the M1, but sometimes there will be a messed up run of consecutive pixel columns about 1/10th of the screen wide starting about 30% from the left of the display. The entire screen in that region is shifted a few pixels upwards. Sometimes it’s hard to notice it but once you do it can’t be unseen. Replugging the monitor into the M1 resolves the issue.
It way faster than my old macbook pro. But one thing is the external monitor won't be open upon resuming. I have to plug/unplug the cable to re-active the external monitor. It seems HDMI handshake was failed somehow
I haven't noticed poor wake times, but my laptop does kernel panic and reboot a fair amount. Maybe 4 times in the past week. My hunch is that it's Spotify's fault but I haven't dug into the logs.
xnu will panic if it doesn't receive periodic check-ins from userspace. For example if WindowServer hangs, then the kernel may deliberately panic so that the system reboots. See man watchdogd for (a tiny bit) more.
I have this problem but only if I've been plugged into a monitor, and then unplugged and gone onto battery. Rebooting after unplugging stops it but its annoying.
I've had mine panic and reboot twice, both times happened shortly after disconnecting other Macs that were connected via a Thunderbolt cable (target disk mode).
Definitely no issue with waking my M1 8GB MacBook Air. Takes a fraction of a second, every time. In fact, this is specifically something that Apple were bragging about when they launched the M1 Macs!
So this is why the integer multiply accumulate instruction mullah, only delivers the most significant bits? Ironic if you aren't religious about these things.
I believe that ARMv8 NEON crypto extensions has a special instruction for 64-bit multiply to 128-bit product, which is useful for Monero mining for example.
It has a bit less gain on a RISC due to the code density (or lack thereof), since it requires more fetch bandwidth. Apple works around this by using a very wide front-end: https://news.ycombinator.com/item?id=25257932
Not in the next 10 years. Why would they? Fabs are extremely capital-intensive and take years to get up and running, when (like Taiwan Semi) knows how to do it. Intel has shown how hard it can be to do this right. Let TSM work on production (and hopefully get more/larger fabs in the USA up and running) and getting better at packing in the transistors, and let Apple improve the design (and software).
They’ve vertically integrated everything else, and they’ve had great success along the way. TSMC has other customers that compete with Apple for production capacity. And there’s geopolitical risk in the region where TSMC (currently) operates.
> They’ve vertically integrated everything else, and they’ve had great success along the way.
Why would they want to get into the low-margin, high-risk part of their supply chain, the bit where you can sink billions of dollars and have the value wiped out by a poor choice?
Or, that's exactly the reason to get involved in it. You don't want to rely on suppliers that are at high risk to implement trade barriers or worse. Apple has been moving production away from China over the past couple of years, no doubt in part due to risk (an Apple spokesperson has specifically cited risk as a reason). They've recently moved some production to suppliers in Vietnam, India, and even some has moved to the US.
Why would they make their own M and A processors? They could continue to buy Intel chips and revert back to using PortalPlayer and Samsung chips. Chip designing is capital intensive and takes years to get up and running, when Intel and Samsung know how to do it.
But Apple has a lot of capital, and could win massive political brownie points for doing so, especially if they promised that some percentage of fab capacity would be sold to other American firms.
An ally who China is dreaming of re-assimilating (or taken over, depending on which side of the view you are) since its inception.
China is engaging in ever more aggressive saber rattling and the total lack of any measurable reaction to their takeover of Hong Kong only has emboldened them. Who can guarantee Taiwan won't end up the same fate?
I am very aware of the situation, being a resident of Hong Kong, and it's completely different. Hong Kong is indisputably a part of the People's Republic of China, and the Basic Law (our Constitution) is part of the constitution of the PRC. The government of Hong Kong has always emphasised that Hong Kong is part of One Country.
Taiwan is completely self-governed at the moment and sees itself as an independent nation.
Besides the 9 figure capital costs and the multi-year start-up, If you want a cutting-edge fab technology the pool of talent you have to recruit from is pretty small and most of them are outside of the United States.
By buying up the newest nodes, TSMC already gives them quasi-monopoly status. They would need to buy TSMC, but Taiwan may not want to sell it, as it's their most important company by far.
TSMC already announced a new fab in Arizona that will start delivering chips ~2024. Apple does not need to own a fab, they just need to diversify their chip supply chain.
Without a major geopolitical change, like escalation of US-China tensions, sanctions on SE Asia or straight up war - likely never. Despite every incentive in place (tax breaks, consumer goodwill, better accessibility/control), what is the last domestic electronics manufacturing success story?
Manufacturing capacity will shift to South Korea, Japan, Malaysia, Singapore, India etc. before the US if that happens. The labor and expertise needed to set up such an operation is just not available here anymore.
In the past, haven’t they made exclusive deals with manufacturers by helping them spin up new factories in exchange for exclusive access for X number of years?
IIRC, Apple basically made "retina displays" happen - when nobody else would make that expensive step - by telling a screen manufacturer "here is a very large check; build a factory to make these at commodity scale, within one year."
With Apple holding a lot of cash offshore awaiting for a favourable way to onshore it, combined with the political eagerness to bring chip production onshore. Those two aspect may well pan out to a situation in which the accountants see it as a win win.
Even then, do Apple use enough chips to justify running a fab, let alone one that would be locked into the node of the time. I really don't see it happening for many reasons and the only reason they would - would be some tax break incentive to onshore some of the money they have offshore in that it pays for itself, win or fail.
> With Apple holding a lot of cash offshore awaiting for a favourable way to onshore it
I don't think that's relevant anymore. My understanding is that the 2017 TCJA required prior unrepatriated earnings to be recognized and taxed over eight years (so still ongoing) and future foreign earnings not subject to US tax (except if the foreign tax is below the corporate alternative minimum tax rate). As a result of those changes, there's no need to hold cash offshore.
I can absolutely see both parties wrapping a flag around an investment tax credit for building a fab in the US. To the person saying taxes don't matter because it's expensive -> The more capital cost the better if they can use that to offset taxes.
That's great if you App is compute bound. "May all your Processes be compute bound." Back in the real world most of the time your Process will be io bound. I think that's the real innovation of the M1 chip.
Exactly because of the "real world" argument, turns out that a lot of actual real world loads are CPU bounds because they are so wastefully implemented. IO of all kinds has extremely high bandwidth these days and OoO helps hide the latency.
Important to clarify this every time it comes up: there is no on-die memory on the M1. It is normal, everyday, DDR4 memory which is located near to the processor. It's actually quite high latency at ~100ns.
This isn't entirely accurate. The Air starts at 1129 EUR in Germany, presumably it's the same in the rest of Europe. That's 1344 USD right now, but that's not an entirely fair comparison since this price is after tax whereas the USD prices are generally before tax. Before taxes, the Air is actually 1129 USD in Germany (current exchange rate and taxes cancel each other out). More than a 1000, but significantly less than 1500.
I'm sorry it's more expensive for you, I just wanted to point out that there are locations which are not the US, such as parts of mainland Europe, where this claim does also not hold.
I just checked and it appears that notebooksbilliger.de sells the Air M1 starting at 1057 EUR and is willing to ship to Croatia for 30 EUR. If you were interested, maybe that's a better alternative.
Edit: Amazon.de charges 1079 EUR and seems happy enough to ship to consumers in Croatia as well for around 14 EUR. I haven't tried completing an order, obviously, but there are no relevant restrictions listed.
This requires a lot more digging to understand.
Simply put, I don't accept the hastily arrived-at conclusion, and wish Daniel would put more effort into investigation in the future. This experiment is a poor example of how to investigate performance on small kernels. You should be looking at the assembly code output by the compiler at this point instead of spitballing.