It's faster because it has more microarchitectural resources. It can load and store more, and it can do with a single core what an Intel part needs all cores to accomplish.
Why is that strategy simultaneously remarkably efficient and remarkably high-performance? What enabled/led them to make those choices where others haven't?
The reason I think so is when I was at Google is was 7 years between when we told Intel what could be helpful, and when they shipped hardware with the feature. Also, when AMD first shipped the EPYC "Naples" it was crippled by some key uarch weaknesses that anyone could have pointed out if they had been able to simulate realistic large programs, instead of tiny and irrelevant SPEC benchmarks. If Apple is able to simulate or measure their own key workloads and get the improvements in silicon in a year or two they have a gigantic advantage over anyone else.
One early benchmark showed allocating and destroying an NSObject performing drastically better on the M1 vs recent Intel Macs. This wasn’t an accident. It’s probably not representative of performance overall. They have enough vertical integration to make their own first party solutions clear optimization targets.
All it takes is looking at what's going on in commonly used code, deciding to optimize for X, Y, and Z, and commit to it. If Intel isn't doing this already, that's all the fault of current management for not making it a priority.
The only way that Apple's vertical integration helped them make that management decision is that they were able to say "our customer is a typical laptop user." Intel tries to cater to much larger markets, so perhaps when management goes to plan a laptop chip, they are less aggressive with deciding to optimize. But I have a feeling that Apple's optimizations are generally good for nearly all code, not just for specific use cases.
Don't discount how a weak organization can make poor decisions even when all necessary information seems to be readily available.
They could run their benchmarks on the Debian repository and it would probably be representative enough.
If they haven't done so, perhaps because it seemed tedious or unimportant, well, that mistake would be on them.
Even if you know about them, you need an expert in each to truly push the hardware to the real limits that get hit in the respective industry. The small differences between real implementations and simulated loads can drastically alter the performance characteristics and cause proc manufacturers to miss the mark.
In short, Apple has a number of 10x engineers that they move around to whatever project needs the most help, whether that is hardware, or application software, or operating system software, or services infrastructure, or whatever.
If some project starts getting enough negative “above the fold” coverage, then they will be temporarily gifted one or more of these special engineers.
Do that enough times, and those 10x engineers will gain enough experience in enough different areas that they will be able to reason well about the other end of whatever pipeline they’re on, and will know other 10x engineers that they can work with on that other end of the pipeline, because they had previously worked with them on some other project in the past months or years.
And those 10x engineers really will make a huge difference in what that project is capable of delivering.
The key failure of this operating mode is that most of the 10x engineers never get enough time to transfer much knowledge or skills to the others on the temporary team they are currently working with, and so things will start slowly deteriorating when they are necessarily moved on to the next project.
Rinse and repeat.
Also we see not the first chip but the first one that met their needs (demonstrably better performance on their workloads).
By which I mean: presumably MacOS has been running a many generations of A processors, so they have had a lot of time to figure out what tweaks would be good and which turn out to be pessimization and overkill. It doesn't hurt that there is significant internal overlap between modern macOS and iOS.
Microsoft proved with the XBOX and Surface series they can make good hardware if they want, now they need to move to chip design.
Microsoft has at least one homegrown processor that it has ported Windows and Linux to with the confusingly named 'Edge'.
Microsoft doesn't even _need_ to target Arm, they could easily team up with AMD or go the whole thing solo and target anything from RISC-V, Arm to an in-house ISA.
What was the last version of Windows to support either of these platforms?
Could that possibly be approved by governments?
I also wonder if having the hardware and software both worked on in-house is an advantage. I mean, if you're developing power management software for a mobile OS, and you're using a 3rd-party vendor, then you read the documentation, and work with the vendor if you have questions. If it's all internal, you call them, and could make suggestions on future processor design too based on OS usage statistics and metrics.
The things people give them complains about:
(a) keeping a walled garden,
(b) moving fast and taking the platform to new directions all at once
(c) controlling the whole stack
Which means they're not beholden to compatibility with third party frameworks and big players, or with their own past, and thus can rely on their APIs, third party devs etc, to cater to their changes to the architecture.
And they're not chained to the whims of the CPU vendor (as the OS vendor) or the OS vendor (as the CPU vendor) either, as they serve the role of both.
And of course they benchmarked and profiled the hell out of actual systems.
A "walled garden" is when there is a single source of software.
According to docs, enabling bitcode: "Includes bitcode which allows the App Store to compile your app optimized for the target devices and operating system versions, and may recompile it later to take advantage of specific hardware, software, or compiler changes."
It seems quite likely they have (and probably used) the capability to recompile any app on their platform to benchmark real workloads against prototype silicon changes.
It's enough that enough of them are.
Plus, even if most aren't, the "short tail" people use these days probably are.
macOS promotes the App Store as the source of software (even if it's not the sole), and has walls like notarization requirements and the Gatekeeper to prevent weeds from intruding.
With the App Store Apple knew that there's a pool of N apps that follows its guidelines, has passed internal checks for API use, and can be converted quite easily to a different architecture, that it could count on.
Their control over the platform allowed them to enforce Metal and deprecate OpenGL pronto, to add a new combined iOS/macOS UI libs, to introduce Marzipan.
They have also added stuff like Universal Binary support, and most importantly Bitcode, which abstracts away parts of the underlying architecture.
All of those where steps towards the ARM/M1 (and future developments), and all were enabled via Apple's control of the hardward, software, and - sure, partial - control of third party apps.
They do. MacOS isn't a walled garden.
> They're anti-competitive
Have you heard of this little company from Washington called Microsoft? They have something like 85% of the PC market. There is another OS called Linux. About 85-90% of the internet runs on it.
I can understand a little where people get the idea the iPhone is anti-competitive, but we're talking about MacOS here.
Ask Amphetamine about how open they are.
Amphetamine is a perfect example of how the Mac isn't a walled garden. They always had the option top sell outside the App Store. That is fundamentally the difference between what makes a platform a walled garden. They might have lost some sales because they couldn't participate in the Mac App Store, but they could still sell their product. Some companies choose to avoid the Mac App Store because they don't like Apple's policies.
That's neither here nor there.
(a) Amphetamine could still be sold outside the Mac App Store.
(b) An app name could be problematic even in FOSS land. It's just that instead of Amphetamine being the name that causes it, it will be something else. E.g. with the trend of banning/changing terms like "master" (as in replication primary master, not as in the owner of slaves), unfortunately named apps could be thrown out something or ask to be renamed to be included in a distros package manager or a project.
Remember the transition to arm64? Apple forced everything on the App Store to ship universal binaries.
Without the App Store walled garden, software isn’t required to keep up to date with architectural changes. Instead, keeping current is only a requirement to being featured on the App Store (which would just be a single way to install software, not the only method).
That said, all software on the Mac, post-Catalina, has to be 64-bit, whether it's distributed through the Mac App Store or not, because the 32-bit system libraries are no longer included at all.
Gates are not incompatible with walled gardens. Most walled gardens have those.
Plus, I mentioned the walled garden as a good thing. It's part of the Apple proposition (even if not alll get it), and part of what it enables it to move at the speed it does (whether in the right or wrong direction).
But one can susbstitute "walled garden" with "tight control of the OS, hardware and imposed requirements on most of third party software, and willingness to enforce hard schedules (e.g. regarding removing 32-bit, OpenGL, etc) to all (or tons) of its developers at once.
The walls in the walled garden only exist in the heads of people who never use a Mac.
With much slower adoption, pushback, and bike-shedding, like in the Microsoft and Linux world.
>To me, it suggests they aren't willing to compete.
Compete with what? With themselves? They compete with Windows (and to a degree Linux, though few care for that), and with Android. They'd compete with Windows Phone too if MS wasn't incompetent.
But they didn't do anything to preclude others from making their OS/hardware and selling it to customers. In fact, they have nowhere near a monopoly in either the desktop (10% or less) or the mobile space (40% or less).
Whereas MS for example, had 98% of the desktop (home and enterprise), and abused its power to threaten OEMs to do its bidding against Linux etc.
I don't care how good their hardware is. Moreover, good luck sourcing parts if the device has trouble. Apple will not sell you the parts. Even if you wanted.
A walled garden does not make their hardware any better. If anything it makes it worse. I hope for Mac users apply does not clamp down further on Macs.
Didn't downvote, but I think it's the same as when people read a "letter to the editor" of yore, declaring that some person "cancelled their subscription" because of something in the magazine.
A natural response is "Don't let the door hit you on your way out", which on HN might be expressed through a downvote by some.
>I don't care how good their hardware is. Moreover, good luck sourcing parts if the device has trouble. Apple will not sell you the parts. Even if you wanted.
Well, they repair all kinds of parts, and have guarantees and guarantee extension programs. But in any case, their allure was never "can find parts to build my own / repair damages forever" or in their stuff being cheap to own or fix/replace.
>A walled garden does not make their hardware any better.
Well, it does in a few ways. Mandating how the software is made, and what software is sold, when it should adapt new libs to continue being sold, etc, means that they can move the platform in different ways faster.
I never bother to enter apple's tyrannical ecosystem . So there is no door to hit me on the way out.
> Well, they repair all kinds of parts, and have guarantees and guarantee extension programs.
You can not get a lot surface mount chips to repair a mac book or iPhone without having to look on the gray-market. This is even before possible firmware issue if you manage to find parts. Heck even getting full replacement boards is basically impossible, unless they are from donor machines that have other problems.
> "Well, it does in a few ways. Mandating how the software is made, and what software is sold, when it should adapt new libs to continue being sold, etc, means that they can move the platform in different ways faster."
I disagree, allowing people to side-load does not stop apple from having policies in place for it's app stores. That's honestly the only problem. It's the owners hardware, they should not need apples permission to run code on it. Unless the owner can sign software themselves and/or run it without apple's consent this will always be a problem. You can't even install gcc without jailbreaking an iPhone.
At the very least I should be able to install an other OS on the device like GNU/Linux. If apple does not want to open iOS the user should at least have that option for the hardware.
This is even before you get into how apple treats developers. Have you read the entire App store guidelines. Some of it is ridiculous. Some of the insanity prevents Firefox from even porting their own browser engine.
X86 is definitely a coefficient overhead, but if Intel put their designs on 5nm they'd look pretty good too - Jim Keller (when he was still there) hinted their offerings for a year or so in the future are significantly bigger to the point of him judging it to be worth mentioning so I wouldn't write them off.
Of course this leads to the question that if everyone in the industry knew this was the issue why weren't Intel and AMD pushing harder on it? They already both moved the memory controller onboard so they had the opportunity to aggressively optimize it like Apple has done, but instead we have year after year where the memory lags behind the processor in speed improvements, to the point where it is ridiculous how many clock cycles a main memory access takes on a modern x86 chip.
Part of this is due to the fact that x86 processor designers won't include more execution units than they can feed from their instruction decoders. Apple's processors are much wider than x86 on both the decode and execution resources, and it's pretty clear that the M1 would not perform as well if its decoders were as narrow as current x86 cores.
This would imply that it's able to sustain ILP greater than 4 (or maybe 5 with macro-fusion). Does it actually manage to do this often? If so, that's really impressive. I was guessing that most of the advantage was coming from the improved memory handling, and possibly a much bigger reorder buffer to better take advantage of this, but I'm happy to be shown otherwise.
For instance, it's hard to combine instructions together, which is actually an advantage for x86 (the complex memory operands come for free). But it also guarantees memory ordering that ARM doesn't which is a drawback.
I'm not sure how important this is in practice.
True, although I just looked at the ARM assembly for Daniel's example, and it's making good use of "ldpsw" to load two registers from consecutive memory with a single instruction. So in this particular case, it may be a wash.
> But it also guarantees memory ordering that ARM doesn't which is a drawback.
Yes, I wasn't considering the memory model to be part of the instruction set. I agree that in general this could be a big difference in performance, although I don't think it comes up in Daniel's example.
I added a comment to Daniel's blog with my guess as to what's happening to cause the observed timings in his example. Feedback from anyone with better knowledge of M1 would be appreciated.
Yes and no. Yes, because modern super scalar CPU's don't execute the instructions directly, but rather use a different instruction set entirely (the "micro-ops") and effectively compile the native instructions into that. This makes them free to choose whatever micro-ops they want. Ergo the original instructions don't matter.
But .... that means there is a compile step now. For a while that was no biggie - it can pipelined if the encoding is complex. But now the M1 has 12 (iirc) execution units. In the worst case that means they can execute 12 instructions simultaneously, so they must decode 12 instructions simultaneously. The is a wee exaggeration as it isn't that bad. In reality the M1 appears to compile 8 instructions in parallel.
This is where the rot sets in for x86. Every ARM64 instruction is 32 bits wide. So the M1 grabs 8 32 bit words, compilers them in parallel to micro-ops. Next cycle, grab another 8 32 words, compile them to micro-ops, and so on. But the x86 instructions can start on any byte boundary, and can be 1 to 16 bytes in length. You literally have to parse the instruction stream a byte at time before you can start decode it. In practice they cheat a bit, making speculative guesses about where instructions might start and end, but when you're being compared to someone who effortlessly processes 32 bytes at a time that's like pissing in the wind.
So the instruction set may not matter, but how you encode that instruction set does matter, at lot. Back in the day, when there we few caches and every instruction fetch cost memory accesses, you were better off using tricks like using one byte for the most common instructions to squeeze the size of instruction stream down. That is the era x86 and amd64 hark from. (Notably, the ill-fated Intel iAPX 32 took it to an extreme, having instructions start and end on a bit boundary.) But now with execution units operating in parallel, and on chip caches putting instruction stores on chip right beside the CPU's, you are better off making storage size worse in order to gain parallelism in decoding. That's where ARM64 harks from.
It's interesting watch RISC-V grapple with this. It's a very clever instruction set encoding that scales naturally between different word sizes. This also naturally leads to a very tight, compressed instruction set. But in order to achieve that they've got more coupling between instructions than ARM64 (but far, far less than x86), and any coupling makes parallelism harder. Currently RISC-V designs are all at the small, non-parallel end, so it doesn't effect them at all. In fact at the low power end it's almost certainly a win for them. But I get the distinct impression the consensus of opinion here on HN is it will prevent them from hitting the heights ARM64 and the M1 achieve.
What's missing from your description is the extra level of decoded µop cache between the decoder and instruction queue on modern Intel chips. In a tight loop, this pre-decoder kicks in and replays the previously decoded µops at up to 6 per cycle. It's a mess (and complicated enough that Intel needed to disable part of it with a microcode update on Skylake) but it provides enough instruction throughput that the real bottleneck is almost always elsewhere. Specifically, the 4-per-cycle instruction retirement limit almost always maxes out my attempts at extremely tight loop code earlier than instruction decoding.
Which is to say, you are right about how much easier it is to decode ARM64 instructions, but I think you are wrong that decoding x64 is in-practice a limiting factor for performance. If you have a non-contrived example to the contrary, I'd love to see it.
More details here: https://en.wikichip.org/wiki/intel/microarchitectures/skylak...
And in this nice blog post: https://paweldziepak.dev/2019/06/21/avoiding-icache-misses/
That's what cache and tlb are for.
I'm sure they'll still come out ahead in benchmarks, but the numbers will be much closer once AMD moves to 5nm. You absolutely cannot fairly compare chips from different fab generations.
I don't see many comments hammering this point home enough... it's not like the performance gap is through engineering efforts that are leagues ahead. Certainly some can be attributed to that, and Apple has the resources to poach any talent necessary.
Apple appears to have taken the power reduction when they moved to TSMC 5nm.
>The one explanation and theory I have is that Apple might have finally pulled back on their excessive peak power draw at the maximum performance states of the CPUs and GPUs, and thus peak performance wouldn’t have seen such a large jump this generation, but favour more sustainable thermal figures.
Apple’s A12 and A13 chips were large performance upgrades both on the side of the CPU and GPU, however one criticism I had made of the company’s designs is that they both increased the power draw beyond what was usually sustainable in a mobile thermal envelope. This meant that while the designs had amazing peak performance figures, the chips were unable to sustain them for prolonged periods beyond 2-3 minutes. Keeping that in mind, the devices throttled to performance levels that were still ahead of the competition, leaving Apple in a leadership position in terms of efficiency.
Customers don't care, but discussion of the merits of the chip should be more nuanced about this.
It also implies that the gap won't exist for very long, as AMD will move onto 5nm soon
... yes, if there is any capacity left. Capacity for the new process is a limited resource after all.
Intel is in a really bad place now (in a forward-looking sense), primarily due to their fab process falling behind TSMC and others. You can't design your way ahead while using old manufacturing technology
AMD has done mobile CPUs that look as if they are close to or even ahead of the M1 in performance, but they all use 2x to 4x as much power. When higher core count versions of Apple Silicon are available, they will be able to have double the core counts of AMD chips at the same power levels.
And each those cores are significantly faster than individual AMD cores.
It's their desktop chips running Zen 3 cores that trade blows on single core integer math, depending on which benchmark you look at.
Of course, on multi-core performance you can buy a desktop chip with a much higher Zen 3 core count than Apple offers.
And Ryzen chips are offered with more cores, but that’s an extremely temporary advantage (reminds me of the friend who told me not to buy Apple stock because they didn’t have big screen phones).
When Apple fits 32 Firestorm cores in a 135 watt TDP package, AMD isn’t going to have an answer.
Here's a great example, Intel Core i9 (Desktop, 125w TDP) vs Intel Core i7 (Laptop, 15W). Huge power difference, only ~10% difference in single core clock speeds.
Or check multicore scores here:
m1 is better by 20% or so, but is 2 fab generations ahead of the intel chips
The Ryzen 4900H is a 55 watt TDP part with eight performance cores. Despite that it’s far behind the M1 in GeekBench single core and multicore scores.
Now as you shown it’s way ahead in Cinebench 23 Multicore benchmark. Let’s assume that’s more representative of real world use, and that being a process behind it will add 15% higher performance when AMD makes a 5 nm successor. That would increase its Cinebench Multicore to roughly 12,700, roughly 60% higher than the M1.
But all Apple has to do is come out with an M1X with eight Firestorm cores. That’s a multicore performance in the same range as the best possible 5 nm AMD CPU, and a TDP barely half of the AMD CPU. And far higher Cinebench single core, and GeekBench single/multicore ratings.
Obviously to swap Firestorm for IceStorm cores they need more transistors or something else has to go (on chip GPU?).
And Apple won’t be making 20 hour MacBook Airs out of this M1X, but they will be able to make 14 hour MacBook Pros with faster discrete GPUs.
And they are about 6 months from doing exactly that. Which is likely going to before AMD 5 nm, so top of line 4900H laptops get smoked on all Cinebench and GeekBench scores while using nearly double the power.
Even though M1 is designed for efficiency, it sometimes outperforms AMD/Intel for performance. That confuses the story.
The commoditisation of the PC hardware has driven great CPU designs into an extinction. Heck, even Oracle, that are now in the business of litigation for fun and a massive profit, with its prodigious cash war chest has discontinued the UltraSPARC architecture due to it requiring extraordinary investments on multiple fronts. PC users have long been forced to be content with whatever bone the CPU architecture coloniser would throw at them. There appears to be a resurgence of the great engineering with M1, and, hopefully, that will lead to more of the thoughtful engineering in medium to long term.
M1 is fast due to: a solid, single vision of what a modern CPU should be like, continuous investment into R&D over an extended period of time, a well concerted effort of the engineering, design ideas reuse across multiple product lines, supply chain management, and, of course, the manufacturing process. Nanometers do not make for a great CPU design but rather play a supporting role. If the nanometers were so important, the 2017 POWER9 design manufactured at a 14 nm process with a smaller L1 cache would not have been able to outperform any existing x86 design in 2020 in both, single core and multi core (with 25% to 50% lesser number of physical cores) setups? Ryzen 3 has narrowed the gap, but POWER9 still takes the lead and POWER10 is around the corner.
There is a great quote by Michael Mahon, a principal HP architect, in the foreword to the PA RISC 2.0 CPU architecture handbook from 1995:
The purpose of a processor architecture is to define a stable interface which can efficiently couple multiple generations of software investment to successive generations of hardware technology. Stability and efficiency are the goals, and the range of software and hardware technologies expected during the architecture’s life determine the scope for which the goals must be achieved
Efficiency also has evident value to users, but there is no simple recipe for achieving it. Optimizing architectural efficiency is a complex search in a multidimensional space, involving disciplines ranging from device physics and circuit design at the lower levels of abstraction, to compiler optimizations and application structure at the upper levels.
Because of the inherent complexity of the problem, the design of processor architecture is an iterative, heuristic process which depends upon methodical comparison of alternatives («hill climbing») and upon creative flashes of insight («peak jumping»), guided by engineering judgement and good taste.
To design an efficient processor architecture, then, one needs excellent tools and measurements for accurate comparisons when «hill climbing,» and the most creative and experienced designers for superior «peak jumping.» At HP, this need is met within a cross-functional team of about twenty designers, each with depth in one or more technologies, all guided by a broad vision of the system as a whole.
Well executed holistic approach is the reason why the entry level M1 is fast. We need more of «holistic-ism» in engineering everywhere.
UltraSPARC was not very competitive. Those machines were very, very expensive and you didn't get much bang for the buck. They weren't even that fast. The later chips had tons of threads but single threaded performance was pretty bad...
There is no freakishly successful strategy at play there as well. It's just all previous attempts at "fast ARM" chip were rather half hearted "add a pipeline step there, add extra register there, increase datapath width there," and not to squeeze it to the limit.
These improvements didn't come from nowhere. It came from iterations of iOS hardware.
Others have to some extent — AMD is certainly not out of the game — so I'd treat this more as the question of how they've been able to go more aggressively down that path. One of the really obvious answers is that they control the whole stack — not just the hardware and OS but also the compilers and high-level frameworks used in many demanding contexts.
If you're Intel or Qualcomm, you have a wider range of things to support _and_ less revenue per device to support it, and you are likely to have to coordinate improvements with other companies who may have different priorities. Apple can profile things which their users do and direct attention to the right team. A company like Intel might profile something and see that they can make some changes to the CPU but the biggest gains would require work by a system vendor, a compiler improvement, Windows/Linux kernel change, etc. — they contribute a large amount of code to many open source projects but even that takes time to ship and be used.
Also, compiler support for CPUs is very overrated. Heavy compiler investment was attempted with Itanium and debunked; giant OoO CPUs like Intel's or M1 barely care about code quality, and the compilers have very little tuning for individual models.
I wasn't just talking about Intel but the concept of separate CPU and compiler vendors in general. Intel contributes a ton of open source but even if they were perfectly organized it takes time for everything to happen on different schedules before it's generally available: get patches into something like Linux or gcc, wait possibly years for Red Hat to ship a release using the new version, etc. Certain users — e.g. game or scientific developers — might jump on a new compiler or feature faster, of course, but that's far from a given and it means they're not going to get the across-the-board excellent scores that Apple is showing.
> Also, compiler support for CPUs is very overrated. Heavy compiler investment was attempted with Itanium and debunked; giant OoO CPUs like Intel's or M1 barely care about code quality, and the compilers have very little tuning for individual models.
This isn't entirely wrong but it's definitely not complete. Itanium failed because brilliant compilers didn't exist and it was barely faster even with hand-tuned code, especially when you adjusted for cost, but that doesn't mean that it doesn't matter at all. I've definitely seen significant improvements caused by CPU family-specific tuning and, more importantly, when new features are added (e.g. SIMD, dedicated crypto instructions, etc.) a compiler or library which knows how to use those can see huge improvements on specific benchmarks. That was more what I had in mind since those are a great example of where Apple's integration shines: when they have a task like “Make H.265 video cheap on a phone” or “Use ML to analyze a video stream” they can profile the whole stack, decide where it makes sense to add hardware acceleration, and then update their choice of the compiler toolchain and higher-level libraries (e.g. Accelerate.framework) and ship the entire thing at the time of their choosing whereas AMD/Intel/Qualcomm and maybe nVidia have to get Microsoft/Linux and maybe someone like Adobe on board to get the same thing done.
That isn't a certain win — Apple can't work on everything at once and they certainly make mistakes — but it's hard to match unless they do screw up.
What you said is true for libraries, I just don't think it's true for compiler optimizations. Even Apple's clang just doesn't have any new optimizations that work on their own; there are certainly new features but they're usually intrinsics and other things that need to be adopted by hand. They thought this would happen (it's what bitcode was sold as doing) but in practice it has not happened.
"ARM won't play nice with the larger CPUs they want to make" - wat? Apple holds an architectural license. This means they paid a lot upfront a long time ago and therefore have a more or less perpetual right to design their own Arm cores without input from Arm.
"a competent graphics solution" - Also wat? M1 has an excellent GPU. It doesn't compete with discrete GPUs that use 300 watts, but that's fine: M1 is the chip for entry level Macs, designed for the smallest and lightest segments of their notebook line. And in that product segment, it has been every bit as much a revelation as the CPU. It's very fast, and uses little power given the performance.
What exactly do you think is going to happen when they scale that basic GPU design up? Despite your dismissiveness, in modern silicon architecture energy efficiency is incredibly important: for any given power budget, the more efficient you are the more performance you can deliver. The performance Apple gets out of about 10W on M1 suggests they'll have few problems building a larger GPU to compete with Nvidia and AMD discrete GPUs.
1. they are the only ones who have 5nm chips because they paid a lot to TSMC for that right
2. they gave up on expandable memory, which lets them solder it right next to the cpu, which likely makes it easier to ship with really high clocks. and/or they just spent the money it takes to get binned lpddr4 at that speed.
So a good cpu design, just like AMD and Intel have, but one generation ahead on node size, and fast ram. Its not special low latency ram or anything, just clocked higher than maybe any other production machine, though enthusiasts sometimes clock theirs higher on desktops!
The design seems to be very different, in that it's far far wider, and supposedly has a much better branch predictor.
> fast ram
Is that a property of the RAM clock, or a function of a better memory controller? The RAM certainly doesn't appear to have any better latency.
and yes, obviously apples bespoke ARM cpu is quite a bit different than Zen3 Ryzens x86 cpu, but I'm not sure it is net-better. When Zen4 hits at 5nm I expect it will perform on par or better than the M1, but we won't know till it happens!
- benchmarking something small enough to inspect machine code, but not inspecting machine code
- not plotting distribution, average, variance etc
- no attention paid to CPU frequency governor settings
- measuring too short a run
- measuring too small a dataset that it fits entirely in L1
Anandtech has a graph showing this, specifically the R per RV prange graph. I've verified this personally with a small microbenchmark I wrote. I've not seen anything else close to this memory latency.
Scroll down to the latency vs size map and look at the R per RV prange. That gets you 30ns or so.
Similar for AMD's latest/greatest the Ryzen 9 5950X:
The same R per RV prange is in the 60ns range.
It's been a long time since I dealt with this stuff (wanted to get 1GB huge pages in Linux for some huge huge hash tables), so maybe I'm misunderstanding.
If you use a 1GB array and see full random with much higher latency than a sliding window then you can be pretty sure that the page size is much less than 1GB.
Getting the cacheline off by a factor of 2 does make a small difference since you get occasional cache hits instead of zero, but as long as the array tested is several times larger than cache the impact is small.
But all in all the M1 has excellent memory bandwidth, excellent latency, and shows significantly better throughput on random workloads as you use more cores. Normal PC desktops have 2 memory channels (even the higher end i7/i9/ryzen7/ryzen9), only the $$$$ workstation chips like threadripper and some of the $$$$ Intel's have more. The little ole M1 in a mac mini, starting at $700 has at least 8 memory channels. So basically the M1 delivers on all fronts, larger and lower latency caches, wide issue, large reorder buffers, excellent IPC, and excellent power efficiency.
Just a clarification as to why the P per RV prange numbers are good: This pattern is simply aggressively prefetched by the region prefetcher in the M1, while Zen3 doesn't pull things in as aggressively.
It's designed to graph latency/bandwidth for 1 to N threads. My 1 thread numbers match Anandtech's. Use -p 0 for full random, which thrashes the TLB or -p 1 to be cache friendly (visit each cacheline once, but within a sliding window of 1 page).
To see the apple results (if you have gnuplot installed):
M1 is fast because they optimized everything across the board. The speed is the cumulative result of many optimizations, from the on-die memory to the memory clock speed to the architecture.
N = 1000000000, 953.7 MB
two : 30.6 ns
two+ : 39.6 ns
three: 45.1 ns
EDIT: Adding `-march=native` didn't really change the results, which makes sense given that it's a memory benchmark.
N = 1000000000, 953.7 MB
two : 12.8 ns
two+ : 13.7 ns
three: 19.5 ns
N = 1000000000, 953.7 MB
two : 29.7 ns
two+ : 36.5 ns
three: 43.8 ns
Looking a little closer at the script, it loads numbers from "random", a vector of 3 million `Int` (this is hard coded, separate from `N`).
This vector is about 11.4 MiB.
The Tiger Lake CPU has 12 MiB of L3 cache (same as your i7-9750H), so it barely fits. Meanwhile, the L1 cache is 48 KiB and the L2 cache is 1.5 MiB -- huge compared to most recent CPUS, and a lot of benefit in most benchmarks, but at the cost of higher latency.
Skylake's L3 latency was 26-37 cycles, and in Willow Cove's (Tiger Lake), it is 39-45 cycles.
That difference by itself isn't big enough to account for the difference we're seeing, so something else must be going on.
I'm wondering if the difference is the the number of active memory channels. How many channels does your respective computers support? Do you have enough RAM installed so all channels are in use? Are you able to do a RAM bandwidth test by some other means to verify?
Another possibility is that for some reason the base latency is just different between your machines. A commenter added a pointer-chasing variation of Daniel's test on his blog. Maybe run this to find the full latency and see how the times differ?
Finally, there was one more commenter on the blog who reported anomalously fast times on a Windows laptop. It's possible there is a bug with Daniel's time measurements on windows.
Re timing issues: I am on Linux.
N = 1000000000, 953.7 MB
two : 17.7 ns
two+ : 19.1 ns
three: 26.4 ns
Neither did messing with flags, like (I tried -fno-semantic-interposition -march=native and a few others).
Care to explain what you mean specifically by this?
two : 49.6 ns (x 5.5)
two+ : 64.8 ns (x 5.2)
three: 72.8 ns (x 5.6)
two : 62.8 ns (x 7.1)
two+ : 69.2 ns (x 5.5)
three: 95.3 ns (x 7.3)
(base) Coding % cc -mnative two-three.c
clang: error: unknown argument: '-mnative'
(base) Coding % cc -v
Apple clang version 12.0.0 (clang-1188.8.131.52)
Thread model: posix
It doesn't make much difference though, autovectorization doesn't work very well and there is not a lot of special optimization for newer x86 CPUs.
Only using 128 bit wide instructions on a core that has 512 bit hardware results in 4x less L1d bandwidth.
Speed: 2133 MT/s
two : 27.1 ns (3x)
two+ : 28.6 ns (2.2x)
three: 39.7 ns (3x)
Load Avg: 2.36, 2.01, 1.97 CPU usage: 2.10% user, 3.39% sys, 94.49% idle
The M1 chip allows higher MLP, presumably because it has more LFB's per core (or maybe they are using different approach where the LFB's are not per-core?). I apologize for using so many abbreviations. I searched to try to find a better intro, but didn't find anything perfect. I did come across this thread that (apparently) I started several years ago at the point where I was trying to understand what was happening: https://community.intel.com/t5/Software-Tuning-Performance/S....
It's like the difference between the situation where every car uses 4 cylinders, and then Apple comes along and makes a car with 5 cylinders.
Not to disagree with your overall point, but 2mm is a long way when dealing with high frequency signals. You can't just eyeball this and infer that it makes no difference to performance or power consumption.
>The only thing you can say against them is they might consume more power driving that trace
Power consumption is really important in a laptop, and Apple clearly care deeply about minimising it.
For all we know for sure, moving the memory closer to the CPU may have been part of what's enabled Apple to run higher frequency memory with acceptable (to them) power draw.
It appears to be mounted on the same chip package.
Why did Apple do this if not for speed?
It is not only HN. It is practically the whole Internet. Go around the Top 20 hardware and Apple website forum and you see the same thing, also vastly amplify by a few KOL on twitter.
I dont remember I have ever seen anything quite like it in tech circle. People were happily running around spreading misinformation.
According to this person's bio they had an undergraduate education in computer science ¯\_(ツ)_/¯
It looks like apple m1 is much less eager when caching memory rows. Maybe because it doesn't have l3 cache.
Edit: This test utilizes the 8x16bit memory bus of apple m1 fully. It's mostly just fetching random locations from memory, which can all be parallelized by the cpu pipeline. It explains why the results are exactly 4x slower on my ryzen 3 with 2 memory channels.
So the summary is that m1 is optimized for dynamic languages that tend to do a DDOS attack on RAM with a lot of random memory access, but it might take a performance hit with compiled languages and traditional HPC techniques that tend to process data in sequence like ECS.
I'm a bit of two minds about this: on the one hand, for a long time I've wanted a language for writing allocators which is more explicit about memory, and offers good abstractions for low-level memory operations (maybe Zig is going in this direction). In some sense, it feels like the move towards programmers thinking less about memory management has been a bit of a dead-end, and what we really want is better tools for memory management. Fragmentation in terms of how processors handle memory goes against this goal in some ways.
On the other hand, it's a bit of a "holy grail" to imagine a hardware stack which obviates the need for memory optimization, and really does treat loading from and storing to memory anywhere on the heap as the same. But I imagine that the interesting things which the M1 is doing with memory are helping a lot with the worst case performance, and maybe even average case performance, but they're probably not doing much for the best case.
My understanding, based on the article, is that a normal processor, we would have expected
arr[idx] + arr[idx+1]
to take the same amount of time.
But the M1 is so parallelized that it goes to grab both arr[idx] and arr[idx+1] separately. So we have to wait for both of those two return. Meanwhile, on a less parallelized processor, we would have done arr[idx] first and waited for it to return, and the processor would realize that it already had arr[idx+1] without having to do the second fetch.
Am I understanding this right?
That depends. If the two accesses are on the same cache line, then yes. But since idx is random that will not happen sometimes. He never says how big array is in elements or what size each element is.
I thought DRAM also had the ability to stream out consecutive addresses. If so then it looks like Apple could be missing out here.
Then again, if his array fits in cache he's just measuring instruction counts. His random indexes need to cover that whole range too. There's not enough info to figure out what's going on.
If you only look at the article this is true. However, the source code is freely available: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
It's a 6 years old system, fastest times are in the 25ns range
- 2-wise+ is 5% slower than 2-wise
- 3-wise is 46% slower than 2-wise
- 3-wise is 39% slower than 2-wise+
on the M1
- 2-wise+ is 40% slower than 2-wise
- 3-wise is 4% slower than 2-wise+
N = 1000000000, 953.7 MB
two : 53.3 ns
two+ : 60.1 ns
three: 78.6 ns
2+ 12% slower than 2
3-wise 47% slower than 2
3-wise 30% slower than 2+
Ratios aside, that's an interesting speed leap when the article gets 9 ms for 2-wise. Mind, the laptop had lots of applications running, i didn't clear it up to do a proper benchmark, but still.
- 2-wise+ is 19% slower than 2-wise
- 3-wise is 48% slower than 2-wise
- 3-wise is 25% slower than 2-wise+
However, as mentioned in the post the numbers can vary a lot, and I noticed a maximum run-to-run difference of 23ms on two-wise.
two : 97.4 ns
two+ : 97.9 ns
three: 145.8 ns
However the speed benefits come from a much larger L1 cache and the fact that the ram is in the same chip which will reduce latency that is the benefit for most of it.
The program (instruction) cache is also a lot bigger and has the advantage that as a fixed size isa can be much wider in execution than in x86 but that’s unlikely to be of benefit here, other than perhaps slightly in terms of queuing up multiple outstanding loads.
Well that’s true and could very well be an advantage. An advantage in that they did it, not in that only they have access to it.
Intel and AMD can trivially profile real world workloads too.
Did they? I don’t know what Apple did, but the impression I get is that intel certainly hasn’t.
This article lays out three scenarios: 1) accessing two random elements
2) accessing 3 random elements
3) accessing two pairs of adjacent elements (same as (1) but also the elements after each random element)
It then does some trivial math to use the loaded data.
A naive model might only consider memory accesses and might assume accessing an adjacent element is free.
On the Mac m1 core, this is not the case. While the naive model might expect cases 1 & 3 to cost the same and case 2 to cost 50% more, instead cases 2 & 3 are nearly the same (3 slightly faster) and case 2 is about 50% more expensive than 1.
Really depends on the level of naivety and the definition of "free". It would be less insane to write that: accessing an adjacent element has a negligible overhead if the data must be loaded from RAM and there are some OOO bubbles to execute the adjacent loads. If some data are in cache the free adjacent load claim immediately is less probable. If the latency of a single load is already filled by OOO, adding another one will obviously have an impact. If the workload is highly regular you can get quite chaotic results when making even some trivial changes (even sometimes when aligning the .text differently!)
And the proposed microbenchmark is way too simplistic: it is possible that it saturates some units in some processors and completely different units in others...
Is the impact of an extra adjacent load from RAM likely to be negligible in a real world workloads? Absolutely. With precise characteristics depending on your exact model / current freq / other memory pressure at this time, etc.
We have to assume these are byte arrays, yes? Or at least some size that's smaller than the cache line. You would still pay for the extra unaligned fetches. I don't think this is a valid scenario at all, M1 or not.
Anyone want to run these tests on an Intel machine and let us know if the authors "naive model" test hold there?
That is, the math part is so trivial compared to the memory access that you could do a bunch of math and you would still only notice a change in the number of memory accesses.
Also it looks like the response to yours links their test and the naive model predicts correctly
I guess I still don't understand whats going on here.
Scenario 1 has two spatially close reads followed by two dependent random access reads.
Scenario 3 (2+) has two spatially close reads, and two pairs of dependent random access reads of two spatially close locations.
Why does it follow that this is caused by a change in memory access concurrency? The two required round trips should dominate both on the M1 and an Intel but for some reason the M1 performs worse than that. Why?
I can't help but feel the first snippet triggers some SIMD path while the 3rd snippet fails to.
I think you raise a good question, though -- what really is going on here? Is this just a missed optimization compiling for the m1?
Or is it actually something fundamental about how reads happen with an m1? I'm definitely not knowledgeable enough to know how to answer this
sysconf(_SC_PAGESIZE); /* posix */
The L1 cache values aren't there. The macOS `getconf` doesn't support -a (listing all variables), so they may just be under a different name.
edit: see replies for `sysctl -a` output
big cores (CPU4-7) have 192KB L1I and 128KB L1D.
Two examples( that are slightly bigger than this but the same principles apply):
If you benchmark a std::vector at insertion, you'll see a flat graph with n tall spikes at ratios of it's reallocation amount apart, and it scales very very well. The measurements are clean.
If, however, you do the same for a linked list you get a linearly increasing graph but it's absolutely all over the place because it doesn't play nice with the memory hierarchy. The std_dev of a given value of n might be a hundred times worse than the vector.
Is there any performance difference between that and mach_absolute_time() ?
macOS also has clock constants for a monotonic clock that increases while sleeping (unlike CLOCK_UPTIME_RAW and mach_absolute_time).
¹Technically it uses mach_continuous_time() first if available (which appears to be equivalent to CLOCK_MONOTONIC_RAW), then clock_gettime(CLOCK_BOOTTIME, &ts) on Linux, then clock_gettime(CLOCK_MONOTONIC, &ts), then some other API for Windows.
²Conveniently enough the value that is equivalent to DISPATCH_TIME_NOW using the monotonic clock is just INT64_MIN, at least in the current encoding.
³Swift makes assumptions about the internal format of dispatch_time_t so I don't know if it actually can meaningfully change. Newer versions of Swift now use stdlib on the system, but any app built with a sufficiently old version of Swift still embeds its own copy of the stdlib. Granted, additions (like the monotonic clock) should be fine, as the Swift API does not actually bridge from dispatch_time_t so it only ends up representing times it has APIs to construct. Since it doesn't have APIs to construct monotonic times, they won't break it.
Specifically if I recall correctly the internal representation of time only allowed for 2 formats (wallclock vs monotonic). I suspect adding other types of clocks could pose back/forward compat challenges. However, this is a wild shot in the dark & I didn't really look into the internals of libdispatch. Maybe ask on their github page?
Generally upgrading a private API to public is a lot of work & goes through a lot of review (having monitored that mailing list & added 1 API during my time there). So if there's a private API probably some team at Apple needed it to deliver a feature but the maintainers were not confident the specific solution they chose generalized well & either requires some work or something else.
00000000000012ec pushq %rbp
00000000000012ed movq %rsp, %rbp
00000000000012f0 movabsq $0x7fffffe00050, %rsi ## imm = 0x7FFFFFE00050
00000000000012fa movl 0x18(%rsi), %r8d
00000000000012fe testl %r8d, %r8d
0000000000001301 je 0x12fa
000000000000130b shlq $0x20, %rdx
000000000000130f orq %rdx, %rax
0000000000001312 movl 0xc(%rsi), %ecx
0000000000001315 andl $0x1f, %ecx
0000000000001318 subq (%rsi), %rax
000000000000131b shlq %cl, %rax
000000000000131e movl 0x8(%rsi), %ecx
0000000000001321 mulq %rcx
0000000000001324 shrdq $0x20, %rdx, %rax
0000000000001329 addq 0x10(%rsi), %rax
000000000000132d cmpl 0x18(%rsi), %r8d
0000000000001331 jne 0x12fa
0000000000001333 popq %rbp
AFAIR on x86 a locked rdtsc is ~20 cycles. So to answer the gp question, it has around a precision in the few nanoseconds range. Accuracy is a different question, IE compare numbers from the same die, but be a little more suspicious across dies.
No clue how this is implemented on the M1, or if the M1 has the same modern tsc guarantees that x86 has grown over the last few generations of chips.
Sufficiently old versions of mach_absolute_time used a function called clock_get_time() on i386 (if the COMM_PAGE_VERSION was not 1). This changed in macOS 10.5 to a tiny bit of assembly that just read from _COMM_PAGE_NANOTIME on i386/x86_64/ppc (the arm implementation(!!) triggers a software interrupt). The i386/x86_64/ppc definitions were also copied into xnu.
For the next few years it kind of bounced back and forth between libc and xnu, and the routine was complicated by adding timebase conversion as needed. And at some point arm support was added back (it vanished when it first went to xnu), but this time using the commpage if possible.
As for M1, I assume it's using the arm64 routine in xnu, which can be found at https://opensource.apple.com/source/xnu/xnu-7184.108.40.206.1/....
As for clock_gettime_nsec_np(), at least as of Big Sur, it's in libc¹ instead of xnu and defers to mach_continuous_time()/mach_continuous_approximate_time()/mach_absolute_time()/mach_approximate_time() for the CLOCK__RAW[_] clocks. And clock_gettime() for those clocks is implemented in terms of clock_gettime_nsec_np().
Interestingly it's the other way around. Apple is using TSMC's 5nm process (they don't have their own fabs), which is better than Intel's in-house fabs, so it's Intel's vertical integration which is hurting them compared to the non-vertically integrated Apple.
Also, the answer to "is this only possible because of vertical integration" is always no. Intel and Microsoft regularly coordinate to make hardware and software work together. Intel is one of the largest contributors to the Linux kernel, even though they don't "own" it. Two companies coordinating with one another can do anything they could do as an individual company.
Sometimes the efficiency of this is lower because there are communication barriers and isn't a single chain of command. But sometimes it's higher because you don't have internal politics screwing everything up when the designers would be happy with outsourcing to TSMC because they have a competitive advantage, but the common CEO knows that would enrich a competitor and trash their internal investment in their own fabs, and forces the decision that leads to less competitive products.
During the iPod era, Toshiba's 1.8in HD production was exclusively Apple's only for music players, but Apple gets all the 5nm output from TSMC for a period of time.
Both teams deserve huge praise for the tight coordination and unreal execution.
Intel tried this with Itanium a while back and failed because it is difficult to get software developers to target a new isa and provide compilers and compiled code for everything unless you use a translation layer.
Apple is one step ahead here because their compilers already supported ARM isa (because iPhones use them) and had both the OS and apps ready to go from day one of availability.
They also had translation technology that would allow mutating x86_64 code to ARM64 code so that old apps would (on the whole) run acceptably fast on the new chip.
To do the latter properly, Apple had to create a special mode to run the arm chip with total store order for memory writes, which is not standard on arm. (It would be a lot slower if they didn’t have that when running Rosetta translated code.)
So both the OS being available, and the OS influencing the ARM tweaks (eg TSO) could they pull it off.
They also have the position that they build hardware that uses those chips so can mass produce - and in fact, replace - existing hardware.
Each of these things could be done in isolation by Intel/Windows/Apps but it would be difficult to do all three.
My guess is you’ll see Intel and AMD offering Arm chips in the near
future, as both AWS (graviton) and Apple have shown the way to a new ARM future.
Without AMD everyone would eventually be dragged into adopting Itanium.
1. Fast uncontended atomics. Speeds up reference counting which is used heavily by Objective-C code base (and Swift). Increase is massive comparing to Intel.
2. Guaranteed instruction ordering mode. Allows for faster Arm code to be produced by Rosetta when emulating x86. Without it emulation overhead would be much bigger (similar to what Microsoft is experiencing).
Doesn't everyone use the (I believe) still valid concepts of latency and bandwidth?
When creating the model discussed in the post, we're using it to try to make a static prediction about how the code will execute.
Note that the goal of the post is not to merely measure the memory access performance, it's to understand the specific microarchitecture and how it might deliver the benefits that we see in benchmarks.
For example, what is the bandwidth and latency when you ask for the value at the same memory address in an infinite loop? And how does that compare to the latency and bandwidth of a memory module you buy on NewEgg?
When people use BW in their performance models, they don’t use only 1 bandwidth, but whatever combination of bandwidth makes sense for the _memory access pattern_.
So if you are always accessing the same word, the first acces runs at DRAM BW, and subsequent ones at L1 BW, and any meaningful performance model will take that into account.