This article has a mistake. I actually ran the benchmark, and it doesn't return a valid result on arm64 at all. The posted numbers match mine if I run it under Rosetta. Perhaps the author has been running their entire terminal in Rosetta and forgot.
As I write this comment, the article's numbers are: (minify: 4.5 GB/s, validate: 5.4 GB/s). These almost exactly match my numbers under Rosetta (M1 Air, no system load):
Maybe this article is a testament to Rosetta instead, which is churning out numbers reasonable enough you don't suspect it's running under an emulator.
Update, I re-ran with the improvements from downthread (credit messe and tedd4u):
Note that my version also uses a nanosecond precision timer `clock_gettime_nsec_np(CLOCK_UPTIME_RAW)` because I was trying to debug the earlier broken version.
That puts Intel at 1.16x and 1.07x for this specific test, not the 1.8x and 3.5x claimed in the article.
Also I took a quick glance at the generated NEON for validateUtf8 and it doesn't look very well interleaved for four execution units. I bet there's still M1 perf on the table here.
x86_64% ./benchmark
simdjson is optimized for westmere(Intel/AMD SSE4.2)
minify : 4.44883 GB/s
validate: 5.39216 GB/s
On arm64:
arm64% ./benchmark
simdjson is optimized for fallback(Generic fallback implementation)
minify : 1.02521 GB/s
validate: inf GB/s
simdjson's mess of CPP macros isn't properly detected ARM64. By manually setting -DSIMDJSON_IMPLEMENTATION_ARM64=1 on the command line, I got the following results:
arm64% c++ -O3 -DSIMDJSON_IMPLEMENTATION_ARM64=1 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
arm64% ./benchmark
simdjson is optimized for arm64(ARM NEON)
minify : 6.64657 GB/s
validate: 16.3949 GB/s
EDIT: Interestingly, compiling with -Os nets a slight improvement to the validate benchmark:
arm64% c++ -Os -DSIMDJSON_IMPLEMENTATION_ARM64=1 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
arm64% ./benchmark
simdjson is optimized for arm64(ARM NEON)
minify : 6.649 GB/s
validate: 17.1456 GB/s
The original article has been updated. M1 actually turns out faster on validation than Intel.
Why? We were not running the same config as the author. You have to supply twitter.json as an argument otherwise it uses the compiled binary itself (!) as the input due to off-by-1 errors in argc/argv parsing.
The same thing happened to me the first time I tried to benchmark my code on M1.
In my case I was building using Bazel. Bazel was running under Rosetta because they don't release a darwin_arm64 build yet. I didn't realize the resulting code was also built for x86-64.
I tried explicitly passing -march but the compiler rejected this, saying it was an unknown architecture. After some experimentation, it appears that when you exec clang from a Rosetta binary it puts it in a mode where it only knows how to build x86.
Thanks for the tips. I'm unable to replicate my previous experiment re: -arch. When I compile a wrapper program that does an exec() of clang, it is able to build arm64 or x86_64 even if the wrapper is built as x86_64.
> That puts Intel at 1.16x and 1.07x for this specific test
That's absolutely amazing result, and shows how wrong the current information in the article is. I hope the author sees what you did and updates his page as soon as possible.
So the updated values on the original page are now:
> Intel/M1 ratio 1.2 0.9
> As you can see, the older Intel processor is slightly superior to the Apple M1 in the minify test.
I'd consider it as bigger news that M1 in one of the two tests chosen by the author (utf8) 10% faster than Intel, and in another (minify) only 20% slower, which is for most purposes something that most users won't even be able to notice. It's quite remarkable result. I'd surely write:
"As you can also see, in the UTF-8 validate test M1 is superior to older Intel processor, and in the minify test only 20% slower, even if Intel uses more power to calculate the result!"
-----
(Additionally I use the opportunity to thank again to u/bacon_blood who verified the initial claims and u/messe who figured out what the remaining bug in the author sources was! Great work!)
(Edit: the ratio 1.16 is from older native measurement. So I've also made an error in the previous version of this comment! I've wrongly connected that with the Rosetta 2 produced code. I've deleted that part of this message. Still the difference between 1.07 and 0.9 measured on two different setups is interesting, when another test is close enough).
The post with egregious errors was also put up on a Sunday afternoon. And while we're all acting conciliatory now, it's pretty remarkable how biased the post was, the author using some clearly erroneous numbers to prove their prior, baseless claim that the "M1 chip is far inferior" in some respects, when those respects were specifically SIMD. Then becoming strangely defensive when some people rightly pointed out that ARM64 has 128-bit NEON and a number of other advantages.
Far inferior becomes....actually superior in many cases, even at SIMD.
Let’s try to be charitable, shall we? Everyone makes mistakes sometimes, even leading experts in low-level algorithm optimization. Lemire was upfront about making a mistake, and not at all defensive about it; if you are reading it that way, it’s just you.
It is clearly the case that the M1 CPU/SoC has a significant performance advantage in typical branchy single-core code, but much less advantage if any for certain kinds of heavily optimized numerics. Beyond that high-level summary, it’s good to dive into the details, and spark discussions.
Everyone is just now getting their hands on these chips, learning how to work with them, and trying to figure out how to best optimize for them.
what does "inf GB/s" mean in this circumstance? I can't figure out exactly how these compare yet; the minify number makes it look like native M1 underperforms Rosetta2.
The benchmark program itself is obviously broken on ARM, as Rosetta is jitting ARM behind the scenes, so you could write a program + compiler that emitted the same ARM as Rosetta. This means it's a problem with the program and not a problem with the M1. I'm not sure what's actually wrong with it yet.
This is a great post showing why you have to measure the specific tasks you care about rather than relying on general assumptions. Another example I remember seeing was crypto/hashing performance where you could find embedded processors competing with much faster general chips because they had dedicated instructions for those use-cases, and performance would fall off of a cliff if you used different encryption or hashing settings or an unoptimized libssl.
I’d be curious how the unified memory architecture shifts the cost dynamic for GPU acceleration. There’s a fair amount of SIMD work where the cost of copying to/from the GPU is greater than the savings until you get over a particular amount of data and that threshold should be different on systems like the M1.
> This is a great post showing why you have to measure the specific tasks you care about rather than relying on general assumptions.
If anything it turns out it's an argument that benchmarks mislead way too readily and first-principles arguments (which would quickly refute the idea of a 3.5x slowdown due to vector width) should always be used to double-check your working.
Indeed!
This reminds of a fun issue I ran into years ago with simd code(gcc, linux) . I was experimenting with various vector sizes, and found significant slowdowns for some vector sizes. I was about to call it quits, as in 'well, I'll have to do things differently', when I realized it didn't make any sense.
I double checked the actual values computed by the benchmark, which happened to be completely wrong. What I had actually found was a compiler bug !
> I’d be curious how the unified memory architecture shifts the cost dynamic for GPU acceleration.
Correct me if I'm wrong, but is this actually different from regular integrated graphics that have been in intel and amd chips for decades? I remember there being some initiatives from amd proposing similar offloading under the name HSA almost a decade ago.
I don't think there are actually any software really using it.
I don't think it is different. For example, the OpenCL specification allows for the possibility that data doesn't need to be copied between CPU and GPU.
In a recent interview (I think with the Changelog podcast) I heard an Apple engineer explain that the M1 had an advantage over previous systems in that not only did the data not need to be copied (which implies this isn't new) but also that no changes to the format of the data were needed given Apple's end to end control.
Yeah. It’s a real feat that Apple was able to get heterogenous computing done (something AMD was touting with OpenCL). The not having to copy data from system ram to GPU buffers etc is really great.
That's probably the difference: AMD and Intel implemented zero-copy years ago but no software used it while the Metal stack on macOS probably does take advantage.
One difference (as I understand it) is on Intel's integrated graphics the RAM used for the GPU is a dedicated segment for the GPU's use. You still need to copy data from the CPU's segment to the GPU's segment. While that might be faster than copying over PCIe it's still a copy operation. With the M1's GPU there's no segmentation so no copying.
That's how I understand it works but I might be completely wrong.
> Shared Physical Memory: The host and the device share the same physical DRAM. This is different from shared virtual memory, when the host and device share the same virtual addresses, and is not the subject of this paper. The key hardware feature that enables zero copy is the fact that the CPU and GPU have shared physical memory. Shared physical and shared virtual memories are not mutually exclusive.
Good point. Some Intel chips have had an on-package on-die?) 128MB "L4 cache" made of DRAM. That certainly sounds a lot like the M1's integrated memory.
>crypto/hashing performance where you could find embedded processors
You mean ASIC I guess.
I think this was Apple's idea in first place. Instead of having general purpose computational machine why not have some general purpose alongside with specialised silicon for the most common tasks.
After all, isn't it GPU just another specialised unit? Why not have similar stuff for everything relevant?
It's a poor post, much like the last one, if for no other reason than it's done so sloppily. There's nothing wrong with running simple, informal benchmarks but at a minimum, showing one's build and run details would make the limitations and outright mistakes more obvious.
No, I didn't. I mean just showing what your run looks like, inline in the blog, which is pretty typical, just like in the comment where someone tried to reproduce the results:
SVE is interesting because it gives you forward binary compatibility. i.e. your binary, written for a 128bit wide vector unit, will benefit directly from a newer 256bit wide unit without recompilation.
Who says they don't already support SVE; is it publicly known they support it or not? Especially if the binary doesn't have to be recompiled you'd never know whether they implemented it or not right?
I would guess the processor would just cause an exception/interrupt and it would just call an OS level exception/interrupt handler which would probably tell the user what the exception/interrupt was that occured; in this case an unsupported instruction.
For many linear algebra heavy workflows, (numpy, R, Julia, etc.) I expect that AMD and especially Intel processors with AVX-512 will crush the M1 on real-world benchmarks. But this isn’t a reflection of RISC vs CISC, and Apple could choose to add hardware acceleration for wider instructions and hopefully will in the future.
It looks like simdjson doesn't support AVX-512, so there couldn't be a direct comparison with this article. I recently got a Tiger Lake laptop, though, so if anyone has a good means for comparison I'd be interested.
But then again, if your workflow is linear algebra heavy, shouldn't you be doing that on a workstation or a cluster and not your little MacBook? Given you are probably doing that over Jupyter notebooks, SSH, or some cloud IDE, then the new ARM MacBooks will provide a better user experience?
> But then again, if your workflow is linear algebra heavy, shouldn't you be doing that on a workstation or a cluster and not your little MacBook?
Blender, Gimp / Photoshop, Video Editing, LTSpice / PSpice and Matlab come to mind. These are consumer-ish workflows that benefit from linear algebra, but people want to do them on their laptops.
Hell, people are doing video editing on their PHONES these days, due to the convenience.
----------
Workstations and clusters are not affordable for the vast majority of users.
GPUs probably are affordable however. But these programs aren't really operating on GPUs yet (I mean, Blender and some Video Editing programs are... but LTSpice / Matlab are CPU-only still)
Clusters are affordable, given that cloud hosting is a commodity now. Digital Ocean, for instance, charge nothing for traffic between nodes if they are hosted at the same data centre.
I think we effectively agree with each other and the author. The M1 is better day-to-day processor but not a AMD/Intel killer for (edit: some) compute-heavy workflows...(yet?). Discussion more pertinent for Mac Pro, where current M1 would be worse than last gen Intel for some common workflows on those devices.
>if your workflow is linear algebra heavy, shouldn't you be doing that on a workstation or a cluster and not your little MacBook?
Not necessarily, such math & ML inference workloads are done even done on a Raspberry pi, other ARM SBCs for numerous CV and other projects requiring edge compute.
What would be much more interesting is a comparison of Apples frameworks for numerical processing. Like Accelerate. Those do not just use SIMD but actually take advantage of all the additional things they put in their SoC.
What we are looking at now is some specific benchmark for a random library that who knows how makes use of things.
I think Apple is banking on coprocessors on their SoC to handle heavy workloads instead of general purpose SIMD. The applications that benefit enormously from SIMD outside the GPU (like ML, multimedia codecs, etc) are being migrated to proprietary function blocks interfaces through Swift frameworks, and not hand rolling optimized blocks of code in C and assembly.
I would not be surprised if we see similar things added for audio DSP to support their creative software (it was alluded to in some marketing materials, haven't read much on it yet).
Other applications like language parsing and compiling is likely going to be a second class citizen moving forward, since Apple doesn't care about using their machines for general purpose development regardless of how many of us buy them for being solid Unix platforms.
>Apple doesn't care about using their machines for general purpose development
That's a bold statement that I've heard repeated over many years and never actually seen any real evidence of considering the Mac is the only place you can develop applications for the majority of Apple devices. Not a week goes by on HN without someone saying "Apple doesn't care about developers and/or will stop making general-purpose Macs" and yet there isn't a XCode for iOS or Android or Windows or Linux.
General purpose development includes more than developing applications for Apple devices.
XCode is a good example of what I mean. It's a terrible developer experience for anything but targeting Apple's platforms in the way you want to (meaning writing Swift, using their frameworks, and their devtools, and keeping your code base small to keep their devtools functional, and never caring about that code running on a different OS).
Compare to Visual Studio, which also only runs on Windows, yet is a pretty damn good developer experience and not a nightmare to support for cross platform projects of late.
Yes but until XCode runs somewhere else, Apple has to support C compilers and CLI tools and allowing general-purpose code to run on their machines. XCode relies on all of these things. Just because Apple doesn't make a general-purpose IDE for their platform doesn't mean anything as long as you can install VSCode or vim or Jetbrains or whatever your preferred IDE is.
Again, this is repeated here at least once a week for the how many years I've been visiting this site and it's no closer to being true now than it was back then. If it was even remotely true Apple would have left the Macs on Intel and just phased them out in favor of the iPad Pro, but instead they spent billions making their iPad chips run MacOS and desktop applications and code compiled for Intel processors and real talk here, what about that gives anyone any indication that they're planning on throwing all of that away?
They're actively doing the opposite of what you're claiming and spending billions of dollars to do it, and one blog post that says "this isn't even a big deal" is all it takes to convince you otherwise?
I was replying to the comment you made about XCode not running on other platforms. That's neither here nor there. My point is it is not a good developer experience for anything but client apps on their operating systems - things like iOS apps and consumer programs, not web design, backend engineering, high performance computing or the slew of other kinds of software that Macs are getting worse at.
This goes deeper than developer tools. The documentation for their core frameworks have been purged from official sources and is relegated to deprecated sites and comments header files tucked away in /Library. Kernel modules are being deprecated. There's no alternative to IPP or MKL on ARM for the M1 chip. Docs for plugins architectures are being more and more hidden away, and Apple's Developer Conference consistently only focuses on consumer facing applications - while support for professional applications and advanced computing is only available if you work for a partner organization, making it less accessible.
The reason I say that Apple is making their platform harder to use for general purpose computing is from my experience shipping code on MacOS for the last decade. It's fine if you disagree, but that hasn't been my experience. Every year it costs me more time and money to target Macs than the year before.
Development for Apple devices is wide in volume but narrow in scope.
You pretty much don't develop anything else on Mac OS. No web, no embedded, no Linux, no Windows, no nothing. Only Mac OS software, iPhone apps, etc. I am exaggerating slightly, but you get the idea.
Or when you do, you use your Mac OS laptop as a terminal. Or at best, something to run VMs, and in both cases you don't develop with Apple OS / tools, it's just hosting or giving you non-integrated access to completely different systems.
Apple things are a pretty much distinct and closed ecosystem, and one which is quite limited to (some) endpoints.
None of anything of what you said makes any sense at all. I absolutely develop web applications on my Mac, using a graphical IDE. I've also developed for Arduino boards using a graphical IDE on my Mac. I've written software natively on my Mac that I put into production on Linux and Windows machines. And all of that is supported on the M1 as well. They even built a complete and complex and powerful subsystem to allow you to run Intel code on their new processor.
Apple has never stopped me from installing VSCode or Atom or Sublime or Jetbrains and they've never given me any reason to think they would do so in the future, especially given Rosetta 2. I've never run a VM on my Mac.
I have no idea what point you're trying to make here, so I'm trying my hardest to argue against the best possible interpretation, and that interpretation is still so unbelievably wrong that I still feel like I must be missing the point.
So I already wrote that I was exaggerating slightly, and I still believe your case fall into that. For example about embedded dev, I had in mind things less individual tinkerer oriented and more productized. I don't know: set-top-boxes, base stations, software for trains, software for washing machines, smartcards, etc; or even big equipment controlled by an OTS desktop/laptop-like computer, or a PLC. I'm sure in a few exotic cases Mac will be involve here and there, but lets be honest, Windows as an host dev station is far more probable. Maybe a few Linux too, but probably far from the majority. And Mac OS would be probably: very far.
Now about running Intel code, I know about the excellent x86 emulation layer Mac OS+M1 have, and it's great, but I actually don't care at all for what I was thinking about, I think that broadly apply for x86 Mac as well. I'm more thinking about the software ecosystem, the precise HW CPU for devs is only interesting maybe for people developing SIMD code or, well, running VMs.
About running VSCode & co, that's great to but where are the toolchains for Mac OS host for the targets I talked about? That's why I qualified Mac OS in this case as merely used as a "terminal", I was speaking in the broad meaning of the term, a graphical terminal, not just a VT100 like terminal. The actual toolchains are elsewhere.
About web-dev, I admit that's probably where you can do most of the non-Apple-only dev while staying really native, although probably not if you need a complex server-side setup. Arguably I went way too far when I wrote "no web".
Well, nothing is absolute, and I know Mac OS remains a general purpose OS even able to host some serious dev. Just I think it is not really the most used one outside of let's say client related consumer tech and some pro-desktop tasks, mainly on Apple techs. Claiming embedded in the general case would really be stretching the narrative.
I imagine there might be a lot of embedded toolchains that only run on windows, so I can see that being used there. But in my experience the vast majority of web development (here in the UK at least, but I get the impression it's similar in the US and the rest of Europe) is done on macs. And you can see that reflected in the software support. Languages/ecosystems like Python and Ruby still don't have support for Windows that's on par with mac/linux support.
Yeah I take back what I wrote for web-dev. Although I find Linux broader than Mac Os in this case because the server side is likely to be Linux in prod and never Mac Os, and in some complex cases it will also be Linux in dev. But even with that, yes big parts of web-dev take place on Mac Os.
Web-dev I mostly take it back (well this is subtle: for complex server side dev scenario you will use Linux even in dev -- but even with that enough work remain possible under purely Mac Os to not say that web-dev broadly can't be done)
But embedded: where are the most used tools for FPGA? where are the compilers for microcontrollers? Of course you can do some, in a limited capacity, with a subset of targets and a subset of toolchains, not the most used on earth. But broadly in that area, yeah you don't use Mac Os.
There aren’t that many tools that are on Linux but not macOS. You’re probably right for FPGAs (which I tend to consider a small subset of embedded, but of course there are different points of view), but the standard tooling for embedded platforms is there on MacOS, including some of the IDEs (I am not going to say most, because I am sure there are counter-examples).
Being UNIX was never part of neither Apple or NeXT's culture, it was more a way to get a foot into the workstation market, and a question of survival in not closing doors.
Now they are way beyond that, so they can focus on what was the soul of Mac design during the System days.
> Being UNIX was never part of neither Apple or NeXT's culture
Why do you say this? True, Mach is architecturally different from the BSD kernel but user space started out as NetBSD and its still fundamentally a POSIX system.
I never worked for either company but have worked with NeXT and Apple engineering teams on projects and wouldn’t say that I was working with people who took a non-Unix orientation, especially when compared, say, to Windows.
Would you not have considered AIX Unix? Or Unicos? People considered that Unix but it was more alien than the macos due to then constraints of the hardware.
What mattered on A/UX for application developers on Apple platforms was the System APIs layer ported to run on top of X Windows integration.
Likewise, anyone doing NeXT development was focused on Objective-C frameworks all the way down to driver kit.
As Application developer on a NeXT, the tune was all about WebObjects, Renderman, EOF.
Applications like Lotus Improv, Wingz, and those being put out by Omni Group were the meat of the kind of applications that people considered to use NeXTSTEP, not BSD command line utilities.
Pretty much patent on commercials like NeXT vs Sun.
AIX is definitly UNIX, because it isnt' like A/UX, NeXTSTEP, OS X or iOS, where the UNIX layer is there more to bring stuff into the platform, while the main developer stack is something else.
Just like there is nothing UNIX about Gnome or KDE. I don’t really get your point. You try hard but the fact is that macOS has a very strong UNIX underpinning with zero signs that it is going away.
UNIX is not going away, you are the one trying to put words on my typing.
I am talking about the Apple developers culture, from those developers that care about Apple platforms, regardless of what powers the bottom layer of the OS.
UNIX can exist until the end of days at Apple, that is not what matters to Apple application developers.
As for GNOME and KDE, they are lego pieces on Linux, a fragmented experience where the command line is worshipped, for most users running something else doesn't matter, or they even change environment every couple of days, this is not what Apple culture is about.
I don’t understand what you mean, macOS and iOS share the same kernel. They are both POSIX UNIX.
Also, macOS isn’t going to stop being UNIX based, it’s XNU/Mach based on BSD, that’s not going to change without a entirely new OS written from scratch.
What’s not UNIX about Swift/Xcode, it’s just a language / app, what don’t I get here.
Do you think they want to abandon Unix and roll their own OS from scratch?
Of course, it is not going to stop being UNIX, except for the little detail that POSIX is mostly irrelevant for what is sold on Apple store for any kind of their devices.
This is the ecosystem that Apple and developers that buy into Apple ecosystem care about.
Those that were buying Macs to do GNU/Linux work were a welcomed addition in times of need, that is all.
I advise reading books like The Cult of Mac and Folklore.
AppKit apps can and do interact with POSIX APIs all the time. Apple can't just pull the rug out from under them like Google might do with Android apps, for example.
Can you make an application on Linux using POSIX APIs? How about on Android? Your argument seems to be macOS is not a UNIX because POSIX doesn’t specify a GUI toolkit, but this is quite frankly absurd.
What is absurd is the way every UNIX afficionado is trying to turn my words around.
I talk about the culture of the application developers and what Apple developers that are on the platform since the System days care about, and keep being told Mac OS X is an UNIX.
Of course it is one, that is not the point being made.
And? The kernel and a couple of shell utilities don't dictate the developer culture, those developers that actually care about Apple ecosystem, not those that buy Apple hardware as pretty Linux replacements.
You were arguing that they didn't care about UNIX, which is demonstrably false, and they went to great lengths to get their OS certified; it is the core of their operating systems. This has been pointed out several times now.
You're right from the point of view of a GUI application developper, the UNIX core is somewhat hidden under intermediate layers, but then it's also the case for applications on Linux. Using GTK or QT, you don't deal with low-level kernel APIS much beyond POSIX either. And you can also do that on macOS, so it's a bit pointless as a purity test.
It seems difficult to argue with a straight face that Apple's developers working on the kernel, low-level layers and system libraries don't care about UNIX: that is their whole job. And as a user, you can have UNIX and decent GUIs.
That is a tiny slice of the OS, and macOS wouldn't be macOS without Objective-C / Swift frameworks, whereas GNU/Linux is still GNU/Linux regardless of what one puts on top of the Linux kernel.
macOS being UNIX was never in discussion, as mentioned it helps sales.
You see that in RF design where tech is developed with software defined radios, then moved to FPGA's/Gate Arrays, and then custom silicon. And the power requirements drop by 95% each step.
Also microprocessor manufacturers have always pushed back against specialized coprocessors as much as they can. Apple bringing it all in house allows them to nullify that impediment.
>Other applications like language parsing and compiling is likely going to be a second class citizen moving forward, since Apple doesn't care about using their machines for general purpose development regardless of how many of us buy them for being solid Unix platforms.
I've found the M1's 128bit instructions to be quite fast. My M1 MacBook Air can hit 90GFlops on a single core. My 2019 16" MacBook Pro is only 1.5 times faster at 135GFlops per core, despite double the vector width.
I think it is important to point out that this is the worst ARM computer Apple will ever ship. I am guessing that these laptops are nowhere near comparable in price.
Brand new, a 2017 15-inch Macbook Pro was $1999. The 2020 13-inch Macbook Pro is $1299 with 8GB/256GB. (A used 2017 Macbook Pro is around $500-800 on eBay.)
Idly speculating, if this architecture favours characteristics of higher-level languages (I'm thinking of the widely-reported measurements of primitives used in automatic memory management), relatively disfavours straight-line branchless SIMD-ready streaming algorithms, and is also shipped with a matrix-friendly neural coprocessor... could that invite a change in the types of programs that perform the best?
That is, is it possible that good algorithms nicely structured and straightforwardly written in higher-level languages with good separation of concerns might actually get the most benefit? Or is this a daydream?
Apple is in a good position to create a software-hardware symbiosis to ensure their compilers work flawlessly on their hardware and are more efficient than any other combination of hardware and software. If they ensured that combo worked perfectly with the best practice for their languages, well that's a winning combo.
I'm a little bit surprised that Apple hasn't moved beyond 128-bit NEON, either with a custom 256 bit extension, or with an early implementation of SVE. This seems to be an area where vertical integration of software makes transitions easier -- getting Accelerate.framework over to something wider would have immediate pay-off, and especially at higher power budgets it seems clear from looking at the rest of the industry that somewhat wider vectors is a good perf to area trade-off.
Unlikely that there would be any big gains on the mobile side, and this being their first desktop chip, they probably didn’t want to rock the boat.
Apple nailed this, unbelievably. I have an M1 MacBook and it’s truly black magic. It’s already the fastest computer I’ve ever used. I’d rather have stability during the Rosetta 2 stage than added complexity and bugs by introducing a new bleeding edge instruction set.
I suppose, but couldn't the same be said for the early 64-bit transition on Arm for Apple? They don't seem scared about biting off complexity today when they know they'll need it eventually.
The other side of this is that we're now in a window where performance-sensitive developers are motivated to start looking at NEON and adding NEON support to existing SIMD code. Which is fine... but if SVE is the long-term answer and SVE will (as its goal) support better performance scaling of SIMD code with future processors, it seems like there's a real motivation for Apple to push developers into doing things "the right way" as early as possible.
Assuming (!) that SVE is indeed the future, missing it for the first few generations is shades of 32-bit-Intel-for-Apple -- a stop-gap with some remarkably long-tailed support costs, compared to jumping straight to x64.
M1’s big cores already match Intel and AMD consumer SKUs in flops/cycle... x86 pretty much had to go with wider ALUs because it was cheaper than widening instruction decode.
(As a side-note, can someone who has explored SVE in more detail than I comment on whether it's possible for a big.LITTLE-style implementation to use different vector widths on the big and little cores? One obvious challenge of going to wider vectors on something like the M1 is that you either pay the cost (area and power) for the capability on the small cores, or you lock threads that use those capabilities to big cores, neither of which is ideal. If you could have 128-bit SVE registers on little and 512-bit on big, it would help with that; but it's not clear to me how complex it would be to allow an SVE instruction to be interrupted and restart it on a core with a different vector width. I guess in extremis one could emulate the remainder of the instruction before restart, but eww.)
Right. But if you take an interrupt after the two of those three operations, and you want to restart on the big core, what do you do? Disallow interrupts in split operations? Unwind the partially-performed op? Allow the big core to run 128-bit ops to complete this one after restart? Something else?
Micro-ops, while require intermediate microarchitectural state, doesn't produce inconsistent architectural state when interrupted. The instruction either retired (produced effects) or didn't, from the POV of the ISA (and thus the kernel and user code)
But I thought that the SVE extension allowed interrupt and restart of an instruction? The alternative seems like it would lead to potentially very high interrupt latency?
I don't know about SVE, just some general principles/thought: you either postpone the serving of the interrupt until the current instruction is done or you throw away progress made in the current instruction.
My educated guess is that the latter is most frequently used as a technique.
Also traps are caused by the very instruction being executed so you cannot complete the current instruction in all circumstances before vectoring to the trap/interrupt handler.
They probably don’t want to go beyond the standard. It could lead to compatibility and support headaches and accusations of “embrace extend extinguish” later. Not sure but I seem to recall the ARM licensing terms discouraging this.
But those don’t conflict with the spec. They can be ignored and it’s just an ARM64. An ahead of release implementation of new vector instructions could result in actual incompatibility with official ARM64.
Do you have a source that they don't already support some sort of SVE? I don't think they would make this public information if they don't have to. As far as I can tell they wouldn't need to recompile any previous ARM binaries so this could remain under the hood and they would never need to tell you whether or not they're doing it. So I'd be interested if you have any source on this.
I don't think anyone has found any SVE being used by Apple's binaries. Sure, it could be there, hidden, but that would in practice mean that applications can't use it regardless.
Are there any good real-world applications that take heavy advantage of SIMD? I imagine it would be very prolific given the benefits offered but SIMD, but I honestly have no idea.
This is something that I've been torn on with all of the M1 benchmarks. All of the benchmarks that are saying "the M1 is so much better than my Intel machine at video work" are all taking advantage of hardware video encode / decode blocks in the M1 (and unified memory between the GPU and video codecs).
Discounting their existence is entirely unfair, as one of the whole points of Apple Silicon is to give Apple the opportunity to put whatever hardware into their computer that accelerates the use cases they envision for their computers. Dedicated hardware is way more power efficient that software implementations.
However, what happens if you work in a video codec that Apple didn't build in hardware support for? Software video codecs depend heavily on SIMD instructions to be performant.
In the first place, using the hardware encoder is only feasible if the output is up to your quality/size standards and is compatible with the decoders that are going to consume your content. If your goal is to quickly render near-lossless mp4/mkv files for uploading to youtube, any regular old hardware encoder is probably fine. If your goal is to render out 6000kbps footage to store on your own CDN, the quality per bit becomes EXTREMELY IMPORTANT and suddenly it may not be feasible to use a particular hardware encoder.
FWIW, NVIDIA has made significant improvements to quality for their hardware encoders in each of their last 3 generations, and you definitely saw reviewers and creatives talking about that in particular when it came to purchasing decisions.
Apple's encoder is probably quite good at least, but I don't think it's meaningful to consider it for most benchmarks. The scenarios where you both are willing to use the hardware encoder and care about how fast it is are relatively few and far between - if you're just doing a zoom call all that matters is whether it can pump out 60fps and how good it looks, not whether it uses 3% cpu instead of 5%. I'd rather see quality/bitrate comparisons of their encoder with x264, not benchmarks.
x264 and x265 on my M1 Mac mini perform at least as well as my i9 16" MBP with the same settings. Neither are using any of the hardware acceleration available to either CPU. The M1 also does well with FCP which is cool but the software encoding with the above tools is really impressive.
In real world though anything but the lowest end hardware will have cryptographic offloads either in the CPU or storage controller (or both). The M1 actually excels at AES throughput for instance.
For esoteric/custom crypto it could play a part though but you have to have good reasons to not want to use standard crypto at higher speed for it to be your use case which is why I say it'd be uncommon.
ripgrep, and to a lesser extent, GNU grep both do. Whenever you run a query and it seems to execute very quickly, it's almost certainly because of SIMD. GNU grep will use SIMD somehow in many patterns. ripgrep uses it in even more.
Does GNU grep actually make explicit use of SIMD via intrinsics or assembly, or just through autovectorization and/or calling libc methods like memchr that are vectorized under the covers?
Yeah, I was being a bit succinct. As far as I'm aware, GNU grep has no explicit SIMD in it other than memchr through glibc. While some libc implementations utilize auto-vectorization of sorts (musl comes to mind), glibc does have Assembly implementations of memchr for several platforms that do indeed make use of SIMD explicitly.
ripgrep does the same, except for Intel at least, its memchr is implemented in Rust using SIMD intrinsics explicitly. And it also has a specialized SIMD algorithm (taken from Hyperscan mostly) for dealing with multiple patterns: https://github.com/BurntSushi/aho-corasick/tree/8b479a60906d...
Hyperscan takes this to a different level though. It has oodles more SIMD. I should have mentioned it in my original comment.
I was curious mostly because I never recalled any SIMD intrinsics in GNU code (ok, probably GIMP has them, so maybe I should say GNU utilities), so that would be a first.
It's interesting how much stuff leans on memchr, shame there aren't systematic wider versions taking more bytes to avoid false positives for longer literals (ignoring wmemchr): these could be nice and fast with SIMD.
Yeah I think the wider versions get a lot more complicated. memchr is a bit of a sweet spot, since its implementation is relatively simple. And things like glibc end up implementing specialized versions of it for most architectures _and_ instruction set extensions. (So e.g., there's one for SSE2 and for AVX on x86_64.)
And then of course there's PCMPESTRI (and its variants), but that has largely been a failure because of its high latency. :-( That's a shame, because that instruction does accept substrings up to 16 bytes.
Yeah I had some kind of brain fart thinking say a 4-character memchr() could be just as fast using the native method, but no of course it's 4x as slow (only an "aligned" memchr() like wmemchr() works like that). So yeah, it starts to get complicated quite if you want it to be fast.
Raytracing (Ray / BVH traversal / bbox intersection, and ray primitive intersection can all be SIMD-fied fairly well (very well up to 4-wide, fairly well up to 8-wide, depending on how things are packed).
SIMDs are big in computer graphics – you'd expect them to be used heavily in physics simulation and CPU-based rendering engines. Would be interesting to see how blender's Cycles or say Arnold Renderer performs on M1
It's very common in image and audio processing algorithms, and overall lots of general-purpose libraries used in software will use SIMD instruction sets. Even things like memcpy or String.IndexOf are vectorized in modern runtimes. IIRC Facebook released a very carefully tuned hashmap that uses SIMD instructions to do many of its search operations.
Jpegs (libjpegturbo), video decode/encode (libavcodec), encryption/decryption, ZFS filesystem, any workload that does high throughput. SIMD is the next generation of computing, without it we would have very very slow computers. Technologies like HEVC don't work well without SIMD, they're designed around it. Take a look into ffmpeg and see the tremendous amount of hand written simd assembly for various platforms.
All of them? I'd need a source on that. Also how do you think these algorithms became popular in the first place? If implimenting an algorithm around SIMD didn't give proper performance then they wouldn't exist.
You have to understand that current software exists because it solved a goal. If you have to first have your algorithm implimented in hardware then nobody can make anything. Without SIMD these projects simply wouldn't be able to exist. That's the hard math of it.
Anything that does number crunching can benefit from SIMD (just look at pretty much any modern compiler output on godbolt, icache be damned).
The tradeoff with using it is basically between instruction density, power (AVX-512), and latency (GPUs are seriously powerful, but getting the data going takes time and a lot of driver bullying).
If neural net inference is a small part of a larger computation, doing it locally (in cache) on the CPU (with AVX-512 instructions) can be a big win (for example, https://NN-512.com)
Something which I think isn’t made clear enough here is that the point of comparison is a 45W part. A comparison with a 10-15W Intel part might be more interesting.
With my JSON library parsing directly to C++ data structures, I was seeing an M1 mini getting perf as fast as a 16" MB Pro with i9. It was quite nice to see.
The updated benchmark is about what I'd expect, they have similar top end SIMD capabilities given that both are effectively able to compute 512 bits of SIMD per cycle.
Intel: 2x256 = 512
M1: 4x128 = 512
I'd expect Zen3 would beat them both pretty easily given that it has 4x256, as would any of the Intel chips with 2x512.
Intel better be widening its execution engine, they have focused on MgHz for far too long..
I wonder how many use cases will be impacted. I can imagine specific use cases like video encoding switching to the built-in co-processors. Which part of usage remains that is hampered by this? Is there a killer reason/use case I should stick to amd? Also, it would probably be relatively trivial to add the 256bit instructions in the next iteration of the chip.
In this case, Dr Lemire is using software he co-wrote to test the performance of the M1. It makes sense that he'd be focussing on that software (Simdjson) which wrings as mich speed out of the processor as possible by using vector instructions and avoiding branches. His coauthors blog iirc is even called https://branchfree.org, to indicate the importance of avoiding branch mispredictions to keep the cpu foing quickly.
>Yet I was criticized for making the following remark:
>In some respect, the Apple M1 chip is far inferior to my older Intel processor. The Intel processor has nifty 256-bit SIMD instructions. The Apple chip has nothing of the sort as part of its main CPU. So I could easily come up with examples that make the M1 look bad.
>I am not saying that the Apple M1 is not great. It is.
Per this thread it does seem like they were severely overstating the problem: https://news.ycombinator.com/item?id=25409535 (tl;dr, their benchmark is likely running _under rosetta_. NEON with native code is far more competitive).
Yes. The whole Apple M1 chip has turn up the volume of fanboyism. I think it would be better to wait a little longer for things to settle before any serious discussions.
I think this benchmark deserves a little drill down on how the ARM v. Intel compilers implement their SIMD output. If the M1 lacks 256-but SIMD, what exactly is being measured here?
What's measured is how a SIMD-optimised routine differs between AVX and NEON, under the assumption that most of the difference would come down to the difference between 256b (AVX) and 128b (NEON) SIMD. In a previous post[0], lemire confirmed that NEON was competitive with SSE (which is also 128b) comparing older µarch (Intel's Skylake versus Apple's A12).
I'm not quite sure that SIMD is a relevant benchmark for comparing the processors. Wouldn't the GPU and/or neural engine take care of SIMD on the ARM Macbook?
You can still benefit from AVX for short workloads, the latency of them is no more than other instructions.
Running things on some accelerator (gpu, etc.) usually involves writing a specific kernel in a language subset, manually copying data and generally long latencies. Unless there is a lot of data it won't be faster.
With AVX in the best case the compiler can just vectorize some loop, speeding it up 5x without any added latency or source code changes.
SIMD and ie Nvidia WARP are not the same. Idk about Apple’s GPU, but for example there is no GPU alternative to the SQRTPD instruction (Square root of double precision). Also, when there is branch divergence across threads, CPUs still do a much better job than GPUs.
Curious to think about how unified memory may change the ratio of flops/memory access when it makes sense to shift job from CPU (better for low number) to GPU (better for high ratio)
This issue was due to Apple's system for checking for expired developer certificates and for malware. Their OCSP server failed, among other things, which is why some apps couldn't launch.
I use mostly free/open source software for web development on macOS; the outage didn't affect me at all.
Your anecdotal survivorship story was, indeed, rather unhelpful when I had previously linked news articles discussing the widespread loss of productivity associated with this concern.
Apple has a great incentive to continue allowing general computing: a significant portion of their users will stop buying their computers if they do this.
Almost every developer I know will stop using macs if macs start actually preventing them from running whatever software they want.
The more closed it gets, the more customers they will lose. Right now macs aren't closed at all, so honestly the customers they've lost so far (yourself included) are... faint-of-heart? Excessively sensitive? Not sure what the best way to phrase this is. You've effectively stopped buying macs for something that could happen, not something that has actually happened. I would find it incredibly surprising if you install enough unsigned GUI apps for that whole warning + have-to-open-it-with-right-clicking thing to be a dealbreaker for you on its own. And if it was, they could've lost you at any time because that seems like incredibly fickle consumer behavior.
An actual closing of the platform though will be a watershed moment.
Yes, I'm agreeing that Apple will/is driving away discriminating customers.
Apple is well aware that most of its recent customer growth, and almost all its future customer growth, is in iOS and similar walled gardens.
The Mac users are proving to happily accept greater restrictions on ability so long as their preferred tools continue to work. You state as much yourself: running unsigned software is surprising to you; as though that should be considered abnormal behaviour. Consumers of Mac products will happily accept greater restrictions if they can be convinced it brings quality; whether or not it is successful in doing so.
Apple can't just increase market share in the PC market, they have to do it with a walled garden? Pretty cool that you know where future growth is gonna come from before it happens.
Again, and for the final time, there has not been any restriction on ability. You can still run the things you've always been able to run. You just have to go through a warning first. That is not a restriction on ability. You know Windows does similar now, right?
Running unsigned software is not surprising to me at all. Running so much unsigned GUI software that it becomes restrictively annoying... is absolutely surprisingly. I guarantee you I run a lot more unsigned GUI apps than the vast majority of users, and it's still such a minor inconvenience I barely notice.
I didn't say they _have_ to use a walled garden; I'm claiming that they know how lucrative that approach is and how submissive their consumer base is, and so are likely to choose that path to pad their revenue.
It's not even a particularly bold prediction; it is completely in-line with their whole product tragectory.
You will accept OSX turning into iOS, and call it innovative, powerful and unburdened; because that's what Apple will call it.
Does using one obscure application of SIMD even count as benchmark? It just could be that the library is specifically optimized for x86. Run something like BLAS matrix multiplication which could use the full power of SIMD in both the platforms(I don't have access to M1 macbook).
As the ARM chips proliferate we will see NEON or GPU/neural engine APIs adopted in lieu of the SIMD from the x86 side. Until Now ARM just wasn’t relevant with the ubiquity of x86 and that the x86 family of chips have been the desktop standard since the 80s of course there will be hardware that takes advantage of this.
I wonder what final cut and other software suites tailored for the new Macs are doing under the covers as they seem to perform very quickly. Also this isn’t really all the relevant as cryptographic validation is required more and more on the server side of things whereas Apple doesn’t have a SKU for the server market.
The GPU/neural stuff is basically nothing else but SIMD, just in slightly different guise and running on different hardware.
On PC it is the difference between MMX/AVX (CPU) and shaders/CUDA/"compute" (GPU), on M1 which is all-in-one SoC it is just a different part of the CPU being used, optimized for somewhat different tasks.
SIMD is only a technical term for one type of parallel computation - single instruction, multiple data. It is not some sort of Intel-specific magic technology (and the term far predates Intel's support for it, going back to vector processing on Crays and such).
Ahh yeah sorry API was used wrong here. I knew it’s a type of CPU instruction set. Of course NEON is specific to ARM so x86 won’t get it. I didn’t know the neural engine is basically SIMD though.
As I write this comment, the article's numbers are: (minify: 4.5 GB/s, validate: 5.4 GB/s). These almost exactly match my numbers under Rosetta (M1 Air, no system load):
Maybe this article is a testament to Rosetta instead, which is churning out numbers reasonable enough you don't suspect it's running under an emulator.Update, I re-ran with the improvements from downthread (credit messe and tedd4u):
Note that my version also uses a nanosecond precision timer `clock_gettime_nsec_np(CLOCK_UPTIME_RAW)` because I was trying to debug the earlier broken version.That puts Intel at 1.16x and 1.07x for this specific test, not the 1.8x and 3.5x claimed in the article.
Also I took a quick glance at the generated NEON for validateUtf8 and it doesn't look very well interleaved for four execution units. I bet there's still M1 perf on the table here.