Hacker News new | past | comments | ask | show | jobs | submit login
Apple Silicon M1 Emulating x86 Is Still Faster Than Every Other Mac (macrumors.com)
337 points by syrusakbary 11 months ago | hide | past | favorite | 290 comments



Everyone here seems to be making this solely about the processor, which it is partially, but the other story is that Rosetta 2 is looking really good. Obviously there’s plenty more to test beyond a GeekBench benchmark, but hitting nearly 80% of native performance shows both the benefits of ahead-of-time transpiling and the sophistication of Apple's implementation.

I’ll still be waiting to see what latency-sensitive performance is like (specifically audio plugins) but this is halfway to addressing my biggest concern about buying an M-series Mac while the software market is still finding its footing.


As some other have mentioned, Rosetta uses AOT translation that contrasts with Microsoft's approach of emulation.

I think the contrast in approach here is motive, where I get the impression that Microsoft wanted developers to write ARM apps and publish them to its Store by making emulated programs less attractive by the virtue of poorer performance.

Apple on the other hand is keen to to get rid of Intel as quickly as they can, therefore they had to make the transition as seamless as possible.


Microsoft calls their approach emulation, but if you read the details you see they are also translating instructions and caching the result (AOT translating).

https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on...

Their current implementation only supports 32-bit code, x64 translation is still underway. It is not known how well x64 translated code will perform relative to native code.


I am similarly curious about Rosetta2, but there seems to be very little marketing let alone technical information being made available. All I can figure is that it performs user mode emulation, similar to what QEmu can do, and does not cover some of the newer instructions.

Off topic, my job has been in virtualisation for the last 12 years, thus I am very familiar with the publicly available body of research on this topic. Ahead of time binary translation has been a niche area at best.


My understanding from WWDC is that it does AOT compilation at install time when possible. If an app marks a page executable at runtime (such as a JIT) it will interpret that.

Newer instructions are still encumbered by patents.


What these instructions do so that someone working on their own couldn't figure out? I am sick of patents lodged just because particular engineers got there first. When you finally save money to do your own research often you learn that your idea has already been patented. It just made work more expensive as I had to spend time finding way around the patent. Such monopoly on ideas should not be legal.


> patents lodged just because particular engineers got there first

That is what patents are. They encourage you to actually get there by giving you a time-limited monopoly on the implementation. It is easy to say "a freshman CS student could have figured this out", but if they did, they could have had the patent instead and licensed it to Intel and Apple. Instead, Intel and Apple had to figure it out on their own.


That is what patents attempt to be. There are many reasons that make them less effective at this than their ideal.

In particular, costs are high, litigation to protect is expensive, and so your average student wouldn't be able to afford this. The fact that software is typically shipped virtually means that borders are practically non-existent, and wide patents are often needed, or a company needs to give up on defending their patent outside their primary market.

To patent an idea will probably be around $10-100k per market. To cover US, EU, and large Asian markets, you're looking at $500k-1m, and thats just to get the patent. Then you'll need to defend it, which can be hard to do against entities based in non-compliant countries such as China.

This all means that unless you're defending the very core of your entire business proposition, you probably need to be a >$100m before it's worth pursuing patents, and even for the core of your business you probably need to have several million in funding.


I don't think a freshman CS student could afford to apply for a patent. This is only reserved for big companies or engineers backed up by wealth. You also often get things patented that nobody thought about patenting as they are that obvious. It's just a mechanism to gate keep and secure profits for big guys. I have a friend who has multiple patents and their investor only agreed to put money in their project if they patented it and shared all revenue. They were not actually interested in the product itself apart from likelihood the patents will bring money. The project is now dead, but patents stay blocking anyone from trying the same idea.


Actually, speaking from experience, you can do this yourself (or mostly yourself with some guidance from professionals). Nolo press has a book ("Patent it Yourself") on the topic.


"If an app marks a page executable at runtime (such as a JIT) it will interpret that."

Was that from a presentation on Rosetta2 that I have missed? It certainly makes sense, but also you'd need to watch for writes to executable pages that have been already AOT translated.


JITs usually mark +x pages -w again after they've written them for sanity, so watching in platform libc mprotect could do it, but you could also force them to be -w anyway and then maybe use the page fault handler. How much integration with the kernel does it have?


Javascript being the extreme example here (no one is going to run Chrome or Firefox in Rosetta2 probably) but it is re-JITing very often (I think Firefox starts with interpreting JS code, and then replaces hot code paths with JIT compiled equivalents when they become available).

I also see issues with self-modifying code, but this is also very rare (I knew some .NET code that was injecting/manipulating its own JIT code to go around C# private/protected encapsulation).


> no one is going to run Chrome or Firefox in Rosetta2 probably

Browsers themselves no, but there's many apps out there based on browser tech (e.g. electron apps). Also many apps shipping with their own JVM or other JIT engine.

Anecdote incoming, my pet project is based on electron, I am currently building a mac x86 version but don't plan to ship an ARM version (since testing it without an actual physical mac is going to be even more difficult / impossible).


> but also you'd need to watch for writes to executable pages that have been already AOT translated.

Wouldn't executable pages (translated or not) normally be mapped as read-only, ensuring the processor faults on any attempt to modify them?


Ah, the question is whether Darwin enforces W^X, because if so, the problem becomes very easy. Then you re-translate whenever a page becomes executable [again].


On Arm 64-bit Darwin, W^X is enforced at all times.


Rosetta 2 is an AOT translation, not emulation.

https://developer.apple.com/documentation/apple_silicon/abou...


Understood, I called this "ahead of time binary translation" in my post above, and I know it would use JIT-based emulation when necessary.


It seems to me that the correct term is "Static Binary Translation" (SBT) for what you call "ahead of time binary translation".

And the correct term for "JIT-based emulation" is "Dynamic Binary Translation" (DBT).

At least these are the terms you should use if you want to find some literature on this subject.

We're not talking about JIT or AOT compiler because it's not really a compilation (compilation is translating to a lower level language).

I think a lot of people talk about JIT rather than DBT because the JIT term is better known, and there is confusion when Apple says they do "Dynamic translation for JITs". Which means that: they do DBT to handle applications that use JIT.

Edit: So Rossetta 2 does both: SBT and DBT.


You are correct, static binary translation is what Rosetta does first. That, however, is what I called niche technology in another post, most of the research so far had focused on dynamic binary translation.

Furthermore, SBT, even for user mode binaries, can rarely reach the performance levels that we see with Rosetta2. There are many issues in determining what is code, where are the branch destinations in case of indirect branches, etc. What we have here is certainly a feat of engineering on its own.


> There are many issues in determining what is code, where are the branch destinations in case of indirect branches, etc.

Yes, handling indrect branch seems a bit complex and I'm not a specialist in the field. But I'm pretty sure that the cases of indirect branch are rare enough so that an additional indirection is relatively inexpensive. Adding a simple address mapping table should meet most of the cases.

An interesting question would also be whether Apple has added features to the hardware to improve the translation?

We know, for example, that Apple introduced a special register [1] to temporarily switch from the ARM consistency model to the TSO consistency model (Total Store Order) from x86.

[1] : https://github.com/saagarjha/TSOEnabler


C++ code with virtuals is basically all indirect branches for method calls.


That is marketing terminology (because "emulation is slow"). Full static transpiling is not a solvable problem - you can't actually take an x86 app, run it through some converter, and get an ARM app out. It's just not a thing and it never will be (without cheating and, like, literally embedding an emulator in the app).

Anything less than that is emulation, and requires dynamic elements. All modern emulators use JIT, and caching the result is similar to AoT translation; plus JIT can be faster than AoT sometimes due to being able to take advantage of runtime profiling, and you can never guarantee ~full AoT translation of even binaries without self-modifying code without additional metadata (like a list of all branch destinations), so Rosetta cannot possibly claim it does that with full coverage. On top of that you need to add a level of indirection to all indirect branches, as you cannot statically change all function pointers in data structures (that's an even harder problem). At that point you're adding enough bookkeeping gunk to the translated code that it is no longer a straight translation, like Apple would want you to believe. JIT is binary translation too, so by Apple marketing standards, qemu, Dolphin, and basically every other modern emulator is also "translation". Which is just not useful.

So everyone saying that "Rosetta 2 is AoT translation" as if that means it's fundamentally better/faster than other emulation technologies is just falling to marketing.

Whatever you call it, it's not fundamentally different from any other emulator in a way that puts it in another class of technology. It is not straight converting x86 to ARM. That's just not a thing and it never will be. The end result is that the CPU is going to be executing a series of translated basic blocks interspersed with code added by the translation to glue everything together, which is the same thing every JIT-based emulator does, and will have the same performance characteristics, and the fact that some of that work can be done ahead of time is not a fundamental difference.

If you want to look for reasons why Rosetta 2 is faster than other emulators, look for places where Apple cheated and made their CPUs implement x86 things like its memory consistency model. That can have massive gains. I bet if you port a decent JIT-based emulator to use that feature on M1, and compare it to Rosetta 2 for number crunching inner loops and such, you'll find you can get very similar performance numbers out of it once the JIT cache is warm.

It'll be interesting when people take a deep dive into specific things Rosetta 2 does.


Why would they market it? They will market the result--performance. Rosetta is just for geeks to appreciate, laypersons don't care about that stuff, all they want to know is whether the end result is faster or slower, as that's what affects them.


I have had a passing interest in virtualization. Do you mind sharing how you got your job?


>specifically audio plugins

FWIW Chris Randall from Audio Damage posted a quick note saying performance of plugins under Rosetta was basically comparible with Intel: https://www.audiodamage.com/blogs/news/a-quick-note-about-ap...

I've read tweets from other plugin developers saying similar things, so preliminary feedback seems quite positive!


That doesn't match with what we know about Rosetta 2. Rosetta 2 can't run processes with mixed architectures, so ARM hosts can only run in-process ARM plug-ins, and x86 hosts can only run in-process x86 plug-ins. Apple's AUv3 plug-in architecture is out-of-process, so you can mix those, but there is no way you can mix ARM and x86 VSTs, for example, without specific work by hosts to provide an out-of-process shim translation layer.

Either he's talking about AUv3 specifically, or the hosts he tested already are doing out-of-process wrapping, or Rosetta 2 is actually magic (AFAICT this isn't a generally solvable problem at that layer), or he's confused.


Why aren't programs distributed by the Mac App Store pre-translated once and the translation downloaded to the Mac?

They'd still have to be run under rosetta2 (because programs can write code and branch to it) but a lot of computation could be done once rather than every time.


Not sure, but, my educated guess is that it will be more efficient as they improve Rosetta 2 to do this on demand instead of pre-translating every app on the store every time Rosetta 2 is updated.

Even if that means the users have to retranslate, that's still essentially "free" (to Apple) distributed compute.


Apps installed via MAS and packages are pre translated (at install time)


So is this a little bit like Android apps being translated into Java Byte Code at install?


It’s the other way around: an intermediate state of compilation is uploaded (basically the internal state after all non-machine-specific optimizations have been done, like hoisting invariants out of loops, etc). The App Store finishes compiling with the flags for the various CPUs supported (iPhone 6, iPhone 12, etc).


translation of most apps doesn't take a significant amount of time, and have it pretranslated would mean shipping code that was not signed by the app developer.


But isn't apple store already re-sign applications code with apple's own key so they can do stuff like this? Or is it only on iOS?


iOS definitely does this, optimizing for the different supported devices. I hope they start doing that with future Mx macs.


I hope they'll release Rosetta 2 as open source project


The fact that it's marginally faster is almost irrelevant.

What really matters is that it's using about 1/3 the power of a comparable Intel chip while doing it, and that Intel routinely has a 60%+ quarterly profit margin (probably much higher on PC chips) that Apple now gets to keep.


TSMC, which manufactures Apple's chips, presumably at a profit, has a net margin of ~40%.

https://www.tsmc.com/uploadfile/ir/quarterly/2020/3PTwU/E/3Q...


It is lot harder technically and politically for apple to replace TSMC part of the value chain. Apple will loose a lot of brand value if it directly employees cheap Asian labour.

They could in theory move the production to U.S. however without the ecosystem of talent, suppliers and partners that Taiwan or china has it will be hard .

They already spent 10+ years improving their iPad chips, switching made sense especially given intel lack of reasonable roadmap, the effort to integrate TSMC part of the chain does not have the same value given the risks today.


TSMC is a foundry, there are very few companies worldwide with capacity to do the fab work that TSMC does and Apple is definitely not one of those.

Are you confusing with Foxconn or other parts assemblies that Apple use? Because for Apple to roll out their own fab infrastructure would be a massive undertaking with very little benefit.


There are a lot of inefficiencies in the companies that fab chips. (If you ever need an example of how consolidation / lack of competition leads to less than optimal output, semiconductor fabrication is it.) Stuff that takes on the order of hours or days that should instead take take a minute or two, tops. Information that should be take seconds to access and available at any time is instead put together manually. So much of what would appear to be institutional knowledge at these places is little more that cargo cult voodoo that may or may not have any significant impact on the product. Dumb mistakes (and the subsequent efforts to cover them up) by a workforce whose primary motivations are centered around making it through the day without getting in trouble and without giving away how little they understand and appearing as stupid as they feel.

Honestly, if Apple or Google or any other company with top-tier engineering wanted to get into the game and were serious about it, they could be competitive in a < 10-year time frame, and easily making laps around the competition past that.


This seemed obvious to me at first, but less the more I think of it. Samsung, for instance. Apple is already using 100% of 5nm capacity, so they are basically leasing an entire fab. The technology curve for fabs will bend in the direction of lower capital investment, smaller batches, and shorter turnaround, not smaller dimensions.


Are you thinking of Foxconn? I don't think anything you said makes sense for TSMC.


Will these chips - or similar ones - be available to buy for others than apple, too?


Splitting hairs here, but Intel’s 60% profit margin has nothing to do with what Apple “keeps” by making their own chips.

It could be dramatically more expensive to make an M1 chip and Apple is making the trade off for performance/walled garden security.

Unlikely, but possible


Your point is theoretically valid but considering M1 is essentially a rebranded A14Z chip they would have have had to make for the iPad anyway, the marginal R&D spent on M1 should be epsilon.


It still might cost Apple more to manufacture an M1 than it costs Intel to make an i5/i7/etc. Just because they are now their own chip supplier doesn’t necessarily mean that it will always be the cheaper option.

But Apple didn’t switch chips for the cost. Just like the PPC to Intel switch, they did it for the performance and efficiency. More efficient means smaller, thinner, less battery, etc... more design flexibility. That and the fact that they fully control their release cycle is also a nice bonus.


Are you comparing the cost to manufacture M1 to what costs Intel to make chips? I have no claim on that but that’s not what I interpreted from the above post. The relevant point of comparison is what Apple can negotiate with Intel for its chips versus M1 total cost.

Please also note that a modern non-M1 Mac included a TSMC silicon (T2) on a similar order of complexity (A10-class) anyway that’s no longer present (albeit at a larger node).


> Are you comparing the cost to manufacture M1 to what costs Intel to make chips

Me personally? I have no way of knowing or estimating those costs either way. But that was how I interpreted the parent comment.

If you allow that R&D costs for the M1 are largely aligned with the A* series for iPad that they would have already been paying (from your comment), then you're left with the manufacturing costs.

For Intel, they have: 1) chip design R&D, 2) fab method R&D, 3) fab costs, 4) sales/marketing/profit margin

By switching to their own chips, Apple has absorbed #1 and #4, and left #2 and #3 to be outsourced to TSMC. If the main costs for #1 are duplicated from the iPad chip, then you're left with fab costs (and fab R&D).

But, again, even with the likely cost benefits of making your own chips (and avoiding the Intel profit margin), I doubt that was the deciding factor in switching. The costs probably made the switch possible, but wouldn't have been the primary factor. Apple would gladly pay more per chip if it was more efficient (heat and power) and if they could control the release schedule. Even if their total costs for a single M1 chip are more than for an Intel chip, they probably still would have made the switch.


Another poster noted TSMC net profit margin is 40% and gross is over 50%.

Apple has only taken over the design of the chip not the manufacturing, Intel was doing both for Apple. Nobody has any data to know how the manufacturing costs have changed.

There is real chance that total costs for Apple could have increased overall and there is no way to know. Intel was in difficult position over the Apple deal for years, Apple would have negotiated down significantly given their threat to move.

For all we know Intel was selling at loss per chip to apple and making it up in other business.


It is in the organisation memory. It is just they have encountered the same issue with power pc cannot keep up with intel. And they wait for > 2 years then the switch programme come. I still remember those Motorola chip then power pc chip. Expensive as always but just under power even if you spend. Not the intel. Still have 2 Mac Pro 2009 and it still work fine (except need to stay in hs for my 1080 ti card). It is just have to move.

The vertical integration of all is frightening though. It does not end well with many industries. You need competition and being now on arm and one gpu meant all those fierce competition in pc and gpu of intel / amd / nvidia is just figures. You do not get the insight and innovation by looking at how good or bad other side is.


This could be true for the M1, but is unlikely to be true for the M2/M1X/M1Pro or whatever they call it.

There are some limitations: lack of PCI-E lanes, lack of off-package RAM support, lack of user-replaceable RAM support, lack of discrete graphics support, lack of Thunderbolt 4 support. These are all essentially non-performance features of the chips that need to be developed to bring them up to feature parity with the Intel chips.

This generation happens to have carefully bypassed the need for those features, a smart move by Apple for the first release, but later releases for the higher end MacBook Pros, the iMac/iMac Pro, and the Mac Pro, will need most/all of these features, and those will likely take significant time to work into their architecture.


AMD has been making similar inroads with Zen on the laptop and desktop space[1].

[1] https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...


Intel can distribute its overhead costs over a lot more CPUs than Apple, and Apple pays TSMC for making the chips, so, presumably, a large fraction of the marginal profits go to TSMC (probably not all of it, as I think Apple paid for building production capacity up front)


>Intel can distribute its overhead costs over a lot more CPUs than Apple

Purely from a CPU / SoC "Unit" shipped perspective. Apple currently shipped more CPU unit on both leading edge and total volume than Intel.


Thanks! Somehow, my mind only was on desktop CPUs.

I couldn’t find hard numbers, but it seems that it isn’t a given that, adding all CPU power globally, there is more CPU power in servers than in smartphones.


I wonder how this is possible? I imagine its partly due to manufacturing improvements (14nm vs 5nm) I mean how can Intel fall so far behind in only a couple years? I know they were one of the earliest investors in ASML’s EUV. Why couldn’t they push for smaller nodes? Is it because they were milking their current nodes too far? I saw them marketing freaking TEC’s to cool down their 500W Cpu gaming rigs on LTT and derbauer recently.

What made TSMC so successful? Is it primarily thanks to their business strategy? Or did Intel do something so wrong they tumbled down this far?

I know that Intel’s 10nm is closer to 7nm TSMC but still their competition is coming up with interesting and relevant technologies while Intel is like a junkyard of half baked ideas. 5g modems? Arduino competitor? Vaporware GPU’s since Larrabee? Claims of dominance in NN accelerators with nothing solid? Nirvana? Optane 600 series garbage SSDs? Stupid desktop computing form factor ideas? I can go on...

I don’t hate Intel or root for any other company. I’m just trying to understand how incompetence like this happens in companies


"What made TSMC so successful? Is it primarily thanks to their business strategy?"

Basically they are riding the new wave of cheap devices that outnumber the x86 devices by 10x 20x.

Basically everything uses an ARM CPU these days, not just tablets and phones, but microwaves, TVs, projectors, refrigerators, ovens, 3D printers...

That makes those devices extremely cheap on volume and make innovations to happen faster than o a single company like Intel, that was not interested on those low margin products.

Intel is far from incompetent, they just decided to get advantage of their monopoly position to reap as big profits and margins as they could get for the longest possible time, instead of cannibalizing themselves with lower margins.

And it was great for them. Their executives have done great. They have just ruled the semiconductor industry and wanted to enjoy it.


I suspect what Intel missed was that firms would be selling phones in very large volume for c$1000. That price includes enough margin to spend quite a lot on the SoC and associated research.

When the iPhone launched Steve Ballmer laughed [1] at the price and pushed a $99 competitor with MS software. The phone market was very, very different before the iPhone got real traction.

[1] https://www.youtube.com/watch?v=eywi0h_Y5_U


> Basically everything uses an ARM CPU these days, not just tablets and phones, but microwaves, TVs, projectors, refrigerators, ovens, 3D printers...

Most of these things are not on bleeding edge 5nm or 7nm process, though. Most microcontrollers are more like 90nm (e.g. STM32 up to F7 is 90nm; STM32H7 is 40nm... many smaller micros of the M0 variety are even 180nm...)

Basically, if you're not video processing, or a real computing device, or something power sensitive, 90nm is still a pretty sweet place to be-- ~$300k for a mask set, easy to have 5V tolerant I/O if that's something you need, high likelihood of common I/O and core voltage, &c.


Plus, if you are doing microcontrollers (cortex M) and not microprocessors (cortex A), the lower leakage current of >40 nm nodes is interesting. Batteries for infraed remotes last years, not days.


It's funny because this seems like a textbook case of the innovator's dilemma (from Clayton Christensen) in a nutshell - what worked for Intel was just working so well, that cannibalizing it with something new didn't make sense - until it was too late.


Ben Thompson on the latest Exponent[1]: "When you start out a company, you're walking around and you have complete freedom of movement. And then you get some processes in place, you get better at things, and now you're riding bike. And then you're driving a car. And eventually at some point, the best, most efficient companies are like bullet trains. They're so much faster than anybody else and so much more powerful and so much more efficient, it's like "how can I compete with that?" Well it's actually quite straightforward how you compete with it, you go somewhere there are no bullet train tracks."

[1] https://exponent.fm/episode-190-intel-apple-disruption-and-d...



Intel have tried time and time and time again to get away from x86; some of their efforts have been underwhelming (the i960) while others were genuinely radical and innovative (the iAXP 432), and others were at least interesting (the Itanium).


They tried a couple of times, but I wonder whether they tried hard enough. Admittedly, that the Itanium failed was partially AMDs fault, which breathed new life into x86 by their 64 bit extensions. But besides a slow start, Intel didn't seem to be in a hurry to push Itanium down the line and offer for example a cheap cpu+motherboard combo for enthusiasts. While Itanium for a while had a relatively large transistor count, by todays standards it is tiny, any smartphone cpu is way larger. The last Itanium was made in a 32nm process, imagine it in today 10nm. Especially with markets shifting, more computing in the cloud, Itanium based servers could be really strong, if Intel just would make them.

Also, Intel declined to make a cpu for the iPhone, dropped their own ARM line and also didn't get into fabbing for other companies when they still had a large lead in fab technology.

I do get the impression, Intel was far to happy selling x86 chips. Which worked and gave them lots of revenues, till they got stuck with the 10nm process. And of course while TSMC grew into the power it is today.


And they made ARM devices for a while.


Yeah, ditching their ARM line looks very foolish now, doesn't it?


IIRC this is also exactly what happened with IBM and original PowerPC Macs way back in 2005 that prompted the switch to Intel and x86.

Funny how 15 years out it's the exact opposite now.


> Intel is far from incompetent, they just decided to get advantage of their monopoly position to reap as big profits and margins as they could get for the longest possible time

That sounds exactly like an incompetent strategy by being lazy ignoring possible competitors.


Nothing incompetent about making a fat profit. We'd like to imagine that companies should always innovate as hard as possible, but it's not always the winning strategy.


I think the point that they were trying to make is that sabotaging long term profits by maximizing short term profits is less profiterole over the long term. Which would make it incompetent for the company, but because of earlier cashing out, possibly not incompetent for the specific people making those decisions.


We'd like to think that long term strategies are always better, but again that's not always true. Sometimes it's better to realise some profits now.


But Intel saw it coming when they're left out of the mobile market and ARM had been advancing rapidly for over a decade, not to mention x86 has been too old now.


That doesn't necessarily mean that there was a better path available to them. If Intel moved away from x86, a lot of their strategic advantages would disappear; they'd be one player among many, and they wouldn't have the vertical integration advantages of people actually making full ARM-based systems. Meanwhile there's plenty of money to be made from x86 for decades yet, from clients that put a high value on backward compatibility (which Apple don't, not in the same way - after all, they've done this twice already).


>If Intel moved away from x86, a lot of their strategic advantages would disappear; they'd be one player among many

They are in fact one player among many. Their strategic advantages have evaporated. The entire mobile space passed them by and now AMD is seriously threatening their x86 business. Something has clearly gone very wrong at Intel. Shareholders can't be happy about that.


> The entire mobile space passed them by

True, but only a small minority of companies that went into that space made money on it. Any Intel mobile division could very easily have been the next Blackberry or Nokia. Staying out of that fight may well have been the best decision for their shareholders.

> AMD is seriously threatening their x86 business.

If they lose the profitable server segment to AMD then that's a serious problem, I'd agree with that. That's far from settled though.


I'm in no position to say but how about just make a division to develop ARM based CPU and take some part of the pie while keeping the x86 cash flow coming?


The conglomerate discount exists for a reason. People who want to invest in ARM know where to find it; it makes sense for Intel to stick to their strategic advantages.


Intel's current fab troubles are simply inexcusable. There are some factors that can account for part of the problem, but at this point Intel is 5+ years late on delivering a usable, profitable successor to their 14nm process. And 14nm got off to a slow, rocky start too. Intel's fab business has been horribly mismanaged, and the CPU design business has been forced to believe fab roadmaps that don't have any credibility.


Perhaps it is because all the latest Intel fabs are in the U.S. in Hillsboro, Oregon. Other foundries like TSMC benefit from the ecosystem and cheaper cost in East Asia ? i.e. they can afford to make more mistakes than Intel can if it is cheaper to do so.


I don't think cost of labor is a big factor here. Intel has no trouble maintaining a large enough workforce. They continually decided not to have parallel teams designing processors for their unproven 10nm and their successful 14nm nodes (or one team making a relatively portable design), even years after it was clear that 10nm was not going to work out as well as needed by their processor design roadmap. That wasn't for lack of staffing or inability to afford enough engineers. It was management hubris.

(On the other hand, I've often pointed out that Intel's attempts to develop two microarchitectures in parallel have always failed in the long run, with one project ending up woefully uncompetitive.)


Intel (or AMD for that matter who are using TSMC) isn't falling behind, it's just that geekbench is completely unrepresentative of real world performance across architectures, as Linus points out here[0] due to the test including hardware accelerated tasks that benefit specifically these modern arm chips.

[0] https://www.realworldtech.com/forum/?threadid=136526&curpost...


>There’s been a lot of criticism about more common benchmark suites such as GeekBench, but frankly I've found these concerns or arguments to be quite unfounded. The only factual differences between workloads in SPEC and workloads in GB5 is that the latter has less outlier tests which are memory-heavy, meaning it’s more of a CPU benchmark whereas SPEC has more tendency towards CPU+DRAM.

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...


That's 7 years old


still relevant


In Geekbench5, they benchmark html5, SQLite reads, pdf rendering, text rendering etc. Seems somewhat relevant. They are providing an upper bound on this activity, and that’s good to know.

https://www.geekbench.com/doc/geekbench5-cpu-workloads.pdf


Still relevant to people running Geekbench 3, unlike this topic.


No idea but I suspect the "unified" on chip memory is very very quick.

Some friends and I were BSing about the "pro" level parts, it you can graft 2 or 4 M1s together, use off chip RAM and then treat that onboard 16GB like cache? We're talking about some game changing stuff.


The Xeon Phi had up to 16GB of fast memory in the same package as the main die. IIRC, it could be used as memory or as cache for external memory (which was much slower).

If Apple integrates two more memory chips, it'll be able to power a pretty solid desktop or laptop.

On the performance, Rosetta is most likely doing JIT so that most of the time it's running native ARM code. It did this with PPC binaries and DEC had it for Alpha.


As noted elsewhere, Roesetta doesn't JIT unless the AOT transpilation lets it down. Most apps are statically transpiled at installation time...


That's (AOT transpilation) quite the interesting approach to the problem. No wonder it's so fast.


We could also entertain the idea that if they would find some instruction particularly hard to emulate - they could have added new instructions on their own chip to cover it.


> on chip memory is very very quick

It is not on-chip memory, the dies are separate, they're just in the same package. They seem to use standard LPDDR4 connectivity, so I don't think its actually faster. The "unified" bit seems to matter more: having a single address space for both CPU & GPU, but this is pure speculation. I don't know if AMD or Intel APUs do this too.


The fact they are on the same package means that the electrical signals have a lot less far to travel from memory to cpu, and therefore you don’t have the signal losses or interference from the board having to route memory lines externally.

As a result you would be able to drive a higher bandwidth because you don’t need to be as limiting with the transfer time of signals.


Or you could use less power for the same speed. Hard to tell what Apple did, without some detailed benchmarks. I suppose one could bench memcpy and derive the clock rate from that.


It's unlikely you're going to transfer data any faster - they're using commodity drams like anyone else - they will however be able to save a clocks's worth of latency here and there which is useful


Didn’t Intel have a similar idea with Skylake? Those had very fast albeit smaller eDRAM die glued to the processor. It was dropped on subsequent generations.


It’s actually still surprisingly relevant in terms of performance [1], and I see it as a precursor to the gigantic caches we are seeing in the latest chips.

[1] https://www.anandtech.com/show/16195/a-broadwell-retrospecti...


It worked pretty well but Intel clearly never liked the idea. They only offered it on a couple low-end models even before dropping it.


My impression was that this may have been designed at Apple’ behest; certainly they were the major user. Older than Skylake, btw; Haswell had it.


It might make sense to use very fast SSD as the main memory and on-chip RAM as cache. Huge amounts of RAM make only sense if your disks are slow or your workload actually needs the whole RAM which is rare.


I do wonder where Apple will go with the Mac Pro. I guess a lot depends on how well the existing model has been selling (which we don't know).


>> I know they were one of the earliest investors in ASML’s EUV.

They may have been one of the earliest investors in EUV (along with TSMC, by the way), but in terms of adoption and roadmap they have been way behind both TSMC and Samsung. I don't know the exact numbers of machines but my educated guess is that TSMC and Samsung together probably have close to 10x the EUV wafer capacity compared to Intel. And have had it for much longer as well.

The problem Intel created for itself is that they have always had a very stubborn over-confidence in their own knowledge of process technology, and have driven tool manufacturers like ASML to work within Intels constraints, instead of working together to alleviate them. Their hubris has bitten them now that EUV has become economically viable compared to Intels process technology that relies heavily on triple and quadruple patterning, and very little of Intels 'old' process technology knowledge carries over to EUV.

TSMC has also had a lot of teething pains with EUV but they have been very determined to make it work, and that's paying off now.


> I don't know the exact numbers of machines but my educated guess is that TSMC and Samsung together probably have close to 10x the EUV wafer capacity compared to Intel.

There are currently zero EUV Wafer from Intel. Which means the answer to your question with would be close to infinite.


I'm pretty sure Intel has had some EUV tools installed for some time already, they're just not using them for any kind of HVM yet as they are ~3 years behind their own process technology roadmap by now.


> I wonder how this is possible?

Binary translation can work pretty well for user code, especially synthetic benchmarks.


> I mean how can Intel fall so far behind in only a couple years?

Arguably Intel has been falling behind since the delays in replacing Haswell (so, last six years or so). It just hasn’t been particularly visible, as the ARM vendors simply don’t compete in the same spaces, until now.

Though, in what might be an early sign in retrospect, x86 phone chips, after a lacklustre launch, vanished without a trace some years back.


There were some ex-Intel people commenting on a previous thread and they told about a lot if internal politics/fighting between inside groups. It might not be the main reason but part of it.


Read "Innovator's Dilemma"


Promising... but I'd wait for the real world benchmarks first. No one in the real world needs to emulate... geekbench.


Geekbench benchmarks provide evidence of an upper bound, which is useful for determining what is technically possible. Real-world benchmarks instead show what can be realistically observed. The upper bound benchmarks are still useful, just in a different way.

For example: these benchmarks show that under excessive load (as benchmarks do), the M1 is capable of out-performing Intel chips at something like 1/3 the energy usage. No matter how you slice it, that is an impressive achievement and shows what M1 could potentially do for the most perfectly optimized software.

Realistically, of course, software will not attain this. But having the upper bound lets software writers know what to aim for and when it's not worth pushing for extra performance.


Would be a lot worse if performance was 1/10th. That’s what this benchmark shows. Essentially that it’s a lot better than people feared. The question for users like myself is, will things like IntelliJ still be decently fast? Still don’t know what kind of performance we will get from the JVM, but this shows some decent hope!


I imagine it won't take long for something as widely used as the JVM to be compiled directly for the M1 instead of relying on Rosetta. :)


Ironically Microsoft just released a beta arm version of the JVM for Mac.


> Ironically Microsoft just released a beta arm version of the JVM for Mac.

I wasn't aware of this. Can you provide a link?



They have, unfortunately it's a bit crashy in what looks new non-platform specific code.


Does the M1 deviate enough from standard ARM64 to need a recompile?


Oh! Probably not; that's a good point. I don't know why it didn't occur to me that, of course, there's already an ARM-compatible JVM haha. Thanks!


Wow I wonder if these new Macs are going to be the new standard for iOS and Android development, since you can run your apps for both ecosystems natively.


The JVM is a bit lacking on ARM64. I’m still seeing super-basic intrinsics being added to the compilers.


Yes, it would need to be build for Darwin.


How is that result with Windows on ARM laptops?


Geekbench tries to simulate real world situations. HTML and PDF rendering, SQLite, etc

https://www.geekbench.com/doc/geekbench5-cpu-workloads.pdf


Here's the "easily fits in cache" list

* Compress a 2.3mb file that fits in cache.

* Alter a 24MP JPEG (that's around 5MB filesize).

* Gaussian blur of 24MP JPEG

* Gumbo Parser of an HTML file then execute some stuff with duktape -- note: this parses HTML to a simple DOM -- nowhere close to actual rendering

* Text rendering of 1,700 words into a 12MP image

* Horizon detection of 9MP image (that's around 2MB)

* Image repaint of 1MP image (that's around 200KB)

* HDR image (4MP -- around 800KB) from 4 normal images

* Neural Net of tiny 224x224 images

* Navigate using a graph with 200k nodes and 450k edges.

* SQLite is between 0.5MB and 1.1MB depending on the compiler options. Maybe the dataset pushes it out of cache, but I wouldn't bet on them creating many millions of rows.

Not sure, but may not fit in cache

* Google's PDFium render. Not sure about the library itself, but a 200dpi map doesn't sound like anything worth mentioning

* Camera test gives very little in specifics, but with several steps and a handful of libraries, this probably overflows cache a little.

* Ray Tracing 3.6K triangles and 768x768 output. I'd put it elsewhere, but they could be using a huge number of rays (though I seriously doubt it)

Undoubtedly doesn't fit in cache

* Clang rendering 730 LOC (seriously?) Clang is pretty big and most likely needs to cache.

Zero actual details

* Speech Recognition

* N-Body Simulation

* Rigid Body Simulation

* Face Detection

* Structure from Motion

All in all, them saying these things are "real world" would be a huge overstatement at best. I don't see anything here to contradict Linus' assessment.


>All in all, them saying these things are "real world" would be a huge overstatement at best.

What your list shows is that caches are huge these days. Because 24MP JPEGs are very much a real world scenario when most cameras sold in the world have half that resolution.

But sure, go ahead and pick another industry standard benchmark of your choice. Let's see how the M1 fares.


ARM already has instructions for improving performance of Javascript [1]. What if Apple added custom ISA extensions to their chips to support efficient x86 emulation? Current evidence seems to suggest that much of the translation is happening statically; a few additional instructions might greatly increase x86 emulation performance if most of the code can be translated instruction-by-instruction.

Also curious to see how Rosetta will work with x86 code whose instruction alignments can't be determined statically.

[1] https://stackoverflow.com/questions/50966676/why-do-arm-chip...


Apple's chips toggle TSO with a proprietary MSR; other than that there aren't really any other custom things going on.


TSO = Total Store Ordering.

(I didn't know what you were talking about at first, had to work it out.)


Is that so that the memory model is equivalent to X86?


Yep, exactly.


I think this is very unlikely as 1. Apple is still almost required to follow the ARM ISA as a result of its licensing agreement 2. They will regard x86 emulation as a short term legacy issue (developers will recompile their Apps for ARM soon) and so why design and build in extra hardware you won't need long term and 3. Rosetta 1 worked fine without this sort of help.


I don't agree that this is so farfetched. As noted in a sibling, Apple has released ISA extensions before. If the architecture is decoding to uops anyway the addition of x86 helper instructions may not affect the architecture much at all (or may even be implemented in microcode). Further, the comparison between Rosetta 1 and 2 may not necessarily apply, as the switch to intel was (arguably) a more substantial perf increase over POWER.


ARM licensing requires any vendor to approach ARM for permission to extend the ISA. ARM may well deny that request and instead make it part of the main ISA instead


Are we sure that applies to Apple in particular?

I presume the exact terms of Apple's license agreement with ARM are not public, so who knows exactly what is in it. It might have different terms from what ARM offers in the general case.


Apple execs had by far the upper hand while negotiating the contract than ARM execs did. The entire environment at the time was "we will do whatever apple wants".

Shortsighted really considering there weren't really any other options for instruction sets for apple to choose - all this stuff was signed before RISCV was a thing, and PPC and MIPS had pretty high barriers to entry (lots of porting work, lacking SIMD type instructions, and another migration for all apple devs) and poor performance.


I think 2. is the key point Apple will expect its developers to produce universal binaries very quickly. Once that has been done the issue has largely gone away. Plus not sure I agree that this is a smaller step up in performance when compared to POWER - certainly same order of magnitude.


True. As sibling noted below it sounds like there's no additional instructions used here.


Apple has already shipped custom instructions; they just aren't using them here.


Is there evidence that custom instructions aren't being used here? Keep in mind that they could have been added but not included in DTKs.


The build of Rosetta that will be used on the M1 has shipped already, I didn't see anything there (it looks largely identical to the DTK version).


I've seen this reported but am a bit sceptical - happy to be proven wrong.



Thanks!


all that "javascript" instruction does is emulate the x86 behaviour, which formed the basis of the javascript spec


People need to stop acting like that instruction is some amazing magical "make JS fast". Literally the only thing it does is to change the rounding mode and sentinel for NaNs.

That's it.


One thing I’m unclear on still is the windows VM situation.

Is it just that “right now” we can’t run a windows VM but with some work by Apple/Microsoft/Parallels it can happen or is there some fundamental blocker here?

I thought there was an ARM version of Windows already?


Parallels is planning an ARM release[1]. This won’t run Intel arch Windows, but rather they hint that it will Windows for ARM64. They also mention Microsoft’s recent announcement that Windows for ARM will soon have the ability to run x64 (Intel arch) applications.

[1] https://www.parallels.com/blogs/parallels-desktop-apple-sili...


If you read that link, there is no hint at anything to do with Windows on there at all. It's quite a short press release, it makes no mention of what might be run inside parallels. I would expect linux to run, but Windows would be complete speculation.

The link does mention an early access program.


> Windows for ARM will soon have the ability to run x64 (Intel arch) applications.

I remember this was promised a long time ago. I'm surprised it's not the case.


I remember Intel "warning" that x86 emulation is riddled with IP/patents - now that they are falling behind I fully expect them to turn to milking competition by patent trolling - sort of like Microsoft back in the Android vs Windows Phone days.


The basic x64 patents were close to expiring at the time and may have expired by now. I suspect they were waiting for that to happen before releasing 64 bit support.


Those are weasel words. They are not saying parallels will run windows ARM. Read carefully and see what they're actually saying. There is not even a hint that parallels will run windows ARM on M1 macs, but the sequence of sentences creates the illusion of one.

Parallels, like vmware, is depending on microsoft here. If microsoft doesn't play ball there will be no windows on apple silicon macs. Currently there can be no officially supported way of running windows ARM on M1 macs, as the windows license does not allow it.


Windows ARM is not available to OEMs.

Parallels is trying emulation on M1 but who knows how well that will work. https://www.parallels.com/blogs/parallels-desktop-apple-sili...


> Windows ARM is not available to OEMs.

Not "officially", but you can get hold of it as an individual and install it on e.g. a raspberry pi relatively easily already. Running it in a VM might be doable.

Actual Windows 10 ARM too, not the restrictive IoT version that's officially supported on the pi.

https://www.windowslatest.com/2020/01/27/windows-10-arm-on-r...


In the article, I don’t see mention of Windows x64 running as a guest. They do hint at Windows for ARM, though.


There is no hint of anything to do with windows in the article whatsoever.


"Parallels is also amazed by the news from Microsoft about adding support of x64 applications in Windows on ARM."


I had the impression that, virtualization was not possible on current M1. As I sadly discovered that docker didn’t and won’t run on current silicon. Is it still possible for parallels to move forward without virtualization ?


Virtualization is possible on the M1, it is not possible on the Developer Transition Kit which uses a A12Z chip without Virtualization extensions.

It can only virtualize ARM, of course.


That’s great news! I was disappointed since I read the discussion on github. I clearly misunderstood. Thanks!


I wouldn’t really expect x86 virtualisation instructions to be supported under Rosetta.

You’re going to need an ARM version of Docker, running ARM Linux kernels, and ARM userland Linux binaries.


It is possible, but docker is not yet ported. The docker images you run will have to be ARM images, as docker cannot emulate x86. Docker supports multi-arch images with their buildx infrastructure, so this situation will fix itself over time. Most of the official base images already support ARM.


Virtualisation within the x86 emulation isn't possible. So you can't just run the x86 build of parallels and have it work. But it should be possible to build a new dedicated ARM version of parallels.


It is possible


VMware also commented on trying to work on it. https://mobile.twitter.com/VMwareFusion/status/1326229094648...


A lot depends on Microsoft and my guess is they would prefer to offer Windows in the cloud for Apple users.


"Every other Mac" is kind of funny when there is a 28-core monster in the line.

In any case, I bet the M3 will be able to run rings around the current MacPro.


I'm an old dog trying to learn a new trick. Old dog says Nvidia GPU with 8+GB of RAM and thousands of cores is powerful. New trick says M1 with max 8 cores and ??? amount of RAM is more powerful??? My head hurts trying to come up with how. Where is the magic happening that makes 6K+ video run in real time in Resolve, when an Intel CPU with multiple GPUs needs a good wind at its back going downhill. Of the many hand wavy demo details left out, what kind of video are they using? MP4 video, RAW videos, ProRes, etc? Is Resolve running in realtime just playing back the video but immediately chokes when you apply a single node with light grade applied? The M1 release videos and too PR speak for me


• NVIDIA is on a dumpster fire cheap “Samsung 8nm” process that is quite possibly the worst <=14nm process especially when it comes to power and heat.

• Apple has the benefits of complete vertical integration, both on the hardware and software side.

• Neural engine is essentially Tensor Cores in NVIDIA’s GPU but occupies at least 4x the equivalent die area as tensor cores (no public details on performance yet).

• NVIDIA doesn’t want to make their consumer GPUs too powerful on tensor operations in order to not cannibalise their 1000% markup ML cards.

It’s almost a classic Intel: financial greed and financial engineering, combined with complacency from being long for so long.

Heck, AMD’s new top end card is tied with the 3090 - but $500 cheaper.

NVIDIA is reportedly scrambling to try and get back into TSMC who is going to make an example out of them.


Wait, you're comparing M1's neural engine with Ampere's tensorcores? Apple is talking TOPS, not TFLOPS, that means M1's 11 TOPs are for inference only, not training. By contrast, a tensorcore on Ampere does ~20 TFLOPS FP64, 78 TFLOPS FP16/BF16, and with GPU boost clock, up to 312 TFLOPs @ BF16.

If you want to talk TOPS, an A100 does up to 1.2 exa-ops INT8, and 2.4 exa-tops INT4. That is, an A100 is more than 1000 times more powerful than the M1 at inferencing, while also supporting up to FP32/FP64 weights, and a 3080 is more than 500 times more powerful than the M1 neural engine.


An A100 isn't a consumer GPU; a more valid comparison would be the RTX 3070.


Even an RTC 2060 beats it, the 2060 has 52 TFLOPs worth of tensor core ops (not counting the CUDA cores), and it has >100 TOPs @ INT8, or 200+ TOPs @ INT4, so more than 10x the M1.


How far an electron can travel in a ns is a crazy limiter in computer science.


It's actually very forgiving. For high frequency signal traces on a PCB there is this rule of thumb that your trace lengths must not differ by more 15cm per ns because that is how far electrons travel in one nanosecond. If you have a 10GHz signal that means you have a 0.01ns budget which translates to a 1.5mm max length difference between traces. Matching within 1.5mm is easy to do by hand.

Now lets talk about actual processors instead of doing an analogy. 15cm per ns is more than enough to travel everywhere inside your CPU but the vast majority of logic is localized (usually signals stay in the same core). It's only awful if you go off package to DRAM or a second socket but then the budget is often higher than 1ns. Apple probably scores a lot of performance points here because the RAM is so close to the CPU.


It's not. The limit is how fast you can switch transistors. Electrons can go a pretty long distance inside a clock cycle, and waiting multiple clock cycles for signals to go from core to core is normal and easy.


Technically electrons barely move at all (on clock-cycle timescales); it's the electric field that moves at ~2/3 speed of light. Compare eg, wind speed (particles) versus the speed of sound (field).


> Technically electrons barely move at all (on clock-cycle timescales); it's the electric field that moves at ~2/3 speed of light.

Yes.

> Compare eg, wind speed (particles) versus the speed of sound (field).

Huh? The speed of particles in the air is even faster than the speed of sound.


Aggregate wind speed, as in "5 knots southeasterly" vs. the speed of sound in that vicinity, which would still be ~340 m/s


They're comparing single-core perf


Indeed this is the number for single-core performance. A previous benchmark[1] showed a MacBook Air with M1 getting 1687 on the single-core test and 7433 on multi-core. Compared to a 4-month-old kitted out iMac (1251 & 9014)[2] this is 35% faster on single-core and 17% slower on multi-core (8 cores on M1, 20 cores on the i9).

Getting a better result via Rosetta 2 on the single-threaded benchmark is very impressive. I assumed that their benchmark included some vector instructions (which as far as I understand Rosetta 2 does not emulate) so this would mean that these higher numbers are for general-purpose instructions vs vector. That said, I can't find any reference to AVX or SSE in the "Geekbench 5 CPU Workloads" document, although the one for Geekbench 4 does mention them. It would be interesting to see numbers for Geekbench 4 on M1, if it runs via Rosetta at all. I would imagine that the tool is able to detect what is supported to run optimized code for each CPU being evaluated.

[1] https://www.macrumors.com/2020/11/11/m1-macbook-air-first-be...

[2] https://browser.geekbench.com/macs/imac-27-inch-retina-mid-2...


It is more than likely that Geekbench will use CPUID to establish what it is running on, and then adjust its kernels to suit the host CPU architecture. Rosetta2 would be using a CPUID that excludes the AVX and any other Intel ISA extensions that it does not support, therefore, Geekbench should be running without issues on the fastest vector extension supported, most likely SSE3 or SSE4.x.


Right, that's also my guess. What I meant was that their document describing the v4 tests[1] mentions SSE and AVX, while the v5 version[2] does not. Does it mean that the v5 tests don't cover them? I found find it surprising for a benchmarking tool but they really were mentioned before and now they're not, hence my question.

If indeed v5 tests are not particularly about specialized performance but more about real-world use cases (e.g. PDF rendering, SQLite, image compression as others have mentioned) then running Geekbench v4 on M1+Rosetta would give a comparison of emulated x86 without vector instructions versus native x86 with all of its modern capabilities. Now if the M1 wins that

[1] https://www.geekbench.com/doc/geekbench4-cpu-workloads.pdf

[2] https://www.geekbench.com/doc/geekbench5-cpu-workloads.pdf


If the M1's emulated x86 performance is this good, why didn't Apple showcase this sort of comparison at their launch announcement? This seems like something that would have assuaged concerns about the apparent tradeoff of switching silicon. That is, it seems there is very little downside, especially if the M1 is more power/heat efficient.

A comparison like this would have demonstrated this point more definitively than the many graphs Apple showed that were conspicuously lacking axis labels.


Because they want to encourage developers to update their apps for the new architecture as a top priority.


They would never compare it to a Mac they are still selling.


As another commented, they would prefer native apps. If emulation were so good it may cause may developers to just keep putting development of the native apps off.

however the cynical side is, Apple is having difficulty getting developers on board. Apple is notorious for not encouraging game developers to be on OS X to the point the common refrain in the Mac community has been to "buy a different system to game on". Many people don't have that luxury or like me don't want two systems on my desk when one should be sufficient.

That M1 debut was notable for one reason too many overlooked, they had very few developers showing off their wares and most of those they did are not well known.


What kind of design choices are they making that can get the power down so low? What's Intel doing that hasn't been optimized away after generations?


I saw an interview with the designers of ARM and they explained how their design was ridiculously more efficient than competitors. Don’t have a link.

Another good source of info is the article over at anandtech. Apple is doing a lot of stuff they Intel can’t because of x86. And also a bunch of stuff that neither Intel nor AMD want to do.

Also don’t forget that Apple is on TSMC 5nm while Intel is for the most part on Intel 14nm which is more or less equivalent to TSMC 10nm.


Thanks for the tip! I thought power consumption issues increase with decreasing size though due to leakage current though a change in the finFET design can help manage that.


Yes, so per area there will be more power consumption. But you need less area so if the transistor count stays the same you save.


Everything is very, very close together. LPDDR and very short trace lengths. Very few high-power PHYs. Lots of on-die cache.

Conway's Law playing out, really.


Two things afaik:

* Decoding an x86 instruction takes a ridiculous amount of resources. Can't be optimized away because of backwards compatibility.

* Limited to small pages without jumping through weird OS hoops.


- it takes some die space sure but no x86 processors are actually limited by instruction decoding (except, iirc, the first generation xeon phi in some cases)

- huge pages don't exactly require weird OS hoops although i agree the 4kb→2mb→1gb page sizes are inconvenient


Maybe it’s because Apple/ARM can ditch all legacy x86 stuff so the chips only contain instructions relevant for today, saving space for more transistors to do these relevant instructions or just to keep the chip smaller and more efficient


Underneath the x86/x64 ISA is a really different architecture. So to reach the backwards compatibility it needs a x86/x64 decoder and that is not a significant portion of the silicon as far as I know. Those instructions might be somewhat inefficient, but this is not really an issue when comparing, any ARM chip emulating will have to tackle the same problem.

If Apples chip does not need this backwards compatibility, they essentially have to do it in software when emulating. They do this with a combination of recompilation at install time and some emulation at runtime. I suspect the compilation step is one of the main contributors to this boost in speed and efficiency. Windows could do the same, optimize the executable to better fit the underlying chip.


Is this really the end of the road for Intel? If you're a PC gamer who uses Windows or a Linux enthusiast, you're probably running Ryzen which blows Intel out of the water in terms of value and cores (and recently IPC as well), and if you're on Mac obviously it's pointless buying an Intel product unless you really need the nVidia dedicated graphics or are buying a Mac Pro for a specific A/V workflow.

It's insane how quickly they fell behind.


Not yet. Intel fabs have a higher capacity, but if they don't get it right, Intel is going to be in the water.


With how much they've been struggling with 7nm, doesn't look positive. Zen 3 is already 7nm, a significant leap forward and TSMC and Samsung are already messing around with 3nm.

I mean, 2022 for their first 7nm chips to hit market? That's crazy. Zen 3 hit the market two weeks ago.


I really wish this tech was available outside of the Apple ecosystem.

If Apple continue like this, x86 machines won't be able to compete. I wonder if other vendors are looking at this and contemplating transitioning to ARM as well.


ARM Chromebooks have been a thing for almost a decade now. They ship with Chrome OS but it's usually fairly easy to install a more standard linux distro on them.

https://www.linuxmadesimple.info/2019/08/all-chromebooks-wit...


My recommendation is the Lenovo Duet: https://www.lenovo.com/us/en/laptops/lenovo/student-chromebo...

It's a Chromebook/Android Tablet hybrid for $300. It's now my primary ereader and "laptop" when I'm away from home since it has decent Android and Linux support.


For the same price I bought an Asus with Linux almost 10 years ago, which now has a SSD and 8 GB, with a DirectX 11/OpenGL 4.1 capable GPU.

Chromebooks keep being overpriced for what they offer.


Rockchip, Mediatek, and Qualcomm are not in the same ballpark as Apple Silicon. They're barely in the same sport.


> I wonder if other vendors are looking at this and contemplating transitioning to ARM as well.

Perhaps, but with what hardware? The PC era was defined by mostly open hardware that had a high degree of interoperability. ARM does not bring the same.


The ARM ecosystem is way more open that the PC ecosystem. Think about it, instead of having a single (AMD barely counted till recently) CPU manufacturer designing and manufacturing the chips, you now have a company that publishes reference designs and several other companies build an array of interesting implementations.

The only reason why ARM seems more closed is that you primarily interact with ARM chips on highly restrictive platforms. The M1 chip will go along way to changing that perception as consumers will now be able to use an ARM based platform as they normally would a regular computer. But, that's not the only way, you can buy a wide array of ARM based devices with recent-generation ARM designs and very flexible IO.

e.g. https://www.pine64.org/pinebook-pro/


Is there anyone shipping an ARM motherboard and ARM chip? I think the "openness" is also defined by what the market offers. All I see are non-extendable notebooks.

Also until you can run non-linux "desktop software" on one, it will always feel like a "tablet"


> Is there anyone shipping an ARM motherboard and ARM chip? I think the "openness" is also defined by what the market offers. All I see are non-extendable notebooks.

Pretty much all ARM application processors use soldered BGA packages over sockets, and there also isn't the same tradition of defining a standard pinout for a generation or two (or five) of chips like there has been for x86 (e.g. LGA1151, AM4, etc.).

You can't really design a motherboard without this, so each design tends to have a custom breakout board or carrier card that's designed to be used in relatively specific installations.


> Is there anyone shipping an ARM motherboard and ARM chip? I think the "openness" is also defined by what the market offers. All I see are non-extendable notebooks.

The closest analogue to what you described in the ARM world would be plugging either a Computer-on-Module or System-on-Chip into a carrier board. Today's ARM deployments are presently largely focused on energy efficiency, so they make heavier use of integration than typical x86 deployments.

e.g. https://www.toradex.com/computer-on-modules/apalis-arm-famil... https://www.toradex.com/support/partner-network/hardware https://www.toradex.com/products/carrier-board

> Also until you can run non-linux "desktop software" on one, it will always feel like a "tablet"

There are millions of linux users around the world that would challenge your assertion that desktop linux doesn't qualify as anything beyond a tablet. In any event, Windows already runs on ARM and some rapscallions have already got it up and running on a Raspberry Pi despite Microsoft's present prohibition on doing so. I'm sure Microsoft will free up the licensing in due time as market demand presents itself (if only to allow interoperability with M1 Mac VMs).

https://www.youtube.com/watch?v=xyLdAs_roIA


That’s the issue. Linux runs everywhere and is not a consumer product, that’s why I exclude it.

Until you can build your own general-purpose PC and can decide to make it ARM, it will not be considered an equivalent option, it will just be another tablet, “chromebook” or “board for hackers”


I honestly don't follow you at all. We're now at a point where every major consumer operating system now has an ARM version available.

You can build your own general purpose PC with ARM today. It's not done regularly because ARM chips historically tended to be less powerful, so the type of person that went out and built their own PC wouldn't want to build an ARM PC. This is now rapidly changing given chips like Apple's M1 and the upcoming Arm Cortex-X1 are surpassing their x86 competitors in performance orientated benchmarks.

Once Microsoft gets their act together with their own version of Rosetta 2 and we reach the X2 generation, you'll start to see a rapid shift towards ARM desktops with X1 chips in the mid-tier segment. Eventually the premium "gamer" tier will follow as M1 has proven that it's possible create an ARM SoC that surpasses top tier x86 single threaded performance at a fraction of the power budget and mountains of thermal headroom. This will be accelerated by the fact Nvidia is purchasing ARM and now has massive incentive to get Nvidia discrete graphics cards paired up with ARM SoCs.


You have a curious definition of openness there.

Most ARM SoCs are the opposite of open - they're proprietary to the max and don't even support an open boot process.

> The M1 chip will go along way to changing that perception as consumers

The M1-chip will change nothing w.r.t. to perception of "openness" - consumers don't care about the CPU inside their devices. What consumers see is a machine that'll last longer on battery, is much quieter and performs outstandingly well.

Openness is important to developers and enthusiasts who want to use the hardware as more than just a commodity.

Apple Silicon will kill the ability to upgrade your Apple device - no more RAM expansion slots, no more external GPU support, no more upgrading of internal storage, no more 3rd party repairs.

Apple Silicon turns laptops and Macs into consoles: closed ecosystems in terms of software and hardware. I wouldn't call that "openness" at all.

The RK3399 inside the Pinebook Pro is also not exactly "open" from a hardware point of view: still no open source drivers for the GPU for example (other than reverse-engineered volunteer efforts).


Raspberry Pi has gradually made their broadcom soc platform more open - I believe they support open boot loaders now?


Essentially what's missing right now is AMD or Intel or maybe some other company releasing an ARM chipset/motherboard spec and a CPU combo that an average consumer can buy and install into a normal ATX tower. Also, different Linux distros committing to shipping ARM images. Windows also somewhat supports it already, although not very well for now.

It would seem that ARM has a much higher potential performance ceiling than already relatively optimized x86 and derivatives and so once we have enough viable ARM options for desktop or high-performance laptops, the industry won't have a choice but move along with it.

If you're building software and your competitor releases ARM support and suddenly their product performs better than yours, users will go to them.

If videogame makers realize that they can push more performance on ARM, they will start releasing ARM-optimizing games or even shift to ARM-first in time.


Windows and x86-64 has way too much momentum for that to happen any time soon. What is going to be far more realistic is AMD/Intel making a new chip where they expose a better ISA and Microsoft releasing a Windows that support that ISA. All while keeping the tried and tested x86/x64 running in parallel and slowly allow software to migrate to the new ISA, or perhaps invest in a recompilation of executables targeting x86-64 to take advantage of the new ISA. But getting Amd, Intel, and Microsoft to agree on a new ISA is a tall order.


Windows on non-x86 would need to have emulation similar to Rosetta 2 to be viable, IMO. Over the past couple decades, we've had both Windows on IA64 and Windows on ARM releases, but they've never had much compatibility with existing software.


I'm not talking about a pure non-x86, but a hybrid chip that expose a legacy x86-64 ISA and a new and improved ISA for the future without all the baggage. Windows will have to target the new, and allow applications to use move to the new ISA at their own pace. With time they can make "Rosetta 2" style emulation with compilation to remove the old ISA all together.

That said I seriously doubt that the ISA in and of itself is the reason the chips are slower. After all, underneath is a state of the art architecture. Until I see a independent true cross platform benchmark in e.g. in dotnet be significantly faster and more efficient, I remain skeptical of such incredible claims.


Check out SolidRun's Honeycomb LX2K. All very normal: Mini ITX, PCIe x8, DDR4 SODIMMs, USB 3.0, M.2 for NVMe. Bonus: 4 10GbE SFP+ ports.

Mine is a delight.

https://www.solid-run.com/nxp-lx2160a-family/honeycomb-works...


ARM devices being closed is a self-fulfilling prophecy. The SystemReady program is working towards an open PC-like ecosystem.


I would kill for a MacNano. Imagine a A14/M1 derived single-board-computer, flexible IO options, an IP rated enclosure, and Apple's hitherto unannounced IoTOS. It'd be the ultimate little edge machine!

They could even do it like the MacPro and couple it with some whackadoodle system for connecting modules (like Raspberry Pi hats) without having to worry too much about how exactly they'll integrate.


Lovely though this would be, it's hard to imagine a less 'Apple' product which is still a computer.


A la the Apple I?


I know you're joking, but it's silly joke. Obviously, an underclocked A14 variant built on a 5mm process would still be significantly faster and more energy efficient than whatever is in the Apple I.


I don't think that comment was intended as a joke. I think it was a perfectly serious and literal comparison to the Apple 1, which was a hackable extendable single board computer.



Yeah, this looks incredibly cool, but I was talking more about consumer-grade electronics. Stuff you can actually buy and build a PC with for work or games or whatever people currently do with computers at home.


honestly, my 5 year old laptop is still plenty fast for what I need. If it weren't for dropped support and websites that force me to upgrade to the latest browser, I'd be happy with 10 year old computers. I honestly want to get off the upgrade roller coaster. (I have in the phone world--happy with my 1st gen iPhone SE)


My 1st gen iPhone SE is also still going strong. So responsive. They really built it to last.


I just "upgraded" from an iPhone SE to an iPhone 12 mini and realized that the engineers at Slack and Discord must be rocking the latest phones, because their apps fly on the new device in a way that they did not at all on iPhone SE. Other apps see much less of a difference. I have a dream that one day the engineers at those companies will come to work and find their brand new test devices smashed and replaced with iPhone 7s so they can understand exactly how badly their apps perform.


Discord uses React Native, and I wouldn't be surprised if Slack does as well.


Slack is a native app, although it barely looks like one these days :(


So M1 was the SoC for the lower power Macs - Air, Mini and 2 port 13”.

Wonder what they’ll come up with for full port 13”, 16” and iMac.


Hopefully they keep Mx line as power efficient everyday chips, and create something really ludicrous (P1?) for the high end models. It’s about time that the “Pro” devices started being actually pro. Massive cache, on chip ECC memory, etc.

The top of the line’s i9 MacBook really doesn’t cut it for a lot of professional work —- especially when it thermally throttles before the work can even begin.


I feel like if they were going to do this right now they would have offered 32gb configurations on the M1 devices... as it stands they're going to have to offer attractive ~$2k devices for creative professionals who don't need bleeding edge cpu performance but would like a 20h battery and native ios apps.


I thought throttling issues were resolved with current-gen 16” i9?


Keeping with previous naming schemes it'll either be some kind of M1X or an "M2".


What amazes me is that this is their first shot. Can’t wait to see what happens when they have an M chip for the Mac Pro (the tower, not the laptop). This architecture is a beast.


To be fair, this isn't their first shot at ARM processors, they've been at it for longer than a decade (Apple A4 was launched in early 2010)

The AMD64 "emulation" on the other hand is brand new.


That's the cool thing, they've used their iphone/ipad line to gradually grow this.


Whenever the transition to ARM came up here before, there were always a lot of people who seemed to think that x86 compatibility would be a huge issue. And if they did emulation, the x86 the performance would certainly be awful! When I claimed that they could probably reach performance parity, or maybe even beat the x86 performance of the Intel chips they were replacing, they didn't believe me.

I was thinking that they'd probably add custom instructions for x86 emulation, I haven't heard if that's the case or not. Would be interesting to find out. Given the complexity of a modern x86-64 frontend, it was never that far-fetched to think that a combination of world-class JIT and a bit of custom silicon could have a negligible overhead compared to a pure silicon solution. After that it was mostly a question of how good the CPU core and technology node was. And Anandtech, among others, had already shown they were well on track to beating Intel there.


I guess this means every other Mac was really slow all along.


The old macs are just using regular Intel chips like most other desktop computers


Regular Intel chips plus woefully inadequate cooling, at least for the macs made in the past 5 years or so.


You mean x86.


You mean Intel.


CodeWeavers apparently has WINE running through Rosetta 2, which is wild from a technical standpoint and gives me hope for my older PC games. https://www.codeweavers.com/blog/jwhite/2020/11/10/its-great...

The real question is… how long does Rosetta 2 stick around? Rosetta 1 only lasted OSX 10.4 - 10.6.


Why is it reporting as "2.4 GHz" under emulation?


Most likely they emulated the cpuid corresponding to a previously existing Intel part. cpuid is used to fill in those “GHz” numbers you see in these reports.


My guess would be that's the base freq it will go to under full load on a fanless MBA. It's like the "official" 1.1GHz base clock of the i5-1030NG7.


I wonder if this is a hint that Rosetta 2 on the 3.2 GHz M1 is intended to provide equivalent performance to a 2.4 GHz Intel processor.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: