Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Apple M1 Microarchitecture Research (dougallj.github.io)
237 points by my123 on March 7, 2021 | hide | past | favorite | 48 comments


It's interesting work. The results show a marked contrast with equivalent information for Intel (think uops.info, Agner Fog's work or even the Optimization Reference Manual).

Most obvious is separating general purpose register stuff from SIMD. If you have a mixed workload, this means M1 is much wider in that sense (although note that the SIMD registers involved are 128b as opposed to up to 512b).

Intel's execution ports tend to be special flowers - there's almost always stuff that can be done and can't be done on any given port. So, of the execution ports (0,1,5,6) each one tends to have some special flavor that the others don't share - and there are often complex "oh, you can only do 2 of those or 3 of these". There's a very small list of operations (generally logic and LEA) that can be done on all of 0,1,5,6.

By contrast, most of the M1 things, you can do across every unit that is capable of doing it. Want to do 4 SIMD math ops? Go for your life! It's certainly easier to remember and understand.

I imagine the counterargument on the Intel side is that it's unlikely you need that many of some given operation - so they are saving area to not make every port omni-capable. This works pretty well (right up to the point it doesn't).

It will be interesting to see how the comparison between Intel and Apple plays out in the near future. M1 is a tour de force of balanced design, but I wouldn't count Intel out - especially if they can get their process mojo back (a lot of M1's current advantage is being able to invest transistors in a balanced and sensible way across the entire pipeline due to a better process).


Intel's more likely argument is that they can't do that many because their decoders and ROB aren't wide enough to keep up. Given how huge their decode units are (about as big as all their integer ALUs), I don't know how x86 will actually solve this issue.


There are a couple distinct issues here - M1 is both wider than recent x86 but also more orthogonal.

I'm not sure that decode is the bottleneck for all code - a lot of code could be played out of a pre-decoded uop cache or the LSD. People always return to decode as an explanation as it's an attractive RISC vs CISC morality tale.

I also don't know how much of the decode unit is devoted to legacy support and MSROM, which doesn't necessarily have to scale the same way as the simple decoders.


Could Intel adapt a Rosetta 2-style approach? Preprocess binaries to replace ancient instructions with modern equivalents, and be able to use smaller decode units? Would the gains be worth the translation requirements?


I might call that the Transmeta approach. Back in the early 2000s, Transmeta offered x86-compatible systems that ran a proprietary VLIW core with software translation. The resulting system looked like an x86 system to the OS and was more power-efficient than contemporary x86 systems from other vendors.

Intel has a perpetual license for all Transmeta patents and applications.


Does the reorder buffer in x86 processors operate on the x86 instructions or the µOPs?


The company that built the original Rosetta, actually apple branding on top of QuickTransit, actually could do x86 to x86 translation in their product and found that even when low hanging fruit was theoretically available it was challenging to extract.

I don’t think the x86 stuff we are talking about is easy to extract. Instead this seems like something that wants an intermediate code + install time specializer a la mainframes.


No, Intel needs to approach this from hardware alone.

Apple has more freedom as they only have to worry about support from their own OS, and even only the latest revision.


That’s an impressive piece of work. I would love to know how far this is from the information in the actual internal documentation. It’s very nice to have some excitement about a mainstream CPU architecture again.


Just a pain that due to Apple's policies on keeping a big margin and lack of sharing technology means that the rest of the industry won't benefit from this.

Still very hard to see not getting an M1 as my next laptop, especially if they continue to do sane things line remove the shitty touchbar.


Nuvia (former Apple M1 team members) was acquired by Qualcomm, so hopefully new laptop SoCs are on the way.


Yeah but Qualcomm... They’re perfectly happy sitting in their arses if there’s nobody to kick them around, and they’re just a nuisance generally. I hope I’m wrong, but they won’t have direct competition from Apple because it looks unlikely that M1 devices will run Windows. They seem to be more or less happy with being 2 generations behind Apple on phone SoCs. And they basically gave up on wearables.

So instead of having Wintel/Mac we’ll have Wincomm/Mac or Wintel/Wincomm/Mac; not sure it’ll be much better.


Huh? I ran Windows on my M1 MBA quite easily: https://www.microsoft.com/en-us/software-download/windowsins...

It was under Parallels but I’m sure over time someone will figure out how to direct boot into it if that’s really important.

Performance of the M1 running Windows and games either in Crossover (commercial WINE implementation - well worth the $40 for ease of use/installation) or the Windows 10 beta actually turned out to be a bad thing. I subscribe to way too many assets in Cities:Skylines to use 16GB of RAM - I need at least 32GB so with great difficulty I returned the 16GB MBA and am eagerly awaiting the next round of AS Macs, hoping there will be at least one laptop that can take at lease 32GB of RAM. If so I will probably never update my Windows gaming machine again. Not that I could if I wanted to right now with the insane GPU shortages - but that’s another topic :p


Yeah it can help in a pinch or for some specific workflows, but that's not a substitute.

I am hoping that my 2015 15'' MBP does not die before the next 15'' or 16'' are released. 32 GB (bare minimum) or 64 GB (for future proofing) and a better CPU would be great.

I don't have enough time to play even my backlog of Mac and Switch games. I wanted to build another PC for fun and to play, but I am not sure I will anymore.


> Huh? I ran Windows on my M1 MBA quite easily

> It was under Parallels but...

You didn't, you run Windows on virtualized hardware which isn't interesting.

Might be useful for you, but not interesting.


This. I’m basically ordering a 15” MBP on day one at this point. I really like their devices as a user, but I wish they’d be more developer friendly.


Giving stuff away doesn't generate profits needed to invest in R&D.


Source for what a validation buffer is? Couldn't find a reference on google


They say the M1 may use a validation buffer instead of a re order buffer. This makes it somewhat clear if you know what a re order buffer is.

Processors basically do this: read instructions, execute them out of order to fully use cpu resources, put the instructions back in order along with their results, and then write results to memory. This way the “final state” follows the order of instructions.

The “put back in order” step is handled by the re order buffer. It is basically an online sort algorithm.

Fully sorting instructions after execution is simple and correct, but in extreme cases may be suboptimal; perhaps it doesn’t matter if the final state is exactly in order or just close to that. I believe the article is saying the M1 may just validate states as they are written rather than fully reordering.


The Sun Microsystems Rock processor had I guess, validation? Rather than re-order instructions, you just speculatively execute instructions, issue you memory ops but don't stop and wait, called "hardware scouting". If after everything resolves, you find your prediction was wrong, you rollback. So it's a kind of Transactional Memory approach, and if validation fails, you re-execute the same sequence of instructions, but the branches are different.

https://en.wikipedia.org/wiki/Hardware_scout


Interesting. Sounds like even more opportunities for speculative instruction exploits.


Arm isn't TSO. As long as the internal instruction stream maintains dependency information independent pieces of code could be completed and written back to memory at any time.

Basically the whole, "put it back in order" bits only have to happen around memory barriers because the visibility of writes isn't guaranteed otherwise.

Whether this matters given large write combining/writeback buffering that queues up early writes until the later writes have completed is one of those decade+ long arguments.


It was the distinction between validation and re-order that confused me.


There has been an update:

>… The M1 seems to use something other than an entirely conventional reorder buffer, which complicates measurements a bit. So these may or may not be accurate. (This paragraph previously said "it seems to use something along the lines of a validation buffer". I think the VB hypothesis has since been disproven. Various attempts to measure ROB size have yielded values 623, 853, and 2295 (see the previous link). My uninformed hypothesis is that this may imply a kind of distributed reorder buffer, where only structures that need to know about a given operation track them, and/or some kind of out-of-order retirement.)

https://github.com/dougallj/applecpu/commit/dc3c220f58f428b5...



My computer hardware architecture design was published on February 06, 2019. One or two years later, the Apple M1 chip adopted the "warehouse/workshop model" design and was released on November 11, 2020.

Warehouse: unified memory Workshop: CPU, GPU and other cores Products (raw materials): information, data there's also a new unified memory architecture that lets the CPU, GPU, and other cores exchange information between one another, and with unified memory, the CPU and GPU can access memory simultaneously rather than copying data between one area and another. Accessing the same pool of memory without the need for copying speeds up information exchange for faster overall performance. reference: Developer Delves Into Reasons Why Apple's M1 Chip is So Fast

From the introduction Apple M1 has not done global optimization of various core (workshop) scheduling. Apple M1 only optimizes the access to memory data (materials and products in the warehouse). Apple needs to further improve the programming language and compiler to support and promote my programming methodology. My architecture supports a wider range of workshop types than Apple M1, with greater efficiency, scalability and flexibility. Conclusion Apple M1 chip still needs a lot of optimization work, now its optimization level is still very simple, after all, it is only the first generation of works, released in stages. Forecast(2021-01-19): I think Intel, AMD, ARM, supercomputer, etc. will adopt the "warehouse/workshop model"

In the past, the performance of the CPU played a decisive role in the performance of the computer. There were few CPU cores and the number and types of peripherals. Therefore, the CPU became the center of the computer hardware architecture.

Now, with more and more CPU and GPU cores, and the number and types of peripherals, the communication, coordination, and management of cores (or components, peripherals) have become more and more important, They become a key factor in computer performance.

The core views of management science and computer science are the same: Use all available resources to complete the goal with the highest efficiency. It is the best field of management science to accomplish production goals through communication, coordination, and management of various available resources. The most effective, reliable, and absolutely mainstream way is the "warehouse/workshop model".

Only changing the architecture, not changing or only expanding the CPU instruction set, not only will not affect the CPU compatibility, but also bring huge optimization space.

So I think Intel, AMD, ARM, supercomputing, etc. will adopt the "warehouse/workshop model", which is an inevitable trend in the development of computer hardware.

Finally, "Warehouse/Workshop Model" and "Von Neumann Architecture" will become the two major architectures in the IT field.

https://github.com/linpengcheng/PurefunctionPipelineDataflow...


I realize it's not what you're referring to, but the context and terminology of your comment reminded me of this:

https://manybutfinite.com/post/the-thing-king/


Hats off to people who do this kind of work. However... *clears throat*

I have real trouble using the term "research" for something like this, lumping it together with activities such as "research" in natural science, "research" in anthropology, or even journalistic "research".

The difference to me is that this activity could be "done", or rather entirely avoided, if Apple didn't decide to keep it a secret.

Mind you, I'm not saying Apple doesn't have good reasons to keep them secret, for some definition of "good". I just think that this person just spent hours upon hours of their life working around a quirk in our current society.

I'm aware I'm splitting hairs, but "we don't know what the architecture is" to me is fundamentally different to "we don't know what dark energy is." The former is something that could "easily" be changed, the latter is not. Lumping them together just cements the idea that the systems we built and navigate, for better or worse, are to be treated like natural laws that can't be argued with.

/rant


That's a weird take. Journalistic research, for example, also often involves bringing to light facts that some in-group already knows. The word research is a lot broader than what you claim; people talk about "researching" which television set to buy, and everyone understands what they mean.

So your contribution here is "this guy is figuring out stuff that people at Apple already pretty much know". Err, thanks, Captain Obvious.


I mean, I read the parent comment more as "it's a shame that we, as a society, waste so much time figuring out stuff we already know but are unwilling to share".


I think it does say that, but it also says:

> I have real trouble using the term "research" for something like this, lumping it together with activities such as "research" in natural science, "research" in anthropology, or even journalistic "research".

Considering we're on "Hacker News" right now, and saying something like "hacking (in the reverse engineering sense) is research-like but not fundamental enough to be research" is bound to be a bit controversial.

It's a slippery slope anyways. Is researching anything that might have been known by someone at some point in time now not research? And it feels a bit like gatekeeping too.


Yeah, I'm not at all sure the distinction over terminology is terribly useful, but I do think GP's underlying point is interesting.


It's possible that he could have expressed these same thoughts without sounding like such a jerk, though.


Security research could be considered another similar example.


Just because other humans created the knowledge in question, doesn't make the search for that knowledge not research, if the knowledge isn't held by the general public in the first place, and if that knowledge requires independent gathering of data and interpretation to acquire.

I agree that Apple and the quirks in our society that encourage that secretive behavior has the effect of researchers independently replicating a small subset of Apple's work in a way that they'd be redirected to something more useful in a more open society. That doesn't make the independent researcher's work not research though.

Otherwise nearly the whole field of anthropology wouldn't exist or wouldn't be considered research.

> Lumping them together just cements the idea that the systems we built and navigate, for better or worse, are to be treated like natural laws that can't be argued with.

The systems we have built are pretty fixed in time. We can argue about how to change them going in the future, but we should also record, externally if need be like here, what has happened so that we can make better decisions going forward. Apple's choices for the M1 are arguably more set in stone than our conception of physics. We're able to retcon all of physics when we make changes to the standard model, but "what tradeoffs did Apple make in the M1" are pretty much done and set in stone (or silicon I guess).


The distinction you're making is more about _fundamental_ research vs. _applied_ research, but it is research nevertheless.

From the first hit I found on Google:

* Fundamental researches mainly aim to answer the questions of why, what or how and they tend to contribute the pool of fundamental knowledge in the research area.

* Opposite to fundamental research is applied research that aims to solve specific problems, thus findings of applied research do have immediate practical implications.


I dislike the fact that “research” has so broad a meaning in English. In particular, journalistic research does not have much in common with scientific research. And this constant push to “do your research” from crackpots and pseudo-skeptics is giving research a bad name (of course, they don’t mean scientific research either). Going down a rabbit hole on Wikipedia watching more or less dubious videos is studying at best, not researching.

Anyway, in this case, “research” is legit, even from a scientific perspective: there is an object, and hypotheses are tested, and observations are made to gain insights on how it works. Perfectly legitimate piece of applied research.


> I just think that this person just spent hours upon hours of their life working around a quirk in our current society.

I think this is a truism. Much human activity could fall under 'working around a quirk in our current society' at some level or another. Even your dark energy researcher probably spent untold hours greasing the funding gears.


Your comment reminds me of a scifi story (Greg Egan maybe?) where AI has taken the lead on science, and humanity's scientists' job is now just to understand the AI's discoveries.


There's a fictitious journal article with this premise by Ted Chiang.


Reminds me of computer virus research in the 90's. "we've found the polymorphic encryption algorithm...". Ok? That's not akin to discover some function on a biological virus. It's just discovering other (even worse, contemporary) people's work. Just like a "reaction" YouTube video, it adds nothing new.


Think of it as approaching the topic from the other side. It is very analogous to what security researchers to.

Consider things like the incompleteness theorem, and how there will always be gaps in our implementation of a given epistemology, in this case computer science, hoping to achieve a given end.

Researchers then look at that output while entirely blind, and then they work backwards, and they can sometimes uncover very useful insights that the original creators may have been blind to.


lots of journalistic research is also about uncovering secret knowledge and information, for example classified government documents, corporate cover-ups and various forms of human rights violations. you are right tho, that with more transparency and less predatory legal constructs the jobs of both journalist and computer researchers would be way easier.


But Apple did keep it a secret and the research involved discovering those secrets for the rest of us. That's still research and it's still valuable. The fact that it shouldn't have to be this way only makes this kind of research more valuable, not less.


I understand where you’d coming from and also do a double take when I read this term, though I think it’s legit.

Consider: what does an agency that spies on other countries’ governments do? Research?


There are known knowns, known unknowns and unknown unknowns. Getting to the bottom of all of them are research.


You are getting downvoted, but I agree with parts of your argument. The part that is perhaps missing is that reversing the microarchitecture or any complex system is generally not an end in itself. It is a prerequisite for more meaningful research. For e.g., building a simulator or finding side-channel vulnerabilities require a good understanding of the u-arch.


Research you don't like is still research.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: