Hacker News new | past | comments | ask | show | jobs | submit login
Apple’s new M1 Pro and M1 Max processors (apple.com)
1052 points by emdashcomma 3 months ago | hide | past | favorite | 983 comments

All: let's keep this thread about the processors, and talk about the new MBPs in the other thread: https://news.ycombinator.com/item?id=28908383.

Edit: to read all the 600+ comments in this thread, click More at the bottom of the page, or like this:



This is about the processors, not the laptops, so commenting on the chips instead. They look great, but they look like they're the M1 design, just more of it. Which is plenty for a laptop! But it'll be interesting to see what they'll do for their desktops.

Most of the additional chip area went into more GPUs and special-purpose video codec hardware. It's "just" two more cores than the vanilla M1, and some of the efficiency cores on the M1 became performance cores. So CPU-bound things like compiling code will be "only" 20-50% faster than on the M1 MacBook. The big wins are for GPU-heavy and codec-heavy workloads.

That makes sense since that's where most users will need their performance. I'm still a bit sad that the era of "general purpose computing" where CPU can do all workloads is coming to an end.

Nevertheless, impressive chips, I'm very curious where they'll take it for the Mac Pro, and (hopefully) the iMac Pro.

> "just" two more cores than the vanilla M1

Total cores, but going from 4 "high performance" and 4 "efficiency" to 8 "high performance" and 2 "efficiency. So should be more dramatic increase in performance than "20% more cores" would provide.

Is there a tradeoff in terms of power consumption?

Yes. But the 14" and 16" has larger battery than 13" MacBook Pro or Air. And they were designed for performance, so two less EE core doesn't matter as much.

It is also important to note, despite the name with M1, we dont know if the CPU core are the same as the one used in M1 / A14. Or did they used A15 design where the energy efficient core had significant improvement. Since the Video Decoder used in M1 Pro and Max seems to be from A15, the LPDDR5 is also a new memory controller.

In A15, Anandtech claims the Efficiency cores are 1/3 the performance, but 1/10 the power. They should be looking at (effectively) doubling the power consumption over M1 with just the CPUs and assuming they don't increase clockspeeds.

Going from 8 to 16 or 32 GPU cores is another massive power increase.

I wonder if Apple will give us a 'long-haul' mode where the system is locked to only the energy efficient cores and settings. I us developer types would love a computer that survives 24 hours on battery.

macOS Monterey coming out on the 25th has a new Low Power Mode feature that may do just that. That said, these Macs are incredibly efficient for light use, you may already get 24 hrs of battery life with your workload. Not counting screen off time.

Yes, it depends on what you're doing, but if you can watch 21 hours of video, many people will be able to do more than 24 hours of development.

Video playback is accelerated by essentially custom ASIC processing built into the CPU, so it's one of the most efficient things you can do now. Most development workloads are far more compute intensive.

I get about 14-16 hours out of my M1 MacBook Air doing basically full-time development (browser, mail client, Slack, text editor & terminal open, and compiling code periodically).

I know everyone's use case is different, but most of my development workload is 65% typing code into a text editor and 35% running it. I'm not continually pegging the CPU, just intermittently, in which case the existence of low power cores help a lot. The supposed javascript acceleration in the M1 has seemed to really speed up my workloads too.

Might be less computationally expensive. But video playback constantly refreshes screen which uses up battery

Did Electron fix the 60Hz (or rather current screen refresh rate) cursor blinking? Otherwise I don't see many web devs getting a lot of runtime in.

That's actually a curious question, I wonder what the most energy efficient dev tool is. Can't imagine its VScode. Maybe plain terminal with VIM?

This is true, but it's not worst case by far. Most video is 24 or 30 fps, so about half the typical 60 hz refresh rate. Still a nice optimization path for video. I'm not sure what effect typing in an editor will have on screen refresh, but if the Electron issue is any indication, it's probably complicated.

Huge, apparently. I just spent a bit over $7,000 for a top-spec model and was surprised to read that it comes with a 140 watt power adapter.

Prior to my current M1 MBP, my daily driver was a maxed-out 16" MBP. It's a solid computer, but it functions just as well as a space heater.

And its power adapter is only 100 watts.

the power supply is for charging the battery faster. the new magsafe 3 system can charge with more wattage than usb-c, as per the announcement. usb-c max wattage is 100 watts, which was the previous limiting factor for battery charge.

USB Power Delivery 3.1 goes up to 240 W (or, I should say, “will go up” as I don’t think anybody is shipping it yet)

USB-C 3.1 PD delivers up to 240watts.

That's with 2 connectors right? I have a Dell Precision 3760 and the one connector charging mode is limited to around 90W. With two connectors working in tandem (they snap together), it's 180W.

The connectors never get remotely warm .. in fact under max charge rate they're consistently cool to touch, so I've always thought that it could probably be increased a little bit with no negative consequences.

Single connector, the 3.1 spec goes up to 5A at 48V. You need new cables with support for the higher voltages, but your "multiple plugs for more power" laptop is exactly the sort of device it's designed for.

It was announced earlier this year, so not in wide use yet. PDF warning: https://usb.org/sites/default/files/2021-05/USB%20PG%20USB%2...

I’ve not seen any manufacturer even announce they were going to make a supported cable yet, let alone seen one that does. I might’ve missed it though. This will only make the hell of USB-C cabling worse imho.

The USB Implementers Forum announced a new set of cable markings for USB 4 certified cables that will combine information on the maximum supported data rate and maximum power delivery for a given cable.

Currently I think there aren't any, and the MBP will only get its full 140W capacity using the magsafe cable.

Color me skeptical.

I believe you (and Apple) that the battery can be charged faster, but I am currently rendering video on an M1 MBP. Its power draw: ~20 Watts.

That's a lot of charging overhead.

The 16” has a 100wh battery, so it needs 100w of power to charge 50% in 30 minutes (their “fast charging”). Add in 20w to keep the laptop running at the same time, and some conversion losses, and a 140w charger sounds just about right.

That actually sounds on the money. I hope you're right!

The other end of the new MagSafe is usb-c, which gets plugged into the power adapter.

Sure, but it's an Apple cable plugging into an Apple socket. They don't have to be constrained by the USB-C specs and could implement a custom high power charging mode. In fact I believe some other laptop manufacturers already do this.

They support fast-charging the battery to 50% in 30 minutes. That's probably the reason for the beefy charger.

I'm surprised the fast charger isn't a separate purchase, a la iPhone.

I’m not particularly surprised. They have little to prove with the iPhone, but have every reason to make every measurable factor of these new Macs better than both the previous iteration and the competition. Throwing in a future-model-upsell is negligible compared to mixed reviews about Magsafe 3 cluttering up reviews they otherwise expect to be positive.

Yeah I'd prefer that. Converted everything to quality 6ft braided usb-c cables, bought a separate multiport charger. Will have to sell original one...

Just in case people missed it - the magsafe cable connects to the power supply via usb-c. So (in theory) there's nothing special about the charger that you couldn't do with a 3rd party charger, or a multiport charger or something like that.

MagSafe was a gimmick for me - disconnects far too easy, cables fray in like 9 months, only one side, proprietary and overpriced. Use longer cables and they will never be yanked again. MBP is heavy enough that even USB-C is getting pulled out on a good yank.

I briefly had an M1 Macbook Air and the thing I hated the most about it was the lack of Magsafe. I returned it (needed more RAM) and was overjoyed they brought Magsafe back with these and am looking forward to having it on my new 16" You can also still charge through USB C if you don't care for Magsafe.

Might be a power limitation. I have an XPS 17 which only runs at full performance and charges the battery with the supplied 130W charger. USB C is only specced to 100W. I can still do most things on the spare USB C charger I have.

With the latest USB-PD standard that was announced in May, up to 240W is supported

I have a top-spec 15” MBP that was the last release just before 16”. It has 100W supply and it’s easy to have total draw more than that (so pulling from the battery while plugged in) while running heavy things like 3D games. I’ve seen around 140W peak. So a 150W supply seems prudent.

In the power/performance curves provided by Apple, they imply that the Pro/Max provides the same level of performance at a slightly lower power consumption than the original M1.

But at the same time, Apple isn't providing any hard data or explaining their methodology. I dunno how much we should be reading into the graphs. /shrug

I think you misread the graph. https://www.apple.com/newsroom/images/product/mac/standard/A...

The graph there shows that the new chip is higher power usage at all performance levels.

Not all; it looks like the M1 running full-tilt is slightly less efficient for the same perf than the M1 Pro/Max. (I.e., the curves intersect.)

Yes, but only at the very extreme. It's normal that a high core count part at low clocks has higher efficiency (perf/power) at a given performance level than a low core count part at high clocks, since power grows super-linearly with clock speed (decreasing efficiency). But notably they've tuned the clock/power regime of the M1 Pro/Max CPUs that the crossover region here is very small.

I think this is pretty easy to math: M1 has 2x the efficiency cores of these new models. Those cores do a lot of work in measured workloads that will sometimes be scheduled on performance cores instead. The relative performance and efficiency lines up pretty well if you assume that a given benchmark is utilizing all cores.

> M1 Pro delivers up to 1.7x more CPU performance at the same power level and achieves the PC chip’s peak performance using up to 70 percent less power

Uses less power

That's compared to the PC chips, not M1. M1 uses less power at same performance levels.


> I'm still a bit sad that the era of "general purpose computing" where CPU can do all workloads is coming to an end.

You'd have to be extremely old to remember that era. Lots of stuff important to making computers work got split off into separate chips away from the CPU pretty early into mass computing, such as sound, graphics, and networking. We've also been sending a lot of compute from the CPU into the GPU as late for both graphics and ML purposes.

Lately it seems like the trend has been taking these specialized peripheral chips and moving them back into SoC packages. Apple's approach here seems to be an evolutionary step on top of say, an Intel chip with integrated graphics, rather than a revolutionary step away from the era of general purpose computing.

Does an Intel 286 without coprocessor count? C64? I remember those, but I wouldn't say I'm extremely old (just regular old).

The IBM PC that debuted with the 286 was the PC/AT ("Advanced Technology", hah) that is best known for introducing the AT bus later called the ISA bus that led to the proliferation of video cards, sound cards, and other expansion cards that made the PC what it is today.

I'm actually not sure there ever was a "true CPU computer age" where all processing was CPU-bound/CPU-based. Even the deservedly beloved MOS 6502 processor that powered everything for a hot decade or so was considered merely a "micro-controller" rather than a "micro-processor" and nearly every use of the MOS 6502 involved a lot of machine-specific video chips, memory management chips. The NES design lasted so long in part because toward then end cartridges would sometimes have entirely custom processing chips pulling work off the MOS 6502.

Even the mainframe era term itself "Central Processing Unit" has always sort of implied it always works in tandem with other "processing units", it's just the most central. (In some mainframe designs I think this was even quite literal in floorplan.) Of course too, when your CPU is a massive tower full of boards that make up individual operations and very the opposite of an Integrated Circuit, it's quite tough to call those a "general purpose CPU" as we imagine them today.

The C64 had the famous SID chip (MOS 6581) for sound. Famous in the chiptune scene at any rate. https://en.wikipedia.org/wiki/MOS_Technology_6581

The C64 was discontinued in 1994 but is technically still available in 2021 as the C64 Mini. ;-)

The C64 mini runs on an ARM processor, so that doesn't count in this context. Also I just learned that the processor in the C64 had two coprocessors for sound and graphics (?). So maybe that also doesn't count.

All this talk of "media engines" and GPUs reminds me of the old SID chip in the Commodore 64 and the Amiga with Agnes and Fat Agnes.

discrete floating point co-processors and vodoo graphics come to my mind

I think the higher memory is also a huge win, with support for up to 64gb.

400GB/s available to the CPU cores in a unified memory, that is going to really help certain workloads that are very memory dominant on modern architectures. Both Intel and AMD are solving this with ever increasing L3 cache sizes but just using attached memory in a SOC has vastly higher memory bandwidth potential and probably better latency too especially on work that doesn't fit in ~32MB of L3 cache.

The M1 still uses DDR memory at the end of the day, it's just physically closer to the core. This is in contrast to L3 which is actual SRAM on the core.

The DDR being closer to the core may or may not allow the memory to run at higher speeds due to better signal integrity, but you can purchase DDR4-5333 today whereas the M1 uses 4266.

The real advantage is the M1 Max uses 8 channels, which is impressive considering that's as many as an AMD EPYC, but operates at like twice the speed at the same time.

Just to underscore this, memory physically closer to the cores has improved tRAS times measured in nanoseconds. This has the secondary effect of boosting the performance of the last-level cache since it can fill lines on a cache miss much faster.

The step up from DDR4 to DDR5 will help fill cache misses that are predictable, but everybody uses a prefetcher already, the net effect of DDR5 is mostly just better efficiency.

The change Apple is making, moving the memory closer to the cores, improves unpredicted cache misses. That's significant.

> Just to underscore this, memory physically closer to the cores has improved tRAS times measured in nanoseconds.

I doubt that tRAS timing is affected by how close / far a DRAM chip is from the core. Its just a RAS command after all: transfer data from DRAM to the sense-amplifiers.

If tRAS has improved, I'd be curious how it was done. Its one of those values that's basically been constant (on a nanosecond basis) for 20 years.

Most DDR3 / DDR4 improvements have been about breaking up the chip into more-and-more groups, so that Group#1 can be issued a RAS command, then Group#2 can be issued a separate RAS command. This doesn't lower latency, it just allows the memory subsystem to parallelize the requests (increasing bandwidth but not improving the actual command latency specifically).

The physically shorter wiring is doing basically nothing. That's not where any of the latency bottlenecks are for RAM. If it was physically on-die, like HBM, that'd be maybe different. But we're still talking regular LPDDR5 using off the shelf dram modules. The shorter wiring would potentially improve signal quality, but ground shields do that, too. And Apple isn't exceeding any specs on this (ie, it's not overclocked), so above average signal integrity isn't translating into any performance gains anyway.

improved tRAS times

Has this been documented anywhere? What timings are Apple using?

Apple also uses massive cache sizes, compared to the industry.

They put a 32 megabyte system level cache in their latest phone chip.

>at 32MB, the new A15 dwarfs the competition’s implementations, such as the 3MB SLC on the Snapdragon 888 or the estimated 6-8MB SLC on the Exynos 2100


It will be interesting to see how big they go on these chips.

> Apple also uses massive cache sizes, compared to the industry.

AMD's upcoming Ryzen are supposed to have 192MB L3 "v-cache" SRAM stacked above each chiplet. Current chiplets are 8-core. I'm not sure if this is a single chiplet but supposedly good for 2Tbps[1].

Slightly bigger chip than a iphone chip yes. :) But also wow a lot of cache. Having it stacked above rather than built in to the core is another game-changing move, since a) your core has more space b) you can 3D stack many layers of cache atop.

This has already been used on their GPUs, where the 6800 & 6900 have 128MB of L3 "Infinity cache" providing 1.66TBps. It's also largely how these cards get by with "only" 512GBps worth of GDDR6 feeding them (256bit/quad-channel... at 16GT). AMD's R9 Fury from spring 2015 had 1TBps of HBM2, for compare, albeit via that slow 4096bit wide interface.

Anyhow, I'm also in awe of the speed wins Apple got here from bringing RAM in close. Cache is a huge huge help. Plus 400GBps main memory is truly awesome, and it's neat that either the CPU or GPU can make use of it.

[1] https://www.anandtech.com/show/16725/amd-demonstrates-stacke...

> The M1 still uses DDR memory at the end of the day, it's just physically closer to the core. This is in contrast to L3 which is actual SRAM on the core.

But they're probably using 8-channels of LPDDR5, if this 400GB/s number is to be believed. Which is far more memory channels / bandwidth than any normal chip released so far, EPYC and Skylake-server included.

It's more comparable to the sort of memory bus you'd typically see on a GPU... which is exactly what you'd hope for on a system with high-end integrated graphics. :)

You'd expect HBM or GDDR6 to be used. But this is seemingly LPDDR5 that's being used.

So its still quite unusual. Its like Apple decided to take commodity phone-RAM and just make many parallel channels of it... rather than using high-speed RAM to begin with.

HBM is specifically designed to be soldered near a CPU/GPU as well. For them to be soldering commodity LPDDR6 is kinda weird to me.


We know it isn't HBM because HBM is 1024-bits at lower clock speeds. Apple is saying they have 512-bits across 8 channels (64-bits per channel), which is near LPDDR5 / DDR kind of numbers.

200GBps is within the realm of 1x HBM channel (1024-bit at low clock speeds), and 400GBps is 2x HBM channels (2048-bit bus at low clock speeds).

HBM isn't just "soldered near", it's connected through a silicon interposer rather than a PCB.

Also we know it's not HBM because the word "LPDDR5" was literally on the slides :)

> just make many parallel channels of it

isn't that just how LPDDR is in general? It has much narrower channels than DDR so you need much more of them?

> isn't that just how LPDDR is in general? It has much narrower channels than DDR so you need much more of them?

Well yeah. But 400GBps is equivalent to 16x DDR4 channels. Its an absurdly huge amount of bandwidth.

> The DDR being closer to the core may or may not allow the memory to run at higher speeds due to better signal integrity, but you can purchase DDR4-5333 today whereas the M1 uses 4266.

My understanding is that bringing the RAM closer increases the bandwidth (better latency and larger buses), not necessarily the speed of the RAM dies. Also, if I am not mistaken, the RAM in the new M1s is LP-DDR5 (I read that, but it did not stay long on screen so I could be mistaken). Not sure how it is comparable with DDR4 DIMMs.

The overall bandwidth isn't affected much by the distance alone. Latency, yes, in the sense that the signal literally has to travel further, but that difference is miniscule (like 1/10th of a nanosecond) compared to overall DDR access latencies.

Better signal integrity could allow for larger busses, but I don't think this is actually a single 512 bit bus. I think it's multiple channels of smaller busses (32 or 64 bit). There's a big difference from an electrical design perspective (byte lane skew requirements are harder to meet when you have 64 of them). That said, I think multiple channels is better anyway.

The original M1 used LPDDR4 but I think the new ones use some form of DDR5.

Your comment got me thinking, and I checked the math. It turns out that light takes ~0.2 ns to travel 2 inches. But the speed of signal propagation in copper is ~0.6 c, so that takes it up to 0.3 ns. So, still pretty small compared to the overall latencies (~13-18 ns for DDR5) but it's not negligible.

I do wonder if there are nonlinearities that come in to play when it comes to these bottlenecks. Yes, by moving the RAM closer it's only reducing the latency by 0.2 ns. But, it's also taking 1/3rd of the time that it used to, and maybe they can use that extra time to do 2 or 3 transactions instead. Latency and bandwidth are inversely related, after all!

Well, you can have high bandwidth and poor latency at the same time -- think ultra wide band radio burst from Earth to Mars -- but yeah, on a CPU with all the crazy co-optimized cache hierarchies and latency hiding it's difficult to see how changing one part of the system changes the whole. For instance, if you switched 16GB of DRAM for 4GB of SRAM, you could probably cut down the cache-miss latency a lot -- but do you care? If you cache hit rate is high enough, probably not. Then again, maybe chopping the worst case lets you move allocation away from L3 and L2 and into L1, which gets you a win again.

I suspect the only people who really know are the CPU manufacturer teams that run PIN/dynamorio traces against models -- and I also suspect that they are NDA'd through this life and the next and the only way we will ever know about the tradeoffs are when we see them pop up in actual designs years down the road.

DRAM latencies are pretty heinous. It makes me wonder if the memory industry will go through a similar transition to the storage industry's HDD->SSD sometime in the not too distant future.

I wonder about the practicalities of going to SRAM for main memory. I doubt silicon real estate would be the limiting factor (1T1C to 6T, isn't it?) and Apple charges a king's ransom for RAM anyway. Power might be a problem though. Does anyone have figures for SRAM power consumption on modern processes?

>> I wonder about the practicalities of going to SRAM for main memory. I doubt silicon real estate would be the limiting factor (1T1C to 6T, isn't it?) and Apple charges a king's ransom for RAM anyway. Power might be a problem though. Does anyone have figures for SRAM power consumption on modern processes?

I've been wondering about this for years. Assuming the difference is similar to the old days, I'd take 2-4GB of SRAM over 32GB of DRAM any day. Last time this came up people claimed SRAM power consumption would be prohibitive, but I have a hard time seeing that given these 50B transistor chips running at several GHz. Most of the transistors in an SRAM are not switching, so they should be optimized for leakage and they'd still be way faster than DRAM.

> The overall bandwidth isn't affected much by the distance alone.

Testing showed that the M1's performance cores had a surprising amount of memory bandwidth.

>One aspect we’ve never really had the opportunity to test is exactly how good Apple’s cores are in terms of memory bandwidth. Inside of the M1, the results are ground-breaking: A single Firestorm achieves memory reads up to around 58GB/s, with memory writes coming in at 33-36GB/s. Most importantly, memory copies land in at 60 to 62GB/s depending if you’re using scalar or vector instructions. The fact that a single Firestorm core can almost saturate the memory controllers is astounding and something we’ve never seen in a design before.


It just said that bandwidth between a performance core and memory controller is great. It's not related to distance between memory controller and DRAM.

L3 is almost never SRAM, it's usually eDRAM and clocked significantly lower than L1 or L2.

(SRAM is prohibitively expensive to do at scale due to die area required).

Edit: Nope, I'm wrong. It's pretty much only Power that has this.

As far as I'm aware, IBM is one of the few chip-designers who have eDRAM capabilities.

IBM has eDRAM on a number of chips in varying capacities, but... its difficult for me to think of Intel, AMD, Apple, ARM, or other chips that have eDRAM of any kind.

Intel had one: the eDRAM "Crystalwell" chip, but that is seemingly a one-off and never attempted again. Even then, this was a 2nd die that was "glued" onto the main chip, and not like IBM's truly eDRAM (embedded into the same process).

You're right. My bad. It's much less common than I'd thought. (Intel had it on a number of chips that included the Iron Pro Graphics across Haswell, Broadwell, Skylake etc)

But only the Iris Pro 5200 (codename: Crystalwell) had eDRAM. All other Iris Pro were just normal DDR4.

EDIT: Oh, apparently there were smaller 64MB eDRAM on later chips, as you mentioned. Well, today I learned something.

Ha, I still use an intel 5775c in my home server!

I think the chip you are talking about is Broadwell.

Broadwell was the CPU-core.

Crystalwell was the codename for the eDRAM that was grafted onto Broadwell. (EDIT: Apparently Haswell, but... yeah. Crystalwell + Haswell for eDRAM goodness)

L3 is SRAM on all AMD Ryzen chips that I'm aware of.

I think it's the same with Intel too except for that one 5th gen chip.

Good point. Especially since a lot of software these days is not all that cache friendly. Realistically this means we have 2 years or so till further abstractions eat up the performance gains.

> 400GB/s available to the CPU cores in a unified memory

It's not just throughput that counts, but latency. Any numbers to compare there?

We'll have to wait for the AnandTech review but memory latency should be similar to Intel and AMD.

I'm thinking with that much bandwidth, maybe they will roll out SVE2 with vlen=512/1024 for future M series.

AVX512 suffers from bandwidth on desktop. But now the bandwidth is just huge and SVE2 is naturally scalable. Sounds like free lunch?

I thought the memory was one of the more interesting bits here.

My 2-year-old Intel MBP has 64 GB, and 8 GB of additional memory on the GPU. True, on the M1 Max you don't have to copy back and forth between CPU and GPU thanks to integrated memory, but the new MBP still has less total memory than my 2-year-old Intel MBP.

And it seems they just barely managed to get to 64 GiB. The whole processor chip is surrounded by memory chips. That's in part why I'm curious to see how they'll scale this. One idea would be to just have several M1 Max SoCs on a board, but that's going to be interesting to program. And getting to 1 TB of memory seems infeasible too.

Just some genuine honest curiosity here; how many workloads actually require 64gb of ram? For instance, I'm an amateur in the music production scene, and I know that sampling heavy work flows benefit from being able to load more audio clips fully into RAM rather than streaming them from disk. But 64g seems a tad overkill even for that.

I guess for me I would prefer an emphasis on speed/bandwidth rather than size, but I'm also aware there are workloads that I'm completely ignorant of.

Can’t answer for music, but as a developer a sure way to waste a lot of RAM is to run a bunch of virtual machines, containers or device simulators.

I have 32GB, so unless I’m careless everything usually fits in memory without swapping. If you got over things get slow and you notice.

Same, I tend to get everything in 32GB but more and more often I'm going over that and having things slow down. I've also nuked an SSD in a 16GB MBP due to incredibly high swap activity. It would make no sense for me to buy another 32GB machine if I want it to last five years.

Don’t run Chrome and Slack at the same time :)

So run Slack inside Chrome? :)

How do you track the swap activity? What would you call “high” swap activity?

Open Activity Monitor, select Memory and there's "Swap used" down the bottom

My laptop has 128GB for running several VMs that build C++ code, and Slack.

Another anecdote from someone who is also in the music production scene - 32GB tended to be the "sweet spot" in my personal case for the longest time, but I'm finding myself hitting the limits more and more as I continue to add more orchestral tracks which span well over 100 tracks total in my workflows.

I'm finding I need to commit and print a lot of these. Logic's little checker in the upper right showing RAM, Disk IO, CPU, etc also show that it is getting close to memory limits on certain instruments with many layers.

So as someone who would be willing to dump $4k into a laptop where its main workload is only audio production, I would feel much safer going with 64GB knowing there's no real upgrade if I were to go with the 32GB model outside of buying a totally new machine.

Edit: And yes, there is does show the typical "fear of committing" issue that plagues all of us people making music. It's more of a "nice to have" than a necessity, but I would still consider it a wise investment. At least in my eyes. Everyone's workflow varies and others have different opinions on the matter.

I know the main reason why the Mac Pro has options for LRDIMMs for terabytes of RAM is specifically for audio production, where people are basically using their system memory as cache for their entire instrument library.

I have to wonder how Apple plans to replace the Mac Pro - the whole benefit of M1 is that gluing the memory to the chip (in a user-hostile way) provides significant performance benefits; but I don't see Apple actually engineering a 1TB+ RAM SKU or an Apple Silicon machine with socketed DRAM channels anytime soon.

I wonder about that too.

My bet is that they will get rid of the Mac Pro entirely. Too low ROI for them at this point.

My hope is to see an ARM workstation where all components are standard and serviceable.

I cannot believe we are in the era of glued batteries and soldered SSDs that are guaranteed to fail and take the whole machine with them.

I think we'd probably see apple use the fast and slow ram method that old computers used back in the 90's.

16-32GB of RAM on the SOC, with DRAM sockets for usage past the built in amount.

Though by the time we see an ARM MacPro they might move to stacked DRAM on the SOC. But i'd really think two tier memory system would be apple's method of choice.

I'd also expect a dual SOC setup.

So I don't expect to see that anytime soon.

I'd love to get my hands on a Mac Mini with the M1 Max.

I went for 64GB. I have one game where 32GB is on the ragged edge - so for the difference it just wasn't worth haggling over. Plus it doubled the memory bandwidth - nice bonus.

And unused RAM isn't wasted - the system will use it for caching. Frankly I see memory as one of the cheapest performance variables you can tweak in any system.

> how many workloads actually require 64gb of ram?

Don't worry, Chrome will eat that up in no time!

More seriously, I look forward to more RAM for some of the datasets I work with. At least so I don't have to close everything else while running those workloads.

I ran 512GB on my home server, 256GB on my desktop and 128GB on small factor desktop that I take with me to summer cottage.

Some of my projects work with big in memory databases. Add regular tasks and video processing on top and there you go.

As a data scientist, I sometimes find myself going over 64 GB. Of course it all depends on how large data I'm working on. 128 GB RAM helps even with data of "just" 10-15 GB, since I can write quick exploratory transformation pipelines without having to think about keeping the number of copies down.

I could of course chop up the workload earlier, or use samples more often. Still, while not strictly necessary, I regularly find I get stuff done quicker and with less effort thanks to it.

Not many, but there are a few that need even more. My team is running SQL servers on their laptops (development and support) and when that is not enough, we go to Threadrippers with 128-256GB of RAM. Other people run Virtual Machines on their computers (I work most of the time in a VM) and you can run several VMs at the same time, eating up RAM really fast.

On a desktop Hackintosh, I started with 32GB that would die with out of memory errors when I was processing 16bit RAW images at full resolution. Because it was Hackintosh, I was able to upgrade to 64GB so the processing could complete. That was the only thing running.

What image dimensions? What app? I find this extremely suspect, but it’s plausible if you’ve way undersold what you’re doing. 24Mpixel 16bit RAW image would have no problem generally on an 4gb machine if it’s truly the only app running and the app isn’t shit. ;)

I shoot timelapse using Canon 5D RAW images, I don't know the exact dimensions off the top of my head but greater than 5000px wide. I then grade them using various programs, ultimately using After Effects to render out full frame ProRes 4444. After Effects was running out of memory. It would crash and fail to render my file. It would display an error message that told me specifically it was out of memory. I increased the memory available to the system. The error goes away.

But I love the fact that you have this cute little theory to doubt my actual experience to infer that I would make this up.

> But I love the fact that you have this cute little theory to doubt my actual experience to infer that I would make this up.

The facts were suspect, your follow up is further proof I had good reason to be suspect. First off, the RAW images from a 5D aren’t 16 bit. ;) Importantly, the out of memory error had nothing to do with the “16 bit RAW files”, it was video rendering lots of high res images that was the issue which is a very different issue and of course lots of RAM is needed there. Anyway, notice I said “but it’s plausible if you’ve way undersold what you’re doing”, which is definitely the case here, so I’m not sure why it bothered you.

Yes, Canon RAW images are 14bit. Once opened in After Effects, you are working in 16bit space. Are you just trying to be argumentative for the fun?

>> die with out of memory errors when I was processing 16bit RAW images

> Canon RAW images are 14bit

You don’t see the issue?

> Are you just trying to be argumentative for the fun?

In the beginning, I very politely asked a clarifying question making sure not to call you a liar as I was sure there was more to the story. You’re the one who’s been defensive and combative since, and honestly misrepresenting facts the entire time. Where you wrong at any point? Only slightly, but you left out so many details that were actually important to the story for anyone to get any value out of your anecdata. Thanks to my persistence, anyone who wanted to learn from your experience now can.

Not the person you're replying to.

>> I was processing 16bit RAW images at full resolution.

>> ...using After Effects to render out full frame ProRes 4444.

Those are two different applications to most of us. No one is accusing you of making things up, just that the first post wasn't fully descriptive of your use case.

Working with video will use up an extraordinary amount of memory.

Some of the genetics stuff I work on requires absolute gobs of RAM. I have a single process that requires around 400GB of RAM that I need to run quite regularly.

I can exhaust my 64GB just opening browser tabs for documentation.

In case your statement is only a slight sarcasm:

Isn’t that just the OS saying “unused memory is wasted memory”? Most of it is likely cache that can easily be evicted with higher memory pressure.

It’s a slight exaggeration, I also have an editor open and some dev process (test runner usually). It’s not just caching, I routinely hit >30 GB swap with fans revved to the max and fairly often this becomes unstable enough to require a reboot even after manually closing as much as I can.

I mean, some of this comes down to poor executive function on my part, failing to manage resources I’m no longer using. But that’s also a valid use case for me and I’m much more effective at whatever I’m doing if I can defer it with a larger memory capacity.

Which OS do you use? It’s definitely not a problem on your part, the OS should be managing it completely transparently.

It is the OS saying “unused memory is wasted memory”, and then every other application thinking they're OS and doing the same.

Since applications have virtual memory, it sort of doesn’t matter? The OS will map these to actual pages based on how many processes are available, etc. So if only one app runs and it wants lots of memory, it makes sense to give it lots of memory - that is the most “economical” decision from both a energy and performance POV.

> So if only one app runs

You answered yourself.

So, M1 has been out for a while now, with HN doom and gloom about not being able to put enough memory into them. Real world usage has demonstrated far less memory usage than people expected (I don't know why, maybe someone paid attention and can say). The result is that 32G is a LOT of memory for an M1-based laptop, and 64G is only needed for very specific workloads I would expect.

Measuring memory usage is a complicated topic and just adding numbers up overestimates it pretty badly. The different priorities of memory are something like 1. wired (must be in RAM) 2. dirty (can be swapped) 3. purgeable (can be deleted and recomputed) 4. file backed dirty (can be written to disk) 4. file backed clean (can be read back in).

Also note M1's unified memory model is actually worse for memory use not better. Details left as an exercise for the reader.

Unified memory is a performance/utilisation tradeoff. I think the thing is it's more of an issue with lower memory specs. The fact you don't have 4GB (or even 2 GB) dedicated memory on a graphics card in a machine with 8GB of main memory is a much bigger deal than not having 8GB on the graphics card on a machine with 64 GB of main RAM.

Or like games, even semi-casual ones. Civ6 would not load at all on my mac mini. Also had to fairly frequently close browser windows as I ran out of memory.

I couldn't load Civ6 until I verified game files in Steam, and now it works pretty perfectly. I'm on 8GB and always have Chrome, Apple Music and OmniFocus running alongside.

Huh, thank you I will try this.

I'm interested to see how the GPU on these performs, I pretty much disable the dGPU on my i9 MBP because it bogs my machine down. So for me it's essentially the same amount of memory.

> but the new MBP still has less total memory

From the perspective of your GPU, that 64GB of main memory attached to your CPU is almost as slow to fetch from as if it were memory on a separate NUMA node, or even pages swapped to an NVMe disk. It may as well not be considered "memory" at all. It's effectively a secondary storage tier.

Which means that you can't really do "GPU things" (e.g. working with hugely detailed models where it's the model itself, not the textures, that take up the space) as if you had 64GB of memory. You can maybe break apart the problem, but maybe not; it all depends on the workload. (For example, you can't really run a Tensorflow model on a GPU with less memory than the model size. Making it work would be like trying to distribute a graph-database routing query across nodes — constant back-and-forth that multiplies the runtime exponentially. Even though each step is parallelizable, on the whole it's the opposite of an embarrassingly-parallel problem.)

That's not how M1's unified memory works.

>The SoC has access to 16GB of unified memory. This uses 4266 MT/s LPDDR4X SDRAM (synchronous DRAM) and is mounted with the SoC using a system-in-package (SiP) design. A SoC is built from a single semiconductor die whereas a SiP connects two or more semiconductor dies. SDRAM operations are synchronised to the SoC processing clock speed. Apple describes the SDRAM as a single pool of high-bandwidth, low-latency memory, allowing apps to share data between the CPU, GPU, and Neural Engine efficiently. In other words, this memory is shared between the three different compute engines and their cores. The three don't have their own individual memory resources, which would need data moved into them. This would happen when, for example, an app executing in the CPU needs graphics processing – meaning the GPU swings into action, using data in its memory. https://www.theregister.com/2020/11/19/apple_m1_high_bandwid...

These Macs are gonna be machine learning beasts.

I know; I was talking about the computer the person I was replying to already owns.

The GP said that they already essentially have 64GB+8GB of memory in their Intel MBP; but they don't, because it's not unified, and so the GPU can't access the 64GB. So they can only load 8GB-wide models.

Whereas with the M1 Pro/Max the GPU can access the 64GB, and so can load 64GB-wide models.

It seems I misunderstood.

so whats the implication of this?

that apples specific use cases for the m1 series is basically "prosumer" ?

(sorry if i'm just repeating something obvious)

Memory is very stackable if needed, since the power per unit area is very low.

How much of that 64 GB is in use at the same time though? Caching not recently used stuff from DRAM out to an SSD isn't actually that slow, especially with the high speed SSD that Apple uses.

Right. And to me, this is the interesting part. There's always been that size/speed tradeoff ... by putting huge amounts of memory bandwidth on "less" main RAM, it becomes almost half-ram-half-cache; and by making the SSD fast it becomes more like massive big half-hd-half-cache. It does wear them out, however.

Why 1TB? 640GB ought to be enough for anything...

Huh, I guess that was as bad an idea as the 640K one.

How much per 8K x 10 bit color, video frame?

Roughly 190GB per minute without sound.

Trying to do special effects on more than a few seconds of 8K video would overwhelm a 64GB system, I suspect.

You were (unintentionally) trolled. My first post up there was alluding to the legend that Bill Gates once said, speaking of the original IBM PC, "640K of memory should be enough for anybody." (N.B. He didn't[0])

[0] https://www.wired.com/1997/01/did-gates-really-say-640k-is-e...

Video and VFX generally don't need to keep whole sequences in RAM persistently these days because:

1. The high-end SSDs in all Macs can keep up with that data rate (3GB/sec) 2. Real-time video work is virtually always performed on compressed (even losslessly compressed) streams, so the data rate to stream is less than that.

And NVMe with 7.5gbps are like, we are almost not even note worthy haha Impressive all around.

It's not that noteworthy, given that affordable Samsung 980 Pro SSDs have been doing those speeds for well over a year now.

980 pro maxes at 7.

But it's also been around for at least a year. And upcoming pcie 5 SSDs will up that to 10-14GBps.

I'm saying Apple might have wanted to emphasise their more standout achievements. Such as on the CPU front, where they're likely to be well ahead for a year - competition won't catch up until AMD starts shipping 5nm Zen4 CPUs in Q3/Q4 2022.

Apple has a well over 5 year advantage when compared to their competition.

That is very difficult to believe, short of sabotage.

Apple has a node advantage.

I'm guessing that's new for the 13" or for the M1, but my 16‑inch MacBook Pro purchased last year had 64GB of memory. (Looks like it's considered a 2019 model, despite being purchased in September 2020).

I don't think this is an apples to apples comparison because of how the new unified memory works

Well technically it is an Apple to Apple comparison in his case.

It all falls apart when the apple contains something non-apple.

Right the Intels supported 64gb, but the 16gb limitation on the M1 was literally the only thing holding me back from upgrading.

And the much higher memory bandwidth

Actually no "extra" chip area in comparison to x86 based solution.

They just throw away so much of cruft from the die like PCIE PHYs, and x86 legacy I/O with large area analog circuitry.

Redundant complex DMA, and memory controller IPs are also thrown away.

Clock, and power rails on the SoC are also probably taking less space because of more shared circuitry.

Same with self-test, debug, fusing blocks, and other small tidbits.

This is very interesting and first time I’ve heard / thought about this. Wonder how much power efficiency comes from exactly these things?

PCI is quite power hungry when it works on full trottle.

The seemed power efficiency when PCIE was going 1.0 2.0 3.0 ... was due to dynamic power control, and link sleep.

On top of it, they simply don't haul memory nonstop over PCIE anymore, since data going to/from GPU is simply not moving anywhere.

Really curious if the memory bandwidth is entirely available to the CPU if the GPU is idle. An nvidia RTX3090 has nearly 1TB/s bandwidth, so the GPU is clearly going to use as much of the 400GB/s as possible. Other unified architectures have multiple channels or synchronization to memory, such that no one part of the system can access the full bandwidth. But if the CPU can access all 400GB/s, that is an absolute game changer for anything memory bound. Like 10x faster than an i9 I think?

Not sure if it will be available, but 400GB/s is way too much for 8 cores to take up. You would need some sort of avx512 to hog up that much bandwidth.

Moreover, it's not clear how much of a bandwidth/width does M1 max CPU interconnect/bus provide.


Edit: Add common sense about HPC workloads.

There is a fundamental idea called memory-access-to-computation ratio. We can't assume a 1:0 ratio since it was doing literally nothing except copying.

Typically your program needs serious fixing if it can't achieve 1:4. (This figure comes from a CUDA course. But I think it should be similar for SIMD)

Edit: also a lot of that bandwidth is fed through cache. Locality will eliminate some orders of magnitudes of memory access, depending on the code.

A single big core in the M1 could pretty much saturate the memory bandwidth available.


> Not sure if it will be available, but 400GB/s is way too much for 8 cores to take up. You would need some sort of avx512 to hog up that much bandwidth.

If we assume that frequency is 3.2Ghz and IPC of 3 with well optimized code(which is conservative for performance cores since they are extremely wide) and count only performance cores we get 5 bytes for instruction. M1 supports 128-bit Arm Neon, so peak bandwidth usage per instruction(if I didn't miss anything) is 32 bytes.

Don't know the clock speed but 8 cores at 3Ghz working on 128bit SIMD is 8316 = 384GB/s so we are in the right ball park. Not that I personally have a use for that =) Oh, wait, bloody Java GC might be a use for that. (LOL, FML or both).

But the classic SIMD problem is matrix-multiplication, which doesn't need full memory bandwidth (because a lot of the calculations are happening inside of cache).

The question is: what kind of problems are people needing that want 400GB/s bandwidth on a CPU? Well, probably none frankly. The bandwidth is for the iGPU really.

The CPU just "might as well" have it, since its a system-on-a-chip. CPUs usually don't care too much about main-memory bandwidth, because its like 50ns+ away latency (or ~200 clock ticks). So to get a CPU going in any typical capacity, you'll basically want to operate out of L1 / L2 cache.

> Oh, wait, bloody Java GC might be a use for that. (LOL, FML or both).

For example, I know you meant the GC as a joke. But if you think of it, a GC is mostly following pointer->next kind of operations, which means its mostly latency bound, not bandwidth bound. It doesn't matter that you can read 400GB/s, your CPU is going to read an 8-byte pointer, wait 50-nanoseconds for the RAM to respond, get the new value, and then read a new 8-byte pointer.

Unless you can fix memory latency (and hint, no one seems to be able to do so), you'll be only able to hit 160MB/s or so, no matter how high your theoretical bandwidth is, you get latency locked at a much lower value.

Yeah the marking phase cannot be efficiently vectorized. But I wonder if it can help with compacting/copying phase.

Also for me the process sounds oddly familiar to vmem table walking. There is currently a RISC-V J extension drafting group. I wonder what they can come up with.

> The question is: what kind of problems are people needing that want 400GB/s bandwidth on a CPU? Well, probably none frankly.

It is needed for analytic databases, e.g. ClickHouse: https://presentations.clickhouse.com/meetup53/optimizations/

This is some seriously interesting stuff.

But they are demonstrating with 16 cores + 30 GB/s & 128 cores + 190 GB/s. And to my understanding they did not really mention what type of computational load did they perform. So this does not sound too ridiculous. M1 max is pairing 8 cores + 400GB/s.

Doesn't prefetching data into the cache more quickly assist in execution speed here?

How do you prefetch "node->next" where "node" is in a linked list?

Answer: you literally can't. And that's why this kind of coding style will forever be latency bound.

EDIT: Prefetching works when the address can be predicted ahead of time. For example, when your CPU-core is reading "array", then "array+8", then "array+16", you can be pretty damn sure the next thing it wants to read is "array+24", so you prefetch that. There's no need to wait for the CPU to actually issue the command for "array+24", you fetch it even before the code executes.

Now if you have "0x8009230", which points to "0x81105534", which points to "0x92FB220", good luck prefetching that sequence.


Which is why servers use SMT / hyperthreading, so that the core can "switch" to another thread while waiting those 50-nanoseconds / 200-cycles or so.

I don't really know how the implementation of a tracing GC works but I was thinking they could do some smart memory ordering to land in the same cache-line as often as possible.

Thanks for the clarifications :)

Interestingly earlyish smalltalk VMs used to keep the object headers in a separate contiguous table.

Part of the problem though, is that the object graph walk pretty quickly is non contiguous, regardless of how it's laid out in memory.

But that’s just the marking phase, isn’t it? And most of it can be done fully in parallel, so while not all CPU cores can be maxed out with that, more often than not the original problem itself can be hard to parallelize to that level, so “wasting” a single core may very well be worth it.

You prefetch by having node->next as close as possible to node. You do that by using an allocator that tries very hard to ensure this.

Doesn't work that well for GC but for specific workloads it can work very nicely.

Yeah, that's a fine point.

I always like pointing out Knuth's dancing links algorithm for Exact-covering problems. All "links" in that algorithm are of the form "1 -> 2 -> 3 -> 4 -> 5" at algorithm start.

Then, as the algorithm "guesses" particular coverings, it turns into "1->3->4->5", or "1->4", that is, always monotonically increasing.

As such, no dynamic memory is needed ever. The linked-list is "statically" allocated at the start of the program, and always traversed in memory order.

Indeed, Knuth designed the scheme as "imagine doing malloc/free" to remove each link, but then later, "free/malloc" to undo the previous steps (because in Exact-covering backtracking, you'll try something, realize its a dead end, and need to backtrack). Instead of a malloc followed up by a later free, you "just" drop the node out of the linked list, and later reinsert it. So the malloc/free is completely redundant.

In particular: a given "guess" into an exact-covering problem can only "undo" its backtracking to the full problem scope. From there, each "guess" only removes possibilities. So you use the "maximum" amount of memory at program start, you "free" (but not really) nodes each time you try a guess, and then you "reinsert" those nodes to backtrack to the original scope of the problem.

Finally, when you realize that, you might as well put them all into order for not only simplicity, but also for speed on modern computers (prefetching and all that jazz).

Its a very specific situation but... it does happen sometimes.

AMD showed with their Infinity Cache that you can get away with much less bandwidth if you have large caches. It has the side effect of radically reducing power consumption.

Apple put 32MB of cache in their latest iPhone. 128 or even 256MB of L3 cache wouldn't surprise me at all given the power benefits.

I suspect the GPU is never really idle.

Even simple screen refresh blending say 5 layers and outputting it to a 4k screen is 190Gbits at 144 Hz.

Apple put ProMotion in the built in display, so while it can ramp up to 120Hz, it'll idle at more like 24 Hz when showing static content. (the iPad Pro goes all the way down to 10Hz, but some early sources seem to say 24Hz for these MacBook Pros.) There may also be panel self refresh involved, in which case a static image won't even need that much. I bet the display coprocessors will expose the adaptive refresh functionality over the external display connectors as well.

It only takes a tiny animation (eg. a spinner, pulsing glowing background, animated clock, advert somewhere, etc), and suddenly the whole screen is back to 120 Hz refresh.

That's a useful tip for saving some battery actually, thanks.

Don't know much about the graphics on an M1. Does it not render to a framebuffer? Is that framebuffer spread over all 4 memory banks? Can't wait to read all about it.

The updates from the Asahi Linux team are fantastic for getting insights into the M1 architecture. They've not really dug deep into the GPU yet, but that's coming soon.

When nothing is changing, you do not have to touch the GPU. Yes, without Panel Self Refresh there would be this many bits going to the panel at that rate, but the display engine would keep resubmitting the same buffer. No need to rerender when there's no damage. (And when there is, you don't have to rerender the whole screen, only the combined damage of the previous and current frames.)

Don't Apple iPhones use an adaptive refresh rate nowadays?

Indeed ProMotion is coming to these new MacBooks too.

More memory bandwith = 10x faster than an i9 ? this makes no sense to me doesn't clock speed and cores determine the major part of the performance of a cpu ?

Yes and no. There are many variables to take into account. An example from the early days of the PPC architecture was their ability to pre-empt instructions. This gave performance boasts even in the absence of a higher clock speed. I can't speak specifically on the M1, but there are other things outside of clock speed and cores that determine speed.

Surely the shared ram between cpu and gpu is the killer feature - zero copy and up to 64gb ram available for the gpu!

Yes, but it's a double edged sword. It means you're using relatively slow ram for the GPU, and that the GPU takes memory bandwidth away from the CPU as well. Traditionally we've ended up with something that looks like Intel's kinda crappy integrated video.

The copying process was never that much of a big deal, but paying for 8GB of graphics ram really is.

> The copying process was never that much of a big deal

I don't know about that? Texture memory management in games can be quite painful. You have to consider different hardware setups and being able to keep the textures you need for a certain scene in memory (or not, in which case, texture thrashing).

The copying process was quite a barrier to using compute (general purpose GPU) to augment CPU processing and you had to ensure that the work farmed to the GPU was worth the cost of the to/from costs. Game consoles of late have generally had unified memory (UMA) and it's quite a nice advantage because moving data is a significant bottleneck.

Using Intel's integrated video as a way to assess the benefits of unified memory is off target. Intel had a multitude of design goals for their integrated GPU and UMA was only one aspect so it's not so easy to single that out for any shortcomings that you seem to be alluding to.

If you're looking at the SKU with the high GPU core count and 64 Gigs of LPDDR5, the total memory bandwidth (400 GBps) isn't that far off from the bandwidth a discrete GPU would have to it's local pool of memory.

You also have an (estimated from die shots) 64 megabyte SRAM system level cache and large L2 and L1 CPU caches, but you are indeed sharing the memory bandwidth between the CPU and GPU.

I'm looking forward to these getting into the hands of testers.

> I'm still a bit sad that the era of "general purpose computing" where CPU can do all workloads is coming to an end.

They’ll still do all workloads, but are optimized for certain workloads. How is that any different than say, a Xeon or EPYC cpu designed for highly threaded (server/scientific computing) applications?

In this context the absence of the 27 inch iMac was interesting. If these SoC were not deemed to be 'right' for the bigger iMac then possibly a more CPU focused / developer focused SoC may be in the works for the iMac?

I doubt they are going to make different chips for prosumer devices. They are going to spread out the M1 pro/max upgrade to the rest of the lineup at some point during the next year, so they can claim "full transition" through their quoted 2 years.

The wildcard is the actual mac pro. I suspect we aren't going to hear about mac pro until next Sept/Oct events, and super unclear what direction they are going to go. Maybe allowing config of multiple M1 max SOCs somehow working together. Seems complicated.

On reflection I think they've decided that their pro users want 'more GPU not more CPU' - they could easily have added a couple more CPU cores but it obviously wasn't a priority.

Agreed that it's hard to see how designing a CPU just for the Mac Pro would make any kind of economic sense but equally struggling to see what else they can do!

There is only so much you can do improving the CPU.

On the other hand, we still haven't really reached the limit for GPUs.

I suppose it’s the software - most CPU software is primarily designed to run on a single / few cores. GPU code is massively parallel by design.

I think we will see iMac Pro with incredible performance. Mac Pros, maybe in the next years to come. It's a really high end product to release new specs. Plus, in they release it with M1 Max chips, what would be the difference? a nicer case and more upgrading slot? I don't see the advantage of power. I think Mac Pros will be upgraded like in 2 years ahead

The real question for the Mac Pros is whether the GPU is changeable or not, and what that means for shared memory.

Nah, it'll be the same M1 Max. Just wait a few months like with the 24" iMac.

You're probably right. Maybe they have inventory to work through!

They might also have supply constraints on these new chips. I suspect they are going to sell a lot of these new MacBook Pros

They also have a limited headcount and resources so they wouldn't want to announce M1x/pro/max for all machines now and have employees be idle for the next 3 months.

Notebooks also have a higher profit margin, so they sell them to those who need to upgrade now. The lower-margin systems like Mini will come later. And the Mac Pro will either die or come with the next iteration of the chips.

The Mac Pro might be blocked, or still progressing, through design changes more so than chip changes.

Yeah, M1 Pro/Max on iMac 30” likely in 2022 H1. Mini also, I imagine.

Yup. Once I saw the 24" iMac I knew the 27" had had it's chips. 30" won't actually be much bigger than the 27" if the bezels shrink to almost nothing - which seems to be the trend.

I wish 34" iMac Pro M1 Max.

I'm going to sound dumb for this, but how difficult do any of you think it would be to make a computer with 2 M1 chips? Or even 4?

They're not meant to go together like that -- there's not really an interconnect for it, or any pins on the package to enable something like that. Apple would have to design a new Mx SoC with something like that as an explicit design goal.

I think the problem would be how one chip can access the memory of the other one. The big advantage in the M1xxxx is the unified memory. I don't think the chips have any hardware to support cache coherency and so on spanning more than one chip.

You would have to implement single system image abstraction, if you wanted more than a networked cluster of M1s in a box, in the OS using just software plus virtual memory. You'd use the PCIe as the interconnect. Similar has been done by other vendors for server systems, but it has tradeoffs that would probably not make sense to Apple now.

A more realistic question would be what good hw multisocket SMP support would look like in M1 Max or later chips, as that would be a more logical thing to build if Apple wanted this.

The rumor has long been that the future Mac Pro will use 2-4 of these “M1X” dies in a single package. It remains to be seen how the inter-die interconnect will work / where those IOs are on the M1 Pro/Max die.

The M1 is a system-on-a-chip so AIUI there's a bunch of stuff on there that you wouldn't want two of.

That's not a problem, you just turn off what's redundant. AMD did the same thing with their original Threadripper/EPYC models.

I think the issue is OS management of tasks to prevent cpu 1 from having to access memory of cpu 2.

Apple are already planning multi-die systems, judging by their drivers. So yes. Rumors are they have a SKU with two M1 Max dies.

The way I interpreted it is that it's like lego so they can add more fast cores or more efficiency cores depending on the platform needs. The successor generations will be new lego building blocks.

If we're going to keep freedom in computing, let's hope "general purpose computing" is the distinction that remains for exactly that purpose.

Accelerators/Memory are just being brought on die/package here, which has been happening for a while. Integrated memory controllers come to mind.

Edit: I was wrong! Thanks for pointing it out

Not exactly. M1 CPU, GPU, and RAM were all capped in the same package. New ones appear to be more a single board soldered onto mainboard, with a discrete CPU, GPU, and RAM package each capped individually if their "internals" promo video is to be believed (and it usually is an exact representation of the shipping product) https://twitter.com/cullend/status/1450203779148783616?s=20

Suspect this is a great way for them to manage demand and various yields by having 2 CPU's (or one, if the difference between pro/ max is yield on memory bandwidth) and discrete RAM/ GPU components

CPU and GPU is one die, you're looking at RAM chips on each side of the package. The M1 also had the RAM separate but on package.

M1: https://d3nevzfk7ii3be.cloudfront.net/igi/ZRQGFteQwoIVFbNn

I know nothing about hardware, basically. Do Apple’s new GPU cores come close to the capabilities of discrete GPUs like what are used for gaming/scientific applications? Or are those cards a whole different thing?

1. If you're a gamer, this seems comparable to a 3070 Laptop, which is comparable to a 3060 Desktop.

2. If you're a ML researcher you use CUDA (which only works on NVIDIA cards), they have basically a complete software lock unless you want to spend an undefined number of X hundreds of hours fixing and troubleshooting compatibility issues.

There has been an M1 fork of Tensorflow almost since the chip launched last year. I believe Apple did the leg work. It’s a hoop to jump through, yes, and no ones training big image models or transformers with this, but I imagine students or someone sandboxing a problem offline would benefit from the increased performance over CPU only.


Seem on the long term game here my dear all. Have amd sponsor that kind of activities. Or intel. …

Could we use M1 chips on non-apple boards? If yes, I wish Apple releases these for non mac os consumption. Eg. Running linux servers in the cloud.

Not a great fit. Something like Ampere altra is better as it gives you 80 cores and much more memory which better fits a server. A server benefits more from lots of weaker cores than a few strong cores. The M1 is an awesome desktop/laptop chip and possibly great for HPC, but not for servers.

What might be more interesting is to see powerful gaming rigs built around the these chips. They could have build a kickass game console with these chips.

Why they didn't lean into that aspect of the Apple TV still mystifies me. A Wii-mote style pointing device seems such a natural fit for it, and has proven gaming utility. Maybe patents were a problem?

Why? There are plenty of server oriented ARM platforms available for use (See AWS Graviton). What benefit do you feel Apple’s platform gives over existing ones?

The Apple cores are full custom, Apple-only designs.

The AWS Graviton are Neoverse cores, which are pretty good, but clearly these Apple-only M1 cores are above-and-beyond.


That being said: these M1 cores (and Neoverse cores) are missing SMT / Hyperthreading, and a few other features I'd expect in a server product. Servers are fine with the bandwidth/latency tradeoff: more (better) bandwidth but at worse (highter) latencies.

My understanding is that you don't really need hyperthreading on a RISC CPU because decoding instructions is easier and doesn't have to be parallelised as with hyperthreading.

The DEC Alpha had SMT on their processor roadmap, but it was never implemented as their own engineers told the Compaq overlords that they could never compete with Intel.

"The 21464's origins began in the mid-1990s when computer scientist Joel Emer was inspired by Dean Tullsen's research into simultaneous multithreading (SMT) at the University of Washington."


Okay, the whole RISC thing is stupid. But ignoring that aspect of the discussion... POWER9, one of those RISC CPUs, has 8-way SMT. Neoverse E1 also has SMT-2 (aka: 2-way hyperthreading).

SMT / Hyperthreading has nothing to do with RISC / CISC or whatever. Its just a feature some people like or don't like.

RISC CPUs (Neoverse E1 / POWER9) can perfectly do SMT if the designers wanted.

Don’t think that is entirely true. Lots of features which exist on both RISC and CISC CPUs have different natural fit. Using micro-ops e.g. on a CISC is more important than in RISC CPU even if both benefit. Likewise pipelining has a more natural fit on RISC than CISC, while micro-op cache is more important on CISC than RISC.

I don't even know what RISC or CISC means anymore. They're bad, non-descriptive terms. 30 years ago, RISC or CISC meant something, but not anymore.

Today's CPUs are pipelined, out-of-order, speculative, superscalar, (sometimes) SMT, SIMD, multi-core with MESI-based snooping for cohesive caches. These words actually have meaning (and in particular, describe a particular attribute of performance for modern cores).

RISC or CISC? useful for internet flamewars I guess but I've literally never been able to use either term in a technical discussion.


I said what I said earlier: this M1 Pro / M1 Max, and the ARM Neoverse cores, are missing SMT, which seems to come standard on every other server-class CPU (POWER9, Intel Skylake-X, AMD EPYC).

Neoverse N1 makes up for it with absurdly high core counts, so maybe its not a big deal. Apple M1 however has very small core counts, I doubt that Apple M1 would be good in a server setting... at least not with this configuration. They'd have to change things dramatically to compete at the higher end.

Today RISC just means an ISA is fixed-length load-store. No uarch implications.

Here is one meaning:

Intel microcode updates added new machine opcodes to address spectre/meltdown exploits.

On a true RISC, that's not possible.


Or are you going to argue that the venerable POWER-architecture is somehow not "true RISC" ??


If its got microcode, it's not RISC.

POWER9, RISC-V, and ARM all have microcoded instructions. In particular, division, which is very complicated.

As all CPUs have decided that hardware-accelerated division is a good idea (and in particular: microcoded, single-instruction division makes more sense than spending a bunch of L1 cache on a series of instructions that everyone knows is "just division" and/or "modulo"), microcode just makes sense.

The "/" and "%" operators are just expected on any general purpose CPU these days.

30 years ago, RISC processors didn't implement divide or modulo. Today, all processors, even the "RISC" ones, implement it.

Hyper threading has nothing to do with instruction decode. It's for hiding memory latency. The sparc T line is 8 way threaded.

It's slightly more general than that, hiding inefficient use of functional units. A lot of times that's totally memory latency causing the inability to keep FUs fed like you say, but i've seen other reasons, like a wide but diverse set of FUs that have trouble applying to every workload.

The classic reason quoted for SMT is to allow the functional units to be fully utilised when there is instruction-to-instruction dependencies - that is, the input of one instruction is the output from the previous instruction. Doing SMT allow you to create one large pool of functional units and share them between multiple threads, hopefully increasing the chances that they will be fully used.

Well, tons, there isn't another ARM core that can match a single M1 Firestorm, core to core. Heck, only the highest performance x86 cores can match a Firestorm core. and that's just raw performance, not even considering power efficiency. But of course, Apple's not sharing.

Linux on M1 Macs is under development, running, last I heard.


Wasn't there a rumor that AMD was creating ARM chips? It will be great if we have ARM versions of EPYC chips.

I'd love to see someone do a serious desktop RISC-V processor.

They were, but have stopped talking about that for years. The project is probably canceled; I've heard Jim Keller talk about how that work was happening simultaneously with Zen 1.

AMD's equivalent of the Intel Management Engine (ME) is based on ARM, I think.

Intel used to use "ARC" cores, but recently converted to and i486.

This is the ARM core that lives inside AMD Xen:


anyone know what is the most optimum performance/dollars setup for this m1pro?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact