I laughed when they called it “unified memory.” Amazing what some marketing can do. In previous years, that was called “shared graphics memory” and it was only for low end systems.
Shared graphics-memory is not the same as unified memory.
In a shared graphics-memory system, a part of the system RAM is reserved for the GPU. The amount of RAM reported to the OS is the total RAM minus the chunk reserved the GPU uses. The OS only can use it's own part, and cannot access the GPU memory (and vice versa). If you want to make something available to the GPU, it still has to be copied to the reserved GPU part.
In unified memory both the OS and the GPU can access the entire range of memory, no need for a (slow) copy. Consoles use the same strategy (using fast GDDR for both system and GPU), and it's one of the reasons consoles punch above their weight graphically.
The main reason that high-end GPUs use discrete memory is because they use high-bandwidth memory that has to live very close to the GPU. The user-replaceable RAM modules in a typical PC are much too far away from the GPU to use the same kind of high bandwidth memory. If you drop the 'user replaceable' requirement and place everything close together, you can have the benefits of both high bandwidth and unified memory.
> If you drop the 'user replaceable' requirement and place everything close together, you can have the benefits of both high bandwidth and unified memory
Rather, if you drop the "big GPU" requirement then you can place everything close together. So called APUs have been unified memory for years & years now (so more or less Intel's entire lineup, and AMD's entire laptop SKUs & some desktop ones).
It still ends up associated with low-end because there's only so much die space you can spend on an iGPU, and the M1 is no exception there. It doesn't come close to a mid-range discreet GPU and it likely won't ever unless Apple goes chiplets so that multiple dies can share a memory controller.
With "normal" (LP)DDR you still run into severe memory bottlenecks on the GPU side as things get faster, so that becomes another issue with unified memory. Do you sacrifice CPU performance to feed the GPU by using higher-latency GDDR? Or do you sacrifice GPU performance with lower-bandwidth DDR?
"The first Apple-built GPU for a Mac is significantly faster than any integrated GPU we’ve been able to get our hands on, and will no doubt set a new high bar for GPU performance in a laptop. Based on Apple’s own die shots, it’s clear that they spent a sizable portion of the M1’s die on the GPU and associated hardware, and the payoff is a GPU that can rival even low-end discrete GPUs."
This is their first low end offering, they seem to be taking full advantage of UMA more so than anyone to this point. It will be interesting to see if they continue this with a higher "pro" offering or stick with a discrete CPU to stay competitive.
My guess is Apple will be the one to make UMA integrated graphics rival discrete GPU's, it will be interesting to see if that happens.
None of those games were designed for UMA nor do they benefit as the API design for graphics forces copies to happen anyway.
The M1's GPU is good (for integrated), but UMA isn't the reason why. I don't know why people seem so determined to reduce Apple's substantial engineering in this space to a mostly inconsequential minor architecture tweak that happen a decade ago.
From the article: "Meanwhile, unlike the CPU side of this transition to Apple Silicon, the higher-level nature of graphics programming means that Apple isn’t nearly as reliant on devs to immediately prepare universal applications to take advantage of Apple’s GPU. To be sure, native CPU code is still going to produce better results since a workload that’s purely GPU-limited is almost unheard of, but the fact that existing Metal (and even OpenGL) code can be run on top of Apple’s GPU today means that it immediately benefits all games and other GPU-bound workloads."
"Note In a discrete memory model, synchronization speed is constrained by PCIe bandwidth. In a unified memory model, Metal may ignore synchronization calls completely because it only creates a single memory allocation for the resource. For more information about macOS memory models and managed resources, see Choosing a Resource Storage Mode in macOS."
I am not trying to minimize the other engineering improvements, however I do believe there may be less credit being given to the UMA than deserved due to past lackluster UMA offerings. As I said it will be interesting to see how far Apple can scale UMA I am not sure they can catch discrete graphics but I am starting to think they are going to try.
Apple can't just hang onto that void* that's passed in as the developer is free to re-use for something else after the call. It must copy, even on a UMA system. And even if it was adjusted such that glTexImage2D took ownership of the pointer instead, there'd still be an internal copy anyway to swizzle it as linear RGBA buffers are not friendly to typical GPU workloads. This is why for example per Apple's docs above when it gets to the texture section it's like "yeah just copy & use private." So even though in theory Metal's UMA exposure would be great for games that stream textures, it still isn't because you still do a copy anyway to convert it to the GPU's internal optimal layout.
Similarly the benefits of UMA only help if transfering data is actually a significant part of the workload, which is not true for the vast majority of games. For things like gfxbench it may help speedup the load time, but during the benchmark loop all the big objects are only used on the GPU (like textures & models)
Any back and forth between CPU and GPU will be faster with unified memory especially with a coherent on die cache.
This is the same model from iOS so just about anyone doing metal will already be optimizing for it same with any other mobile development.
It doesn't seem like a minor architectural difference to me:
"Comparing the two GPU architectures, TBDR has the following advantages:
It drastically saves on memory bandwidth because of the unified memory architecture.
Blending happens in-register facilitated by tile processing.
Color, depth and stencil buffers don’t need to be re-fetched."
> I believe most of the benchmarks where Metal based in the Anand article
But that doesn't tell you anything. Being Metal-based doesn't mean they were designed nor benefit from UMA.
Especially since, again, Apple's own recommendation on big data (read: textures) is to copy it.
> Any back and forth between CPU and GPU will be faster with unified memory especially with a coherent on die cache.
Yes, but games & gfxbench don't do this which is what I keep trying to get across. There are workloads out there that will benefit from this, but the games & benchmarks that were run & being discussed aren't them. It's like claiming the sunspider results are from wifi 6 improvements. There are web experiences that will benefit from faster wifi, but sunspider ain't one of them.
Things like GPGPU compute can benefit tremendously here, for example.
> also PBO have been around for quiet a while in OpenGL:
PBO's reduce the number of copies from 2 to 1 in some cases, not from 1 to 0. You still copy from the PBO to your texture target, but it can potentially avoid a CPU to CPU copy first. When you call glTexImage2D it doesn't necessarily do the transfer right then, it instead may copy to a different CPU buffer to later be copied to the GPU.
> "Comparing the two GPU architectures, TBDR has the following advantages:
> It drastically saves on memory bandwidth because of the unified memory architecture. Blending happens in-register facilitated by tile processing. Color, depth and stencil buffers don’t need to be re-fetched."
First API's don't support it, can't pin memory (which is what a PBO does). Then oh well they are not taking advantage of it. Move the goal post much?
TBDR came to prominence in UMA mobile architectures, it's a big part of what allows it to perform so well with limited memory bandwidth. The M1 is just an evolution of Apples mobile designs and PowerVR before that.
> First API's don't support it, can't pin memory (which is what a PBO does). Then oh well they are not taking advantage of it. Move the goal post much?
No, they don't, so no, I didn't move the goal posts at all. PBOs are a transfer object. You cannot sample from them on the GPU. The only thing you can do with PBOs is copy them to something you can use on the GPU.
As such, PBOs do not let you take advantage of UMA. In fact, their primary benefit is for non-UMA in the first place. UMA systems have no issues blocking glTexImage2D until the copy to GPU memory is done, but non-UMA ones do. And non-UMA ones are what gave us PBOs.
> TBDR came to prominence in UMA mobile architectures, it's a big part of what allows it to perform so well with limited memory bandwidth.
Support that with a theory or evidence of literally any kind. There's nothing at all in TBDR's sequence of events that has any apparent benefit from UMA.
Look at that the sequence of steps. ARM doesn't even bother including a CPU in there, so which step would UMA be helping with?
What UMA can do here is improve the power efficiency by reducing the cost of sending the command buffers to the GPU, but that's not going to get you a performance improvement as those command buffers are not very big. If sending data from the CPU to GPU was such a severe bottleneck then you'd see the impact of things like reducing the PCI-E bandwidth on discreet GPUs, but you don't.
The modern approach to textures is to precompile them, so you can hand the data straight over. It's not as common to have to convert a linear to swizzled texture, though it can happen.
Also, the Apple advice for OpenGL textures was always focused on avoiding unnecessary copies. (for instance, there's another one that could happen CPU side if your data wasn't aligned enough to get DMA'd)
One reason M1 textures use less memory is the prior systems had AMD/Intel graphic switching and so you needed to keep another copy of everything in case you switched GPUs.
As SigmundA points out a huge advantage Apple has is control of the APIs (Metal, etc) and the ability to structure them years ago so that the API can simply skip entire things (even when ordered to do them) as it's known it's not needed. An analogy would be a copy-on-write filesystem (or RAM!) that doesn't actually do a copy when asked to, it returns immediately with a pointer, and only copies if asked to write to it.
Yeah I believe the M1 GPU(2.6 TFLOPS) falls between a PS4 (1.8 TFLOPS) and PS4Pro (4.2 TFLOPS). Yes the original PS4 came out in 2013, but still I find it impressive that a mobile integrated GPU has that much computational
Power with no fan and that power budget.
I do wonder what they are going to do with the higher end MBP, iMacs, and Mac Pro (if they make one). Will they have an “M1X” with more GPU cores or will they offer a discrete option with AMD GPUs. I do think we could potentially see an answer at WWDC. I wouldn’t be surprised if eGPU support was announced for ARM Macs at WWDC.
Yes the Amiga had a form of UMA as did many other systems, the term UMA seems more widely used than "shared memory" its definite not just a marketing term.
I don't believe Apple claimed to invent unified memory only that they are taking maximum advantage of the architecture more so than anyone to this point.
Federighi:
"We not only got the great advantage of just the raw performance of our GPU, but just as important was the fact that with the unified memory architecture, we weren't moving data constantly back and forth and changing formats that slowed it down. And we got a huge increase in performance."
This seems to be talking about the 16mb SLC on die cache that CPU,GPU and other IP cores share:
"Where old-school GPUs would basically operate on the entire frame at once, we operate on tiles that we can move into extremely fast on-chip memory, and then perform a huge sequence of operations with all the different execution units on that tile. It's incredibly bandwidth-efficient in a way that these discrete GPUs are not. And then you just combine that with the massive width of our pipeline to RAM and the other efficiencies of the chip, and it’s a better architecture."
As far as I understand, it has a lot to do with the actual design of the processor[1], and not so much to do with the on-chip memory or the software integration.
> I'm still assuming it's faster due to unified memory architecture
AMD, Intel, ARM, & Qualcomm have all been shipping unified memory for 5+ years. I'd assume all the A* SoCs have been unified memory for that matter too unless Apple made the weirdest of cost cuts.
Moreover literally none of the benchmarks out there include anything at all that involves copying/moving data between the CPU, GPU, and AI units. They are almost always strictly-CPU benchmarks (which the M1 does great in), or strictly-GPU benchmarks (where the M1 is good for integrated but that's about it)
> An AMD/Intel using same soldered RAM next to CPU and same process node would give Apple a run for its money.
AMD's memory latency is already better than the M1's. Apple's soldered RAM isn't a performance choice:
For equivalent laptop specific CPU, you will get a speedup from on-package RAM vs user replaceable RAM placed further away, even desktops would benefit but it would not be a welcome change there.
That's not really how dram latency works. In basically all CPUs the memory controller runs at a different clock than the CPU cores do, typically at the same clock as the DRAM itself but not always.
If you meant the dram was running faster on the AMD system then also no. The M1 is using 4266mhz modules while the AMD system was running 3200mhz ram
> For equivalent laptop specific CPU, you will get a speedup from on-package RAM vs user replaceable RAM placed further away, even desktops would benefit but it would not be a welcome change there.
Huge citation needed. There's currently no real world product that matches that claim nor a theoretical one as the physical trace length is minimal latency difference and far from the major factor.
An AMD/Intel using same soldered RAM next to CPU and same process node would give Apple a run for its money.
Still, the optimizations on OS side are interesting here.