> For example, I have not encountered hardware for reading vertex attributes or uniform buffer objects. The OpenGL and Vulkan specifications assume dedicated hardware for each, so what’s the catch?
That is not my understanding of those specs (as someone that's written graphics drivers). Uniform Buffer Objects are not a "hardware" thing. They're just a way to communicate uniforms faster than one uniform per API call. What happens on the backend is undefined by those specs and is not remotely tied to some hardware implementation. Vertex Attributes might have been a hardware thing long ago but. I'm pretty sure there are older references but this 9yr old 2012 book already talks about GPUs that don't have hardware based vertex attributes.
https://xeolabs.com/pdfs/OpenGLInsights.pdf chapter 21
> Simply put – Apple doesn’t need to care about Vulkan or OpenGL performance.
OpenGL and Vulkan allow an implementer to more easily make such specialized HW. But it doesn't assume it at all in any other way. If your HW is fast enough there is absolutely no need to implement specialized block for it without any performance penalty.
It's trivial to implement things like input assembler without specific HW, just issue loads. But it would be massive pain to go the other way around. Try to sniff what loads fit the pattern that could be tossed into fixed function input assembler. That's a no go.
This is the right way around to do things. As there is no performance penalty for "emulating" it, because there is nothing to emulate in the end.
From the Phoronix comments on this post:
> I have an idea. Why not support exclusively Vulkan, and then do the rest using Zink (that keeps getting faster and faster)?
> This way you could finish the driver in one year or two.
(For context: Zink is an OpenGL to Vulkan translator integrated into Mesa)
I had the same thought in my mind—Zink is 95% the speed of Intel's OpenGL driver, so why not completely ignore anything but Vulkan? On the Windows side, dxvk (DirectX to Vulkan) already is much faster (in most cases) than Microsoft's DX9-11 implementation, so it's completely feasible that Zink could become faster than most vendors' OpenGL implementation.
I have no knowledge of low-level graphics, so I don't know the ease of implementing the two APIs. I could envision, however, that because this GPU was never designed for OpenGL, there may be some small optimizations that could be made if Vulkan was skipped.
Like, the classic "Intel OpenGL driver" in Mesa (i.e., i965) doesn't use Gallium and NIR, and hence has to implement each graphics API itself, whereas their modern "Iris" driver using Gallium presumably just handles NIR -> hardware?
Or does the Gallium approach still require some knowledge of higher-level constructs and some knowledge of things above NIR?
Is there a way to support the work?
Part 1: https://news.ycombinator.com/item?id=25673631
Part 2: https://news.ycombinator.com/item?id=25873887
The Linux driver required two new quirks (different queue entry size, and an issue with using multiple queues IIRC). That's it. That's all it was.
On the M1, NVMe is not PCIe but rather a platform device, which requires abstracting out the bus from the driver (not hard); Arnd already has a prototype implementation of this and I'm going to work on it next.
Oh, and it looks like the fixes only made it into mainline Linux in 5.4, less than a year and a half ago, and from there it would've taken some time to reach distros...
Generally, "platform device" means that it's just a direct physical memory map. Honestly, from a driver perspective, that's sort of what you get with PCIe as well. The physical addresses is just dynamically determined during enumeration instead. Of course, there's some boilerplate core stuff to perform mappings and handle interrupts specific to PCI, but at the end of the day, you just get a memory mapped interface.
This is unlike something like USB where you need to deal with packets directly.
Right, I was sort of alluding to that. I’m really just curious how the NVMe packets physically make their way to the SSD.
The specific logical signals between separate IPs on the SoC is slightly less interesting to me then. It’s likely something similar to ACE5, like you said, for sharing the memory bus.
One point I disagree with:
>What’s less obvious is that we can infer the size of the machine’s register file. On one hand, if 256 registers are used, the machine can still support 384 threads, so the register file must be at least 256 half-words * 2 bytes per half-word * 384 threads = 192 KiB large. Likewise, to support 1024 threads at 104 registers requires at least 104 * 2 * 1024 = 208 KiB. If the file were any bigger, we would expect more threads to be possible at higher pressure, so we guess each threadgroup has exactly 208 KiB in its register file.
>The story does not end there. From Apple’s public specifications, the M1 GPU supports 24576 = 1024 * 24 simultaneous threads. Since the table shows a maximum of 1024 threads per threadgroup, we infer 24 threadgroups may execute in parallel across the chip, each with its own register file. Putting it together, the GPU has 208 KiB * 24 = 4.875 MiB of register file! This size puts it in league with desktop GPUs.
I don't think this is quite right. To compare it to Nvidia GPUs, for example, a Volta V100 has 80 Shader Multiprocessors (SM) each having a 256 KiB register file (65536 32-bit wide registers, ). The maximum number of resident threads per SM is 2048, the maximum number of threads per thread block is 1024.
While a single thread block _can_ use the entire register file (64 registers per thread * 1024 threads per block), this is rare, and it is then no longer possible to reach the maximum number of resident threads. To reach 2048 threads on an SM requires the threads to use no more than 32 registers on average, and two or more thread blocks to share the SM's register file.
Similarly, the M1 GPU may support 24576 simultaneous threads, yet there is no guarantee it can do so while each thread uses 104 registers.
 https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.... : table 15, compute capabilities 7.0
Remember, they are selling computers as a combination of hardware and software. They are not selling processors so they are of course not supporting driver and other low level software for other OSs. That's a bummer if you are into other OSs, but it is not part of their business model so it should not be surprising.
Other OSs are supported via their virtualization framework. My limited tests show about a 10% performance penalty. Not too bad right out of the gate with a new system.
That being said, Ms. Rosenzweig is doing some incredible and interesting work. Really enjoying the series.
Most people will not end up writing code in the optimal way though, since they also want to support discrete GPUs with their own VRAM and those have totally different memory management.
macOS devices are not, and Apple invested significant development effort into allowing third-party kernels on M1 Macs. The situation is very different. They are not actively supporting the development of third-party OSes, but they are actively supporting their existence. They built an entire boot policy system to allow not just this, but insecure/non-Apple-signed OSes to coexist and dual-boot on the same device next to a full-secure blessed macOS with all of their DRM stuff enabled, which is something not even open Android devices do.
You can triple-boot a macOS capable of running iOS apps (only possible with secureboot enabled), a macOS running unsigned kernel modules or even your own XNU kernel build, and Linux on the same M1 Mac.
One may consider that not important but I think it’s important to at least note the hardware of these machines are not fully under user software control.
Then again, I don’t think the iOS support requires any hardware so I’m not sure why someone hasn’t released a mod (requiring a kext or not) that enables iOS app loading with SIP disabled.
In fact, the SEP is re-bootstrapped by the main OS itself after the boot stuff is done (we get the SEP firmware blob in memory passed in from the bootloader), so even though we cannot run our own code on it, we can choose not to run Apple's signed code either, and thus guarantee that it isn't doing nefarious things behind our back. For most intents and purposes, it's like an optional built-in YubiKey with a fingerprint reader.
iOS apps should be FairPlay encrypted AIUI, and presumably that goes through the SEP to only authorize decryption when booted in secure mode. That's my understanding anyway, I haven't had anything to do with that ecosystem for ages. Of course, either way you could load decrypted iOS apps just like you can pirate iOS apps on a jailbroken phone.
The M1 does have other co-processor CPUs that run signed firmware loaded before Linux boots (e.g. for power management, graphics management, sensors, etc), but all of those other firmware blobs are plaintext (only the SEP one is encrypted); we may not be able to change them (unclear yet how much control we have over those CPUs post-iBoot, there might be a way to stop them or reload the firmware) but we can at least reverse engineer them and audit that they don't do anything nasty. Besides, I think most of this stuff goes through OS-controlled IOMMUs to access memory anyway, so it can't do much harm to the main OS.
What makes people believe this?
All the low-level details so far have been reverse-engineered since Apple doesn't provide documentation. Just because m1n1 finds the CPU to be in the EL2 state when its first instruction executes doesn't mean EL3 doesn't exist. An equally valid conclusion is that iBoot dropped from EL3 to EL2 before jumping to the m1n1 code.
Apple's phone chips use EL3 as a "god mode" to silently scan the kernel's code pages for modifications, and panic the processor if any are found:
Until this mechanism was discovered nobody thought EL3 was being used at all on the phone chips.
I agree that Apple probably has less random junk running at exceptionally high privilege levels, but your argument is not convincing to me. We have control of the exception levels, which we can check the existence of from the public ISA, but that doesn't mean Apple hasn't added any new stuff elsewhere that can touch the CPU in ways that are not yet known (and, to be entirely fair: I don't even think we know we have control of the exception levels. We have EL2 execution, but GXF exists, and even if we know how to navigate through it who can really say for sure what it does to the processor state?). I think the right argument here is "Apple has no reason to add stupidity to their processor (and many reasons to not add this garbage) so it likely does not exist" and leave it at that, rather than trying to draw up technical reasons why it seems more open.
> iOS apps should be FairPlay encrypted AIUI, and presumably that goes through the SEP to only authorize decryption when booted in secure mode.
Wow is it really possible that iOS apps are encrypted with a private key that is stored within all SEP devices and it hasn’t been cracked yet? If so that’s incredible and would explain why a workaround for using iOS apps with SIP disabled hasn’t been released. Of course I shouldn’t be that surprised since 4K DRM media content would rely on the same property.
Edit: I looked into this and it turns out that each device has its own public key and the server encrypts the app/content with a key derivable from the device public key on-the-fly at download time. This is a simplified explanation but the essential implication is that there is no global private iOS app package key.
> Besides, I think most of this stuff goes through OS-controlled IOMMUs to access memory anyway, so it can't do much harm to the main OS.
Great point. If these other onboard devices have unfettered access to the memory bus and/or can trigger some sort of NMI then you can never really trust these devices. Though as you point out, most contemporary x86 PCs are no different in that regard.
There are global keys, which are used for system software. iOS used to be encrypted as a whole (not any more though, but the SEP firmware and iBoot still are) and getting those keys is tricky, as they are baked into hardware and different for each generation. You can build hardware so it lets you decrypt content or subkeys with a key, but not access the key material itself; if done properly (it often isn't done properly), that can mean you can only use the devices as an oracle (decrypt anything, but only directly on-device) unless you spend a lot of time and money reverse engineering the baked-in hardware key using a scanning electron microscope.
Ah yes indeed. I remember this from my jail breaking days. Just never was aware that app packages were encrypted.
'turning the Mac into an iPhone' suggest they are locking it down, which isn't entirely true.
They could do more to help driver development though.
I've just started playing with OpenGL recently and I don't know what "changing fixed-function attribute state can affect the shader" means.
Can anyone give an example of what kind of operations in the shader code might cause these unnecessary recompiles?
More modern interfaces now force you to clump a lot of state together into pretty big immutable state objects (e.g. pipeline objects) so that the driver has to deal with fewer surprises at inopportune times.
I think I understand now. Ideally the GLSL shader code is compiled once and sent to the GPU and used as-is to render many frames.
But if you use the stateful OpenGL APIs to send instructions from the CPU side during rendering you can invalidate the shader code that was compiled.
It had not occurred to me because the library I am using makes it difficult to do that, encouraging setting the state up front and running the shaders against the buffers as a single "render" call.
From the article, it would seem that compensating via software was fine (performance wise). Apple’s approach seems to be seems to break the norm in fields where the norm has proven to be unnecessary complexity. Which open tip room for just more raw performance.
It's a smart move for Apple to double down on pruning hw features you don't think you need, but sadly you can only go all-in on it if you control the entire ecosystem.
Some of the comments below surely are out of jealousy. But then again that jealousy is understandable, when not too long ago, people wouldn't celebrate the age of the people nor would they even mention the age anywhere.
To some extent, I personally am jealous that in the place I grew up we didn't understand how this kind of marketing helps with life later on. And I still find myself jealous of Americans who oftentimes market less work that lots of us have done as something that turns the person in question into some sort of hero character.
Though, I have plenty of other comments that actively critique hero worshipping. I personally think that even if you remove the jealously aspect its damaging to the persons character development.
During my time it meant:
- Knowledge of BASIC (GW, Turbo and Quick), Turbo Pascal, Turbo C, Turbo C++, 80x86 Assembly, dBase III Plus and Clipper
- Databases and their data organization on harddisk
- Digital circuits
- OS design, with experience on MS-DOS/Netware and Xenix
- 3 months trainship at the end of the degree into a local company
- All the remaining stuff on traditional high schol like physics, math, geometry and whatever else.
Now would everyone be as good as she is?
Certainly not, but the tools are there for anyone that wants to have a go at it.
You're right, though, people are in fact capable of strong work at many ages, including their teens, and it can be good to remember that.
Apple are doing it out of rather sickening lock-in culture in the company and Metal is far from the only example like that.
Alyssa clearly explained how avoiding fixed-function hardware means they can cram more shaders in which means they can increase performance; we have no idea, at this stage, whether this ends up being a net gain or a net loss for, say a Vulkan app. And we probably never will, because we don't have an "AGX-but-it-has-this-stuff-and-fewer-shader-cores-in-the-same-silicon-area" to compare with. And it doesn't matter. In the end OpenGL and Vulkan apps will run fine.
If we ever end up with empirical evidence that these design choices significantly hurt real-world OpenGL and Vulkan workloads in ways which cannot be worked around, you can start complaining about Apple. Until then, there is absolutely no indication that this will be a problem, never mind zero evidence for your conspiracy theory that it was a deliberate attempt by Apple to sabotage other APIs.
I am, quite honestly, getting very tired of all the off-topic gratuitous Apple bashing in articles about our Linux porting project.
Keep up the great work, plenty of people really appreciate it.
Something to think about, or maybe not.
If apple wanted to support vulkan without it being worse than Metal they would either need to add so many apple only extensions that it would be Vulkan in name only or make their GPUs be identical to AMD or Nvidia (unfortunately due to IP patents apple can't just make a copy of AMDs GPUs they need to find another IP partner and that is PowerVR).
If PowerVR had 80% of the GPU market (like Nvidia) the would have pushed Vulkan to line up with a TBDR pipeline but they do not so while you can run Vulkan on a TBDR pipeline you end up throwing away lots and lots of optimisations.
Hence why Google made it a compulsory API on Android 10, to try to tame OEMs in improving their Vulkan story, and yet it is a plain Vulkan 1.1.
Apple Pay the piper, and Apple call the tune. Whether you like it or not is immaterial.
> ...pointless NIH and lack of collaboration here.
> Apple are doing it out of rather sickening lock-in culture in the company and Metal is far from the only example like that.
> no benefit to them in doing it the way that you want them to.
portray Apple almost as a helpless besieged small business that should be shielded from critique of its decisions. Whereas they are an industry titan, and people should criticize them as they see fit, even if others don't find merit in the criticisms.
> Whether you like it or not is immaterial
is a completely true, and utterly banal statement, as it can be applied to any opinion made in conversation. No one here has any power over Apple, but we do have the power to free discussion, do we not?
Apple didn't even design their own GPU, the IP behind it is largely PowerVR, again, arising from a company trying to compete against ATI, NVidia, 3dfx, Matrox, etc who were running in a bandwidth wall, by taking a big risk with a tile based deferred renderer.
Now look at what is being competed on now? Ray intersection hardware. This is happening because of Raytracing extensions to DirectX and OpenGL. Otherwise you end up with a game console, and while game consoles can leverage their HW maximally, they don't produce necessarily top end HW innovation and performance.
I3DRender render = Engine::GetRender("render-name");
In any case, here in an example of such approach, https://www.ogre3d.org/
I don’t disagree, but what else they could possibly do?
Metal shipped in 2014 for iOS, in 2015 for OSX. Vulkan 1.0 was released in 2016.
I don’t think it was reasonable to postpone long overdue next gen GPU API for a few years, waiting for some consortium (outside of their control) to come up with API specs. By the time Vulkan 1.0 has released, people were using Metal for couple years already.
I remember when building everything on LLVM bytecode was best practice. It wouldn't be the Linux ecosystem without continual reinvention of the wheel, would it?