What an awesome writeup. I am not even really personally invested in the problem as I don't play older video games, but I loved reading the story. I wish other open source projects would do similar writeups when they reach major accomplishments.
If I just saw "specialized shaders replaced with ubershaders" on a feature update, I probably wouldn't think there was much of a story to it.
I'd love to know how this project manages to get such high quality stories out for every update. I assume their project has a community member who is passionate about writing. It's such a rare, but useful thing to have someone volunteer their time to do great technical writing. I wish someone would do it for the open source projects I work on!
It's an incredibly impressive project. I can't believe they're not on Patreon (maybe this type of project doesn't qualify?). They're more high quality with their updates and the writeups about their updates than most commercial software made by professionals getting paid.
The Dolphin project doesn't accept donations in any form, IIRC they said that fairly splitting the money between the contributors would be too difficult.
Apparently their infrastructure costs are covered by the few ads on the site and they're happy to leave it at that.
This might apply to even newer videogames. Playing Breath of the Wild on Cemu seems to have huge stuttering due to shader caching, even with a good spec machine.
Modern Shader cores are basically Turing Complete, so you should be able to emulate anything, including the Wii U's modern GPU. But it won't run fast enough to help, because you will have to spill some (or even a lot) of the shader's state to the host GPU's main memory.
Ubershaders work well for dolphin because the entire mutable state of the GameCube's pixelshaders fit into the available registers, with plenty of space left over.
Still seams like the model could be repurposed to make gamedev more stable across APIs and backends at the same time removing lots of interactions with the underlying driver. It is a really clever identity function with some mechanical sympathy. Amazing stuff and so glad it is open source!
This is an amazing article. I _love_ technical problems for which the prevailing consensus moves from "this isn't a problem" to "this problem is impossible to fix" to "the proposal fix could never work" to "doing it right would be too much work" to "the solution was inevitable".
We took a similar approach when building de Blob 2 for Wii, X360, and PS3. We defined all our materials in terms of TEV stages. On the Wii that was used to set up the TEV when rendering. For X360 and PS3 we had an ubershader that emulated the TEV stages. This made it much easier for the artists; they built all materials in one tool in terms of what are essentially register combiners. We also allowed them to create more complex materials for X360/PS3 that would override the base material and do things that the Wii didn't actually support.
"Over the past few years, we've had users ask many questions about shader stuttering, demand action, declare the emulator useless, and some even cuss developers out over the lack of attention to shader compilation stuttering."
Ugh, it pains me to imagine users that would be anything but appreciative towards these developers, but kudos to the devs for using that abuse as inspiration.
I assume Dolphin doesn't have a mode that uses Metal? Because that would presumably make it work well on macOS, as Metal is where Apple's been focusing their efforts for a while now.
I'm curious how similar Metal is to Vulkan in API-surface terms. Would it be easier to develop a Metal backend for Dolphin by starting from the macOS Vulkan backend than by starting from scratch?
So if I read the article right, this shader emulates those parts of the rendering pipeline of the GameCube/Wii... which to my mind still just sounds absolutely amazing - does this mean you've had to implement any quirks of the devices into the shader?
Also - does this generalise to other rendering pipelines for other devices do you think?
Yeah, Though we had already implemented all those quirks in the old generated shaders.
The main difference with ubershaders is skip the shader generation/compilation and directly interpret the raw shader binary .
> Also - does this generalise to other rendering pipelines for other devices do you think?
Modern shader cores are more or less Turing complete, so you should be able to do the same on any other rendering pipeline which fits into 'opengl pipeline model', including other modern GPUs.
Though, while it might be possible to run modern shaders in this manor, it won't run fast enough, because you will have to spill some (or even a lot) of the shader's state to the host GPU's main memory.
Ubershaders work well for dolphin because the entire mutable state of the GameCube's pixelshaders fit into the available registers, with plenty of space left over. I assume it will work well for other DirectX 8 or even DirectX 9 era GPUs.
No, not really. It's not hard to write a graphics pipeline emulator. Those who believe a software rasterizer is difficult simply haven't spent much time trying to write one.
Making it correct is much more difficult than making it at all. I'd be shocked if their emulator got the basics correct in every detail. Example: Texture wrapping occurs by taking the UV texcood modulo the texture size. But since a texcoord can be negative, your modulo operation has to account for that at the boundary between positive and negative, or else you end up with truncated texels.
It doesn't sound like they were really writing a rasterizer, but it's the same concept.
And if you do want to try it, I recommend this as a starting point: https://github.com/ssloy/tinyrenderer/ - straightforward and satisfying to get a 3D model up on screen. It really helped me start building an understanding of how the a pipeline might be implemented.
The final approach of interpreting the shaders initially, while compiling them in the background for greater performance, sounds very similar to what just-in-time compilers do.
If you think about it, the problems they face are also kind of similar: both systems are confronted with unpredictable, dynamic "source code", and both want to achieve both high performance while avoiding the lag introduced by ahead-of-time compilation, so it makes sense that a similar solution approach might work.
Call me old fashioned or stupid (just not nostalgic, the best I ever owned was Pegasus, a hardware clone of a NES/Famicon) but whenever I see these issues with older Sony or Nintendo stuff I a in awe.
Today's consoles seem like repacked PCs with few changes but the older ones seem like actual dedicated gaming hardware, especially PS2 with Emotion Engine and PS1 as disk controller, what the hell (in a good way)?!
I'm not an expert, but that's my understanding as well. I believe that's one of the reasons that exclusives were far more common in those days; a port was not a minor engineering effort, you had to do a total rewrite.
What I don't understand is why this was the case. I wonder if a general-purpose-PC-like architecture that was powerful enough to play games of the intended caliber was simply too expensive at the time.
Some games on PS2 like Fatal Frame or Haunting Grounds are impressive even by today standards and could pass for double A games nowdays (entire 17 years later). That's just impressive. And their hardware specs read like a real spec for a gaming machine (the EE's 2 VPUs, the PSX in PS2 for compatibility, etc.), not just "bunch of PC CPUs and GPUs from AMD in a box + blueray drive". Ironically, first original XBox prototype was actually (I think or maybe its a rumor) made out of laptop components.
NES was a bit weak in comparison but very cheap (Pegasus costed like 50 PLN in the early 2000s).
In the PSX/PS2 era (or really, the era starting from the SNES's SuperFX chip), the dedicated graphics ASICs in consoles, combined with the fact that those ASICs were being targeted individually at a low-level by game devs, were putting out results that seriously outpaced what you'd expect out of your PC's 3DFX Voodoo card.
That wasn't because the designs were more clever, mind you; but just because the hardware designers didn't need to think in terms of an architecture that contained concepts like dynamic frequency scaling and multi-monitor support and a kernel that blocks on disk IO. Consoles were hard real-time embedded systems, and the games were their unikernels; well into the PS2 era, console game were still relying on VBlank interrupts for physics timing!
And what this got you, was effects that were only achievable on an $8000 SGI workstation, for $300. Slightly-beyond-state-of-the-art, for cheap. But in exchange, it forced heavy consolidation in the console manufacturer market, because developing that specialized hardware wasn't cheap (like it was back in the 8-bit micro era.)
But "generic" PC GPUs eventually started scaling in power geometrically, to the point where the specialization and hard real-time guarantees just weren't needed any more to achieve modern graphics cheaply. The low-level-targeted specialized-graphics-ASIC technique wouldn't be of much benefit today, because six months later there'd be another new "generic" GPU out that could do what that ASIC does without breaking a sweat.
The same thing happened in networking: ASIC switches with special RTOSes were needed to run data centers—until CPUs and PCIe advanced enough to take over. Now everything (except Internet-backbone gear) is Software-Defined Networking, i.e. generic boxen running VMWare EXS running a Linux/BSD router VM.
Why not take a profiling approach and cache the configurations rather than the compiled shader? You could then compile them on startup. By caching the configurations, you could then share this data between hosts and don't have to invalidate them as often.
Each configuration has a Unique ID, and they discuss building up a list of all the required UIDs beforehand:
> Users refer to this as "sharing shaders" and in theory if users shared UID files, they could compile shaders ahead of time and not encounter stuttering.
They list a few reasons why they didn't go for this, but the tl;dr is that it is hard to build a complete list, so many users in many games would still see stuttering.
It still seems like a good supplementary approach to the ubershader hybrid mode. Perhaps it'll be an idea that gets revisited at a later date, after the speed of changes to the graphics emulation code have slowed down.
this is kind of BS, all it takes is one cheap shared hosted server collecting shaders from users(upload every time you check for new version/once a day), you would build 95% complete list in couple of months.
Except that formats change between driver versions, among other things. I know of few games that aggregate their shaders, most just generate them during loading. There's a good chance you'd uncover some amusing bugs.
we are talking about caching TEV configuration, not compiled shaders.
They were against this solution for ~3 years, how much coverage would they get in 3 years of users uploading missing Unique ID" objects?
The UID's are a dolphin invention, though, on the hardware level there are just a bunch of registers set. So either you store a complete dump of the cpu state for each shader configuration or you invalidate all shared shaders when dolphin internals change.
Yes, and the parent comments are talking about a way to make it possible to compile all the shaders on startup instead of in the middle of the game. Longer loading times > stuttering.
What I did was using two modes: in development mode, the shaders would be (re)compiled on the fly (causing some latency), and stored in a bank. In release mode, the shader-bank is used, and a shader compilation would be done if a shader wasnt found - supposedly rarely.
In a traditional game this is exactly how it's done. Dolphin is a bit different though, because it's an emulator. It is aware that a game is using the GPU, but isn't aware of the inner workings of each game's logic. Dolphin doesn't know the difference between a cutscene, gameplay, a pre-rendered movie, or a credits sequence. Because of this, it has to be written in a more general manner, so that a game running inside of it can ask it to do anything a Gamecube or Wii would be able to do, and still come up with the right result.
Cut scene might have different shaders than in game, generally cut scenes are far more controlled and so might have higher quality shaders than in actual levels. And also that solution would only work in games with cut scenes and you would have to assume so cut scenes occur before the levels with the needed shaders in. So it would end up being a very brittle solution.
The dolphin emulator's blog is doing awesome blog posts as usual. Reminds me of how JavaScript compilers[1] also compile several versions of the same function, as the interpreter gains more insight on how the function is used.
This approach actually seems more straightforward and easier to maintain than the original shader cache system.
Of course when dolphin was originally written this wasn't feasible on hardware at that time, but nowadays I'd say shaders of this complexity aren't that unusual.
Now refactor everything so that the purpose-built shaders are actually generated from the ubershader simply by hard-coding the various decision points, possibly using preprocessor tricks? Seems like a natural next step that should be able to simplify the emulator a lot...
In short, they went from shader translation to emulation (on the GPU) which eliminates the delays of dynamic translation. Fortunately the emulation is fast enough that it works great.
It's a set of register settings. The code in the game looks like a bunch of function calls; the stuff the emulator has to deal with is a bunch of opcodes in the FIFO. There's no interpretation per se in the shader - it's not like a pixel shader or a vertex shader, with some kind of bytecode type of affair. It's more like the NV_register_combiners GL extension (http://developer.download.nvidia.com/opengl/specs/GL_NV_regi...).
From a software perspective, there is a solid cut of between "not-shaders" and "shaders". DirectX 7 compatible GPUs don't have "shader" and DirectX 8 compatible GPUs do have "shaders", even if these "shaders" are primitive and only support a maximum of 8 instructions.
But from a hardware perspective, there is no solid cut off between "non-shaders" and "shaders".
Because DirectX 8 era Shaders are just slightly more capable register combiners, with a swanky new shader based programming interface. Oh, and those 8 instructions were shoved into registers, just like on the gamecube.
In some ways, The gamecube's GPU is actually more capable than other DirectX 8 era GPUs (it can support upto 16 instructions), so It's my position that if you consider DirectX 8 gpus to have proper pixel shaders, then you should consider the gamecube has having pixel shdaers too, just with an older clunky programming interface.
The various parts of the instruction (aka register combiner stage) might be split over various registers, but the gamecube GPU executes them as whole instructions in series, just like a shader core. Our Ubershader acts like a proper interpreter, interpreting the raw data out of these registers cycle by cycle as machine code, just like the real hardware.
This is incredible work. Predicting, sharing, asynchronous compilation, and reverse-engineering the pipeline are all very creative solutions to a really difficult problem. As I understand, deep learning basically runs graphics cards backwards to generate text from images.
How can we apply these excellent algorithms to machine learning?
If I just saw "specialized shaders replaced with ubershaders" on a feature update, I probably wouldn't think there was much of a story to it.