Hacker News new | past | comments | ask | show | jobs | submit login
Ubershaders: A Ridiculous Solution to an Impossible Problem (dolphin-emu.org)
772 points by voltagex_ on July 30, 2017 | hide | past | favorite | 88 comments

What an awesome writeup. I am not even really personally invested in the problem as I don't play older video games, but I loved reading the story. I wish other open source projects would do similar writeups when they reach major accomplishments.

If I just saw "specialized shaders replaced with ubershaders" on a feature update, I probably wouldn't think there was much of a story to it.

I'd love to know how this project manages to get such high quality stories out for every update. I assume their project has a community member who is passionate about writing. It's such a rare, but useful thing to have someone volunteer their time to do great technical writing. I wish someone would do it for the open source projects I work on!

As far as I'm aware, the user you're thinking of is JMC4789- they do some fantastic writeups.

They've recently started a side blog (http://emucross.com/), as well.

If you're curious, their reddit submission history (https://www.reddit.com/user/JMC4789/submitted/) has a bunch of interesting stuff.

They seem to have people on the forum with 'Content Creator' badges who work on these reports: https://forums.dolphin-emu.org/Thread-dolphin-progress-repor.... Great idea to recognise people who help in other ways other than coding.

It's an incredibly impressive project. I can't believe they're not on Patreon (maybe this type of project doesn't qualify?). They're more high quality with their updates and the writeups about their updates than most commercial software made by professionals getting paid.

The Dolphin project doesn't accept donations in any form, IIRC they said that fairly splitting the money between the contributors would be too difficult.

Apparently their infrastructure costs are covered by the few ads on the site and they're happy to leave it at that.

Wow, that's a decision I deeply respect.

This might apply to even newer videogames. Playing Breath of the Wild on Cemu seems to have huge stuttering due to shader caching, even with a good spec machine.

Probably not. Wii U shaders are apparently much more complicated than Wii/GameCube shaders (fully programmable shader pipeline).

Theoretically yes.

Modern Shader cores are basically Turing Complete, so you should be able to emulate anything, including the Wii U's modern GPU. But it won't run fast enough to help, because you will have to spill some (or even a lot) of the shader's state to the host GPU's main memory.

Ubershaders work well for dolphin because the entire mutable state of the GameCube's pixelshaders fit into the available registers, with plenty of space left over.

Still seams like the model could be repurposed to make gamedev more stable across APIs and backends at the same time removing lots of interactions with the underlying driver. It is a really clever identity function with some mechanical sympathy. Amazing stuff and so glad it is open source!

The "issue" is architectural, it has every reason to apply to more or less every game on the GC base.

This is an amazing article. I _love_ technical problems for which the prevailing consensus moves from "this isn't a problem" to "this problem is impossible to fix" to "the proposal fix could never work" to "doing it right would be too much work" to "the solution was inevitable".

Can you name/link to some other examples?

SpaceX spending almost a decade developing, testing, and then deploying rockets that can land and then be reused is a good example.

Is there any higher accolade in the field than having John Carmack say "Dolphin updates are wonderful system engineering articles." ?


My favorite part: "Despite being around 90% complete, the last 90% still remained to be done"

This is a rather old joke in software:


I've heard it, and variations of it, many times before, but never realized it went back that far.

This is really great stuff from the Dolphin team!

We took a similar approach when building de Blob 2 for Wii, X360, and PS3. We defined all our materials in terms of TEV stages. On the Wii that was used to set up the TEV when rendering. For X360 and PS3 we had an ubershader that emulated the TEV stages. This made it much easier for the artists; they built all materials in one tool in terms of what are essentially register combiners. We also allowed them to create more complex materials for X360/PS3 that would override the base material and do things that the Wii didn't actually support.

"Over the past few years, we've had users ask many questions about shader stuttering, demand action, declare the emulator useless, and some even cuss developers out over the lack of attention to shader compilation stuttering."

Ugh, it pains me to imagine users that would be anything but appreciative towards these developers, but kudos to the devs for using that abuse as inspiration.

> macOS Graphics Drivers are Still Terrible

There it is again :(

And even in Bootcamp, the current gen Radeons aren't getting any updates either. At least nVidia chips used unified drivers and got updates.

http://www.bootcampdrivers.com/ works well. Website looks a tad sketchy, but it's a community driven site and seems to be legit.

I assume Dolphin doesn't have a mode that uses Metal? Because that would presumably make it work well on macOS, as Metal is where Apple's been focusing their efforts for a while now.

I'm curious how similar Metal is to Vulkan in API-surface terms. Would it be easier to develop a Metal backend for Dolphin by starting from the macOS Vulkan backend than by starting from scratch?


Dear God, the solution is insane! That is mind-blowing... emulating the whole pipeline?!?!

People who do emulation are, quite simply, the very, very best of us.

My other take away is: just don't bother getting an Nvidia card if you can avoid it.

It doesn't quite emulate the whole 3D pipeline, but it emulates the entire pixel and vertex stages.

If anyone is interested in checkout out the massive ubershdaers, I've stashed a copy here:


So if I read the article right, this shader emulates those parts of the rendering pipeline of the GameCube/Wii... which to my mind still just sounds absolutely amazing - does this mean you've had to implement any quirks of the devices into the shader?

Also - does this generalise to other rendering pipelines for other devices do you think?

Yeah, Though we had already implemented all those quirks in the old generated shaders.

The main difference with ubershaders is skip the shader generation/compilation and directly interpret the raw shader binary .

    > Also - does this generalise to other rendering pipelines for other devices do you think?
Modern shader cores are more or less Turing complete, so you should be able to do the same on any other rendering pipeline which fits into 'opengl pipeline model', including other modern GPUs.

Though, while it might be possible to run modern shaders in this manor, it won't run fast enough, because you will have to spill some (or even a lot) of the shader's state to the host GPU's main memory.

Ubershaders work well for dolphin because the entire mutable state of the GameCube's pixelshaders fit into the available registers, with plenty of space left over. I assume it will work well for other DirectX 8 or even DirectX 9 era GPUs.

This is utter insanity. Bravo.

No, not really. It's not hard to write a graphics pipeline emulator. Those who believe a software rasterizer is difficult simply haven't spent much time trying to write one.

Making it correct is much more difficult than making it at all. I'd be shocked if their emulator got the basics correct in every detail. Example: Texture wrapping occurs by taking the UV texcood modulo the texture size. But since a texcoord can be negative, your modulo operation has to account for that at the boundary between positive and negative, or else you end up with truncated texels.

It doesn't sound like they were really writing a rasterizer, but it's the same concept.

> But since a texcoord can be negative, your modulo operation has to account for that

Is that the fmod(fmod(a, b) + b, b) thing, or something else? I still wish that was the default behaviour in more languages…

If you like this kind of thing, http://chrishecker.com/Miscellaneous_Technical_Articles is an excellent resource. It points out all of the fiendish corner cases lurking in a standard rasterizer.

http://chrishecker.com/images/9/97/Gdmtex2.pdf page 21 has a correct FloorDivMod function.

I guess I'm one of those :-) I still have massive respect for these guys.

Yeah, I didn't mean to minimize the achievement. I just meant if you're interested, you should try it! It's fun.

And if you do want to try it, I recommend this as a starting point: https://github.com/ssloy/tinyrenderer/ - straightforward and satisfying to get a 3D model up on screen. It really helped me start building an understanding of how the a pipeline might be implemented.

afaik MESS Sound Blaster subsystem emulates embedded 8051 responsible for handling ISA DMA requests.

This was a really well written overview of a technical puzzle and it's eventual resolution. Loved the lucidity of the prose!

Fascinating read!

The final approach of interpreting the shaders initially, while compiling them in the background for greater performance, sounds very similar to what just-in-time compilers do.

If you think about it, the problems they face are also kind of similar: both systems are confronted with unpredictable, dynamic "source code", and both want to achieve both high performance while avoiding the lag introduced by ahead-of-time compilation, so it makes sense that a similar solution approach might work.

I never thought people would be working full time on emulator projects. I guess I really underestimated the amount of work going there.

The CEMU patreon is quite well funded: https://www.patreon.com/cemu

Call me old fashioned or stupid (just not nostalgic, the best I ever owned was Pegasus, a hardware clone of a NES/Famicon) but whenever I see these issues with older Sony or Nintendo stuff I a in awe.

Today's consoles seem like repacked PCs with few changes but the older ones seem like actual dedicated gaming hardware, especially PS2 with Emotion Engine and PS1 as disk controller, what the hell (in a good way)?!

I'm not an expert, but that's my understanding as well. I believe that's one of the reasons that exclusives were far more common in those days; a port was not a minor engineering effort, you had to do a total rewrite.

What I don't understand is why this was the case. I wonder if a general-purpose-PC-like architecture that was powerful enough to play games of the intended caliber was simply too expensive at the time.

Yes - during the NES era, a reasonably powerful computer for games cost closer to $5k while the NES was closer to $100.

I'm talking more of PSX and PS2 era.

Some games on PS2 like Fatal Frame or Haunting Grounds are impressive even by today standards and could pass for double A games nowdays (entire 17 years later). That's just impressive. And their hardware specs read like a real spec for a gaming machine (the EE's 2 VPUs, the PSX in PS2 for compatibility, etc.), not just "bunch of PC CPUs and GPUs from AMD in a box + blueray drive". Ironically, first original XBox prototype was actually (I think or maybe its a rumor) made out of laptop components.

NES was a bit weak in comparison but very cheap (Pegasus costed like 50 PLN in the early 2000s).

In the PSX/PS2 era (or really, the era starting from the SNES's SuperFX chip), the dedicated graphics ASICs in consoles, combined with the fact that those ASICs were being targeted individually at a low-level by game devs, were putting out results that seriously outpaced what you'd expect out of your PC's 3DFX Voodoo card.

That wasn't because the designs were more clever, mind you; but just because the hardware designers didn't need to think in terms of an architecture that contained concepts like dynamic frequency scaling and multi-monitor support and a kernel that blocks on disk IO. Consoles were hard real-time embedded systems, and the games were their unikernels; well into the PS2 era, console game were still relying on VBlank interrupts for physics timing!

And what this got you, was effects that were only achievable on an $8000 SGI workstation, for $300. Slightly-beyond-state-of-the-art, for cheap. But in exchange, it forced heavy consolidation in the console manufacturer market, because developing that specialized hardware wasn't cheap (like it was back in the 8-bit micro era.)

But "generic" PC GPUs eventually started scaling in power geometrically, to the point where the specialization and hard real-time guarantees just weren't needed any more to achieve modern graphics cheaply. The low-level-targeted specialized-graphics-ASIC technique wouldn't be of much benefit today, because six months later there'd be another new "generic" GPU out that could do what that ASIC does without breaking a sweat.

The same thing happened in networking: ASIC switches with special RTOSes were needed to run data centers—until CPUs and PCIe advanced enough to take over. Now everything (except Internet-backbone gear) is Software-Defined Networking, i.e. generic boxen running VMWare EXS running a Linux/BSD router VM.

Why not take a profiling approach and cache the configurations rather than the compiled shader? You could then compile them on startup. By caching the configurations, you could then share this data between hosts and don't have to invalidate them as often.

I don't know much about this topic. Is what you're suggesting distinct from the approach they addressed in the "Sharing Shaders" section?


Each configuration has a Unique ID, and they discuss building up a list of all the required UIDs beforehand:

> Users refer to this as "sharing shaders" and in theory if users shared UID files, they could compile shaders ahead of time and not encounter stuttering.

They list a few reasons why they didn't go for this, but the tl;dr is that it is hard to build a complete list, so many users in many games would still see stuttering.

It still seems like a good supplementary approach to the ubershader hybrid mode. Perhaps it'll be an idea that gets revisited at a later date, after the speed of changes to the graphics emulation code have slowed down.

this is kind of BS, all it takes is one cheap shared hosted server collecting shaders from users(upload every time you check for new version/once a day), you would build 95% complete list in couple of months.

Awesome, go build it and prove them wrong

Except that formats change between driver versions, among other things. I know of few games that aggregate their shaders, most just generate them during loading. There's a good chance you'd uncover some amusing bugs.

we are talking about caching TEV configuration, not compiled shaders. They were against this solution for ~3 years, how much coverage would they get in 3 years of users uploading missing Unique ID" objects?

The UID's are a dolphin invention, though, on the hardware level there are just a bunch of registers set. So either you store a complete dump of the cpu state for each shader configuration or you invalidate all shared shaders when dolphin internals change.

At the end of the day you still have to compile the shader though. A shader cache is the only way around that.

Yes, and the parent comments are talking about a way to make it possible to compile all the shaders on startup instead of in the middle of the game. Longer loading times > stuttering.

Citation needed

What I did was using two modes: in development mode, the shaders would be (re)compiled on the fly (causing some latency), and stored in a bank. In release mode, the shader-bank is used, and a shader compilation would be done if a shader wasnt found - supposedly rarely.

As far as I can tell it worked pretty well.

Could they use the "cut scenes" (?) to prime the shader cache(s)?

I mean those bits of animation between levels. I understood that many animations are scripted, vs movie recordings.

Apologies if I'm using the wrong words. I don't play many games.

In a traditional game this is exactly how it's done. Dolphin is a bit different though, because it's an emulator. It is aware that a game is using the GPU, but isn't aware of the inner workings of each game's logic. Dolphin doesn't know the difference between a cutscene, gameplay, a pre-rendered movie, or a credits sequence. Because of this, it has to be written in a more general manner, so that a game running inside of it can ask it to do anything a Gamecube or Wii would be able to do, and still come up with the right result.

This is an emulator. I assume they don't have the option of modifying the games to include high level optimizations like that.

Cut scene might have different shaders than in game, generally cut scenes are far more controlled and so might have higher quality shaders than in actual levels. And also that solution would only work in games with cut scenes and you would have to assume so cut scenes occur before the levels with the needed shaders in. So it would end up being a very brittle solution.

They aren't able to predict what shaders will be needed.

The Dolphin project always has amazing writeups for complicated technical problems. Really love these. Amazing work from that whole team

It always amazes me how dedicated and talented the engineers who work on these projects are, amazing :)

The dolphin emulator's blog is doing awesome blog posts as usual. Reminds me of how JavaScript compilers[1] also compile several versions of the same function, as the interpreter gains more insight on how the function is used.

[1] https://wingolog.org/archives/2011/07/05/v8-a-tale-of-two-co... (Wow, 2011, time goes by fast)

This approach actually seems more straightforward and easier to maintain than the original shader cache system. Of course when dolphin was originally written this wasn't feasible on hardware at that time, but nowadays I'd say shaders of this complexity aren't that unusual.

Dolphin always do great writeups and this is no different. Real nice!

Reminds me of https://01.org/fast-ui-draw/blogs/krogovin/2016/fast-ui-draw..., a Canvas implementation from Intel that also uses a uber shader.

Now refactor everything so that the purpose-built shaders are actually generated from the ubershader simply by hard-coding the various decision points, possibly using preprocessor tricks? Seems like a natural next step that should be able to simplify the emulator a lot...

In fact, that's the usual meaning of uber-shaders; large shaders parametrized at compile time.

In short, they went from shader translation to emulation (on the GPU) which eliminates the delays of dynamic translation. Fortunately the emulation is fast enough that it works great.

does anyone know what the interpreter actually needs to interpret?

what does the code look like?

Interpreter seems to be somewhat dynamically generated based on host system: https://github.com/dolphin-emu/dolphin/blob/master/Source/Co...

Found some flaky documentation: http://amnoid.de/gc/tev.html

edit: phire posted an example ubershader synth https://news.ycombinator.com/item?id=14886856

It's a set of register settings. The code in the game looks like a bunch of function calls; the stuff the emulator has to deal with is a bunch of opcodes in the FIFO. There's no interpretation per se in the shader - it's not like a pixel shader or a vertex shader, with some kind of bytecode type of affair. It's more like the NV_register_combiners GL extension (http://developer.download.nvidia.com/opengl/specs/GL_NV_regi...).

From a software perspective, there is a solid cut of between "not-shaders" and "shaders". DirectX 7 compatible GPUs don't have "shader" and DirectX 8 compatible GPUs do have "shaders", even if these "shaders" are primitive and only support a maximum of 8 instructions.

But from a hardware perspective, there is no solid cut off between "non-shaders" and "shaders".

Because DirectX 8 era Shaders are just slightly more capable register combiners, with a swanky new shader based programming interface. Oh, and those 8 instructions were shoved into registers, just like on the gamecube.

In some ways, The gamecube's GPU is actually more capable than other DirectX 8 era GPUs (it can support upto 16 instructions), so It's my position that if you consider DirectX 8 gpus to have proper pixel shaders, then you should consider the gamecube has having pixel shdaers too, just with an older clunky programming interface.

The various parts of the instruction (aka register combiner stage) might be split over various registers, but the gamecube GPU executes them as whole instructions in series, just like a shader core. Our Ubershader acts like a proper interpreter, interpreting the raw data out of these registers cycle by cycle as machine code, just like the real hardware.

Metroid Prime was so good.

All this work and you can buy a used GameCube system for $50 USD. What's the point?

You won't be able to buy a working used GameCube system for $50 forever.

And the game disks won't last forever either.

and Dolphin can upscale to e.g. 1080p rather than the gamecube's native 480p.

> Blog tags

  3d  4.0  5.0  60FPS  Accessory  adreno  amd Analysis  android  announcement  arm  audio bestof  bug  bugfix  Cheats  Commemoration D3D  D3D9  Datel  driver  Factor5 Feature Removal  Foundation  Gamehacks  gpu Graphics  Hardware  HD  hle  intel  Legal Licensing  mali  mesa  Netplay  new feature nvidia  OGL  Patches  performance progress report  Qt  qualcomm  release releasecandidate  Review  shieldtv  stereo stereoscopy  technical  ubershaders  ui Unlicensed  video  vulkan  Wii  wiimote  Wiimote Wind Waker

This is incredible work. Predicting, sharing, asynchronous compilation, and reverse-engineering the pipeline are all very creative solutions to a really difficult problem. As I understand, deep learning basically runs graphics cards backwards to generate text from images.

How can we apply these excellent algorithms to machine learning?

Machine learning doesn't run the graphics card backwards.

What they did is not really useful for ML. As said by them, their ubershader is massively inefficient.

This is incredibly false. How did you even manage to get this picture of what deep learning is?

I'm genuinely curious as to what articles/papers you've read that made you think deep learning is basically "running a graphics card backwards".

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact