It's device sRGB for the time being, but more color spaces are planned.
You are correct that conflation artifacts are a problem and that doing antialiasing in the right color space can improve quality. Long story short, that's future research. There are tradeoffs, one of which is that use of the system compositor is curtailed. Another is that font rendering tends to be weak and spindly compared with doing compositing in a device space.
Yeah, there is an entire science on how to do font rendering properly. Perceptually you should even take into account if you have white text on black background or the other way as this changes the perceived thickness of the text. Slightly hinted SDFs kind of solve that issue and look really good but of course making that work on CPUs is difficult.
What's difficult with font SDFs on the CPU? The bezier paths?
I made myself a CPU SDF library last weekend, primarily for fast shadow textures. It was fun, and I was surprised how well most basic SDFs run with SIMD. Except yeah Beziers didn't fair well. Fonts seem much harder.
SIMD was easy, just asked Claude to convert my scalar Nim code to Neon SIMD version and then to an sse2 version. Most SDFs and gaussian shadowing got 4x speedup on my macbook m3. It's a bit surprising the author has so much trouble in Rust. Perhaps fp16 issues?
I haven't looked at this recently but from what I remember rendering from SDF textures instead from simple alpha textures was 3-4 times slower, including optimizations where fully outside and inside areas bypass the per pixel square root. Of course SIMD is a must, or at least the use _mm_rsqrt_ss.
1. I like ryg's "A trip through the Graphics Pipeline" [1]. It's from 2011 but holds up pretty well, as the fundamentals haven't changed. The main new topic, perhaps, is the rise of tile based deferred rendering, especially on mobile.
2. I skipped over this in the interest of time. `Nevermark has the central insight, but the full story is more interesting. For each tile, detect whether the line segment crosses the top edge of the tile, and if so, the direction. This gives you a delta of -1, 0, or +1. Then do a prefix sum of these deltas on the sorted tiles. That gives you the winding number at the top left corner of each tile, which in turn lets you compute the sparse fills and also which side to fill within the tile.
The newest spline work (hyperbezier) is still on the back burner, as I'm refining it. This turns out to be quite difficult, but I'm hopeful it will turn out better than the previous prototype you saw.
I'm excited for the Adafruit Fruit Jam to come out. I don't know how much it'll cost, but it's really a complete small computer: 2x USB A for keyboard, mouse, and/or controller, HDMI, SD card, high quality audio out 5 RGB LEDs, extra PSRAM, and some other goodies. I imagine it will be quite straightforward to port this.
While getting DVI/HDMI out from a RP2040 is an impressive hack, on the RP2350 it's pretty straightforward, and the chip will do standards-compatible 640x480 without overclocking. With overclocking, at least the one I have will do 1280x720 60Hz (albeit with reduced blanking). At those resolutions and the limited amount of RAM, framebuffers are not great (this is why bpp is so spare), so I'm exploring generating video on the fly.
Super fun chip, I highly recommend it for people who want to get into very low level programming.
AFAIK a very large part of Slang are massively big 3rd party libraries written in C++, the Slang-specific Rust code would just be a very thin layer on top of millions(?) of lines of C++ code that has been grown over decades and is maintained elsewhere.
(fwiw I've been considering to write the custom parts of the Sokol shader compiler in Zig instead of C++, but that's just a couple of thousand lines of glue code on top of massive C++ libraries (SPIRVTools, SPIRVCross, glslang and Tint), and those C++ APIs are terrible to work with from non-C++ languages.
As far as developer friction for integration into asset workflows goes, that's exactly where I would prefer Zig over Rust (but a simple build.zig already goes most of the way without porting any code to Zig).
It is according to Khronos anyway, for those that aren't already deeply invested into HLSL.
Khronos has been quite vocal that there is no further development on GLSL, they see that as a community effort, they only provide SPIR-V.
This is how vendor specific tooling eventually wins out, they kind of got lucky AMD decided to offer Mantle as basis for Vulkan, LunarG is doing the SDK, now NVidia contributed slang, otherwise they would still be arguing about OpenGL vNext.
This is a longer and deeper conversation, but I think on topic for the original article, so I'll go into it a bit. The tl;dr is developer friction.
By all means if you're doing a game (or another app with similar build requirements), figure out a shader precompilation pipeline so you're able to compile down to the lowest portable IR for each target, and ship that in your app bundle. Slang is meant for that, and this pipeline will almost certainly contain other tools written in C++ or even without source available (DXC, the Apple shader compiler tools, etc).
There are two main use cases where we want different pieces of shaders to come from different sources of truth, and link them together downstream. One is integrating samplers for (vello_hybrid) sparse strip textures so those can be combined with user paint sources in the user's 2D or 3D app. The other is that we're trying to make the renderer more modular so we have separate libraries for color space conversion and image filters (blur etc). To get maximal performance, you don't want to write out the blur result to a full-resolution texture, but rather have a function that can sample from an intermediate result. See [1] for more context and discussion of that point.
Stitching together these separate pieces of shader is a major potential source of developer friction. There is a happy path in the Rust ecosystem, albeit with some compromises, which is to fully embrace WGSL as the source of truth. The pieces can be combined with string-pasting, though we're looking at WESL as a more systematic approach. With WGSL, you can either do all your shader compilation at runtime (using wgpu for native), or do a build.rs script invoking naga to precompile. See [2] for the main PR that implements the latter in vello_hybrid. In the former case, you can even have hot reloading of shaders; implemented in Vello main but not (yet) vello_hybrid.
To get the same quality of developer experience with Slang, you'd need an implementation in Rust. I think this would be a good thing for Slang.
I've consistently underestimated the importance of developer friction in the past. As a contrast, we're also doing a CPU-only version of Vello now, and it's absolutely night and day, both for development velocity and attracting users. I think it's possible the GPU world gets better, but at the moment it's quite painful. I personally believe doing a Rust implementation of the Slang compiler would be an important step in the right direction, and is worth funding. Whether the rest of the world agrees with me, we'll see.
> The pieces can be combined with string-pasting, though we're looking at WESL as a more systematic approach.
> To get the same quality of developer experience with Slang, you'd need an implementation in Rust. I think this would be a good thing for Slang.
WESL has the opposite problem: it doesn't have a C++ implementation. IMO, the graphics world will largely remain C++ friendly for the forseeable future, so if an effort like WESL wants to succeed, they will need to provide a C++ implementation (even more so than the need for Slang to provide a Rust one).
You're probably right about this. In the short to medium term, I expect that the Rust and C++ sub-ecosystems will be making different sets of choices. I don't know of any major C++ game or game-adjacent project adopting, say, Dawn for their RHI (render hardware interface) to buy into WebGPU. In the longer term, I expect the ecosystems to start blending together more, especially as C++/Rust interop improves (it's pretty janky now).
Long story short: you want to compose shaders at runtime and need a compilation pipeline for that. So what you really need is a C interface to the slang transpiler that is callable from rust.
Rewriting the whole slang pipeline in rust is a fool's errand.
We tried something like this with piet-gpu-hal. One problem is that spirv-cross is lossy, though gaps in target language support are getting better. For example, a device scoped barrier is just dropped on the floor before Metal 3.2. Atomics are also a source of friction.
But the main problem is not the shader language itself, but the binding model. It's a pretty big mess, and things change as you go in the direction of bindless (descriptor indexing). There are a few approaches to this, certainly reinventing WebGPU is one. Another intriguing approach is blade[1] by Dzmitry Malyshau.
I wish the authors well, but this is a hard road, especially if the goal is to enable more advanced features including compute.
I’d very much like to read about Blade, but seems like they have literally no documentation in text format, not even a basic introduction. Every link on the GitHub page goes to YouTube.
Project authors, please don’t do this. It’s impossible to get a two-minute overview from a video. Browsing through tutorials and documentation is much more efficient.
If you really have never written anything about the project except conference slides, then at least put up that deck in addition to the YouTube link. Clicking through slides is not great, but it’s still a better browsing experience than seeking at random in a video.
I really do wish that Sony made even more info about GNM and GNMX public. I was only starting to learn it when I got laid off and lost my access. I may or may not still have some older docs that found their way into my box as I was leaving on the last day, but if any did, they're definitely incomplete. I spent most of my time working on non-graphics parts of the project, so my time that I got to spend on digging into graphics system of the PS5 was pretty limited.
As someone still having an Nintendo Developer Portal account, holding SCEE content back when the London Soho office used to have a developer site (aka Team Soho), and PS2Linux owner, there is plenty of material that can be discussed publicly without breaking NDAs.
Console specific information also is not all that interesting anymore these days since game consoles have switched to off-the-shelf GPU designs with only minor modifications.
Even the current generation of consoles still have some interesting stuff going on. The 'core' of the console is fairly off the shelf, but they do still have modifications specific to the console that you won't find elsewhere. As far as GPU stuff goes, they tend to provide somewhat lower-level access to the hardware that you would normally not get with consumer stuff.
So I would say skill at GPU assembly is in-demand for the elite tier of GPU performance work. Not necessarily writing much of it (though see [1] for an example, this is the kernel of multisplit as used in Nvidia's Onesweep implementation), but definitely in being able to read it so you can understand what the compiled code is actually doing. I'll also cite as evidence of that the incredible work of the engineers on Nanite. They describe writing the core of the microtriangle software renderer in HLSL but analyzing the assembler output to optimize down to the cycle level, as described in their "deep dive into Nanite virtualized geometry" talk (timestamp points to the reference to instruction-level micro-optimization).
The question of which assembly is best to learn is of course incredibly subjective, but I think the author gives short shrift to ARM32. It is historically important (especially for the Acorn computers, most popular in the UK), sensibly designed, and still relevant today, just in the context of microcontrollers.
Some of the most fun I've had programming assembly has been writing HDMI video scanout kernels for a RP2040 chip[1]. It was a delightful puzzle how to make every single cycle count. It is a great sense of satisfaction of using every one of the 8 "low" registers (the other 8 "high" registers generally take one more cycle to move into a low register, but there are exceptions such as add and compare where they can be free; thus you almost always use a high register for the loop termination comparison). Most satisfying, you can cycle-count and predict the performance very accurately, which is not at all true on modern 64 bit processors. These video kernels could not be written in Rust or C with anywhere near the same performance. Also, in general, Rust compiles to pretty verbose code, which matters a lot when you have limited memory.
Ironically, the reasons for this project being on hold also point to the downside of assembler: since then, the RP2350 chip has come out, and huge parts of the project would need to be rewritten (though it would be much, much more capable than the first version).
I read LLVM (or one of its many GPU-flavored variants) reasonably often, mostly to figure out where in the chain a shader miscompilation is happening. But I've never personally had to write it, and it's not easy for me to think of a use case where it would make a lot of sense. It's pretty unpleasant and fiddly, as you have to annotate all the types of the intermediate values and so on, and it doesn't have the main advantage of actual assembler: being able to reason about the performance of the code. That depends so much on the way it's compiled.
That said, I have several times wanted to reach for LLVM intrinsics. In Rust, these are mostly available through a nightly-only feature (in std::intrinsics). One thing that potentially unlocks is "unordered" memory semantics, which are intermediate between nonatomic and relaxed atomics, in that they allow much of the optimization of the former, while also not being UB if there's a data race. In a similar vein is the LLVM "freeze" operation, which makes read from uninitialized memory into a well-defined bit pattern. There's some discussion ([1] [2], for example) of adding those to Rust proper, but it's tricky.
But as another data point, for something I really want to do that's not yet expressible in Rust (fp16 SIMD operations), I would rather write NEON assembly language than LLVM IR. And I am quite certain I don't want to write any of the GPU variants by hand either.
You are correct that conflation artifacts are a problem and that doing antialiasing in the right color space can improve quality. Long story short, that's future research. There are tradeoffs, one of which is that use of the system compositor is curtailed. Another is that font rendering tends to be weak and spindly compared with doing compositing in a device space.
reply