
Fast 2D Rendering on GPU - raphlinus
https://raphlinus.github.io/rust/graphics/gpu/2020/06/13/fast-2d-rendering.html
======
raphlinus
Fast 2D rendering on GPU has represented the last few months of concentrated
work for me, and I'm happy to present the results now. It's required going
pretty deep into various aspects of GPU compute, so feel free to ask me about
that, 2D vector graphics, or anything related.

~~~
RivieraKid
\- Shouldn't 2D rendering be a solved problem given that it's basically a
subset of 3D rendering?

\- Don't libraries like Skia, Qt, Cairo use GPU rendering? I've always assumed
so. I mean, this is 2020, GPUs have been around for decades.

~~~
raphlinus
Others have spoken to this, but as a general introduction I highly recommend
Jasper's post "Why are 2D vector graphics so much harder than 3D?" In short,
no, it's not a solved problem.

Also, I see a lot of variations of this question, but I should state this more
clearly. There's been accelerated graphics in one form or another for a long
time, but what I'm doing is a completely different type of thing. In my world,
on the CPU you just encode the scene into a binary representation that's
optimized for GPU but in many ways is like flatbuffers, and then the GPU runs
a highly parallel program to render the whole thing. In previous approaches,
the CPU is deeply involved in taking the scene apart and putting it back
together in a form that's well suited to relatively dumb pixel pipes. Now that
GPUs are _really_ fast, that approach runs into limitations.

It also depends what you're trying to do. I'm focusing here on dynamic paths
(and thus font rendering), while most of the libraries optimized for UI put
text into texture atlases and then use the GPU to composite quads to the final
surface, something they can do well.

[https://blog.mecheye.net/2019/05/why-is-2d-graphics-is-
harde...](https://blog.mecheye.net/2019/05/why-is-2d-graphics-is-harder-
than-3d-graphics/)

~~~
raks435
Can you expound the principle of tiling mentioned in your algorithm a bit more
? The conventional mechanism is to use de-Casteljau to divide a bezier curve
into triangles and then rasterize these triangles using GPU. If the curve is
required to be scaled, the triangularization/tesselation is done again How is
the algorithm presented in the link different ? Somehow the concept of tiling
seems to imply that rasterization of the curve is done in the CPU itself. What
am I missing ?

~~~
raphlinus
I recommend reading the blog post series, I'm not sure I can usefully
summarize the concepts in a comment reply. But very briefly, there's a
flattening step (evaluated on GPU, based on de Casteljau) that converts the
Bezier into a polyline ( _not_ triangles), then a tiling step that records for
each tile a "command list" that contains the complete description of how to
render the pixels in a tile, finally "fine rasterization" so that each
workgroup reads that command list and renders 256 pixels in parallel from it.
From your question, it sounds like your mental model is pretty different from
how this pipeline works.

~~~
raks435
Yes, I am trying to align my mental model with yours. I read your blog post "A
sort-middle architecture for 2D graphics" & "2D Graphics on Modern GPU", but
still unable to grasp the fundamental guiding principles. It's not clear what
are the commands that constitute each tile and whatever they are, what is the
fundamental reasion which leads to the performance being better than joining
the polylines of a curve to get a triangle list and having those triangle
rasterized by the GPU. Any blog/article on the fundamental principles, that
you would recomend

------
chrismorgan
I would like to get an idea of the capabilities of these things.

Suppose I was to make an ebook reader designed to be modelled more closely
after paper, on a device like the Surface Book’s 13″ 3000×2000 display,
showing two pages side-by-side. Each page might contain something like
2,000–2,500 letters. I want to be able to flip through pages like I might with
a paper book, so that I might be roughly completely rendering several pages at
once, and perhaps parts of several more pages; ideally it _might_ render the
page like a real 3D page, but if that’s too troublesome I’d settle for an
affine transformation while flipping. Assume that the layout of the pages,
with all the shaping, is all done ahead of time and is in memory.

I’ve never seen anyone attempt anything like this before. In the old way of
doing things, I think anyone attempting this would render each page to a
bitmap and use that as a GPU texture, and I think that could provide
acceptable performance (unless you were flipping rapidly through hundreds of
pages, because that’d take a _lot_ of GPU memory to keep), but I imagine that
the quality of the rendering mid-turn would be fairly atrocious—it could be
the sort of thing where the appearance of the text subtly changes half a
second after you finish turning the page, as it switches from one renderer to
a slightly different one.

Would the performance of this new approach be sufficient to render what I
describe at 60fps, while rendering each frame perfectly?

I was talking about the Surface Book; its Intel Core i7-6600U has Intel HD
Graphics 520 as its GPU; probably not too far off your 630’s results. (And I’m
interested in what integrated graphics can do, more than a dedicated GPU.)

My _guess_ based upon your paper-1 results is that this is probably _just
barely_ possible with integrated graphics, so long as you employ a few tricks
to reduce the amount of work required.

~~~
raphlinus
Short answer, yes, I believe it could work, though not with much margin to
spare. It might require some dedicated optimization work to fit the frame
budget, especially when you know things about the scene (it's mostly small
glyphs and with basically no overdraw) as opposed to being completely general
purpose.

Doing a warp transformation in the element processing kernel would probably
work just fine, and give you realistic movement and razor-sharp rendering.

I'd love to see such a thing and would very much like to encourage people to
build it based on my results :)

~~~
londons_explore
For the OP's effect, remember you're probably going to want motion blur in
each frame, which means you can render at a much lower resolution in the
direction of motion (as long as you can do multipass rendering)

~~~
ghusbands
I don't think you do want motion blur, as you don't know the direction in
which the user's eyes are tracking. When games and films do motion blur, they
know where they expect the viewer's attention to be, so blurring things they
won't be tracking improves the look (especially at the low frame-rate of
films). But if you happen to track the moving page, and you are likely to, and
it has blur added, then it will look worse.

Look at phone interfaces; even if you scroll fast, they don't add motion blur,
and on high framerate and/or low-persistence displays, you can read things
while it scrolls.

------
aasasd
Noob question. Isn't some GPU acceleration for 2D used since forever? I keep
being perplexed as to why browsers still have problems with it (at least on my
older Macbook with a shitty Intel's embedded GPU). I thought some stuff, like
scrolling, was offloaded at least in early/mid-2000s, making for a distinctly
meh experience when proper drivers weren't installed.

~~~
thechao
GPUs derive most of their parallelism from emitting rows of related “quads”
(the smallest unit of the screen that supports a difference operator). 2D
graphics are “chatty” in window sizes that are more like 5–10 units in
diameter. It’s _hell_ on HW perf. To make things worse, 2D applications
usually want submillisecond latency. A GPU driver/hw stack will struggle to
get latency below 1–2ms. When there’s lots of multipass rendering (which is a
thing 2D also wants a lot of), latency can to 10+ ms.

~~~
rasz
What does latency have to do with 2D? Are you suggesting UI toolkits are
writing GPU commands synchronously? You should fill display list, fire it and
forget about GPU until the next update.

~~~
thechao
Pen drawing for UIs. The _ideal_ case is to update a small portion of the
screen at about 240hz, to provide a good simulation of pen/pencil feedback.
Really, your latency envelope should be on the order of the propagation of
sound through the barrel of the marking device, but screens don’t update that
fast.

~~~
marcusjt
Surely >95% of screens out there right now are running at 60Hz, so 240Hz "pen
drawing" is pretty niche and not a priority?

~~~
chrismorgan
And >95% of screens don’t support pen drawing.

I expect a screen that supports or is designed for pen drawing to be somewhat
more likely to be above 60Hz. All of these things are niche things that not
many care about, but in any case, it’d be nice to be _able_ to do better. And
like with Formula 1 race cars, benefits from high-end techniques tend to
trickle down to other more mainstream targets in time.

------
eco
How does this (and Pathfinder for that matter) compare to NanoVG? I've
recently been experimenting with swapping out Cairo for NanoVG and it seems
much faster. The lack of dashed lines may kill my experiment though unless I
can think of a decent workaround.

~~~
pbsurf
I've just released a fork of nanovg that does GPU rendering a bit like
Pathfinder, so it can support arbitrary paths - nanovg's antialiasing has some
issues with thin filled paths. It also adds support for dashed lines.

[https://github.com/styluslabs/nanovgXC](https://github.com/styluslabs/nanovgXC)

If you give it a try, let me know how it works for you.

~~~
eco
That is very exciting to hear. I'll give it a shot. Thanks.

------
stephc_int13
Something I am not sure to completely understand, after reading the article,
is: in which way is it better/ more desirable than the classical approach of
GPU rendering? (with vertex, triangles, and shaders)

~~~
pcwalton
The basic problem is that, without compute, you either have to encode the
vector scene into GPU primitives (triangles) as quickly as possible, and all
the ways I know of to do that either involve (a) an expensive CPU process or
(b) a cheap CPU process but way too much overdraw on GPU side. Compute gives
you the best of both worlds, allowing you to upload outlines directly to GPU
and do the processing necessary to lower them to primitives on the GPU itself.

Pathfinder in there is actually using regular old GPU rasterization with
triangles and so forth, and I'm fairly confident it's about as fast as you can
go at D3D10 level (i.e. no compute shaders) without sacrificing quality. Note
that the numbers can vary wildly depending on hardware. On my MacBook Pro,
with a powerful CPU and limiting myself to the Intel integrated GPU only,
Pathfinder is actually about equal to the GPU compute approach on a lot of
scenes like the tiger, though it uses a lot of CPU.

~~~
stephc_int13
Is it true for all 2D rendering?

From my intuition, this seems pretty specialized for vector-like rendering,
with a lot of small bezier shapes.

~~~
pcwalton
Yeah, when I say 2D I mean vector art. There are a lot of things under the
heading of 2D rendering, such as blitting raster sprites, that are much closer
to being solved problems. (Though you might be surprised--power concerns,
coupled with greatly increased pixel density, have brought renewed attention
to performance of blitting lately...)

------
c-smile
Question to the author: is this only implementation of GPU rasterizer? What
about anything like WARP
([https://en.wikipedia.org/wiki/Windows_Advanced_Rasterization...](https://en.wikipedia.org/wiki/Windows_Advanced_Rasterization_Platform))
- fallback rendering when GPU is not available ?

~~~
raphlinus
It's something I'd like to explore at some point, also Swiftshader, which
might be easier to explore, as it's already Vulkan. I expect performance to be
pretty good, but there are already really advanced CPU renderers such as
Blend2D. Doing serious performance evaluation is hard work, so I actually hope
it's something others take up, as I have pretty limited time for it myself.

~~~
c-smile
I understand.

Problem is that any practical 2D rendering solution shall support as GPU as
CPU rendering unfortunately.

It would be interesting to see any GPU equivalent of something like AGG by Max
Shemanarev, RIP.

~~~
throwaway9087
Practical 2D renderers implement an abstraction layer that lets them easily
redirect their output to a number of low level libraries, which can be either
CPU or GPU-based (or a mix). I worked on a few such 2D stacks, including
OpenGL, DirectX, WebGL, and AGG based, and it took no more than a few days to
add new 2D backends to an existing pipeline. Most 2D rendering is based on
concepts from PostScript so it's usually easy to do such ports -- except for
AGG, that one was a bit like a library from outer space. Maxim himself worked
on a hardware accelerated 2D renderer for Scaleform, and it looked nothing
like AGG, mostly because AGG is practically impossible to move to a GPU
implementation.

~~~
c-smile
I know, AGG is just a set of primitives but not an abstraction like class
Graphics {...}. But that one can be assembled from them. Did it once for early
versions of Sciter.

Ideally GPU and CPU rendering backends should have pixel perfect match that
makes "adding new 2D backend" tricky at best.

------
Abishek_Muthian
iTerm's usage of metal on macOS is good example of benefits in implementing 2D
rendering on GPU[1].

[1][https://gitlab.com/gnachman/iterm2/-/wikis/Metal-
Renderer](https://gitlab.com/gnachman/iterm2/-/wikis/Metal-Renderer)

~~~
bori5
That links to iTerm2 which is what I’m guessing you meant to say ? Kitty is
also a GPU accelerated terminal emulator and one that I enjoy using
[https://sw.kovidgoyal.net/kitty/](https://sw.kovidgoyal.net/kitty/) Not sure
if it uses Metal on Mac though.

------
marcosscriven
Naive question - but does this change to make use of DX11/shaders in newer
GPUs make it more likely to have a good cross-platform UI for app development?

I’ve looked before and although OpenGL is good for windowing, widgets etc,
it’s not great for sub pixel rendered/anti aliased text.

~~~
raphlinus
Yes, the motivation and long term goal for this work is to provide a
performant foundation for cross-platform UI. There's a lot to be done though!

------
Const-me
I wonder how it compares performance-wise against my solution of the same
problem: [https://github.com/Const-me/Vrmac#vector-graphics-
engine](https://github.com/Const-me/Vrmac#vector-graphics-engine)

I don’t use compute shaders, tessellating input splines into polylines, and
building triangular meshes from these.

------
chadcmulligan
This is very impressive. Have you benchmarked it against core graphics on a
Mac? I believe they've done a similar thing - performing the render on Metal

~~~
c-smile
CoreGraphics is pure CPU rasterizer, similar to GDI+.

Sciter ([https://sciter.com](https://sciter.com)) on MacOS uses Skia/OpenGL by
default with fallback to CoreGraphics.

It is possible to configure Sciter to use CoreGraphics on MacOS to compare
these two on the same UI by using

    
    
        SciterSetOption(NULL, SCITER_SET_GFX_LAYER, GFX_LAYER_CG);
    

I think it is safe to say that Skia/OpenGL is 5-10 times more performant than
CG on typical UI tasks.

~~~
chadcmulligan
Interesting, thanks - I remember hearing at one of the WWDC that Coregraphics
got a 10x speed improvement using metal. I just read the fine print - it seems
draw calls only have a 10x speed improvement, I assume because they render
through a layer of some sort. The CA* libraries may use metal - animation and
layers, which I guess is where the 10x draw call improvement comes in maybe.

Some discussion here
[https://arstechnica.com/civis/viewtopic.php?t=1285571](https://arstechnica.com/civis/viewtopic.php?t=1285571)
, though a lot of guessing.

~~~
raphlinus
Metal is definitely used extensively in CoreAnimation, and Apple UI tends to
rely on that - relatively slow (and memory hungry) rendering of layer content,
which is then composited very smoothly and nicely in CA.

They might use it for other stuff like glyph compositing (I think this is one
reason they got rid of RGB subpixeling, to make it more amenable to GPU), but
last I profiled it, it was still doing a lot of the pixels on CPU, as others
have stated.

------
amelius
Is GPU here synonymous with Nvidia?

~~~
raphlinus
No, it's designed to be portable to all GPU hardware that can support compute,
which these days is a pretty good chunk of the fleet. I tested it on Linux on
Intel HD 4000, and the master branch seems to run just fine, though previous
versions.

A lot of the academic literature (Massively Parallel Vector Graphics, the Li
et al scanline work) is dependent on CUDA, but that's just because tools for
doing compute on general purpose graphics APIs are so primitive. I have a talk
and a bunch of blog posts on exactly this topic, as I had to explore deeply to
figure it out. See
[https://news.ycombinator.com/item?id=22880502](https://news.ycombinator.com/item?id=22880502)
and [https://raphlinus.github.io/gpu/2020/04/30/prefix-
sum.html](https://raphlinus.github.io/gpu/2020/04/30/prefix-sum.html) for more
breadcrumbs on that.

