
The compositor is evil - raphlinus
https://raphlinus.github.io/ui/graphics/2020/09/13/compositor-is-evil.html
======
PaulDavisThe1st
It is fascinating to me, as someone who has been doing real time audio work
for 20 years, to see how two fundamental processes have been at work during
that time:

    
    
       * audio developers recognizing the parallels between the way audio and video hardware works, and taking as many lessons from the video world as possible. The result is that audio data flow (ignoring the monstrosity of USB audio) is now generally much better than it used to be, and in many cases is conceptually very clean. In addition, less and less processing is done by audio hardware, and more and more by the CPU.
    
       * video developers not understanding much about audio, and failing to notice the parallels between the two data handling processes. Rather than take any lessons from the audio side of things, stuff has just become more and more complex and more and more distant from what is actually happening in hardware. In addition, more and more processing is done by video hardware, and less and less by the CPU.
    

In both cases, there is hardware which requires data periodically (on the
order of a few msec). There are similar requirements to allow multiple
applications to contribute what is visible/audible to the user. There are
similar needs for processing pipelines between hardware and user space code.

(one important difference: if you don't provide a new frame of video data,
most humans will not notice the result; if you don't provide a new frame of
audio fata, every human will hear the result).

I feel as if both these worlds would really benefit from somehow having a full
exchange about the high and low level aspects of the problems they both face,
how they have been solved to date, how might be solved in the future, and how
the two are both very much alike, and really quite different.

~~~
munificent
I like your overall point a lot. I'm an ex-game developer who has also dabbled
in audio programming and music so I see a little both sides.

I think the key difference, though, is that the consumer video needs are
probably in the area of two orders of magniture more computationally complex
for video than for audio. On consume machines, almost no interesting real-time
audio happens. It's basically just playback with maybe a little mixing and EQ.
Your average music listener is not running a real-time software synthesizer on
their computer. Gamers are actually probably the consumers with the most
complex audio pipelines because you're mixing a lot of sound sources in real-
time with low latecy, reverb, and other spatial effects.

The only people doing real heavyweight real-time audio are music producers and
for them it's a viable marketing strategy for audio programmers to expect them
to upgrade to beefier hardware.

With video, almost every computer user is doing really complex real-time
rendering and compositing. A graphics programmer can't easily ask their
userbase to get better hardware when the userbase is millions and millions of
people.

Also, of course, generating 1470 samples of audio per frame (44100 sample
rate, stereo, 60 FPS) is a hell of a lot easier than 2,073,600 pixels of video
per frame.

I agree that audio pipelines are a lot simpler, but I think's largely a luxury
coming from having a much easier computational problem to solve and a target
userbase more willing to put money into hardware to solve it.

~~~
PaulDavisThe1st
You don't generate 1470 samples of audio per frame.

You generate N channels worth of 1470 samples per frame, and mix (add) them
together. Make N large enough, and make the computation processes associated
with generating those samples complex enough, and the difference between audio
and video is not so different.

Jacob Collier routinely uses 300-600 tracks in his mostly-vocal overdubs, and
so for sections where there's something going on in all tracks (rare), it's
more in the range of 400k-900k samples to be dealt with. This sort of track
count is also typical in movie post-production scenarios. If you were actually
synthesizing those samples rather than just reading them from disk, the
workload could exceed the video workload.

And then there's the result of missing the audio buffer deadline (CLICK! on
every speaker ever made) versus missing the video buffer deadline (some video
nerds claiming they can spot a missing frame :)

~~~
munificent
_> You generate N channels worth of 1470 samples per frame, and mix (add) them
together. Make N large enough, and make the computation processes associated
with generating those samples complex enough, and the difference between audio
and video is not so different._

Sure, but graphics pipelines don't only touch each pixel once either. :)

 _> Jacob Collier routinely uses 300-600 tracks in his mostly-vocal overdubs,
and so for sections where there's something going on in all tracks (rare),
it's more in the range of 400k-900k samples to be dealt with._

Sure, track counts in modern DAW productions are huge. But like you note in
practice most tracks are empty most of the time and it's pretty easy to
architect a mixer than can optimize for that. There's no reason to iterate
over a list of 600 tracks and add 0.0 to the accumulated sample several
hundred times.

 _> If you were actually synthesizing those samples rather than just reading
them from disk, the workload could exceed the video workload._

Yes, but my point is that you aren't. Consumers do almost no real-time
synthesis, just a little mixing. And producers are quite comfortable freezing
tracks when the CPU load gets too high.

I guess the interesting point to focus on is that with music production,
_most_ of it is not real-time _and interactive_. At any point time, the
producer's usually only tweaking, recording, or playing a single track or two
and it's fairly natural to freeze the other things to lighten the CPU load.

This is somewhat analogous to how game engines bake lighting into static
background geometry. They partition the world into things that can change and
things that can't and use pipelines approach for each task.

 _> And then there's the result of missing the audio buffer deadline (CLICK!
on every speaker ever made)_

Agreed, the failure mode is catastrophic with audio. With video, renderers
will simply use as much CPU and GPU as they can and players will max
everything out. With audio, you set aside a certain amount of spare CPU as
headroom so you never get too close to the wall.

~~~
pvg
I think the most concrete argument in support of your position is that typical
high-performance audio processing doesn't take an external specialized
supercomputer-for-your-computer with 12.3 gazillion parallel cores, a Brazil-
like network of shiny ducts hotter than a boiler-room pipe and a memory bus
with more bits than an Atari Jaguar ad.

~~~
munificent
_> doesn't take an external specialized supercomputer-for-your-computer with
12.3 gazillion parallel cores,_

Not anymore at least. :)

In the early days of computer audio production, it was very common to rely on
external PCI cards to offload the DSP (digital signal processing) because CPUs
at the time couldn't handle it.

------
skunkworker
It seems like we are coming full circle in try to reduce latency in all parts
of the chain. With the next gen gaming consoles supporting VRR and 120hz,
recently announced Nvidia Reflex low latency modes, 360hz monitors, and things
like the Apple Pencil 2 reducing the latency from pen input to on screen down
to 9ms we are working our way back to something _close_ to what we used to
have.

But I feel like we have a number of years to go until we can really get back
to where we used to be with vintage gaming consoles and CRT displays.

~~~
perl4ever
I think that with the end of Moore's law, there is a gradual unwinding of
software bloat and inefficiency. Once the easy hardware gains die away, it is
worth studying how software can be better, but it's a cultural transition that
takes time.

~~~
jhayward
It is also the case that in some/many areas software has improved performance
by several multiples of the improvement in hardware performance over the last
20 years. So it makes sense that this would invigorate software performance
investment.

~~~
mpweiher
Can you give some examples of areas where software performance improvement has
been several multiples of hardware improvement?

Most of the examples I can think of are ones where the software _slowdown_ has
more than cancelled out the hardware improvements. Then there are a some areas
where hardware performance improvement was sufficient to overcome software
slowdown. Software getting faster? Software getting faster than hardware??

~~~
MayeulC
Compilers are way smarter and can make old code faster than it used to run.
They also parse the code way faster, and would as such compile faster if you
restrict them to the level of optimizations they used to do.

"Interpreters" are faster as well: Lisp runs faster, JavaScript is orders of
magnitudes more efficient as well.

Algorithms have been refined. Faster paths have been uncovered for matrix
multiplication [1] (unsure if the latest improvements are leveraged) and other
algorithms.

Use-cases that have been around fo run a wile (say, h.264 encode/decode) are
more optimized.

We now tend to be a lot better at managing concurrency too (see: rust, openmp
and others), with the massively parallel architectures that come out nowadays.

[1]
[https://en.m.wikipedia.org/wiki/Matrix_multiplication_algori...](https://en.m.wikipedia.org/wiki/Matrix_multiplication_algorithm)

~~~
mpweiher
Thanks for the examples!

However, I can't really agree that they are examples of software improvements
being multiples of hardware improvements.

1\. Compiler optimizations

See Proebsting's Law [1], which states that whereas Moore's Law provided a
doubling of performance ever 18-24 months, compiler optimizations provide a
doubling every 18 years at best. More recent measurements indicate that this
was optimistic.

2\. Compilers getting faster

Sorry, not seeing it. Swift, for example, can take a minute before giving up
on a one line expression, and has been clocked at 16 lines/second for some
codebases.

All the while producing slow code.

See also part of the motivation for Jonathan Blow's Jai programming language.

3\. Matrix multiplication

No numbers given, so ¯\\_(ツ)_/¯

4\. h.264/h.265

The big improvements have come from moving them to hardware.

5\. Concurrency

That's hardware.

[1] [https://www.semanticscholar.org/paper/On-
Proebsting%27%27s-L...](https://www.semanticscholar.org/paper/On-
Proebsting%27%27s-Law-Scott/0a2b1aa8bb63fb545f7f41233e5d6c0206486ccc?p2df)

~~~
MayeulC
You are reading my examples in bad faith :) (though I originally missed your
point about "multiples of")

You want examples where software has sped up at a rate faster than hardware
(meaning that new software on old hardware runs faster than old software on
new hardware).

Javascript might not have been a good idea in the first place, but I bet that
if you were to run V8 (if you have enough RAM) on 2005-era commodity hardware,
it would be faster than running 2005-SpiderMonkey on today's hardware. JIT
compilers have improved (including lisps, php, python, etc).

Can you give me an example of 2005-era swift running faster on newer hardware
than today's compiler on yesterday's hardware? You can't, as this is a new
language, with new semantics and possibilities. Parsing isn't as simple as it
seems, you can't really compare two different languages.

These software improvements also tend to pile up along the stack. And
comparing HW to SW is tricky: you can always cram more HW to gain more
performance, while using more SW unfortunately tends to have the opposite
effect. So you have to restrict yourself HW-wise: same price? same power
requirements? I'd tend to go with the latter as HW has enjoyed economies of
scale SW can't.

Concurrency might be hardware, but in keeping with the above point, more
execution cores will be useless for a multithread-unaware program. Old
software might not run better on new HW, but old HW didn't have these
capabilities, so the opposite is probably true as well. Keep in mind that
these new HW developments were enabled by SW developments.

> No numbers given, so ¯\\_(ツ)_/¯

Big-O notation should speak for itself, I am not going to try and resurrect a
BLAS package from the 80s to benchmark against on a PIC just for this argument
;) Other noteworthy algorithms include the FFT [1]. (I had another one in mind
but lost it).

> The big improvements have come from moving them to hardware.

I'm talking specifically about SW implementations. Of course you can design an
ASIC for most stuff. And most performance-critical applications probably had
ASICs designed for them by now, helping prove your point. SW and HW are not
isolated either, and an algorithm optimized for old HW might be extremely
inneficient on new HW, and vice-versa.

And in any case, HW developments were in large part enabled by SW developments
with logic synthesis, place and route, etc. HW development is SW development
to a large extent today, though that was not your original point.

What can't be argued against, however, is that both SW and HW improvements
have made it much easier to create both HW and SW. Whether SW or HW has been
most instrumental with this, I am not sure. They are tightly coupled: it's
much easier to write a complex program with a modern compiler, but would you
wait for it to compile on an old machine? Likewise for logic synthesis tools
and HW simulators. Low-effort development can get you further, and _that
shows_. I guess that's what you are complaining about.

[1]
[https://en.wikipedia.org/wiki/Fast_Fourier_transform#Algorit...](https://en.wikipedia.org/wiki/Fast_Fourier_transform#Algorithms)

~~~
mpweiher
> I originally missed your point about "multiples of"

That wasn't my point, but the claim of the poster I was replying to, and it
was _exactly_ this claim that I think is unsupportable.

Has some software gotten faster? Sure. But mostly software has a gotten slower
and the rarer cases of software getting faster have been outpaced
significantly by HW.

> You want examples where software has sped up at a rate faster than hardware

"several multiples":

 _that in some /many areas software has improved performance by several
multiples of the improvement in hardware performance over the last 20 years_

> [JavaScript] JIT compilers have improved

The original JITs were done in the late 80s early 90s. And their practical
impact is far less then the claimed impact.

[http://blog.metaobject.com/2015/10/jitterdammerung.html](http://blog.metaobject.com/2015/10/jitterdammerung.html)

As an example the Cog VM is a JIT for Squeak. They claim a 5x speedup in
bytecodes/s. Nice. However the naive bytecode interpreter, in C, on commodity
hardware in 1999 (Pentium/400) was 45 times faster than the one microcoded on
a Xerox Dorado in 1984, which was a high-end, custom-built ECL machine costing
many hundred thousands of dollars. (19m bytecodes/s vs. 400k bytecodes/s).

So 5x for software, at least 45x for hardware. And the hardware kept improving
afterward, nowadays at least another 10x.

> [compilers] Parsing isn't as simple as it seems [..]

Parsing is not where the time goes.

> 2005-era swift running faster

Swift generally has not gotten faster at all. I refer you back to Proebsting's
Law and the evidence gathered in the paper: optimizer (=software) improvements
achieve in decades what hardware achieves/achieved in a year.

There are several researchers that say optimization has run out of steam.

[https://cr.yp.to/talks/2015.04.16/slides-
djb-20150416-a4.pdf](https://cr.yp.to/talks/2015.04.16/slides-
djb-20150416-a4.pdf)

[https://www.youtube.com/watch?v=r-TLSBdHe1A](https://www.youtube.com/watch?v=r-TLSBdHe1A)

(the difference between -O2 and -O3 is just noise)

> Big-O notation should speak for itself

It usually does not. Many if not most improvements in Big-O these days are
purely theoretical findings that have no practical impact on the software
people actually run. I remember when I was studying that "interior point
methods" were making a big splash, because they were the first linear
optimization algorithms that had polynomial complexity, whereas the Simplex
algorithm is NP-hard. I don't know what the current state is, but at the time
the reaction was a big shrug. Why? Although Simplex is NP-hard, it _typically_
runs in linear or close to linear time and is thus much, much faster than the
interior point methods.

Similar for recent findings of slightly improved multiplication algorithms.
The _n_ required for the asymptotic complexity to overcome the overheads is so
large that the results are theoretical.

> FFT

The Wikipedia link you provided goes to algorithms from the 1960s and 1940s,
so not sure how applicable that is to the question of "has software
performance improvement in the last 20 years outpaced hardware improvement by
multiples?".

Are you perchance answering a completely different question?

> [H264/H265] I'm talking specifically about SW implementations

Right, and the improvements in SW implementations don't begin to reach the
improvement that comes from moving significant parts to dedicated hardware.

And yes, you have to modify the software to actually talk to the hardware, but
you're not seriously trying to argue that this means this is a software
improvement??

~~~
MayeulC
Another example recently cropped on HN:
[https://news.ycombinator.com/item?id=24544232](https://news.ycombinator.com/item?id=24544232)

> Parsing is not where the time goes.

Not with the current algorithms.

But let's agree to put this argument to a rest. I generally agree with you
that

1\. Current software practices are wasteful, and it's getting worse

2\. According to 1. most performance improvements can be attributed to HW
gains.

I originally just wanted to point out that this was true in general, but that
there were exceptions, and that hot paths _are_ optimized. Other tendencies
are at play, though, such as the end of dennard's scaling. I tend to agree
with
[https://news.ycombinator.com/item?id=24515035](https://news.ycombinator.com/item?id=24515035)
and to achieve future gains, we might need tighter coupling between HW and SW
evolution, as general-purpose processors might not continue to improve as
much. Feel free to disagree, this is conjecture.

> And yes, you have to modify the software to actually talk to the hardware,
> but you're not seriously trying to argue that this means this is a software
> improvement??

My point was more or less the same as the one made in the previously linked
article: HW changes have made some SW faster, other comparatively slower.
These two do not exist in isolated bubbles. I'm talking of off-the-shelf HW,
obviously. HW gets to pick which algorithms are considered "efficient".

~~~
mpweiher
> Parsing [current algorithms]

Recursive descent has been around forever, the Wikipedia[1] page mentions a
reference from 1975[2]. What recent advances have there been in parsing
performance?

> 1\. Current software practices are wasteful, and it's getting worse

> 2\. According to 1. most performance improvements can be attributed to HW
> gains.

Agreed.

3\. Even when there were advances in software performance, they were outpaced
by HW improvements, certainly typically and almost invariably.

[1]
[https://en.wikipedia.org/wiki/Recursive_descent_parser#Refer...](https://en.wikipedia.org/wiki/Recursive_descent_parser#References)

[2]
[https://archive.org/details/recursiveprogram0000burg](https://archive.org/details/recursiveprogram0000burg)

------
monocasa
I've always really liked the Nintendo DS's GPU, and thought something with a
similar architecture but scaled up would make a lot sense in a few places in
the stack. Unlike most GPUs, you submitted a static scene, and then the
rendering happened pretty close to lock step with the scan out to the LCD.
There was a buffer with a handful of lines stored, but it was pretty close;
the latency was pretty much unbeatable. In a lot fo ways it was really a 3D
extension to their earlier 2D (NES, SNES, GB, GBA) GPU designs.

Something like that sitting where the scan out engines exist today (with their
multiple planes the composite today) would be absolutely killer if you could
do hardware software co design to take advantage of it.

I've also thought that something like that would be great in VR/AR.

~~~
aspaceman
You've just inspired a dive into documentation. I do GPU architecture research
and reading this was exciting. Thanks.

~~~
monocasa
Awesome! I'd really suggest Martin Korth's fantastic documentation in that
case. In a lot of ways it was better than even the official docs given to
developers.

[https://problemkaputt.de/gbatek.htm](https://problemkaputt.de/gbatek.htm)

------
Jasper_
So one thing to note is that hardware overlays are _really_ uncommon on
desktop devices. NVIDIA only has a single YUV overlay and a small (used to be
64x64, now it might be 256x256?) cursor overlay. And even then, the YUV
overlay might as well not exist -- it has a few restrictions that mean most
video players can't or won't use it. After all, you're already driving a giant
power-hungry beast like an NVIDIA card, there's no logical reason to save
power and move a PPU back into the CRTC. So hardware overlays won't help the
desktop case.

I still think we get better compositor performance by decoupling "start of
frame" / "end of frame" and the drawing API. The big thing we lack on the app
side is timing information -- an application doesn't know its budget for how
long it should take and when it should submit its frame, because the graphics
APIs only expose vsync boundaries. If the app could take ~15ms to build a
frame, submit it to the compositor, and the compositor takes the remaining
~1ms to do the composite (though much likely much much less, these are just
easy numbers), we could could be made to display in the current vsync cycle.
We just don't have accurate timing feedback for this though.

One of my favorite gamedev tricks was used on Donkey Kong Country Returns.
There, the developers polled input far above the refresh rate, and rendered
Donkey Kong's 3D model at the start of the frame into an offscreen buffer, and
then, as the frame was being rendered, processed input and did physics. Only
at the end of the frame, did they composite Donkey Kong into the updated
physics. So they in fact cut the latency to be sub-frame through clever
trickery, at the expense of small inaccuracies in animation. Imagine if
windows get "super late composite" privileges, where it could submit its image
just in the nick of time.

(Also, I should probably mention that my name is "Jasper St. Pierre". There's
a few tiny inaccuracies in the history -- old-school Win95/X11 still provides
process separation as the display server bounds the window's drawing to the
window's clip list for the app, and Windows 2000 also had a limited
compositor, known as "layered windows", where certain windows could be
redirected offscreen [0], but these aren't central to your thesis)

[0] [https://docs.microsoft.com/en-
us/windows/win32/winmsg/window...](https://docs.microsoft.com/en-
us/windows/win32/winmsg/window-features#layered-windows)

~~~
modeless
Nvidia's chips support hardware overlays. Historically CAD software used them,
so Nvidia soft-locks the feature in the GeForce drivers to force CAD users to
buy Quadro cards for twice the price instead. Their price discrimination means
we can't have nice things.

~~~
Jasper_
Huh, I've never heard of this. Traditionally, hardware overlays are consumed
by the system compositor. Do they have an OpenGL extension to expose it to the
app?

~~~
modeless
I'm not sure how overlays are exposed in the Quadro drivers. Here's an old
Nvidia doc talking about it:
[https://web.archive.org/web/20180926212259/https://www.nvidi...](https://web.archive.org/web/20180926212259/https://www.nvidia.com/attach/1006974?type=support&primitive=0)

I guess it's possible that their overlay support is limited and not fully
equivalent to more modern overlays. It's tough to tell for sure from that
description. But even if so, the price discrimination aspect may still have
stopped them from wanting to implement a more capable feature and expose it in
the GeForce drivers.

Edit: This documentation suggests it's limited to 16 bit color and 1 bit
transparency.
[http://http.download.nvidia.com/XFree86/Linux-x86/100.14.19/...](http://http.download.nvidia.com/XFree86/Linux-x86/100.14.19/README/appendix-b.html)

~~~
Jasper_
Ah, this is a classic "RGB overlay", as pioneered by Matrox for the
workstation market, which doesn't really mean much, and I assume is fully
emulated on a modern chip. Nothing like a modern overlay pipe like you see in
the CRTCs on mobile chips.

------
modeless
Great article. The brute force solution to this problem is switching to high
refresh rate monitors, but if you want to beat Apple II latency by brute force
without fixing the underlying issues in modern software then 120 Hz probably
isn't enough. You need crazy high refresh rates like 240 or 360 Hz. I do enjoy
high refresh rates, but they are not great for power consumption and limit the
resolution you can support.

~~~
Daishiman
A high refresh monitor isn't going to do much if your synchronization
primitives and scheduling capabilities aren't up to the task. A millisecond of
latency in the compositor is still a millisecond no matter what output refresh
rate you have.

~~~
wffurr
Latency is almost always in frames, the renderer is one frame behind the
compositor which is one frame behind scanout.

If each steps takes 4 msec (240 Hz) instead of 16.7 msec (60 Hz), and you're
the same number of steps behind, the latency is reduced by (16.7 - 4) *
nsteps, or 50 msec - 12 msec = 38 msec.

Now that's assuming your system can actually keep up and produce frames and
composite that fast, which is where the "brute force" comes into play.

~~~
maximilianroos
Why is that delay coupled to the monitor that's plugged into the computer?
Could we ask the computer to produce graphics at 240Hz and then take every 4th
frame?

Then we have "average" delay in ms rather than frames.

(Asking as someone who doesn't know this area well)

~~~
zokier
You sould effectively get the same result by delaying compositing by 3/4th of
a frame

~~~
gpm
Only if you know how long it will take to render a frame.

If you render "as fast as possible (to a max of xyz hz), you get better worst
case performance than if you delay compositing. A render time of 2 physical
frames results in only 1 missed frame instead of a render time of 1.26
physical frames resulting in 2 missed frames.

Of course nothing is simple in the real world, the flip side is rendering the
extra frames creates waste heat that slows down modern processors with
insufficient cooling.

------
emmanueloga_
> When pressing a key in, say, a text editor, the app would prepare and render
> a minimal damage region

Is this similar to the "dirty rectangles" technique? [1]

It seems difficult to implement some sort of scene system (for either an app,
game or general GUI) that given a single key press can determine a minimum
bounding box of changes on the screen no matter what the scene is, and is able
to render just that bounding box.

If the single key occurs on a text area, potentially the whole text area could
be "damaged" right? Edit: I was looking at this with the "Rendering" tab of
chrome dev tools, enabling "Paint flashing" and "Layout Shift Regions", it
seems like the text area is its own layer and the space is partitioned pretty
cleverly on things like paragraphs and lines, but from time to time the whole
text area just flashes, which tells me the algorithm sometimes is not sure
what is dirty and just repaints the whole thing, but not always.

1:
[https://wiki.c2.com/?DirtyRectangles](https://wiki.c2.com/?DirtyRectangles)

~~~
munificent
_> Is this similar to the "dirty rectangles" technique?_

Yes.

 _> It seems difficult to implement some sort of scene system (for either an
app, game or general GUI) that given a single key press can determine a
minimum bounding box of changes on the screen no matter what the scene is, and
is able to render just that bounding box._

I don't think it's that bad. In the days before hardware graphics, basically
every 2D sprite engine used in every computer game you liked had to do this
logic.

Text editors are a harder case in some ways because of line wrapping, but the
editor _does_ need to figure out where everything goes spatially in order to
render, so extending that to tell which things have not moved is, I think, not
that difficult.

------
Androider
There's some hardware fixes that are becoming available.

Variable refresh rate (VRR / G-Sync / FreeSync / Adaptive Sync) and high-
refresh rate hardware is now starting to gain traction on the high-end,
usually with both technologies combined. Apple's ProMotion is often touted as
a "120Hz" feature, but it's also a variable refresh in that it can support
much lower framerates. Without a static v-sync interval, you can display the
frames as they become available, for both perfect tear-free visuals and lower
latency. It's quite likely we'll have that on Mac's as well.

I recently got a 144Hz FreeSync (G-Sync compatible) "business" monitor, so
this tech is starting to filter down from the gamer side. It works great with
a compositor on Linux! Ultra smooth mouse and input response at 144Hz, and
completely tear-free. I would highly recommend it for developers as well.

~~~
badsectoracula
> It works great with a compositor on Linux! Ultra smooth mouse and input
> response at 144Hz, and completely tear-free. I would highly recommend it for
> developers as well.

I got a 165Hz monitor and while it is indeed smoother than a 60Hz monitor with
a vsync'd compositor, it still isn't as smooth as a 60Hz monitor without a
compositor.

------
tomxor
> [...] the worst that can happen is just jank, which people are used to,
> rather than visual artifacts.

This is an important point. There are many solutions out there (beyond
compositing) that add incredible complexity along with other costs in order to
solve a _perceived_ problem by it's proponents... not uncommonly those same
people pushing the solution are incapable of fairly judging the cost benefit -
sometimes the solutions are just not worth the cost.

There is currently an issue in chromium compositing with checkerboarding
(blanking whole regions of the screen with white) when scrolling due to an
optimization released to reduce jank:

[https://bugs.chromium.org/p/chromium/issues/detail?id=106266...](https://bugs.chromium.org/p/chromium/issues/detail?id=1062662)

You will notice this in recent releases of chromium when you scroll any non
trivial page fast enough... or a simple page very fast by yanking the scroll
bar... unless you have an extremely fast computer and graphics card. When
reading through that thread you will notice how defensive they are, it's a
complex and no doubt incredible bit of coding they have done - unfortunately
the side-effect is worse than the original problem that most users don't even
notice.

In the long run I think time will prove this problem and it's solutions are
purely transient... In the same way font anti-aliasing is becoming obsolete
with DPIs higher than we can perceive, solving jank isn't necessary if you
cannot perceive it (i.e using frame rates > 60 Hz)... and the original problem
really isn't as bad as all the people attempting to solve it think it is
anyway.

~~~
Firadeoclus
> In the same way font anti-aliasing is becoming obsolete with DPIs higher
> than we can perceive

Screens might exceed acuity of most users but not hyperacuity. Even on a
400ppi phone screen at typical viewing distances it is possible to tell
whether a slanted line is anti-aliased or not. Font anti-aliasing is not
becoming obsolete any time soon.

~~~
pekim
I agree, anti-aliasing is certainly still required at 400dpi. Although
subpixel anti-aliasing is not.

------
boulos
I was surprised that the tile-based updates from VNC, Teradici, PCoIP etc.
wasn't in the related work. In particular, a challenge with "send < smaller >
updates" rather than whole frames is how you decide to deal with massive
updates where you'd end up causing more work / traffic than the whole frame
approach.

This post is focused on local compositing, but I think the same arguments
apply as with the networked case: too many updates is actually worse than
vsync, but the usual case of "just ship deltas" is amazing (for remote display
you get the bandwidth and latency win, here you'd get latency/power/whatever).

I think a low-level "updates" API would make sense for some sophisticated
applications, but I'm not convinced that this quote holds:

> I think this design can work effectively without much changing the
> compositor API, but if it really works as tiles under the hood, that opens
> up an intriguing possibility: the interface to applications might be
> redefined at a lower level to work with tiles.

Seems like if you can show Chrome and VLC both working fine, that'd be a great
proof of concept!

~~~
taeric
I feel like this almost always comes back to a reinvention of I-P-B Frames.

I am curious if there is a solid data structure that could be used to collect
updates together in such a way that updates could be coalesced into a
collection of other updates efficiently.

------
nayuki
Variable refresh rate technologies like FreeSync and G-Sync are used in some
video games. Wouldn't this improve the latency of desktop compositing too?
After all, the idea is that the GPU sends out a frame to the monitor once it's
finished, not once a periodic interval is reached.

~~~
raphlinus
We had a discussion on this in #gpu on our Zulip. The short answer is that if
the screen is updating at 60fps, then FreeSync and its cousins are effectively
a no-op. If the display is mostly idle, then there is potentially latency
benefit in withholding scanout until triggered by a keypress. Older revisions
of FreeSync have a "minimum frame rate" which actually doesn't buy you much;
40Hz is common: [https://www.amd.com/en/products/freesync-
monitors](https://www.amd.com/en/products/freesync-monitors)

~~~
amluto
I would expect that VRR helps if you try to make your compositor only start
compositing as late as possible while still finishing on time. With VRR, one
could delay the next frame slightly if the deadline gets missed. This would
work even when rendering at 60Hz.

~~~
medlyyy
Problem is that causes jitter which is extremely noticeable even in small
amounts. You can't have smooth animations if the frame timing is constantly
varying.

------
amelius
I'm afraid that you'd need a realtime OS to implement a compositor with the
suggested properties.

~~~
zokier
I'm in the opinion that real-time desktop is already overdue. I can understand
that in the olden days the perf benefit from dropping latency guarantees were
worth it (classic latency vs throughput optimization), especially on servers
where current operating systems draw their lineage (NT and UNIX). But these
days with beefy, even overpowered CPUs, and 25 years of OS/real-time research
later I would imagine that we would be in a place that the tradeoff would look
different.

Of course just slapping RT kernel in a modern system alone would not do much
good. The whole system needs to be thought with RT in mind, starting from
kernel and drivers through toolkits/frameworks and services up to the
applications themselves. But ultimately the APIs for application developers
should be nudging people to do the right things (whatever those would be)

~~~
flohofwoe
You need to go down to the hardware even, a realtime kernel running on
mainstream CPUs and GPUs might not be enough (I wonder if a realtime system
can even be built with current mainstream CPUs and GPUs):

The video systems in 8-bit home computers only worked because the _entire
system_ provided "hard realtime" guarantees. CPU instructions always were the
same number of cycles and memory access happened at exactly defined clock
cycles within an instruction. Interrupt timing was completely predictable down
to the clock cycle, etc etc. Modern computers get most of their performace
because they dumped those assumptions and as a result made timings entirely
unpredictable (inside the CPU via caches, pipelining, branch prediction, etc
etc...), and between system components (e.g. CPU and GPU don't run cycle-
locked with each other).

------
LargoLasskhyfv
I don't get this obsession with compositors. I just switch them off. Has no
benefit for me. If I'm so inclined I can move windows with video running in
_Full-HD_ without any tearing across several screens like mad. When I'm in a
terminal window of some sort I don't care for the real transparency, because
it is unergonomic. When I'm in some other application the same applies.

Can someone enlighten me about the real benefits?

Seriously.

~~~
sjy
The app thumbnails in macOS’s Mission Control [1] and GNOME’s Activities
overview [2] provide a nice way of managing graphical windows and workspaces
which requires a compositor. Animated window management in general is a UI
affordance that I value. I’m not sure if it’s a “real benefit,” but
consistently getting this right is one thing that attracts users to macOS and
iOS.

[1] [https://support.apple.com/en-us/HT204100](https://support.apple.com/en-
us/HT204100)

[2] [https://help.gnome.org/users/gnome-help/stable/shell-
introdu...](https://help.gnome.org/users/gnome-help/stable/shell-
introduction.html.en#activities)

~~~
LargoLasskhyfv
I know them. Maybe I'm just an old fart and "set in my ways", but it's not
even "nice to have" for me when it compromises the general haptics of the
environment, as in dragging it down with sluggishness.

------
crazyloglad
A few pointers:

The first bit is that practically all that is argued in for here, is in
Android - and has been so for a long time (~2013 last I worked on the topic in
that context) and it is still not enough - they will go (and are going) higher
refresh rate.

Even at the time there were crazy things like tuning CPU governor to wake up
when touch screen detects incoming 'presence'(before you can get an accurate
measurement of where and how hard the finger hit) - it's not like you are
going to change your muscle intention mid flight. Input events were
specifically tagged so that the memory would be re-used rather than GCed and
reallocated.

A sunny frame could go from motion to photon in 30ms. Then something happens
and the next takes 110ms. That something is often garbage collectors and other
runtime systems that think it is safe to do things as there is no shared
systemic synchronization mechanism around for these things.

The second is that this is -ultimately- a systemic issue and the judge, jury
and executioner is the user's own perception. That's what you are optimising
for. Treating it as a graphics only thing is not putting the finger on
something, it is picking your nose. The input needs to be in on it, audio
needs to be in on it, and the producer needs to be cooperative. Incidentally,
the guitar hero style games and anything VR are good qualitative evaluation
targets but with brinelling like incremental loads.

(Larger scale) 1\. communicate deadlines to client so the renderer can judge
if it is even worth doing anything. 2\. communicate presentation time to
client so animations and video frames arrive at the right time. 3\. have a
compositor scheduler than will optimize for throughput, latency, minimizing
jitter or energy consumption. 4\. inform the scheduler about the task that is
in focus, and bias client unlock and event routing to distinguish between
input focus and the thundering herd. 5\. type annotate client resources so the
scheduler knows what they are for. 6\. coalesce / resample input and resize
events. 7\. align client audio buffers to video buffers. 8\. clients with
state store/restore capabilities can rollback, inject and fastforward
([http://filthypants.blogspot.com/2016/03/using-rollback-to-
hi...](http://filthypants.blogspot.com/2016/03/using-rollback-to-hide-latency-
in.html))

There is a bunch of more eccentric big stuff after that as well as the well
known quality of life improvements (beyond the obvious gamedev lessons like
don't just malloc in your renderloop and minimize the amount of syscall jitter
in the entire pipeline), but baby steps.

Oh, and all of the above++ is already in Arcan and yet it is still nowhere
near good enough.

~~~
baybal2
Android scroll still feels surprisingly laggy to me in comparison to even
something like GPE, or OPIE, which used GTK+ (and Qt mobile respectively,) and
had fully CPU based animation.

It tells just how much more garbage been thrown into the linux stack since
2007, that even, effectively, kernel based, and hardware accelerated animation
stutters today.

------
roca
The beam-racing compositor is a cool idea. It seems to me, however, that it
could create visible jitter for applications that use traditional vsync-driven
frame timing --- e.g. an animation of vertical motion would see a one-frame
latency change when the damage rectangle crosses some vertical threshold.

~~~
raphlinus
If you're animating (or doing smooth window resizing), then your "present"
request includes a target frame sequence number. The way I analyze the text
editor case is that it's something of a special case of triple-buffering, but
instead of spamming frames as fast as possible at the swapchain, you do
multiple revisions for a single target sequence number. The last revision to
make it ahead of the beam wins. If you have a smooth clock animation in
addition to some latency-critical input sensitive elements, then the clock
hands would be in the same position for each revision, only the other elements
would change.

If done carefully, this would give you smoother animations at considerably
lower power consumption than traditional triple buffering. It's not the way
present APIs are designed today, though, one of my many sadnesses.

As I hinted, this is a good way to synchronize window resize as well. The
window manager tells the application, "as of frame number 98765, your new size
is 2345x1234." It also sends the window frame and shadows etc with the same
frame number. These are treated as a transaction; both have to arrive before
the deadline for the transaction to go through. This is not rocket science,
but requires careful attention to detail.

------
mncharity
A year or few back, I did camera-passthru XR in VR HMDs, on a DIY stack with
electron.js (which is chromium) as compositor. So a couple of thoughts...

The full-screen electron window apparently displaced the window system
compositor. So "compositing" specialized compositors can be useful.

Chromium special cased video, making it utterly trivial to do _comfortable_
camera passthru, while a year+ later, mainstream stacks were still considering
that hard to impossible. So having compositor architecture match the app is
valuable.

That special case would collapse if the CPU needed to touch the camera frames
for analysis. So carefully managing data movement between GPU and CPU is
important. That's one design objective of google's mediapipe for instance.
Perhaps future compositors should permit similar pipeline specifications?

My XR focus was software-dev "office" work, not games. So no game-dev "oh, the
horror! an immersion-shattering visual artifact occurred!" \- people don't
think "Excel, it's ghastly... it leaves me aware I'm sitting in an office!".
Similarly, with user balance based on nice video passthru, the rest of the
rendering could be slow and jittery. Game-dev common wisdom was "VR means 90
fps, no jitters, no artifacts, or user-sick fail - we're GPU-tech limited",
and I was "meh, 30/20/10 fps, whatever" on laptop integrated graphics. So...
Games are hard - don't assume their constraints are yours without analysis.
And different aspects of the rendered environment can have very different
constraints.

My fuzzy recollection was chromium could be persuaded to do 120 Hz, though I
didn't have a monitor to try that. Or higher? - I fuzzily recall some variable
to uncap the frame rate.

I've used linux evdev (input pipeline step between kernel and libinput)
directly from an electron renderer process. Latency wasn't the motivation, so
I didn't measure it. But that might save an extra ms or few. At 120 Hz, that
might mean whatever ms HID to OS, 0-8 ms wait for the next frame, 8 ms
processing and render, plus whatever ms to light. On electron.js.

New interface tech like speech and gesture recognition may start guessing at
what it's hearing/seeing _many_ ms before it provides it's final best guess at
what happened. Here low latency responsiveness is perhaps more about app
system architecture than pipeline tuning. With app state and UI supporting
iterative speculative execution.

Eye tracking changes things. "Don't bother rendering anything for the next 50
ms, the user is saccading and thus blind." "After this saccade, the eye will
be pointing at xy, so only that region will need a full resolution render"
(foveated rendering). Patents... but eventually.

------
upofadown
As someone who uses a tiling window manager it has been a long time since I
have actually used a compositor. I have to ask if the cool window effects are
actually worth the bother? I just can't see anyone actually caring about stuff
like that in 20 years.

~~~
yoz-y
Nobody might care about the 3D cube but stuff like shadows under windows have
a clear usability benefit. aka: knowing which window is on to of each other.
Of course in a tiling WM you don’t care about that but vast majority of people
don’t want to use tiling WMs.

~~~
mrob
Why would I need to know the Z-order of overlapping windows? None of the
windows I'm actively working with overlap (it would be too annoying to have
them constantly blocking each other), and all the rest just have "background"
status. I don't care how far they are into the background, so drop shadow is
useless visual clutter.

~~~
yoz-y
You are you. Other people work differently. If you want to hear how a power
user uses overlapping windows I’d recommend listening to this ATP episode:
[https://atp.fm/96](https://atp.fm/96)

------
andrekandre
without a compositor, how does one implement a feature like expose/mission
control effeciently?

------
jankotek
Most of those problems could be avoided by using higher refresh rate. 120+Hz
is pretty common in gaming and phones.

In reality we render stuff with slow javascript. Latency after key press may
not be 16ms, but 16 seconds.

~~~
colordrops
Hate to go down this road, but JavaScript is anything but slow. What is slow
is the layer upon layer of complexity + poor coding of many web apps. That's
not an exclusive problem to JavaScript.

~~~
jcelerier
> Hate to go down this road, but JavaScript is anything but slow. What is slow
> is the layer upon layer of complexity + poor coding of many web apps. That's
> not an exclusive problem to JavaScript.

where are those fast javascript apps ? where ? compare the super snappiness of
a Telegram Desktop with molasses $every_other_chat_client_using_electron for
instance. And if you are gonna say that VSCode is fast you have no idea how
truly fast software feels like - if you come across try a QNX 6 demo iso and
be amazed at how fast your computer answers to literally everything.

~~~
colordrops
That doesn't contradict my statement at all.

The problems you speak of are due to the browser runtime, not JavaScript.

~~~
jcelerier
What makes you say that they aren't also due to the kind of programming style
that JS encourages ?

Also, my anecdotal experience writing some audio algorithms in pure JS is that
it's ... not an order of magnitude, but not far away, from the performance of
the same code written in C++ with std::vector<double>s. Likely there's some
magic that could be done to improve it, but the C++ code is written naïvely
and ends up extremely well optimized.

------
guhcampos
"As the experience of realtime audio illuminates, it’s hard enough scheduling
CPU tasks to reliably complete before timing deadlines, and it seems like GPUs
are even harder to schedule with timing guarantees."

Well, I suppose that's way more feasible in RISC CPUs.

Damn I hate when Apple is right about something.

~~~
monocasa
RISC as a concept doesn't really help you with making timing guarantees.

~~~
est31
Yeah as of today, the instruction language used isn't really relevant anyway.
You still have to figure out the same problems as a chip designer.

