
Pointermove event latency in web browsers - secondo
https://rsms.me/pointer-latency/
======
modeless
I've done a lot of work testing this type of latency in web browsers:
[https://google.github.io/latency-
benchmark/](https://google.github.io/latency-benchmark/)

On Windows, DWM's display compositing adds one frame of latency to every
window on screen. It's not possible to render a dragged object in any window
that sticks to the mouse cursor without at least one frame of latency.

But when you drag whole windows around they _do_ stick to the mouse cursor
with apparently zero frames of latency; how does DWM do it? Easy, they cheat
by disabling the hardware mouse overlay during window dragging so that the
mouse cursor gets that extra frame of latency too. You can prove this by
enabling "Night light" in settings; watch the mouse cursor change colors as it
transitions from hardware overlay to software rendering when you start
dragging a window.

~~~
DaiPlusPlus
Could compositors be optimised to eliminate the extra frame of lag in the case
where every window on-screen is being displayed “directly” by invisibly
switching to a mode that maps each scanline and pixel column to a window’s
framebuffer - and non-client areas to the window-manager’s UI buffer which is
directly read by the monitor signal generator. While this would mean
transparency effects wouldn’t work, it could supported with some special-
casing. Basically a framebuffer-less hardware compositor. I think rendering
windows to 3D deformable meshes [makes for cool
demos]([https://youtu.be/USedxVrU2Ko](https://youtu.be/USedxVrU2Ko)) but in
practice we just don’t use it for anything besides window open/close
animations.

I had to use a monitor running at 30Hz for a while (4K over HDMI 1.4) and
while that was bad enough, the compositor’s lag meant all window contents had
an extra (unnecessary IMO) delay of 33ms. Add on to that normal monitor input
lag.

We’ll probably all shift to 120Hz w/ variable-rate refreshing as a new
baseline standard over the next 10 years as Apple seems to be heading in that
direction - at 120Hz the lag of the compositor would be acceptable - but I’m
worried lazy that graphics devs are going to use that as an excuse to add
another frame of latency...

~~~
modeless
> Could compositors be optimised to eliminate the extra frame of lag in the
> case where every window on-screen is being displayed “directly” by invisibly
> switching to a mode that maps each scanline and pixel column to a window’s
> framebuffer

Yes. This concept is called hardware overlays and there are varying levels of
support for it in different GPUs and compositors.

There are tradeoffs. Using multiple hardware overlays may cost extra power
and/or memory bandwidth, the number of supported overlays may be very limited,
alpha blending may not be supported, and the transforms that can be applied to
overlays may be very limited. The extremely hardware specific nature of the
restrictions and the lack of good APIs exposing overlays means they get much
less use than they should.

~~~
heavenlyblue
Technically if your pixel operation is commutative and reversible (as could be
with alpha blending), you could store a buffer of pixels at the current
position, then undo the previous pixel by current window at that buffer and
then re-apply the operation with the new pixel value, and then directly send
this to the display?

Am I missing something apart from the fact that alpha blending is not actually
commutative and/or reversible; and the fact that nobody implemented this yet?

------
emersion
>This happens in a buffer and is normally one display update behind in time.

This assumes compositors perform their work right after each display refresh.
Compositors can decide to perform their work later, some amount of time before
the next display refresh (e.g. a few milliseconds). This allows to reduce
latency because the new buffers submitted by clients (such as web browsers)
can be displayed with less than 1 refresh period worth of latency. For
instance the browser can update its buffer at last display refresh + 8ms, then
the compositor can composite at last display refresh + 13ms, and the new frame
can be displayed at last display refresh + 16ms.

Here's for instance how Weston does it: [1]. Sway has a similar feature.

>However since pointing with a cursor is such a core experience in these
OS'es, the "screen compositor" usually have special code to draw the cursor on
screen as late as possible—as close in time to an actual display refresh as
possible—to be able to use the most recent position data from the input device
driver.

That's not entirely true. Nowadays all GPUs have a feature called "cursor
plane". This allows the compositor to configure the cursor directly in the
hardware and to avoid drawing it. So when the user just moves the mouse around
the compositor doesn't need to redraw anything, all it needs to do is update
the cursor position in the hardware registers.

Compositors don't have code to draw the cursor as late as possible. Instead,
they program the cursor position when drawing a new frame. (On some hardware
this allows the compositor to "fixup" the cursor position in case some input
events happen after drawing and before the display refresh.)

But in the end, all of this doesn't really matter. What matters is that the
app draws before the compositor draws, thus the compositor will have a more
up-to-date cursor position.

[1]: [https://ppaalanen.blogspot.com/2015/02/weston-repaint-
schedu...](https://ppaalanen.blogspot.com/2015/02/weston-repaint-
scheduling.html)

------
skybrian
If you try to "predict the present" based on the past (and when you use
previous points to calculate velocity and acceleration, that's what you're
doing) it will overshoot when there's a change in direction, and how much
depends on how aggressively you try to extrapolate. For the one-dimensional
case in signal processing, doing this with a quickly-changing signal like a
square wave will result in ringing.

It can smooth things a bit but it's not that good a substitute for actually
improving latency.

(There are probably consequences for coronavirus charts as well, since they're
based on lagging data.)

~~~
modeless
Although I agree that there's no substitute for actually improving latency, I
think it's possible to do significantly better at prediction. Mouse movements
are not easily predictable but they are also not completely random; this is a
good type of problem to apply machine learning to.

Ultimately you want the lowest possible latency _and_ prediction, because you
can never get the latency to zero. Once the latency is small enough,
prediction becomes a net win. For example, all VR devices do prediction for
head and hand positions after lowering latency as much as possible elsewhere.

~~~
Rauchg
I totally agree, you want both. Negative latency!

[https://rauchg.com/2014/7-principles-of-rich-web-
application...](https://rauchg.com/2014/7-principles-of-rich-web-
applications#predict-behavior)

~~~
fiddlerwoaroof
That demo looks like the sort of attempt to be helpful that really irritates
me on web sites.

------
snvzz
The neglect for latency in current popular systems such as Linux sickens me.

I suggest experimenting with cyclictest from rt-tests. On all hardware I've
tried, I get 30ms+ peaks after running it on the background for not even very
long. I can't comprehend how anybody could find this acceptable.

I do run linux-rt for this reason. Then again, while linux-rt provides the
tools to make latency reasonable, the rest of the system hardly does use them.

As we move from the likes of Linux to better architected systems, potentially
based on seL4, I do hope the responsiveness will return to sanity. Until then,
I'll have to keep going back to my Amiga hardware as cope mechanism.

~~~
joosters
Why would a real-time OS help at all with latency? All RT means is that the
latency can be reliably upper-bounded (but note that that upper bound might be
very high/slow), it _doesn 't_ mean that the latency will be reduced. Real-
time OSs aren't _faster_.

~~~
codys
linux-rt is a patchset that changes the behavior of linux to increase the
number of places where preemption can occur (among other things).

Doing this decreases certain types of latency in certain situations. As an
example, it tries to have interrupts disabled less frequently and for shorter
intervals, and uses mutexes instead of spinlocks.

As a result, using linux-rt can provide a lower latency experience compared
plain linux.

~~~
joosters
Ah, that's fair enough, but it isn't 'real time', which is the thing I was
assuming from the 'RT' in the name. Perhaps linux-ll would be a better name,
for 'low latency'. RT just confuses what it is trying to do.

~~~
codys
It is trying to make the linux kernel more real time capable. Having periods
of time where preemption isn't enabled (due to having interrupts disabled,
etc) results in more variation in when tasks are scheduled, including real
time tasks.

The reality is that "real time" as a definition covers many "features" and
design choices because many ducks need to be in a row for real time tasks to
run properly. Decreasing variation in the scheduling of (real time) tasks is
one of those items.

As a result, it's entirely reasonable to call "linux-rt" "linux-rt".

------
negativegate
I'm seeing <2ms in Edge Chromium and ~10ms in Firefox on a 144 Hz display. I'm
curious how that compares to what other people are seeing.

I've been doing some WebGL work recently and I've noticed that while it
reaches ~144 fps using requestAnimationFrame() in Firefox, there's a lot of
stuttering. It's very smooth at 144 fps in Edge Chromium, while Edge Legacy is
below 80 fps. As far as I can tell it's not CPU bound, and it's definitely not
GPU bound. It would be nice if I could get it running smoothly in Firefox but
I don't know what to investigate.

~~~
epidemian
> I'm curious how that compares to what other people are seeing.

~10ms on Firefox, Linux, 60hz display.

------
jcelerier
> If you move your pointer left and right (or up and down) in sweeping motions
> and follow it with your eyes, you'll notice that the rectangle is trailing
> behind the pointer by quite a long distance

that's definitely not what I am observing
([https://streamable.com/9u4cpx](https://streamable.com/9u4cpx)). Enabling the
predictive tracking, however, is quite nauseating especially in circular
motions. Please don't play with your users' cursors !

~~~
codys
The article does mention that the predictive tracking feels worse:

> predictive tracking will feel much worse than direct (technically lagging)
> tracking when there is no system cursors to match.

Additionally, we can see the lag between the red box and your cursor in the
video of your screen that you've uploaded.

[https://i.imgur.com/ZEBcGch.png](https://i.imgur.com/ZEBcGch.png)

------
eyelidlessness
I didn't read the article, but I did try the checkboxes. What I saw surprised
me and I will go read the article to see if it addresses my experience, but in
case it isn't:

1\. The predictive checkbox improved tracking my cursor.

2\. Disabling `requestAnimationFrame` improved it more.

This is not what I'd have expected, so I'll include details about my
environment:

\- macOS 10.15.4

\- Safari 13.1

\- 2019 16" MBP with maxed RAM and ~25GB swap

I have no idea whether the browser or the memory pressure made same-thread
tracking more accurate, but something did.

------
Shtirlic
Also great bench for browsers and perf tester
[https://www.vsynctester.com/](https://www.vsynctester.com/)

------
tobr
I recently experimented with implementing certain pointer-controlled effects
on a <canvas>, and was discouraged by the jerky feeling caused by latency.

But I noticed that if I rendered the effect with motion blur, it suddenly
started to feel _much_ smoother, and the perception of jerkiness was mostly
gone. I felt that it completely restored my sense of control of the motion.

It’s surprising considering that motion blur actually adds one half frame of
extra latency.

Since trying this, I’m bothered by how jerky fast mouse movement always feels
in MacOS. 60 fps leaves these enormous, ugly gaps between the pointer at each
frame, and makes it hard to perceive the motion correctly. I can’t unsee these
gaps now! I’m convinced that system-wide motion blur just for the pointer
would be a simple way to make the whole OS feel much smoother and more
responsive.

~~~
leddt
I have had a similar experience when first using a 144hz display. I was amazed
how responsive the mouse was. How "in control" I felt.

Then, going back to a 60hz display, I couldn't NOT see the gaps left by the
cursor's movement. I had never before seen this as a problem, but seeing
something better ruined 60hz for me.

------
ufo
The dead-reckoning algorithm seems to do well when moving in a straight line
but my impression is that it does worse if there are curves because it veers
ourside the path that mouse actually traced. For example, when moving the
mouse in a circle the predicted squaretrace appears to move in a circle with a
larger radius.

What kind of algorithm could be used to improve the accuracy for curves?

~~~
aidenn0
For just ellipses (which includes circles), a 2nd derivative prediction will
work perfectly. Obviously there are paths that are not predictable though

------
vxxzy
I got a bit excited thinking this may go into latency of dereferencing
pointers in C.

~~~
forrestthewoods
Same! I wrote a blog post that _kind of_ talks about that.

[https://www.forrestthewoods.com/blog/memory-bandwidth-
napkin...](https://www.forrestthewoods.com/blog/memory-bandwidth-napkin-math/)

~~~
deagle50
Interesting results. Any ideas why the L1 got slower?

~~~
glangdale
Ha; there's a little architecture grognard subthread on this unrelated topic.

The Pentium 4 L1 cache was a miracle for its time, and once the P4 was clocked
to Peak Netburst levels the 2-cycle latency looks really good.

Tradeoffs on modern system are different - a Skylake cache _may_ have 4-5
cycle latency on access, but is 4 times bigger (32 rather than 8KiB), can
execute twice as many loads per cycle, and is write-back rather than write-
through (more complex to design, but more scalable with lots of cores).

You can still get your ass kicked by a ancient system if you pick just the
right pointer-chasing microbenchmark. This had some real implications for
regex implementation, given that a straightforward DFA implementation (and
many string matching algorithms like Aho-Corasik) are really just pointer-
chasing.

~~~
deagle50
Thanks for the info.

