
Help the Graphics team track down an interesting WebRender bug - Santosh83
https://mozillagfx.wordpress.com/2020/02/18/challenge-snitch-on-the-glitch-help-the-graphics-team-track-down-an-interesting-webrender-bug/
======
Const-me
I’m almost sure it’s incorrect usage of D3D. I’ve encountered and fixed
similar bugs in my 3D graphics code.

Based on the symptoms and difficulties to reproduce, I think what you see on
the screenshots are incomplete renders. The FF code submitted some draw calls
to GPU, and copied the render target texture without waiting for draw calls to
complete. A good way to fix is use a D3D11_QUERY_EVENT query to wait for
completion of rendering.

If all GPU access happens through D3D11 this shouldn’t happen i.e. the API
guarantees to wait for the completion of previously submitted draw calls. It
may happen in practice when mixing multiple GPU APIs, e.g. when using DXGI
surface sharing to pass textures between D3D11 and DX9. Also, it may happen
when using D3D from multiple threads. Easy way to detect the latter, set
D3D11_CREATE_DEVICE_DEBUG flag when creating the device, and read warnings in
debug output of the process.

Another possibility is just bugs in rendering code. To troubleshoot them,
[https://renderdoc.org/](https://renderdoc.org/) is awesome. Unfortunately, if
the FF doesn’t use D3D to present, only rendering to textures, some changes to
FF’s source code is required to capture frames with RD, see this page for
details:
[https://renderdoc.org/docs/in_application_api.html](https://renderdoc.org/docs/in_application_api.html)

~~~
pcwalton
The problem with using RenderDoc is that it doesn't capture ANGLE GL function
usage, only the underlying D3D calls. So you have to correlate the high-level
GL API calls with whatever ANGLE is lowering them to yourself.

It would be nice to have a native D3D11 backend (and D3D12, and Metal, and
Vulkan) someday, but that day isn't today. gfx-rs, or wgpu-rs, looks promising
as an abstraction layer over all of these APIs.

~~~
Const-me
> it doesn't capture ANGLE GL function usage

It's opensource with MIT license. I wouldn't expect too many issues supporting
such use case, given RD's support for normal OS-supplied GLES. I would only
expect complications if you'll try capturing both layers at the same time,
GLES between your app and angle, and D3D between angle and d3d11.dll.

~~~
jonahrd
RenderDoc does have some issues tracking calls from ANGLE's backend, but
workarounds are documented here for some platforms:
[https://chromium.googlesource.com/angle/angle/+/master/doc/D...](https://chromium.googlesource.com/angle/angle/+/master/doc/DebuggingTips.md#running-
angle-under-renderdoc)

------
godelski
I used to have this problem __ALL__ the time. So I can give some specs to help
anyone, because another comment suggests old hardware.

I __only__ experienced the text tearing. It would basically be a slash through
the screen, usually two or three where text would just be displaced. A small
scroll would fix it.

This has happened to me in both Manjaro and Ubuntu and on two different
computers. The Manjaro is a Inspiron with a 1060Qmax. The Ubuntu has a GTK
Titan Black (so old).

I have not experienced these glitches in fairly recent memory. So probably
last two versions of FF. These computers are my daily drivers and I keep them
fairly up to date.

Usually where these glitches would happen is in long form text articles. I
don't recall seeing them on Reddit, but more like NYT. Hope this helps.

~~~
rahuldottech
I've been experiencing this too! On only one of my machines. I did a complete
Windows reinstall but the problem persisted.

It doesn't happen too often, but iirc I saw a text glitch sometime in the past
few days. Moving my mouse over the glitched area or a small scroll solves it
for me too.

~~~
godelski
Yeah I could get rid of it by scrolling or highlighting the text. I remember
looking it up and only find like one small bug report a year ago. But I don't
see it anymore. Are you on the newest FF? I don't use nightly and I haven't
seen this problem in awhile.

~~~
rahuldottech
Just checked, I'm on latest regular FF, non-nightly. I don't recall seeing it
in the past couple days so maaaaybe it's resolved now? I'll keep an eye out
for it in case I see it again.

------
petargyurov
This bug happens to me somewhat often. In fact it happened to me yesterday. It
fixes itself after a couple of seconds. No idea how it happens - I haven't
noticed a pattern.

I have a GTX 1070.

~~~
jbonisteel
Could you file a bug for us?
[https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&comp...](https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Graphics%3A+WebRender)

~~~
capableweb
I'm in the same boat with Windows 10 + nvidia 2080ti. Not sure why you want a
bug report filled when neither of us have any way of reproducing the issue nor
more information about where/when it happens.

~~~
jdashg
Because that's what bug reports are helpful for: organizing and collecting
data, which helps with the whole "finding a repro case" and eventually fixing
this issue.

Did you think bugs got fixed by mentioning "me too" in a HN comment thread? :)

------
dx87
I was having a weird WebRender issue last year, except it was causing sudden
reboots until the computer stayed powered off long enough for the GPU memory
to be cleared. I had enabled WebRender in Linux (it's experimental, I know),
and the computer would run fine for weeks at a time, then reboot out of
nowhere. It would continue to reboot any time I opened firefox, sometimes a
few minutes after starting firefox. The reboots would persist even when I
would boot into Windows instead of Linux after one of the forced reboots. Once
I narrowed it down to firefox causing the reboots, I started it in safe mode
and disabled WebRender, and I haven't had any issues since. This was with a
1050TI graphics card.

~~~
jamienicol
That sounds more like a Linux or hardware issue than a webrender one!
(Likewise if a website causes Firefox to crash, that's a Firefox issue rather
than the website's)

------
rmccue
I am fairly certain I’ve seen this bug, but on macOS using a MBP’s integrated
graphics. I opted into WebRender a while ago via about:config, so I had
figured it was just a random beta issue. Will have to see if I can capture it
next time it happens!

~~~
dawnerd
Same here as well, but in fairness I've also seen it happen in Chrome too on
the same sites.

~~~
nicoburns
Came here to say this. I've seen this on my (2015 13 inch MBP), in both Chrome
and Firefox.

------
tomlong
Heh, i was talking about this today. I do three days a month for a client
where I use their computers and it's the only time I ever use Windows 10 and
the only time I see this. I was asking around about it but no one else was
using Firefox. Next time I'm in I'll pay more attention.

~~~
jbonisteel
If you can grab the info in about:support for that machine and file a bug:
[https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&comp...](https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Graphics%3A+WebRender)
that would be helpful!

------
nootropicat
I get it all the time but I have gfx.webrender.all set to false.
gfx.webrender.blob-images, webrender.dcomp-win.enabled, force-angle, picture-
caching, triple-buffering.enabled are set to true

However gfx.webrender.enabled is set to false so are they even active?

FF 72.0.2

~~~
jamienicol
If you got to about:support and check the value of "compositing", you will see
if webrender is enabled.

------
bfdm
This is so bizarre for me because it's like a flashback. I was seeing errors
_exactly like this_ in Firefox years ago, around 4-5 years or more I'd say.

This was on windows 7 64bit with a probably outdated install because my
favorite extension for vertical tab tree was always behind on version support.

This error would manifest when I had many tabs open for a long time. I'd often
keep sessions of hundreds of tabs open for weeks. The errors would start to
appear and gradually get worse until I had to force crash ffx and restart (and
recover my tabs!)

Not really actionable or reoro, but anecdotally maybe it helps to hear this
isn't strictly new.

------
egfx
Looks a lot like older graphics cards over heating. I worked in graphics QA.

~~~
trynumber9
I get this frequently. Usually when an update fires and I scroll. Have not yet
been able to reproduce reliably. I really doubt it's overheating, since it
takes a fraction of a second to happen and it couldn't heat up that quickly.

~~~
inetknght
> _it couldn 't heat up that quickly_

It could, actually, especially if it's not well maintained.

~~~
trynumber9
It's a $1200 GPU, I do make time to clean. I'm not saying it's Firefox
specifically, perhaps a strange interaction between Firefox and the latest
Nvidia drivers. But you're welcome to come check my machine.

------
AlexMax
Son of a gun, I have been running into this issue sporadically ever since I
built my new PC with a Radeon 5700 XT, and I thought it might have been a bug
in the drivers or a bad card, but turns out it's this instead.

For what it's worth, I downgraded my drivers to the last set of drivers from
2019 (19.9.2) and the problem seems to have gone away, but I'm having other
issues instead (for instance, half the screen flashes dark when Night Light
turns on). Guess it's time to upgrade and file some bugs.

------
modeless
The state of debugging for graphics code is deplorable. When you run code on a
GPU you almost always have a gigantic closed source binary blob underneath
you, the graphics driver. Graphics drivers contain complete compilers that
munge your code with a focus on speed over correctness. The graphics driver is
often years out of date, and you can't even run it yourself to reproduce
issues unless you go out and buy the exact hardware that each user has.

The state of the art in debugging GPU issues is to have a closet full of
random hardware purchased used on eBay that hopefully contains something close
enough to what you need to reproduce the issue you're working on.

Most fancy debugging tools pretend that GPUs don't exist and GPU-focused
debuggers are incredibly buggy and feature-poor, not to mention hardware
specific in many cases.

~~~
_bxg1
I've never understood the logic behind closed-source drivers. Nobody pays for
a driver, people pay for hardware. If you want your hardware to get better
support, and therefore become more popular, aren't you incentivized to open-
source the drivers?

~~~
modeless
> Nobody pays for a driver

This is not true. Driver quality is a major differentiating factor between
GPUs and absolutely influences purchasing decisions.

GPU makers don't want their competitors to be able to take parts of their
drivers and benefit from them. They want to protect trade secrets. They want
to avoid revealing how many of their competitors' patents they are violating.
They don't own the rights to some third party code or trade secrets that are
incorporated in their drivers. They want to avoid revealing future product
plans.

Of course it's debatable whether closed source drivers actually help with all
of the above reasons, or whether they are good reasons to begin with. But
those are the reasons.

~~~
HeWhoLurksLate
AMD Radeon has made multiple of their performance-enhancing projects of theirs
open source only to have nVidia copy and implement their work without
reciprocating the favor.

~~~
lmz
Is it mandatory to reciprocate? If not, then that's a problem of their own
making, isn't it?

~~~
HeWhoLurksLate
It's not mandatory, but it sure isn't polite, especially when projects like
HairWorks are killing AMD because it's a driver handling issue more than a
hardware one.

------
looperhacks
OT: This is the first time firefox has warned me about blocking social media
trackers. Funny that it happened on one of mozillas own websites.

~~~
nical
Yeah that's a wordpress blog quickly put together outside of mozilla's
official blogging infrastructure. There was no really good reason to do that,
it was just laziness on our end.

------
miohtama
But hunting is one of the easy to apply kinds of crowd wisdom. Let's hope
Mozilla gets enough data to nail down the root cause.

------
zamalek
Black rectangle rings true here. I get it pretty reliably on my phone (preview
nightly, Android). If I click a GitHub link and FastHub automatically opens
(any GitHub link), when I go back on the phone I see a black rectangle for the
whole page.

Pixel 3. Hopefully it's a common codepath.

~~~
jamienicol
That sounds like a separate issue. Similar to one I had before on android, but
fixed. If you go to settings, about Firefox nightly, what version does it say?

Does the black go away by itself after a couple seconds, or when you scroll,
or does it stay forever?

------
reuben_scratton
I haven't seen this particular issue but I've seen very similar bugs in my own
work where GPU and CPU are not properly sync'd when using glyph or small image
cache textures.

E.g. gpu has a frame pending, CPU updates glyph cache, GPU renders pending
frame using incorrect cache data.

------
shitloadofbooks
I had a similar issue in Chrome and a regular crash-to-desktop in the main
game I was playing at the time (Divinity 2) on Windows 10 with a GTX 970. I
also had major problems with YouTube.

My fix was to turn off all the Dynamic Clock and GPU Boost stuff in the NVIDIA
control panel.

------
Izkata
Hmm... This looks identical to something that occurred on Ubuntu 14.04 when I
came out of suspend, except it happened to all gtk apps. Switching to another
virtual terminal then back to 7 seemed to reinitialize something and fix it.

Even nowadays, on rare occasion on Ubuntu 18.04, something like this occurs
during startup of Android Studio. I don't recall seeing it on Firefox since at
least 16.04 though.

------
nonbirithm
I wonder if Mozilla has enough resources to test on Windows machines, or if
it's just bad luck with hardware configuration, or something else entirely. I
reported an unrelated bug in Nightly where I couldn't use smartcard
authentication and since it only reproed on Windows we had a back and forth
exchanging screenshots and logs until the root cause was narrowed down.

------
Rapzid
I had issues like this in May of last year in Chrome on Windows 10. Oddly it
happened most frequently on TravisCI's website, and I thought at first that it
was a TravisCI issue!

It turned out a Windows updated over night and replaced my AMD driver with
some old compatibility driver. After updating back to the most recent driver
the issue went away.

------
tomaszs
A classical race condiction double bug. What happens here is there is a a
special case of the data. So other routine of processing the data is used.

It causes a race condition that results in proving incomplete data to the next
step of the process.

There is surely logic to prevent it from happening. But the data initially
used the bad routine so no one expects the race condition to occur.

And at the end we have a bad routine that processes data badly because data
are bad and incomplete. This routine does not expect how it receives the data
and in what time.

In a result rendering breaks, but for the logic of the code it seems fine. So
the error block is displayed. Gor GPU rendering it is a black box.

What is interesting if you did deep enought to find all these 3 issues that
cause it each of them separately is fine. It is what they do together causes
the issue.

Very nasty thing to track. And also fix. Because on a high level there is
little to about it. And a workaround is mostly to detect the situation and
process data slightly different way to avoid the issue, or to just do
processing two times in a rather hacky way.

------
pja
I have seen very similar bugs (the same text glitches for sure) under Linux
(Radeon RX580, open source drivers) in the past. I had put it down to open
source drive stack bugs, but maybe it was actually a Firefox bug?

I’ll try and take a screenshot next time it happens & report it.

------
j0e1
I have seen this in the past (I use FF-nightly on MacOS) but subsequent
updates seemed to have fixed it. Although maybe some other factor affecting
the frequency. Will keep an eye out.

------
electrotype
Why the Nightly version only, I'm curious?

~~~
cpeterso
Presumably there are other graphics bugs that have been fixed in Firefox
Nightly but still exist in the Beta or Stable Release channels. Best to debug
and reproduce the bug in the code closest to version the Mozilla engineers are
using.

------
MayeulC
> Update: we have created the channel #gfx-wr-glitch on Matrix

Funnily, this alias is incomplete and therefore not enough to join. A room ID
might have been enough, but just an alias without a server part is worthless,
a bit like someone left an e-mail adress as gfx-wr-glitch@, without specifying
the rest.

I've successfully joined trough #gfx-wr-glitch:mozilla.org, (roomid is
!qaAtjMWizEgMcVSbNH:mozilla.org, no alias on matrix.org) and told them so.

Edit: they fixed it. I wonder how often that happens with those new to Matrix?

~~~
Dylan16807
I don't think that's a "new to Matrix" thing.

They run their own server, just like they ran their own IRC server before, and
at first didn't really consider the need to make that explicit. I doubt there
was any misunderstanding of the tech, or that the channel names are per-
server.

------
irrational
> Without having a way to reliably reproduce this bug, we are at a loss on how
> to solve it.

I wish non-technical people understood this.

------
j1elo
I'm curious, why does a team working for Mozilla need to host their blog as a
free account on wordpress.com? They really don't have a unified blog system to
provide a more official feeling to all their different departments?

Or is this just an unofficial blog from people that is unrelated to Mozilla?

~~~
staktrace
In this respect (and probably others) Mozilla is more like a ragtag army of
rebels than a uniform-wearing platoon of soldiers. The blog is hosted on
WordPress because the person who started it found it most convenient, and it
never got migrated anywhere else. We also have hundreds of different github
organizations and repos (despite also hosting our own "official" mercurial
server), use lots of different communication tools, spread documentation over
multiple wikis and bug trackers, etc.

But yes, this is an "official" blog from Mozilla employees on the graphics
team, of which I too am a member.

~~~
j1elo
That's both strange, very confusing, and amazing at the same time :-)

It surprised me that Mozilla is able to simultaneously have things like the
MDN, basically the best resource for HTML and Web APIs, but then other
references or publications scattered around everywhere.

------
rvz
This reminds me of the comments section or sometimes a mailing list for a
particular bug in a driver or a low-level component and most of all the
replies were from 'users' who weren't helping to solving the actual problem in
the discussion. Then one day an unknown contributor not only was able to
reproduce the bug but also sent an actual fix and ended up gaining commit
access to the project after staying there for a while.

Whenever I hear the 'mythical legend of the 10X rockstar developer', I think
of them. Mozilla needs more of those developers to reproduce AND send patches
to interesting bugs like this. Not the whining copypasta forum-posting 'users'
calling themselves 'developers' who take many open-source projects like this
for granted and don't appreciate the time spent by the maintainers of the
project.

Nowadays they're usually doing this on a developer social network through
every maintainers nightmare 'The Issues Tab'.

~~~
paxys
Hah! As an open source maintainer for a mildly popular repo a big chunk of my
time is spent weeding out people filing issues and opening PRs for trivial
changes (like adding a line to the README file) just so they can claim to be a
"contributor". Meanwhile there are a ton of issues that are ready to code with
a full explanation and are even marked as such ("good first issue" or whatever
else) that stay untouched.

Regardless of community engagement, perceived interest, Github stars or
whatever else — finding a developer willing to spend actual time and effort to
make a project better is very, very rare.

~~~
strken
I've made a bunch of one line PRs that I'm worried are perceived like this,
but they're always a fix for something that cost me a lot of time. Examples
include adding quotes around CSS base64 URLs (our base64 string included //)
and changing the case of some headers in documentation (headers were downcased
upstream and the example in the doc failed silently). Are these the kind of
PRs you don't like?

~~~
paxys
Not at all. It isn't the size of the PR that matters, but I generally want
contributors to:

\- Discuss their changes beforehand (usually on the issue page)

\- Use patterns and practices that the rest of the codebase follows

\- Listen to comments and feedback

\- Write tests

While the work is "free", it should still be treated professionally.

~~~
bcrosby95
> \- Discuss their changes beforehand (usually on the issue page)

Interesting. When I last looked at contributing to open source, some general
advice was to offer up code even if its likely the wrong approach. The idea
being that maintainers are busy, and there's lots of people that drag on their
time without ever contributing anything. Showing you're actually willing to
put in time and effort will make the maintainer more likely to guide you in
the correct direction because you've already shown you're willing to spend
time on the issue.

~~~
aesyondu
I feel the same way. Although I have only been starting to contribute
recently, I thought that by offering code first via pull request I wouldn't be
perceived as a whiner in the issues section.

~~~
atq2119
This is definitely true. I'm not a fan of Issues discussion with random people
who are not known to be reliable contributors. Talk is cheap, show me the
code.

------
lucideer
Given this is the mozillagfx blog, one would hope the "gfx" displayed on the
blog were a little better: I can't make out the graphical glitches described
in those tiny low-resolution screenshots.

~~~
VWWHFSfQ
you dont see the glitches in this image
[https://mozillagfx.files.wordpress.com/2020/02/screen-
shot-g...](https://mozillagfx.files.wordpress.com/2020/02/screen-shot-
glitch.jpg)

~~~
ilikehurdles
Is that not how reddit looks?

~~~
k1t
Interesting to note that even the address bar appears to be affected. Several
characters are missing from the URL in that screenshot.

~~~
capableweb
Yeah, when I had it happen to me, it happens to the entire Firefox instance,
settings page, url bar, tab names, context menu (right click menu) and
everything in between.

~~~
cpeterso
Much of Firefox's UI is HTML and can be affected by bugs like this. But it is
very interesting to see that bug appears to affect both the app UI and web
page at the same time.

------
dmead
is this a real request for help, or a marketing-look-at-our-blog thing?

~~~
Lammy
If you can suggest a better way to get word about a tricky bug out to people
who don't usually look at Mozilla's bug tracker I'm sure they'd be happy to
hear it :)

------
tvelichkov
This very much sound like a similar issue Linux users are facing from more
than a year and it's still not resolved?
[https://askubuntu.com/questions/1088324/firefox-becoming-
unr...](https://askubuntu.com/questions/1088324/firefox-becoming-unresponsive-
after-suspending-on-kubuntu-18-10)

~~~
twic
I don't think so - in the Linux bug, Firefox becomes unresponsive, whereas in
the bug in the article, Firefox runs as normal, and the blanks can be cleared
by scrolling or mousing over.

I see a somewhat similar bug on my Linux box, where windows flicker between
black and normal, and the flickering stops if i focus on the window. I don't
think it's specific to Firefox, though - i believe i've seen all sorts of apps
do it. I do have an nvidia graphics card, FWIW.

~~~
tvelichkov
As someone who experience this very often I can't say that Firefox is becoming
unresponsive, you can still scroll up/down, change tabs, etc.. but the all
pages are glitched similar (even worse) to the screenshots.

~~~
jbonisteel
Can you file a bug?
[https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&comp...](https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Graphics%3A+WebRender)

------
thefurman
You should ask the people having the issue to switch graphics cards, PSU and
RAM since this >90% smells like a memory/power issue. I think you've just
gotten unlucky and collected a few similar reports of people having memory
corruptions. I am guessing you cache renderings and that is why the issue
sticks around until you re-render.

If you want to reproduce it yourselves then perhaps try pointing a hairdryer
from a distance at the various components until they start to create trouble,
or alternatively just overclock them towards the breaking points.

~~~
thefurman
Why the downvotes? Anything contradicting this which you might want to share
in text instead?

