And if anyone else was interested, the developer of Falcon Pro has already started implementing the fixes. To quote him from Twitter (@joenrv): "Reading @romainguy analysis of my app makes me feel a bit like a naked dude in a room full of people staring xD #letsgettowork"
"#FalconPro v1.0.2 uploaded to @GooglePlay. Get ready for some extra butter thanks to @romainguy ! Should hit your devices in an hour or so"
Overdraw primarily eats up memory bandwidth. Bandwidth isn't the only resource you have to worry about on a GPU (though on a mobile, it's certainly important). Equally important can be the time spent running pixel and/or vertex shaders when actually drawing onscreen elements - it's quite easy for a poorly written pixel shader to add multiple milliseconds to the time taken to render one fullscreen image on an embedded device.
Unfortunately none of the steps he takes in this article, until the OpenGL ES Trace near the end, appear to give you any of the information you'd need to figure out whether overdraw is actually your problem. Maybe it's a safe bet that for most Android apps, overdraw is the issue, because they're using Skia and thus don't have access to use custom shaders?
On the other hand, that Hierarchy Viewer feature in the debug tools looks really great. I wish more SDKs offered features that nice.
Most UI elements are drawn with a quad and trivial shaders (a couple of multiplications in the vertex shader, a texture lookup and modulate in the fragment shader.)
In my years of working on Android I have learned that poor framerate is most of the time (not always) a combination of blocking the UI thread for too long and drawing too much.
I should have made this clearer but a lot of overdraw above 2x will likely be one of the causes of performance issue. Not all devices will behave the same of course but it's a reasonable average (devices with more bandwidth tend to have higher screen resolutions.)
What's important to remember however is that overdraw often indicates other problems. This typically the application uses more views than it needs. This impacts performance in other ways: higher memory consumption, larger tree that takes longer to manager and traverse, longer startup times, more work for the renderer (sorting display lists, managing OpenGL state, etc.)
In general, you're right, you can "shade too hard" (consuming too many GPU cycles) and/or "shade too much" (consuming too much memory bandwidth); In my experience on mobile "shading too much" is more common and easier to do (especially in the era of 2560x1600 displays...). Maybe because the framebuffer and textures are on system memory and not GDDR5/whatever like on a desktop.
In the case of visible views that do use blending, overdraw multiplies the time spent running shader computation right along side multiplying the full memory bandwidth consumption of the shader (much more than just checking the depth/stencil buffer). It's true that it's totally possible to write slow shaders that chug with only 1x overdraw. But, at 3x overdraw it will be 3x as bad because you are running the whole function 3x per pixel.
The stencil is not used at the moment (well... that's actually how overdraw debugging is implemented) because the hardware renderer only support rectangular clipping regions and thus relies on the scissor instead. Given how the original 2D API was designed, using the stencil buffer for clipping could eat up quite a bit of bandwidth or require a rather complex implementation.
It is planned to start using the stencil buffer to support non-rectangular clipping regions but this will have a cost.
Remember that the GPU rendering pipeline was written for an API that was never designed to run on the GPU and some obvious optimizations applied to traditional rendering engines do not necessarily apply.
This means that, at least on traditional forward rendering GPUs (Nvidia, Adreno), overdraw is full cost even for pixels covered by opaque views. Do the PowerVR chips still get effectively-zero opaque overdraw from their tile-based-deferred-rendering approach?
Meanwhile, I'm not totally clear how hidden surface removal works on Mali chips. They use TBDR, but still recommend drawing front-to-back to avoid overdraw http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....
(The WinRT SDK goes a step further, there simply _aren't_ any blocking network/FS methods available. async/await makes that less painful than it would seem, though.)
* Optimized the app following @romainguy recommendations. Report back if you feel the butter :)
Have they advanced on the NDK/C++ front as well? I've never been interested in the Java aspect of Android and working with the NDK was brutal compared to iOS.
That being said, I do have one gripe. There are some chasms between the different tools. It is a bit painful to operate all the different performance tools and deal with switching contexts for the same problem. Systrace -> traceview and back and forth.
Also, it would be nice if traceview had a text based api/interface. I know that the graph visualization must be valuable for something, but I spend the majority of my time looking for particular methods and signs of excessive consumption/trouble. Now that I think of it, this sounds like a fun weekend project :)
Very interesting post though. :)
Getting 100 000 users for paid apps is very rare, Falcon Pro has just between 1000 and 5000 users so far for example. It's was just recently released though. If it starts to become popular "too fast" the price can just be raised.
Anandtech Nexus 7 review says it has 5.3 GB/s of
memory bandwidth. At 60 fps that would leave 88 MB per frame.
The 1280x800 screen has 3 megabytes worth of pixels at 24 bpp.
That's 1/29 of the 88MB-per-frame. So how come the overdraw related slowdowns started appearing with just 4x overdraw?
Would be nice if that was mentioned in the article (I spent awhile trying to find it...), it's a really nice new tool they've put in.
To Romain Guy, please keep it up. Thanks !