
The case for memory-mapped GPU assets - ingve
https://www.facebook.com/permalink.php?story_fbid=1799575323610310&id=100006735798590
======
exDM69
I did something along the lines of his suggestion using OpenGL sparse textures
(D3D would call them tiled resources) and persistently/coherently mapped
buffers with disk i/o being done with memory mapped files. It's a rather crude
proof of concept for on-demand loading of large textures (I used a 16k x 8k
satellite image). I didn't properly detect "page faults", but I had some of
the mechanisms implemented (outputs yellow pixels instead of page faults).

To make it work fully end-to-end, it would look something like this:

    
    
        1. Shader samples from a sparse texture, and detects that the requested page is non-resident.
        1b. Fall back to lower mip map level.
        2. Shader uses atomic-or to write to a "page fault" bitmap (one bit per page)
        3. The bitmap is transferred to the CPU
        4. For each set bit, start async copy from disk to DMA buffer (ie. pixel buffer object in GL)
        5. When disk i/o is complete, start texture upload from buffer to a "page pool" texture
        6. When texture upload is complete, re-map the texture page from "page pool" to the actual texture
    

Now this approach works alright, but there are a number of issues that make it
impractical for the time being. Off the top of my head:

    
    
        1. Sparse textures are only supported on Nvidia and AMD hardware. Not Intel, ARM or IMG.
        2. Requires Vulkan or D3D12 for step #6 (the demo doesn't do this, so there may be pipeline stalls)
        3. One or two frames of latency that can only be avoided if this was done in the kernel mode driver.
        4. Poor fit for existing KMD architecture (which has its own concept of residency)
        5. Detecting page faults is easy. Detecting which pages can be dropped is hard.
    

Here's the source code of my demo. It's not pretty because it was a one off
demo project for very specific hardware (Android + OpenGL 4.5, which means
Nvidia Shield hardware with Maxwell GPUs). The technique is portable, though.

[https://github.com/rikusalminen/sparsedemo/blob/master/jni/g...](https://github.com/rikusalminen/sparsedemo/blob/master/jni/gfx.c)

(In the code above, all the interesting bits are the functions named xfer_*)

Based on the experience from writing this demo, I have to agree with Carmack
here. File-backed textures would make a lot of sense for a lot of use cases.

~~~
yassim
Thanks for the example code.

------
Retr0spectrum

        Splash screens and loading bars vanish. Everything is just THERE.
    

I'm not sure I agree with this. It might be more convenient to have a
filesystem-like interface, but at the end of the day everything still has to
be loaded into the (rather limited) GPU memory at some point.

Most CPU applications can handle RAM swapping from disk, but I really doubt
that big games could maintain 60fps if even a few assets needed re-loading.

If a frame is 16ms, and the best consumer SSDs are around 2GB/s, You can only
load 33MB of assets in a one frame in a best-case scenario.

~~~
taspeotis
> You can only load 33MB of assets in a one frame in a best-case scenario.

Back the GPU mapped resources with RAM. I have 16GB of it. That's plenty for
one level's worth of assets.

GPU <=> RAM <=> SSD.

~~~
johncolanduoni
I suspect we'll see more of this as 16+GB of RAM becomes more standard, at
least for gamers. In the past it just wasn't a case that was worth optimizing
for.

~~~
imtringued
Why does RAM even matter? Warframe merely needs 700MB and yet it still looks
great.

~~~
douche
I would really like it if some games would take advantage of the gigs of RAM I
have just sitting around unused. Why are you thrashing disk loading assets all
the time, when there's 12 GB of pristine, untouched RAM available?

Obviously, there's the x86/x64 divide there - hopefully nobody is buying new
32-bit systems anymore, and that limitation can go away - although I'd rather
get a x64 version of Visual Studio, but apparently that's not a good idea,
because _reasons_ [1]

[1]
[https://blogs.msdn.microsoft.com/ricom/2015/12/29/revisiting...](https://blogs.msdn.microsoft.com/ricom/2015/12/29/revisiting-64-bit-
ness-in-visual-studio-and-elsewhere/)

~~~
dkersten
Similarly, I wish games would be smarter about level loading.

One game I recently played, a racing game, when you select "restart race" it
reloads the entire level (3d models, textures etc) when really all it had to
do was reset a handful of variable (car positions, time, a few other things
like that).

If its not possible to reset these variables, then take a snapshot!

Its especially annoying when a game, eg, autosaves before a boss battle and
then when you die, instead of just resetting some stats and inventory, you
have to reload everything... They already know there's a good chance you will
die (that's why they autosave!). If its too much effort to reload just the
bits that need it, then take a snapshot of non-graphics memory at the autosave
point and just reload that.

There's too many games out now where you can die very quickly (within seconds)
if you're not that good yet, and then have to sit through a multi-minute
loading phase over and over... very frustrating.

(obviously this applies only to "reloading", not terminating the game and
loading)

~~~
kazagistar
Games are one-off code written in memory unsafe languages. They are usually a
pile of hacks and bugs. Reset from known state is a nice, foolproof way to
recover from problems, and reduce the likelihood to reach edge cases.

I mean, yeah, when a game does allow bypassing load screens, it is pretty
amazing bonus to game play: its the key to success in games like Super Meat
Boy, to constantly maintain flow through failure and difficulty. But its hard.

~~~
dkersten
Known state doesn't have to include all of the static assets (art, music,
level data).

------
Retr0spectrum
Mirror for those who can't/won't use Facebook:

I have been advocating this for many years, but the case gets stronger all the
time. Once more unto the breach.

GPUs should be able to have buffer and texture resources directly backed by
memory mapped files. Everyone has functional faulting in the GPUs now, right?
We just need extensions and OS work.

On startup, applications would read-only mmap their entire asset file and
issue a bunch of glBufferMappedDataEXT() / glTexMappedImage2DEXT() or Vulkan
equivalent extension calls. Ten seconds of resource loading and creation
becomes ten milliseconds.

Splash screens and loading bars vanish. Everything is just THERE.

You could switch through a dozen rich media application with a gig of
resources each, and come back to the first one without finding that it had
been terminated to clear space for the others – read only memory mapped files
are easy for the OS to purge and reload without input from the applications.
This is Metaverse plumbing.

Not that many people give a damn, but asset loading code is a scary attack
surface from a security standpoint, and resource management has always been a
rich source of bugs.

It will save power. Hopefully these are the magic words. Lots of data gets
loaded and never used, and many applications get terminated unnecessarily to
clear up GPU memory, forcing them to be reloaded from scratch.

There are many schemes for avoiding the hard stop of a page fault by using a
lower detail version of a texture and so on, but it always gets complicated
and requires shader changes. I’m suggesting a complete hard stop and wait. GPU
designers usually throw up their hands at this point and stop considering it,
but this is a big system level win, even if it winds up making some frames run
slower on the GPU.

You can actually handle quite a few page faults to an SSD while still holding
60 fps, and you could still manually pre-touch media to guarantee residence,
but I suspect it largely won’t be necessary. There might also be little tweaks
to be done, like boosting the GPU clock frequency for the remainder of the
frame after a page fault, or maybe even the following frame for non-VR
applications that triple buffer.

I imagine an initial implementation of GPU faulting to SSD would be an ugly
multi-process communication mess with lots of inefficiency, but the lower
limits set by the hardware are pretty exciting, and some storage technologies
are evolving in directions that can have extremely low block read latencies.

Unity and Unreal could take advantage of this almost completely under the
hood, making it a broadly usable feature. Asset metadata would be out of line,
so the mapped data could be loaded conventionally if necessary on unsupported
hardware.

A common objection is that there are lots of different tiling / swizzling
layouts for uncompressed texture formats, but this could be restricted to just
ASTC textures if necessary. I’m a little hesitant to suggest it, but drivers
could also reformat texture data after a page fault to optimize a layout, as
long as it can be done at something close to the read speed. Specifying a
generously large texture tile size / page fault size would give a lot of
freedom. Mip map layout is certainly an issue, but we can work it out.

There may be scheduling challenges for high priority tasks like Async Time
Warp if a single unit of work can create dozens of page faults. It might be
necessary to abort and later re-run a tile / bin that has suffered many page
faults if a high priority job needs to run Right Now.

Come on, lets make this happen! Who is going to be the leader? I would love it
to happen in the Samsung/Qualcomm Android space so Gear VR could immediately
benefit, but it would probably be easiest for Apple to do it, and I would be
just fine with that if everyone else chased them in a panic.

~~~
nowprovision
Thanks. I hope this is a one off, if faceache becomes the article platform for
techie material I may just change industries.

~~~
unixhero
"This is a Facebook +Premium article. In order to access this content, you
must sign in with a Facebook +Premium account[?]. [?] Facebook +Premium
accounts are Facebook accounts where you have also confirmed your identify
with your national passport [level1] and have your yearly retina scan access
enabled and updated [level2]. As an option you may wish to access the Facebook
extra features and free access to all services by enabling [level3] on your
account by enabling location services and allowing us to store your GPS
position throughout your day. A level 3 account will have access to location
based and time based offers, content and Facebook friendship features that
will simplify and improve your life. Level 4 access is not yet ready for us to
offer you, but it will involve a small chip which you implant into your arm.
This will let you get the full and unfettered Facebook experience without the
need for any cellphone or other device!"

~~~
DavidSJ
The sad part is, it took me until your fourth sentence to realize it was a
parody, not an actual TOS quote.

~~~
philh
Often, "I couldn't tell that this was a parody" says more about you than about
the thing being parodied.

~~~
ovt
I haven't observed that myself, but I suppose we all have different
experiences in different environments.

------
angch
I'm not quite sure mmap is such a good idea if you're trying to have more low-
level control over performance. Weird Carmack advocating this, because you
can't really guarantee the latency of grabbing any resource if you incur a
fault and need to grab it from disk.

See also the comments from
[https://news.ycombinator.com/item?id=8704911](https://news.ycombinator.com/item?id=8704911)

~~~
HelloNurse
He notes that reasonable hardware should have the performance margin to load a
reasonable number of pages from a SSD without dropping a frame, which seems a
very good plan. Looking forward to actual tests, of course.

Considering that prefetching schemes allow the programmer to spread asset
loading evenly over many frames, and cheap rendering approximations can be
used in troublesome frames, there should also be enough low-level control.

~~~
lazyjones
> _reasonable hardware should have the performance margin to load a reasonable
> number of pages from a SSD without dropping a frame_

My disks are usually encrypted though and sometimes I can choose faster or
slower encryption methods (thus affecting throughput when loading). I don't
see how this can work reliably without forcing the user to reserve specific
disk areas just for GPU assets.

~~~
wtallis
Why aren't you using a self-encrypting SSD?

------
aphextron
Am I alone in finding this an odd thing to be reading on Facebook?

~~~
tomlong
Facebook is the new .plan!

Seriously though, the company he works for is owned by Facebook. This may be a
factor.

------
wiredfool
"""... it would probably be easiest for Apple to do it, and I would be just
fine with that if everyone else chased them in a panic. """

~~~
pawadu
it seems to me that when apple does something it quickly becomes "accepted" by
the consumers (even if it is a technical thing and not consumer facing). This
is not always a bad thing for competitors

~~~
wiredfool
I think it's a combination of Apple owning enough of the stack to make it
happen, and occasionally Apple's secrecy catching the rest of the industry
flat footed. (see the 64bit ARM transition)

~~~
pawadu
Funny you mentioned ARM! I have some inside stories to tell about that!

At the time it happened, many had already looked into the architecture and
realized that to them there were no real benefits: 32-bit ARM could already
address 1TB of memory, you could get the accelerated crypto instructions with
an architecture extension and 64-bit ARM implementations were not very power
efficient (a problem just recently solved by the latest A7x devices).

But when Apple switched to 64-bit ARM everyone else just had to follow along.
This resulted in weaker Android phones for about a year. The funny thing is,
the reason Apple switched so early is because they needed some head start
since they use native apps. They didn't really needed the 64-bit ARM at that
time yet.

~~~
gehsty
I think one of the main advantages was to get onto the latest (or newer..) ARM
instruction set, which I recall did increase performance Vs 32bit
chips/instruction.

I'm definitely no expert though, any inside knowledge on this?

~~~
pawadu
Yes, but you also lost a number of useful instructions (such as LDM). Also,
many 3rd party implementations already incorporated at least a good subset of
those accelerated instructions. So you could (and people did) create synthetic
benchmarks where one architecture was 200% faster than the other.

Now, I am not saying that 64-bit was just fluff. The new design has a much
nicer pipeline (specially thanks to the instructions they removed) which is
MUCH better suited for things like speculative execution. But the
implementation to make use of this wasn't really there until very recently.
Here is a fun fact for you: the 64-bit Cortex-A53 and the 32-bit Cortex-A7 are
80% the same CPU. What does that tell you about the first generation of 64-bit
devices?

------
jmount
It does sound like memory mapped assets would be a great feature. One thing to
read (not really an objection, just a commentary that remains relevant) is "On
the design of display processors" Myer, Sutherland; Communications of the ACM,
1968 (also called the wheel of reincarnation
[http://cva.stanford.edu/classes/cs99s/papers/myer-
sutherland...](http://cva.stanford.edu/classes/cs99s/papers/myer-sutherland-
design-of-display-processors.pdf) ).

------
jokoon
I recently upgraded from a 1.5MB L2 cached athlon 2 to a 6MB L3 core i5, and
surprisingly, game loadings are still as slow. I guess that copying assets
files onto RAM doesn't result in a speed up ?

So if I understand the problem right, it's because copying data to the GPU is
made through the PCI express bus, and done "piece by piece", instead of larger
batches ? A little like grouping draw calls ? That's funny how that problem
can be seen everywhere in hardware, where multiplying queries will make
latencies snowball.

~~~
monocasa
I think it has more to do with the fact that GPU's memory accesses aren't
cache coherent with the CPU, so a larger L2 doesn't really add much to the
table.

~~~
amscanne
I think you want a different word here.

Generally DMA to/from the GPU is cache coherent (either via DMA sniffing for
cache invalidation or software managing regions for DMA, e.g. marking relevant
PTEs as nocache).

So accesses are _coherent_, but the cache is simply irrelevant (or even more
costly, if it's using snooping).

------
Aissen
Funny, this is only necessary because SSDs are becoming so common and are
crazy fast (at both latency and throughput). But it's true, we do need proper
mmap-to-GPU. This is going to be challenging (and fun).

------
Const-me
A comment by wewbull on reddit:

Been there, done that, doesn't work. You start a level and every corner you
take invokes the hard drive. Chug city. New enemy, new textures, chug.

Horrible.

Source: worked at a GPU company. Saw the experiments.

------
deepnet
The GPU => CPU memory bus is a major bottleneck for NVIDIA's growing Deep
Neural Net driven adoption.

GPUs churn through data once it is across the bus.

A heirarchy of GPUs, outputs wired to inputs, mirroring the heirarchy of deep
nets would be useful for real time robots & cars, NVIDIA's other big market.

~~~
creshal
> heirarchy of GPUs, outputs wired to inputs

Nvidia introduced a new, faster SLI bridge for the new 1000 generation, aren't
they used in GPGPU setups?

~~~
jensnockert
The SLI bridge is quite slow though, even the updated SLI bridge is just
2GB/s.

~~~
creshal
Ouch, I'd have expected more.

------
throw7
Totally aside and ranty: facebook has started (recently?) greeting/forcing
outsiders with a login dialog that covers the whole page. You can click "not
now", but that just permanently lowers the login dialog box to about the
bottom 1/3 of the page.

If you want to publicly make your content available in a free and seamless way
stop using facebook. Please.

~~~
cageface
Your rant may be justified but I'm getting really tired of clicking on HN
threads I'm interested in and finding the top comment plus a screenful or two
of replies addressing some nit somebody wants to pick with the implementation
of the site in the article instead of addressing the content of the article.

Leave the asides aside please.

~~~
wmccullough
I'm at that point too. I know if I click a link, I'm not going to get to read
intelligent discussion until much deeper in the page. Take for example the
article about the LinkedIn acquisition, top comment has people arguing over
the origin of the word secular...

~~~
kibwen
How about we solve this problem once and for all by turning this thread into a
two-page series of complaints about how HN doesn't allow you to collapse
comment threads. :P

------
evc
What is this all about?

~~~
k__
Having all resources that an application needs in the GPU memory, I think.

~~~
dspillett
Not quite.

Having everything the app needs _available_ to the GPU at all times without
having to explicitly load them from disk to graphics memory beforehand in
userland code (and cycled out as needed) - having it happen automatically so
less code needs to be written & debugged in the app/framework, and having it
happen in the kernel so it is potentially more efficient (allowing for more
complex scenes and/or more detail and/or faster frames, on the same hardware).

------
Const-me
Don't like the idea. Games that stream assets don't drop frames. Before the IO
is complete, they display lower-quality placeholders. Even a universal
placeholder like an amorphous black shadow is better then a game that stops
rendering and waits for IO.

