Hacker News new | past | comments | ask | show | jobs | submit login
Emulating double precision on the GPU to render large worlds (godotengine.org)
173 points by coppolaemilio on Oct 17, 2022 | hide | past | favorite | 63 comments



The 2xFP32 solution is also dramatically faster than FP64 on nearly all GPUs.

While most GPUs support FP64, unless you pay for the really high-end scientific computing models, you're typically getting 1/32nd rate compared to FP32 performance. Even your shiny new RTX 4090 runs FP64 at 1/64th rate.

2xFP32 for most basic operations can be 1/4th the rate of FP32. It is quite often the superior solution compared to using the FP64 support provided in GPU languages.


>While most GPUs support FP64, unless you pay for the really high-end scientific computing models, you're typically getting 1/32nd rate compared to FP32 performance.

I wonder if there is a hardware reason for this or It's just market segmenting by nvidia.


Mostly market segmentation. There is a software lock to a certain ratio (of clock speed) to the FP32 performance that varies by the card. For most consumer NVIDIA cards it is locked to 1/24 of FP32 speed to prevent use in professional settings that require FP64 performance. However, some cards, such as the Radeon VII, is only locked to 1/4 of FP32 speed (much faster)


My naive guess is that most floating point code uses FP32 and FP64 uses at least double the die size. So optimize for FP32 and have some FP64 for the rare equations that need it.


These compute units are usually sliced - they can perform either four FP32 multiples or one FP64 multiply on the same die part. This trick was done as long ago as PA-RISC was developed, from what I remember it was HP who introduced sliced ALU, capable of doing one large or several smaller operations on the same hardware.

I can be wrong about who did that first, but most FPUs now are done like that.


On GPUs, they're not sliced like this anymore since quite a long time to save die area.


The slicing was introduced to save die area. Not to slice is to have slightly smaller computation delay traded for greater die area.


Since to-scale solar systems are mentioned in the article, it may be worth talking briefly about Outer Wilds. Outer Wilds is a wonderful game built in Unity and comes with its own solar system. Things are quite a bit smaller than in the real world, but I suppose everything is still large enough for floating point precision to be a potential issue. The developers have solved this by making the player the origin instead. Everything else is constantly shifted around to accommodate for the fact that the player is at the center. This works perfectly in normal gameplay, and is only noticeable when flying away a great distance from the game's solar system (nothing's stopping you), at which point you will see the planets and other astral bodies jiggling around on the map.


Outer Wilds can also suffer from issues with simulation precision if you run the solar-system simulation long enough. The developers have talked about this a bit, and people have observed the effects with mods that encourage long-term exploration. This isn't actually an issue in practical gameplay, though.

rot13 to avoid spoilers for people who haven't played the game: Gur fha tbrf abin va gjragl-gjb zvahgrf, fb guvf vfa'g npghnyyl na vffhr va cenpgvpr.


I beat the game, but I believe you can experience precision issues if you just zoom the map out far enough and/or fly far enough away from the sun. Makes sense if you can infer that the map was really just an alternate camera showing the entire scene. It doesn't matter where the origin is if you can see all of it at once.

EVE Online had (still has?) a similar issue with its camera being able to zoom in on objects that are very far away. Normally, at those distances, you'd be using your overview or the HUD markers, but if you did zoom in on a far object, the origin would still be on your ship (or maybe the center of the area you were in), and the object would get distorted. Especially fun when it was a floating corpse.


>The developers have solved this by making the player the origin instead

Perhaps such a brilliant idea came to them in a dream. But maybe they forgot how they did it in another dream.


This is actually a pretty common practice and has been used for a long time. Here is a great explanation by developers of Kerbal Space Program, who built a solar system sized physics sandbox using the same method (and a few other tricks) in Unity: https://youtube.com/watch?v=mXTxQko-JH0


Good luck doing networked multiplayer for that setup though


Neither the submission article nor Outer Wilds is about multiplayer, nor does every game need multiplayer. So seems fine enough for many cases, not just the case you are thinking about I guess.


I don’t feel that needed to be stated personally. That’s true of pretty much anything.


It's actually not a big problem in many cases. If you need to physically interact with something that's N number of miles away from you then yes it could be an issue.


This is similar to the solution we used in Vega Strike https://www.vega-strike.org/ detailed here https://graphics.stanford.edu/~danielh//classes/VegaStrikeBl...


How old is Vega Strike! The Portability page is talking about TNT2 and 3DFX Voodoo!


I wonder if there isn’t another solution here. It seems like the issue is due to large translations? Presumably your view frustrum is small enough that single precision floats are sufficient for the entire range, so couldn’t you just add subtract some offset when calculating the translation matrix for both your view and the model translation? I suppose this may result in instances where you need to recalculate the translation matrix for some visible meshes, but that seems less complicated that trying to increase the on-GPU precision?


That's a common approach to deal with calculations in aerospace and automotive applications. You choose an arbitrary origin point nearby your operational area, and work in vectors relative to that. It's often referred to as "NED", or North-East-Down. Then, you just need to be able to convert between different frames of references as needed.


Your intuition is right, there's a pretty standard algorithm to solve this called "floating origin".

Essentially you translate the world back to origin when the player gets too far away.


Double precision makes sense to me for world space interactions wherein participants are very far apart.

Something like Kerbal space program is an example where I'd probably break out the doubles.

For final projection, clipping, z buffering, etc., single precision is almost certainly enough.


For Kerbal, isn't it easier and more accurate to set the origin on the craft being simulated? Why use world space?


That is what is done in KSP ever since planets beyond the Mun were added - on every frame, the coordinate system is re-centered on the active vehicle. The problem that still remains is mainly in trajectory calculations - you might have an intercept with a planet way on the other side of the solar system, and you can generally see that the predicted trajectory does not pass through the encounter point in these cases due to floating point inaccuracies.


Other stuff too if you want to look around, I did a quick and dirty solar system a couple of decades ago, one with texture maps where possible, and quickly discovered that Pluto was all lumpy .... simply because tessellating a sphere that far out quickly runs into FP resolution issues if the Sun is the origin


KSP only simulates at most a few hundred things in orbit, does it?

I wonder whether they should just use rational numbers for those?


> For final projection, clipping, z buffering, etc., single precision is almost certainly enough.

If you do this in the GPU then you would need it to handle double precision data.


Yeah, I agree that double precision makes sense for modeling a very large system in data, but I've never bought the need for double precision when it comes to rendering a small subsection of it. Unless you have an incredibly high resolution display, single precision should be enough.


You are correct that you can render a small section of a very large data set by re-centering your origin of the data sent to GPU around a small subsection. So, now the question becomes "But, do you really want to?"

> Overall, we are quite happy with how this solution turned out. We think it is the closest to "just works" that we can get.

I think this is the crux of it. The performance penalty is very small and the convenience factor is very high.

But, now I want to know what they do with the positions of lights in the scene... Likely transformed to view space regardless for deferred rendering, I'd guess.


You are right, it is what I did when I was working on a planetary rendering system. Drawing a plane would happen far from the Earth origin, and it would show in the shader computations especially specular highlights. The new origin was at the Earth surface I think. However it's not trivial, all shaders need to change, and having things in world space is useful in your shaders. And now eg. you need to convert light positions into that local space. So likely it wouldn't make sense in the context of Godot.


Also see the twofloat crate in Rust, which uses a pair of f64's to give double the number of significant digits as a standard f64. The linked docs point to a number of academic papers on the subject.

[2f]: https://docs.rs/twofloat/latest/twofloat/


For general computational applications (i.e. not for special graphics cases), implementing double-precision operations using single-precision operations is considerably more complicated than implementing quadruple-precision operations using double-precision operations.

The reason is that it is not enough to extend the precision of the 32-bit FP numbers. The exponent range must also be extended. The standard double-precision numbers have an exponent range that is large enough to make underflow and overflow very unlikely in most algorithms. With the very small exponent range of FP32 numbers, underflow and overflow is very likely and this must be corrected in any double precision implementation.

So it is not enough to use two FP32 numbers to represent one FP64 number. One must use either a third number for the exponent, or at least one of the two 32-bit numbers must be integer and partitioned into exponent and significand parts.

Both approaches will lead to much more complex algorithms and a much worse speed ratio for FP64 implemented with FP32 vs. FP128 implemented with FP64.


It's interesting that you find the idea of "only" being able to represent numbers as small as 10^-38 and as large as 10^38 as having "very small exponent."

In deep learning, this is huge! If you have numbers this big, then something is definitely already wrong. If you have numbers that small, then you definitely don't care.

I wonder if deep learning will save us from poorly conditioned linear algebra too.


In physics there are many universal constants or material constants with ranges between 10^10 and 10^40, and their reciprocals are between 10^-10 and 10^-40.

Some of these cannot be represented in single precision, while for the others one or two multiplications or divisions are enough to cause underflows and overflows. Such wide ranges are unavoidable in complex physical simulations, because their origin is in the ratios between quantities at human or astronomic sizes and quantities at atomic or molecular sizes.

Single precision values are perfectly adequate to represent the input data and the final results of any computation, because 24 bit is about the limit for any analog-to-digital or digital-to-analog conversion, and the exponent range is also sufficient for the physical quantities that can be measured directly, but when you simulate any semiconductor device and even when you simulate just an electrical circuit with discrete components, it is very frequent to have intermediate results with values much outside the range that can be represented in single precision, even up to 10^60 or 10^-60. When computing a high-order polynomial in order to approximate the solution of some problem, some intermediate values may be even outside that range.

In theory it is possible to avoid underflows and overflows by introducing a large number of scale factors in the equations, in appropriate places.

However, handling those scale factors in a program is extremely tedious and error prone. The floating point format was invented precisely in order to avoid the need of dealing with scale factors. If someone introduces scale factors in a program, they might as well use fixed-point numbers, because the main advantage of the floating-point numbers is lost.


And for Julia, Quadmath.jl or DoubleFloats.jl


Feels kinda weird to be using a data type that gets less precise, the further you move out from the center. Unless the world is infinite (which it sometimes is), isn't it a bit of a waste of precision? I kinda doubt you need nanometer precision, but only within 1 meter from the center. I get that gpus have existing floating point hardware to accelerate stuff. But with more open worlds being a thing. Wouldn't it make sense to include some new, big, floating point data type in hardware / emulate it in software?


Or like, I guess a 64 bit int would probably do the job


All those annoying trig functions get in the way since they're implemented for floats/doubles.


No reason you can't have intermediate int's. And if there's some nearby frame to do calculations against then you don't get the loss of precision.


Integer definitely seems like the right fit here. The precision is translationally invariant, which is the property you want for a Cartesian coordinate system.


Depends on what you want. If you're rendering a 3D world (with perspective projection) and use a coordinate system where the camera is always at the origin or close to it (floating origin), then floats give you more precision where it matters (objects close to the camera) and less precision where it doesn't (faraway objects).


But if you're doing it that way, why would you need higher precision floats to render a larger world? Single precision should be good enough for any draw distance, because the object dimension on the screen falls as 1/r.

If you want to move in the environment, you want to be able to store the relative positions of a teacup and a table in virtual London with the same precision as in virtual New York. So the coordinates of objects should be stored as integers. Then to render the world, the camera coordinate (also an integer) is subtracted from all objects, with no loss of precision, and the result cast to float for 3D rendering.


As seen in their video (the fact that movement is choppy, but not the rendering. Indicates they are already doing that


I’ve always wanted to see a game that used nanometer scale int64 positions. That’d give 11.5 million miles of nanometer scale precision. I imagine there are terrible problems with this. But I’ve never tried it so I don’t know what they are.

Back to Godot, I thought the answer would be to precompute the ModelView matrix on the CPU. Object -> World -> Camera is a “large” transformation. But the final Object -> Camera transform is “small”. I’m sure there’s a reason this doesn’t work, but I forget it.

Unreal 5 changes to doubles everywhere for large world coordinates. I wonder what fun issues they had to solve?


Article cites a 2007 article, but this technique is quite old. For example https://csclub.uwaterloo.ca/~pbarfuss/dekker1971.pdf


"Then, when doing the model to camera space transformation instead of calculating the MODELVIEW_MATRIX, we separate the transformation into individual components and do the rotation/scale separately from the translation."

That's the core idea here. A bit more detail would help. Is that done in the GPU? Is that extra work for every vertex? Does it slow down rendering because the GPU's 4x4 matrix multiplication hardware can't do it?

I actually have to implement this soon in something I'm doing. So I really want to know.



Ah. Thanks.

This is overkill for what I'm doing. They want to zoom way out and see planet-sized objects. I just have a big flat world a few hundred km across. So the usual "offset the render origin" approach will work. I don't have to update on every frame, only when the viewpoint moves a few hundred meters.


This is neat, but it's often a lot simpler to just render each object relative to the camera instead. Which can be as simple as subtracting the camera's position from the world position right before rendering.


Why not describe the world space in integers? Where 1 is the Planck length of the simulation?

Is there a “lossy compression” benefit to describing space with floats?


Integers will improve your range somewhat, but not that much. If you set 2^16 to be a meter, then you still can't go past 65km. And as a downside now you have to be extra careful your derived numbers don't go out of range.


If you set the unit to 1mm you can reach like 20% of the way to Proxima Centauri with Sol as origin.


That's 64 bit. If you were using 64 bit floats you didn't have problems in the first place.

Also a granularity of 1mm will make slow movement complicated to calculate correctly. Consider updating at 60Hz and having an object that moves 1 inch per second.


>The MODELVIEW_MATRIX is assembled in the vertex shader by combining the object’s MODEL_MATRIX and the camera’s VIEW_MATRIX

I was taught that MV/MVP should be calculated CPU-side per-model, and that doing it in the vertex shader is wasteful. Is that advice out of date?


Not necessarily, but it doesn’t solve the problem - if you calculate MVP on the CPU with double float precision, when you pass the final matrix to the shader, it will be cast down to single float precision, and cause the issue described.


The book "3D Engine Design for Virtual Globes" contains a bunch of these shader tricks with different levels of precision (e.g. 1cm across the whole solar system etc.)


A better way to solve this problem is to move the world around the origin instead. Just like you had to with OpenGL 1!

Really half-floats are more interesting, saving 50% memory on the GPU for mesh data. You could imagine using half-floats for animations too!

Then we could have the debate about fixed point vs. floating. Why we choose to use a precision that deteriorates with distance is descriptive of our short sightedness in other domains like the economy f.ex. (lets just print money now close to origin and we'll deal with precision problems later, when time moves away from origin)

What you want is fixed point, preferably with integer math so you get deterministic behaviour, even across hardware. Just like float/int arrays both give you CPU-cache and atomic parallelism at the same time, often simplicity is the solution!

In general 64-bit is not interesting at all, so the idea Acorn had with ARM that jumping to 32-bit forever is pretty much proven by now. Even if addressing only jumped to from 26-bit to 32-bit with ARM6.

Which leads me to the next interesting tidbit, when talking 8-bit the C64 had 16-bit addressing.


Minecraft works like this. The camera is the world origin when rendering.


It's easy when you have chunks.

But really all large worlds need chunking.

The real reason AAA never got into user generated content, is they have staff to create linear worlds.

After this economic crisis, linear content will more or less disappear.

Why listen to a hardcoded story when you can make your own just like in real life?

Scarcity is the key, one UGC networked world will make time valuable.


Don't all games do this? The camera is by definition the world space origin, (0, 0, 0). Translating the camera right actually means translating the world left. What does Minecraft do different?


Camera is not be definition the world space origin.


This is similar to the Far Lands[0] in old versions of Minecraft. Woof![1]

[0] https://minecraft.fandom.com/wiki/Far_Lands

[1] https://farlandsorbust.com/


Just render relative to camera (subtract camera position from model position and set camera position to 0).




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: