Hacker News new | past | comments | ask | show | jobs | submit login
Another NaN-based tagging strategy for dynamic programming languages (outerproduct.net)
63 points by todsacerdoti on Jan 11, 2023 | hide | past | favorite | 19 comments



Interesting, but this only admits 32-bit floats.

The original idea behind NaN-tagging is to overlap pointer representation with 64-bit floats.


> The original idea behind NaN-tagging is to overlap pointer representation with 64-bit floats.

And that's why this is 'another NaN-based tagging strategy'.


Aren't NaN-based operations computationally risky? Per Anger Fog's instruction latency guides, several microarchitectures seem to process NaNs and denormals through a slow, microcoded path rather than the fast FPU/vector path. Saving a few instructions on boxing/unboxing might be a false economy if it comes at the cost of a hundred cycles in microcode latency.


This is a valid observation. Most NaN-tagging schemes only perform arithmetic once the values are determined to not be NaN, and so avoid any penalty. In this scheme always perform arithmetic on NaNs, so if the uarch imposes a penalty it will be hit every op.


Recent uarchs take no penalty for denormals and nans.


> Also, any floating-point operation performed on a NaN will give back the same NaN.

This is not entirely true.

The IEEE-754 recommendation is that floating-point operations do NaN propagation: the result of an operation with one or more NaN inputs should have the same payload as one of those NaN inputs. However, it's only a recommendation, and the only requirement is that the result be a qNaN. There do exist some notable architectures that do not do NaN propagation: RISC-V, for one (it returns the preferred NaN [1]).

As a matter of principle, if you're writing this not in assembly, you also have to worry about whether or not the compiler will guarantee NaN payload preservation (so long as the hardware does). Spoiler alert: the answer is no.

And we don't talk about sNaN. Just say no to sNaN.

[1] Usually, people call this "canonical NaN", but I find this can cause confusion with canonical/noncanonical floating-point numbers. For the basic floating-point types everyone is familiar with, all values are canonical--you have to use the x87 80-bit FP type, the double-double type, or decimal floating point types to find noncanonical representations.


I was working on a function to deduplicate (preserving order) collections in python. Very straightforward: take the next value, see if it's in the set of already-seen values, put it in the output list if not. Since it uses sets under the hood, the values have to be hashable. Testing was almost equally straightforward: I favor property testing[1] and just asked for examples from from_type(Sequence[Hashable]).

Except. It turns out that Decimal is a hashable type, but there is a specific value that isn't: Decimal('sNaN'). Which of course the tests found pretty much right away (yay prop tests). sNaN: not even once.

[1] specifically, using hypothesis


>> Also, any floating-point operation performed on a NaN will give back the same NaN.

> This is not entirely true.

As I say two sentences prior:

> a couple of qualities near-ubiquitous on modern processors and architectures

Obviously, this won't work if you don't do nan-propagation. It also won't work if you don't have vector ops at all, or if they're much slower than the scalar ops.

I will add: if you are guaranteed a particular nan, you can just use that as your tag. (Although it looks like riscv uses a nan with an all-zero mantissa, which is unfortunate, as it means using it as the low-bits tag slows down integer math.)

> you also have to worry about whether or not the compiler will guarantee NaN payload preservation

The target audience is authors of compilers for dynamically-typed programming languages.


Wouldn't this strategy still allow performance gains when running on a processor that does NaN propagation?


You'd have to be sure that the compiler will also preserve NaN propagation through optimization, which is itself not a safe assumption (internal models for compiler IR often model floating-point as producing a nondeterministic choice of NaN representation, which means that it's not illegal for an optimization to fail to propagate NaNs).

As for performance gains, well, the thing is you're not likely to see much of a gain. The post argues that "standard is to store the floats in the high 32 bits of the pointer, with the low 32 bits as a tag", except I've never seen a system like this (tbh, I haven't really done too many VM systems, so my experience isn't great). That tagging model seems like they're assuming a pointer-tagging system [0], rather than true NaN-boxing. With NaN-boxing, you'd want to use the high bits of the payload, so that the low 32-bits can be used directly for whatever value is boxed. See https://searchfox.org/mozilla-central/source/js/public/Value... for an example of NaN-boxing in Firefox's JS engine [1]. What you save from somebody already using high payload bits from the tag is... an or instruction at the end to set the bits.

Realistically, though, there's very little need for a VM to support 32-bit float types. They're not actually faster than 64-bit floats (except for division) in hardware, and the main benefit of smaller float types is more compact memory representation--except if you're boxing all the values anyways, the 32-bit float types also aren't any smaller.

[0] Classical pointer tagging uses low bits for the tag, because you rely on pointer alignment. However, with 64-bit systems generally having only 48-bit pointers (57-bit with x86's 5-level page table, but this requires processes to opt in to using 5-level page tables anyways), it makes a lot of sense on 64-bit systems to also use high bits for the tag these days instead.

[1] JS doesn't have 32-bit floats, but it does have 32-bit integers, so you can at least get an idea for how the payload would be set if the boxed value could represent 32-bit floats.


It really depends on how you are boxing. Assuming a four byte object header a boxed float32 consumes 8 bytes. A boxed float64 consumes 12 bytes but! Is quite likely to require 4 bytes of padding to align it to an 8 byte boundary. It adds up quickly if you have many floats.


I think you'd also want to check the NaN route is actually faster in practice on the CPUs you are targeting but beyond that as long as you're willing to maintain the dedicated codepath for it (or are fine with saying it only runs on specific targets) then you should be good.


Cute idea.

I would be worried about using vector float ops when the program only needed a scalar op, since I would expect the vector ones to be more expensive somehow (either they are actually slower or they don’t get as much instruction level parallelism).

Also, the scalar ops to massage the tagged float (the shift ops in the example) are hella cheap. In JSC, we have sometimes implemented optimizations to remove those only to find that it didn’t buy us anything.

Finally, it’s somewhat unusual to want to tag 32-bit floats in a dynamic language setting. Usually dynamic languages have 64-bit floats.


On recent x86s, vector ops are generally the same speed as scalar ops, at least applied to xmm*. (Stuff like sqrt or division gets slower on bigger vectors, but that doesn't matter here.) Taking a quick look at the tables for firestorm/icestorm, it looks like the same thing applies there.

> the scalar ops to massage the tagged float (the shift ops in the example) are hella cheap. In JSC, we have sometimes implemented optimizations to remove those only to find that it didn’t buy us anything

Still get second-order effects from stuff like I$, or not fitting as much in your rob. Of course, for a JIT, this maybe has to be weighed against the time it takes to perform the optimisations, so I'm not surprised if it wasn't an obvious win in your case.

> it’s somewhat unusual to want to tag 32-bit floats in a dynamic language setting. Usually dynamic languages have 64-bit floats.

My primary interest is in common lisp, which has both single and double precision floats. Single precision is more than enough for lots of dsp-type stuff, and is half the size; better for memory b/w, and usually at least twice as fast when vectorised, too.

And has the advantage that you can pass it around without immediately without making integers and pointers slow :)


I’m unsure why this is so determined to use Nan tagging with a 32bit float. It’s not going to provide a performance win, it increases complexity, etc.

The reason we use tagged nans in high perf cases is that we’ve got 64bits burned by having to hold a float, but the total bit size for the alternative types is <= the significand bits.

Here we’re using nans to hold the tags, but all that’s doing is pushing the tag bits around: we haven’t reduced the value size, the access complexity, or provided any support for additional values.


I believe this is meant to be used in combination with a low-bits pointer tagging scheme. One of the low-bits tags represents a float value, and you insert this tag into the NaN-space of a IEEE-754 single-precision float in the full lower 32-bits. The upper 32-bits contains the single-precision float you are interested in.

The optimization is only for single-precision floats. Other types just follow the low-bits tagging scheme.


I would be curious to see if this was actually faster in practice. It is less instructions, but I would expect that the use of vector registers would negate that.


On x86-64, all floating point arithmetic uses vector register, and there is no difference in latency or throughput between scalar and 128-bit SIMD arithmetic; in fact, scalar instructions can easily introduce performance issues because they leave the high bits of the registers untouched (i.e., introduce useless partial dependencies).


> scalar instructions can easily introduce performance issues because they leave the high bits of the registers untouched (i.e., introduce useless partial dependencies)

How's that? SSE scalar ops are all two-address, so they depend on the destination regardless. And AVX scalar ops are three-address, but zero the high bits of the register. Where you run into trouble is mixing sse with avx, as you can end up in the dirty high half state; but that can happen with vector or scalar ops.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: