Hacker News new | past | comments | ask | show | jobs | submit login
Supporting half-precision floats is really annoying (2021) (futhark-lang.org)
88 points by jb1991 on Jan 16, 2023 | hide | past | favorite | 26 comments



> Google has also developed the bfloat16 for use in AI accelerators. ARM apparently also supports a nonstandard variation of binary16, but I’m hoping I will never have to learn the details.

The details of bfloat16 are very simple: Build a standard ieee754 float32, then take only the upper 16 bits. It's a LOT easier than trying to support half-precision float, a form that has so little demand that most platforms don't even support it.


Depends on your definition of “platform“… nearly all GPUs, including the one in your phone, support it.

It’s also a native type on the CPU side on all Apple Silicon and iOS devices.


There is a huge demand for float16 in machine learning, and as such all modern GPUs support it. I can imagine that reusing ieee754 float32 implies a lot of performance degradation over native instructions?


This is ancient history so I might be misremembering, but I think float16 first appeared in NVidia GPUs in the 00s when even GLSL was not yet mainstream and the thing to use were the {ARB,NV}_fragment_program* pseudo-assembly languages.


7 bit only for mantissa?


Yes. The idea is that in machine learning you care more about dynamic range and being able to represent many numbers than you care about the precise number.


> This is not how we want to do it in Futhark. The f16 type should be fully supported like any other primitive type, and on all systems. When necessary, the compiler must generate code to paper over lack of hardware support.

That seems like a them problem. Nobody needs an abstraction that isn't like /any/ of the concrete implementations below it. Especially when the ML version of half-floats is different from the GPU one.


> Nobody needs an abstraction that isn't like /any/ of the concrete implementations below it.

You mean like bignums?


Are those an abstraction over multiple implementations? They seem pretty concrete to me.


In many high-level dynamic languages, the integer type is an “arbitrary-precision integer” in the sense that it’s a machine integer when it can be, but gets implicitly auto-promoted into a bignum ADT when necessary to keep operations lossless, and said bignum ADTs auto-demoted back to machine integers when possible to make operations more efficient. This is mostly hidden from the program, all part of one continuous “integer” abstraction, unless you use reflection facilities to specifically ask what concrete type/kind/class of “integer” you’re currently holding.

I would say that that’s almost exactly analogous to fp16s swapping from one representation to another as they’re moved between the GPU and a CPU that either does or doesn’t have native hardware support for them.

Another good analogy might be to x86 real-mode segmented “far” pointers, vs. protected-mode “flat” pointers, in cases where a given memory address is expressible under both representations.


The issue here isn't that FP16 changes speeds during runtime (although it can) but that it changes on different compile targets, so you can write a program that effectively hasn't been tested on the other targets.

Bignums are straightforwardly O(N) based on size - which is different from O(1) and can definitely lead to performance and security bugs - but it is consistent.



Why stop at half-precision floats? Why not go all the way to 3-bit floats? And build a hardware accelerator out of them? Tom Murphy has you covered with: the NaN Gate.

https://www.youtube.com/watch?v=5TFDG-y-EHs

http://tom7.org/nand/nand.pdf



I <3 binary3; it achieves the theoretical goal of knowing the value of everything (at least, pace Everett, as approximated by {0,1,∞}) while being blithely unconcerned with the cost of anything.


Sadly, from the following year's sigbovik the validity of binary3 is disputed (http://sigbovik.org/2020/proceedings.pdf, https://twitter.com/stephentyrone/status/848172687268687873)


As much as I am saddened to learn binary3 is invalid, I am overjoyed to learn that sigbovik has its own (non-self-cite!) literature graph.


Interesting comment on machine learning.

My understanding is that they don't train with half precision, but quantisation allows even integer values to be used for inference?

Yet the reasoning still makes sense. Can training take place at half precision or even if you intend to run inference at halfp do you want doublep for training?


My guess is that very small and very large values in the weights are already trained away due to the regularisation of the cost function, so insignificant changes in a network don't tend to have significant changes in the output.

You gain more by being able to run gradient descent faster than by having higher-precision floats.


Absolutely. And beyond weight regularization, for any weighting followed by a sigmoid or other squashing function, large weights simply tend to saturate the squashing function and there is very little gradient (quickly effectively zero) to benefit from increasing the weight value past that point.


Training with half precision (or mixed precision) is common.


I believe the core issue here is that they are trying to merge multiple 16-bit float types into one f16 type. But we had multiple 16-bit types with different trade-offs for a reason ...


No, the problem is that the platforms involved do not support any 16-bit float type in an ergonomic way. (The most accessible type on all the platforms, and the one used, is ultimately IEEE 754 binary16.)


> Apparently when you run-time compile a CUDA kernel with NVRTC, the default search path does not include the CUDA include directory, so this header cannot be found.

I don't know much about futhark nor cuda includes, but if futhark supports cuda, wasn't this already an issue before adding f16 support, to find the headers of other main functionality of cuda? Or is f16 the first thing in cuda that needs a header and other things like regular float computations work without any includes at all?


You always had to tell Futhark about the CUDA include directory at compile-time. The difference is that now you also have to tell it about it at run-time.

Further, at compile-time the usual CPATH/LIBRARY_PATH environment variables are respected, which I think is not the case for the run-time NVRTC compiler.


And on the Nvidia H100, FP32 runs at 67 teraFLOPS....and BFLOAT16/FP16 runs at 1,979 teraFLOPS.

https://www.nvidia.com/en-us/data-center/h100/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: