> The patten I like most is a struct of const function pointers, instantiated as a const global, then passed by address to whatever wants to act on that table
> If you heed the "const" word above the compiler inlines through that just fine.
But only when the function it self is inlined, which you quite often don't want. If you sort integers in a bunch of places, you don't really want qsort to be inlined all over the place, but rather that the compiler creates a single specialized copy of qsort just for integers.
With something as simple as qsort compilers sometimes do the function specialization, but it's very brittle and you can't rely on it. If it's not specialized nor inlines then performamce is often horible.
IMO an additional way to force the compiler to specialize a function based on constant arguments is needed. Something like specifying arguments as "inline".
IIRC, this library uses this style of generic programming with a very nice API: https://github.com/JacksonAllan/CC
It's just unfortunate that everything needs to get inlined for good performance.
If you inline stack_drop into that and user code has calls into malloc_stack_drop, you get the instantiation model back.
Absolutely agreed that this is working around a lack of compiler hook. The interface I want for that is an attribute on a parameter which forces the compiler to specialise with respect to that parameter when it's a compile time constant, apply that attribute to the vtable argument. The really gnarly problem in function specialisation is the cost metric, the actual implementation is fine - so have the programmer mark functions as a good idea while trying to work out the heuristic. Going to add that to my todo list, had forgotten about it.
There is a very simple and small adition to C, that would solve the generics problem, albeit without much typesafety and you'd probably need a bunch of macros to create really nice APIs: Marking function arguments as "inline" such that the argument must always be a constant expression and the compiler is supposed/forced to specialize the function.
You can already write generic code that way currently, see qsort, but the performance is often very bad, because compilers don't specialize the functions aggresively enoigh.
On the simple level, this would make thigs like qsort always specialize on the comparator and copy operation.
But you can use this concept to create quite high level APIs, by passing arround constexpr type descriptor structs that contain functions and parameters operating on the types, somewhat similar to interfaces.
Early versions of Mathematica were only a few dozen MB and certainly have more functionality than probably most calculator apps you can find that are much bigger.
C based, no support for modular programming, everything needs to be a giant include, no one is adding features to it as Khronos isn't assigned any budget to it.
HLSL has evolved to be C++ like, including lightweight templates, mesh shaders and work graphs, has module support via libraries, is continuously being improved on each DirectX release.
I'm not a fan of GLSL either, but adding C++ like baggage to shading languages like HLSL and especially MSL do (which is C++) is a massive mistake IMHO, I'd prefer WGSL over that sort of pointless language complexity any day.
Long term shading languages will be a transition phase, and most GPUs will turn into general purpose compute devices, where we can write code like in the old days of software rendering, except it will be hardware accelerated anyway.
We already see this with rendering engines that are using CUDA instead, or as shown at Vulkanised sessions.
I do agree that to the extent C++ has grown, and the security issues, something else would be preferable, maybe NVidia has some luck with their Slang adoption proposal.
From the pov of assembly, C and any other high level language are basically the same. That doesn't mean that climbing even higher up on the abstraction ladder is a good thing though (especially for performance).
The macros I see in the real world seem to usually work fine. I’m sure it’s not perfect and you can construct a macro that would confuse it, but it’s a lot better than not having a compilation db at all.
Are just 32-bit and naturally aligned 64 bit instruction a better path than fewer 32 bit, but 16/48/64 bit instructions?
I think it's quite unclear which one is better. 48-bit instructions have a lot of potential imo, they have better code density then naturally aligned 64 bit instructions, and they can encode more that 32-bit. (2/3 to 3/4 of 43-bits of encoding)
There are essentially two design philosophies:
1. 32-bit instructions, and 64 bit naturally aligned instructions
2. 16/32/48/64 bit instructions with 16 bit alignment
Implementation complexity is debatable, although it seems to somewhat favor options 1:
1: you need to crack instructions into uops, because your 32-bit instructions need to do more complex things
2: you need to find instruction starts, and handle decoding instructions that span across a cache line
How big the impact is relative to the entire design is quite unclear.
Finding instruction starts means you need to propagate a few bits over your entire decode width, but cracking also requires something similar. Consider that if you can handle 8 uops, then those can come from the first 4 instructions that are crackes into 2 uops each, or from 8 instructions that don't need to be cracked, and everything in between. With cracking, you have more freedom when you want to do it in the pipeline, but you still have to be able to handle it.
In the end, both need to decode across cachelines for performance, but one needs to deal with an instruction split across those cache lines. To me this sounds like it might impact verification complexity more than the actual implementation, but I'm not qualified enough to know.
If both options are suited for high performance implementations, then it's a question about tradeoffs and ISA evolution.
There is also a middle ground of requiring to pad 16/48-bit sequences with 16-bit NOP to align them to 32 bits. I agree that at this time it's not clear whether the C extension is a good idea or not (same with the V extension).
The C extension authors did consider requiring alignment/padding to prevent the misaligned 32-bit instruction issues, but they specifically mention rejecting it since it ate up all the code size savings.
This would require specifying a cache line size in the ABI, which is a somewhat odd uarch detail to bubble up. While 64-bytes is conventional for large application processors and has been for a long time, I wouldn't want to make it a requirement.
Not really. Most modern x86 compilers already align jump targets to cache line boundaries since this helps x86 a lot. So it is doable. If you compile each function into a section (common), then the linker can be told to align them to 64 or 128 bytes easily. Code size would grow (but tetris can be played to reduce this by packing functions)
> If you heed the "const" word above the compiler inlines through that just fine.
But only when the function it self is inlined, which you quite often don't want. If you sort integers in a bunch of places, you don't really want qsort to be inlined all over the place, but rather that the compiler creates a single specialized copy of qsort just for integers.
With something as simple as qsort compilers sometimes do the function specialization, but it's very brittle and you can't rely on it. If it's not specialized nor inlines then performamce is often horible.
IMO an additional way to force the compiler to specialize a function based on constant arguments is needed. Something like specifying arguments as "inline".
IIRC, this library uses this style of generic programming with a very nice API: https://github.com/JacksonAllan/CC It's just unfortunate that everything needs to get inlined for good performance.
reply