Hacker News new | comments | show | ask | jobs | submit login
VC++ /arch:AVX option – unsafe at any speed (randomascii.wordpress.com)
42 points by jsnell 163 days ago | hide | past | web | 27 comments | favorite

Meanwhile this problem has been solved in GCC for years


Isn't dynamic dispatch a separate problem? It's great that gcc solves dynamic dispatch, but this article was about the problem of functions that are not supposed to be dynamically dispatched, but are called from dynamically dispatched functions - the noninlined inline function problem.

gcc "solves" this problem for its own inline functions by tagging them as static. But what about template functions? And what about functions defined by developers. I don't think gcc claims to solve those.

and is being overhauled in GCC 6 I just read: https://lwn.net/Articles/691932/

The reason given for not doing it for Chrome are sound, but I agree with the author that the solution for most software should be to ship separate AVX-enabled and non-AVX-enabled binaries (whether an individual DLL or the entire program).

There's an easy way to do this for .Net programs - some Windows .Net programs (Paint.NET, etc) contain an "optimization" step at the end of their installers that ngens their binaries. The ngen'd binary is free to use CPU-specific instructions.

> but I agree with the author that the solution for most software should be to ship separate AVX-enabled and non-AVX-enabled binaries

I don't think this scales. I work on text search, and there are distinct algorithms that use at most SSSE3, some use at most SSE 4.2 and some use at most AVX2. Should I then ship 4 different versions for every platform I support? What are my instructions to users on the download page?

>Should I then ship 4 different versions for every platform I support?

It's worse than what you have now, of course, but the situation is not new with AVX.

>What are my instructions to users on the download page?

If the differentiation is at the DLL level (SSE3 implementations in their own DLL, SSE4 implementations in their own DLL, all providing a common interface), your application would load the corresponding one at runtime based on feature detection and your user never notices.

If the entire binary / program package needs to be differentiated (say ripgrep where there's only one statically linked binary to speak of), your Windows installer picks the right binary to install, and you have multiple Linux packages and a metadata package that depends on them and picks the one with the matching architecture. The former is definitely done, and I believe I've seen instances of the latter in my package manager's listings.

But I don't think you actually need to do that. You can compile individual functions with different target specific optimizations using (e.g.) LLVM's target_feature attribute. Both gcc and clang expose this type of functionality.

As DannyBee mentions in the sibling, this issue is compounded in AVX512. You really need to support runtime detection, and doing that means including multiple versions of the same function in the same binary. It looks like one is at the mercy of your compiler for how well this works, but it seems like gcc/clang do this just fine?

>Both gcc and clang expose this type of functionality.

Sure, I didn't say that it was impossible in general. The context of my comments was that MSVC doesn't appear to support automatic target-architecture-based dispatch and thus requires such workarounds.

Ah! Gotya. Understood. Sorry for the mixup!

FWIW, frustratingly MSVC has no compiler options for generating SSE3/SSSE3 or SSE4.x code - you can get them from intrinsics, but the compiler flags are SSE/SSE2 (on ia32) and AVX/AVX2 only.


It's infinitely worse with AVX512, where there are cpu's that support bits and pieces.

CPU dispatch like this is pretty hard to get right. For now, I've only seen Intel's ICC/ICPC get it right. Even then, if you have an AMD CPU you're out of luck.

This... should not even be a hard thing.

The dynamic dispatch is not a hard thing.

For C, in fact, you get "whatever", and that's even legal. For C++, the problem is templates, etc, are linkonce, and you can't choose which it's going to pick (ie the problem described in the article).

You also can get screwed by inlining, etc.

Function multi-versioning alone does not solve this well, you really need a way to make the ABI part of the template arguments or something to do it well.

There are other issues.

Dynamic dispatch is not a hard thing. And in fact the article treats dynamic dispatch as being too simple to be worthy of discussion. The article is about the subtle problems that can happen after you have dynamically dispatched.

How do you think that compilers and linkers should handle the problems of non-inlined inline functions in the face of /arch:AVX? Saying that it should not even be a hard thing doesn't help much.

The way the author is using it to generate both AVX and non-AVX code is such a hack that I find it hard to believe it even worked. It is literally a ticking time bomb that depends on compiler's inliner heuristics to trigger. Compilers and linkers should be able to generate and optimize different code paths -- it shouldn't have to take hacking and tinkering with tools to do something. To make it sound worse, that something isn't even guaranteed to be safe even with hacky build.

Literally the only two things it takes to do this are a conditional check based on CPUID result and CodeGen support for generating multiple Arch paths. If you have those two, multi-version function is very likely to work with safety.

Edit: oh, you are the author. I hope you weren't offended. I meant none.

Like others have said it's not a "hard thing" per se. It's that few implementations actually work correctly in most of the edge cases. Manual dispatch is, of course, doable. Agner Fog's optimization guide has a pretty decent walkthrough of writing your own dispatcher and even he has to go through a bunch of caveats.

Yeah, VC needs function multiversoning like everyone else. Sadly, it doesn't help if you want to do something like have templated c++ functions

This was a solved problem going back at least 10 years. At $oldjob, we would simply compile an entire section of project as a standalone DLL once wth AVX (it was SSEx something back then) and once without, then dynamically load the entire module - not call a single function.

Is linker ordering acrually undefined as the author suggests? I thought that linkers had a well-defined process for deciding what symbols to take from which object.

I'm the author but I'll chime in and say that yes, linker ordering is undefined (or at least only partially defined) in general, and definitely in this case.

The C++ language says that the linker is free to choose any instantiation of an inline function because they are required to be identical. Failure to make them identical is an ODR violation, and the ODR implicitly says that the linker can grab whichever one it wants.

s/acrually/actually/ (phone keyboards...)

I'm surprised this doesn't cause a multiple definitions error. Is the msvc linker really that broken?

The linker could check all .obj files for definitions of the function and see if they match. However this would slow down linking, especially if the linker peeks into .obj files that it otherwise wouldn't examine.

It's also not clear what sort of differences to report. Let's say that you have two translation units, one compile /O1 and the other compiled /O2. They would probably generate different versions of floor, but this is not an ODR violation. So, how is the linker supposed to detect when a difference in generated code is fatal and when it is benign?

The C++ language does not require reporting of ODR violations because such reporting is expensive and problematic. What sort of ODR reporting do the gcc and clang toolchains do?

Unix linkers give an error if any symbol is defined in multiple object files, different or not. After all the object files have been combined, remaining undefined symbols are resolved by searching the specified libraries in command line order. Each object from a library that provides a needed definition is selected and its undefined symbols are added to the global list. Some linkers will scan the libraries a second time in order to resolve interdependencies. If at the end of this, the combined set of object files (from command line and from libraries) contain more than one definition for any symbol, an error is raised. The MSVC linker is apparently much more lax in this regard.

With the inline semantics defined by C99 there are two solutions to this problem:

1. Use static inline. Any non-inlined calls will result in an instance of the function with internal linkage. Aside from wasting space with multiple copies of the same function it is harmless. As mentioned, an optimising linker might even merge identical functions across compilation units.

2. Use inline without explicit extern. Non-inlined calls result in an undefined external reference to be resolved by the linker with a definition provided elsewhere. The inline definition is only used when the function is actually inlined and never provides a non-inline definition of the function, internal or external.

Neither of these approaches result in duplicate definitions with external linkage, so the described problem cannot arise.

It's disappointing that Microsoft seem to have chosen the one approach to inline functions that results in this sort of breakage.

> The MSVC linker is apparently much more lax in this regard.

No, you are incorrect. The function in question was tagged as inline. Having multiple instances of it is legal. If a Unix linker gave an error on multiple copies of an inline function then that would be a serious bug that would prevent them from linking any non-trivial C++ program.

For inline functions it is only an ODR violation if the two instances are different in an ODR relevant way. The language spec doesn't discuss compiler switches but it is logically clear that /arch:AVX versus not is an ODR violation, whereas /O1 versus /O2 is not. So therefore, even having two different instances of a non-inlined inline function is not necessarily illegal, and must be accepted by a conforming linker.

> Use inline without explicit extern.

I'm not sure why you think that this is a solution. An explicit 'extern' is not needed when defining an inline function. If it is not inlined then the compiler generates a copy that can be referenced. I have worked with gcc, clang, and VC++ and I have never had to explicitly mark an inline function as extern.

You are correct that static inline solves the problem, although I think that this is an ugly solution. And, it is a standard library solution, rather than a toolchain solution, so it doesn't automatically help developers who write their own inline functions.

The best suggestion I heard (on twitter) was name mangling (for non-inlined inline functions) that added an architecture suffix, thus making accidentally calling the wrong architecture function impossible. Much cleaner and more efficient than static inline.

What I said about inline is true of C99. C++ might be different though.

It is also true that Unix linkers give an error if the final set of object files to link contain more than one external definition of the same symbol. The linker can't tell if functions were marked inline in the source code or not. A definition is a definition. The Microsoft object file format might record this information. If it does, it would appear to be a misguided decision that opens the door for the kind of breakage you're complaining about.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact