Why does a „3rd party encryption software” appear in the call stack of a user mode process in the first place? Is this another case where a “security” software injects broken DLL files into all processes in your system?
I'd love to see Chrome and Firefox and IE working together to detect this kind of thing and put up a malware warning explaining how to uninstall it, and see how quickly it can be eliminated.
Seriously, there is zero valid reason to ever inject code into another program, other than as a debugging tool on a system being debugged.
The modding communities for Bethesda games rely heavily on code injection, especially the script extender plugins that are central to the entire enterprise.
Yeah, but there's ZERO reasons why you should be allowed to modify software to suit you. ZERO. Security über alles, DRM control over all, how dare you modify the software God Developer has given you, it will make you Insecure.
On Windows, you can call "EnumProcessModules" to get all the DLLs loaded in the current process – and check those are what you expect.
It is potentially a bit fragile though - DLLs can load other DLLs, and it has happened before that in a new Windows version, Microsoft suddenly adds new implementation DLLs which get pulled in by the main system DLLs.
One approach might be to look at code-signing on all the DLLs, and ignore any unknown DLLs signed by Microsoft.
Would probably make sense to open a scary red warning at startup, telling users that their browser contains untrusted third party code, and stability/security cannot be guaranteed. Probably need to make sure there is no easy way to disable it, or else some IT will just disable it as part of installing DLL-injecting "security" software.
There are ways of doing hidden code injection, without the DLL coming up in the loaded DLLs list – https://reverseengineering.stackexchange.com/questions/2262/... – one option is for the loaded DLL to duplicate itself in memory, jump to the duplicate, then unload the original. Or, you can use VirtualAlloc2/WriteProcessMemory to load code into another process, and CreateRemoteThread to launch it.
I thought these "security" software vendors wouldn't be doing anything so fancy: but from the Chrome bug [0] it looks like some are:
> And I was able to confirm (in some of the dumps, we don't collect the right heap information in all dumps) that Trend Micro code (one region is a DLL that seems to be called ApiHookStub.x64.dll, another is not a direct DLL copy) which has been allocated on our process heap without going through the loader, presumably via something like ::VirtualProtectEx and ::WriteProcessMemory. This is a pattern I see used broadly in Edge crashes we root cause to third-party software.
However, that still can be detected – use VirtualQueryEx to iterate through process address space and find all executable memory regions – any not owned by a loaded DLL (or generated by JavaScript JIT/etc) are evidence of code injection, even if you don't know who the injector is.
IIRC we do block DLLs that aren't signed by either Google or Microsoft in some of our processes. In other processes we can't because third-party DLLs are needed for shell extensions (utility processes) or accessibility (browser process).
And, as you say, code injection is possible without loading a DLL, and does seem to happen.
In this case I don't understand how the state leaked from the file-system filter driver to our process, as it seems to have done. It's a mystery.
I have plans to do the address space iteration you speak of in our crash reporter.
> In this case I don't understand how the state leaked from the file-system filter driver to our process, as it seems to have done.
I assume you made some Windows API call somewhere, which ended up in the filesystem filter driver, which then clobbered the register. And I'm guessing the NT kernel code and Windows DLLs never save/restore the register, because it isn't supposed to be clobbered.
Couldn't one approach be to wrap all Windows API calls with some extra code which saves and restores all the callee-save registers, so even if a buggy kernel driver clobbers them, you don't get hurt by that? I don't know, maybe that's too expensive.
Instead of restoring, one could check for the clobbering, and crash the process immediately. Or maybe Microsoft should add such a wrapping to all calls to third party kernel drivers, and blue screen?
Those tactics could work. They would be a bit expensive however, and it would be a shame to have all users paying the performance penalty because of a few bad pieces of software. And, this would not have caught the two errors in the assembly language code within Chrome - that would require testing at every function call, not just Windows API calls.
It would be more practical (I think) to do this checking on a special build of Chrome that is shipped to a small percentage of users, so that not everybody pays the price.
But, this is an ecosystem problem and I'm not sure Chrome wants to shoulder the entire burden of finding bad software :-)
It would be interesting to see measurements of how big the expense is.
Also, I don't think one necessarily has to do it for every Windows API call – some Windows API calls are more likely to invoke third-party code than others; some API calls are far more performance-critical than others. Maybe one could find a subset of calls to focus on which maximise the likelihood of invoking third-party code but also minimise the performance impact.
> And, this would not have caught the two errors in the assembly language code within Chrome - that would require testing at every function call, not just Windows API calls
For code you control, I think some kind of static analysis would be a better approach – parse inline assembly code and check that every register it touches is marked as clobbered to the compiler. I saw some other comments you were replying to already on that topic. I think this kind of "dynamic" approach should be reserved for third-party code with low trustworthiness.
> It would be more practical (I think) to do this checking on a special build of Chrome that is shipped to a small percentage of users, so that not everybody pays the price.
I was thinking, you could also do it using API hooking. Have some hidden setting to control it, by default off. If it is off, no impact, same as now. If the flag is on, hook (some subset of) Windows APIs with the "unexpected-register-clobber-detector". That way you don't have to produce two completely different builds.
And maybe even, automatically turn that flag on if an install starts to experience crashes–especially if the presence of certain kinds of third-party software is detected.
> But, this is an ecosystem problem and I'm not sure Chrome wants to shoulder the entire burden of finding bad software :-)
Agree. Ideally, Microsoft would take the lead there, since it is their platform. But a world in which the Chrome team does it would be better than a world in which nobody does.
> Seriously, there is zero valid reason to ever inject code into another program
Sometimes, I use software which does not work the way I would like it to work. When this software is closed source, and the problem sufficiently annoying, I inject code to make it do what I want.
(And no, once I do this I don't open bug reports, unless I can reproduce the problem without code injection.)
Because that's the great thing about owning a computer, and knowing how to really use it. It's a tool for you to command.
Now, will most people ever do this? No. I wish everyone could, but unfortunately, injecting code in a useful way requires a reasonable amount of coding knowledge.
But, this is Hacker News. I would hope most users here can come up with lots of valid reasons to inject code into other programs.
There's a huge difference between "I, the user of this system, am intentionally injecting code into a process to debug/extend it in a way I want, and if it breaks I'll have a pretty good idea that it might be my injected code's fault and know where to start looking" and "software that I may not even have intentionally installed on this system or understand the function of has broken other software on this system".
Injecting keyboard events does not require running inside the current process. Either emulate a keyboard device or inject events via the APIs for that.
DLL injection is also used for in-game overlays by Steam/Discord/etc, and I'm not aware of a better method they could use instead given that they're expected to work for games that were never made with those services in mind.
They use wayland compositor (gamescope) under linux (even under X11), in their new SteamOS UI and it's already working a lot better. Plus they can do other awesome stuff with it like completely having control of where and how the game outputs its buffer. So it can fix games with bad multi-monitor behaviour, force game to render in their native aspect ratio / resolution etc... it's just great and makes gaming experience feel more like a console where things (usually) just work.
Gamescope + mangohud + steam overlay working together is very big part of SteamOS and Steam Deck's success, and it does zero runtime modification or hooking into the child process. I can imagine all the trouble you run with the hooking eventually and all the hacks it must have piled up.
They use gamescope under SteamOS, not Linux in general where you already have another compositor. Running all games under a nested compositor (gamescope) would have some perf impact (even if a small one) as well as an unexpected cursor acceleration profile.
True. I personally have setup my distro to launch directly into gamescope optionally for SteamOS like experience. On desktop client they still either hook or use vulkan layer (latter often in proton, thanks to dxvk). However I would not be surprised if they started using nested compositor even on desktop.
GL/Vulkan extensions for drawing overlays over the current frame. Compositor/Windows extensions for drawing overlays, the way Android has. You don't need to be running inside the current process to draw on the screen.
I used to take point for Mozilla's efforts dealing with third-party interference in our binaries. Browsers are ripe targets for this kind of shit. The stories we could tell...
It seems that the chrome developers should be able to perform the same binary analysis on the suspect Mcafee software. I guess it's a bit harder without source code to reference side-by-side though.
That would be possible, but we'd have to install the software, then guess which binary was the culprit, and then have some way of finding the function boundaries. My crude analysis technique required on having symbols for chrome.dll to indicate where functions started, so I'd have to have switched tools to something else that could find those.
If one could reliably detect this DLL injection in the process address space, then the correct "fix" is to crash immediately. Authors of such tools should seek another way of accomplishing their goal, preferably one that does not export their own bugs to innocent bystanders.
Naïve question about ABIs: shouldn’t the caller be responsible for that? If I want a function to restore certain registers, wouldn’t it be simpler if I was the one that save them on my memory, call the function, and then override the registers with whatever values the function set? Otherwise it seems we’re just… asking for trouble, so to speak.
It would just be slow for the caller to have to push and pop (say 30) registers in general that the specific callee (and transitive callees) may not even use.
Most ABI specify some registers caller-saved, some registers callee-saved (retained unchanged from the perspective of the caller) and some registers scratch (not-saved).
In the end it depends on the architecture and on typical workload which are the fastest--and measurements can be made and it can be found out which combination is the fastest on average.
You usually wouldn't need to push&pop all registers, just ones that you want to preserve across the call. Regardless, yeah, non-volatile aka callee-saved registers are extremely important for good performance of code that calls functions (esp. loops - without callee-saved registers, you'd have to store the loop counter & length on the stack!)
...which is exactly what the callee within the loop would have to do if it used those registers. But I guess your point is that in the callee-save case it only happens when it needs to happen, while in the caller-save case it happens every time whether it needs to or not.
yep, hence why calling conventions usually have both callee-saved and caller-saved registers, so that you only have "unnecessary" stack usage when you need to use more than roughly half of the registers.
Programming defensively in this way is possible but has a huge cost (extra saves and restores). The caller is supposed to be able to trust that certain registers persist across the call. If it can't, it has to save everything to memory before the call and restore it. I suppose a compiler could add a special annotation for "this is an assembler routine and I don't trust that they know the rules", which would generate extra saves and restores, but presumably the routine was coded in assembly for extra speed, so 6 to 8 extra saves and restores would cancel that out.
If someone is trying to use the same assembler routine on Windows as on Linux, it's likely to be wrong for one of them.
> If I want a function to restore certain registers, wouldn’t it be simpler if I was the one that save them on my memory, call the function, and then override the registers with whatever values the function set?
Such cases are rare bugs in a small amount of code — mostly just compilers and hand-written assembly.
Preemptively saving and restoring all registers in all callers would appreciably slow down every single function call on every device in the world using that ABI.
ABIs are a convention that everybody follows regarding how such preservation should or should not occur, for which registers.
It's common to say "caller cleans up registers x, y, z, callee cleans up a, b, c".
So yes, the caller could do it or not do it, the choices are all possible to do both ways, but that wouldn't be the agreed upon ABI, it'd be something different.
Sounds like there should be an option when you write assembly code to tell the compiler "please save/restore any register that I'm modifying in this asm code according to the target you are compiling for"
I not sure whether that satire. In case its not, most languages that allow inline assembly (like C) have an optional "clobber list" argument that tells the dataflow analysis of the compiler that your assembly snippet overwrites certain registers [1]. Inline assembly doesn't have target specific clobber lists because it's assumed that the code only works on one target and the programmer has to take care of making it work.
When you use inline assembly, in fact the compiler does preserve the semantics of the surrounding program considering the target's ABI -- if you tell it correctly what impacts the inline asm has. Lots of rules must be followed in order to get this behavior just right. One of the most subtle ones is "early clobbers" [1].
In many cases, you can get all the benefits of inline assembly from compiler intrinsics while letting the compiler handle all the details of register allocation and scheduling.
Note that in OP's case IIUC this was assembly code and not inline assembly. If you write functions in assembly you are solely responsible for calling conventions and ABI conformance.
THat is essentially how inline assembly in GCC works. You have to declare which registers you are changing and in which registers you expect input and which contain results from your inline assembly block.
This for obvious reasons does not work if you have separate compilation unit written in assembly, then you have to follow the ABI.
In fact the fix to the WebRTC bug was to adjust the clobber list, thus telling the compiler to save the registers.
Why the compiler didn't notice that registers were being used that weren't on the clobber list is unclear to me. Your suggestion seems totally reasonable.
C does have official intrinsics for SIMD on Intel platforms, but the people who are good at writing video codecs don't like to use them because Wintel culture has such bad taste at naming functions (thanks to Hungarian notation) that using them is near-unreadable and it's easier to write everything in asm.
You're wrong. Intrinsics for SIMD are not named by Windows people. They were named by Intel engineers and supported now by all other compilers. Naming is actually in the C/C++ style - __simd_do_something_here() and such
The Intel platform intrinsics have names like `_mm512_4dpwssd_epi32()`. The standardized SIMD intrinsics with `simd` in the name are much newer than any of the code I'm talking about in ffmpeg/x264/dav1d. These are okay, but not being platform-specific of course means you don't get platform-specific features, which you might want when you're doing this level of optimization.
The other problem is compilers (esp. gcc) were traditionally very bad at code generation for them, although these days they're okay at it.
And, a compiler that knows custom SIMD optimizations for every algorithm anyone might ever need, and can recognize when you have coded one of them so it can substitute its SIMD version.
You could totally design a better convention in your own high-level language which compiled down to assembly, as long as you stayed inside it. But once you need to interface with external code you need to use a shared convention.
Your scheme sounds like it would make your code well-behaved as the callee (with perhaps some performance penalty?). But as a caller, you couldn't trust the external code not to clobber registers.
That kind of feature is very .. un-assembler. You certainly could, but the assembler doesn't keep track of any of the relevant information, so it would be a larger feature than you expect. It doesn't know the calling convention. It doesn't have a map of registers dirtied by which instructions. It doesn't do reachability analysis, so it doesn't even necessarily know what's "in the function".
Yes, but it's not a property of the assembly (or assembler), it's a necessity for the compiler to correctly codegen around the inline assembly.
Historically, assemblers have been really dumb, so ABI is not a thing they'd track, especially as... I don't think they know what functions are? So while they can notice call/ret, they have no knowledge of a label being a jump or call target per-se, do they?
So you'd need an assembly-like language to encode this sort of information.
I think most assemblers know what functions are. There are usually directives to indicate this. They help with emitting symbols (you need the function name to be emitted or no-one can call it), stack unwind information, and other information (some of which may also be required by the ABI). Here's some relevant documentation for MASM
https://learn.microsoft.com/en-us/cpp/assembler/masm/proc?vi...
The short version of it is that assemblers these days are rarely - if ever - just a zero-context stream of machine instructions. There is far more, some of it actually required.
I mean if you have a macro assembler you should be able to write a macro that generates "save registers" instructions before a section of code and "restore registers" instructions after it. The assembler doesn't need to know our care that this section of code is a function.
It's interesting how blurry the line is between a good, full-featured assembler and a crappy compiler!
> I mean if you have a macro assembler you should be able to write a macro that generates "save registers" instructions before a section of code and "restore registers" instructions after it.
Except you want a "macro" which:
- saves only the registers you touched
- which are callee-saved
- according to the ABI you're targeting
And you really only want that for functions, because... that's where ABIs come into play.
It's a long time since I tangled with x86 assembler; but as I recall, ENTER and LEAVE were specifically for functions, and I'm not aware of any other use for them.
ENTER and LEAVE were specific x86 instructions though, not some kind of assembler special sauce. The assembler just translated your ENTER instruction into the corresponding machine opcode; no ABI knowledge required whatsoever.
before using XMM7 to zero anything? That way the compiler doesn't have to assume that XMM7 is zero, it can know.
Yes it's an extra instruction but XORing a register with itself is such a common metaphor for zeroing that register that CPU designers try to make it fast.
Edit: Just noticed that Veliladon essentially made the same comment herein and explained the reason why it's not done this way.
It's not just that. If you can't trust the ABI then everything else is wrong too. Everything. Not just zeroing registers. It only just so happens that in this case XMM7 is used for zeroing, but it could be used to save a variable across a function call. Then there is no trick to get it set back to the right value.
The ABI is not optional, or best effort, or best practice, or any other BS that passes in the ordinary world. It is just as required as the correct operation of instructions (e.g., add should actually add things, mul should actually multiply them, and so on).
My reading of it was that the bug was caused by XMM7 not being restored to its previous value.
E.g. on Linux, functions should restore the values of ebx, esi, edi, ... once they're done with them. The article says (on Windows) that XMM7 needs to be restored too.
If you can't trust one register being preserved (per the ABI), then you really can't trust the values of any registers.
Fair point, which is why I used the phrase "this particular example." In fact the compiler's lazy assumption about XMM7 was the key to determining that the ABI was being violated. It would have taken longer to figure this out if the compiler had been doing things my way, because then a failure of the callee to preserve XMM7 wouldn't have mattered.
You could avoid the crash in this specific case. But this crash was triggered by an assertion failure. The same code could be corrupting XMM7 in other situations, which could result in much more subtle bugs.
I think the real fix is to set all registers to a canary value at the start of int main(), and then when exiting check the canary value is still there.
If anything has messed with any registers without permission, you crash and collect as much data as possible about any injected dll's.
Then you correlate these to find the culprits, and for each you contact the authors of the DLL and figure out a way to block the injection of any unfixed versions that cause crashes.
I'm sure some C++ lawyer can correct me, but isn't branching off of `m_ptr` (e.g. `CHECK()`ing the value) after `std::move(m_ptr)` technically Unspecified Behavior since `std::move(m_ptr)` leaves `m_ptr` as an Unspecified Value? It would be up to the compiler to define the behavior if they so pleased, but the C++ spec would not require such behavior to be defined at all.
There is a big difference between "undefined" and "unspecified" behavior. In this case, the behavior of `unique_ptr(unique_ptr&&)` is in fact specified. [0]
However, the bigger issue with that code is that it can easily stop working with a simple refactor. Consider:
Neither of the above asserts will fire, but from the calling site, they look exactly the same. In my opinion, the more explicit option would be to do something like `bar(std::exchange(p2, nullptr))`
Right, but changes in some other code potentially silently breaking assumptions at the calling site is exactly why you would want an assert like this in the first place.
Moves in C++ don't make the whole object invalid, it just leaves them in a valid but unspecified state, so that at least the destructor can still run.
This means calling operator bool on your unique_ptr ought to be fine, because the unique_ptr still has a valid state (you don't know what that state is, it's unspecified, but it's guaranteed to not be radioactive on mere contact. It has to be a valid unspecified state.)
Moves in C++ are a function call that can do whatever any other function call can. The type itself can have additional guarantees and the standard library types generally do. In the case of std::unique_ptr the guarantee is explicit that the moved-from unique_ptr is nulled.
I was wondering if anybody was going to point that out. It did occur to me that the CHECK was not technically valid due to that exact concern, but given that we control the compiler and the C++ library implementation and given that it's just debugging code (albeit debugging code that we ship to users) I'm fine with it.
In other words, I guess you shouldn't oughta do that generally, but I was fine with it being used there, and it did its job.
Note that the linked discussion is about the standard library. For your own types you can make whatever guarantees you want - ultimately as far as the language is concerned, moves are a function call like any other.
It’s an unspecified but required to be valid value for the moved type. The author mentions it’s a smart pointer type, which could easily be defined to act like this.
It appears that you need to be really smart in order to not blow your own foot off with this "smart pointer." I'll stick to the regular dumb kind, thanks.
To what degree is this possible to check statically?
It feels like at least simple breaks of the ABI rules like this can be detected somewhat statically. The author already started with a very simple and incomplete version.
In general, I wonder, are there any (many?) static analyzers for assembled binaries.
Compilers do inter-procedural register allocation and use custom calling conventions for local calls (where “local” can be quite large with LTO), while preserving ABI externally. This means that clobbering a callee-saved register without saving/restoring it in the same function is not necessarily a bug.
Curiously, I found a register clobber bug in the NaCl cryptography library today. Apparently, they used a custom assembler-preprocessor (qhasm) that avoids certain classes of bugs and aids with porting, but while the tool seems to actually model the register in some way, it does not treat it as callee-saved.
You could write an arbitrarily complex analyzer to try to find violations but I suspect that the halting problem means that you can never be sure you've found all errors.
I think that either crude heuristics (found two bugs!) or UBSan style instrumentation (finds all bugs in code executed under test) is the best set of solutions.
McAfee/Trellix says that they have fixed the bug in the latest (7.4.0) version of their disk encryption product. So apparently my guesses about the root cause were correct.
It’s interesting that there is a zero stored in a register and used for hours - is that significantly faster than just using some actual zero each time? Perhaps CPUs need a “always zero” register or some similar menomic to help harden.
Intel has never really needed to have a zero register because xor register, register as a zeroing idiom is so fast and so recognized that Intel have optimized the hell out of it. In Sandy Bridge and onward it doesn't even go through an execution port, even for the vector registers.
The problem is really whether to indulge bad programmers who don't respect the ABI at the cost of a minimal sliver of performance (even though it's not taking up an execution port the extra instruction still takes up cache space, bandwidth, and decode). Yeah they should probably zero the register before they zero the pointer but they shouldn't have to if other people respected the ABI.
I think it's not just the xor trick, but that Intel has lots of addressing modes, including ones with immediate operands that you can use in many situations.
In RISC-like machines, most of the operations are register-register, and you have load/store instructions for referencing memory.
To use an immediate operand (literal constant in the code itself), you may have to load it into a register, like
move r7, #42
add r1, r1, r7 ;; ok, now we have 42 in r7, we can increment r1 by 42.
Whereas in a CISC you would have
add r1, #42 ;; two operand form
or maybe
add r1, r1, #42 ;; three operand form
When you need a zero, you just use the immediate operand zero, and thus you don't need to to pick some register to clear.
In summary, zero registers in RISC-like instruction set architectures effectively provide a literal zero that can be used wherever a register is required, which helps because only register operands can be used in many instructions.
That's a good point. But all of the x86 SIMD stuff is register/register and we don't have xmm0/ymm0/zmm0 being 0 like we'd expect on a load store style RISC architecture.
Yeah, something like "mov ax, 0h" but I suppose that is way more memory intensive as you have to load a 0 into memory somewhere and then copy it into the register.
It strikes me as somehow the compiler is making assumptions that aren't being enforced by the ... OS? Language? not sure what, but it's assuming functions restore registers used but that isn't enforced by anything. From my (long ago) time there was PUSHA and POPA but I assume those take quite a bit of "oomph" and are avoided if possible.
> It strikes me as somehow the compiler is making assumptions that aren't being enforced by the ... OS? Language?
One case of this problem was in a handwritten assembly file. The other was a compiler bug.
This is a case where the ABI requires that if you use a certain register you must save its previous value and restore it afterwords; the two independent bugs were cases of forgetting to look after a certain register.
An ABI is simply an agreement as to how things should work: what registers you are free to clobber, which you must look after when you use, how certain data must be laid out in memory, etc. ABIs are typically language specific, though there may be a lot of commonality at the very high level (i.e. how you use sections in an ELF file) and low (anybody using unboxed integers probably will do the same thing).
You are welcome to violate the ABI as you see fit in your own code. The OS doesn't care; it has its own constraints (how to make a system call, how to pass arguments to each -- though cf above when I talked about ints). So, say, a Lisp compiler can lay out stack frames differently from a C++ compiler because of the languages' different semantics) but if your Lisp program wants to call a library written in C++ it must make sure memory at the call site follows the C++ ABI because that's what the C++ compiler will have assumed.
Both bugs were programming errors in assembly language files. One was inline assembly that was missing entries from a clobber list, the other was an assembly function that lacked invocations of the macros that were supposed to be used to preserve/restore the registers. There was no compiler bug.
It seems to me something that could be found by some kind of valgrind-like tool - it'd be much slower than normal code but "ABI exception detected" or something.
Interesting conclusion given that I found two functions and a (presumed) third-party driver/what-not that were violating the ABI. One of these was causing crashes, and the other one was going to. The crashes went on for over a year and a half, so, ...
The compiler is making assumptions (which it is supposed to make) but nobody is enforcing the assumptions. The only player who could reasonable enforce the assumptions would be the compiler, in a special checking mode. I am not aware of a compiler that does this. Pity.
Because you’d either need a special “always zero” register (some chips have this), or a menomic for some or all of the instructions that assume zero as an operable (some chips have this), or wipe a register (this is the problem here - uses a register) or use memory.
Adding an implicit zero may make sense for some instructions but probably not all.
Look at the format of the 68000's MOVEQ instruction. The zero is part of the instruction, and does not take an extra four bytes to hold it. There's no memory that's used (other than the instruction itself), no extra memory to hold the argument, and no "always zero" register.
MOVEQ can move more than a zero. It can move any small number (-128 to 127), so 0 is not "special" here.
Also check out the CLR instruction (though that may be what you meant by "a mnemonic for some or all of the instructions that assume zero").
Where it's not optimized away, getting "an actual zero" requires a memory operation of some kind. Register ops are faster in that they are right there, no fetch needed.
Depends on the instruction architecture. 68000 had some ways of burying a small literal operand in the instruction. If I recall correctly, MOVEQ.L would let you move zero to a register without touching memory (other than the instruction fetch), and it wasn't a long instruction.
However, moving a zero to a register does take time. Time that would otherwise be used operating with the zero value already present in the zero register.
The second best is what moto did.
As you point out, there is the instruction fetch, which could be the intended operation, rather than developing the zero itself.
On par with that is having enough registers to just hold a zero, and whether that made sense depended on the need and developer strategy.
I am a big fan of the moto CPU's, starting with the 6809. Just to be clear.
But moving it from the zero register to another register would also take time. If what you want is a zero in a register other than the zero register (say, one that is going to serve as the index of a loop, which the zero register cannot do), then MOVEQ should not take any longer than a MOVE from the zero register to another register.
Say we are zeroing memory. No advantage there. Coupla cycles right at the start, then a ton of writes.
Say we are forming a bitmask. Could be an advantage there in that having a zero handy in a register means no fetching one. When a lot of dynamically created masks are needed, this can be a nice gain.
I'm sure we can come up with more. It's not always important, and like you mention with the moto designs, may not matter too much due to many other optimizations possible given a good instruction set.
Some people would rather have the register free for general use! I'm one of those, but if there is a zero register, I use it to get the benefit of it when I can. On the devices I've seen, there are generally a lot of registers so the marginal impact of having a zero register isn't significant. There are plenty to work with.
Maybe I should be clear here too. I personally don't care whether there is one. If it's there, I do things in ways that leverage it, and was just pointing out why devices that have one, ahem... have one! Those that don't may or may not have options that make sense. The way moto did it is very good, and there are other pretty great optimizations possible with their ISA, abusing the stack to write memory, etc...
If not, then I do other things. It's assembly language! Work the chip, right?
We ban accounts that post like this, because the community considers it spamming.
I'm not going to ban you because you've also posted other things to HN and seem like a legit user. But you've been posting these links much too often, so please stop doing that.
He even mentions that it's Windows-specific, both in the commit message and in the article... and then seemingly fails to make it Windows-only? That's "not very nice" either.
Those PUSH_XMM/POP_XMM macros appear to be Windows-only; I think they expand to nothing on other platforms because they contain their own guard for Windows internally. If that's the case, the call sites don't need to guard for it. I'm guessing that obeying this calling convention is the purpose of those macros.
Exactly. My understanding of the conventions and macros in those source files is that you declare what registers you will be trashing, and then the registers are saved/restored as required by that platform. On Linux it would be a NOP, and on Windows it saves and restores XMM6 and XMM7 (XMM0-XMM5 are volatile).
The webrtc fix was thematically similar in that the programmer declared what registers were trashed and then the compiler knows which registers need to be saved. I'm not sure why the compiler doesn't notice when registers are used without being declared as being trashed - I'm really not an expert at _writing_ assembly language.
It's literally undecidable in principle whether some assembler correctly restores some register R. That's a non-trivial semantic property, Rice's theorem applies. So the compiler's only practical option if it worked this way would be a conservative option - any time it's unclear whether register R is clobbered, treat it as clobbered.
As a trivial example of why a register might not be clobbered even though my code touched it and it seems like I didn't restore it...
Suppose if R is divisible by 12 I branch, in the other branch I don't change R, but in that branch I do change R, XORing it with a value which is difficult to explain but has a value between 1 and 3 inclusive, sometimes more than once. At the end of the branch I also clear the bottom two bits of R.
R is actually not clobbered by this function! If the bottom two bits weren't zero before, R isn't divisible by 12, so we didn't change R, and if they were zero, we restore that, the other bits are never changed.
Having the human programmer promise they they wrote a correct clobber list means if their assembler does somehow restore/ preserve register R, the human can just say so, and needn't prove to the compiler somehow that this works. This sort of code is mostly in performance critical components, e.g. video decoding, where we are already trading reliance on fallible humans for better performance, so adding one extra promise feels OK.
It would be easy to have an "auto" clobber list option. The inline assembler would note which registers were touched (mostly trivial, a few instructions have implicit destinations) and would add them all to the clobber list. In about 99% of cases this would be sufficient. I am not aware of any code that conditionally uses a register _and_ conditionally preserves it.
So, 1% of assembly code would use the manual clobber list, but the other 99% would be guaranteed (barring bugs in the compiler) to not have this bug. It seems like the right tradeoff.
Or, instead of an "auto" clobber list the compiler could have a warning if a register is used without being in the clobber list. The programmer could silence that warning in the rare cases where they need to optimize register preservation, and the bugs would be greatly reduced.
If your numbers are roughly correct then I agree it's worth trying to do this. My guess was that the assembly we actually use tends to be aggressively hand-optimised to solve very nice problems in the most optimal way and therefore would be more rather than less likely to trip up analysis, but I haven't experimented.
My experience is that most non-trivial assembly language is doing processing of large chunks of data (high-precision math, encrypting blocks of data, FFTs, etc.) and therefore the startup cost of the function is not significant, so a tiny bit of inefficiency in that area is not something people care about.
Or, put another way, writing tiny little assembly language functions is probably not worth it because the mere fact that you are using assembly language instead of (say) C/C++ means that you have missed many opportunities (code reordering, inlining, etc.) so assembly language functions _should_ be doing enough work to justify their calling cost.
But, I'm not working on an assembler or even using one so I don't think I'll even file a feature request.
> It's literally undecidable in principle whether some assembler correctly restores some register R.
No, it's literally undecidable in principle whether every bit of assembler correctly restores some register R. For any given bit of inline assembler, it's quite likely to be trivial.
In any case, we can have a useful safety feature without requiring the compiler to decide. The compiler can easily work out all the registers which get written to (right?), just not which get restored. So in addition to the clobber list, we could have a list of registers which the programmer asserts that the code restores. A register which is written to has to be on either the clobber list or the restore list (or be an output). This certainly isn't foolproof, but it would catch accidental clobbers.
> No, it's literally undecidable in principle whether every bit of assembler correctly restores some register R. For any given bit of inline assembler, it's quite likely to be trivial.
Exactly. And if it's non-trivial to decide something that basic, you're doing it wrong. Saving and restoring all the registers is always an option. Only saving and restoring some of them is an optimization which must be shown to be sound.
Traditionally, an assembler neither knows nor cares about such information, it just turns lines of assembly into bytes and does some address fixup for you.