Hacker News new | past | comments | ask | show | jobs | submit login
Why do Windows functions all begin with a pointless MOV EDI, EDI instruction? (msdn.com)
320 points by cabacon on Sept 21, 2011 | hide | past | web | favorite | 43 comments

I have basically no background writing applications on Windows outside of .Net but I love reading posts by Raymond Chen. I always enjoying learning about things that seem crazy from the outside from have a real purpose that you're just missing information to understand. That's pretty much what looking at someone else's code is often like so it's helpful to remember that even seemingly crazy things have a purpose.

I feel like I've learned a lot from reading his blog over the years. I even bought his book years ago because I felt like I was getting a lot of value from the blog.

It's really too bad Microsoft doesn't seem to value backwards compatibility as much as they did during the times Chen often writes about. It seems like an interesting challenge that they've pretty much given up on. I can't even count how many conversations I've been in where people complained on one hand that Microsoft focused on that backwards compatibility too much and on the other that their driver from 2001 doesn't work right in Windows 7. Often these statements happen moments apart.

Microsoft puts a LOT of effort into application compatibility. They're very serious about it, and the fact that they fail at times is proof of how hard it is to maintain app compatibility when application developers are doing their best to sabotage you by using internal APIs and corrupting the heap.

Driver compatibility is less of an issue; older hardware becomes less common (and thus less of an issue) over time, and strictly adhering to kernel compatibility limits your ability to add new features to the kernel significantly.

That's true, but sadly they don't put as much effort into compatibility of development tools. Old code has to be ported to new versions of Visual Studio. Even just sticking with C++, I ran into issues with winioctl.h and unicode fstreams requiring different code for different versions of VC++. I imagine other people may have a longer list.

That's sort of the flip side of the same coin. Associating specific runtime versions with apps at build time reduces the bug surface. They don't have to worry about breaking last year's apps with this year's changes to the standard libraries.

And yes, I agree it sucks. But that's the world of proprietary software. I think MS has done this about as well as anyone could.

Unfortunately, it took them quite a while to have an actually vaguely conforming C++ implementation, and there are still issues (there was a nice example on Hacker News a while back involving some stl method being dramatically slower than both its equivalent in other major implementations, and a near-equivalent MS-specific method), which may explain some of that.

I have never written anything that runs on Windows, regardless of language or API, and I still subscribe to his blog. I feel badly that he gets snarked at so much that he has started injecting pre-emptive snark responses; I think he might write more, or more candidly, if it weren't for the highly vocal peanut gallery he has attracted.

This one in particular seemed worthy of sharing because of a couple of things. Firstly, if you just saw the artifact and not the reasoning, you would probably see a lot of WTF in having 7 bytes of NOP-type instructions, with five outside and two inside each function; it's a great reveal when you find out what it is for. Then my reaction was "Wow - I don't really design good instrumentation/logging/debugging points in my code, do I? I wonder if there's any part of this idea I can fruitfully rip off for my own code?" It was a nice one-two punch.

I can guarantee that the snark is intrinsic to Raymond. He is just that type of guy. He loves a good story, is very sarcastic and likes to troll co-workers.

(I could tell quite a few Raymond stories from when we worked together but it is better to leave it to him. It is a shame that he sticks to technical topics mostly on his blog.)

Actually, I really like the stories about his nieces too. He had one about watching his niece learn to cheat that was fantastic.

Re: snark, I wasn't talking about his own snark; that's great. I'm talking about the bits he adds labelled "Pre-emptive snark: [insert snarky thing a reader might say here, plus his response". Maybe I'm mis-reading his attitude towards those, but he seems a bit weary of them.

I thought I'm the only one doing this. Raymond Chen is worth subscribing to just for his knowledge alone, regardless of what platform you program for.

I haven't used Windows more than occasionally for a decade, and I still read Chen religiously; he's very entertaining.

> It's really too bad Microsoft doesn't seem to value backwards compatibility as much as they did during the times Chen often writes about.

This, I'd disagree on. At a certain point, one either has to move on, or to spend 90% of the time working on backward compatibility.

> on the other that their driver from 2001 doesn't work right in Windows 7

Drivers were never really part of the backward compatibility obsession. They changed utterly between 9x and NT, of course, and also fairly dramatically DOS/Win3.x and 9x, between NT4 and 2000, and between XP and Vista.

They have been trying to cover the various markets with one product but instead of offering Windows Home, Home Pro, Pro, Super Pro or whatever they call them, they should split into Windows Classic, Windows Nouveau and Windows Server. Now they are trying to do it on a monolith with Win8.

If you're never had a chance to play with it, Detours, the more complex alternative to the hot-patch strategy Chen is talking about, is really slick.

What you do in Detours is, freeze the process, disassemble the first several instructions of the function you want to hook, copy out enough of them to make room for a full jump instruction, copy in your hook function somewhere in memory, followed by the instructions you stole to make room for the jump, followed by a jump back to the original function. Then you patch in a jump to that location and unfreeze the process.

The example programs for Detours do this, for instance, on every libc function to implement library tracing.

That this "just works" with Microsoft's Detours package is kind of mindboggling.

This is a great project to tackle if you want to write programmable debuggers. We've done it for Win32 (you need a full build environment to use Detours; we have the whole thing in Ruby), OS X, and Linux. It's crazy useful.

Detours is really cool. Last I checked (last year?) it was free for 32-bit code, but you had to pay for or license the 64-bit version. There's an open-source (but not 100% functionally-equivalent) alternative called EasyHook: http://easyhook.codeplex.com/

An anecdote: I've got a Sony VAIO Z series laptop, one of the 2010 models with "Switchable Hybrid" graphics -- that is, there's a switch marked "Auto"/"Speed"/"Stamina" which you can use to switch between the embedded Intel GPU and the discrete nVidia GPU. The laptop itself is great -- probably the best developer's laptop I've ever seen/used -- but drivers have always been a real pain. Anyway, as it turns out, I was updating the drivers last week and just happened to notice the Detours DLL within the driver installer files; so it seems that the graphics driver actually just checks the position of the switch and uses Detours to direct any calls to the "real" driver for whatever hardware is selected.

Does that handle double detours, where two different programs each patch the same function?

That was the most entertaining/frustrating (depending on your world view; it made the journey way more interesting, but also a lot longer) part of hacking classic Mac OS.

You would have a zillion extensions (apple and third party) each patch tens of OS calls, both at startup and, in cases where the Finder reset patches, after the Finder launched, or even after every application launched (to get your code running at such times, you would have to patch another OS call). On PowerPC machines you would have the added fun of patching PPC code with 68k code and vice versa.

That that _sometimes_ worked was really mind boggling. Relative to that, patching your own libc seems easy.

>Does that handle double detours, where two different programs each patch the same function?

I believe Detours patches the import table inside a single process, it does not patch the call on a system wide basis, so there really is no 'two different programs each patching the same function'. In theory you could have two different pieces of code running in the same process doing that (i.e. each patching a given function), but Detours gives you a 'trampoline' function to invoke the original thing you are patching so I believe the second to patch, when invoking the trampoline function would simply invoke the first patch, which when invoking its own trampoline would invoke the original, though I haven't tried that so it may not work that way :)

VS uses this mechanism allow you to run as non-admin even though there is TONS of VS and third party code that expects to do things like write to HKLM which is a no-no if you are not an admin.

It doesn't actually patch the import table, but actually finds the functions address in memory. This has the benefit of working for statically linked libraries and run-time dynamic linking (loading). You're correct in all other respects. My casual impression is that the technique would allow more than one library to hook into the target function.

Ahh okay, that makes sense. Modifying the actual page the instructions live on in memory would force it to be marked as dirty and thus if paged out it would actually need to hit the page file instead of just being able to discard it and when it is needed again fetch it from the original dll no? I suppose that is a minor concern assuming you aren't patching tons and tons of things that reside on lots of different pages.

I'm a little fuzzy on whether they had to deal with paging or not. At the very least they'd have to mark the pages as dirty, but I think that happens when you mark it as write-able.

I used Detours years ago to hook into the Wave I/O API and DirectSound to capture the audio I/O. I was blown away by the power of API Hooking. Nothing concrete came out of it but it was a lot of fun.

If all you need is to trap a few functions Intel published a paper on how to intercept API call. It can be very useful if you need to fix some function inside a DLL you load, modify it's behavior or log calls for debugging. There is a lot of potential uses and it's a fun technique that teaches you some things about cache control and assembly.


That's coincidental. I once made a Linux kernel function hijacking module and I had the exact same idea. Well, without the freezing of other (kernel) threads, I didn't realize the problems yet.

Now that I think about it I'm even more surprised it worked in the first place. I thought Linux had w^x protection?

It does, but you have to explicitly turn it on for the pages you care about.

For those that are interested, the Linux kernel does almost the same thing (if compiled that way):


The mcount feature piggybacks on the profiling instruction added into every function when you use the gcc -pg option.

Edit: better link is probably this one: http://www.mjmwired.net/kernel/Documentation/trace/ftrace.tx...

NOOP sequences in x86 are a fun subject. There's an interesting section in Intel's optimization guide somewhere (I'm too lazy to find it) that details "best practice" noop instructions of 1, 2, ... up to something like 9 bytes. These are used for alignment puposes too, where you need a few bytes of padding to make a loop-back target cache-line aligned or whatnot.

Check out the "smartalign" package in NASM: it contains 1-8 byte nop instructions for several different x86 variants: http://repo.or.cz/w/nasm.git/blob/a2c78555770990ed966c414da9...

You can use xxd and objdump to see which all of these translate into. For example, here's an 8-byte nop for x86-64:

  $ echo '0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00' | xxd -r > /tmp/bincode
  $ objdump -M intel -D -b binary -mi386 -Mx86-64 /tmp/bincode
  /tmp/bincode:     file format binary
  Disassembly of section .data:
  00000000 <.data>:
     0:   0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]
     7:   00

Okay I have two questions that might be very clueless but I don't know the answer to them so I will ask them anyway.

1) In the comments Raymond says, "Hot-patching is not an application feature. It's an OS internal feature for servicing." Then why does the compiler put hot-patch points in my code? Why not use a special compiler flag when building Windows DLLs?

2) Why do we need a special hot-patch point at all? What's wrong with just overwriting the first few bytes of the function you want to hot-patch?

Having a hot-patch point avoids race conditions when overwriting the code. The patch target needs to be 1 instruction, so that any OS threads executing that code either see the old instruction (and run the unpatched code), or the new instruction (and run patched code), and never some mishmash of pre-patch and post-patch instructions.

(It also needs to be possible to overwrite the patch target with 1 instruction, which isn't possible for a far JMP as they are 5 bytes in length.)

1) I didn't see anything that suggested that all DLL functions have this hot-patch point. I think from his perspective "Windows DLL" means "a DLL that is part of the Windows operating system", not "a DLL used by an application executing on Windows".

2) I think he addressed this - someone might be executing the function while you are trying to patch it. Having a 2-byte, one clock cycle NOP at the front means that you can replace it "atomically" from the perspective that nobody can walk into the middle of you updating the memory.

Thanks! Re 1), it does seem to be a compiler switch /hotpatch, not the default behavior.

For one thing, you can't just scoop out 2-5 bytes, replace them with a jump, and assume that things will work. Detours, the alternative to wired-in hot patch points, includes a small disassembler that ensures it's working on instruction boundaries. Detours is significantly more complex than the patching strategy Chen outlined.

About 2: You would have to copy the overwritten bytes to another place in memory to execute them later. As the length of x86 instructions is not fixed you would need a whole disassembler to find out what bytes belong to what instruction. Easier to have just two bytes you can overwrite at will. Saves the hassle of calculating how many bytes you need to copy.

I actually don't know how Windows hot-patching works, but I'd assume they'd just replace the whole function. You wouldn't need to execute the replaced bytes (like you do in Detours, which is usually hooking functions, not replacing them).

Wow awesome explanation - about 6 years ago while hacking gtk+ and Mozilla I used those instructions to hack into the main event loop to get gtk+ embedding gecko 1.7 and had no idea that those my perceived hacks where actually some what valid method for doing what I needed to do - modify how window events from gecko where propagated to gtk+ event loop and vice versa. I think that my bug report is probably still open and might even be worth revisiting if anyone is still interested in gtk+ with Mozilla embedded - would likely need to make lots of changes... Latest gecko is 1.9?? Anyways awesome explanation

Latest Gecko is 9.0 to coincide with Firefox 9. :)

Whatever happened to the old idea of separating program and data spaces and write-protecting the program space?

Instead we do the opposite today, we mark the data pages with the "No Execute" bit, which is a far better way to ensure malicious code doesn't execute.

I'm by no means an expert in this domain but my understanding is that that's normally not happening on x86/x64, not even on Linux or OS X. Otherwise, how could just-in-time compilers like Java, .NET or V8 work? Please correct me however if I'm wrong.

Modern x86 does have the NX bit (and other major archs like ARM have equivalents), which allows areas of memory to be marked executable or not, and most modern operating systems do use it. JIT VMs will explicitly set it on produced code.

This is part of the reason that JIT is not allowed, and will not even _work_, on iOS (with the exception of Mobile Safari's Javascript engine, which has special privileges). Applications aren't allowed set the ARM NX equivalent.

JIT compilers allocate executable memory by using mmap() (not malloc(), since malloc()-allocated memory is not guaranteed (or even likely) to be executable). When you map memory with mmap(), you can decide the protection bits (PROT_READ, PROT_WRITE, PROT_EXEC, and they can be OR'd together in any combination).

For example, from the V8 sources: http://www.google.com/codesearch#W9JxUuHYyMg/trunk/src/platf...

malloc()ed memory is executable if you mprotect() it to be so. malloc()ing memory and blindly executing it is definitely a bug OTOH. malloc() is mmap() based a lot of systems nowadays anyway, so malloc() vs mmap() is just a choice of convenience of API.

> malloc()ed memory is executable if you mprotect() it to be so.

True, but since mprotect() can only operate on full pages your mprotect() calls will most likely affect memory beyond what malloc gave you. For example if remove write or execute permissions your program will probably crash, and it's best to avoid having pages that are both writable and executable.

These are von neumann machines, not harvard, at all levels of abstraction save for some newer security measures.

Then why do I need to restart the computer after I install anything?

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact