

Why do Windows functions all begin with a pointless MOV EDI, EDI instruction? - cabacon
http://blogs.msdn.com/b/oldnewthing/archive/2011/09/21/10214405.aspx

======
jswinghammer
I have basically no background writing applications on Windows outside of .Net
but I love reading posts by Raymond Chen. I always enjoying learning about
things that seem crazy from the outside from have a real purpose that you're
just missing information to understand. That's pretty much what looking at
someone else's code is often like so it's helpful to remember that even
seemingly crazy things have a purpose.

I feel like I've learned a lot from reading his blog over the years. I even
bought his book years ago because I felt like I was getting a lot of value
from the blog.

It's really too bad Microsoft doesn't seem to value backwards compatibility as
much as they did during the times Chen often writes about. It seems like an
interesting challenge that they've pretty much given up on. I can't even count
how many conversations I've been in where people complained on one hand that
Microsoft focused on that backwards compatibility too much and on the other
that their driver from 2001 doesn't work right in Windows 7. Often these
statements happen moments apart.

~~~
cabacon
I have never written anything that runs on Windows, regardless of language or
API, and I still subscribe to his blog. I feel badly that he gets snarked at
so much that he has started injecting pre-emptive snark responses; I think he
might write more, or more candidly, if it weren't for the highly vocal peanut
gallery he has attracted.

This one in particular seemed worthy of sharing because of a couple of things.
Firstly, if you just saw the artifact and not the reasoning, you would
probably see a lot of WTF in having 7 bytes of NOP-type instructions, with
five outside and two inside each function; it's a great reveal when you find
out what it is for. Then my reaction was "Wow - I don't really design good
instrumentation/logging/debugging points in my code, do I? I wonder if there's
any part of this idea I can fruitfully rip off for my own code?" It was a nice
one-two punch.

~~~
jbeda
I can guarantee that the snark is intrinsic to Raymond. He is just that type
of guy. He loves a good story, is very sarcastic and likes to troll co-
workers.

(I could tell quite a few Raymond stories from when we worked together but it
is better to leave it to him. It is a shame that he sticks to technical topics
mostly on his blog.)

~~~
cabacon
Actually, I really like the stories about his nieces too. He had one about
watching his niece learn to cheat that was fantastic.

Re: snark, I wasn't talking about his own snark; that's great. I'm talking
about the bits he adds labelled "Pre-emptive snark: [insert snarky thing a
reader might say here, plus his response". Maybe I'm mis-reading his attitude
towards those, but he seems a bit weary of them.

------
tptacek
If you're never had a chance to play with it, Detours, the more complex
alternative to the hot-patch strategy Chen is talking about, is really slick.

What you do in Detours is, freeze the process, disassemble the first several
instructions of the function you want to hook, copy out enough of them to make
room for a full jump instruction, copy in your hook function somewhere in
memory, followed by the instructions you stole to make room for the jump,
followed by a jump back to the original function. Then you patch in a jump to
that location and unfreeze the process.

The example programs for Detours do this, for instance, on every libc function
to implement library tracing.

That this "just works" with Microsoft's Detours package is kind of
mindboggling.

This is a great project to tackle if you want to write programmable debuggers.
We've done it for Win32 (you need a full build environment to use Detours; we
have the whole thing in Ruby), OS X, and Linux. It's crazy useful.

~~~
Someone
Does that handle double detours, where two different programs each patch the
same function?

That was the most entertaining/frustrating (depending on your world view; it
made the journey way more interesting, but also a lot longer) part of hacking
classic Mac OS.

You would have a zillion extensions (apple and third party) each patch tens of
OS calls, both at startup and, in cases where the Finder reset patches, after
the Finder launched, or even after every application launched (to get your
code running at such times, you would have to patch another OS call). On
PowerPC machines you would have the added fun of patching PPC code with 68k
code and vice versa.

That that _sometimes_ worked was really mind boggling. Relative to that,
patching your own libc seems easy.

~~~
ryanmolden
>Does that handle double detours, where two different programs each patch the
same function?

I believe Detours patches the import table inside a single process, it does
not patch the call on a system wide basis, so there really is no 'two
different programs each patching the same function'. In theory you could have
two different pieces of code running in the same process doing that (i.e. each
patching a given function), but Detours gives you a 'trampoline' function to
invoke the original thing you are patching so I believe the second to patch,
when invoking the trampoline function would simply invoke the first patch,
which when invoking its own trampoline would invoke the original, though I
haven't tried that so it may not work that way :)

VS uses this mechanism allow you to run as non-admin even though there is TONS
of VS and third party code that expects to do things like write to HKLM which
is a no-no if you are not an admin.

~~~
srdev
It doesn't actually patch the import table, but actually finds the functions
address in memory. This has the benefit of working for statically linked
libraries and run-time dynamic linking (loading). You're correct in all other
respects. My casual impression is that the technique would allow more than one
library to hook into the target function.

~~~
ryanmolden
Ahh okay, that makes sense. Modifying the actual page the instructions live on
in memory would force it to be marked as dirty and thus if paged out it would
actually need to hit the page file instead of just being able to discard it
and when it is needed again fetch it from the original dll no? I suppose that
is a minor concern assuming you aren't patching tons and tons of things that
reside on lots of different pages.

~~~
srdev
I'm a little fuzzy on whether they had to deal with paging or not. At the very
least they'd have to mark the pages as dirty, but I think that happens when
you mark it as write-able.

------
rwmj
For those that are interested, the Linux kernel does almost the same thing (if
compiled that way):

<https://lwn.net/Articles/264029/>

The mcount feature piggybacks on the profiling instruction added into every
function when you use the gcc -pg option.

Edit: better link is probably this one:
[http://www.mjmwired.net/kernel/Documentation/trace/ftrace.tx...](http://www.mjmwired.net/kernel/Documentation/trace/ftrace.txt#1563)

------
ajross
NOOP sequences in x86 are a fun subject. There's an interesting section in
Intel's optimization guide somewhere (I'm too lazy to find it) that details
"best practice" noop instructions of 1, 2, ... up to something like 9 bytes.
These are used for alignment puposes too, where you need a few bytes of
padding to make a loop-back target cache-line aligned or whatnot.

~~~
haberman
Check out the "smartalign" package in NASM: it contains 1-8 byte nop
instructions for several different x86 variants:
[http://repo.or.cz/w/nasm.git/blob/a2c78555770990ed966c414da9...](http://repo.or.cz/w/nasm.git/blob/a2c78555770990ed966c414da94ccd3ed91de120:/macros/smartalign.mac)

You can use xxd and objdump to see which all of these translate into. For
example, here's an 8-byte nop for x86-64:

    
    
      $ echo '0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00' | xxd -r > /tmp/bincode
      $ objdump -M intel -D -b binary -mi386 -Mx86-64 /tmp/bincode
      
      /tmp/bincode:     file format binary
      
      
      Disassembly of section .data:
      
      00000000 <.data>:
         0:   0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]
         7:   00

------
cousin_it
Okay I have two questions that might be very clueless but I don't know the
answer to them so I will ask them anyway.

1) In the comments Raymond says, _"Hot-patching is not an application feature.
It's an OS internal feature for servicing."_ Then why does the compiler put
hot-patch points in my code? Why not use a special compiler flag when building
Windows DLLs?

2) Why do we need a special hot-patch point at all? What's wrong with just
overwriting the first few bytes of the function you want to hot-patch?

~~~
cabacon
1) I didn't see anything that suggested that all DLL functions have this hot-
patch point. I think from his perspective "Windows DLL" means "a DLL that is
part of the Windows operating system", not "a DLL used by an application
executing on Windows".

2) I think he addressed this - someone might be executing the function while
you are trying to patch it. Having a 2-byte, one clock cycle NOP at the front
means that you can replace it "atomically" from the perspective that nobody
can walk into the middle of you updating the memory.

~~~
cousin_it
Thanks! Re 1), it does seem to be a compiler switch /hotpatch, not the default
behavior.

------
alexwestholm
Wow awesome explanation - about 6 years ago while hacking gtk+ and Mozilla I
used those instructions to hack into the main event loop to get gtk+ embedding
gecko 1.7 and had no idea that those my perceived hacks where actually some
what valid method for doing what I needed to do - modify how window events
from gecko where propagated to gtk+ event loop and vice versa. I think that my
bug report is probably still open and might even be worth revisiting if anyone
is still interested in gtk+ with Mozilla embedded - would likely need to make
lots of changes... Latest gecko is 1.9?? Anyways awesome explanation

~~~
sid0
Latest Gecko is 9.0 to coincide with Firefox 9. :)

------
giardini
Whatever happened to the old idea of separating program and data spaces and
write-protecting the program space?

~~~
DrJokepu
I'm by no means an expert in this domain but my understanding is that that's
normally not happening on x86/x64, not even on Linux or OS X. Otherwise, how
could just-in-time compilers like Java, .NET or V8 work? Please correct me
however if I'm wrong.

~~~
haberman
JIT compilers allocate executable memory by using mmap() (not malloc(), since
malloc()-allocated memory is not guaranteed (or even likely) to be
executable). When you map memory with mmap(), you can decide the protection
bits (PROT_READ, PROT_WRITE, PROT_EXEC, and they can be OR'd together in any
combination).

For example, from the V8 sources:
[http://www.google.com/codesearch#W9JxUuHYyMg/trunk/src/platf...](http://www.google.com/codesearch#W9JxUuHYyMg/trunk/src/platform-
linux.cc&q=PROT_EXEC%20package:http://v8%5C.googlecode%5C.com&l=385)

~~~
brohee
malloc()ed memory is executable if you mprotect() it to be so. malloc()ing
memory and blindly executing it is definitely a bug OTOH. malloc() is mmap()
based a lot of systems nowadays anyway, so malloc() vs mmap() is just a choice
of convenience of API.

~~~
haberman
> malloc()ed memory is executable if you mprotect() it to be so.

True, but since mprotect() can only operate on full pages your mprotect()
calls will most likely affect memory beyond what malloc gave you. For example
if remove write or execute permissions your program will probably crash, and
it's best to avoid having pages that are both writable and executable.

------
wwwww
Then why do I need to restart the computer after I install _anything_?

