
Hotpatching a C Function on x86 - ingve
http://nullprogram.com/blog/2016/03/31/
======
benou
GCC has some built-in limited support for that: \- 'ifunc' attribute [1]
allows you to dynamically select the implementation at load time, which
simplifies the usecase where you just want to have a single binary but use
optimized function depending on the available HW (eg. use SSE4.2 if available
or fallback to MMX on old platforms, etc.) \- asm goto [2] + custom section to
keep track of the jmp instructions addresses. It allows you to dynamically
change jmp addresses at runtime. A typical usecase is when you want to
enable/disable a feature in the hot path at runtime, eg. instrumentation

[1] [https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-
Attrib...](https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-
Attributes.html) [2] [https://gcc.gnu.org/onlinedocs/gcc/Extended-
Asm.html#GotoLab...](https://gcc.gnu.org/onlinedocs/gcc/Extended-
Asm.html#GotoLabels)

------
wslh
Shameless plug but my company is almost wholly oriented to binary
intrumentation!

In our Deviare Hooking Engine/Deviare In Process [gihub-1] and RemoteBridge
[github-2], we have a disassembler in place to hook Win32/COM/C++ vtables, so
the hooking process is smarter if there are changes in the prologue. You can
obviously take a look at the source code and learn a lot from it since it is
state of art and perfectly competing with Microsoft Detours [3]. For an old
comparison with Microsoft Detours you can check [4].

For anyone else looking for an extremely easy to use and higher level API,
Deviare Hooking Engine makes extremely easy to hook and handle functions
parameters and return value. Simple like this:

[snippet]

    
    
                notepadPID = LaunchNotepadAndGetPid();
    
                //in first place, hook DllGetClassObject of the target dll/ocx
                hookDllGetClassObj = spyMgr.CreateHook("shell32.dll!DllGetClassObject", (int)eNktHookFlags.flgOnlyPostCall);
                hookDllGetClassObj.Attach(notepadPID, true);
                hookDllGetClassObj.Hook(true);
                hookDllGetClassObj.OnFunctionCalled += OnDllGetClassObjectCalled;
    

[/snippet]

Docs are available here: [http://www.nektra.com/products/deviare-api-hook-
windows/doc-...](http://www.nektra.com/products/deviare-api-hook-
windows/doc-v2/index.html)

[github-1]
[https://github.com/nektra/Deviare2](https://github.com/nektra/Deviare2)

[github-1 bis] [https://github.com/nektra/Deviare-
InProc](https://github.com/nektra/Deviare-InProc)

[github-2]
[https://github.com/nektra/RemoteBridge](https://github.com/nektra/RemoteBridge)

[3] [http://research.microsoft.com/en-
us/projects/detours/](http://research.microsoft.com/en-us/projects/detours/)

[4]
[https://www.reddit.com/r/programming/comments/22crn0/gpl_alt...](https://www.reddit.com/r/programming/comments/22crn0/gpl_alternative_library_to_microsoft_detours_for/)

~~~
dreamlayers
Yes, while it's nice to learn about this, it's a lot more practical to use a
library like that. Thanks for releasing your library as open source with a GPL
license.

If I want to use it with the GPL license, that only means my own code using
your library needs to have a GPL compatible license, right? I can use your
library with my GPL compatible code to modify other closed-source or GPL-
incompatible code, right? Just checking because some may view it as using a
GPL plugin with GPL-incompatible program, which violates the GPL.

~~~
wslh
If I understand well you want to use Deviare in your GPL compatible project
and hooking into a closed source application such as Microsoft Outlook,
Internet Explorer, Skype, etc? The answer is yes, you can use Deviare in this
context using the GPL license, we don't think that instrumenting a closed-
source application via Deviare is part of your software, even when you can
complement other closed-source applications.

------
RandomBK
Another interesting article by raymond chen on this topic:
[https://blogs.msdn.microsoft.com/oldnewthing/20110921-00/?p=...](https://blogs.msdn.microsoft.com/oldnewthing/20110921-00/?p=9583)

I believe it's where the "ms" in "ms_hook_prologue" came from.

~~~
andyjohnson0
There was some discussion of Raymond Chen's article a couple of months back
[1].

According to the present article GCC emits an eight byte prologue (LEA
RSP,[RSP+0x0]) at the start of the function, but Raymond Chen says that
Microsoft's compiler emits five NOPS before the function start address and an
overwritable two byte prologue (MOV EDI, EDI) at the start of the function
itself. To me, Microsoft's approach seems more efficient - but I've never
written any serious x86 assembly. Anybody knowledgeable want to comment on
this?

[1]
[https://news.ycombinator.com/item?id=11063700](https://news.ycombinator.com/item?id=11063700)

~~~
cfallin
One thing that comes to mind is that GCC's version could have a bit more
overhead due to ESP-folding [0]. Basically, reading ESP/RSP directly can incur
some overhead, because the register renamer is playing tricks to avoid
actually adjusting the stack pointer on every push/pop until you actually read
the stack pointer's value. It's unclear here why GCC chose RSP over some other
register.

[0] the only reference I could find at the moment, but I've seen it documented
elsewhere too:
[http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/fun...](http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/function-
perilogues.html)

------
WalterBright
Back in the olden DOS days, instead of a configuration file, I had a struct in
the code with the configuration data declaration. Then there was a static
instance of the struct to inform the program of the current state.

To change the configuration, the program would write new values to the struct,
and then patch the executable. Since I knew how to find the offset of the
struct instance in the executable based on the runtime address of the
instance, this was easy.

The big advantage to this was speed - being on floppy disk systems, having to
do a file lookup/read was very slow.

Sadly, what killed this technique was when antivirus software appeared and
decided that an executable patching itself was malware.

~~~
salgernon
Back in the olden Mac Plus days, I used a disassembler/monitor/mini-assembler
called TMON[1] that allocated a chunk of otherwise unused memory - 256 bytes,
I think, called "PlayMem".

Since linking on a Mac Plus took upwards of 11 minutes for our project, I'd
would typically replace any particular line with a JMP $PlayMem to patch up
whatever registers or memory wasn't right, and then jump back into the middle
of the routine.

Oh, and then there was Steve Jasik's "The Debugger"[2] which would let you
crash and then switch to a (capabilities limited) MPW environment to let you
edit and re-link individual routines which it would then hot-patch back in
place. I think for that one, he actually shipped own linker, which was a hand
patched copy of Apple's.

Life in 24 bits of flat address space was more fun. Shorter, but more fun.

[1]
[http://www.mactech.com/articles/mactech/Vol.01/01.10/TMONDeb...](http://www.mactech.com/articles/mactech/Vol.01/01.10/TMONDebugging/index.html)

[2] [http://www.jasik.com](http://www.jasik.com)

------
hohod
DTrace has been doing this kind of things, in a safe and usable-in-production
way, for years (and the code, while not trivial, is usually pretty cool to
read): [https://github.com/joyent/illumos-
joyent/blob/master/usr/src...](https://github.com/joyent/illumos-
joyent/blob/master/usr/src/uts/intel/dtrace/fasttrap_isa.c)

~~~
chamibuddhika
For userspace probes DTrace uses INT3 AFAIK. Taking a signal each time a probe
gets hit can be high overhead for certain usages though (like low overhead
profiling as opposed to intrusive debugging).

------
wyldfire
What's the use case for this feature? Altering a thread's (or set of threads')
work dynamically because synchronizing with those threads is too expensive? Or
in an actual case would "goodbye()" contain code which was only generated as
late as runtime?

It's neat, but I'm wondering why simpler things like updating a function
pointer wouldn't be sufficient.

~~~
reubenmorais
Usually it's done when redesigning/recompiling the target function is not
possible. Game modding, restartless updates, instrumentation, etc.

~~~
csl
But that would only be possible from the same process space, right? So only
applicable for plugins etc.?

What I'd like to see is hotpatching the function for another process, but I
guess that's very hard to do with ASLR. Probably doable with some tricks,
though.

Edit: Come to think of it, gdb is able to attach to a running process w/o
debug symbols and find function addresses. So in other words, I just need to
dig into and grok the gdb source.

~~~
chatmasta
If you are able to recompile the program, you can disable ASLR. For example on
iOS it just requires a change to the MACH-O header of the binary. [0]

But I doubt you could disable ASLR of a running process, for somewhat obvious
reasons...

[0]
[https://github.com/peterfillmore/removePIE](https://github.com/peterfillmore/removePIE)

~~~
dreamlayers
You can similarly disable ASLR via the header in Windows, though there are
ways to override that.

ASLR shouldn't be a big problem though. Only the base address of code in each
file (executable or library) is changed, and you can easily find it. Functions
within one file are not shuffled around. ASLR only exists to stop you from
hard-coding function addresses, to make exploits harder.

------
dkopi
If you enjoyed this, here's a great follow up: [http://jbremer.org/x86-api-
hooking-demystified](http://jbremer.org/x86-api-hooking-demystified)

------
eb0la
Intesting. ms_hook_prologue inserts 1 8-byte instruction ( lea rsp,[rsp+0x0])
instead of 8 1-byte nops. It executes much faster than 8 nopts if the function
is not patched (no need to decode 8 instructions).

~~~
kevincox
In the article it is also mentioned that it is important so that you can't
have a thread that has "partially" executed your no-ops wen you replace
it/them..

------
hathym
intel provides a tool called 'pin' that makes it possible to alter a program
dynamically in runtime:
[https://software.intel.com/sites/landingpage/pintool/docs/65...](https://software.intel.com/sites/landingpage/pintool/docs/65163/Pin/html/)

------
ultramancool
If you want something simpler:

[http://www.frida.re/](http://www.frida.re/)

It's by far the simplest hooking framework on the planet.

------
openasocket
Question: what if you just had some global function pointer that you would
call instead, so hotpatching would just involve changing the pointer? That
seems a lot simpler, there's less that can go wrong, and is a lot more
portable. I get that this would probably be slower for the caller, but it's
not that much slower. If the function is in the hot path that function pointer
will be in cache anyway, so how slow is that read compared to the 8-byte NOP?
It sounds to me like the OPs code will only be a cycle or two faster for each
call during normal execution. Maybe I'm forgetting something.

~~~
munin
it winds up being about the same. you need a pointer-wide atomic compare-and-
swap (CAS) to swap out the function pointer safely.

on x86_64, a pointer-wide value is 8 bytes, which is enough space to put some
instructions that detour to another function. so it's a question of whether
you CAS the global function pointer, or the first 8 bytes of the function.

the "atomic CAS" part means you don't benefit from caching either, since you
need to make sure the write is globally visible or threads on different cores
will do different things. this kind of trickery combines two terrible things:
reasoning about virtual method invocation (essentially) and lock-free
programming techniques. have fun!

~~~
openasocket
Why do you need a CAS? Isn't an atomic write sufficient? Because we're just
changing the function to use our code we don't really care what it was
pointing to before, and we don't care if someone else changed the pointer
before we did: it's just last write wins. In fact, the function pointer method
will be able to have multiple threads try to hotpatch at once; idk about the
OP code.

My guess is the OP's method can probably be extended to allow for more
advanced hooking, like returning to the original function definition after
running the new code or something. And, it means the caller doesn't have to
know that the function is hot patchable, which I'm guessing is the most likely
reason.

------
chamibuddhika
This works since x86 guarantees atomicity for aligned reads and writes w.rt.
instruction fetch. LOCK prefix can be used for unaligned reads/writes to
ensure atomicity of data reads and writes. But Intel SDM says (in 8.1.2 Volume
3A) "Locked instructions should not be used to ensure that data written can be
fetched as instructions", which suggests instruction fetch is not atomic even
with LOCK'd instructions, specially when the accesses are unaligned. Intel SDM
(8.1.3 Volume 3A) suggests a cross modification protocol which requires global
synchronization to ensure correct operation in such scenarios. Anyhow recently
we unwittingly hit this limitation in our work and this led to some
exploratory work for finding a way to relax the global synchronization
requirement. Our work can be found at
[http://conf.researchr.org/event/pldi-2016/pldi-2016-papers-l...](http://conf.researchr.org/event/pldi-2016/pldi-2016-papers-
living-on-the-edge-rapid-toggling-probes-with-cross-modification-on-x86) if
anyone is interested.

------
chatmasta
Anyone interested in this might also be interested in Cycript [0] by saurik,
which "allows developers to explore and modify running applications on either
iOS or Mac OS X using a hybrid of Objective-C++ and JavaScript syntax [...]"

[0] [http://www.cycript.org/](http://www.cycript.org/)

------
specialist
Huh.

Ages ago, I tried to figure out how to dynamically probe OpenGL extensions. My
intent was that the first time a method call was attempted, the method proxy
would

    
    
      look for a concrete implementation
    
      on success, swap its pointer to the concrete implementation
    
      on fail (not found), swap its pointer for a generic error method
    
      then call itself again
    

By doing it dynamically, an app would only be probing extensions it actually
used, versus every extension from every vendor. Back then, all the stubs were
code generated. To avoid probing everything, we'd manually modify the headers,
which I didn't like maintaining.

I didn't get very far. Looking at your implementation reminded me of that
effort.

I haven't done OpenGL in probably 15 years. I don't even know if extension
probing is still a thing (useful).

Thanks for sharing.

------
e12e
Hm, didn't know putc was thread safe, but apparently it generally is
(implementation dependent)[1]. Does this mean that using unlocked_stdio[1] is
a good idea if you're writing single-threaded programs?

Another question, as I understand it, one must go to c11 before C gains any
"standard" thread awareness -- does that mean that the use of a naked int for
_x_ in this article, probably should've been a __Atomic(int) x;_? [ed: if the
article conformed to c11 as opposed to c99, that is]

I suppose it depends where _x_ ends up being stored, if multiple threads
across multiple cores will always see/modify the same version of _x_?

[http://linux.die.net/man/3/unlocked_stdio](http://linux.die.net/man/3/unlocked_stdio)

~~~
gjulianm
There are differences between atomicity and thread synchronization. An atomic
write means that the variable is always in a correct state. For example, a
non-atomic write could write half of the integer in one instruction and the
other half in other instruction, so a thread could see that intermediate,
incoherent state (actually, a write of a integer is always atomic in 32/64
bits, but this is just an example).

But that the write is atomic does not mean that it is synchronized with other
cores. For example, thread A in core 1 could write the value of x, but later
thread B in core 2 could read an outdated from its L1 cache. You should use
memory barriers that force cache refreshing, so all the threads always see the
same version of the variable and do not rely on possibly outdated cachés.

About the unlocked stdio, I don't think there would be a clear advantage. For
best performance, I/O is done in blocks as big as possible, so locking
mechanisms should not matter that much (you spend much more time in the actual
I/O than in the locking). It should only affect significantly when doing lots
of I/O calls, but even in that case, removing the locks would not improve
performance as much as grouping and batching those calls.

------
etrevino
Now, if we change the program while it is running, is it going to change the
program on disk? I.E., once I've hotpatched this thing, can I rely on the
patch "sticking" when I shutdown and reload my program? My understanding of
this is rudimentary, so I want to make sure that I understand (even though I
don't see a use case for me).

~~~
wintermute42
What, exactly, do you think is going on here? How on earth would the
executable on the disk get modified by pointer manipulation? (without the use
of mmap, of course)

~~~
etrevino
Well, that's why I was asking, I wanted to understand. And now I do.

------
StillBored
The linux kernel also hotpatches itself (not just kpatch either) for
performance and errata fixing. For example:

[https://lwn.net/Articles/620640/](https://lwn.net/Articles/620640/)

Other unix kernels do the same for much the same reasons.

------
munin
if you want to do something like change a function definition or data type at
runtime "the right way" you could use the kitsune dynamic software updating
framework to do it: [https://github.com/kitsune-
dsu](https://github.com/kitsune-dsu)

------
j_s
Are there any open-source tools (or even commerial options!) for hotpatching
.NET assemblies at runtime?

------
_RPM
As a C hacker, I really enjoy this guys blog. He posts about a lot of topics
of interest to me.

