Wow this was actually just the code reference I was looking for to extend decomp2dbg into Ghidra. I had seen ida2dwarf before, but not this sleek Python approach to it. It does still seem though that you need to run this script after every time you change something in Ghidra, which is less than ideal for me. With this, I think it will be possible now for seamless Ghidra support.
Ty ty, I'll be sure to give credit where credit is due. Ofc Robert Xiao worked on this, lol.
Original author of the BinDiff tool (ca. 2004) here.
BinDiff uses a good number of different algorithms to narrow the set of candidates for CFG-based matching -- so things like identical string references may be used to match a large function, and then basic blocks in that large function may be matched, and then BinDiff may identify that the functions called from these basic blocks will be identical, too.
The entire field has exploded, and there are dozens of tools for function matching nowadays, some with very sophisticated ML approaches etc. - it is somewhat surprising that BinDiff still does it's job, considering that the core engine has not been modified or improved since 2010.
I'd also like to extend kudos to https://twitter.com/admvonschneider -- he is the sole Googler sacrificing his 20% time to keep the tool alive...
So that has been "in the works" for 7+ years, and I think it repeatedly stalls because someone needs to really drive it against the bureaucratic internal obstacles ...
There's no reason they aren't doing it, except "nobody has time to do it".
Sure, I have a couple. I will warn you that I don’t know much about how BinDiff works, other than knowing it does :)
The “Achilles heel” of binary differs is recompiling with different flags or making other minor changes that alter all basic blocks. Do you think the techniques BinDiff uses could be extended to work on “semantics” at a higher level rather than actual code? Like, doing the diff after an optimizing decompiler has run over it and lifted it.
If you were writing BinDiff today, what new techniques would you integrate? Or are you happy with the algorithms as they are already?
What kinds of things is BinDiff good at identifying? It’s often difficult to know beforehand “how well” a diff will work. How would you judge whether it’ll be worth it?
1) You could try to lift BinDiff to operate on the semantics more, but I am not entirely sure that it'd affect diffing quality. "Semantics" are a difficult thing to define at the assembly level: If your compiler optimizes less, and uses a stack slot instead of a register, the assembly-level semantics are actually changed. If you recompile using frame pointer omission, the assembly-level semantics are also changed. If you compile code for ARM/Thumb and ARM normally, different assembly-level semantics are emitted. So if you are trying to do diffing on the semantic level, you have to lift to high-level-language semantics, and account for stuff like /GS stack cookies etc. etc. - this path, madness lies. Even state-of-the-art decompilers tend not to generate the same C output if you decompile differently-compiled assembly. And all of this assumes that no inlining changes happen...
It is more promising to use things that are "affected by semantics, but not excessively so". CFGs fall into the category, and a variety of other things are.
BinDiff can sometimes work around heavily changing CFGs by using string references, relative positioning in the call graph, and a variety of other tricks. In general, if a function X that calls Y is properly matched, the odds that Y will be properly matched increase greatly.
In a nutshell, BinDiffs algorithm is a "attempt to match two nodes in a graph (callgraph of CFG) using a variety of criteria (ordered from precise to loos) -- and each time a match is generated, try to generate additional matches in the neighborhood of the matched node by taking the new match into account". So you try to get high-fidelity matches first, if you get any, you propagate that information to generate more high-fidelity matches, and as time goes by your criteria for allowing two functions to match get looser and looser ... but hopefully you've already done good high-fidelity matches so this counterbalances the later recklessness of the algo.
If I was writing BinDiff today ... wow, there's a million things I'd do differently. I last touched that engine in 2009 or so, and things that I've learnt since that'd be very helpful:
1. There's fantastic use of locality sensitive hashing to be applied here. There's a bit of work in the functionsimsearch github repo around this, but BinDiffs performance could be boosted significantly using these techniques.
2. There's a ton of good academic research on metric learning, and using ML to generate embeddings to then do similarity comparisons. If you combine this with (1), you will see a significant performance boost.
3. The algorithm is pretty sequential, I think it could be multi-cored reasonably easily.
4. I'd like to add a little bit more semantic diffing
5. I think one could very profitably use the distribution of constant offsets in a function as additional matching criteria ("two functions that have approximately similar data structure access might be the same").
6. ... lots more ...
Regarding "what is BinDiff good at" -- if you can get a bunch of initial high-fidelity matches, the overall success of the diffing process will be much higher than if the initial matches are poor. You can use BinDiff interactively and do incremental diffing -- look at the BinDiff config files, you can selectively disable individual matching criteria, and then start with an essentially empty diff, match a few functions manually, and then ask BinDiff to propagate them.
Come to think of this, BinDiff tries too hard to solve vastly different use cases:
A) Patch diffing, where you want to be very sensitive to even small changes
B) Symbol porting, where you want to be very loose to small changes
The tool should probably be split into two -- one matching phase, and one semantic analysis phase; and for the matching phase, making the interactive diffing experience better would be a great idea...
In general, it blows my mind that BinDiff is still in use. I had expected a superior successor to arrive within 2-3 years of Google acquiring zynamics, I would have never expected that 10+ years on, without any change to the core engine, BinDiff still produces reasonably-ish results...
I am only little surprised that it works; after all, compiling a program for two platforms results in a mostly equivalent control flow graphs, except for optimizations and different code gen strategies (*); and of course platform specific stuff (but that was probably already abstracted for such a project back then).
BUT I was very (pleasantly) surprised someone did go through the effort of implementing this and made it available for everyone as FOSS <3
(*, rant) I think inlining and tail call optimizations (especially calls into libc or the equivalent of the era) might cause the most trouble. Also loop generation can have subtle differences reflected in the CFG (not only loop unrolling, but also transforming a while-loop into a do-while-loop causes differences in the CFGs). Back then I'd also expect compilers not always perform equally well for "simple" things like dead code elimination. I wish I had the time to check out bindiff in detail, it's really interesting (personal bias: In the past I did work on static code analysis and had a supporting role in a compiler dev team).
You may also want to check out https://github.com/mahaloz/decomp2dbg. (Disclaimer: I will never not shill Zion's code.)