
De-anonymizing programmers from executable binaries - godelmachine
https://blog.acolyer.org/2018/03/16/when-coding-style-survives-compilation-de-anonymizing-programmers-from-executable-binaries/
======
sixhobbits
I did a lot of work on authorship attribution for natural language.

I found it was easy to get mind-blowing results on a single dataset - e.g. 99%
accuracy for working out who wrote which emails in the Enron [0] dataset.

The moment the test set was even slightly cross-genre, results became bad.

So this is great in theory, but I would love to see a test set from the same
authors outside the context of the Google Code Jam. I'd be surprised if the
results were anything like as good.

There are very few real-world cases where you have a large amount of same-
genre data for the questioned author. (No one writes a thousand suicide notes
/ ransom demands / scams)

[0] A large email database often used for authorship tasks
[https://www.cs.cmu.edu/~enron/](https://www.cs.cmu.edu/~enron/)

~~~
IncRnd
Great point. A system needs to be forward tested not simply back tested on a
known set of data.

Incorrect back testing is absolutely standard with stock/bond/forex/etc
trading. There are a million hustlers who sell their back-tested systems. In
essence, without forward testing all you have is an untested hypothesis.

------
haberman
I would be very interested to know how the accuracy changes if you remove
symbols and strings from the analysis.

Symbols and strings are free-form text that would seem to leave a lot more of
a stylistic fingerprint. If you could get this level of accuracy from machine
code alone, I would be very surprised.

The article says that stripping symbols reduced accuracy by 24%. This is a far
greater drop than from optimizing the binary. It leads me to believe that a
large part of the success in classifying is coming from symbols and strings.

------
zawerf
This isn't a surprising result if you know what competition programming code
looks like.

Everyone has their own distinctive personal library/template/boilerplate that
they are pasting for every problem since includes, typedefs, defines, library
helpers, IO, etc are always the same.

You can see some examples here to see what I mean: [https://www.go-
hero.net/jam/17/solutions](https://www.go-hero.net/jam/17/solutions)

------
meuk
Very interesting. The accuracy achieved is quite impressive -- I think it
would be lower when the technique would be applied in practice, since the
assumptions that are made are quite strong. For looser assumptions (no
disclosure about the compiler/optimization level, highest optimization level
enabled, larger pool size), the technique is less accurate.

In practice, it might be interesting to give a fuzzier output - the esimated
probability that an individual wrote the code for a certain binary. This would
help when two programmers (say, Alice and Bob) write code that yields binaries
with similar properties. In the current system, either Alice or Bob is picked
-- there is no way to express that it might be either of them.

------
akskos
How much is there code available written by Hal Finney?

~~~
akskos
Although we have the actual source code for Satoshi's Bitcoin implementation
so not exactly relevant to the article

~~~
jcora
Uhh Satoshi is Nick Szabo...

~~~
rotred
Wei Dai's AMA makes me doubt that.

[http://lesswrong.com/lw/jgz/aalwa_ask_any_lesswronger_anythi...](http://lesswrong.com/lw/jgz/aalwa_ask_any_lesswronger_anything/ap3c)

Specifically this comment
[http://lesswrong.com/lw/jgz/aalwa_ask_any_lesswronger_anythi...](http://lesswrong.com/lw/jgz/aalwa_ask_any_lesswronger_anything/ap3c#)

~~~
psyc
> No I don't think it's Szabo or anyone else whose name is known to me.

I have always thought the speculation about SN's identity suffered alarmingly
from availability bias. Everyone wants to believe he's someone they know
about. Someone who's extremely well-known for working on or writing about
things extremely similar to Bitcoin. There are far more smart nobodies
tinkering with things than there are top names working on those things.

Also suffers from a bias I'll call "Shakespeare didn't write Shakespeare
bias." The idea is that a notable thing must have been written by a high-
status and well-credentialed person, and could not have been done by a
talented no-name.

~~~
wolfgke
> The idea is that a notable thing must have been written by a high-status and
> well-credentialed person, and could not have been done by a talented no-
> name.

I think your reasoning is interesting. Nevertheless consider that if we hunt
for SN, it is very likely that it is at least a person who knows more than a
little bit about various topics that were necessary to implement Bitcoin (e.g.
finance, cryptographic hashed data structures, existing protocols for crypto
cash, ...). I agree that it is quite possible that this could also be a
talented no-name, but I think this property provides a strong criterion to
check if someone claimed that some specific person (who is a talented non-
name) they have in mind is SN.

Also one would have to provide an explanation why such a person is not more
well-known. It can be that he/she has a difficult personality. It can also be
that he/she lives in a country where such a person will have a lot less
opportunities etc. But these constraint again limits the circle of talented
no-names that could be SN.

TLDR: Your reasoning is interesting, but your hypothesis has strong
consequences.

~~~
psyc
I don't believe lack of notability needs any explanation at all. It's the
default. I have plenty of [anecdotal] evidence that exceptional skill is one
of many factors that might contribute to notability. Other factors include
networking or self-promotion, timing, and chance. Being a lurker type myself,
I've encountered many fellow lurkers in forums over the last 20 years, who
demonstrated astounding expertise, but are afaik almost completely unknown in
my industry. I can think of many more people who actually are at the top of
the industry, but are relatively unknown simply because they do not give
talks, blog, or write books. In case it wasn't obvious, my personal
_hypothesis_ is that SN was simply a very skilled lurker in the cypherpunk
scene. I doubt he was Finney, Dai, or Szabo, but I suspect he was someone who,
like myself, was well aware of them and hung out in the same forums and lists.

------
xxs
Back in the late 80s/early 90s it was quite possible to tell the author
looking at the disassembled portion of a virus. (when all viruses were
virtually written in assembly)

I don't think much has changed.

~~~
EvanAnderson
Came here to say the same thing. Back in the early 90's one of my friends
spent time obsessively studying the personal nuance of code attributed to The
Dark Avenger. Even in code that had obviously been optimized bits of personal
nuance leak through. It seems very intuitive that compiled languages, being
more expressive, would leak more personal nuance.

~~~
xxs
Well that was easy, given the signature of triple 6 :D (although that part
dropped due to space considerations)

------
EvanAnderson
I immediately thought of the anecdote about World War II British SIGINT
operators being able to identify individual German telegraphers by way of
their distinctive style of Morse keying.

[http://listserv.linguistlist.org/pipermail/ads-l/2005-April/...](http://listserv.linguistlist.org/pipermail/ads-l/2005-April/048637.html)

[https://web.archive.org/web/20100820000534/https://www.wnyc....](https://web.archive.org/web/20100820000534/https://www.wnyc.org/books/42607)
(Search for 'fist')

------
golergka
Can it help forensic determine if it's the same attackers behind different
incidents?

~~~
amelius
Perhaps in the past. In the future, attackers will just modify their code
until this test returns a different predicted author.

~~~
slx26
anyway, I don't think this would ever be accepted as evidence in trial. it
could be an indicator/lead for investigations, but I don't think it would be a
big concern for attackers?

------
jcoffland
The conclusion that more experienced developers develop their own style is an
important one. It shows that good programming is not about adopting
convention.

~~~
bitwize
There's a principle in art about "knowing the rules so you know when and how
to break them". You can't pass off bad art as good with the excuse of "it's
just my style, man". Well, maybe you can in the elite galleries, but not in
the workaday world of commercial art (e.g., for advertisements, video games,
comics, cartoons, etc.) You have to learn the fundamentals and work within
those bounds, then once you "git gud", only then develop a personal style. I
suspect the same is true of programming.

------
xstartup
After linters, it seems now we've to run our code through code anonymizer too.

