
When coding style survives compilation: De-anonymizing programmers from binaries - randomwalker
https://freedom-to-tinker.com/blog/aylin/when-coding-style-survives-compilation-de-anonymizing-programmers-from-executable-binaries/
======
bambax
This statement seems debatable: " _Since all the contestants implement the
same functionality, the main difference between their samples is their coding
style._ "

All the (winning) contestants implement the same functionality, yes, but with
possibly wildly different approaches/algorithms, so the main difference
between code samples is not just "style" but what could be called "general
thinking in and around the problem".

But this statement seems the most interesting: " _By comparing advanced and
less advanced programmers’, we found that more advanced programmers are easier
to de-anonymize and they have a more distinct coding style._ "

Beginners tend to think alike, while experts develop an original line of
thinking, that is identifiable. The paper could be called "Fingerprints of
Thought"...

~~~
euske
As a professional, I tend to think a distinct coding style is bad. You should
try to write a code that is plain and unsurprising and reproducible. To me,
this study gives us another reason that we need a more uniform/standardized
methodology and good education for the software industry.

~~~
dman
Is this view unique towards programmers or do you extend it to artists and
writers as well?

~~~
superuser2
Artists and writers' products are supposed to be aesthetic; code is supposed
to be readable and maintainable.

I apply it to journalism, though. Newspapers have long had "house" style
guides to keep things consistent across reporters.

------
Tinyyy
This is an interesting article, but there's one part that I don't really
understand: 'After scaling up the approach by increasing the dataset size, we
de-anonymize 600 programmers with 52% accuracy.'.

Isn't 52% close to a random guess? I don't get how this ties in with the rest
of the paragraph either.

~~~
Lawtonfogle
If they were saying that in 52% of the cases, they could guess the correct
choice out of 600, then random chance alone would have been <.2%.

If they were saying that in 52% of cases, they could guess if some code did or
did not belong to a given programmer, then it is basically a random guess.

~~~
nothrabannosir
_> If they were saying that in 52% of cases, they could guess if some code did
or did not belong to a given programmer, then it is basically a random guess._

That entirely depends on the distribution of the original sample.

~~~
dllthomas
Does it? If you're asked "does this code belong to this programmer?" and flip
a coin, you'll approximate 50% regardless of distribution of programmers.

Of course, "const No" might do quite a bit better than 50%, depending on the
distribution of the original sample (for that matter, so might "const Yes",
but those distributions seem less likely).

~~~
michaelt
Well, if there were 600 programmers, one of whom had written 649 programs and
599 of whom had written 1 program each, you could achieve 52% accuracy by
always guessing the same guy.

Of course, the paper explicitly says "For this experiment, we use 600
contestants from [Google Code Jam] with 9 files" so I think in this case the
distribution was probably fairly even?

~~~
dllthomas
I think you missed some context. Up-thread, someone raised the point that it's
unclear whether the question was "Of these 600 programmers, which wrote this
code?" or "Did this programmer write this code?", and this subthread is
discussing the latter case.

------
stickac
Don't miss the talk by Aylin starting in 50 minutes here:

[http://streaming.media.ccc.de/32c3/hallg/](http://streaming.media.ccc.de/32c3/hallg/)

(Recording available later ...)

~~~
sp332
Recordings will be at
[https://media.ccc.de/b/congress/2015](https://media.ccc.de/b/congress/2015)

~~~
hs86
And here it is: [https://media.ccc.de/v/32c3-7491-de-
anonymizing_programmers](https://media.ccc.de/v/32c3-7491-de-
anonymizing_programmers)

------
chatmasta
Often, just running `strings | grep '/home/'` on a binary will reveal the home
directory of at least one programmer involved in the compilation process.

~~~
revelation
Or rather, this is what you will find if the company in question has
amateurish procedures and doesn't use automated build tools and continuous
integration.

Then yeah, you might very well find some guys /home/ path in firmware running
on production basebands.

(Hi, Dojip Kim! [https://www.linkedin.com/in/dojip-
kim-7b0b1b6a](https://www.linkedin.com/in/dojip-kim-7b0b1b6a))

------
petercooper
This seems to indicate that modern optimizers still have a _long_ way to go to
be as effective as they could be, since broadly similar code even in different
styles would end up with the same, most efficient end result(?)

~~~
kevin_thibedeau
They did this with optimizations off:

    
    
      The above mentioned executable binaries are compiled
      without any compiler optimizations, which are options to
      make binaries smaller and faster while transforming the
      source code more than plain compilation. As a result,
      compiler optimizations further normalize authorial style.
    

Everyone's still safe with some -O* code massage.

~~~
placeybordeaux
Saying everyone is 'still safe with some -O* code massage' is a very strong
assumption. In fact the paper in question deals with the issue.

Taken from section VI. A. " Compiler Optimization: Programmers of optimized
executable binaries can be de-anonymized.":

"[...] programming style is preserved to a great extent even in the most
aggressive level-3 optimization. This shows that programmers of optimized
executable binaries can be de-anonymized and optimization is not a highly
effective code anonymization method."

Please try and do a basic level of investigation before making claims.

------
EGreg
Well, this puts a damper on my plan to create a secret identity on the
internet under which I release software the way Banksy releases art.

Unless I stop publicly writing software for a few years.

~~~
moyix
My suspicion is that most of these features wouldn't survive in an adversarial
setting -- either by consciously changing your coding style or (better) using
automated tools to rewrite your source code before compilation to alter the
control flow structure (e.g., control flow flattening [1]).

[http://reverseengineering.stackexchange.com/questions/2221/w...](http://reverseengineering.stackexchange.com/questions/2221/what-
is-a-control-flow-flattening-obfuscation-technique)

~~~
petke
Another option would be safety in numbers. Take a large executable written by
someone else and wear its dead code like its your own skin. Hiding your true
self in there somewhere.

~~~
EGreg
That might be an interesting way to defeat it! Thanks.

------
motoboi
Maybe we can use this method to find Satoshi, if he uses github.

~~~
x1798DE
Leaving aside that finding Satoshi isn't a particularly laudable goal, isn't
the original Bitcoin client open source? It's almost certainly easier to
detect programming style from source code than from binaries.

~~~
SoftwareMaven
The bitcoin client would be the training set.

------
cjslep
Really neat, especially since the source could be from different languages and
compilers. I'd be interested at the deanonymization accuracy within the Go
language where there is a widespread adoption of the code formatting tool and
seeing if that has any impact.

~~~
tshadwell
I write almost exclusively Go these days, and I think most of the people I
work with could tell you if I wrote a Go program

~~~
nightpool
From only the binary code? That's quite a feet for the unassisted human :P
(mind sharing what kind of people you work with?)

------
psk
This sounds interesting. Could this be applied to Stuxnet, Duqu and malware in
general, or would you require more information?

~~~
worldsayshi
I thought about looking at the bitcoin implementation for finding Satoshi.

~~~
placeybordeaux
We have the commit history for the source code of bitcoin to profile Satoshi.
No need to try and do it based off of the binary.

~~~
worldsayshi
Sure. I thought of the general method.

I saw that this has been tried to some extent:
[https://en.wikipedia.org/wiki/Satoshi_Nakamoto#Nick_Szabo](https://en.wikipedia.org/wiki/Satoshi_Nakamoto#Nick_Szabo)

------
bshanks
The features given to their random forest classifier appear to be:

* assembly language instruction features: "token unigrams and bigrams"

* decompiled lexical features (unparsed decompiled text): "word unigrams, which capture the integer types used in a program, names of library functions, and names of internal functions when symbol information is available"

* syntactic features (from the parsed AST of the decompiled text): "AST node unigrams, labeled AST edges, AST node term frequency inverse document frequency (TFIDF), and AST node average depth"

* basic block features: TF-IDF weighted "unigrams and bigrams, that is, single basic blocks and sequences of two basic blocks"

source: [http://www.princeton.edu/~aylinc/papers/caliskan-
islam_when....](http://www.princeton.edu/~aylinc/papers/caliskan-
islam_when.pdf)

------
Fuxy
Interesting but if your code is open source a lot of people will be
contributing to it bringing in their own style.

Depending on how popular the code is your fingerprint could be completely
hidden amongst hundreds.

~~~
leppr
If the code is open-source there likely is a history of contributions. But
this research is about "De-anonymizing programmers from executable binaries"
anyway, so really not an OSS scenario.

