
De-Anonymizing Programmers via Code Stylometry (2015) [pdf] - sillysaurus3
https://www.usenix.org/system/files/conference/usenixsecurity15/sec15-paper-caliskan-islam.pdf
======
randomwalker
I'm a coauthor of this paper. It was published a couple of years ago;
pleasantly surprised to see it here.

The paper shows that programmers have distinctive styles in their source code
that can be extracted to create a fingerprint of coding style. It's not just
the obvious stuff like spaces vs. tabs -- parsing the code and looking at the
Abstract Syntax Tree is what results in a powerful fingerprint.

We have a follow-up to this paper showing that surprisingly, coding style
survives in compiled binaries (even with optimization turned on and debugging
symbols removed): [https://www.princeton.edu/~aylinc/papers/caliskan-
islam_when...](https://www.princeton.edu/~aylinc/papers/caliskan-
islam_when.pdf)

~~~
hdhzy
Looks impressive, now only one question remains: did you find out what is the
real identity of Satoshi Nakamoto?

~~~
randomwalker
I've been asked this a lot, especially since I also research and teach
cryptocurrency technology [1]. I haven't personally tried to do that. Satoshi
clearly wished to maintain their pseudonymity, and I'd rather respect that. I
also think Satoshi's pseudonymity is a powerful statement about decentralized
cryptocurrencies, namely that their viability rests on their technical merits,
with no need to know or trust their creators.

That said, I don't blame people for trying to uncover Satoshi's identity, and
it's possible that the techniques in our paper can help. The big caveat, of
course, is access to a corpus that includes Satoshi's code labeled with their
true identity.

[1] [http://randomwalker.info/bitcoin/](http://randomwalker.info/bitcoin/)

~~~
SomeStupidPoint
Tangent, but can your method be extended to indicate traits of a coder?

Eg, based on this sample, the coder is like American, with experience at X
corps, went to school at Y, etc.

~~~
rawnlq
Even more tangent, this reminds me of a manga where there was a character who
could guess your sexual personality based on your code (read right to left):
[http://imgur.com/a/FJPMS](http://imgur.com/a/FJPMS)

~~~
SerLava
Porn intros keep getting weirder.

------
rawnlq
Stylometry is a really cool field.

For example it was used to figure out if a subreddit was being manipulated by
a small number of sock puppet accounts:
[https://www.reddit.com/r/Bitcoin/comments/3hf5z7/determining...](https://www.reddit.com/r/Bitcoin/comments/3hf5z7/determining_manipulation_via_sock_puppet_accounts/)
[https://www.reddit.com/r/Bitcoin/comments/3dszzu/55_btc_boun...](https://www.reddit.com/r/Bitcoin/comments/3dszzu/55_btc_bounty_for_proof_of_block_size_debate/)

(and of course the people studying it the most are the ones hunting for
satoshi =P)

------
rustyfe
I wonder how much this varies by programming language. In Go, gofmt ensures
that most stylistic choices are similar for everyone (Go code tends to look
alike) and patterns and rules are fairly well codified (Go variable and
function names tend to follow certain rules). The more codified the code is,
the less fingerprints each developer can leave.

It'd be interesting to see the most and least identifiable languages!

~~~
ycmbntrthrwaway
It is not about tabs, spaces and variable names. The main difference is how
long your functions are, how many functions they call, how deep control
structures within functions are etc.

~~~
lucb1e
Would be interesting to tie that to ratings. Have a diverse set of people read
lots of code (samples from projects large and small, in different languages,
etc.) and rate it on things like "easy to understand" and "easily extendible",
then correlate that with styles. I'd be very interested in how I score and
what I can improve.

~~~
hathawsh
It sounds like you're talking about cyclomatic complexity.

[https://en.wikipedia.org/wiki/Cyclomatic_complexity](https://en.wikipedia.org/wiki/Cyclomatic_complexity)

There are tools out there for most major programming languages for measuring
cyclomatic complexity. They can tell you whether typical coders will be able
to help you work on your program at its current complexity level.

~~~
lucb1e
I heard of this but haven't seen it used anywhere outside of school. Do you
have any experience whether this correlates with how readable _people_ find
code to be?

~~~
hathawsh
No, sorry. Actually, I feel like long functions are often clearer than
refactored equivalents.

~~~
lucb1e
Okay, thanks for responding!

------
fenwick67
This could probably be used to do some sort of code analysis and give people
ideas for how to structure their code differently.

Similar to how there are some automated tools to help you vary your writing
more.

------
bitL
Hehe, seems like we are all going to commit to GitHub using encoder to a
blended, typical style and pull via decoder to our own unique styles in order
to obfuscate who we are. Like those BlendIt extensions in Firefox to pretend
we run on Win 7 instead of Linux etc.

------
aaossa
Hey, I'm a TA at some CS courses, can this be used to find similar codebases?
Maybe it useful to find similar programmers, find possible code copies and/or
identify the author of a snippet. Any ideas?

~~~
dsfyu404ed
There's only so many ways to do the toy problems that people get assigned.

Students code winds up looking like their sources because they copy the
control structures because there's only so many sane ways to do those problems
without introducing fluff.

At least this is what my friend's experience using plagiarism checking tools
was.

~~~
userbinator
_because there 's only so many sane ways to do those problems_

Indeed, that's been my experience too; if you assign something "Hello world"
level, you're going to get a bunch of 100% correct solutions that look almost
the same, and then the other... odd ones which are almost certainly incorrect
in some way.

It's only when you assign something far more nontrivial, and with many correct
ways to do it, that you will get more diverse solutions; and even then, the
chances of the correct solutions being very similar are high.

------
pencilcode
Do you guys know any good resources on stylometry?

------
ycmbntrthrwaway
(2015)

~~~
sctb
Thanks, updated.

