
Anonymous programmers can be identified by analyzing coding style - randomwalker
https://freedom-to-tinker.com/blog/aylin/anonymous-programmers-can-be-identified-by-analyzing-coding-style/
======
GuiA
Interesting to see this formalized. When I was in grad school and graded
undergrad homework/exams, I could most definitely recognize the students by
their coding styles after just a few assignments. Every student ends up
developing their own habits, and they're quite easy to spot in something as
repetitive as code.

I remember teaching a Matlab class for engineers and scientists that was about
50/50 male/female, and the women tended to have much neater code. Code written
by males often had comments all jumbled up, inconsistent number of spaces
between braces/operators/etc, incoherent variable names, worse names for
functions, and so on.

~~~
okasaki
It's funny to think that if you reversed the genders the comment would at the
bottom of the page rather than the top.

~~~
vertex-four
On HN? Definitely not.

------
joelgrus
My Code Jam solutions always _shared_ a lot of code -- all the boilerplate for
reading inputs, parsing integers, iterating over test cases, and writing out
results.

Because of that, it seems like Code Jam is an artificially easy test case for
this sort of identification -- I'm pretty sure a _human_ could look at my
solutions and conclude they were obviously all written by the same person.

------
lotophage
Not adhering to style guides is now a privacy issue.

~~~
busterarm
Is it odd that I've been thinking about this one for a while now. I don't
really talk to blackhat folks on the regular anymore, but given how important
OpSec is, it's really sad how easily identifiable most of their code ends up
being. Malware authors brag like crazy ("oh, let me just go ahead and encrypt
this payload with my handle as the key...") and I wouldn't be surprised if you
found some of the same from the state-sponsored folks.

I'd really love to see someone like grugq weigh in with some thoughts here.
The idea of having tools to parse and rewrite my code to be as generic as
possible came to me years ago.

~~~
ObviousScience
> The idea of having tools to parse and rewrite my code to be as generic as
> possible came to me years ago.

You basically want an obfuscator that replaces all the names of things with
generics, and then randomly permutes blocks of code without changing the code
paths possible in the final binary. (Perhaps some optimized-for-performance
version of this, but that might identify the tool you use.)

It sounds relatively easy to write if you stick to certain coding guidelines
(like using techniques amenable to static analysis).

However, this still won't work in some cases, because you'd need more advanced
tools to handle profiling of what sized functions and such you ended up
writing.

It would be interesting to try and write a tool which defeated any analysis of
author patterns in the code, but would require understanding the program
across the boundary of function calls, which is a difficult problem. (You
probably couldn't write Turing complete code, for example.)

------
jszymborski
Reminds me of how telegraph receivers used to be able to identify transmitters
of the telegraph by their "fist" (cadence or rhythm with which they signaled).

Here's a Schneier post about a Concordia University study about identifying
e-mail authors.
[https://www.schneier.com/blog/archives/2011/08/identifying_p...](https://www.schneier.com/blog/archives/2011/08/identifying_peo_2.html)

------
joemaller1
But can they tell us if TJ Holowaychuk is really only one person?

~~~
TeMPOraL
Could someone provide a comprehensive description of what on earth is TJ
Holowaychuk? I did Google a bit, and I'm still confused. Is it the new __why?

~~~
juliangregorian
No, he's just ridiculously prolific.

------
patrickmay
It's interesting that they only analyze the abstract syntax tree and ignore
formatting. I would suspect that brace placement, tabs vs spaces, etc. would
provide a useful fingerprint as well.

~~~
breck
They used both:

> We used a combination of lexical features (e.g., variable name choices),
> layout features (e.g., spacing), and syntactic features (i.e., grammatical
> structure of source code)

The AST stuff is super interesting. The other signals are somewhat
superficial. But comparing ASTs? That is deep.

~~~
benten10
One would think that, but with AST's generated from 'normal' text (not code),
they're actually quite noisy, and lexical and syntactic features have been
more useful[1]. In traditional authorship attribution, AST's are a about
decade-old technology. But then, this is code, so for all I know, very
different.

PS: 1] shows that using AST's does does not get you THAT much of entropy gain
compared to other features.

1:
[http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6234420](http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6234420)

------
fpp
They have done a presentation at the 31C3 a few days ago on this presenting
their findings in more details & Q/E
([http://media.ccc.de/browse/congress/2014/31c3_-_6173_-_en_-_...](http://media.ccc.de/browse/congress/2014/31c3_-_6173_-_en_-
_saal_g_-_201412291715_-_source_code_and_cross-domain_authorship_attribution_-
_aylin_-_greenie_-_rebekah_overdorf.html#video) ).

What I understood from that - it worked quite well with code bases like from
the Google Code Jam (large LoC, no style guides etc), but not that well with
smaller amounts of code and I'm looking forward to some additional results
e.g. with a codebase from a corporate development environment.

------
cryptos
I think that kind of analysis wouldn't be possible with Go
([http://golang.org/](http://golang.org/)), since it is very strict/limited
and uniform.

~~~
jaimebuelta
I think that one of the most important parts of this analysis is the naming of
variables, so I guess it still be possible (though probably more difficult)

------
acomjean
This doesn't surprise me. When I was in college we had a daily paper I worked
at the photo dept. We has a box of "feature Photos". they were kind of filler
(campus life, people playing hacky sack, feeding ducks , setting up for events
etc..) I figured out one day could look at the photos and tell who took them
by the style (the photographer name is on the back).

At one job I had we called uncommented, poorly formatted code "curtis code"
for some reason.....

------
lisa_henderson
Of these 3, at least 2 would depend on the language in use:

"We used a combination of lexical features (e.g., variable name choices),
layout features (e.g., spacing), and syntactic features (i.e., grammatical
structure of source code)"

In particular, "layout features" is a huge issue in some languages, and not at
all in others. For instance, a language like Javascript, or PHP, give great
flexibility about layout, so in those languages I can see each developer
having a unique style (and I have been involved in style debates regarding
those languages), however, a language like Python has a fairly fixed layout,
since the whitespace is significant. And also, in Clojure, I think most
programmers use Emacs and accept the Emacs clojure-mode indenting as the
default.

Variable name choices is another where some environments encourage similarity,
and others allow for unpredictability and unique styles. Within the Ruby On
Rails framework, for instance, there are norms about the creation of variable
names.

I would guess that syntactic features is perhaps the one characteristic that
shows a great deal of uniqueness in every language. I am often surprised at
the choices my fellow co-workers make, when it comes to how to solve a
problem.

~~~
TheLoneWolfling
Python does not have anywhere near a fixed layout.

Off the top of my head:

* When / if single-line if / while / etc statements are used * How many blank lines are used between functions * How often blank lines are used in functions * How much indentation is used for initializing lists / etc. * If multiline strings are used.

Etc.

Some of these are covered by PEPs, yes, but enough people don't follow PEPs
religiously that even those offer some information.

------
dschiptsov
_Could_ be identified in some cases of amateur code, like PHP or Javascript or
Clojure.

Good practice is to follow a very explicit coding style which makes code
written by different developers indistinguishable - the more the better.

Go ahead, identify which developer wrote which part of Linux kernel or, god
forbid, jdk/src/ _

------
amirmc
Something that would be interesting is to follow code styles across people
who've pair-programmed. Kinda like the apprenticeship model, I wonder if you
could detect specific styles that get adopted and evolve over time.

------
click170
This is fascinating, my mind is immediately drawn to simple obfuscation
programs that would turn tabs into spaces and change the formatting and so on,
while still leaving it syntactically correct. Not obfuscation such as to hide
the purpose of the code, just the identity of the author.

Does anyone know of any such projects, or what they might be described as?
None of the queries I've tried produce the intended results.

You could even take it one step further, if you can identify the author of
source code, can you not then forge that signature to make it look like they
wrote something they didn't?

~~~
userbinator
_Does anyone know of any such projects, or what they might be described as?
None of the queries I 've tried produce the intended results._

I guess it's because you're looking for "obfuscators" while such programs are
usually known as "automatic code formatters"... and any decent IDE is going to
have the functionality to do this.

For something standalone, look at
[http://en.wikipedia.org/wiki/Indent_(Unix)](http://en.wikipedia.org/wiki/Indent_\(Unix\))

~~~
palunon
Or astyle for more versatile standalone formatter (and it's frequently the one
used by IDEs)...

------
marak830
I have a question, now I've never decompiled anything but I was under the
impression it would come out in machine code, so you wouldn't get programmers
notes, tabs etc. Can someone explain how their doing this? I knowI should know
this Haha, be gentle :-p

~~~
jetti
You wouldn't get any if that information unless you ran -g with gcc. However,
the article uses the source code and doesn't decompile anything.

~~~
marak830
Ahh thanks guys I must have missed that part.

------
ryan-c
I'm pretty sure something similar could be done with shell history logs.

------
bfe
This stylometry analysis is 95% of a stylometry obfuscator/homogenizer.

------
avodonosov
The same way authors of text posts online can be identified.

------
geetarista
gofmt ftw ;)

~~~
Sanddancer
Formatting along those lines are just one vector that's used. This gives a lot
of weight to the underlying AST of the code, so in order obfuscate, you'd have
to have a program that scrambles variable names, placement and content of
control blocks, etc. Basically, you'd need less an autoformatter and more an
obfuscator, probably coupled with a deobfuscator to make the code as generic
as possible.

------
avodonosov
Can it help to find the real author of bitcoin?

~~~
gwern
Hard to say. This is analogous to drug testing or terrorist hunting: even if
you have a highly accurate test, you're going to want to apply it to thousands
of programmers, and suddenly, when you do the Bayesian calculation, your high
accuracy turns out to still be a low probability of having correctly
identified the true author.

And then you have to justify your closed-world assumption: how do you know
Satoshi (under his real name) was _even in your dataset_? Maybe after Bitcoin
he went back to closed-source work or commercial projects, and none of his
source code other than Bitcoin appears in your dataset. Then the guy your
analysis picked out isn't 'Satoshi' so much as 'the guy who looks the most
like Satoshi (but actually isn't)'.

~~~
avodonosov
We can find a number suspects, and then analyze other facts about them.

------
Iv
"Prose authorship attribution that utilizes parse trees have been able to
identify an anonymous text from 100,000 candidate authors 20% of the time."

Color me unimpressed

