All the (winning) contestants implement the same functionality, yes, but with possibly wildly different approaches/algorithms, so the main difference between code samples is not just "style" but what could be called "general thinking in and around the problem".
But this statement seems the most interesting: "By comparing advanced and less advanced programmers’, we found that more advanced programmers are easier to de-anonymize and they have a more distinct coding style."
Beginners tend to think alike, while experts develop an original line of thinking, that is identifiable. The paper could be called "Fingerprints of Thought"...
I'm not sure I agree with your conclusion. Beginners tend to look to others' code more frequently (often including copying and pasting), which means their "style" is really an amalgamation of multiple styles.
On the other hand, even when more experienced programmers look to others' code, they will still fold it into their own style, even when including large amounts of somebody else's code. Their own style shows through the entire project.
However, both of us are just hypothesizing. A great follow-up to this work would be to look at why beginners are harder to identify. I'm sure the other obvious follow up, "how to anonymize yourself", is in the works somewhere.
The other thing I'd be interested in is how well this holds up for multi-person projects. Would it be possible to identify if I submitted code to the Tor project? How much do I have to contribute before identification becomes likely.
> Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts. This requires not that the writer make all his sentences short, or that he avoid all detail and treat his subjects only in outline, but that he make every word tell.
I apply it to journalism, though. Newspapers have long had "house" style guides to keep things consistent across reporters.
I would consider this more important for writing than art, since reading novels involves much more of a time commitment from your audiences to appreciate, whereas art gives more flexibility in deciding how long your audience wants to spend with it.
However, there's a difference between bad art and writing and bad programming. Bad art and writing lacks appreciation. Bad programming is full of errors, hard to maintain, and runs slow, giving an overall poor user experience.
A distinct coding style does not indicate that the code is surprising or difficult to follow.
A diversity of styles should help overall code quality in an organization because you get to pick and choose the 'best approach'* for a situation, and more importantly, learn from your peers.
* whatever that heuristic happens to be... probably unsurprising and reproduce-able for you. I'd say the heuristic for me is 'concise, read-ably correct'. I've always thought of programming as closer to writing mathematical/logical proofs than anything else. If I can eyeball a program and say 'yup, that does what claims it wants to do,' then it's good. I don't think that heuristic always holds up in practice, simply because successful code-bases blossom proportionally to their success.
Of course that's a matter of interpretation, but I find the summary of the paper suffers from the ambiguity.
There is also the fact, that I alluded to, that winning entries in a programming contest should be more alike than non-winning entries, in that they all actually solve the problem; if non-winning entries were included, it could explain some of the discrepancy in identification success.
Isn't 52% close to a random guess? I don't get how this ties in with the rest of the paragraph either.
If they were saying that in 52% of cases, they could guess if some code did or did not belong to a given programmer, then it is basically a random guess.
That entirely depends on the distribution of the original sample.
Of course, "const No" might do quite a bit better than 50%, depending on the distribution of the original sample (for that matter, so might "const Yes", but those distributions seem less likely).
Of course, the paper explicitly says "For this experiment, we use 600 contestants from [Google Code Jam] with 9 files" so I think in this case the distribution was probably fairly even?
If the question is, which of these 20 programmers wrote this code", then a random guess has only a 5% chance (1/20) of being right! So 52% is more than 10X better than random.
(Recording available later ...)
Then yeah, you might very well find some guys /home/ path in firmware running on production basebands.
(Hi, Dojip Kim! https://www.linkedin.com/in/dojip-kim-7b0b1b6a)
The above mentioned executable binaries are compiled
without any compiler optimizations, which are options to
make binaries smaller and faster while transforming the
source code more than plain compilation. As a result,
compiler optimizations further normalize authorial style.
Taken from section VI. A. " Compiler Optimization: Programmers of optimized executable binaries can be de-anonymized.":
"[...] programming style is preserved to a great extent even in the most aggressive level-3 optimization. This shows that programmers of optimized executable binaries can be de-anonymized and optimization is not a highly effective code anonymization method."
Please try and do a basic level of investigation before making claims.
Unless I stop publicly writing software for a few years.
It's pretty easy to vandalize a wiki tho.
Their code should be online, but I am not sure what exactly is there though.
I saw that this has been tried to some extent: https://en.wikipedia.org/wiki/Satoshi_Nakamoto#Nick_Szabo
* assembly language instruction features: "token unigrams and bigrams"
* decompiled lexical features (unparsed decompiled text): "word unigrams, which capture the integer types used in a program, names of library functions, and names of internal functions when symbol information is available"
* syntactic features (from the parsed AST of the decompiled text): "AST node unigrams, labeled AST edges, AST node term frequency inverse document frequency (TFIDF), and AST node average depth"
* basic block features: TF-IDF weighted "unigrams and bigrams, that is, single
basic blocks and sequences of two basic blocks"
Depending on how popular the code is your fingerprint could be completely hidden amongst hundreds.