The paper shows that programmers have distinctive styles in their source code that can be extracted to create a fingerprint of coding style. It's not just the obvious stuff like spaces vs. tabs -- parsing the code and looking at the Abstract Syntax Tree is what results in a powerful fingerprint.
We have a follow-up to this paper showing that surprisingly, coding style survives in compiled binaries (even with optimization turned on and debugging symbols removed): https://www.princeton.edu/~aylinc/papers/caliskan-islam_when...
That said, I don't blame people for trying to uncover Satoshi's identity, and it's possible that the techniques in our paper can help. The big caveat, of course, is access to a corpus that includes Satoshi's code labeled with their true identity.
I have to say I'm completely unsurprised at your follow-up result - I've always held that view myself. Code style isn't just about variable names and brace placement, it's also about the abstract design and how you choose to reduce the problem you want to solve to the method that you want to solve it with - and the choices made in that process carry through all the way to the object code, data structures, file formats and beyond. That stylometry may be possible to some degree from object code follows naturally from that viewpoint.
A little romantic as that might be, I've always felt that reverse-engineering can sometimes seem like, via the medium of what they've created, being a few steps removed from reading someone else's thoughts.
In response to your last remarks, the gist of the claim in the Medium article is that state agencies already know Satoshi's identity, and the method used was stylometric analysis of his prose (not code) cross-referenced with personal communications that NSA is purported to have access to. That's the claim, at least.
By the way the hacker news transparency looks really cool!
Actually that was just a joke. I don't need their identity but sometimes I wish there was a way of discussing some design decisions that went into Bitcoin (I did a little bit of technical analysis of Bitcoin and was involved in electronic crypto-based asset projects before Bitcoin was conceived). Unfortunately that would make it harder for Satoshi or stay anonymous.
By making what you made, you've made it possible for LEA to categorize and fingerprint malware/open source anonymous software etc etc.
This will go unnoticed for many years, until the first big player gets busted by it (publically, because many more were busted before in secret).
Once that happens, it will become the standard practice to fuzz your source code with random methods to obscure the author.
In a sense, someone will have to make something that turns "regular" code into unreadable gibberish before compiling... because of what you wrote
At least now the rest of the world knows this is possible and can mount a defense if they feel it is necessary.
Eg, based on this sample, the coder is like American, with experience at X corps, went to school at Y, etc.
> We have a follow-up to this paper showing that surprisingly, coding style survives in compiled binaries
Now that is interesting.. Have you noticed similarities in coding styles for different demographics? Is there a Russian style, a Californian style? Or does it lend itself better to a "cohort analysis"? Surely this is one of the first applications that popped into your mind.
The Russian style smells like vodka and owns a dash-cam.
Another question is: how pertinent could this be and could it serve as evidence. Is it as "receivable in court" as fingerprints? If it can be "evidence", how to deal with tampering and entities trying to frame each other by mimicking each other's fingerprint to 'cover their tracks'? Could there be a "market" for fingerprints where you just download a set of fingerprints, write [and compile] your code, and then "apply" the prints to the code? Would there be tools to detect genuine, organic, fingerprints from synthetic or after the fact prints? Can you detect how "fresh" these prints are? Can there be a "glove" equivalent so your fingerprints don't show? Can there be a tool to remove your fingerprints after the fact? Can you apply multiple prints successively? This is indeed fascinating.
If for example I apply different linters or stylers, and don't write docs, and don't use camelCase, and don't write tests, won't I appear to be a different programmer? Can we link IOCCC entries with their author's regular day-job work?
The code stylometry in the linked paper is interesting because it also looks at things likely preserved by stylers and linters: they parse the code and look at things like the depth of the abstract syntax tree, or the frequency of certain AST-node bigrams.
Obviously there's opportunity for a lot more research here. But compared to other variants of stylometry, this approach might be fairly robust
(It kind of seems like it would be "easier" in way since natural language doesn't have to obey any special syntax.)
While somewhat less accessible to comp-sci hobbyists like myself, is your system privately available?
For example it was used to figure out if a subreddit was being manipulated by a small number of sock puppet accounts:
(and of course the people studying it the most are the ones hunting for satoshi =P)
It'd be interesting to see the most and least identifiable languages!
There are tools out there for most major programming languages for measuring cyclomatic complexity. They can tell you whether typical coders will be able to help you work on your program at its current complexity level.
Similar to how there are some automated tools to help you vary your writing more.
Students code winds up looking like their sources because they copy the control structures because there's only so many sane ways to do those problems without introducing fluff.
At least this is what my friend's experience using plagiarism checking tools was.
Indeed, that's been my experience too; if you assign something "Hello world" level, you're going to get a bunch of 100% correct solutions that look almost the same, and then the other... odd ones which are almost certainly incorrect in some way.
It's only when you assign something far more nontrivial, and with many correct ways to do it, that you will get more diverse solutions; and even then, the chances of the correct solutions being very similar are high.
I think it's promising, but to evaluate it against existing solutions and make it production ready is probably still a good chunk of work.
For simple code similarity there are Moss, JPlag and Sherlock (linked in the paper )