Hacker News new | past | comments | ask | show | jobs | submit login
De-Anonymizing Programmers via Code Stylometry (2015) [pdf] (usenix.org)
173 points by sillysaurus3 on Sept 1, 2017 | hide | past | favorite | 52 comments



I'm a coauthor of this paper. It was published a couple of years ago; pleasantly surprised to see it here.

The paper shows that programmers have distinctive styles in their source code that can be extracted to create a fingerprint of coding style. It's not just the obvious stuff like spaces vs. tabs -- parsing the code and looking at the Abstract Syntax Tree is what results in a powerful fingerprint.

We have a follow-up to this paper showing that surprisingly, coding style survives in compiled binaries (even with optimization turned on and debugging symbols removed): https://www.princeton.edu/~aylinc/papers/caliskan-islam_when...


Looks impressive, now only one question remains: did you find out what is the real identity of Satoshi Nakamoto?


I've been asked this a lot, especially since I also research and teach cryptocurrency technology [1]. I haven't personally tried to do that. Satoshi clearly wished to maintain their pseudonymity, and I'd rather respect that. I also think Satoshi's pseudonymity is a powerful statement about decentralized cryptocurrencies, namely that their viability rests on their technical merits, with no need to know or trust their creators.

That said, I don't blame people for trying to uncover Satoshi's identity, and it's possible that the techniques in our paper can help. The big caveat, of course, is access to a corpus that includes Satoshi's code labeled with their true identity.

[1] http://randomwalker.info/bitcoin/


Thank you for respecting that. I believe you understand.

I have to say I'm completely unsurprised at your follow-up result - I've always held that view myself. Code style isn't just about variable names and brace placement, it's also about the abstract design and how you choose to reduce the problem you want to solve to the method that you want to solve it with - and the choices made in that process carry through all the way to the object code, data structures, file formats and beyond. That stylometry may be possible to some degree from object code follows naturally from that viewpoint.

A little romantic as that might be, I've always felt that reverse-engineering can sometimes seem like, via the medium of what they've created, being a few steps removed from reading someone else's thoughts.


In response to your previous comment, my impression is that this may have come up due to a recent Medium article that was killed on HN.[1]

In response to your last remarks, the gist of the claim in the Medium article is that state agencies already know Satoshi's identity, and the method used was stylometric analysis of his prose (not code) cross-referenced with personal communications that NSA is purported to have access to. That's the claim, at least.

1. http://hn.0x2237.club/post/15116222


I've read the article and although entertaining it doesn't contain any substantial information nor sources. So all in all it was an interesting anecdote.

By the way the hacker news transparency looks really cool!


> Satoshi clearly wished to maintain their pseudonymity, and I'd rather respect that.

Actually that was just a joke. I don't need their identity but sometimes I wish there was a way of discussing some design decisions that went into Bitcoin (I did a little bit of technical analysis of Bitcoin and was involved in electronic crypto-based asset projects before Bitcoin was conceived). Unfortunately that would make it harder for Satoshi or stay anonymous.


Here's a twist:

By making what you made, you've made it possible for LEA to categorize and fingerprint malware/open source anonymous software etc etc.

This will go unnoticed for many years, until the first big player gets busted by it (publically, because many more were busted before in secret).

Once that happens, it will become the standard practice to fuzz your source code with random methods to obscure the author.

In a sense, someone will have to make something that turns "regular" code into unreadable gibberish before compiling... because of what you wrote


You're far too accusatory toward the author here. Various LEA and other parties not associated with law enforcement have vested interests in pursuing this technology anyway. If it was possible, one of them was going to find it. It's very possible that similar discoveries were already made but kept secret.

At least now the rest of the world knows this is possible and can mount a defense if they feel it is necessary.


Tangent, but can your method be extended to indicate traits of a coder?

Eg, based on this sample, the coder is like American, with experience at X corps, went to school at Y, etc.


Even more tangent, this reminds me of a manga where there was a character who could guess your sexual personality based on your code (read right to left): http://imgur.com/a/FJPMS


Porn intros keep getting weirder.


Not the author, but I don't see why not. I'm sure you could apply some machine learning techniques that combine the stylometric features extracted with demographic information.


Is it one person or many people who wrote that code? I think it is one, but stylometry probably knows better.


I've always suspected it was Hal Finney, but Hal passed away without clarifying whether he was or not. It would have been just like Hal to play such a prank :-)


The issue is that there isn't necessarily a representative code base available for analysis for some likely suspects, i.e. Nick Szabo.


He has been found out using stylometry on his crypto paper.


Interesting. I can guess with high accuracy who wrote the code at work, sometimes by which PEP8 recommendation they've broken; sometimes seeing an object mutated in two consecutive lines, leaving a white trailing space on a line, or a ^M character, etc.

But:

> We have a follow-up to this paper showing that surprisingly, coding style survives in compiled binaries

Now that is interesting.. Have you noticed similarities in coding styles for different demographics? Is there a Russian style, a Californian style? Or does it lend itself better to a "cohort analysis"? Surely this is one of the first applications that popped into your mind.


The Californian style goes to great lengths to be inclusive and excludes a lot of people in the process.

The Russian style smells like vodka and owns a dash-cam.


I wonder if compiled binary fingerprints could be used to narrow down malware authorship.


My "Surely this is one of the first applications that popped into your mind." and mentioning "demographics" was a priming attempt partly prompted by the authors' affiliations.

Another question is: how pertinent could this be and could it serve as evidence. Is it as "receivable in court" as fingerprints? If it can be "evidence", how to deal with tampering and entities trying to frame each other by mimicking each other's fingerprint to 'cover their tracks'? Could there be a "market" for fingerprints where you just download a set of fingerprints, write [and compile] your code, and then "apply" the prints to the code? Would there be tools to detect genuine, organic, fingerprints from synthetic or after the fact prints? Can you detect how "fresh" these prints are? Can there be a "glove" equivalent so your fingerprints don't show? Can there be a tool to remove your fingerprints after the fact? Can you apply multiple prints successively? This is indeed fascinating.


I'm just starting to dive into NLP and I'm curious if this type of technique could be viable for identifying the same users across multiple social media accounts.


If users have the slightest inclination to attempt to be distinct across accounts, wouldn't that defeat this sort of de-anonymization?

If for example I apply different linters or stylers, and don't write docs, and don't use camelCase, and don't write tests, won't I appear to be a different programmer? Can we link IOCCC entries with their author's regular day-job work?


Stylometry of regular text can be fairly reliability defeated even by people who don't know how it works. It remains interesting because most people don't do that.

The code stylometry in the linked paper is interesting because it also looks at things likely preserved by stylers and linters: they parse the code and look at things like the depth of the abstract syntax tree, or the frequency of certain AST-node bigrams.

Obviously there's opportunity for a lot more research here. But compared to other variants of stylometry, this approach might be fairly robust


I seem to recall a paper a couple of years ago showing that this can already being done quite successfully?!? Don't remember the details, unfortunately.

(It kind of seems like it would be "easier" in way since natural language doesn't have to obey any special syntax.)


Do you ever plan to release your system for general use?

While somewhat less accessible to comp-sci hobbyists like myself, is your system privately available?


Please tell us about your Satoshi hunt without revealing personal details ;)


Stylometry is a really cool field.

For example it was used to figure out if a subreddit was being manipulated by a small number of sock puppet accounts: https://www.reddit.com/r/Bitcoin/comments/3hf5z7/determining... https://www.reddit.com/r/Bitcoin/comments/3dszzu/55_btc_boun...

(and of course the people studying it the most are the ones hunting for satoshi =P)


I wonder how much this varies by programming language. In Go, gofmt ensures that most stylistic choices are similar for everyone (Go code tends to look alike) and patterns and rules are fairly well codified (Go variable and function names tend to follow certain rules). The more codified the code is, the less fingerprints each developer can leave.

It'd be interesting to see the most and least identifiable languages!


It is not about tabs, spaces and variable names. The main difference is how long your functions are, how many functions they call, how deep control structures within functions are etc.


Would be interesting to tie that to ratings. Have a diverse set of people read lots of code (samples from projects large and small, in different languages, etc.) and rate it on things like "easy to understand" and "easily extendible", then correlate that with styles. I'd be very interested in how I score and what I can improve.


It sounds like you're talking about cyclomatic complexity.

https://en.wikipedia.org/wiki/Cyclomatic_complexity

There are tools out there for most major programming languages for measuring cyclomatic complexity. They can tell you whether typical coders will be able to help you work on your program at its current complexity level.


I heard of this but haven't seen it used anywhere outside of school. Do you have any experience whether this correlates with how readable people find code to be?


No, sorry. Actually, I feel like long functions are often clearer than refactored equivalents.


Okay, thanks for responding!


My style changes over time.


As long as it changes gradually, or partially, that might not matter.


I'm also wondering if this transfers between programming languages, and to what degree.


Is it good or bad if a programming language leaves room for personal style?


This could probably be used to do some sort of code analysis and give people ideas for how to structure their code differently.

Similar to how there are some automated tools to help you vary your writing more.


Hehe, seems like we are all going to commit to GitHub using encoder to a blended, typical style and pull via decoder to our own unique styles in order to obfuscate who we are. Like those BlendIt extensions in Firefox to pretend we run on Win 7 instead of Linux etc.


Hey, I'm a TA at some CS courses, can this be used to find similar codebases? Maybe it useful to find similar programmers, find possible code copies and/or identify the author of a snippet. Any ideas?


People often talk about using MOSS for this: https://theory.stanford.edu/~aiken/moss/


There's only so many ways to do the toy problems that people get assigned.

Students code winds up looking like their sources because they copy the control structures because there's only so many sane ways to do those problems without introducing fluff.

At least this is what my friend's experience using plagiarism checking tools was.


because there's only so many sane ways to do those problems

Indeed, that's been my experience too; if you assign something "Hello world" level, you're going to get a bunch of 100% correct solutions that look almost the same, and then the other... odd ones which are almost certainly incorrect in some way.

It's only when you assign something far more nontrivial, and with many correct ways to do it, that you will get more diverse solutions; and even then, the chances of the correct solutions being very similar are high.


Authorship attribution is the main application discussed in the paper, and there is some thought given to the possibility of detecting ghost writers (i.e. detecting if you let someone else write your code)

I think it's promising, but to evaluate it against existing solutions and make it production ready is probably still a good chunk of work.

For simple code similarity there are Moss, JPlag and Sherlock (linked in the paper )


Your students are probably hiring people off Wyzant to do the problems for them. It's impossible to identify the author when they're a willing participant.


Why are you interested in such applications?


"TA" means "teaching assistant." I think the OP is asking about this to detect cheating and/or plagiarism.


Do you guys know any good resources on stylometry?


(2015)


Thanks, updated.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: