I did a lot of work on authorship attribution for natural language.
I found it was easy to get mind-blowing results on a single dataset - e.g. 99% accuracy for working out who wrote which emails in the Enron [0] dataset.
The moment the test set was even slightly cross-genre, results became bad.
So this is great in theory, but I would love to see a test set from the same authors outside the context of the Google Code Jam. I'd be surprised if the results were anything like as good.
There are very few real-world cases where you have a large amount of same-genre data for the questioned author. (No one writes a thousand suicide notes / ransom demands / scams)
Great point. A system needs to be forward tested not simply back tested on a known set of data.
Incorrect back testing is absolutely standard with stock/bond/forex/etc trading. There are a million hustlers who sell their back-tested systems. In essence, without forward testing all you have is an untested hypothesis.
I would be very interested to know how the accuracy changes if you remove symbols and strings from the analysis.
Symbols and strings are free-form text that would seem to leave a lot more of a stylistic fingerprint. If you could get this level of accuracy from machine code alone, I would be very surprised.
The article says that stripping symbols reduced accuracy by 24%. This is a far greater drop than from optimizing the binary. It leads me to believe that a large part of the success in classifying is coming from symbols and strings.
This isn't a surprising result if you know what competition programming code looks like.
Everyone has their own distinctive personal library/template/boilerplate that they are pasting for every problem since includes, typedefs, defines, library helpers, IO, etc are always the same.
Very interesting. The accuracy achieved is quite impressive -- I think it would be lower when the technique would be applied in practice, since the assumptions that are made are quite strong. For looser assumptions (no disclosure about the compiler/optimization level, highest optimization level enabled, larger pool size), the technique is less accurate.
In practice, it might be interesting to give a fuzzier output - the esimated probability that an individual wrote the code for a certain binary. This would help when two programmers (say, Alice and Bob) write code that yields binaries with similar properties. In the current system, either Alice or Bob is picked -- there is no way to express that it might be either of them.
Well it could in theory be that the author of Bitcoin had previously written and released software under their real name but without ever providing the source code for said software, in which case comparing the first binary release of Bitcoin to such executables could give a hint as to who Satoshi is/was.
> Maybe it's a huge team, so there are multiple individual signatures and it's tough to track down heterogenous signatures.
In this case I would expect that we see a multitude of code styles in the source code of the very first publicly released version of the Bitcoin client. But I think nobody has yet pointed this out. This to me at least implies that if there were multiple programmers involved, there was at least someone who "ironed over" the complete source code to make the code style much more homogenous. But under this hypothesis we would expect that the code was audited at least somewhat thoroughly. But in fact the first versions of the Bitcoin client were rather "put together hastily" and contained lots of bugs that were fixed afterwards. Look for example at the following comment:
So first versions of the Bitcoin client rather look to me like some solo person who can program, but at least whose job is not to program in C++ every day, put together an experiment (the Bitcoin client) hastily.
Now let us assume that there was a huge team behind, but only 1-2 (at most 3) programmers with very similar code signatures. But of course the other team members also had things to do (to be worth their salary), like working out the math etc. But if it were a huge team that contains people of such special skills, why the hell did they hire such a bad programmer to implement it?
In this sense I believe there is strong evidence to drop the "huge team hypothesis".
"In this case I would expect that we see a multitude of code styles in the source code of the very first publicly released version of the Bitcoin client. "
Would that be the case if they used software that enforced a common, coding style on the team? And on top of the team putting in at least a little effort to keep consistent? There's quite a bit of that software already used in big organizations, including defense sector.
Has it ever been floated that it might have been the work of an intelligence agency, as a means to fund black bag operations? It seems unlikely, but possible.
Yes, it's also been floated that it's the work of a super-intelligent AI that needed a system of money that it could use without providing identification documents.
The internet is a vast network where the individual components don't understand the whole, just like a brain is. How do we know the internet itself hasn't turned into a super-intelligent AI already?
Not sarcastic! I think your suggestion was quite reasonable, I was just offering another suggestion that I found interesting. Albeit a slightly-more implausible one :)
Alright, glad to hear it, although I maintain that I’m still a noodlebrain. I’m guessing that you have, but you might enjoy Ghost In The Shell, which runs along a similar premise to what you’re talking about.
His death 2013 fits well with the public disappearance of Satoshi Nakamoto. Also Dave Kleiman worked in computer forensics and was from very early on involved in the Bitcoin community.
It is well-known that Dave Kleiman and the "self-claimed Satoshi Nakamoto" Craig Steven Wright knew each other (Craig Steven Wright claimed that they collaborated when creating Bitcoin). Also the fact that Craig Steven Wright could convince people that are deeply involved in the Bitcoin community in a one-on-one interview that he is Satoshi Nakamoto, provides strong evidence that Craig Steven Wright knew lots of obscure things about Satoshi Nakamoto "that only Satoshi Nakamoto should be able to know". If he knew this information from Dave Kleinman, this can be plausibly explained.
> No I don't think it's Szabo or anyone else whose name is known to me.
I have always thought the speculation about SN's identity suffered alarmingly from availability bias. Everyone wants to believe he's someone they know about. Someone who's extremely well-known for working on or writing about things extremely similar to Bitcoin. There are far more smart nobodies tinkering with things than there are top names working on those things.
Also suffers from a bias I'll call "Shakespeare didn't write Shakespeare bias." The idea is that a notable thing must have been written by a high-status and well-credentialed person, and could not have been done by a talented no-name.
> The idea is that a notable thing must have been written by a high-status and well-credentialed person, and could not have been done by a talented no-name.
I think your reasoning is interesting. Nevertheless consider that if we hunt for SN, it is very likely that it is at least a person who knows more than a little bit about various topics that were necessary to implement Bitcoin (e.g. finance, cryptographic hashed data structures, existing protocols for crypto cash, ...). I agree that it is quite possible that this could also be a talented no-name, but I think this property provides a strong criterion to check if someone claimed that some specific person (who is a talented non-name) they have in mind is SN.
Also one would have to provide an explanation why such a person is not more well-known. It can be that he/she has a difficult personality. It can also be that he/she lives in a country where such a person will have a lot less opportunities etc. But these constraint again limits the circle of talented no-names that could be SN.
TLDR: Your reasoning is interesting, but your hypothesis has strong consequences.
I don't believe lack of notability needs any explanation at all. It's the default. I have plenty of [anecdotal] evidence that exceptional skill is one of many factors that might contribute to notability. Other factors include networking or self-promotion, timing, and chance. Being a lurker type myself, I've encountered many fellow lurkers in forums over the last 20 years, who demonstrated astounding expertise, but are afaik almost completely unknown in my industry. I can think of many more people who actually are at the top of the industry, but are relatively unknown simply because they do not give talks, blog, or write books. In case it wasn't obvious, my personal hypothesis is that SN was simply a very skilled lurker in the cypherpunk scene. I doubt he was Finney, Dai, or Szabo, but I suspect he was someone who, like myself, was well aware of them and hung out in the same forums and lists.
Back in the late 80s/early 90s it was quite possible to tell the author looking at the disassembled portion of a virus. (when all viruses were virtually written in assembly)
Came here to say the same thing. Back in the early 90's one of my friends spent time obsessively studying the personal nuance of code attributed to The Dark Avenger. Even in code that had obviously been optimized bits of personal nuance leak through. It seems very intuitive that compiled languages, being more expressive, would leak more personal nuance.
A lot has changed. Now a team of security experts hired by governments or large organizations write their viruses to look like they originated from someone else when the victims disassemble them and look for clues.
I immediately thought of the anecdote about World War II British SIGINT operators being able to identify individual German telegraphers by way of their distinctive style of Morse keying.
anyway, I don't think this would ever be accepted as evidence in trial. it could be an indicator/lead for investigations, but I don't think it would be a big concern for attackers?
The conclusion that more experienced developers develop their own style is an important one. It shows that good programming is not about adopting convention.
There's a principle in art about "knowing the rules so you know when and how to break them". You can't pass off bad art as good with the excuse of "it's just my style, man". Well, maybe you can in the elite galleries, but not in the workaday world of commercial art (e.g., for advertisements, video games, comics, cartoons, etc.) You have to learn the fundamentals and work within those bounds, then once you "git gud", only then develop a personal style. I suspect the same is true of programming.
Well, conventions are usually related to syntax. Then in some languages we talk about "idiomatic" solutions to certain problems. But style can definitely go beyond those two, and influence others like problem solving approaches, code structure, etc, which don't necessarily contradict conventions.
I found it was easy to get mind-blowing results on a single dataset - e.g. 99% accuracy for working out who wrote which emails in the Enron [0] dataset.
The moment the test set was even slightly cross-genre, results became bad.
So this is great in theory, but I would love to see a test set from the same authors outside the context of the Google Code Jam. I'd be surprised if the results were anything like as good.
There are very few real-world cases where you have a large amount of same-genre data for the questioned author. (No one writes a thousand suicide notes / ransom demands / scams)
[0] A large email database often used for authorship tasks https://www.cs.cmu.edu/~enron/