He submitted the Human Flesh story from NYTimes that he mentioned in his comment.
And he's only submitted 2 stories since starting his account 341 days ago. So the story must have meant a lot to him, 1st to submit it, 2nd to mention it in that comment.
Also, (s)he's only made 18 comments in total and his last comment was 104 days ago yet the flesh story was submitted 13 days ago: so he's careful with his comments: probably careful with his identity and privacy too, so much so that CitizenParker has no bio info: and look at the name CitizenParker (sort of like call yourself John Smith on a sample credit card) - so generic naming could be important to CitizenParker: something he's conscious about, and will write about it: whilst also doing that anonymously.
But mainly, his style seems similar, which was what got me thinking.
Does it really matter?
Scott Parker: http://citizenparker.com/page/About-Scott-Parker.aspx
This earlier thread:
Contains this sentence:
"this thread highlights a fundamental property of a networked life: privacy is dead, there is only identity management."
but then this contradicts the original thesis:
"If harnessed properly, these things can be useful, but it requires a mindset and workflow not entirely dissimilar to those of spies or high-end criminals - controlling information by selective disclosure, identity segmentation, disinformation, anonymization, etc. - not for sinister purposes, mind you, but simply to guard what we traditionally call privacy."
I'd say the current thread offers proof that privacy can be defended. After all, here we have all these smart people, failing to identify the earlier user.
Elsewhere in this thread someone is jumping up and down to stop trying to identify the poster, the funny thing is I think he/she is in no danger at all of being identified, at least not without his/her cooperation.
The only person that could identify this user is PG, and maybe alaskamiller, and I'm pretty sure that our secrets are safe there.
edit: and if that person is the original poster then they're not helping themselves by increasing the sample size :)
Plenty of people have done so over the years, checking a box is nothing that only the 'technically savvy' can do.
If you do that rarely then I think your anonymous words are reasonably safe. If you do it regularly then you are open to the kind of attack that I attempted, and then it will have a better chance of success.
Privacy is dead in a general sense, companies like facebook, google and twitter facilitate identified communication and in that sense every letter you wrote using the old postal system was just as revealing, it just wasn't open to be read by the public.
People are slowly coming around about all this stuff being visible online. I can see that with the 'reocities' project, on average two people every day ask for their old account to be wiped because of privacy reasons. That's not much, but it still means that 1,000 non-technical users per year that I happen to have backed up a few pages for realize this. So if you extrapolate that to the internet at large I think that the number of users that are wising up to this is much larger than you'd expect at first glance.
Time will tell if there will be enough support for this, the 'think of the children' and 'war on terror' people seem to have the advantage for now, but laws that are enacted can in due course be repealed.
I've never bothered to hide my identity, there is nothing that I have to say that I wouldn't put my name to, even if not all of it is received equally well, that doesn't bother me (maybe it should).
There are people in positions that are sensitive that have stuff to tell us, in such cases (which are rare) anonymity really serves a purpose and I think this little experiment shows that without at least access to some log files these exercises get a lot harder.
edit: your thread definitely isn't 'dead'.
Imagine the hassle of creating brand new identities for every action taken online. If privacy isn't dead, then it sure as hell is tough to maintain (and to do so would not be very practical).
For slightly more details, here's a sketch of the algorithm:
Treat each comment as a "document" input to LDA. Use the theta matrix that represents the distribution of topics over each document. Then use the inverse dot product between two document theta vectors and perform k Nearest Neighbors to predict IDs. You should be able to tune the rank and k values from all the labelled data.
When it comes time to infer I suggest running the with the whole set through LDA instead of reusing the discovered alpha and beta. For some reason (which I'm not entirely sure of), my results seem much better that way.
Maybe I'm being too clever for my own good...
Assuming it's not you, we have another area of comparison that is being overlooked. Besides OTToken's text patterns we also have when the comments were left. So we can throw out certain people that never comment during the hours of the day that OTToken did.
Also OTToken is obviously familiar with HN and has an alternative account by his own implication. Likely he was reading/commenting on HN then decided to make the account. OTToken made multiple comments over two hours so we could also try and pick people who's comments adjoin that timeframe. Specifically people who made comments in advance of OTTokens comments, but not at the same time.
He can switch accounts to make alternative comments, but it's unlikely he was making comments from two separate accounts simultaneously.
Noodle already clued in to that:
Also, if you could check my comment history (which you can't because it seems to time out on HNs server) you'd see that my comment speed is usually fairly quick in threads where I'm active.
This is merely a naive guess. He's the only other user on hacker news (according to Google) to use the term "people search engines". He also seems to have been working in the data mining business.
He also seems to comment heavily on technical issues - programming languages, database technologies, so it might make sense for him to feel heavy non-tech opinions deserve a onetimetoken. Similarly reasoned, the sentiment of "privacy is dead, there is only identity management" seems to be a realization appropriate for someone who recently started working on YC-funded companies. Seem convincing to me, but since they're all reverse-justifications, probably best to take them with a grain of salt: you might be able to draw similar conclusions combing through many other comment histories.
Interesting that this "human [powered] search engine" style of identification might have been faster than devising a machine heuristic.
Edit: ok one other thing strikes me: use of the word "ton" doesn't quite fit with the rest of the language.
"...his brother recognized Ted's style of writing and beliefs from the manifesto, and tipped off the FBI." from http://en.wikipedia.org/wiki/Theodore_Kaczynski
I am concerned with online privacy, but not to the extent as "onetimetoken" (my FB profile is globally viewable). Also my comments are usually short, and I avoid big generalizations.
Looking at the thread in question, though, I'd definitely guess jgrahamc.
Edit: The intention behind this was to keep it structured and organized, contest-like, and not for karmic purposes, which I take is the reason for the downvotes.
 See http://en.wikipedia.org/wiki/ICFP_Programming_Contest#Prizes
Hacking and puzzling are intricately interwoven anyway, especially debugging. It's no wonder that plenty of hackers have hobbies like lockpicking.
Enough parenthesis. I just go ahead and pledge 10 Pounds per week to it. Perhaps we should discuss more by email?
(More later, I'll have to go to bed now.)
Seems a bit roundabout without any real advantage.
Daniels tag is all it really needs.
Also posting bets and searching for someone to take the other side (or be the arbiter--in case any is needed) could be interesting. Similar to http://www.longbets.org/, but embedded into HN and not focussed on long-term bets.
Edit: I have opened a new top-level post about this topic. See http://news.ycombinator.com/item?id=1200153
1. Does this user object to being identified?
2. How will you know you succeeded?
Also, where's randomwalker when you need him?
2) you can't be sure, unless the person will confirm using the original 'one time' account.
I base this on eru's phrasing and use of "dissimilar" in the post in question, which can also be noted here:
My strategy was to look for unusual words and phrases and do a google site search for those phrases.
Additionally, eru's post in this thread indicate an interest in privacy and eru's activity pattern is both frequent, and recent which I would expect to be true for the poster.
Here, eru even taunts us a bit:
They did not issue a challenge to be identified--in fact they agree with the notion that privacy is dead, which seems to be what you're trying to prove with this exercise. They may have serious reasons for using a one time account.
If your name is one of the (very random) guesses in this post, please neither confirm nor deny that the user is you, since this could identify that user by elimination.
This item should not have so many points. The post is rubbish. A 275 word sample is long, but likely insufficient given the pool of candidates. The post did not explain what methods were used, what work in authorship identification influenced his approach, nor did he provide his ranked findings. The tries are actually failed guesses, rather than, say, different algorithms attempted. This item has now devolved into a guessing game, rather than a coding exercise.
Again, stop trying to identify this user.
written with a one time account
The post didn't but the original thread did, I tried matching the vocabulary of the samples to the corpus of HN comments.
> what work in authorship identification influenced his approach
This is not a scientific paper.
> nor did he provide his ranked findings.
I'm not giving my ranked results because I think two attempts from me is enough.
> The tries are actually failed guesses, rather than, say, different algorithms attempted.
They were the #1 and #2 outputs of my code.
> This item has now devolved into a guessing game, rather than a coding exercise.
No-one said that you had to guess, but human guesses are also powered by computation at some level, even if it would be very hard to figure out exactly what went on.
> Again, stop trying to identify this user.
If that request would be posted by 'onetimetoken', who posted three times then it would have some credibility.
If you are not him/her why does this upset you ?
The 'one time account used as a rhetorical device' says fairly clearly that it is just a gimmick, not some kind of terrible secret.
And if you are 'onetimetoken' you are increasing the sample size ;)
My apologies for being ambiguous.
Says it all.
Evidence that the comment does not come from my keyboard:
I would never write a ponderous sentence like "I fully agree with the sentiment that inspires your statement" as the opening sentence of a post.
I wouldn't write "Without even noticing it we are whoring out our privacy and intimate patterns," because I consider "whoring out" a crude expression, too crude for the polite, learned conversation I expect on HN.
The phrase "it requires a mindset and workflow not entirely dissimilar to those of spies" reminds me of George Orwell's "One can cure oneself of the not un- formation by memorizing this sentence: A not unblack dog was chasing a not unsmall rabbit across a not ungreen field." I may occasionally write like that, if I am composing a sentence as I type, but I try not to.
He is not me. I think he is male.
As for the 'learned conversation', I think that sets the bar a bit high, as long as it is polite and somehow coherent I'm fine with it :)
"tokenadult" = the included minority adult
"onetimetoken" = account used once, like putting a disposable token into a machine
Though I'm not one to speak, I've used "smokinn" (or "Smokinn" which I generally prefer) for, I believe, 16 years now.
Reminds me of "Blink", Malcolm Gladwell's book.
But, don't know if Clayton Donley is on HN or not..
His Bio, at Oracle:
Clayton Donley, Sr. Director, Development
Currently run the dev organization for some of Oracle's security and identity management products. Landed here after selling OctetString in 2005. Before that held various roles at IBM, Motorola, and as an independent consultant. Also wrote LDAP Programming in 2001.
For example, the comment has "/" in google/facebook, while the blog has "Google-Facebook".
What made me curious was not just the topic, and the identity management, google-facebook, but the fact that he is in the field of security/identity management.
But as I said, I dont know if he is on HN or not :)
> 1 point by onetimetoken 23 hours ago | link
> I was just trying to empasize my point. Just out of curiosity, how would you go about identifying me?
Edit: ignore, just realised SearchYC doesn't respect "" and looks for near match words.
(i) The first search revealed "gstar", but although they both have similar writing style, gstar doesn't have an active participation on privacy discussions (based to the query only )
(ii) The second search revealed "astine": now this is interesting because this user has a very active participation on privacy discussions, especially I think he was inspired by _why.
EDIT: [ORG] Based on the original comment at http://news.ycombinator.com/item?id=1197027
He's my prime suspect: http://news.ycombinator.com/item?id=1200739
It's gotta be him.
Smart idea; but I just tried it and no results.
The way it works in real life, is that you find a person's email address or a long term account on a forum, and then use that info to build up a full profile about that person. The longer the person is on the web, the more personal information they've revealed in the past.
i.e. 6 months ago they might have mentioned their phone #...so you can use whitepages to see their address. Or maybe they posted a link to their site..where they didn't have privacy enabled, so you can get the full name and address using whois. Or maybe they are using the same username on all sites, so you can use google to see all the forums they've ever posted on. etc
I doubt PG would appreciate all of us hammering the HN server to collect the data.
They seemed interested in the thread as to how they might be found, but don't seem to have given any permission for a site-wide (wo-)manhunt. (This might have happened out of band, though.) I guess they do say that they were just using a one time account for rhetorical emphasis.
Further, if this were to really be a contest, it seems like there should be some sort of rules, such that the result isn't determined just by exhaustion of currently in use usernames by guessers.
Walked through the results for each word, pulling in all the usernames:
intersection of: pure-ad dissimilar
intersection of: CTRs disinformation
intersection of: CTRs dissimilar
['ivankirigin', 'patio11', 'strlen']
Is there an easy way to download the corpus of comments?
This person has said that we can find out his identity... if at all possible. Therefore, you are welcome to search based on his own permission.
Total words: 365
Female = 326
Male = 616
Difference = 290; 65.39%
Female = 348
Male = 450
Difference = 102; 56.39%
Verdict: Weak MALE
Weak emphasis could indicate European.
Do you have a corpus?
Further, and this is an open question, is there an archive/downloadable corpus of HN in part or entirety anywhere? It would be fascinating and I'd love to keep a copy to look back at in years to come.
Yes, that's why I thought it would be an easy challenge.
My bad :)
> Further, and this is an open question, is there an archive/downloadable corpus of HN in part or entirety anywhere?
Yes, you can query the google cache. It's fairly easy to do.
The only things you don't get that way is the stuff you can see as a logged in user.
You could, but I'm not trying that again. It's easy to get your IP busted depending on what random algorithm Google decides to run each week.. <g>
Seems like a good thing to verify first. Maybe you already got the guy and he just said "no it's not me."
Of course you can be paranoid, but I think the bigger chance is the author seeing a chance here at sowing some disinformation. Such as participating in this thread and giving false pointers and / or confusing the issue.
For the really paranoid, of course the last person to participate in this thread is 'the one'...
It wasn't me, that's for sure :)
The original poster could log in and disclose who found him first.
"The point being,"
", mind you,"
"I fully agree with"
"highlights a fundamental"
":" some text "," some text "."
" - e.g."
Whatever user has the most instances of these signature phrases is likely your man.
* collective pause to think
* pure-ad parked
These suggest to me a [highly proficient] non-native speaker too. "pause for thought" and "pure ad-parked" are correct versions.
* "intimate patterns" is an unusual turn of phrase in this context, would probably be "personal usage patterns"
* "high-end criminals" looks like an unusual hyphenation
This search gives a name - http://www.google.com/search?q=%22identity+management%22+roi....
The other things you point out encourage me to share your opinion, however.
I'm from the UK too.
"high end" has more uses than "high-end." For example, "I bought a car at the high end of my budget." In that case, "high-end" wouldn't make sense. In "datacenters have been targeted by high-end criminals," however, "high-end" is a compound adjective.
Alternatively, you could drop the hyphen and/or form an entirely new word: "highend." The word "highend" doesn't seem to have caught on yet, though. I suspect that's because "upmarket" covers the same meaning already and is less susceptible to these morphological mishaps.
(On seeing what OS X had to suggest as a correction for "highend," it suggested both "high end" and "high-end.")
On the other hand that isn't proof of anything, but I think the changes are higher that someone will own up that didn't do it to throw sand in the eyes of the searchers than the reverse.
But then again, maybe I'm a sucker and I believe that people in general are honest and trustworthy. So far that seems to me to be a better assumption than the reverse.
Relies on a person's vocabulary and writing style more than the actual words.
[edit: disregard - I hadn't read deep enough to see that this was essentially the approach you had taken]
Still, it's easy to get into a sort of confirmation bias looking at this stuff manually, and seeing things that fit while missing things that don't.
Disprove that pink elephants exist vs prove that gray ones do.
Which one is the easier challenge ?
edit: oh right, i also looked at the fact that he was posting comments on HN around the time that the one in question was posted. a lot of my other candidates didn't meet that data point.
Ah, very clever, another angle of attack. Never thought of that one.
He thought it was long enough that he could pretty trivially have a program compare the writing style to other HN comments and determine who it was, but he failed. So it's a challenge for other hackers--can you write a program that can determine who said something simply based on the writing style and knowing that he's a member of a reasonably small sample (HN users).
PG would have an easy time of it (log files) :)
This is effectively a puzzle, a reasonably hard one (my attempt failed, but not for lack of trying, I spent a fair number of hours on it before making my guesses, of course it could simply be that I'm stupid), and one that seems fun to solve.
It is exactly the kind of thing that I enjoy doing when it comes to programming in the first place, figure out how stuff works and/or solving reasonably hard problems. One step above my current competence is my favorite, that way I'm reasonably sure I can solve the problem, if the difference is too big then I tend to get stuck.
It's of course a bit like the question why people climb Mount Everest, the answers are: because they can and because it's there.
edit: Funny, I thought your downmod for asking a valid question was unfair, in return I get downmodded for answering :)
However, this thread highlights a fundamental property of a networked life: e.g., etc. http://news.ycombinator.com/threads?id=onetimetoken
A typical case of me and my big mouth.
I figured that a basic analysis should reveal who wrote this:
Because of the size of the sample. So I wrote a bit of code to compare against other HN comments, and figured that that would turn up the user quickly.
But I was wrong, after two tries (Daniel Markham and John Graham-Cummings) I have to admit that my simple analysis has failed.
So, who will take up the challenge, can you identify this user somehow ?
This site normally gets 10 visitors / day or so...
I posted it on my blog because you can't post HN links on HN!