Hacker News new | comments | show | ask | jobs | submit login
A challenge, identify this HN user, I tried twice and failed (jacquesmattheij.com)
129 points by jacquesm 2659 days ago | hide | past | web | 144 comments | favorite



It could well be CitizenParker. http://news.ycombinator.com/user?id=citizenparker

He submitted the Human Flesh story from NYTimes that he mentioned in his comment. http://news.ycombinator.com/item?id=1167615

And he's only submitted 2 stories since starting his account 341 days ago. So the story must have meant a lot to him, 1st to submit it, 2nd to mention it in that comment.

Also, (s)he's only made 18 comments in total and his last comment was 104 days ago yet the flesh story was submitted 13 days ago: so he's careful with his comments: probably careful with his identity and privacy too, so much so that CitizenParker has no bio info: and look at the name CitizenParker (sort of like call yourself John Smith on a sample credit card) - so generic naming could be important to CitizenParker: something he's conscious about, and will write about it: whilst also doing that anonymously.

But mainly, his style seems similar, which was what got me thinking.

http://news.ycombinator.com/threads?id=citizenparker

http://searchyc.com/user/citizenparker?only=comments

Does it really matter?

edit: http://citizenparker.com/ Scott Parker: http://citizenparker.com/page/About-Scott-Parker.aspx


Is the question serious, or is it meant as a joke? As a riposte to the earlier thread, it is fantastic. The user writes "privacy is dead" and here, on Hacker News, and also on jacquesmattheij.com, you have a thread with a lot of intelligent people trying to figure out the person's identity, and failing. Therefore, the user's original point is disproven simply by starting this new thread. If this was deliberate, then this was genius.

This earlier thread:

http://news.ycombinator.com/item?id=1197027

Contains this sentence:

"this thread highlights a fundamental property of a networked life: privacy is dead, there is only identity management."

but then this contradicts the original thesis:

"If harnessed properly, these things can be useful, but it requires a mindset and workflow not entirely dissimilar to those of spies or high-end criminals - controlling information by selective disclosure, identity segmentation, disinformation, anonymization, etc. - not for sinister purposes, mind you, but simply to guard what we traditionally call privacy."

I'd say the current thread offers proof that privacy can be defended. After all, here we have all these smart people, failing to identify the earlier user.


Exactly! That's the whole idea here, I was quite surprised that my first solution (with a very high correlation) failed, even more surprised when the second one failed as well (especially since that person had been commenting in the same thread and had a very high correlation as well).

Elsewhere in this thread someone is jumping up and down to stop trying to identify the poster, the funny thing is I think he/she is in no danger at all of being identified, at least not without his/her cooperation.

The only person that could identify this user is PG, and maybe alaskamiller, and I'm pretty sure that our secrets are safe there.

edit: and if that person is the original poster then they're not helping themselves by increasing the sample size :)


However, as I tried pointing out (in a seemingly dead thread: http://news.ycombinator.com/item?id=1200091), identity management is something that only the technically savvy can pull off. And even they are likely to stumble apart at some point, because being perfect in every way is inhumanly hard. And so, as it stands, privacy is dead in this age.


On /. posting anonymously is as simple as checking a box, even if you have an account.

Plenty of people have done so over the years, checking a box is nothing that only the 'technically savvy' can do.

If you do that rarely then I think your anonymous words are reasonably safe. If you do it regularly then you are open to the kind of attack that I attempted, and then it will have a better chance of success.

Privacy is dead in a general sense, companies like facebook, google and twitter facilitate identified communication and in that sense every letter you wrote using the old postal system was just as revealing, it just wasn't open to be read by the public.

People are slowly coming around about all this stuff being visible online. I can see that with the 'reocities' project, on average two people every day ask for their old account to be wiped because of privacy reasons. That's not much, but it still means that 1,000 non-technical users per year that I happen to have backed up a few pages for realize this. So if you extrapolate that to the internet at large I think that the number of users that are wising up to this is much larger than you'd expect at first glance.

Time will tell if there will be enough support for this, the 'think of the children' and 'war on terror' people seem to have the advantage for now, but laws that are enacted can in due course be repealed.

I've never bothered to hide my identity, there is nothing that I have to say that I wouldn't put my name to, even if not all of it is received equally well, that doesn't bother me (maybe it should).

There are people in positions that are sensitive that have stuff to tell us, in such cases (which are rare) anonymity really serves a purpose and I think this little experiment shows that without at least access to some log files these exercises get a lot harder.

edit: your thread definitely isn't 'dead'.


"I'd say the current thread offers proof that privacy can be defended. After all, here we have all these smart people, failing to identify the earlier user."

Imagine the hassle of creating brand new identities for every action taken online. If privacy isn't dead, then it sure as hell is tough to maintain (and to do so would not be very practical).


I'm not sure I'll have the time to do this, but I've had some good results running Latent Semantic Analysis and Latent Dirichlet Allocation on a similar problem. In my case, I have data from people playing a negotiation game and having a conversation with a human actor. I have scores from a human judge going from 1 - 5. Using LDA on the transcriptions of the dialog I can predict the results of the human judge to a correlation of .5 There was a previous study with essay's a teacher grades that got .8 with LSA. The LSA study used a much larger training corpus outside the individuals.

For slightly more details, here's a sketch of the algorithm: Treat each comment as a "document" input to LDA. Use the theta matrix that represents the distribution of topics over each document. Then use the inverse dot product between two document theta vectors and perform k Nearest Neighbors to predict IDs. You should be able to tune the rank and k values from all the labelled data.

When it comes time to infer I suggest running the with the whole set through LDA instead of reusing the discovered alpha and beta. For some reason (which I'm not entirely sure of), my results seem much better that way.


Simple psychology would lead me to guess that it is you Jacques. If I wanted to tell how easy it was to identify an anonymous comment, then I'd make one. I'd then publicise it, and challenge other people to crack it.

Maybe I'm being too clever for my own good...


I solemnly swear it wasn't me, but I do agree with you that would be a good prank.


Looking at how quickly you commented after OTToken and when he commented how quickly you responded, I could see why someone would think it was you. Just like the yahoo answers "questions" that are obviously setups because they are answered 1minute after asked.

Assuming it's not you, we have another area of comparison that is being overlooked. Besides OTToken's text patterns we also have when the comments were left. So we can throw out certain people that never comment during the hours of the day that OTToken did.

Also OTToken is obviously familiar with HN and has an alternative account by his own implication. Likely he was reading/commenting on HN then decided to make the account. OTToken made multiple comments over two hours so we could also try and pick people who's comments adjoin that timeframe. Specifically people who made comments in advance of OTTokens comments, but not at the same time.

He can switch accounts to make alternative comments, but it's unlikely he was making comments from two separate accounts simultaneously.


> Besides OTToken's text patterns we also have when the comments were left. So we can throw out certain people that never comment during the hours of the day that OTToken did.

Noodle already clued in to that:

http://news.ycombinator.com/item?id=1199768

Also, if you could check my comment history (which you can't because it seems to time out on HNs server) you'd see that my comment speed is usually fairly quick in threads where I'm active.


Or it is you nagrom.


Sadly, in this case, I am not Spartacus. That would require a mind so cunning that it makes mine hurt just to think of it ;-)


Or you! Or me...


Is it marketer?

http://news.ycombinator.com/threads?id=marketer

This is merely a naive guess. He's the only other user on hacker news (according to Google) to use the term "people search engines". He also seems to have been working in the data mining business.


From a quick look at his comments, he seems to match the other heuristics seasoup mentions in that thread: meticulously correct spelling, grammar, and punctuation, use of semicolons, and use of dashes.

He also seems to comment heavily on technical issues - programming languages, database technologies, so it might make sense for him to feel heavy non-tech opinions deserve a onetimetoken. Similarly reasoned, the sentiment of "privacy is dead, there is only identity management" seems to be a realization appropriate for someone who recently started working on YC-funded companies. Seem convincing to me, but since they're all reverse-justifications, probably best to take them with a grain of salt: you might be able to draw similar conclusions combing through many other comment histories.

Interesting that this "human [powered] search engine" style of identification might have been faster than devising a machine heuristic.


He also uses lower case for facebook, twitter etc. which is what leaped out at me about the comment were trying to identify.

Edit: ok one other thing strikes me: use of the word "ton" doesn't quite fit with the rest of the language.


Yeah it's almost crowd sourced. Reminds me of when the cops put a letter or riddle from a serial killer in the newspaper figuring someONE out there will recognize it, as opposed to a computer recognizing it. Or maybe that just happens in the movies.


The unabomber was caught by his brother recognizing his writing.

"...his brother recognized Ted's style of writing and beliefs from the manifesto, and tipped off the FBI." from http://en.wikipedia.org/wiki/Theodore_Kaczynski


To be incredibly picky: Meticulously correct grammar would join "privacy is dead" and "there is only identity management" with a semicolon, dash, or conjunction -- or just split it into two sentences. ;)


It's an interesting guess but it's not me :)

I am concerned with online privacy, but not to the extent as "onetimetoken" (my FB profile is globally viewable). Also my comments are usually short, and I avoid big generalizations.

Looking at the thread in question, though, I'd definitely guess jgrahamc.


Unfortunately, it wasn't me. I did post on that thread but then I was gone from HN for quite a while because of work.


Darn. Thanks for letting me know!


It'd be really interesting if we had challenges, both social and technical, posted here on HN on a weekly basis. Some of the solutions and discussions would be pretty brilliant, I think.


I'd like to propose a tag:

Challenge HN:


Sounds good. Anyone with a challenge, email me at kyro@kyrobeshay.com with title/text of the submission. I'll post them on a weekly, or even bi-weekly, basis and credit the author.

Edit: The intention behind this was to keep it structured and organized, contest-like, and not for karmic purposes, which I take is the reason for the downvotes.


People could post their own challenges, that would be fine too.


Just make a new username for your contest-organizing activity.


How about donating small prices for the winners? Similar to ICFP programming contest bragging rights [1].

[1] See http://en.wikipedia.org/wiki/ICFP_Programming_Contest#Prizes


That would certainly spice things up.

Hacking and puzzling are intricately interwoven anyway, especially debugging. It's no wonder that plenty of hackers have hobbies like lockpicking.


Also donating prizes would give a different metric than pure karma-per-submission to order the challenges. (Though it might be hard to order bragging rights. But we should be able to find a (corporate?) sponsor who hands out 50 dollar for the charity of choice of the winner every week. (Hey, I might even be able to get the money out of my employer, if I asked to--or I just do it myself.))

Enough parenthesis. I just go ahead and pledge 10 Pounds per week to it. Perhaps we should discuss more by email?

(More later, I'll have to go to bed now.)


Ok, I'll match your 10 pounds, whatever that works out to in my currency (euros).


i could care less if you get karma. frankly, you'd deserve it for orchestrating this.


Karma doesn't enter in to it, what's the difference between posting a challenge yourself vs mailing someone and having them post it for you and credit you.

Seems a bit roundabout without any real advantage.

Daniels tag is all it really needs.


Oh, there's something to be said for having an "official" challenge of the week. It focusses attention. Though on the other hand, having the primaries out in court of HN may be the best approach to picking the most interesting challenges.


Informal is cool with me. Someone could just state their challenge, and the prize, if any (which need not be money), and if others think the challenge is interesting enough they could paypal the author their contribution to the pot, or they could publicly state that they want to up the stakes (or both).


Yes. Probably a common protocol will emerge.

Also posting bets and searching for someone to take the other side (or be the arbiter--in case any is needed) could be interesting. Similar to http://www.longbets.org/, but embedded into HN and not focussed on long-term bets.

Edit: I have opened a new top-level post about this topic. See http://news.ycombinator.com/item?id=1200153


Payment could be in the form of services provided by startups at HN. Subscriptions to our SaaS systems, free products, discounts, etc.


Sure. Or just bragging rights on some website. (E.g. a small line mentioning the winner on the bottom of HN, for challenges that pg thinks worthy.)


Yes that would be nice, something like the search riddles http://www.searchlores.org/cgi-bin/search?query=riddle&s...


Could be fun, but are there answers to these two questions?

  1. Does this user object to being identified?
  2. How will you know you succeeded?
I just skimmed the thread (already 107 comments), so perhaps I missed it, but I didn't see anything definitive.

Also, where's randomwalker when you need him?


1) based on his writing ('a one time account as a rhetorical device') I don't think he'd mind, also there is nothing in the comment itself that you would have to be ashamed of

2) you can't be sure, unless the person will confirm using the original 'one time' account.


I think the post was by "eru"

I base this on eru's phrasing and use of "dissimilar" in the post in question, which can also be noted here:

http://news.ycombinator.com/item?id=1159200

My strategy was to look for unusual words and phrases and do a google site search for those phrases.

Additionally, eru's post in this thread indicate an interest in privacy and eru's activity pattern is both frequent, and recent which I would expect to be true for the poster.

edit: Here, eru even taunts us a bit: http://news.ycombinator.com/item?id=1200060



Stop trying to identify this user.

They did not issue a challenge to be identified--in fact they agree with the notion that privacy is dead, which seems to be what you're trying to prove with this exercise. They may have serious reasons for using a one time account.

If your name is one of the (very random) guesses in this post, please neither confirm nor deny that the user is you, since this could identify that user by elimination.

This item should not have so many points. The post is rubbish. A 275 word sample is long, but likely insufficient given the pool of candidates. The post did not explain what methods were used, what work in authorship identification influenced his approach, nor did he provide his ranked findings. The tries are actually failed guesses, rather than, say, different algorithms attempted. This item has now devolved into a guessing game, rather than a coding exercise.

Again, stop trying to identify this user.

written with a one time account


> The post did not explain what methods were used,

The post didn't but the original thread did, I tried matching the vocabulary of the samples to the corpus of HN comments.

> what work in authorship identification influenced his approach

This is not a scientific paper.

> nor did he provide his ranked findings.

I'm not giving my ranked results because I think two attempts from me is enough.

> The tries are actually failed guesses, rather than, say, different algorithms attempted.

They were the #1 and #2 outputs of my code.

> This item has now devolved into a guessing game, rather than a coding exercise.

No-one said that you had to guess, but human guesses are also powered by computation at some level, even if it would be very hard to figure out exactly what went on.

> Again, stop trying to identify this user.

If that request would be posted by 'onetimetoken', who posted three times then it would have some credibility.

If you are not him/her why does this upset you ?

The 'one time account used as a rhetorical device' says fairly clearly that it is just a gimmick, not some kind of terrible secret.

And if you are 'onetimetoken' you are increasing the sample size ;)


I assume this upsets the user because using a one time account indicates a desire not to be identified or associated with the posted content, and the user wants this preference to be honored.


They find it amusing, and they do not mind. If you look at the comments of the user, they authorize their own identification if we can do so.


Ah, by "the user" I meant the user that was upset, not the target of the ID hunt.

My apologies for being ambiguous.



I bet it's tokenadult. I do not have any other proof than the fact that I immediately thought of that username when I saw onetimetoken. :-)


I never use any other username besides tokenadult on the forums where I use the username tokenadult. I like to have one consistent identity wherever I post (real name some places, screen name some other places) and I'm sparing in my use of screen names, and nonexistent in my use of sock-puppets. (I have been tempted a few times, but have thus far always resisted the temptation.) Now I will go look at the comment so I can identify what about it does NOT have my writing style, and then post that in an edit to this reply.

Evidence that the comment does not come from my keyboard:

I would never write a ponderous sentence like "I fully agree with the sentiment that inspires your statement" as the opening sentence of a post.

I wouldn't write "Without even noticing it we are whoring out our privacy and intimate patterns," because I consider "whoring out" a crude expression, too crude for the polite, learned conversation I expect on HN.

The phrase "it requires a mindset and workflow not entirely dissimilar to those of spies" reminds me of George Orwell's "One can cure oneself of the not un- formation by memorizing this sentence: A not unblack dog was chasing a not unsmall rabbit across a not ungreen field." I may occasionally write like that, if I am composing a sentence as I type, but I try not to.

He is not me. I think he is male.


Guessing someone is male on HN is a pretty good bet.

As for the 'learned conversation', I think that sets the bar a bit high, as long as it is polite and somehow coherent I'm fine with it :)


No offense intended, it was just off the top of my head. Even though I do not think using a throwaway account would merit any embarrassment :)


The word "token" is used in two completely different contexts, so I don't think that's right.

EG:

"tokenadult" = the included minority adult

"onetimetoken" = account used once, like putting a disposable token into a machine


You are correct. I first used the screen name on a forum, and then another forum, where the majority of users are teenagers. The screen name doesn't fit well here on HN (where almost everyone is an adult, even though I am older than most participants), but I like to minimize my use of distinct screen names. However, I am sure by screen name searches that other people now use this same screen name.


Maybe you could switch it up and use commonadult along with tokenadult depending on the perceived demographic?

Though I'm not one to speak, I've used "smokinn" (or "Smokinn" which I generally prefer) for, I believe, 16 years now.


Sure, and of course since I was wrong this is correct. But my brain thought that the using the term token for a username was sufficiently distinct to maybe be a subtle hint as to the original author (especially given the original context of the comment in question).


Or it a misspelled "takenadult" :)


I don't think tokenadult would mind posting that comment from his own account. The person that posted this sees his HN-identity as not stating strong opinions.


or, they where willing to make a point about anonymity by creating a new act.


Also correct. I am not afraid of expressing controversial opinions here.


"I immediately thought of that username"

Reminds me of "Blink", Malcolm Gladwell's book.


Ooh, I like your guess. The writing style seems to be the same.


I just searched for "google-facebook" and "identity management" and saw a blog by the title "Google-Facebook: Identity Management in a Brave New Internet"

Link: http://blogs.oracle.com/clayton/2008/05/googlefacebook_ident...

But, don't know if Clayton Donley is on HN or not..

His Bio, at Oracle:

Clayton Donley, Sr. Director, Development

Currently run the dev organization for some of Oracle's security and identity management products. Landed here after selling OctetString in 2005. Before that held various roles at IBM, Motorola, and as an independent consultant. Also wrote LDAP Programming in 2001.


I think everything matches here except that Mr. Donley capitalizes "F"acebeook. The comment in question has these as lowercase, which has already been mentioned.


Well, you're forgetting that most people, well at least I am, are more careful when writing formal blog posts vs simple HN comments. So, the minor differences can be attributed to that.

For example, the comment has "/" in google/facebook, while the blog has "Google-Facebook".

What made me curious was not just the topic, and the identity management, google-facebook, but the fact that he is in the field of security/identity management.

But as I said, I dont know if he is on HN or not :)


How about "martythemaniak"?

Quote:

> 1 point by onetimetoken 23 hours ago | link

> I was just trying to empasize my point. Just out of curiosity, how would you go about identifying me?

http://searchyc.com/empasize


Only two posts - http://searchyc.com/%2522identity+management%2522+roi

Edit: ignore, just realised SearchYC doesn't respect "" and looks for near match words.


From that search, what about tptacek, based on the non-standard use of "ROI": http://news.ycombinator.com/item?id=1024825


The styles are similar.


I used the rarely rare words used (rare combination of words used only 1/few time(s)[ORG]):

(i) The first search[1] revealed "gstar", but although they both have similar writing style, gstar doesn't have an active participation on privacy discussions (based to the query only [2])

(ii) The second search[3] revealed "astine": now this is interesting because this user has a very active participation on privacy discussions[4], especially I think he was inspired by _why[5].

[1] http://searchyc.com/sentiment+that+inspires

[2] http://searchyc.com/user/gstar

[3] http://searchyc.com/Ethically%252C+is+it+fair

[4] http://searchyc.com/user/astine

[5] http://news.ycombinator.com/item?id=774337

EDIT: [ORG] Based on the original comment at http://news.ycombinator.com/item?id=1197027


strlen is a match for [1]: http://searchyc.com/sentiment+that+inspires

He's my prime suspect: http://news.ycombinator.com/item?id=1200739

It's gotta be him.


I noticed strlen, too. But I don't think it's strlen. Some say can be also randomwalker.


Didn't someone come up with a "Which HN User Are You Most Like" app a couple months back? I'm trying to find it now but no luck so far.


http://swimwithoutgettingwet.com/hnusers/?user=onetimetoken&...

Smart idea; but I just tried it and no results.


it's pretty impossible to identify a user just by 1 anonymous post on a website.(without the logs). I mean sure you can compare a person's typing style...but unless they always add "jambalaya" to their posts, it'll be next to impossible to be 100% sure.

The way it works in real life, is that you find a person's email address or a long term account on a forum, and then use that info to build up a full profile about that person. The longer the person is on the web, the more personal information they've revealed in the past.

i.e. 6 months ago they might have mentioned their phone #...so you can use whitepages to see their address. Or maybe they posted a link to their site..where they didn't have privacy enabled, so you can get the full name and address using whois. Or maybe they are using the same username on all sites, so you can use google to see all the forums they've ever posted on. etc



not me


Do you keep a corpus of all HN comments?

I doubt PG would appreciate all of us hammering the HN server to collect the data.


It's slow enough as it is :/


searchyc has this.


As well as google.


I guess randomwalker, cuz this sort of thing seems to be his specialty. And the writing style seems to match up fairly well.

http://news.ycombinator.com/user?id=randomwalker


Obviously, everyone loves a good challenge, but is there any evidence that the find-ee wants to be found?

They seemed interested in the thread as to how they might be found, but don't seem to have given any permission for a site-wide (wo-)manhunt. (This might have happened out of band, though.) I guess they do say that they were just using a one time account for rhetorical emphasis.

Further, if this were to really be a contest, it seems like there should be some sort of rules, such that the result isn't determined just by exhaustion of currently in use usernames by guessers.


Please read the whole original thread, that's exactly how we got to that point, and the response I got to my 'I bet I can identify you' and his/her admission that they thought of obfuscating the text made it pretty clear they would not mind an attempt, but that does not guarantee that there will be a resolution.


I picked several suspicious words and ran them through searchyc. The four rarest words in the post are: pure-ad, CTRs, disinformation, and dissimilar.

Walked through the results for each word, pulling in all the usernames:

    intersection of:  pure-ad dissimilar
    ['noodle']
    intersection of:  CTRs disinformation
    ['jacquesm']
    intersection of:  CTRs dissimilar
    ['ivankirigin', 'patio11', 'strlen']
Of those, strlen's writing style seems to be the closest match. So I'm changing my guess from randomwalker to strlen :)


It wasn't me. Nice approach, though. Just intersecting word choices has very little to recommend it for industrial scale author identification but for a small-ish community like HN it might work, and of course it is trivial to implement if you already have the data source lying around.


Heh. I identified a Reddit IAMA once this way.


Okay, so stupid question time... but oh well.

Is there an easy way to download the corpus of comments?


"What a surprise to find a whole thread and blog post dedicated to the search for my identity. I consent to a benevolent search for my identity or identities. I was quite surprised to see the speed and scale of this development - another symptom of networked life."

This person has said that we can find out his identity... if at all possible. Therefore, you are welcome to search based on his own permission.


I tried gender guesser. Words on the post was not enough so I used all the text he wrote.

http://www.hackerfactor.com/GenderGuesser.html

The Output

Total words: 365

Genre: Informal Female = 326 Male = 616 Difference = 290; 65.39% Verdict: MALE

Genre: Formal Female = 348 Male = 450 Difference = 102; 56.39% Verdict: Weak MALE

Weak emphasis could indicate European.


So I wrote a bit of code to compare against other HN comments

Do you have a corpus?

Further, and this is an open question, is there an archive/downloadable corpus of HN in part or entirety anywhere? It would be fascinating and I'd love to keep a copy to look back at in years to come.


> Do you have a corpus?

Yes, that's why I thought it would be an easy challenge.

My bad :)

> Further, and this is an open question, is there an archive/downloadable corpus of HN in part or entirety anywhere?

Yes, you can query the google cache. It's fairly easy to do.

The only things you don't get that way is the stuff you can see as a logged in user.


Yes, you can query the google cache. It's fairly easy to do.

You could, but I'm not trying that again. It's easy to get your IP busted depending on what random algorithm Google decides to run each week.. <g>


Do you have any proof that the guy would acknowledge you are correct even if you identified him?

Seems like a good thing to verify first. Maybe you already got the guy and he just said "no it's not me."


I'm a big proponent of fair play and I think the author would identify himself when asked, but at the same time only PG can be sure.

Of course you can be paranoid, but I think the bigger chance is the author seeing a chance here at sowing some disinformation. Such as participating in this thread and giving false pointers and / or confusing the issue.

For the really paranoid, of course the last person to participate in this thread is 'the one'...

It wasn't me, that's for sure :)


First I want to say that I don't agree with publicly disclosing the "identity" of people who doesn't want to be found. I also don't think doing so "originates" from any good personal quality. That being said, I do remember this [1] talk from last CCC to be interesting from a technical standpoint.

[1] http://events.ccc.de/congress/2009/Fahrplan/events/3468.en.h...


Obviously it would not have to be public, an email with a confirmation and a request to keep it quiet would be fine.

The original poster could log in and disclose who found him first.


Sample size way, way too small


The sample is actually pretty huge considering the number of words.


For online messages with such short length, when the full set of features are used, a sample size of about 30 messages per author is necessary to predict authorship with an accuracy of 80~90%

http://ai.eller.arizona.edu/COPLINK/publications/CACM_From%2...


One of the strong indicators is the use of italics in the post. Many users will ignore formatting within their posts. I am confident this user has used italic formatting before for emphasis and has done it often within their HN posts. It also indicates a comfort level within HN which means they have likely posted frequently. (At least one a month)


I think it is a trick and it's jacquesm.


My googling turned up "neilc" as a user of dashes at least one of the obvious digrams from that post.


I thought google ignored dashes (as well are all other punctuation characters), how did you deal with that?


Run the following phrases on your thingy and filter by the users who use them:

"The point being,"

", mind you,"

"I fully agree with"

"highlights a fundamental"

":" some text "," some text "."

", etc."

" - e.g."

"entirely dissimilar"

Whatever user has the most instances of these signature phrases is likely your man.


That's exactly what I did and failed...


There are some short "googlewhacks" (though they are multiple words) in there:

* collective pause to think

* pure-ad parked

These suggest to me a [highly proficient] non-native speaker too. "pause for thought" and "pure ad-parked" are correct versions.

* "intimate patterns" is an unusual turn of phrase in this context, would probably be "personal usage patterns"

* "high-end criminals" looks like an unusual hyphenation

This search gives a name - http://www.google.com/search?q=%22identity+management%22+roi....


"high-end criminals" isn't unusual. Certainly not to these British eyes. If you Google for "high end criminals" even without the hyphen, about half of the results use the hyphenated version.

The other things you point out encourage me to share your opinion, however.


Google's regular search doesn't handle hyphenation but Trends appears to: http://www.google.com/trends?q=%22high-end%22%2C%22high+end%.... Google searches give me results which suspect this Trend search is not sound however.

I'm from the UK too.


The problem is context, a common issue with tracking things with Google Trends in particular. Tracking programming language usage with it, for example, has been a nightmare ("ruby" and "python" having far too many meanings, but few write "ruby programming").

"high end" has more uses than "high-end." For example, "I bought a car at the high end of my budget." In that case, "high-end" wouldn't make sense. In "datacenters have been targeted by high-end criminals," however, "high-end" is a compound adjective.

Alternatively, you could drop the hyphen and/or form an entirely new word: "highend." The word "highend" doesn't seem to have caught on yet, though. I suspect that's because "upmarket" covers the same meaning already and is less susceptible to these morphological mishaps.

(On seeing what OS X had to suggest as a correction for "highend," it suggested both "high end" and "high-end.")


It's a blunt tool, agreed. But it was supposed to be a simple indicator only, not a measure.


From reading the original post, I think that "intimate patterns" was a stylistic choice, meaning to emphasize the invassivness of this phenomonon


How do you know you failed?


That's a good one, another poster above suggested something similar. I believe that HN'ers would play fair in something like this.

On the other hand that isn't proof of anything, but I think the changes are higher that someone will own up that didn't do it to throw sand in the eyes of the searchers than the reverse.

But then again, maybe I'm a sucker and I believe that people in general are honest and trustworthy. So far that seems to me to be a better assumption than the reverse.


you heard of this technique?

http://news.bbc.co.uk/1/hi/8404025.stm

Relies on a person's vocabulary and writing style more than the actual words.

[edit: disregard - I hadn't read deep enough to see that this was essentially the approach you had taken]


chime fits a few of the patterns: use of etc. mid-sentence, occasional use of hyphens - in this very pattern - and moderate use of slashes when "or" would do. Also American spelling.

Still, it's easy to get into a sort of confirmation bias looking at this stuff manually, and seeing things that fit while missing things that don't.


A different (and probably much easier) challenge: Disprove that it was me [or insert any other user here].


No need to taunt us, onetimetoken.

http://news.ycombinator.com/item?id=1200422


I don't think that's an easier challenge at all.

Disprove that pink elephants exist vs prove that gray ones do.

Which one is the easier challenge ?


Yes, and still people seem take the denials of the other users at face value.


I think that we're all treating this as a game, under the assumption that if we figure out who onetimetoken is he'll tell us, as it supports his point that privacy is dead. Also, it wouldn't be fun if we assumed that he will deny it if identified. Never underestimate the importance of having fun.


I'm going with SwellJoe because he seems to have some patterns in common.


Privacy is the new sharing.


i believe it is vaksel.


Definitely a possibility... but based on what ?


i tossed a few of the stylistic quirks into a search and took a look at some writing samples. i think his stuff looks the most similar. i'm going purely on my own arbitrary judgment. it just feels right. :) the method itself isn't much different than what has already been talked about.

edit: oh right, i also looked at the fact that he was posting comments on HN around the time that the one in question was posted. a lot of my other candidates didn't meet that data point.


> i also looked at the fact that he was posting comments on HN around the time that the one in question was posted. a lot of my other candidates didn't meet that data point.

Ah, very clever, another angle of attack. Never thought of that one.


Out of curiosity, why does it matter who wrote it?


I don't think the reason he wants to find out is because he wants to know. I think he wants to prove that it's possible to find out.

He thought it was long enough that he could pretty trivially have a program compare the writing style to other HN comments and determine who it was, but he failed. So it's a challenge for other hackers--can you write a program that can determine who said something simply based on the writing style and knowing that he's a member of a reasonably small sample (HN users).


You got it. Sorry for not being more clear, I thought it was an interesting challenge, and since I've used up my 'two guesses' I think it is more appropriate to admit failure rather than to keep on hammering away at it until I hit the right user.

PG would have an easy time of it (log files) :)


Well, that's the difference between a black box and a white box.



Why is this a worthwhile endeavour?


It's not 'worthwhile' in a sense that you can't take it to the bank (though it may come to that, see elsewhere in this thread), but I think it is what hackers do, solve puzzles.

This is effectively a puzzle, a reasonably hard one (my attempt failed, but not for lack of trying, I spent a fair number of hours on it before making my guesses, of course it could simply be that I'm stupid), and one that seems fun to solve.

It is exactly the kind of thing that I enjoy doing when it comes to programming in the first place, figure out how stuff works and/or solving reasonably hard problems. One step above my current competence is my favorite, that way I'm reasonably sure I can solve the problem, if the difference is too big then I tend to get stuck.

It's of course a bit like the question why people climb Mount Everest, the answers are: because they can and because it's there.

edit: Funny, I thought your downmod for asking a valid question was unfair, in return I get downmodded for answering :)


website down :/


I fully agree with the sentiment that inspires your statement.

However, this thread highlights a fundamental property of a networked life: e.g., etc. http://news.ycombinator.com/threads?id=onetimetoken


Here's the relevant text:

  A typical case of me and my big mouth.

  I figured that a basic analysis should reveal who wrote this:
  
  http://news.ycombinator.com/item?id=1197027
  
  Because of the size of the sample. So I wrote a bit of code to compare against other HN comments, and figured that that would turn up the user quickly.
  
  But I was wrong, after two tries (Daniel Markham and John Graham-Cummings) I have to admit that my simple analysis has failed.
  
  So, who will take up the challenge, can you identify this user somehow ?


Sorry, increased the server process limit to something more reasonable, should be better now.

This site normally gets 10 visitors / day or so...

I posted it on my blog because you can't post HN links on HN!


vanelsas



I am Spartacus


No, I am Spartacus!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: