

Responding to jacquesm's challenge - jgrahamc
http://www.jgc.org/blog/2010/03/responding-to-jacquesms-challenge.html

======
davidw
Did you look at posting times at all to exclude people who never post at
certain times? (I should add that that's not my idea but something someone
posted on the original thread).

~~~
jgrahamc
No, I had a limited amount of time to get this done so there are lots of other
things I could have tried.

------
swombat
Hahaha, you still haven't found me out.

Oh wait..

~~~
jgrahamc
I asked PG if swombat and onetimetoken had set off the HN sock puppet detector
and he said that onetimetoken had used an IP address never before seen by HN.

So, swombat, time to prove that you are onetimetoken.

~~~
swombat
If I was onetimetoken, and I went to all this hassle to create a properly
anonymous account and even laid down a challenge to you to find me out... why
would I help you to find me out by giving you proof?

~~~
jgrahamc
You are the one making the claim to be onetimetoken. I'm just asking that you
back that up.

~~~
swombat
Where did I make such a claim? I made a joke about such a claim...

or did I...

~~~
endtime
The comment similarity problem is a lot more interesting than the is-swombat-
just-being-coy problem...could you just tell us?

------
jxcole
This is certainly an interesting problem,but without having some access to the
data I don't think I could really approach it.

Perhaps I am alone in saying this, but I think data mining is interesting
while web crawling is boring. Could somebody make the data available so that
we don't have to write a crawler? Or is this part of the challenge?

I think this is a classic example of unsupervised learning, for which I would
generally use a system like Fuzzy ART. I think that might perform better than
a naive Basyesian text classifier though I can't be sure until I try it out.

~~~
jdrock
If anyone wants to use 80legs for this challenge, just drop us a line at
<http://www.80legs.com/contact.html>. We might be able to set up some custom
free plans.

------
adrianwaj
Would the owner of the comment just fess up for pete's sake? We won't hurt
you.

edit: there's some new text to put through the filter:
<http://news.ycombinator.com/user?id=onetimetoken>

and please try running the previous suspects through the filter: marketer,
citizenparker, martythemaniak, eru, vaksel, neilc, vanelsas, swelljoe

Also, if it's not you of those, please formally deny it.

~~~
JesseAldridge
There's a misspelled word in the profile.

<http://searchyc.com/identitiy>

------
dangoldin
Isn't the Naive Bayes classifier biased to users with a large volume of text?
Ie if there are two users and one writes 99% of the content it's very likely
that that user will be picked as the author for almost anything? At the same
time, this may be desired since someone who does contribute a lot on HN may
have also desired to have some fun.

~~~
jgrahamc
The key calculation is given a word w what's the frequency with which this
user uses word w. So that's number of times w occurs / numbers of words that
user has used. So it doesn't matter as long as a user has 'enough' text so
that they've covered a good portion of the overall dictionary of words in use.

The prior probability is based on the number of comments a user makes. In this
case that prior is insignificant because the sample text is large.

~~~
eru
Training on word pairs or triples may also be worth a look, instead of going
for single words only.

------
wynand
What did you use to compute P(D|Ci)?

This uses your notation from the Dr Dobbs article, so that Ci (a category) is
a user and D is a document (a comment?).

Did you use something like trigram signatures?

Also, is P(Ci) equal to #comments made by Ci/total number of users?

Interesting stuff jgrahamc!

~~~
jgrahamc
I just used whitespace separated words after stripping punctuation.

~~~
pbhjpbhj
I'd have thought that capitalisation and punctuation were key elements in any
textual analysis. In the subject text there is a very unusual hyphenation
"pure-ad" for example.

------
sgoranson
Very cool, and appropriate that you're basically using PG's spam filtering to
identify users on his site :)

I think the next step is to write a more complex filter that does not assume
word probabilities are independent of each other, i.e. take unusual phrases
like "entirely dissimilar" into account.

------
hooande
That's a nice approach, but a naive bayes classifier doesn't seem like it
would be the best method for this particular problem.

You probably want to do an N-gram analysis, like that performed by libtextcat
<http://software.wise-guys.nl/libtextcat/>. This will perform a comparison
based on common combinations of letters used (like "wo", "or", "rd"). Seems
like it would be more accurate with such a relatively small sample of
comments. If you had a list of 10-20 possible candidates, you could narrow it
down to just a few.

------
gcheong
Didn't somebody code up a website a while ago that looked for other HN members
that were similar in commenting style to oneself?

------
petercooper
As I was in the list on there, I just want to confirm it wasn't me _but_ when
I read the original comments left by the anonymous commenter, I saw a lot of
my own syntax mannerisms - at least the algorithm isn't too bad, eh? ;-)

------
DanielBMarkham
So basically everybody struck out. Most likely due to sample size.

There's an interesting lesson here that probably says something like the
coolness of the tool used has no direct relation to the usefulness of the
conclusions provided.

~~~
stse
I think it would be more interesting if the "guesses" would actually take into
consideration how successful or unsuccessful the method is with the data
available. For example, how likely are each of the names he mentioned and how
likely is it that it's any one of them?

Edit: If someone here has a background in intelligence I would love to here
their take on the challange.

------
ErrantX
Great to see an analytical approach to the challenge :D Although reviewing
your first list most of them actually seem unlikely (for a variety of
reasons).

~~~
jgrahamc
I'm dying to know if this turns out to be right, or not. I actually have a
totally different list which is based on a different handling of punctuation.
It suggests that the most likely person who also commented on that thread is
zaveri.

~~~
ErrantX
Im still convinced the lower case use of google, facebook etc. (which occured
more than once in the comment) is important - especially as there is one
Google at the start of a sentence - indicating it is intentional/common.

That's why I personally discounted many from your first list (plus the fact I
know a few are native English speakers)

~~~
stcredzero
I am aghast that people would think I'm not a native English speaker. I hope
I'm numbered in the "few." Disclaimer: sometimes Windows handwriting
recognition makes a real hash of my post without my noticing.

------
fnid2
Don't rule out jacquesm himself! It'd be just the kind of thing he would do.

------
JesseAldridge
<http://dl.dropbox.com/u/135901/rare_words.py>

------
anigbrowl
This is my inscrutable face.

