

Geekiest Hacker News comments from the last month - riffer
http://www.swimwithoutgettingwet.com/blog/most_technical_hn/

======
riffer
TLDR version: We took a corpus of 25k comments from HN, analyzed them to infer
semantic similarity. Next we came up with a seed of 18 words out of the 40k
words in our corpus, scored the corpus of 40k words based on similarity with
those 18 words. We then analyzed the 25k comments to score them based on the
scores assigned to the 40k words. Major weaknesses: [1] words not in the
corpus are dropped, so technical text with (relatively) obscure words may not
score well, and [2] our seed of 18 words was highly subjective and put
together in <5 mins.

Also, there are a couple of dynamic links at the bottom of the post, for those
who want to play with the mechanics, search for themselves and others, etc.

~~~
eitally
as an addendum, since none of the results seemed overly technical to me, here
are the "technical" words they used to gauge geekiness:

CS, Clojure, Debian, Haskell, JavaScript, Python, Rails, Scala, algorithm,
compiler, engineer, frameworks, jQuery, macros, open-source, process, servers,
stack

~~~
dantheman
Ahh, I see they left out Erlang.

~~~
riffer
Good point, in the case of Erlang it wasn't a picking favorites thing, so much
as for whatever reason Erlang was not particularly well represented in the
corpus of 25k comments that we grabbed from the site.

~~~
Vivtek
Honestly - if a particular technical topic is mentioned _less_ that arguably
makes it _more_ geeky.

------
Zak
I think it needs a bit of work. Here's pg's highest-scoring comment from the
past month: <http://news.ycombinator.com/item?id=1606788> (score: 81.91). It's
entirely non-technical, but outranks many highly technical comments, like this
one from jacquesm: <http://news.ycombinator.com/item?id=1574015>

~~~
dman
Totally agree with it needing more work. This is about a week old at this
point and were working hard to make it better.

------
pyre
Apparently:

    
    
      > I think the 'pain' comes when there are issues with the c
      > library that lxml binds to.
    

Scores 80.99/100

I think that the algorithm needs more work....

Also:

    
    
       > I count 8:
      >     Avatar (2009) - Yep
      >     Titanic (1997) - Yep
      >     The Dark Knight (2008) - Yep
      >     Star Wars: Episode IV - A New Hope (1977) - Yep
      >     Shrek 2 (2004) - Nope
      >     E.T.: The Extra-Terrestrial (1982) - Nope
      >     Star Wars: Episode I - The Phantom Menace (1999) - Yep
      >     Pirates of the Caribbean: Dead Man's Chest (2006) - Yep
      >     Spider-Man (2002) - Yep
      >     Transformers: Revenge of the Fallen (2009) - Yep
    

Scored 74.45/100

------
docgnome
Hrm... It says my geekiest comment was "Any examples?" Doesn't seem very
geeky... Maybe I'm not a geek... *has an existential crisis

------
jfraser
If you just hit 'Calculate Score' on the default 'Enter text ...' phrase, you
score 65.08 / 100.

------
samg_
What algorithm is used to get the centroid clusters? Do you need to know the
number of clusters in advance? I am familiar with max-link/min-link/avg-link
hierarchical algorithms, but not centroid related ones.

~~~
riffer
We represent clusters as a type of node. So if your other nodes have
coordinates in a space, there will be distances between your clusters. Or they
can be part of a graph traversal, etc.

~~~
samg_
Can you be more specific about the cluster membership testing? Sure it is
based on some "distance" calculation, but how do you avoid long chain problems
inherent to clustering algorithms? And to reiterate, do you need to know the
number of clusters ahead of time?

~~~
riffer
send me an email, it's in my profile

------
thristian
So, this is some joke site where it sees you're logged into Hacker News, finds
your userID, grabs a random one of your comments and fingers you of being the
geekiest geek in geekdom, right?

...right?

~~~
dman
You just happened to be one of the lucky ones!

------
dstein
These aren't very geeky. I understood almost everything.

I would rather see the complete back-and-forth of the "most geekiest argument
on HN".

------
makmanalp
These look just as geeky as any other to me.

~~~
Natsu
Just out of curiousity, I compared myself to the most notable HNN names I
could think of.

My highest is 83.79. Compare that to patio11 at 77.89 or Chromatic at 80.90,
or pg at 81.91. RiderOfGiraffes manages to top us all, though, with a comment
saying "This is a thin veneer on the slide-show already discussed here.
[link]" which scores 91.44. The next one after that falls clear down to 76.97,
which is itself rather typical (it seems that many of us have a steep drop
after one "geekiest" comment).

I'm not convinced that this is a useful metric. It's an interesting
experiment, perhaps, but one problem is that "geekiness" isn't necessarily
what we want in comments (or perhaps it is necessary, but not sufficient).

I can't believe that short link was RoG's best comment this month, for
example. And the top few comments from pg are certainly not his best (the more
interesting ones seem to start a bit below the 50s). YMMV.

~~~
wwortiz
My comment: _They took it down because people were defacing it_ actually
scored 100 without even being more than slightly technical.

~~~
riffer
Yeah, this is an interesting case.

It's all coming from 'defacing', your comment is actually the only one in this
corpus of 25k comments that uses that term; probably not a coincidence. Taking
a deeper look now, thanks for your help.

------
phoenix24
I am just curious, are you planning to release implementation details; or they
are already available somewhere?

~~~
dman
What would you like to know ? We could perhaps write a follow on blog post
with additional details.

~~~
phoenix24
I really, wouldn't know where to start if ever needed to make a similar
application. so, i could use it as a starting point.

maybe if you suggest a general outline of the steps taken, and tools used
that'll be a headstart.

thanks a lot.

------
powrtoch
Hopefully no one will take this as a challenge...

~~~
sp332
Not too hard. "Scala Scala Scala" is enough to round up to 100.0%.

It did say this (legitimate) comment of mine was 99.57%.
<http://news.ycombinator.com/item?id=1599910>

~~~
jerf
I'm a little confused by the scoring; I have two comments at "100.00", and one
of them is "<http://www.retrologic.com/jargon/H/hacker-humor.htmlNot> that I
want HN to become a joke-a-day site but accidental-hacker-humor like this is a
unusual enough find that I can deal with the rare exception."

The other one is more legitimately geeky, but I'm lost on what 100.00 means
exactly.

(Edit: Must be "hacker", but I'm still curious what the 100.00 means.)

~~~
riffer
"hacker" and also "exception" scores really well

There aren't any words like "war" or "contrary" that are far way from the
seed, and would indicate that some non-technical meaning should be inferred
for "exception"

On the "100" ... that is just a z-score scaled to a 0-100 range, with capping
at 100 rather than making it approach asymptotically.

