
Hacking Hacker News - hexis
http://joelgrus.com/2012/02/16/hacking-hacker-news/
======
jcr
With all due respect Joel, it seems you missed a good number of the inputs you
could have used to train your classifier.

If I (jcr) want to know what classifies as "interesting to Joel"
(joelthelion), I simply look at the "comments" and "submissions" links in your
HN profile. It will show me the stuff you took the time to comment on, or took
the time to submit to HN.

If I want to know what's "interesting" to me, the "saved stories" link is
visible in my own profile even through it is not visible to others. In the
"saved stories" is THE goldmine of every submission I've either submitted or
up-voted.

<https://news.ycombinator.com/saved?id=jcr>

Depending on your personal bookmarking habits, your bookmarks file/db can be
another useful input. I'm in the habit of bookmarking both the submitted
article, and the HN discussion page (if it's good). Assuming I didn't book the
HN discussion, I could easily find the HN discussion/submission for all of the
sites that I've ever bookmarked with a search engine. Give google an URL and
ask it for all of the sites linking to that URL, then parse for HN, and you've
got the target.

The serious problem I see with your approach was already mentioned by ck2;
you're creating a bubble and will miss out on all the fantastic stuff that is
interesting to you, but you don't yet know that it's interesting to you.

One of the primary benefits of HN and similar sites is learning about the
things that _others_ find interesting. Those things may not interest me, but
the fact that _others_ find them interesting is, well, interesting.

Why to they consider it interesting?

Why do I consider it uninteresting?

Even if my personal opinions remain unchanged, these are important questions
for me to keep asking myself, repeatedly.

~~~
peterb
This is also why cabals are so destructive. They cause homogeneity on what
would normally be a diverse set of topics. I wonder if you could train your
classifier to reject topics that are similar or are being promoted by the same
set of people?

------
thristian
There's a lot of stuff on HN I like, and a lot of stuff I find boring or
irrelevant. Unfortunately, the stuff that I really like tends to be novel and
unpredictable, so trying to teach some kind of Bayesian classifier to
recognise things I'll like is probably not going to work.

Personally, I think I'd be perfectly happy with an old-school killfile: do not
show me posts whose headlines contain the strings "X", "Y" or "Z", or that
link to sites "A", "B" or "C".

~~~
gnosis
Something else that would help enormously is simply allowing stories and
comments to be tagged.

~~~
kruhft
I have a hacked up version of HN that I've added tags and a decent search to
that I am using as a personal journal. I haven't removed account creation or
anything from the regular HN, so you can still post new stories, vote and
comment with an account: <http://kruhft.dyndns.org/discuss>

------
raganwald
Old timers (‘scuse me while I slam a geritol and bourbon) will remember that
when reddit first launched, there was a recommendation engine that purportedly
took your votes and turned them into a personal page of stories you would
enjoy.

It was scrapped and eventually subreddits were introduced. I think people like
the idea of communities.

~~~
raganwald
That being said... The fact that something failed one, two, or a hundred times
in the past doesn’t mean it won’t work today. Things may be different today,
or perhaps the approach may be slightly different. If Google can make
bazillions of dollars using machine learning to optimize the ads displayed on
a page, I’m pretty confident machine learning can be used to optimize the
likelihood that you’ll upvote articles suggested by a bot.

The question of whether this becomes an echo chamber and you stop finding
things that are interesting but outside of your current tastes is deep.
There’s an old saying, “The best present is something you didn’t know you
wanted until you unwrapped it."

~~~
EwanG
Is that really an old saying? I thought that was a Steve Jobs' description of
the iPad?

~~~
raganwald
It goes way back before SJ.

------
ck2
By filtering out stuff, you'll never expose yourself to things outside your
"pattern".

HN's subtitle is " _Links for the intellectually curious_ "

I guess HN has a dual purpose to keep people up on "breaking" hacker news, but
I like to think of it as "hacker news outside your thinking pattern".

Also why do people immediately go to AWS for testing something? Doesn't a real
hacker have their own server handy for experimental projects, or is it only
me?

~~~
thristian
The thing is, HN doesn't _just_ contain links for the intellectually curious.
It often has stories about a new school of thought, or an approach to design,
or a new programming language or something. However, it also often has stories
about Internet Drama in the tech startup community, or American political
events, or practical advice for people entering the business world for the
first time. They're popular here because they're relevant to a large
proportion of the HN user-base, but it's certainly possible for someone to be
both intellectually curious and uninterested in American politics.

~~~
ericd
I would argue that American politics as they affect the internet and copyright
impacts pretty much everyone, unfortunately.

~~~
jarek
Which doesn't necessarily mean everyone would be interested in reading about
the newest developments weekly.

------
joelgrus
Hey everyone, thanks for the comments. There's too much to respond to
everyone, but a lot of people have brought up an "echo chamber" concern.

As other people have pointed out, the naive Bayes model works topically, so it
will learn that I like stories about "patents" but not (usually) whether the
stories are pro or anti-patent. It is totally true that I might miss an
interesting story about the new OSX or about Pinterest, but I'm willing to
live with that.

Two larger points are that

1\. HN is only a small fraction of the news I consume, so it wouldn't matter
that much to me even if it were a bubble chamber, and 2\. The main reason I
did this is that I simply couldn't keep up with the volume of stories
otherwise.

Last night when I spoke about this, someone asked me whether I was concerned
about all the false negatives I was missing. But before I started this, my RSS
feed had like 800 (and growing) unread HN articles in it. Reading _some_ of
them, even a targeted some, is better than none.

Anyway, thanks for all the comments. I'm suprised (and glad) that people are
so interested in this!

------
m0th87
I applied naive bayes to generic news for a project a few years ago. Counter
to some of the comments here, I think it works surprisingly well in filtering
articles, and is a great way to start.

One of the nicest aspects of it is that it doesn't support a user's
confirmation bias: your perspective isn't taken into account in filtering
since it's just looking at keywords. That's probably not as important here,
but especially on political news it's highly relevant. If I'm a Democrat, I
don't want just left-leaning news to come through the grapevine, because that
prevents me from seeing the other perspective.

~~~
Tichy
I don't understand what you mean. Don't you train the classifier on some input
set? There's your user's confirmation bias.

~~~
tripzilch
I think what he means is that if you take the entire article as a bag-of-words
model, the general topic of the article will be a stronger signal than the
position the article takes with respect to the subject.

The latter is a lot harder to extract from a bag-of-words. In fact the only
thing I can imagine is if an article uses a lot of euphemisms or negative
synonyms of a topic.

So you get the factual reporting articles regardless of left/right bias. With
the more polemic hyperbole using articles, it will filter a bit more of the
angle you disagree with, which adds some bias, but it also good for your blood
pressure and a polemic article you disagree with is not going to make your
view more balanced either.

------
joelthelion
I've been working on this kind of stuff on an off for a while now. I still
think it's a great idea, but recommendations are a tricky business and naive
bayes is not good enough.

~~~
DennisP
What is good enough?

------
switz
Wow this is great! I was actually thinking about something like this the other
day. Any chance on releasing the source?

Edit: Found it - <https://github.com/joelgrus/hackernews>

~~~
someone13
It's kinda hard to see, but there's a link at the bottom of the article:

<https://github.com/joelgrus/hackernews>

------
Sukotto
Nice article and an interesting approach. thanks for writing it up.

    
    
      The model can only get better with more training data,
      which requires me to judge whether I like stories or not.
      I do this occasionally [using] the above command-line tool,
      but maybe I’ll come up with something better in the future.
    

Well, you could analyse your server logs to see which stories you really did
click on and which you skipped over (You'll probably want to only consider
pages that have at least one click for the edge case of "I didn't even look at
that page". Need a cookie or login to make sure it only counts _your_ clicks)

Also consider scraping your HN "saved stories" list as a positive source.

Don't recall if you mentioned it in your article, but you'll probably want to
randomly insert the occasional low-scoring article as a check for under-
weighting.

I really wish our profile pages supplied a (private) log of up/down votes for
comments and flags for stories in addition to the the up votes for stories. It
would make for some interesting datamining

------
sounds
Why is HN opposed to scraping? I think since the front page is dynamically
generated -- a bot would waste resources. The site gets loaded (slow to
respond) during peak times.

Beyond that, only pg could say...

~~~
bootload
_"... Why is HN opposed to scraping? ..."_

because code written by N hackers scraping HN has a greater effect on the
site, compared to using the reliable HNSearch API ~
<http://www.hnsearch.com/api>

    
    
      Ask HN: Does HN have an API and if not what's the etiquette for scraping?
      http://news.ycombinator.com/item?id=2138730
    
      Ask HN: Is there an API for HN?
      http://news.ycombinator.com/item?id=1107874

~~~
mmackh
Judging from my own experience this API is not reliable.

~~~
andres
If you have problems with the API please let me know (andres@octopart.com).

------
mmackh
Link to the relevant Blog: <http://joelgrus-hackernews.blogspot.com/>

\--

I had a go at scraping HN myself a couple of days ago (in PHP) and here's the
output:

<http://thequeue.org/api/frontpage.xml>

<http://thequeue.org/api/new.xml>

<http://thequeue.org/api/best.xml>

The trouble is that sponsored posts, i.e. We're Hiring (YC 11) break the code.
If anyone can help, there's a question on Stackoverflow about it:

[http://stackoverflow.com/questions/9301215/scraping-hn-
front...](http://stackoverflow.com/questions/9301215/scraping-hn-front-page-
handeling-simple-html-dom-error#comment11731379_9301215)

~~~
underwater
This article only scores 0.039 on his blog. Whoops.

------
siculars
Great move with the bayes classifier. I'm more interested in how stories are
received on other networks so I made <http://hnfluence.com> for my HN
consumption. Ultimately looking to see how one network effects another.

~~~
aggarwalachal
this is actually quite nice. those nifty little additions to original layout
look good.

------
xpose2000
It's 2AM so forgive my skimming of the article, but that is a heck of a lot of
work and looks to be quite good. Will take another looksee tomorrow when I can
think-straight.

Hopefully when I get back to hackernews sometime tomorrow this will be
frontpage where it belongs. :)

~~~
drivebyacct2
As a recommendation, clicking the up arrow on a story will put it on this
list: <http://news.ycombinator.com/saved?id=xpose2000>. That way you can view
it without bookmarking. (As for hitting the frontpage, it did...)

------
silentscope
Neat idea.

Nevertheless, I believe that a good news source, like a good community,
sometimes gives you things that you might not like--things that challenge the
filters we already have.

It's what flipping though a regular newspaper does, what hacker news does,
what listening to a good broadcast does. The key is if you know the content is
going to be so good that you're willing to take that risk despite your
hesitation.

Hacker News is a bit like a firehose, but that's why I read it. The people on
this site are informed, opinionated and pretty damn smart. It reminds every
time I get on how little I actually know.

To be honest, a small part of me feel uncomfortable with that but that's why I
read it--it's news I need to know.

------
barmstrong
Interesting post. This isnt quite the same but wanted to mention i created a
project, <http://ribbot.com>, to let people create their own Hacker News style
site on other topics.

~~~
wallawe
THIS is what I've been looking for. It deserves a post of its own. Is the only
way to make money by paying the monthly fee and using one's own ads? I am
about to begin playing with it..

~~~
barmstrong
Awesome, thanks for checking it out. Yes, the only way is ad supported right
now if you want to go that route. What else were you thinking - monthly
memberships for private forums? It seems odd but I haven't thought about it
much.

I'm considering making the project open source as well so technical folks can
run their own versions and modify it. Sort of a wordpress model where it's
open source but they still make money off the hosted version which is simpler
for non-technical people to use. Thoughts?

------
jakejake
This is cool but please HN do not ever start implementing this. Digg was once
exactly like hacker news until they screwed it up by making the homepage
different for every user, thus killing the feeling of a shared experience for
us users.

Even though it's not always fair, even though you have to wade through
stories, the fact that there is a common home page is what spurs the
discussion and keeps things interesting.

This community is exactly how I remember digg in its heyday. Mostly tech
stories and one common home page that has some bit of prestige when your story
reached it. That's the magic sauce.

~~~
dangrossman
Digg was supplanted by reddit, a site where the homepage is different for
every user.

~~~
jakejake
I didn't know that. I'm not a hardcore redit user though. I still feel like HN
is more like the "good old days" of digg when it was mostly tech. Reddit is
cool but it has a more mass appeal with its topics.

------
jplehmann
Looks like you wouldn't find your blog post very interesting. Only scored
0.039 on 2/16 at 11PM. <http://joelgrus-hackernews.blogspot.com/>

------
lt
I think a good feature to measure would be the number of duplicates a story
has multiplied by the number of days between submissions (normalized somehow).

I have this theory that "atemporal" stories (technical analysis, insightful
essays, etc) that keep getting resubmitted every year are more interesting
than news about the latest gadget. I've written about it before:
<http://news.ycombinator.com/item?id=2505081>

------
dave_sullivan
That's really neat, had taken a couple stabs at this myself but not gotten all
that far.

I think there's probably some really interesting data in the comments (maybe
just take top ten comments, 3 layers deep?) And wrt following links, one idea
that occurred to me was taking a screenshot of the page linked to and basing
part of your model on that... I suspect there may be graphical, layout
similarities in some of the pages people like.

But kudos for doing it, very cool!

------
joverholt
Check out [Programming Collective
Intelligence]([http://www.amazon.com/Programming-Collective-Intelligence-
Bu...](http://www.amazon.com/Programming-Collective-Intelligence-Building-
Applications/dp/0596529325)). I found it to be a good introduction to machine
learning because it uses practical (and neat) examples to teach the concepts.
One of my top 5 programming books.

------
akg
Just curious if you tried using LSM (Latent Semantic Matching) on the titles
and/or their contents to determine what you like and don't like. It might take
some time to train the data to "your liking" but might return some decent
results. I know Apple has an LSM framework in Lion that could be of some use.
Just wondering if you gave that approach a thought?

------
guynamedloren
Another huge advantage of this tool is that by only seeing whats interesting
to you, you won't find yourself wading through every single story on the front
page of HN during downtime.

Looks like a great start to an awesome project. I think the next logical step
would be to expand to give anybody a filtered HN experience, but you probably
didn't need me to tell you that :)

------
gnufs
To filter the overwhelming amount of readable content on HN, I personally use
the <http://news.ycombinator.com/over?points=120> function. In fact, since the
amount of voters have increased a lot, I think I'm going to increase the point
cap to 200.

------
hello_moto
Got a question to you all Ruby developers:

I came from Java background where Maven reigns supreme when it comes to build
+ dependency + convention on file structure and I like this set-up.

What is the equivalent to that in Ruby, I know there's Bundler and Rake, but
Rake feels like Ant where you'd have to do a few things yourself.

------
mickey7
for a simple hack to keep exposure to novelty & variety i suggest just using
negative filters to downvote the disliked content

thus the more novel the content - the higher up it will remain

unless it's an article about a novel completely revolutionary arrangement of
old ideas which you never liked on their own :)

------
sravfeyn
If you train the predictor with URLs that user has browsed(from browser
history) even if they are not originated from HN, would improve the predicting
model. Because, user's browser history would reflect his interests.

~~~
yread
Yes, but not necessarily what he likes. I often click on a story in HN only to
find it completely useless and content-free

------
dfc
How easy would it be to invert this filter?

My biggest complaint with the number of stories on HN is not finding the great
articles but wading through a bunch of stories about the same three or four
topics I am not interested in...

------
donniezazen
I read all posts that reach 100 karma points.
[http://talkfast.org/2010/07/23/a-cure-for-hacker-news-
overlo...](http://talkfast.org/2010/07/23/a-cure-for-hacker-news-overload)

------
vseorlov
Nice thing, I think people need such a thing, because news number is great,
but interesting and relevant news number is not that great. Information
filters are the future.

------
xedarius
Nice work, however this feature is a little like putting blinkers on a horse
in case it sees something frightening. HN's greatest asset is the variation of
the stories.

------
hamoid
Great work. I've been wishing posts had tags so I could subscribe to some
topics and avoid others.

Apart from the technical side, I find the light blue text hard to read.

------
chaz81
This guy just gave a talk about this at a Seattle Meetup a few hours ago,
pretty cool to see how fast it shot up to the top.

------
paolomaffei
Is there any technical difficulty in implementing an overbar on the content to
vote the content up or down?

Just like you can do on Reddit.

------
jgmmo
This is totally awesome. Very happy to come across this. Alot of technology in
there I can learn from.

------
sravfeyn
Wow. Something I am looking ahead to try to implement after my AI course.

------
iamgilesbowkett
This is a very rewarding thing to do. I recommend it to everyone. You don't
necessarily even need to use AI.

I filter HN with <http://hacker-newspaper.gilesb.com/>, which pulls RSS,
filters it, and reformats it on an hourly cron job. I mainly did it for the
typography -- I disagree with just about every visual design decision on
Hacker News -- but added very primitive filtering after the fact. I throw out
any story from TechCrunch, Zed Shaw, Steve Yegge, and Jeff Atwood, because I
just got tired of them, and any story with "YC" in it, too, because I got
tired of seeing job ads for Y Combinator startups. (In fact it was the job ad
for a Curebits marketing manager right after their scandal that did it.)

When Apple launched the iPad, I went in and added a simple regex to filter out
any story about it. Hacker News is a great source for skimming but
occasionally gets fixated on topics. I get like a hundred uniques a day so
it's not exactly a huge hit, but I've thought about making a commercial
version with customization. It got featured on Mashable and somebody created
an iPad app which looked very, VERY similar, which I'm going to take as
validating my design. But whether or not I ever startupify it, anyone who
wants a customized version can just fork the project, deploy their own version
in like ten or twenty minutes, and tweak regexes to their heart's content.
It's on GitHub (<https://github.com/gilesbowkett/hacker_newspaper>) and only
requires the most basic proficiency with cron, ruby, and python.

I also want to add comment-scanning. Right now I don't use comment links at
all. The code extracts them but then simply throws them away. I don't want to
add comment links back in unless I can also set it up to alert me if the
comments thread contains comments from raganwald, patio11, jashkenas, amyhoy,
etc -- basically automated comment elitism. I'm not trying to be a dick with
that, I'm just a busy dude.

Anyway, when I set out to do this, I planned on doing a bunch of Bayesian
whatnot, but I found that I got most of the way there just tweaking regexes
occasionally. Likewise there's a lot of rough edges I could clean up, e.g.,
text encoding is a bit of a mess, and summarization in the style of
<http://tldr.it/> would make it way more useful.

But I recommend it because making deliberate decisions about what info you
want to get from HN makes it a lot less like watching TV and a lot more like
doing actual research into topics which interest you. It's surprising how much
more enjoyable HN becomes when viewed through a customized filter.

~~~
mmackh
A word of advice about using the comment links in the hn's rss feed, from
personal experience: try not to scrape them in an interval. I mistakably ran a
piece of code that went through 7~10 different posts in a short period of time
(30 secs) and my server was banned immediately.

~~~
iamgilesbowkett
heh, thnx. my comment-reading experience is also boosted with some
<http://defunkt.io/dotjs> hacks, btw:

[https://github.com/gilesbowkett/dotjsfiles/blob/master/news....](https://github.com/gilesbowkett/dotjsfiles/blob/master/news.ycombinator.com.js)

