
Help Reddit build a recommender - ahalan
http://www.reddit.com/r/redditdev/comments/lowwf/attempt_2_want_to_help_reddit_build_a_recommender/
======
zach
Some may remember that Reddit used to have an item recommender a long time
ago, back in its first year or so. It was a Bayesian classifier that, since it
needed a bunch of input, only worked for the most hardcore members — who had
already seen almost all of the recommendations!

This was originally the "hard problem" at the center of Reddit.

Let me explain what I mean by that. There used to be a quaint notion that to
be a respectable tech startup, you had to have a "hard problem"
(technologically speaking) at your core, which you had an innovative "secret
sauce" solution for, preferably one you were patenting. After all, if not,
then someone can just copy you and squash you like a bug, right?

Since then, YC's insistent focus on making something people want, Eric Ries'
lean startup gospel and many entrepreneurs' own experiences have thankfully
gone a long way to convince people (most importantly SV investors) that
focusing on a "hard problem" is not only unnecessary, but may end up being a
fatal distraction.

This is a pretty good example of how the "hard problem" can turn out to be
completely irrelevant. Once it was clear that the recommendation engine wasn't
a growth vector, the Reddit team seemed to drop it out of sheer pragmatism.
They just needed to keep the site running.

I can't recall many who cared or even noticed that the "recommended" tab was
gone. But from that point on, Reddit was more free to become not just a quirky
"personalized news" startup, but what it has aspired to since: the front page
of the internet. And only now, just now, do a good chunk of the millions of
users think a recommender might be nice.

It's the startup version of "you aren't gonna need it" — if it doesn't drive
growth, push it aside.

~~~
DanielRibeiro
Earlier today, on Clojure West, the founder of GetPrismatic[1] made a very
interesting presentation. He is trying to solve this problem. Real time
machine learning is a hard problem.

Hopefully it will be something people want, because I want it as well. But if
it is not, I'd rather he pivoted into something where he can succeed than
taking his startup to the ground because of me. A startup that fails helps
nobody.

[1] <http://getprismatic.com/>

~~~
achompas
Is this presentation online somewhere? Sounds v. interesting, especially if
it's Bradford Cross.

~~~
DanielRibeiro
Not yet: <https://github.com/strangeloop/clojurewest2012-slides>

InfoQ taped it, but they may not release it for up to 6 months.

And yes, it was Bradford Cross.

------
jstepien
During the previous semester I spent some time building a recommender using
this data as a project for a data mining class. It turned out to be far more
challenging than I had initially anticipated.

I've used methods known as collaborative filtration, whose goal was to
estimate how a given user would rate a given item basing on knowledge of
preferences of other users of similar interests. The initial scope included a
naïve Bayesian classifier and a technique called Slope One [1]. The latter one
is particularly interesting as according to claims of its authors allows to
make a very good estimation in a very short time using solely a very simple
linear model. The preprocessing is both time- and space-wise expensive though
as it requires you to build a matrix of deviations between rated items.

After reducing the data set to a single subreddit and filtering it from users
who weren't avid voters I ran the algorithms and after some tuning I was very
content to see promising ROC curves and decent AUC values. Models built around
NBC and S1 achieved comparable results when it came to such metrics as
precision, recall and F-measure.

When I went to discuss the results with the professor teaching the class I've
heard "That's indeed promising, but how about comparing those results with a
_really_ naïve model which would just take an average of existing votes by a
given user?". Guess what: the model built solely using a single call to the
_avg_ function was nearly as good as the NBC and S1 models.

Now I understand why the guys from Reddit are looking for external help with
the recommender. It's a way less obvious task than it might seem to be.

[1]
[http://lemire.me/fr/documents/publications/lemiremaclachlan_...](http://lemire.me/fr/documents/publications/lemiremaclachlan_sdm05.pdf)

Edit: s/machine learning/data mining/

~~~
_dps
Out of curiosity, did you compare to any other baselines? I suspect you did a
lot better than you think you did, because that particular baseline is
actually very misleading for ranking/recommendation tasks (this is a common
source of confusion for newcomers). Here's why, in two parts:

1) Say you estimate (as you propose) that a user will always give their
average rating. This might get you good-ish error and ROC _as a prediction
task_ , but will give zero recommendation value because the prediction for a
given user will be constant for all possible recommendations.

2) Say you estimate that a user will give the average score that the item has
received across all users. Again, possibly good-ish in terms of _prediction_
ROC and RMS error, but this offers no personalization (all users get the same
predictions, i.e. you're basically just showing the default Reddit ranking).

Both of these baselines are vastly inferior to even really stupid models like
"how many times have I upvoted stories from this submitter" in terms of
recommendation value, but the latter is (if I recall from my own experiments)
much worse when evaluated on the basis of overall ROC.

I would strongly suspect that a correctly implemented NB or S1 would vastly
outperform either of the two baselines in terms of actual recommendation
utility (even though when you look at the baseline's ability to predict actual
numbers, they might be comparably good in an RMS sense).

The moral of the story: one must be very careful when trying to quantify the
performance of learning systems; actual utility is often difficult to evaluate
merely by looking at standard statistical measures of accuracy.

~~~
jstepien
No, I didn't make any comparisons to other baselines. Thanks a lot for sharing
your thoughts; I'll have to reconsider the results I got in the light of your
comment.

------
espeed
The Neo4j User Group wants to help with this
([https://groups.google.com/d/topic/neo4j/rkhjlQx-
bfo/discussi...](https://groups.google.com/d/topic/neo4j/rkhjlQx-
bfo/discussion)).

Gremlin (<https://github.com/tinkerpop/gremlin/wiki>) works great for real-
time recommendations.

See "A Graph-Based Movie Recommender Engine" by Gremlin's creator, Marko
Rodriguez ([http://markorodriguez.com/2011/09/22/a-graph-based-movie-
rec...](http://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-
engine/))

------
mikeklaas
For anyone who's trying this, I recommend basing your effort on factor models
(i.e., the thing that won the netflix prize). It works very well for us at
Zite.

(Content models are the other, probably less interesting, 50% of the
solution.)

------
krelian
This is a bit old, no? Anyway, I don't need to a recommender, I need a better
way to let me affect the weight different subs have on the homepage. I need a
way to group different low traffic subreddits together so that I won't miss
their content among the high traffic ones.

Reddit's old interface doesn't work anymore now that there are so many subs.
The fact that there has been so little interface improvements in the last
couple of years is pretty sad. I can't imagine browsing the site without RES.

The way things work now only helps to magnify the lower quality trend because
the homepage gives undue weight to content from popular subs.

~~~
ebf
I've noticed this problem lately. Last night, I saw that a large percentage of
my front-page was from 1 subreddit. I'm also subscribed to over 200
subreddits. Some of these subreddits never hit my frontpage, and I tend to
visit only 10-20 subreddits. This results in the majority of my subreddit
subscriptions being useless.

I think there could be some interesting UI solutions to this problem. If more
people treated the Reddit API like the Twitter API, there could be
applications that aren't necessarily supposed to replace the traditional
Reddit browsing experience, but to make whole new experiences (e.g.
Flipboard).

~~~
stevengg
You can only see 50 on the front page as a regular user and 100 if you pay for
reddit gold they update every 30 minutes. This is the main reason I removed
reddits like r/thewire, r/archlinux and /r/bookclub because they so
infrequently get posts its not worth having them clogging up one of my 50
spots.

------
wrath
This is a pretty open ended problem!

How to you measure success? After I create my algorithm how do I know that I'm
close to what reddit wants? Without answers to these questions, IMO, this is
an exercise in futility. I'm not close enough to the project but written my
fair share of classifiers and clustering engines any machine learning problem
there needs to be a way to measure success. My point of view on a great result
is different from reddits for sure.

------
thedark
This is exactly the sort of thing a properly implemented tagging system would
have solved. Along with their notorious search problems. Along with the
difficulty in finding subreddits. Along with discovering old content. 6 years
later I maintain this as a mistake.

~~~
naner
What would a "properly implemented tagging system" look like on a site like
reddit? I know they have been rejecting the idea for years and intentionally
went with subreddits to handle the growth, encourage small disparate
communities, etc.

~~~
citricsquid
A story about Startups can belong in multiple subreddits, eg: r/startups,
r/entrepeneurs, r/business.

If a story had tags and there was a system where the frequency of tags
appearing in a subreddit mattered it would allow me to look at r/startups and
then find the other subreddits relevant to my interests.

reddit made the mistake of treating every subreddit as its own individual
isolated community without considering crossovers in interests. If tagging
existed then this would not have been a problem. Today 6 years on it's still
impossible to find good subreddits relevant to specific interests, tags would
have been one of the solutions for that.

~~~
raldi
The reddit founders talked and thought about this tremendously, and ultimately
decided that it was more important to have distinct communities, so that the
same story can be on /r/aww and /r/photography without one group overrunning
the other. Or /r/TwoXChromosomes and /r/MensRights. Or /r/politics and
/r/economics.

I think that this was one of the most important strategic decisions in
reddit's history, and that they got it right.

I'm not saying tags can _never_ work, just that any proposed tags system needs
to supplement, not destroy, the siloing of subreddit communities. And be
simple to use, even for the 99% of redditors who never even vote or subscribe
to anything.

~~~
tomjen3
They would be useful for something like what stackoverflow does by allowing
people to block tags or highlight others (e.g. Block Ron Paul posts in
/r/politics).

That said don't listen to me. I have quit using reddit, except for
/r/gonewild.

~~~
raldi
Negative filtering would be a disaster. The power users do most of the voting
and almost all of the reporting. If they all could block Ron Paul, those
stories wouldn't get downvoted and, when offtopic, reported. This would cause
the Ron Paul stories to take over the site for the 99% of users who wouldn't
be using the filter.

------
PaulHoule
Collaborative filtering is a boring problem and doesn't get to the heart of
what's wrong with Reddit, Hacker News, and such.

For one thing, many good stories languish on the "new" page and never get
enough votes to get a fair shake. Collaborative filtering doesn't help with
this, if anything it makes it worse.

Last night I made a crude boomerang by glueing two rulers together, this
morning it had set and my son pressured me to try throwing it before I'd even
finished my breakfast. Right when it started to curve, it hit a telephone pole
and broke at the glue joint.

When I see many of the things people want to do on reddit, my first impression
is it will wind up like that. For instance, LSI is one of those things that
does not work so well in real life... They still seem to be teaching kids
about it, but not that you get results almost good doing dimensional reduction
with a random basis set.

If you've got some semantic analysis and predictive models, you can make an
automated system that picks quality relevant content out of the "new" queue
and because you can use smart feature selection you don't need to wrangle as
much data -- training is orders of magnitude faster and you don't need to futz
around with hadoop.

------
lars
I don't think you need voting data. Rather, answer the question: "what
subreddit is similar to this particular subreddit". Then you use the degree of
overlap in subscribers as a distance measure between subreddits. Use a tf*idf
like approach, so popular subreddits are weighted less.

Then the similarity of r/programming to r/coding would be based on two
numbers:

    
    
        b = number of people subscribed to both r/coding and r/programming
        n = number of people subscribed to r/coding
        similarity = b/n

------
markkat
I'm not sure if it's a recommendation engine they need, but they do need a
better way to find subreddits. IMO some sort of a map might work better than a
rec engine. Or even just a quick way to see what subreddits another user
subscribes to.

------
mumrah
I think that, given the volume of users on reddit and the volume of content
they interact with, any of the various collaborative filtering techniques
would work well at this point.

You could take it a step further and incorporate more than explicit up/down
vote features, such as "clicked", "commented", "saved", etc.

Then incorporate some business rules that filter recommendations by
subreddits, boost results by time, and now you have a decent recommender.

Easier said than done of course.

------
cop359
So this is a StumbleUpon for Reddit?

