

Building a recommendation engine, foursquare style - jcsalterego
http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/

======
physcab
_By setting some constraints on which scores were significant, it was possible
to build the resulting similarity matrix in less than an hour on a 40-machine
cluster_

40 machines!?! Wow. We compute our recommendations at Grooveshark in a couple
hours on a crappy 2-node Hadoop cluster. hah. We have about as many songs as
they have venues, so I have no idea why they need so many nodes. Or maybe
we're just extraordinarily harsh in our constraints. I loved this writeup
though. Its great to see how other companies tackle these difficult problems.

We also have that problem of a "cold-start" and another one which we call the
"coldplay" problem. For them it would be something like McDonalds I guess.

~~~
izendejas
For the "coldplay" problem you may want to use an inverse band frequency
weighting (much like in IDF for text). Bands that are extremely popular won't
give you any useful signals. As far as the cold-start... yep, that's a beast
and using social data may help. For example, are you using stuff from
Facebook/Twitter once you have users sign in? You could crawl out band/artists
name from their data.

This gives me an excuse to try grooveshark. I've been curious for a while. :)

update: I didn't see any recommendations based on what I've like on Facebook
or Twitter. This would be a good "warm" start. Note: I'm working on stuff like
this, but don't mind sharing this simple idea. :) Quora does a great job of
this pulling in data from your Twitter network.

------
baberuth
Very cool.

10M venues is a lot, and a really hard to believe number.

For example, Yelp lists ~13k restaurants in NYC (3.5k in Chicago, 4.5k in SF)

In the top 10 categories for venues, Yelp NYC has 40k VENUES (restaurants,
shopping, food, health, spas, nightlife, and some more).

Either 4square has 250 cities like NYC, a fundamentally different definition
of what a venue is, or Yelp is SERIOUSLY missing a lot of places (like on the
order of getting only 10% of the venues in each city).

People have written about the data sparsity problem in the Yelp dataset as
compared to Netflix
(<http://www.stanford.edu/class/cs229/proj2009/Fennell.pdf>) for using CF
techniques, I'm very interested to hear what people will think about the
4square implementation.

I'm skeptical, but I really hope it works, because I have very average food
far too often on recommendations from friends with dissimilar palates...

~~~
billpaetzke
Foursquare has way more places in their database, based on my usage of both
Yelp's and Foursquare's iPhone apps.

Yelp is based on reviews of businesses. Whereas Foursquare is based checking
into places. Therefore, Foursquare's domain is much larger.

I'm not sure if Foursquare or Yelp add places to their database or if it's
100% user-driven. In Yelp's case, it takes more effort to add a place becuase
you need to (or at least you feel like you should) also write a review.
Whereas for Foursquare, you can just add your home. Or "RainApocolypse 2011,"
of which I am two days away from becoming the mayor.

~~~
baberuth
Reduced friction for adding places is a plausible reason for why 4square might
have more venues, but what I was getting at is the order of magnitude of the
difference.

10 MILLION places is a ton. Like I said above, thats like 250 NYC's. Even
given 4square's reduced friction to add venues, the fact that they've probably
launched in more cities, and inclusion of arbitrary venues ("Hey I checked in
at the tree that's 10 paces west of my house!"),

I don't think 4square is lying. Now I'm just wondering how useful those venues
are. (I am also pretending that Yelp has better than 10% coverage of venues
that users care about). Incidentally, either way, you've got to think that
10mm venues has to wreck the data sparsity problem as well.

For reference, the venue count of 3 large american cities:

\- NYC 47228 \- SF 38656 \- Chi 19079

I'd be surprised if Yelp had even 1.5mm venues. Would love if someone could
corroborate/dispute this fact. While there is certainly a disparity between
the two and Yelp, I'd imagine that the "useful" venues (those that people want
recommendations for) aren't your home or rainapololypse2011. Adding those into
the dataset actually makes 4squares job _harder_ with respect to teasing
meaningful data out of the dataset.

If my phone ever recommends your house though, I just may come a knockin'.

------
jkava
Regarding the issue of user-based feedback ranking (the Per Se vs. Shake Shack
problem), it may not be such a negative thing to have results skewed due to
"unequal" ratings. Culinary ratings often are a result of many factors that
civilian diners may not consider necessarily important or even relevant.
Looking to Foursquare to show search results based on user approval, which
often is submitted on a knee-jerk reaction after dining, may be the best thing
for a prospective diner. After all, Per Se and Shake Shack may have 5 stars
rewarded to them by the same diner, but unless this diner is making hundreds
of thousands a year (or is Thomas Keller), they would likely recommend Shake
Shack as the spot to eat to their Foursquare friends. To me, hedging this data
will end up producing results along the lines of more traditional culinary
recommendation systems, and may be devaluing the Foursquare recommendation
engine.

------
sadiq
Anyone else used Mahout? What are your experiences with it?

~~~
physcab
We tried using it a while ago when it was first getting started. It was a pain
to use back then but afaik it has advanced quite a bit and after seeing this
post I'll be taking another jab at it. If you want an out-of-the-box
recommender, Mahout is good while providing many other Machine Learning
algorithms that can be run at scale.

You definitely need to have some technical knowledge though. One thing that I
don't like about things like this is that if there is a bug and you don't
understand the software (or the math) you won't be able to troubleshoot.

This is why I tend to write my own for the specific task at hand.

~~~
justinmoore
Agreed, it has a very steep learning curve. Once you understand it (and
understand how to think in map-reduce), it's pretty awesome -- and easy to
extend. Rolling your own definitely makes the learning curve easier, but you
can miss out on some of the M-R efficient algorithms that are built in it.

