Hacker Newsnew | comments | show | ask | jobs | submit login

Can this library be used for users who viewed/bought this product also viewed/bought these other products? If not, do you know of a python library?


You could use scikit-learn to build your own recommender system if you really understand the math of the models you want to implement: scikit-learn will only provide you with low level building blocks such as a semi-scalable Singular Value Decomposition (http://scikit-learn.org/stable/modules/generated/sklearn.dec...) or penalized linear regression models or clustering algorithms such as k-means.

How to build and evaluate the performance of a useable and scalable recsys based on such building blocks is far from trivial though. It's probably even harder than implementing some of the building blocks provided by scikit-learn it-self for instance.

If had to build a recsys myself I would probably just use a fulltext engine such as ElasticSearch or Apache Solr + similarity queries (MoreLikeThis) + custom "features" + custom score functions as explained in this presentation by Trey Grainger (http://www.slideshare.net/treygrainger/building-a-real-time-...), and maybe use scikit-learn models to extract some relevant features to describe either the users or the items for improving the quality of the recommendations.


Yes. it can. Scikit-learn supports algorithms that can be used to build your own recommender system. To scale your algorithms you can always use hadoop map/reduce to scale.


Once you use scikit-learn to figure out what you want to do on a reasonable data set, you can use Mahout (http://mahout.apache.org/) to translate the algorithm to hadoop pretty easily.


You should directly use Mahout: the recsys part is quite complete and high level and application oriented contrary to scikit-learn which does not provide high level recsys concepts.

The best documentation I found is the Mahout in Action book (http://manning.com/owen/) while reading the source code in parallel.

Also you probably don't need to run this on a Hadoop cluster unless your data is too big to fit on one single machine.


Yes, you can use nearest neighbors to implement collaborative filtering. Truly you can use numpy.linalg.norm to implement collaborative filtering recommendations. The trick is always to figure out what "distance" or "similarity" between two products really means.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact