
Ask HN: Has anyone built a recommendation engine in-house? - theknight
For example, a recommendation engine to recommend similar products, blog posts, movie, music, etc.<p>Would love to understand the challenges you faced and any library or third-party products (e.g., recombee) you used to power the recommendation engine.
======
sadikkapadia1
I wrote the recommendation system at Netflix (still in use after 5 years).
Primary problem was company politics. Many groups were not happy that one
person could write a system that was better in A/B test, had more uptime and
cheaper to run. All of it (ML, production, monitoring), was custom code.

~~~
MaxLeiter
Almost all of your top-level comments mention you did this

~~~
adtac
Interestingly, his Medium post from two years ago [1] also says "5 years ago",
and happens to be the only activity there.

[1] [https://medium.com/@sadikkapadia/i-wrote-the-
recommendation-...](https://medium.com/@sadikkapadia/i-wrote-the-
recommendation-system-at-netflix-5-years-ago-705d02c6aa9f)

~~~
sadikkapadia1
I don't keep track of time. That system is old technology. I did confirm from
a Netflix employee that they still use it a few months ago. Deep learning, LDA
(even one of Xavier's pet projects - k-means), did not do better.

------
chudi
There are basically 3 types of recommender engines:

Content Based: If you can represent your products as a vector, you can have a
distance between each product, then you have a item-item recommendation. You
can use all kinds of embedding to achieve this results, some techniques that
we tried are word2vec embedding of user navigation, auto encoding of features
using neural networks, dimensionality reduction with PCA, ALS, etc. There are
lots of libs for solving these problems as is a very studied field, usually
numpy and for finding the neighbors we use ann from scikitlearn, because if
you have millions of items, you cant just find the distance between all the
pairs.

Collaborative - Filtering, here you use the pairs of behavior of the users,
<user, item, ranking>. There is a surprise lib in python that works well, you
have the MlLib from Spark too, this techniques are called matrix factorization
techniques, and also gives you a embedding of the item or the user, and you
can apply the techniques of content based to find user-user and item-item
recommendations along the user-item recommendations

Hybrid Models: These are the models that use behavior and features of the user
an items, LightFM is a good lib that works well, but you can model it with
other tools like neural networks (
[https://ai.google/research/pubs/pub45530](https://ai.google/research/pubs/pub45530)
).

The challenges are depending on the company, its not the same to recommended
small amount of items to large number of users than large number of items to
small number users.

There is a whole specialization in coursera that is really good
[https://www.coursera.org/specializations/recommender-
systems](https://www.coursera.org/specializations/recommender-systems)

~~~
eggie5
I don't understand your connection between lightfm and the youtube paper...

~~~
chudi
they are hybrid in the sense that gather signals from not just features or
user activity, yt paper uses embeddings from search and views, so its more of
a mixed model than a pure one content based or a pure collaborative filtering

~~~
eggie5
ok, I see, you are making the connection on basis of hybrid characteristics.

Since you're familiar w/ the youtube paper, I've been wondering this question:
How do they get vectors out of the softmax?

------
splonk
I've been part of two efforts, one at a very large company, one at a startup.
From where I stand, your biggest issue is going to be getting a sufficient
data set, and having sufficient traffic in whatever you're recommending to be
able to actually test your models.

Technical aspects in how you train your models and such are fun, but way, way
down the list of things that are likely to matter in the short to medium term.
Like, data scientists are nice to have, but you're not really going to be able
to fully utilize them until you have the capability to build, deploy, and test
a model at scale. If going third party helps you do this, you probably should.

------
ecesena
I built the one at Theneeds.com and, if you're interested, this is the one at
Pinterest [1].

At Theneeds we were recommending news = fresh content based on user's interest
and other features. Because the content is fresh, you can't easily have enough
data for a proper collaborative system.

Our algo was essentially the Reddit algo, where a piece of content gets a rank
based on time and log of score. Score in Reddit is the upvotes - downvotes. At
Theneeds we had a more complex score including social signals (likes on fb /
RT on tw) so we could compute a meaningful score also without a big community
of users. The other difference wrt Reddit was having different scores and
different paces (multipliers) based on categories of content, so for example
news in tech and politics from newspapers were updating faster than news on
travel from magazines. And by normalizing the ranks, you can merge multiple
categories in one -- a feature that I think Reddit also added.

As for the code/stack, custom written in python. We were using Redis to cache
user timelines using sorted sets (including the guest users, i.e. the default
top news for each category). In Redis, you can merge sorted sets, and we used
it as an efficient way to create the new timeline when a new user was signing
up.

[1] [https://medium.com/@Pinterest_Engineering/introducing-
pixie-...](https://medium.com/@Pinterest_Engineering/introducing-pixie-an-
advanced-graph-based-recommendation-system-e7b4229b664b)

Edit: added more details about tech.

------
prades
I've built a recommender system for my movie database app "Coollector Movie
Database". It's based on Collaborative filtering and it took me 2 years to
implement. I built it from scratch and it's unique in several ways (for
example, you can view the reliability of each recommendation). The technical
difficulty is to crunch fast enough a huge quantity of data. I've had to apply
all the optimizations that I could think of.

[https://www.coollector.com/help.html#recommendations](https://www.coollector.com/help.html#recommendations)

~~~
eggie5
How would you generalize the method you are using?

~~~
prades
I don't like ML frameworks (Tensorflow, etc...), maybe it's because I haven't
tried them. My understanding is that they're like a magic black box: you input
some data, you adjust some settings, and you wish for the results to be good.
Instead, I've taken a direct approach to the collaborative filtering problem,
the difficulty being to correlate a huge amount of data. Some said that only
quantum computers would one day be fast enough to solve the recommendation
problem, until recently a student demonstrated that it could be solved with
classical computers.

[https://www.quantamagazine.org/teenager-finds-classical-
alte...](https://www.quantamagazine.org/teenager-finds-classical-alternative-
to-quantum-recommendation-algorithm-20180731/)

This student's algorithm is quite different from mine, but I suppose that my
algorithm is yet another example of solving the recommendation problem with
classical computers.

------
seektable
Have you checked Apache Mahout recommendation framework (
[https://mahout.apache.org/docs/latest/algorithms/recommender...](https://mahout.apache.org/docs/latest/algorithms/recommenders/)
)? For 'small data' it can be used as Java library (algorithms for single-
machine); if you prefer .NET C# port is also present:
[https://github.com/nreco/recommender](https://github.com/nreco/recommender) .
If you're new to collaborative filtering 'Apache Mahout in Action' book will
help a much.

------
itronitron
Yes, I have implemented a few content-based recommendation engines (referring
to Chudi's taxonomy). The biggest existential threat is dealing with
colleagues that want to question your work for not using f^xyz method that
they have heard about. Having a straightforward evaluation framework in place
to evaluate your results will go a long way towards ensuring the adoption and
longevity of what you create.

I grow my own analysis code but use search APIs for storage and access (Lucene
or Algolia)

------
rwieruch
I was in the same situation and learned about recommender systems by
implementing a simple movie recommendation system in JavaScript. If you are
interested, you can find the source code over here:
[https://github.com/javascript-machine-learning/movielens-
rec...](https://github.com/javascript-machine-learning/movielens-recommender-
system-javascript)

------
Topgamer7
I built a document recommendation project as part of a course, wrote it in
python using the term-frequency inverse document frequency (TF-IDF) formula.
It's actually a pretty straight forward method for recommending similar
documents based on content.

[https://github.com/ElementalWarrior/LearningAnalytics](https://github.com/ElementalWarrior/LearningAnalytics)

~~~
theknight
Thanks for sharing the repo. Was the course on recommender system or just a CS
course and you chose to build a recommender system?

~~~
Topgamer7
The course had a section on it, it also covered things such as Bayes nets, it
was basically a precursor course for machine learning. I chose to build the
document recommendation project as my final project.

[https://people.ok.ubc.ca/bowenhui/analytics/](https://people.ok.ubc.ca/bowenhui/analytics/)

------
eggie5
lightFM is my goto for prototyping Matrix Factorization models. It efficiently
handles large data w/ sparse data structures and is CPU accelerated including
optimizations like Hogwild!. It also has the WARP loss BPR variant which I
have not seen implemented anywhere else.

I can train on multi-GB datasets w/ only lightFM and multiple CPUs.

Another interesting package is called Implicit. This package, although not as
complete as LightFM when it comes to algorithms or APIs, really shines when it
comes down to optimizations. Including native Cuda kernels for BPR and ALS it
also has an important speedup called the Conjugate Gradient Method which makes
it faster than spark in some benchmarks.

But usually, now-a-days my work requires more customized hybrid models of
which I usually start w/ a base BPR implementation I have in Keras.

------
codecrusade
A simple graph traversal would usually make good recos.

~~~
sadikkapadia1
Surprising good.

