
A Simple Content-Based Recommendation Engine in Python - numlocked
http://blog.untrod.com/2016/06/simple-similar-products-recommendation-engine-in-python.html
======
ChicagoBoy11
The blog post was a bit confusing in the way it was structured, and I'm still
not quite sure I understood you correctly: You start off by saying that
content-based recommendations are what most people usually have in mind when
they have recommender systems. Then, you suggest that actually in many real-
life situations those systems are kinda terrible to actually meaningfully move
the ball forward on offering recommendations that are non-obvious to users.
But then you show a nice implementation of just one-such system which you just
said may not be entirely useful for a lot of things, and when paired with the
other algorithm often contributes very little... is that right?

As a reader, with all of that intro, I couldn't help but be really
disappointed that the implementation discussed wasn't the one relying on
buying behavior... seems like a good chunk of the article is devoted to
explaining why that one is a superior recommendation system, so naturally I
wanted to see how you'd go about putting THAT together...

~~~
numlocked
Great feedback, thanks. I just made some revisions to try and make the
narrative arc clearer.

To clarify here as well:

\- CF techniques are definitely superior for answering the question "what else
might this user want to buy?" in a generic way, when you have
preference/purchase data for that user.

\- Content-based techniques are great for making recommendations in a context
where what you really want is something more akin to automatic curation, and
you may not have the necessary preference information about a visitor to apply
CF (e.g. an anonymous user viewing a product details page, vs. a logged in
user who has previously bought products). A content-based approach can solve
the cold-start problem for CF.

You're right that I could have done a much better job in the original post --
I've since added information to the effect of the above. Thanks very much for
the comment!

------
numlocked
Author here! Love to hear any feedback and hope folks find it useful. As I
mention in the article, we use a very, very similar implementation in
production at grove.co. It's ridiculously simple, which also makes it very
robust and reliable in production.

Example (frames as "customers also bought" but in reality is a set of similar
products -- we swapped in the content-basd engine for a CF approach and
haven't yet updated the copy):
[https://www.grove.co/catalog/product/cellulose-
sponge/?v=802](https://www.grove.co/catalog/product/cellulose-sponge/?v=802)

------
bartkappenburg
I liked the way the article compared the two approaches (CF and text analysis)
and the code was interesting.

At Conversify[0] we literally tested these methods last week(!) We help
e-commerce companies with machine-learning optimizations of which a
recommendation system is one. The average shop uses the standard tools for
recommendations which are terrible, with some intelligence it could be a very
important way to increase sales and conversion. [/end-plug]

This is what we learned. We've tested three approaches (all on sites with
enough traffic):

(1) CF based on buys in combination with/without profiling

(2) CF based on purely clicks in combination with/without profiling

(3) Text analysis (sklearn, TF-IDF)

We had the best hopes for buys then clicks and as last text-analysis. It was
exactly the other way around ;-).

(1) Buys has the drawback that the relationship of profiles to products is a
lot to a few most of the time (relatively).

(2) Clicks has the same drawback but we can also see clicks within one
session. This makes clicks somewhat better because shoppers often compare
multiple products that are alike. This is valuable info.

(3) We didn't think that TF-IDF had evolved the last couple of years this much
that we now can do an analysis with a few a lines of code with these results.
They were stellar. We tested the outcomes with the business owners as well as
customers (we didn't say which method generated which results) and they all
went massively for text-analysis.

Here's our outcome for a random product from one of our customers (with a
combined buys/clicks recommendation):
[http://i.imgur.com/lEYYIh9.png](http://i.imgur.com/lEYYIh9.png)

That being said: we are working on a combination of all three methods. We
think that text-analysis is for the first filtering of related products and
that the CF methods are for ordering of this set. Also using weights for the
different methods are useful.

Besides that we are developing a way to distinguish in an automated way
between substitute products (samsung vs iphone) and complementary products
(charger with a phone). This could be accomplished by using more CF than text
(and specifically bought products within sessions). Our first results are
promising.

[0] [https://www.conversify.com](https://www.conversify.com)

~~~
data_hans
Awesome work, I really enjoy reading your experience on this. For the TFIDF
approach, did you use clicks or purchases? I am also curious as to why you
think the TFIDF approach is better in your case. I saw the imgur image that
you uploaded, let's say I'm buying something for a party, isn't the
buys/clicks recommendation better in that case? It'll show me all the items I
may need for the party. Why would I want to see other similar items to what
I'm viewing in that case?

~~~
bartkappenburg
TF-IDF uses no click/buys data whatsoever. It's purely based on
characteristics of the product (description, name, brand etc etc).

Regarding the example picture I gave: that's the difference between
substitutes and complementary products. The example gives alternatives (other
straws...). We're building a way to have recommendations on these two types.

Why TF-IDF works better vs buys? A set of bought products holds no to little
similarities between those products, it's better to use it for complementary
products.

Why TF-IDF works better vs clicks? Click paths on sites are more random than
you think. Combining click data with TF-IDF makes more sense.

As said: TF-IDF as a first filter and using CF as sorting could be the most
optimal for substitute products. Complementary products should rely more on
CF.

A problem with clicks/buys only is that new products tend to get overlooked,
you can fix that by mixing in TF-IDF.

------
data_hans
Awesome article. Using cosine similarity to calculate product similarity is
something I have been wanting to try. Have you thought about using running
TFIDF on customers' entire purchase history and running cosine similarity
between each customer's purchase history vectors and product vectors? I think
an interesting study can be done on comparing the recommendation results
between CF, Cosine similarity on products, and cosine similarity between
customer history and products.

------
ecesena
I've been working on news recommendation for the past few years and totally
agree with your point. Often time you can't have enough history for CF to
work.

I'd encourage you to go further that tf-idf, can improve a lot keyword
extraction and based on that improve your overall recommendation.

A simple, basic approach is to create a taxonomy of "entities" that you know
are relevant for you. Often time these are so specific and particular that
even if they appear a single time in the text they have to be considered
keywords, wether tf-idf says so or not. It's clear that a text that simply
reference that entity would have that as a keyword, which may be wrong, but
most of the time you'll be correct.

As an example, I don't know, "coconut oil soap". Will tf-ids ever surface it?
Hard to say. Is it relevant to your business and thus recommendation? I think
more.

Happy to chat about this anytime, shut me an email.

