
Factorization Machines - dialtone
http://tech.adroll.com/blog/data-science/2015/08/25/factorization-machines.html
======
xtacy
Isn't the key insight -- to replace `w_{i,j}` by turning it into a vector with
an inner product structure -- also known as the kernel-trick in machine
learning? (here, they seem to use a polynomial kernel:
[https://en.wikipedia.org/wiki/Polynomial_kernel](https://en.wikipedia.org/wiki/Polynomial_kernel)).

EDIT: <strike>It's</strike> Kernel tricks in general are nice because they
also generalise well to an infinite dimensional space (the RBF kernel) and
compute the dot-products in that space without actually computing the
embedding. RBFs:
[https://en.wikipedia.org/wiki/Radial_basis_function_kernel](https://en.wikipedia.org/wiki/Radial_basis_function_kernel)

~~~
99decisionstr
Not quite - it's different from the kernel trick, which is impossible at this
scale (there's no way you can train an RBF kernel in a decent amount of time
when your space has 10^8 features and your training set has 10^9
observations). The idea of factorization machine is to _learn_ the polynomial
kernel, and it provides a mechanism to do so which scales "linearly" (with
some arbitrary constant that you're in control of).

~~~
srean
Could you elaborate on its differences from the kernel trick. It seems to me
@xtacy is exactly right on that modeling point. Training the model using scale
appropriate optimization is an orthogonal concern, if that is what you were
getting at, i.e. an off the shelf kernelization would not be sufficient to run
this at scale.

~~~
99decisionstr
In very large-scale sparse settings, the optimization strategy is tightly
coupled to the modeling, not orthogonal to it. The reason why is that "data
beats algorithm": meaning that a "dumb" model that can be trained on 1TB of
data using a "dumb" optimizer will easily outperform a fancier model for which
we only have at our disposal more complex optimizers that can only eat 1GB of
data in the same amount of time. This paper is a classic when it comes to the
kind of trade-offs you have to deal with in large-scale web settings:
[http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-
scal...](http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-
learning.pdf)

The kernel trick uses an _implicit_ mapping into a higher-dimensional feature-
space. On the other hand, Factorization Machines uses an _explicit_ mapping
into the polynomial kernel space. However it learns jointly the "right"
polynom, by mapping the _base_ features into a low-dimensional dense space
where the higher-order terms of the polynom are dot-products (we're not taking
dot-products in the original feature-space!). FM then _learns_ the right
polynomial (a non-convex taks) jointly with learning the original supervised
learning task.

~~~
srean
Thanks for replying in spite of the fact that the audience would have moved to
other threads by now. I am totally with you in the claim that to deploy a
solution one has to be fully cognizant of its optimization theoretic
implications. What I meant by orthogonal is the the properties of the model
does not care what algorithm was used to optimize provided you reach an unique
optimizer. Here it is tricky to make the same claim because although convex in
a set of variables, its not jointly convex.

Where I am not with you is in the hint that kernels have feature maps that are
necessarily opaque. In fact inhomogeneous poly kernels ar great examples where
the feature map is known in closed form. Although such a map is available it
is not always efficient to use it. But what it lets you do is optimize the
weights of each dimension on tje mapped feature space but executed in the
native space. In fact, and I am greatly tickled by the coincidence, I had sent
a mail to colleagues where I suggested a form where instead of the standard
Euclidean dot product in the poly kernel is replaced by a convex combination
of other favourite kernel of choice. The training algo remains almost the
same. I was not aware of this piece of work but the mechanism is pretty much
the same (little more general. Essentially replaces the std dot product that
appear in the kernel expression by other kernel expressions. If you are
thinking recursion now well that's intended) and can be pushed further.

------
fiatmoney
This is the main paper on factorization machines, which discusses some of the
advantages:

[http://www.ismll.uni-
hildesheim.de/pub/pdfs/Rendle2010FM.pdf](http://www.ismll.uni-
hildesheim.de/pub/pdfs/Rendle2010FM.pdf)

In particular they're nice for recommendation algorithms.

------
srean
If there are any Adroll folks on this thread, really curious to know about
Adroll's use of the D programming language, how it has worked out, anecdotes,
stories etc. That you use already makes it an attractive place :)

~~~
99decisionstr
Sure! Here are some insights:
[https://news.ycombinator.com/item?id=10136197](https://news.ycombinator.com/item?id=10136197)

D is awesome for what we do with it. It unlocks our small team of data
scientists to have a huge impact on the company, as we don't have to rely on
frameworks such as your typical hadoop/spark cluster which require engineers
just to maintain it. We can also use the same code path offline and online
where we have very strong latency and QPS requirements.

D makes us ship more often, with less humans. The runtime is not perfect yet,
but the language and compiler are awesome. It keeps our codebase small, simple
and very efficient.

~~~
srean
Now that's just bragging with the express purpose of making me envious. So how
fast can you come to India :)

------
iaw
I don't quite understand the benefit of replacing the weighting with a vector.
Could anyone offer additional insight?

~~~
mattwilson
Without factorization machines, you would have O(n^2) quadratic weights
w_{i,j} to learn. This is bad in a setting like online advertising where you
might have millions of features (i.e. n is large) for two reasons. First, this
makes training your model very slow. Second, you won't have enough
observations to learn all weights w_{i,j} well. So using all quadratic weights
in the online advertising setting is not feasible. But we still want to
include feature interaction information. Enter FM.

To answer your question, FM is good for two main reasons. First, as a result
of some mathematical magic, we can train the model in a time that scales
linearly with the time needed to train a model without interaction terms.
Second, each of the n features gets embedded in an inner product space with
similar features ending up close to one another in some sense. This allows us
to make a decent estimate of the interaction between features even if they do
not appear frequently in the training data.

