
Group Lasso Regularization - keyboardman
https://leimao.github.io/blog/Group-Lasso/
======
currymj
In Bayesian terms, ridge regression is equivalent to putting a Gaussian prior
on the weights, which gets wider the weaker the penalty term.

The lasso is the same, but with a Laplace distribution as the prior.

The "elastic net", which combines the l1 and l2 penalties, even has a Bayesian
interpretation, with a fairly weird prior [1, 2].

Anyone know if there's an equivalent Bayesian interpretation of group lasso?
Maybe it's just a Gaussian prior but with block-wise correlation between the
variables?

[1]: [https://stats.stackexchange.com/questions/283238/is-
there-a-...](https://stats.stackexchange.com/questions/283238/is-there-a-
bayesian-interpretation-of-linear-regression-with-simultaneous-l1-and)

[2]:
[https://www.tandfonline.com/doi/abs/10.1198/jasa.2011.tm0924...](https://www.tandfonline.com/doi/abs/10.1198/jasa.2011.tm09241)

~~~
laGrenouille
From a Bayesian perspective, it’s not entirely accurate to say that there is
the same relationship between ridge/Gaussian-prior and lasso/Laplacian-prior.
In the first case, ridge regression yields a "proper" Bayesian estimator
(e.g., mean of the posterior); the latter is only the MAP estimator, which is
not really recommended from a purely Bayesian perspective. Ridge is also a MAP
estimator, but that’s in addition to the stronger condition of minimizing the
Bayes risk.

There is a way to compute a "true" Bayesian Lasso, but it doesn’t yield a
sparse model [0].

[0]
[http://archived.stat.ufl.edu/casella/Papers/Lasso.pdf](http://archived.stat.ufl.edu/casella/Papers/Lasso.pdf)

------
abhgh
Not very commonly discussed, but great topic to be aware of. One particularly
powerful use of this is in learning models with interaction terms: more
powerful than a linear model but still interpretable, [1].

[1] R package
[https://cran.r-project.org/web/packages/glinternet/index.htm...](https://cran.r-project.org/web/packages/glinternet/index.html)
(paper linked therein)

------
nerdponx
Nice post.

There's also at least one CrossValidated question on this topic, with good
answers attached. For example
[https://stats.stackexchange.com/q/214325/36229](https://stats.stackexchange.com/q/214325/36229)

------
6gvONxR4sf7o
Group Lasso is one of those things I'm really surprised not to see used more
often. Great post.

~~~
0xddd
Why is that? The article doesn't say much about the benefits of using this
type of regularization.

~~~
6gvONxR4sf7o
Say you want to represent a categorical feature that has many levels. You're
likely to one-hot encode it. This can lead to a lack of parity between how
sparsely you treat your continuous features versus your categorical ones,
where continuous features can't be partially included, but one-hot-encoded
categoricals can. Sometimes you want to be sparse on your original variables,
rather than sparse on dimensions of their encoding. It's a good inductive bias
that if, for example, profession is relevant to your model, then all
professions are relevant (and you still get to ridge penalize the individual
professions). Nice for interpretability too.

