
Dirichlet Process Mixture Models in Pyro - apsec112
https://pyro.ai/examples/dirichlet_process_mixture.html
======
benrbray
Without getting too far into the Bayesian vs. frequentist debate, I find the
whole field of Bayesian nonparametrics to be very mathematically satisfying.
Hierarchical models effectively come with a hyperparameter search "built-in".

It's also quite appealing that many of these probabilistic models made by
statisticians are "fuzzy" generalizations of ad-hoc algorithms originally
developed for practical reasons. In the same way that Gaussian Mixture Models
are a "fuzzy" generalization of K-means, Dirichlet Process Mixture Models are
a "fuzzy" generalization of the adaptive K-means algorithm, which increments K
whenever outliers are detected. This connection is nicely summarized by Kulis
& Jordan 2012 [1].

If you're wondering where to get started learning a topic like this, it's good
to know about latent variable models and expectation-maximization first. See
for example my own notes [2] on the topic. Following that, you can start to
understand variational inference, as well as topics relevant to modern deep
learning like amortized inference, variational autoencoders, etc..

[1] Kulis & Jordan 2012, "Revisiting K-Means: New Algorithms via Bayesian
Nonparametrics" ([https://people.eecs.berkeley.edu/~jordan/papers/kulis-
jordan...](https://people.eecs.berkeley.edu/~jordan/papers/kulis-jordan-
icml12.pdf)) [2] [https://benrbray.com/static/notes/eecs445-f16-em-
notes.pdf](https://benrbray.com/static/notes/eecs445-f16-em-notes.pdf)

~~~
cs702
Have you looked at capsule-routing algorithms that use expectation-
maximization to generate layer outputs in deep neural networks? For example,
in
[https://research.google/pubs/pub46653/](https://research.google/pubs/pub46653/)
and [https://arxiv.org/abs/1911.00792](https://arxiv.org/abs/1911.00792) each
output capsule in a layer is generated by a Gaussian mixture model and also
has an associated "activation value" that increases to the extent the GMM can
explain (i.e., generate) its input data data "better" than the GMMs of other
output capsules in the same layer. I wonder, do you think these routing
algorithms, which are differentiable end-to-end, could be generalized to
Dirichlet Process Mixture Models?

