It's also quite appealing that many of these probabilistic models made by statisticians are "fuzzy" generalizations of ad-hoc algorithms originally developed for practical reasons. In the same way that Gaussian Mixture Models are a "fuzzy" generalization of K-means, Dirichlet Process Mixture Models are a "fuzzy" generalization of the adaptive K-means algorithm, which increments K whenever outliers are detected. This connection is nicely summarized by Kulis & Jordan 2012 .
If you're wondering where to get started learning a topic like this, it's good to know about latent variable models and expectation-maximization first. See for example my own notes  on the topic. Following that, you can start to understand variational inference, as well as topics relevant to modern deep learning like amortized inference, variational autoencoders, etc..
 Kulis & Jordan 2012, "Revisiting K-Means: New Algorithms via Bayesian Nonparametrics" (https://people.eecs.berkeley.edu/~jordan/papers/kulis-jordan...)