Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I didn't read through the paper (just the abstract), but isn't the whole point of the KL divergence loss to get the best compression, which is equivalent to Bayesian learning? I don't really see how this is novel, like I'm sure people were doing this with Markov chains back in the 90s.


in fact it is nothing new.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: