Hi! Blog author. This was an attempt a couple years ago to understand and write about this paper in a detailed way. Here is a video going through this topic as well: https://youtu.be/dKJEpOtVgXc?si=PDNO0B0qi6ARHaeb
Section 2 of the blog post is no longer very relevant. A lot of advances (DSS, S4D) simplified that part of the process. Arguably also this all should be updated for Mamba (same authors).
Thanks for your spectacular resources! I see that you began an Annotated Mamba repository -- any chance you could share when that blog page might go live?
A lot of intimidating math that will make all self-attention tutorials seem like a walk in the park in comparison. Luckily subsequent state space models building on S4 (DSS, S4D and newer ones like Mamba) simplified the primitives and the math used.
The math is not designed to intimidate but rather approach the "how to build sequence model" in a principled way from state space models, which draws from an arguably longer literature than neural networks.
Some of concepts are better explained here than anywhere else, and make it straightforward to make sense of Mamba, which is increasingly popular.
I did not mean it in a negative way, this is a great resource. But the math will be intimidating regardless for most devs who don't have a solid math/signal processing background. It's way beyond the simple linear algebra plus chain rule from calculus that are required to understand basic neural networks training.
Well, but this stuff is also much more principled and much better understood (by construction) than why/how a transformer works. The price of actual understanding, and being able to make precise statements, is that the statements will be precise and detailed (ie likely involve math).
Section 2 of the blog post is no longer very relevant. A lot of advances (DSS, S4D) simplified that part of the process. Arguably also this all should be updated for Mamba (same authors).