I've actually done that same concept a while back when whisper.cpp came out. A significant challenge is sane paragraph segmentation, as even humans don't often agree on the best place for a line break. I wonder what approach you've used.
I've adopted a very simple approach: 80 words per "paragraph". I am now experimenting with computing the embeddings of each sentence and try to detect topic segments. But the simple approach yields pleasant segments AFAIK.
This is absolutely amazing, the fact that you can click on text and it takes you immediately to the part of the audio where that text is being said is great.
I wish the video was shown as well, but other than that, excellent work!