Hacker News new | past | comments | ask | show | jobs | submit login

I've actually done that same concept a while back when whisper.cpp came out. A significant challenge is sane paragraph segmentation, as even humans don't often agree on the best place for a line break. I wonder what approach you've used.



I've adopted a very simple approach: 80 words per "paragraph". I am now experimenting with computing the embeddings of each sentence and try to detect topic segments. But the simple approach yields pleasant segments AFAIK.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: