However, I found that Whisper is thrown off by background music in a prodcast - and will not recover. (That was with the mlx-community/whisper-large-v3-mlx checkpoint, OP uses distil-whisper-large-v3). I concluded for myself that Whisper might be used in larger processing pipelines that will handle such - can someone provide insights about that? The podcast I used it on was https://www.heise.de/news/KI-Update-Deep-Dive-Was-taugen-KI-....
I use a noise filter pass (really just https://github.com/richardpl/arnndn-models/blob/master/bd.rn... and some speech band filtering after) before doing any processing in whisper. It's worked well for me when using dirty audio (music in the background, environmental noise, etc). When there is music, you either almost can't hear it at all or you'll only hear particularly clear parts featuring singing.
However, I found that Whisper is thrown off by background music in a prodcast - and will not recover. (That was with the mlx-community/whisper-large-v3-mlx checkpoint, OP uses distil-whisper-large-v3). I concluded for myself that Whisper might be used in larger processing pipelines that will handle such - can someone provide insights about that? The podcast I used it on was https://www.heise.de/news/KI-Update-Deep-Dive-Was-taugen-KI-....
I ended up using Google Gemini, which handled it well. (Blog post: https://ndurner.github.io/mlx-whisper-gemini)