Really nice! I'm also currently playing around a lot with automatically generated videos and I can see this having a lot of potential! Some questions that come to my mind:
1. Do you plan including AWS Polly in the speech generation? There is also a free tier, the API is nice so it might be also a good choice for people using AWS SDK already in their projects.
2. How do you approach aligning the words with animations? Is it possible to align it for a specific word? I was wondering how one might approach this. Did you already describe it somewhere?
I did not do a detailed write-up yet, but you can search the repo for "word_boundaries" and "TimeInterpolator". Services like Azure return timestamps for the beginning of each word, and for those that don't return, I integrated Whisper to generate them from the audio. Then, it's a matter of mapping the string indices to audio time via some sort of interpolation (I used linear).
Hey HN! This is a side project that I started working on roughly 1 year ago.
Manim is the math animation library created by the awesome math YouTuber 3blue1brown. This is about Manim Voiceover, a plugin for Manim that provides an API for adding voiceovers to videos. My goal is to create efficient text2video pipelines and make it possible to automatically generate beautiful explainer videos from any educational text on the web.