edit: Swapped youtube URL to Peertube for Content ID claims issues.
Error: Could not get code signature for running application
at m(/Applications/Youka.app/Contents/Resources/app/.webpack/main /index.js:1:12481)
at App.<anonymous> (/Applications/Youka.app/Contents/Resources/app/.webpack/main/index.js:1:14365)
at App.emit (events.js:215:7)
Search your query in YouTube using https://github.com/youkaclub/youka-youtube
Search lyrics using https://github.com/youkaclub/youka-lyrics
Split the vocals from instruments using https://github.com/deezer/spleeter
Align text to voice (the hardest part) using some private api
That's also the part that would be most interesting to have explained. Is it language-agnostic? After all, the title says "in any language", but I can't think of any text-audio alignment algorithms that don't require a language-specific model. (Unless you just count characters and assume they map linearly to time, which I'd expect to go very badly.)
But that seems a lot more complicated... so, unlikely.
A way to cheat that would probably work good enough most of the time would be to spectrographic analysis on the audio stream to identify syllables, and then similarly just count syllables in the known text and line those up. That works better the more consistent your spelling system is, though, and still requires language-specific modelling. If you actually want to do a decent job cross-linguistically, you'd need in the general case a dictionary for every supported language listing syllable counts for each word (because not everybody's orthography is transparent enough to make simple models like counting character sequences work).
If you actually have a fully language-agnostic algorithm for aligning text to audio that's actually decently accurate, though, that's gotta be worth at least a Master's degree in computational linguistics, 'cause on the face of it it doesn't seem to me (who has such a Masters degree) that it should even theoretically be possible.
Espeak-ng supporting 108 languages is maybe a bit misleading. They have pronunciation definitions for many languages, but the actual level of support varies widely.
For Mandarin, espeak-ng 1.49.2 has a bug where it reads the tone numbers out loud instead of modifying the pitch contour, so e.g. the number 四 (four) is pronounced si si instead of sì, because it has the fourth tone. That's the version packaged for Ubuntu, so you may be using it for your API.
For Japanese, kanji aren't supported at all, so 四 is pronounced as "Chinese letter" (in English). For proper Japanese support, you'd need to switch to a different TTS engine like Open JTalk or preprocess the text to transform it into kana.
Also note that Aeneas is licensed under AGPL, which requires you to offer the source code if you let others interact with the program over a network (which is what your API does). So your attempt to keep the secret sauce private and only reveal it once someone guessed the algorithm was likely illegal. You should add proper copyright notices to your program and audioai.online
Even with only line-level accuracy, that would've been nice to have 7 years ago... but I see the first commit to the project is only in 2015. Might still be useful to some of my old colleagues, though; I'll have to see if they've heard of it.
My intuition, however, is that a meet-in-the-middle approach using automatic speech recognition and then aligning the resulting text streams would be the optimal approach, and indeed every other major forced-alignment tool besides aeneas (https://github.com/pettarin/forced-alignment-tools) does seem to use that approach. The catch, of course, is that you actually need decent ASR language models for every target language to make that work, and gas you can see from tat list, it is rare for any given engine to support more than a few languages; CMU Sphinx probably has the widest support, although it's not the highest end toolkit for popular languages like English. So, if you really want to maintain the broadest possible language support, and you can afford the API fees, building a new alignment engine that piggy-backs on MicroSoft or IBM's speech recognition APIs is probably the best option--or, to keep it cheap I'd go ahead and use Sphinx's aligner as a preferred option for all the languages that it has models for, and either fall back on aeneas for remaining languages, or (if you can afford occasional API calls to commercial services for the occasional less-popular language) upgrade to MicroSoft/IBM services for the remaining languages.
Audio AI API
Split voice from audio
Sync voice to text
In the future, it would be great to have a "portable" version of this for Windows that doesn't install anything. It's annoying to open up an app, and have it install itself without any warning or user consent. You could just release a .zip file with the build as an option.
Ooops, some error occurred :(
Error: [Errno 2] No such file or directory: '/tmp/tmpphtr8ehu/accompaniment.aac'
When running on the official Windows 10 SandBox (https://techcommunity.microsoft.com/t5/windows-kernel-intern...)
Edit: it somehow works for some songs. The concept is really nice. I love it.
Ooops, some error occurred :(
Error: name 'espeakng_supported_langs' is not defined
I'll look into aeneas to see if that can give the API-level technical tools that I need - thank you for explaining that part in the other comments!
If your lyrics are in Peh-oe-ji, you'll need to define how the romanization maps to phonemes. You may be able to get some inspiration for that from the definitions for Mandarin and Cantonese. Though I just looked at the "phonology" section on Wikipedia https://en.wikipedia.org/wiki/Taiwanese_Hokkien#Phonology and the tone sandhi rules look a lot more complex than any other Sinitic language I know.
If the lyrics use Chinese characters, there's the added difficulty of collecting a pronunciation dictionary, which I'd probably do by scraping https://twblg.dict.edu.tw/holodict_new/index.html , http://xiaoxue.iis.sinica.edu.tw/ccr/ and Wiktionary. (If you know any other sources for pronunciation data, I'm interested.)
Tones are difficult, so I encode those as colours. Adding code to espeak-ng sounds very difficult. Most of the songs are in Mandarin though, so I'll try those first.