
Show HN: Karaoke for any song in any language - youka
https://github.com/youkaclub/youka-desktop
======
rnotaro
Demo:
[https://peertube.co.uk/videos/watch/3c183b56-deb6-4e6b-a7a2-...](https://peertube.co.uk/videos/watch/3c183b56-deb6-4e6b-a7a2-87cd692df483)

edit: Swapped youtube URL to Peertube for Content ID claims issues.

~~~
youka
thanks but the video is unavailable

~~~
rnotaro
Ah yes. It got copyright claimed for "Love Me Do The Beatles" even if it's in
Public Domain in Europe and Canada.

Classic Youtube.

------
laurieg
I installed this on Mac OS but the program always fails with:

    
    
      Uncaught Exception:
      Error: Could not get code signature for running application
          at m(/Applications/Youka.app/Contents/Resources/app/.webpack/main /index.js:1:12481)
        at App.<anonymous> (/Applications/Youka.app/Contents/Resources/app/.webpack/main/index.js:1:14365)
        at App.emit (events.js:215:7)

~~~
youka
Just reopen it and you will be fine (I don’t have free 99$/year for apple code
signature)

------
ronyfadel
I wish the readme had a description of how Youka works. Looks promising, but
I’m not sure it does what I think it does.

~~~
youka
I'll add some explanation soon. Here's the main process:

Search your query in YouTube using [https://github.com/youkaclub/youka-
youtube](https://github.com/youkaclub/youka-youtube)

Search lyrics using [https://github.com/youkaclub/youka-
lyrics](https://github.com/youkaclub/youka-lyrics)

Split the vocals from instruments using
[https://github.com/deezer/spleeter](https://github.com/deezer/spleeter)

Align text to voice (the hardest part) using some private api

~~~
yorwba
> Align text to voice (the hardest part) using some private api

That's also the part that would be most interesting to have explained. Is it
language-agnostic? After all, the title says "in any language", but I can't
think of any text-audio alignment algorithms that don't require a language-
specific model. (Unless you just count characters and assume they map linearly
to time, which I'd expect to go very badly.)

~~~
gliese1337
Having worked for many years in a linguistics research lab where we spent a
lot of money paying people to edit and align subtitles and audio transcripts,
and having largely written what was at the time the most sophisticated
subtitle-and-transcript editing tool available, I can confirm: counting
characters and mapping them linearly to timespan, even after isolating vocals,
does indeed go _very_ poorly. And much worse when there's singing involved.

~~~
youka
So let’s play, if you can guess the align method I’ll open source it :)

~~~
gliese1337
Alternately, since you say speech recognition isn't "even close", I might try
going the other way--doing text-to-speech on the audio stream, attempting to
align the two speech tracks, and the back-porting the timecodes from audio
alignment onto the text.

But that seems a _lot_ more complicated... so, unlikely.

A way to cheat that would probably work good enough most of the time would be
to spectrographic analysis on the audio stream to identify syllables, and then
similarly just count syllables in the known text and line those up. That works
better the more consistent your spelling system is, though, and still requires
language-specific modelling. If you actually want to do a decent job cross-
linguistically, you'd need in the general case a dictionary for every
supported language listing syllable counts for each word (because not
everybody's orthography is transparent enough to make simple models like
counting character sequences work).

If you actually have a fully language-agnostic algorithm for aligning text to
audio that's actually decently accurate, though, that's gotta be worth at
least a Master's degree in computational linguistics, 'cause on the face of it
it doesn't seem to me (who has such a Masters degree) that it should even
theoretically be possible.

~~~
youka
You are close enough, so I have to respect my word. I’m not a genius, just a
lego builder, I’ve tried a lot of methods, from DL to ML but aeneas project
(with some optimizations) gave me the best results. Amazing project and even
better personality. Take a look at
[https://github.com/readbeyond/aeneas](https://github.com/readbeyond/aeneas)
Together with espeak-ng, you can get good results for line level alignment for
108 languages.

~~~
gliese1337
Ah! It's not even _trying_ to do word or syllable-level alignment. Well, that
makes the problem considerably more error-tolerant. And they specifically call
out ASR-based aligners as more accurate, so that makes me feel good about
myself! Still, that's a cool project; thanks for pointing it out. I shall have
to dig into it and see what they are actually doing.

Even with only line-level accuracy, that would've been nice to have 7 years
ago... but I see the first commit to the project is only in 2015. Might still
be useful to some of my old colleagues, though; I'll have to see if they've
heard of it.

~~~
youka
Based on your experience, which alignment method/system is the state of the
art? (I’m looking for accurate word/syllable level alignment for Youka)

~~~
gliese1337
I actually have very little direct experience with automated forced alignment;
I have enough experience in the space to know that naive approaches suck, but
back when my boss was paying people to do manual alignment most of the effort
went into second-language subtitles for pedagogical studies... which means the
text doesn't actually represent the same words that are in the audio, because
they're words in a different language, and _nothing_ would do a good job of
accurately aligning that! So I got very little support for building in a more
sophisticated auto-alignment system.

My intuition, however, is that a meet-in-the-middle approach using automatic
speech recognition and then aligning the resulting text streams would be the
optimal approach, and indeed every other major forced-alignment tool besides
aeneas ([https://github.com/pettarin/forced-alignment-
tools](https://github.com/pettarin/forced-alignment-tools)) does seem to use
that approach. The catch, of course, is that you actually need decent ASR
language models for every target language to make that work, and gas you can
see from tat list, it is rare for any given engine to support more than a few
languages; CMU Sphinx probably has the widest support, although it's not the
highest end toolkit for popular languages like English. So, if you really want
to maintain the broadest possible language support, and you can afford the API
fees, building a new alignment engine that piggy-backs on MicroSoft or IBM's
speech recognition APIs is probably the best option--or, to keep it cheap I'd
go ahead and use Sphinx's aligner as a preferred option for all the languages
that it has models for, and either fall back on aeneas for remaining
languages, or (if you can afford occasional API calls to commercial services
for the occasional less-popular language) upgrade to MicroSoft/IBM services
for the remaining languages.

~~~
youka
I’ve tested every single ASR alignment solution that mentioned here
[https://github.com/pettarin/forced-alignment-
tools](https://github.com/pettarin/forced-alignment-tools), but they all
performed poorly compared to Aeneas, even with good language models (English)

~~~
gliese1337
Interesting....

------
Reubend
Hey there! First of all, I want to tell you that the app is fantastic. I used
the earlier version of this, when it was a website, from your previous HN
post. And once again the alignment works quite well in my experience, as does
the isolation.

In the future, it would be great to have a "portable" version of this for
Windows that doesn't install anything. It's annoying to open up an app, and
have it install itself without any warning or user consent. You could just
release a .zip file with the build as an option.

~~~
youka
I’ve considered few options to install ffmpeg, and choose that way. I’m open
to other suggestions

~~~
Reubend
You can distribute a .zip file which includes the statically linked build of
FFmepeg:
[https://ffmpeg.zeranoe.com/builds/](https://ffmpeg.zeranoe.com/builds/) .
Then just call it locally. There's no need to install it system-wide.

~~~
youka
I don’t install it system wide, just download a single binary into youka
directory.

------
rnotaro
I get an error when trying to open any video :

Ooops, some error occurred :( Error: [Errno 2] No such file or directory:
'/tmp/tmpphtr8ehu/accompaniment.aac'

When running on the official Windows 10 SandBox
([https://techcommunity.microsoft.com/t5/windows-kernel-
intern...](https://techcommunity.microsoft.com/t5/windows-kernel-
internals/windows-sandbox/ba-p/301849))

Edit: it somehow works for some songs. The concept is really nice. I love it.

~~~
youka
Looks like a server-side bug (can't really handle more that a single split
process concurrently), I'll add queue in the next version.

------
yunusabd
Personally I love karaoke, but looking at the repo and the website gave me no
information whatsoever about this project. Maybe that's something you can work
on? In the meantime I found this article, which reads quite positive:
[https://www.theverge.com/tldr/2020/2/19/21144452/youtube-
you...](https://www.theverge.com/tldr/2020/2/19/21144452/youtube-youka-club-
karaoke-lyrics)

~~~
youka
You right! I'll add illustration gif soon.

~~~
yunusabd
Cool! So you were originally running it as a webapp, and then decided to open
source it? Presumably due to legal reasons?

~~~
youka
exactly

------
peterburkimsher
Is there a way to manually provide the lyrics? I have a substantial collection
of songs in Chinese and Taiwanese, and it would be _really_ helpful to use
this to help me make lyrics videos for Pingtype. When I tried, I got this
error:

Ooops, some error occurred :( Error: name 'espeakng_supported_langs' is not
defined

I'll look into aeneas to see if that can give the API-level technical tools
that I need - thank you for explaining that part in the other comments!

~~~
yorwba
Note that it won't work for Taiwanese (I assume Hokkien) unless you add the
necessary support to espeak-ng.

If your lyrics are in Peh-oe-ji, you'll need to define how the romanization
maps to phonemes. You may be able to get some inspiration for that from the
definitions for Mandarin and Cantonese. Though I just looked at the
"phonology" section on Wikipedia
[https://en.wikipedia.org/wiki/Taiwanese_Hokkien#Phonology](https://en.wikipedia.org/wiki/Taiwanese_Hokkien#Phonology)
and the tone sandhi rules look a lot more complex than any other Sinitic
language I know.

If the lyrics use Chinese characters, there's the added difficulty of
collecting a pronunciation dictionary, which I'd probably do by scraping
[https://twblg.dict.edu.tw/holodict_new/index.html](https://twblg.dict.edu.tw/holodict_new/index.html)
,
[http://xiaoxue.iis.sinica.edu.tw/ccr/](http://xiaoxue.iis.sinica.edu.tw/ccr/)
and Wiktionary. (If you know any other sources for pronunciation data, I'm
interested.)

~~~
peterburkimsher
Yes, I know about romanisation! I wrote Pingtype, and extracted romanisation
dictionaries for Taiwanese Hokkien and Hakka by parsing Bible data.

[https://pingtype.github.io](https://pingtype.github.io)

Tones are difficult, so I encode those as colours. Adding code to espeak-ng
sounds very difficult. Most of the songs are in Mandarin though, so I'll try
those first.

------
redraw
oh, I had the same idea and started working here
[https://github.com/redraw/karaoke-machine](https://github.com/redraw/karaoke-
machine) days after Deezer's spleeter was released, but stopped while
searching for a way to sync the lyrics. thx! I'll try it out

~~~
youka
good luck! here's the relevant code [https://github.com/youkaclub/youka-
api/blob/master/youka/ali...](https://github.com/youkaclub/youka-
api/blob/master/youka/align.py)

------
fareesh
From what I understand, it is software for you to align lyrics to music
contained in a video, with tools to enable you to do so.

~~~
youka
Youka aligns lyrics automatically, you have left nothing to do

~~~
fareesh
Thanks - that sounds great

