Yeah it is interesting, and it could also be a big boost to plain olde speech to text in cases where you have video if the errors were non-correlated (which I wasn't able to determine from skimming the readme.)
edit: now I see it is being used to match audio samples, not to generate text so it wouldn't create an independent value from the audio in this arrangement. Other than i.e. speaker attribution which they mentioned.
edit: now I see it is being used to match audio samples, not to generate text so it wouldn't create an independent value from the audio in this arrangement. Other than i.e. speaker attribution which they mentioned.