
Show HN: Aeneas – a Python audio/text aligner - alpe
https://github.com/readbeyond/aeneas
======
psobot
This is super cool. I'm trying to think of common practical applications for
this - would one use this to sync a script with a performance? Could this
remove a lot of the work required to manually subtitle movies, TV shows, and
YouTube videos?

~~~
alpe
Thank you.

Indeed several users of aeneas adopted it for producing SRT/TTML files, i.e.
captions, for videos, both online and offline --- and many of them start with
an existing transcript.

However, please note that there are limitations on the amount of "non speech"
that aeneas can tolerate: for example, long spurious portions of audio or sung
passages might affect the quality of the alignment.

For details on how aeneas works:
[https://github.com/readbeyond/aeneas/blob/master/wiki/HOWITW...](https://github.com/readbeyond/aeneas/blob/master/wiki/HOWITWORKS.md)

~~~
tetraodonpuffer
> there are limitations on the amount of "non speech" that aeneas can tolerate

couldn't you have as part of the input also a very simple map where users
could define times that should be ignored to help with that? Might also be
possible to look at the spectrum at any time to possibly identify areas of the
file to skip.

And speaking about spectrum, just wondering, are you doing any pre-processing
in terms of EQ (narrow band-pass on spoken frequencies), compression to not
deal with volume, etc. to help with this also?

~~~
alpe
> Might also be possible to look at the spectrum at any time to possibly
> identify areas of the file to skip.

I would say yes and no.

Currently you can add a switch that makes aeneas ignore the audio intervals
that are detected as "non speech" by the built-in Voice Activity Detector
(VAD), which is a very rough energy-based VAD. For sure this is a part that
can use some improvement.

However, AFAIK e.g. music/singing separation is a really difficult open
problem, with people in academia doing PhDs on it. So, I am not sure how far
one can push this line, while staying relatively fast on a regular machine.
(Which is one of the goals of aeneas.)

> And speaking about spectrum, just wondering, are you doing any pre-
> processing in terms of EQ (narrow band-pass on spoken frequencies),
> compression to not deal with volume, etc. to help with this also?

Besides converting the input audio file to mono 16 kHz 16 bit WAVE, I do not
perform any other operation on the audio data before passing it to the MFCC
extractor (which by default runs with "standard" settings, but the user can
change them).

Unfortunately, I have had no time to perform an exhaustive search of the
parameter space, nor to try other pre-processing techniques.

But for sure if you have means to "pre-clean" the audio file before feeding it
into aeneas, that is probably going to improve the quality of the output
alignment.

(I did play with amplitude normalization and it did not seem to improve the
results. The non-speech masking mentioned above seems beneficial if you do
word-level alignment.)

------
echelon
This is going to be beyond useful for me. I can extract far more labeled audio
samples for my Donald Trump text to speech engine [1]. Thanks for sharing
this!

[1] [http://jungle.horse](http://jungle.horse)

~~~
sargun
Have you looked at applying the techniques used in Google's Wavenet to your
corpus? In addition, any interest in releasing your corpus?

------
oulipo
This code is also useful
<[https://github.com/lowerquality/gentle>](https://github.com/lowerquality/gentle>)

~~~
alpe
Yes, there are several other open source aligners out there, mostly from
academic research or derived from academic projects. In my personal GitHub
page I have a repo with an annotated list of forced aligners. (If I add a link
to it, the spam detector triggers ?! Anyway, google "github forced-alignment-
tools" to find it.)

Gentle, which is based on Kaldi, has a good performance, and an handy setup
script.

However, these aligners, which are based on automatic speech recognition
techniques, have pre-trained models only for English and maybe an handful of
other "popular" languages. Some allows you to train your own language model,
but very few users have the actual competence/resources for doing that.

aeneas is build using an older approach, which has the advantage of requiring
weaker language models, that are already available (in the form of TTS
voices): this is the reason why it "supports" so many languages. Of course the
disadvantage is that aeneas works decently well at (sub)sentence granularity,
but worse than ASR-based aligners at word granularity or with more noisy audio
files.

~~~
hftf
Do you know of any existing forced alignment tools that work well with live
audio (microphone) input? I would like to create a live stream in which the
words of a known text are displayed as they are being spoken into a
microphone.

~~~
alpe
For sure aeneas is not suitable, since it requires all the text and all the
audio in advance.

But ASR-based tools in theory would allow such an operation mode, but I have
not seen aligners that read from the mic buffer directly or have a built-in
option/CLI for it.

Knowing the text in advance basically means that you can train your own
language (textual) model adapted to that exact text, and then use the
(standard) acoustic model for your language and aligning procedure as usual.
Hence, I am quite sure you can tweak e.g. CMU Sphinx or Kaldi to do it.
Perhaps gentle (which is based on Kaldi) is worth looking into.

~~~
hftf
I looked into gentle a few weeks ago and did notice that it seems to use an
online algorithm. It doesn’t have built-in support for live audio input
unfortunately, but it may be tweakable as you say (such as reimplementing it
to use audio streams that work with either static or real-time input). I guess
there’s no other way to find out than just try it myself.

------
TuringNYC
Thanks for creating this. I can imagine a not-so-distant future where
thousands of random video-watchers could annotate tiny parts of videos via
some free-form box, and aeneas could clean up and formalize this into an
official transcription. Seems like a minor feature, until one realizes how
much the public just lost due to missing transcriptions:
[https://www.washingtonpost.com/local/education/why-uc-
berkel...](https://www.washingtonpost.com/local/education/why-uc-berkeley-is-
restricting-access-to-thousands-of-online-lecture-
videos/2017/03/15/074e382a-08c0-11e7-a15f-a58d4a988474_story.html)

~~~
alpe
Thank you.

Indeed, while aeneas was created for ebook-audiobook synchronization, several
of its current users are producing closed captions --- because, in most cases,
they already have a clean transcript (e.g., speakers provide transcripts to
the captioner) or they clean up an automated transcript, derived from an
automatic speech recognition system.

------
afarrell
This is really cool!

Would you like me to make a conda package for this? I can do so for Linux and
OSX so that someone who uses python for data science can do `conda install
aeneas` and it will install this and it's dependencies into a virtualenv.

I'd do it on windows too, but I don't know of an easy way to get my hands on a
windows box. If anyone knows of a service that can give me 30 minutes of CLI
access to a windows box, I'd be grateful.

~~~
detaro
AppVeyor has free Windows CI for open-source projects (which probably would be
a good idea to set up for later updates), and they explicitly mention that you
can install remote access tools in their build VMs to work on/debug the build:
[https://www.appveyor.com/docs/how-to/rdp-to-build-
worker/](https://www.appveyor.com/docs/how-to/rdp-to-build-worker/)

------
rrherr
"Audio is assumed to be spoken: not suitable for song captioning"

Can anyone recommend alternative approaches for music lyrics alignment?

------
braindead_in
What's the accuracy level of alignment?

~~~
alpe
aeneas is not based on ASR (i.e., it does not try to "recognize" words and
align them with the input text), but on the "older" MFCC + DTW approach.

Hence, it is difficult to give you a precise answer, e.g. in terms of word-
error-rate or similar metrics.

For the task aeneas has been designed for --- aligning an ebook and the
corresponding audiobook --- and for similar tasks (e.g., captioning videos of
lectures or spoken-only content), it generally produces an alignment that is
indistinguishable from a manually-produced one.

If you want to see some examples, read+listen one of these audio-ebooks: the
alignment has been produced by aeneas:
[https://www.readbeyond.it/ebooks.html](https://www.readbeyond.it/ebooks.html)

But of course if you want to align at finer level (word) or a more noisy/non-
matching audio, the quality of the alignment can deteriorate.

~~~
braindead_in
Thanks for the explanation. Will it work if there are gaps in the transcript?
Eg, the clean verbatim transcript where the ah's and uhm's are left out.

~~~
alpe
Several users of aeneas interested in producing caption files for videos told
me that it does. And considering how DTW works, it is plausible.

Unfortunately, I have not had the time to setting up a suitable corpus and
performing a rigorous evaluation to comfortably answering your question with a
definitive answer "yes".

Perhaps the best option to see if aeneas works for your use case, consists in
trying it out.

If you do not want to install anything on your machine, you can use the aeneas
Web app: [https://aeneasweb.org](https://aeneasweb.org) \--- basically you
submit an audio file (or a YouTube URL) and a text file, and get a
SRT/TTML/etc. file emailed back.

~~~
braindead_in
I definitely plan to try it soon.

------
aeneasr
Reading my name (spelled correctly, cudos for that) on Hacker News feels
really weird

~~~
alpe
In Italian high schools "Licei" we take 5 years of Latin (and also ancient
Greek if you choose the classical study path)... nice to meet you!

