
Show HN: Automatically synchronize subtitles with video - smacke
https://github.com/smacke/subsync
======
anyfoo
From the README:

"[...] the naive O(n^2) strategy for scoring all alignments is unacceptable.
Instead, we use the fact that "scoring all alignments" is a convolution
operation and can be implemented with the Fast Fourier Transform (FFT),
bringing the complexity down to O(n log n)."

I absolutely love it when something that, at very first glance, has no
business in being solved in the frequency domain, gets solved in the frequency
domain.

~~~
hosay123
I thought I understood FFT, but totally not getting the relationship to this
problem. Someone ELI5 please? :)

~~~
aabeshou
ELI5: You can turn this problem into finding the best "convolution index", and
fourier transforms make computing convolutions cheaper.

ELIUndergrad: (note that \\* means multiplication, there doesnt seem to be a
way to escape an asterisk)

Lets start by seeing how this is a convolution. We have the videoSpeech
sequence, and the subtitle sequence - each is a vector, indexed by time, of
0's and 1's indicating whether there is speech in that time. We can imagine
padding the sequences out on either side with 0's, and consider the alignment
task as shifting the subtitle sequence left and right in time until we get the
best alignment with the 1's in the videoSpeech sequence. We can express the
goodness of alignment as the number of matching 1's, aka the sum over all
times t of videoSpeech(t) \\* subtitle(t). This is the definition of a
convolution: the convolution of two sequences gives a new sequence where the
value at index i is this sum above where one of the sequences is shifted by i.
Mathematically, conv(videoSpeech, subtitle)(i) = sum( videoSpeech(t)\\*
subtitle(t-i)). So we can rephrase this problem as, find the index i which
maximizes the value of the convolution sequence.

The discrete fourier transform is a function that takes a sequence and gives
another sequence. It's relevant here because it "turns convolution into
multiplication": fourier(videoSpeech)(i) \\* fourier(subtitle)(i) =
fourier(conv(videoSpeech, subtitle))(i).

So finally to solve the problem, we get the pointwise product sequence S =
fourier(videoSpeech) \\* fourier(subtitle), do the inverse fourier transform
on it invFourier(S), and maximize invFourier(S)(i) over i.

~~~
yesenadam
Why not use good ol 'x' for multiplication? :-)

~~~
computerfriend
'x' is a letter, not the times symbol.

~~~
janaagaard
But if you have to choose between '\\*' and 'x', then 'x' might well be the
best choice, no? :-)

~~~
rjeli
If they had used ‘x’ instead, I would be scouring the comment for where the
variable had been introduced.

~~~
yesenadam
Ok, use 'x' and write "(note that 'x' means multiplication)" instead of "(note
that \\* means multiplication, there doesnt seem to be a way to escape an
asterisk)". More legible and shorter and 'symbol is already used for the
purpose' and..

------
satori99
This is cool.

I am constantly impressed by ffmpeg and it capabilities.

Last week I was able to make it act as a proxy for live streaming video that
reduced the volume of ad-breaks by 50% automatically by listening for SCTE35
cue packets in the stream and adjusting a volume filter accordingly.

My housemate has a reality TV addiction, and the ads were getting on my
nerves. I just intended to see if it was possible, but I had a POC running in
about 1/2 an hour.

~~~
zeotroph
A neat subtitle feature I found out mpv has, but no other video player: Hide
those parts of the subtitle which are mainly for those with impaired hearing
like '[loud noise]' etc, but keep the rest:

`mpv --sub-filter-sdh ...`

That logic could probably be extracted into a separate subtitle filtering
tool.

~~~
JorgeGT
If I remember correctly SubtitleEdit
([https://github.com/SubtitleEdit/subtitleedit](https://github.com/SubtitleEdit/subtitleedit))
had such a tool for removing the hearing-impaired bits.

~~~
gsich
still has. I use it for every subtitle, gets rid of most common errors too.

------
kumarharsh
Completely tangential question to this awesome discussion: who even writes
subtitles? It feels like a thankless job to write subtitles for rips of movies
and TV shows old and new. I get that maybe they source original from, say,
Netflix, but .set files existed long before Netflix, and for lots of movies
not on Netflix. Writing all those tags for hearing impaired seems like a lot
of work, which anyone other than the first-person movie production team would
be loathe to indulge in. Yet I see a lot of subtitles which don't seem
'official'. Then there are subtitles which seem like they were written as a
loose translation of the audio. Both are in English, so I don't see why they
can't just transcribe what's being said. Instead, for dialogues such as "The
greatest trick the Devil ever pulled was convincing the world he didn't
exist.", the subtitle is written as "The devil tricked the world into thinking
he didn't exist."

~~~
oasisbob
I'm equally amazed by the effort that goes into foreign dubbing. Staying in
Chennai, India and channel surfing, Sky TV had dozens of US channels (TLC,
Discovery channel, etc) which not only had English/Hindi/Tamil subtitles - but
foreign language dubs too!

Not just movies, but full overdubbing for 20-episode season disposable reality
TV shows. Think, "Storage Wars" etc. Spent over an hour channel surfing and
flipping languages.

I would /love/ to know how they take the original subtitle tracks and
translate/dub them efficiently.

~~~
MagnumOpus
Dubbing is a major industry in many non-English-speaking countries (Germany,
France, Spain, Italy etc), employing more than half of voice actors -- so it's
generally not some DIY cottage industry, but professional audio engineers and
voice actors. They don't just use subtitles, they use the original video/audio
to lip-sync as well as possible.

Wikipedia has some more detail:
[https://en.wikipedia.org/wiki/Dubbing_(filmmaking)#Methods](https://en.wikipedia.org/wiki/Dubbing_\(filmmaking\)#Methods)

------
dearrifling
Awesome! According to the description of the internals this only works with
offsets and doesn't adjust the subtitle's playback speed. I think playback
speed variations are not that broad, it's usually caused by playing a subtitle
written for a given FPS played back on a different FPS video. It might worth
it to try common FPS ratios at once and keep the best match.

~~~
aasasd
In the case when resulting subtitles turn out to be shorter than the input
file, subtitle editing software like Aegisub can automatically trim ending
times so they don't overlap with starting times.

This could probably be similarly done in the script itself in the future.

If the result gets longer and the times for each line are a bit too short―I
guess you'll have to read faster, har har.

\---

Actually, iirc Aegisub can also do automatic stretching for different FPS,
since that's a rather easy case.

However, it's not enough when the subs are from a different media release or
an edit―which made me long for exactly the kind of solution like in the post.

------
draz
This is an area I am actively looking into at the moment for my employer. It's
an interesting project, and I am excited to see work in this domain. Note that
this will not pass QC at any broadcaster/OTT. Our tolerance is typically
within 1-12 frames (and a bunch of other requirements around shot changes,
etc.)

~~~
barrystaes
> Our tolerance is typically within 1-12 frames (and a bunch of other
> requirements around shot changes, etc.)

Tell me more!

~~~
draz
Netflix did a good job outlining their requirements (which are mirroring
industry requirements): [https://partnerhelp.netflixstudios.com/hc/en-
us/articles/215...](https://partnerhelp.netflixstudios.com/hc/en-
us/articles/215758617-Timed-Text-Style-Guide-General-Requirements) Note that I
do not work for Netflix rather another multinational mass media conglomerate.

------
Theodores
Subtitles are in the realm of not many people doing interesting things with
them, which is a pity as they are such a creative medium.

Recently I rolled my own code to play WebVTT to an audio (think video without
the pictures) track. I had assumed there would be off the shelf libraries that
would do this for me. Oh how wrong I was!

I took it a bit further than white text in a black box at the foot of the
screen. Not having pictures kind of made it that way. So I decided I needed
cartoon grade speech bubbles, with the speech bubble coming from the left for
one voice and the other side of the screen for the other voice, again finding
myself in the realms of no ready made examples to do this. The speech bubbles
had to scale to fit the content with them being suitably rounded, a la cartoon
style. I found an SVG solution to my problem.

The WebVTT format and variants have all kinds of goodies in them to position
speech and do things in time, as per karaoke.

I think one reason I found myself in the world of rare code was that we assume
anything to do with accessibility is for disabled people and therefore has to
be boring and done with no creativity whatsoever.

------
symstym
Nice! Have you seen this similar project? If not sure if there are any ideas
to be borrowed, but it sounds like it achieved excellent accuracy:
[https://github.com/AlbertoSabater/subtitle-
synchronization](https://github.com/AlbertoSabater/subtitle-synchronization)

~~~
smacke
Interesting! I was not aware of that project before. It looks like it could be
worthwhile to incorporate their neural net-based VAD, if it's not too slow. At
a first glance, it looks like the main difference is the postprocessing step
-- they use some heuristics to avoid trying all possible alignments, while
subsync uses FFT as its secret sauce to get away with trying out all of them.
:)

I should include in the readme that for >1 hour movies, it usually finishes
the synchronization in ~20 seconds, which compares favorably to the project
linked in your comment (13 minutes unoptimized, 2 minutes optimized).

------
loeg
> I have yet to find a case where the automatic synchronization has been off
> by more than ~1 second

That's actually quite bad synchronization error for A-V; maybe it's less
noticeable for subtitles. I wonder if there is a good way to reduce that error
down to milliseconds.

~~~
gsich
100 ms A/V is already quite noticeable, with some training one might spot 50
ms too.

~~~
loeg
I agree 100ms is bad :-). I think even 10s of ms sync error is readily jarring
for most people, at least for A-V. Subtitles may be more forgiving.

------
kozikow
An even harder and cooler problem to solve would be finding extra scenes.
Sometimes subtitle or movie have slightly different "cut" (e.g. directors cut)
and there are some extra scenes added. A similar algorithmic approach could be
used to solve this problem as well.

~~~
smacke
Totally agreed. I'm not sure how to do the alignment scoring step in this
case, perhaps some kind of DTW / FFT hybrid... definitely worth further
consideration.

~~~
rcthompson
This sounds very similar to the problem of whole-genome alignment in
bioinformatics. You have, say, the human and chimp genome sequences, and you
want to align them to find out which portions of each genome correspond to
which portions in the other, allowing for the possibility of insertions or
deletions of DNA sequences in each one.

Unfortunately, I'm not too familiar with the current state of the art for
whole-genome alignment, so I don't know for sure which algorithms are
considered, but MUMmer4[1] seems like a good place to start. These algorithms
are designed to handle sequences up to billions of letters long (e.g. the
human genome, which is about 3 billion letters).

[1]:
[https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005944)

------
lern_too_spel
This problem is called forced alignment. YouTube's subtitler does it for you.
If you don't want to upload your video and subtitles, here are some other
tools that will work: [https://github.com/pettarin/forced-alignment-
tools/blob/mast...](https://github.com/pettarin/forced-alignment-
tools/blob/master/README.md)

~~~
smacke
Cool! It looks to me like most of those approaches are a lot more
sophisticated and work even when the text doesn't have any time annotations.
At a first glance, it looks like the major downside is that these approaches
are geared toward particular target languages. Subsync's advantage is its
simplicity: by limiting the acoustic analysis part to simple voice detection,
it can perform alignment between subtitles and audio/video in different
languages, which is great for when you want to watch a TV episode in another
language, but your video includes the previous episode's recap and your
subtitles do not (or vice-versa). I'll definitely be taking a deeper look at
these to see if there's any ideas / functionality that can be further
incorporated.

------
psychometry
You should look into DNA and RNA sequence alignment algos for possible
improvements on your FFT. It's a huge and very well-explored space.

~~~
zimpenfish
Since I already had it open (for unrelated reasons), this might be relevant.

[https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorit...](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm)

> The Smith–Waterman algorithm performs local sequence alignment; that is, for
> determining similar regions between two strings of nucleic acid sequences or
> protein sequences.

~~~
amelius
But also:

> The Smith–Waterman algorithm is fairly demanding of time: To align two
> sequences of lengths m and n, O(mn) time is required.

~~~
zimpenfish
Thankfully I'm only working with short sequences but yeah, I guess that might
be a problem.

------
lainga
Really neat! Because it doesn't rely on the video image data this would work
for podcasts with subtitles (if they exist), right?

~~~
smacke
Yes! it should work for this too, as long as ffmpeg recognizes the audio
format :)

------
robert-boehnke
> Instead, we use the fact that "scoring all alignments" is a convolution
> operation and can be implemented with the Fast Fourier Transform (FFT),
> bringing the complexity down to O(n log n).

Not sure I understand this right - is this basically treating both binary
strings as square waves, converting them to the frequency domain and
determining the offset as a pitch shift between the two spectrograms?

~~~
smacke
Precisely. If I could more easily get at the raw classifier output of
webrtcvad, it should be possible to be even smarter (we could have square
waves with any amplitude between -1 and +1, not just either -1 and +1, which
should take into account the classifier uncertainty).

EDIT: err, I'm actually not sure about the pitch shift part, that's a bit of
vocabulary I'm not familiar with. If you've seen the fast polynomial
multiplication algo from CLRS, it's basically that. E.g. if we have strings
1101 and 0101, we can find the best alignment by looking at the exponent of
the largest coefficient after multiplying

polynomial(1101)*polynomial(reverse(0101))

where polynomial(1101) = x^3 + x^2 - x + 1

and polynomial(reverse(0101)) = polynomial(1010) = x^3 - x^2 + x - 1

------
kumarharsh
This blows my mind (like everyone else's here)!

While playing digital copies of somewhat older movies (before BluRay rips came
into vogue), a problem which surfaces frequently is that the frame rates of
the video and the subtitle track are slightly mismatched, say video at 24fps,
subtitle at 25.6fps(?). This makes the subtitles drift away from the video and
require a manual intervention every few minutes. If I can't hear the audio
properly for some reason, then I just become fed up and don't watch the movie
at all. Add to that the existence of different 'cuts' for films, which add
another dimension to the subtitle problem.

How would one even go about solving this problem?

~~~
rocqua
As others have said, matching cuts is very similar to genome matching. In
genome matching, often genes are slightly offset, so there is need to correct
for that.

Matching speed might be possible based on first matching runs of one minute,
and then trying to scale with that minute as a fixed point to see which scale
works out. Perhaps some heuristics to take common frame rate-ratios into
account might help.

An even crazier idea:

Use the fact that scaling becomes shifting after a logarithm, and try to match
after setting g(x) = f(log(x)).

------
joshstrange
This looks pretty cool, right now I drop down to the command line to shift
subs either direction but if they are off by too much or commercials/breaks
don't line up then it's game over. I'll have to drop this out on my server,
I'm tempted to run it across everything but I'll probably settle for running
it manually when something doesn't line up in plex.

Edit: Hmm, my first test moved the subtitles but they ended up 3-5 seconds
after the audio instead of the original 2-5 before... I'll have to keep
testing when I come across bad subs.

~~~
smacke
My guess is that your subtitles don't have a "fixed drift". Unfortunately,
this project only works for constant drift at the moment -- dynamic subtitle
drift is a harder problem but absolutely worth thinking more about for future
work.

------
acqq
"Although it can usually work if all you have is the video file, it will be
faster (and potentially more accurate) if you have a correctly synchronized
"reference" srt file, in which case you can do the following:

subsync reference.srt -i unsynchronized.srt -o synchronized.srt"

I really don't understand what the reference.srt is supposed to be here.

Edit: If I have a "reference" which is synchronized why do I have to do
anything at all? Why would I need to synchronize anything then?

Edit2: If the French titles are a translation of the English ones, how
wouldn't copying the timestamps from one to another be enough? Or do you want
to say that the program can match the subtitles which aren't based on the
English ones and have different number of sentences etc? But how can it then
know which part of translation belongs to what? What's short in one language
isn't in another etc.

~~~
Outpox
I believe it's a typo and should read "Although it can usually work if all you
have is the subtitles file [...]". This means that you would use a reference
file that would allow you to synchronize your subtitles.

~~~
jmkb
I could imagine having a file with correctly-synced subs in the wrong
language. Maybe that would be a good reference file.

(I think you're correct subbing "subtitles file" for "video file".)

------
bbx
Thanks for this! I just happen to need it. I have the first season of my
favourite US series, but dubbed in Spanish, because I want to train my ear.
Unfortunately the subtitle files slowly drift apart from the audio by 10
seconds during each episode.

I used to have a script to “stretch” subtitles manually, but it’s quite
fiddly:
[https://gist.github.com/jgthms/7dfc20db3478a938069a0191c4e30...](https://gist.github.com/jgthms/7dfc20db3478a938069a0191c4e30843)

~~~
smacke
You are most welcome, although unfortunately it looks like this project
doesn't support your use case yet. Right now it can only correct "constant
drift", although I'll definitely be thinking more about other scenarios based
on the reception.

That said, it sounds like for your particular case, if you use VLC, you might
be able to adjust the subtitle FPS so that you don't have to continuously
manually resync. (I haven't run into this so just going off what others have
mentioned.)

------
hobbes78
I see this will only shift the timings some offset. But many times it isn't
enough e.g. if the video source is NTSC and the subtitle were meant for PAL.

In this scenario I use Subtitle Workshop, which allows me to specify two (or
more) pairs of times, and all subtitles are synchronized accordingly.

Optimally the two pairs of times would be near the beginning and the end, but
to prevent big plot spoilers, I usually use the beginning and 2/3 of the
length.

------
kojackst
Sorry for the dumb question, but what exactly does this do?

`subsync reference.srt -i unsynchronized.srt -o synchronized.srt`

I mean, I already have a synchronized srt file, so what would I be syncing
here?

~~~
smacke
Great question. The main use case I can think of is when reference.srt and
unsynchronized.srt are in different languages, and you want to eventually
merge them into a single dual-language subtitle file.

EDIT: Oh, I should also mention that you don't need a reference.srt -- it can
look at the video directly and use that as a reference.

~~~
kojackst
Ohh now I got you. I believe README should be improved in this part then.

It reads:

====

Although it can usually work if all you have is the video file, it will be
faster (and potentially more accurate) if you have a correctly synchronized
"reference" srt file, in which case you can do the following:

subsync reference.srt -i unsynchronized.srt -o synchronized.srt

====

I believe you should explain that if you have a reference file in _another
language_ which is correctly synchronized with that video, you can use that
file instead of the video, as its timestamps will serve as references when
synchronizing the target .srt file.

Now this has raised a question, what if the reference file has a different
block count? For example, in some languages (like Chinese or Japanese) we can
say a lot with fewer characters than in English. So in Chinese a text will
stay on the screen for a long time, whereas in English the corresponding text
would be split into two or more blocks. Wouldn't that make synchronization
less accurate?

BTW that's a cool project. Thanks for sharing!

~~~
smacke
Really appreciate the feedback -- will definitely work on improving the
clarity in the README. Glad you like the project!

------
s0adex
It seems people are going back to P2P for movies. I'd forgotten the ordeals of
unsynchronized srt files. Thanks to the Netflix era.

~~~
Krasnol
Probably because not everything you want to see is on Netflix anymore and it's
getting worse.

------
hantusk
I've had great results with the Descript app which can both transcribe and can
export a synchronized subtitle file.
[https://www.descript.com/](https://www.descript.com/)

Perhaps you can import an srt file, and have it sync'ed? or perhaps you can
export the srt text and let Descript sync it and re-create an .srt file.

------
boolemancer
I recently ran across a similar tool for aligning subtitles based on a
reference subtitle file called aligner:

[https://github.com/kaegi/aligner](https://github.com/kaegi/aligner)

It seemed to work pretty well in my testing. Maybe it could be useful to look
at?

~~~
smacke
This project came up when I was researching this idea. It looks especially
useful because the alignment algorithm is less restrictive (can introduce
splits/breaks), so I plan on taking a closer look for sure. Given that it does
not do any speech detection and requires a reference subtitles file, I'm not
sure how well it would work for aligning the encoding for the speech-induced
part, which will have a lot more noise. Particularly, the linked project works
by aligning time intervals [a,b], while subsync is "dumber" in that it
discretizes things into binary strings, and an interval constructed by merging
consecutive 1's and 0's in speech detection-induced binary strings will
probably be a lot harder to match to the subtitle intervals.

Either way, I also thought this was a cool project and will be thinking more
about whether any ideas can be borrowed.

------
chanux
I remember the annoyance. I created a script to eye ball, guess and fix back
then. It was good enough.
[https://gist.github.com/chanux/2042676](https://gist.github.com/chanux/2042676)

------
aorth
Super cool. This and the Python-based subtitle downloader subliminal make
working with subtitles _really_ easy. I had previously relied on incremental
fudging of times using trial and error with www.sync-subtitles.com.

------
kowdermeister
> I have yet to find a case where the automatic synchronization has been off
> by more than ~1 second.

1 second is in the 0 - 1000ms which is still a lot of room for manual tweaking
in VLC. I can notice 50ms differences in sync.

------
sidcool
This is quite amazing. The code quality overall looks pretty good too.
Pleasantly surprised to see it in Python. Was dreading complex C++ code.
Thanks for sharing. The code will aid my learning.

------
hendry
What about syncing up audio?

Yet to find a CLI solution for syncing up & replacing external audio on my
camera's MP4.

Current solution is boot up FCPX and do a sync there which rips away minutes
of my life.

~~~
gsich
I use Audacity for this.

I load both files, look for a peak or end of silence in both. The difference I
remux with Mkvtoolnix. I do this with 2-3 spots.

Loading the file is usually the longest waiting period.

------
apexalpha
Please tell me someone that works at Plex is reading this thread. I know it
might seem a bit distant for Americans but this HUGE!

~~~
Guillaume86
A Kodi plugin might be a more realistic target, subtitles handling in Kodi is
pretty great in my experience, especially compared to Plex (download while
playing, sources addons, offset adjustement).

------
IshKebab
Surely you can do much better than this using a speech recogniser?

~~~
smacke
By speech recognizer, are you referring to a method that would perform speech-
to-text transcription? If so, you could probably do better when your video and
subtitles are in the same language, but you would lose the ability to be
language-agnostic (unless you manage to solve machine translation well enough
:)

Interestingly enough, I haven't actually found any cases where the
synchronization doesn't work (assuming the only problem with the subtitles
file is a time offset), so it actually looks like simple voice detection might
be good enough for the target use-case of longer TV episodes and movies,
although further evaluation is necessary.

~~~
Axsuul
Frequently there will be subtitles that describe sounds and noises for the
deaf (i.e. door closes). How is this handled?

~~~
smacke
The approach taken here is to not worry about those. :)

These will introduce some noise, but the synchronization algorithm seems to be
robust enough that noise doesn't matter (whether it's from the voice activity
detector or from non speech-related subtitles).

------
ummonk
I'm still waiting for the day there is an automatic synchronizer for the video
and audio tracks in VLC. It's so frustrating trying to find the right offset
when they get out of sync.

