Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Automatically synchronize subtitles with video (github.com)
583 points by smacke 57 days ago | hide | past | web | favorite | 124 comments

From the README:

"[...] the naive O(n^2) strategy for scoring all alignments is unacceptable. Instead, we use the fact that "scoring all alignments" is a convolution operation and can be implemented with the Fast Fourier Transform (FFT), bringing the complexity down to O(n log n)."

I absolutely love it when something that, at very first glance, has no business in being solved in the frequency domain, gets solved in the frequency domain.

I thought I understood FFT, but totally not getting the relationship to this problem. Someone ELI5 please? :)

ELI5: You can turn this problem into finding the best "convolution index", and fourier transforms make computing convolutions cheaper.

ELIUndergrad: (note that \* means multiplication, there doesnt seem to be a way to escape an asterisk)

Lets start by seeing how this is a convolution. We have the videoSpeech sequence, and the subtitle sequence - each is a vector, indexed by time, of 0's and 1's indicating whether there is speech in that time. We can imagine padding the sequences out on either side with 0's, and consider the alignment task as shifting the subtitle sequence left and right in time until we get the best alignment with the 1's in the videoSpeech sequence. We can express the goodness of alignment as the number of matching 1's, aka the sum over all times t of videoSpeech(t) \* subtitle(t). This is the definition of a convolution: the convolution of two sequences gives a new sequence where the value at index i is this sum above where one of the sequences is shifted by i. Mathematically, conv(videoSpeech, subtitle)(i) = sum( videoSpeech(t)\* subtitle(t-i)). So we can rephrase this problem as, find the index i which maximizes the value of the convolution sequence.

The discrete fourier transform is a function that takes a sequence and gives another sequence. It's relevant here because it "turns convolution into multiplication": fourier(videoSpeech)(i) \* fourier(subtitle)(i) = fourier(conv(videoSpeech, subtitle))(i).

So finally to solve the problem, we get the pointwise product sequence S = fourier(videoSpeech) \* fourier(subtitle), do the inverse fourier transform on it invFourier(S), and maximize invFourier(S)(i) over i.

> ELI5: You can turn this problem into finding the best "convolution index", and fourier transforms make computing convolutions cheaper.

I tried this sentence on my 5-year-old and got a blank stare. He then proceeded to tell me a story about how Darth Vader is so scary and cool and that he's actually Luke's father. YMMV.

I guess GP actually meant ELWOT - Explain Like We're On Twitter.

lol, that makes sense. heres another try:

ELI5 v2: you can look at the sequences a different way, like tilting your head to look at them from a different angle. and from that angle, we can more easily try every possible way of putting them together, so that we can choose the best way.

I've beem working on a similar algorithm.

I'm writing a program which takes multiple channels of near-periodic audio wav files, and outputs a phase-stable oscilloscope video. This is done by FFT-correlating (a buffer of recent oscilloscope plots) with (the audio signal) (with quite a bit of added complexity for better results). Incidentally I'm also using Python/Numpy/ffmpeg.

It's most useful for "complex" chiptune like FM/SNES/tracker/MIDI music, which are easy to split into monophonic single-note channels. https://github.com/jimbo1qaz/corrscope Should I submit this separately (Show HN)?

Why not use good ol 'x' for multiplication? :-)

I keep AltGr+8 mapped to × for just such an emergency.

The Compose key works great for this (I use WinCompose on Windows). Most of the time I can guess the key combination. I think × is Compose, x, x.

'x' is a letter, not the times symbol.

But if you have to choose between '\*' and 'x', then 'x' might well be the best choice, no? :-)

If they had used ‘x’ instead, I would be scouring the comment for where the variable had been introduced.

Ok, use 'x' and write "(note that 'x' means multiplication)" instead of "(note that \* means multiplication, there doesnt seem to be a way to escape an asterisk)". More legible and shorter and 'symbol is already used for the purpose' and..

It is _a_ multiplication symbol. Why not use period ".".

Because there are a lot of those in the vicinity and it's very small?

The problem being solved essentially is: you have two binary strings, and you want to offset one of them so that they match up the best. For each offset, you're taking a dot product of one sequence with the offset version of the other. This is the same as computing the convolution of the two sequences together (https://en.wikipedia.org/wiki/Convolution). Computing this naively would be O(n^2) (doing linear work for each possible offset).

One property of the Fourier transform is that convolution in the time domain corresponds to element-wise multiplication in the frequency domain (https://en.wikipedia.org/wiki/Convolution_theorem), so you can compute the convolution efficiently by taking the FFT of both series, doing element-wise multiplication, and then taking the inverse FFT of the result.

It has nothing to do directly with FFT, which is why it is confusing:

What he is looking for is maximal correlation between two binary series (think "Pearson's r or r-square correlation coefficient"). Now, correlation is just like convolution except one series is flipped around on the time axis. Which means, if you have an efficient way to compute convolutions (and you do, through FFT), you have an efficient way to compute correlations.

(I don't think 5 year olds heard of Pearson's correlation coefficient or convolutions, but ... that's the best I can do).

In many ways, correlation and covariance are more fundamental than convolution - they are closely and directly related to the inner product.

And the only reason to use the FFT here is convenience (i.e., it is fast and simple to compute), but the correlation/convolution property applies to many "transform domains" - Laplace, Z, Cosine, Sine, and a few others (all of which are closely related among themselves, but only a few easy to compute numerically).

From my memory, convolution in the time domain is equivalent to multiplication in the frequency domain

the fourier series was one of the things that blew my mind. any periodic function can be decomposed into a bunch of sine waves! even (the interesting portions of) many other functions can be approximated by a (potentially infinite) series of sine waves! it's simply madness i tell you.

Interesting bit of history, Kolmogorov became famous when he published his first scientific paper (at 19!) on the construction of a function whose Fourier series diverges (almost) everywhere.

This is cool.

I am constantly impressed by ffmpeg and it capabilities.

Last week I was able to make it act as a proxy for live streaming video that reduced the volume of ad-breaks by 50% automatically by listening for SCTE35 cue packets in the stream and adjusting a volume filter accordingly.

My housemate has a reality TV addiction, and the ads were getting on my nerves. I just intended to see if it was possible, but I had a POC running in about 1/2 an hour.

A neat subtitle feature I found out mpv has, but no other video player: Hide those parts of the subtitle which are mainly for those with impaired hearing like '[loud noise]' etc, but keep the rest:

`mpv --sub-filter-sdh ...`

That logic could probably be extracted into a separate subtitle filtering tool.

When going through Miami Vice I discovered that the SDH parts sometimes also act as notes for the culture-impaired. (Song titles and lyrics were a particularly nice touch, though not strictly SDH.)

And for English-as-second-language folks, it's occasionally curious which expressions are used in the SDH descriptions.

If I remember correctly SubtitleEdit (https://github.com/SubtitleEdit/subtitleedit) had such a tool for removing the hearing-impaired bits.

still has. I use it for every subtitle, gets rid of most common errors too.

This sounds amazing! Did you use a USB TV tuner or an online source?

Is there chance you could open source the code?

I used an online stream source; A HLS stream that one of my local TV stations broadcasts online.

My method will only work with streams that have SCTE35 packets embedded. This is normally used for local ad insertion on cable networks, so it may not always be available on OTA streams.

There is not much code because it is mostly ffmpeg that does the work.

I used nodejs to manage an ffmpeg child process, read its output streams (one stream over a unix domain socket for the video, and the stdout stream for the SCTE35 packets) and pipe them to a HTTP client.

My code parses the SCTE packets and then sends commands into the running ffmpeg child process to change the audio levels.

I am happy to share it when I get home in a few hours.

Thanks so much!

please do


Keep in mind it's PoC code, I stopped when it worked the first time :P

Completely tangential question to this awesome discussion: who even writes subtitles? It feels like a thankless job to write subtitles for rips of movies and TV shows old and new. I get that maybe they source original from, say, Netflix, but .set files existed long before Netflix, and for lots of movies not on Netflix. Writing all those tags for hearing impaired seems like a lot of work, which anyone other than the first-person movie production team would be loathe to indulge in. Yet I see a lot of subtitles which don't seem 'official'. Then there are subtitles which seem like they were written as a loose translation of the audio. Both are in English, so I don't see why they can't just transcribe what's being said. Instead, for dialogues such as "The greatest trick the Devil ever pulled was convincing the world he didn't exist.", the subtitle is written as "The devil tricked the world into thinking he didn't exist."

> Both are in English, so I don't see why they can't just transcribe what's being said.

Try writing subtitles yourself, and you'll probably see why. There's a limited space and time budget - people need to be able to read the thing without spending too much time focusing on it.

Sometimes, if it's a quick back and forth, you even have to leave out some of the dialogue. Now what do you cut?

And this is just for subtitles in the same language. Imagine also facing translation issues - where you not only are trying to roughly hit the same meaning and connotations, but also need to stay relatively close to the acting, e.g. you can't translate a strong exclamation word into something too soft. And there's also the issue of difference between spoken and written language, e.g. the viewers may accept someone saying "fuck that", but seeing it written down "- Fuck that" feels awkward.

It is truly a work of art.

And that's probably also why people are doing it in their spare time, perhaps after having seen a bad job from the official movie.

I imagine doing it will also get you really close to understanding how the movie is built up.

I don't know who does it. There are some professionals. Professionals do it for when e.g. BBC or SBS shows non-english movies/series I guess. I wish there was a site where you could put money towards english subs for things!! I guess judging the quality wouldn't be so easy etc. Just in the last few days I've had these subtitle-related issues:

- No English subtitles available for S04 of Rita (danish).. So I made English subtitles for E01, by machine translating existing German and Portuguese subs and using the 'original' German for reference. Turned out pretty great, but took an hour or 2. No way I'd do the whole season like that though.

- Season 2 of Blue Moon (in french) has been out for years, no English subtitles anywhere, seems like there never will be. There's a season 3 too I think, not sure.

- Trying to find subtitles for S02 of Overspel (dutch) there's everything available (but in 1 obscure subtitle site, not easy to find) except E07 and E08. I found someone in a forum asking about exactly those 2 subtitles in January 2014! No-one answered...

I'd be happy to pay for subtitles for things like that that I really want! The cost would be small, I guess, divided by a few or a few dozen or hundred people around the world. Then they would be available online forever, hopefully. Just need to coordinate the effort. That could be a great website.

Also, I helped a bit with some spanish and english subtitles for spanish-language tv shows on viki.com 6 or 7 years ago, which was a pretty great-seeming community-sourced translation effort, but it disappeared, along with all its subtitles. At least most of the content did, it was all asian stuff last time I looked, I think it must have changed hands.

> who even writes subtitles?

I do, as a hobby. Typesetting in particular is one of the most rewarding things I've ever done. There are few things as satisfying me as being able to set a sign so that it looks like it's part of the video.

If you're curious, there's a tool out there called Aegisub. Pretty much every fansubber uses it, and it can handle most parts of the process: timing, typesetting, TL/editing, and it shows you a preview of what it looks like muxed with the video (though you do need a separate tool to actually generate muxed MKV files).

> Both are in English, so I don't see why they can't just transcribe what's being said. Instead, for dialogues such as "The greatest trick the Devil ever pulled was convincing the world he didn't exist.", the subtitle is written as "The devil tricked the world into thinking he didn't exist."

You have entered the wonderful world of Hong Kong bootlegs, which are usually just called "HKs". The people who work on these just plain don't care. They translate the movie into Chinese, and then for whatever other language they release subtitles in, they just Google Translate their Chinese subs into that language, even if it's the movie's original language.

Fansubbers hate them, and if the show or movie in question was never officially released in English, there are some groups who will clean up these subtitles to make them more presentable and fixing the godawful grammatical errors (this is called "scrubbing"). Of course, you're still left with a shitty base TL, but it's at least better than the HKs-as-is, so these scrubs are always just a stopgap until somebody can do a full TL. And there are some groups who will subtitle a series from scratch just to spite the HKs.

The only thing good about these subs is that they tend to turn into memes. I'd recommend looking up the HKs of Revenge of the Sith, which ended up becoming an impressive stockpile of memes, most famously being the origin of "Do not want!". For memes that came out of shows I actually watch, I'd recommend googling phrases like "Don't molest the lawyer" and "Gao Main Bastard".

Thankfully are jobs will be getting easier as speech recognition continues to improve. Some people start from scratch, some people find existing hard subs and translate them to soft subs, some people just translate subs, etc... some people are prolific but the reason it seems most media has a sub is that there are millions of interested and capable people out there, so if a movie is only meaningful enough to only one person out of say 40 million, that's still one person who will make subs for that movie. We have the countless masses to thank, even when sometimes the quality isn't perfect.

I'm not super familiar with the specifics and I'm sure someone who's worked on this particular probably will be able to clarify, but the way I understand it there is a limit on how many characters can be displayed on the screen at any given time, as well as a minimum amount of time a subtitle should be visible on the screen. Because of this, transcribers might be forced to write down a shorter version of the line in order to make it fit.

I'm equally amazed by the effort that goes into foreign dubbing. Staying in Chennai, India and channel surfing, Sky TV had dozens of US channels (TLC, Discovery channel, etc) which not only had English/Hindi/Tamil subtitles - but foreign language dubs too!

Not just movies, but full overdubbing for 20-episode season disposable reality TV shows. Think, "Storage Wars" etc. Spent over an hour channel surfing and flipping languages.

I would /love/ to know how they take the original subtitle tracks and translate/dub them efficiently.

Dubbing is a major industry in many non-English-speaking countries (Germany, France, Spain, Italy etc), employing more than half of voice actors -- so it's generally not some DIY cottage industry, but professional audio engineers and voice actors. They don't just use subtitles, they use the original video/audio to lip-sync as well as possible.

Wikipedia has some more detail: https://en.wikipedia.org/wiki/Dubbing_(filmmaking)#Methods

People actually _pay_ to watch TV shows, so imagine being paid to do it!

TED Talks for example has a community of volunteer subtitlers that get credit for their work. A lot of them move on from there to professional captioning.

For more obscure movies there are specialized forums where people do that for some bounty (mostly in upload credit) to make it possible for others to enjoy that movie they don’t speak the language of too. Not much different to the question why people spend days ripping music, movies just to share it with others.

Besides people who manually write subtitles, there's also a whole slew being ripped from DVD's and such.

A friend of mine does it as a hobby.

Awesome! According to the description of the internals this only works with offsets and doesn't adjust the subtitle's playback speed. I think playback speed variations are not that broad, it's usually caused by playing a subtitle written for a given FPS played back on a different FPS video. It might worth it to try common FPS ratios at once and keep the best match.

In the case when resulting subtitles turn out to be shorter than the input file, subtitle editing software like Aegisub can automatically trim ending times so they don't overlap with starting times.

This could probably be similarly done in the script itself in the future.

If the result gets longer and the times for each line are a bit too short―I guess you'll have to read faster, har har.


Actually, iirc Aegisub can also do automatic stretching for different FPS, since that's a rather easy case.

However, it's not enough when the subs are from a different media release or an edit―which made me long for exactly the kind of solution like in the post.

Yes that would be great, happens a lot when moving between pal and NTSC regions.

This is an area I am actively looking into at the moment for my employer. It's an interesting project, and I am excited to see work in this domain. Note that this will not pass QC at any broadcaster/OTT. Our tolerance is typically within 1-12 frames (and a bunch of other requirements around shot changes, etc.)

> Our tolerance is typically within 1-12 frames (and a bunch of other requirements around shot changes, etc.)

Tell me more!

Netflix did a good job outlining their requirements (which are mirroring industry requirements): https://partnerhelp.netflixstudios.com/hc/en-us/articles/215... Note that I do not work for Netflix rather another multinational mass media conglomerate.

Subtitles are in the realm of not many people doing interesting things with them, which is a pity as they are such a creative medium.

Recently I rolled my own code to play WebVTT to an audio (think video without the pictures) track. I had assumed there would be off the shelf libraries that would do this for me. Oh how wrong I was!

I took it a bit further than white text in a black box at the foot of the screen. Not having pictures kind of made it that way. So I decided I needed cartoon grade speech bubbles, with the speech bubble coming from the left for one voice and the other side of the screen for the other voice, again finding myself in the realms of no ready made examples to do this. The speech bubbles had to scale to fit the content with them being suitably rounded, a la cartoon style. I found an SVG solution to my problem.

The WebVTT format and variants have all kinds of goodies in them to position speech and do things in time, as per karaoke.

I think one reason I found myself in the world of rare code was that we assume anything to do with accessibility is for disabled people and therefore has to be boring and done with no creativity whatsoever.

Nice! Have you seen this similar project? If not sure if there are any ideas to be borrowed, but it sounds like it achieved excellent accuracy: https://github.com/AlbertoSabater/subtitle-synchronization

Interesting! I was not aware of that project before. It looks like it could be worthwhile to incorporate their neural net-based VAD, if it's not too slow. At a first glance, it looks like the main difference is the postprocessing step -- they use some heuristics to avoid trying all possible alignments, while subsync uses FFT as its secret sauce to get away with trying out all of them. :)

I should include in the readme that for >1 hour movies, it usually finishes the synchronization in ~20 seconds, which compares favorably to the project linked in your comment (13 minutes unoptimized, 2 minutes optimized).

> I have yet to find a case where the automatic synchronization has been off by more than ~1 second

That's actually quite bad synchronization error for A-V; maybe it's less noticeable for subtitles. I wonder if there is a good way to reduce that error down to milliseconds.

It's actually fairly annoying for subtitles, but it's useful for getting videos in unfamiliar languages to the right ballpark. Afterwards, my go-to so far has been to just perform a manual one-off adjustment in VLC after the video starts (you can offset subtitles in increments / decrements of 50ms with hotkeys h and j).

100 ms A/V is already quite noticeable, with some training one might spot 50 ms too.

I agree 100ms is bad :-). I think even 10s of ms sync error is readily jarring for most people, at least for A-V. Subtitles may be more forgiving.

An even harder and cooler problem to solve would be finding extra scenes. Sometimes subtitle or movie have slightly different "cut" (e.g. directors cut) and there are some extra scenes added. A similar algorithmic approach could be used to solve this problem as well.

Totally agreed. I'm not sure how to do the alignment scoring step in this case, perhaps some kind of DTW / FFT hybrid... definitely worth further consideration.

This sounds very similar to the problem of whole-genome alignment in bioinformatics. You have, say, the human and chimp genome sequences, and you want to align them to find out which portions of each genome correspond to which portions in the other, allowing for the possibility of insertions or deletions of DNA sequences in each one.

Unfortunately, I'm not too familiar with the current state of the art for whole-genome alignment, so I don't know for sure which algorithms are considered, but MUMmer4[1] seems like a good place to start. These algorithms are designed to handle sequences up to billions of letters long (e.g. the human genome, which is about 3 billion letters).

[1]: https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...

This problem is called forced alignment. YouTube's subtitler does it for you. If you don't want to upload your video and subtitles, here are some other tools that will work: https://github.com/pettarin/forced-alignment-tools/blob/mast...

Cool! It looks to me like most of those approaches are a lot more sophisticated and work even when the text doesn't have any time annotations. At a first glance, it looks like the major downside is that these approaches are geared toward particular target languages. Subsync's advantage is its simplicity: by limiting the acoustic analysis part to simple voice detection, it can perform alignment between subtitles and audio/video in different languages, which is great for when you want to watch a TV episode in another language, but your video includes the previous episode's recap and your subtitles do not (or vice-versa). I'll definitely be taking a deeper look at these to see if there's any ideas / functionality that can be further incorporated.

You should look into DNA and RNA sequence alignment algos for possible improvements on your FFT. It's a huge and very well-explored space.

Since I already had it open (for unrelated reasons), this might be relevant.


> The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences.

But also:

> The Smith–Waterman algorithm is fairly demanding of time: To align two sequences of lengths m and n, O(mn) time is required.

Thankfully I'm only working with short sequences but yeah, I guess that might be a problem.

Agreed -- definitely worth thoroughly investigating this space to see whether there are any ideas that can be borrowed.

Really neat! Because it doesn't rely on the video image data this would work for podcasts with subtitles (if they exist), right?

Yes! it should work for this too, as long as ffmpeg recognizes the audio format :)

> Instead, we use the fact that "scoring all alignments" is a convolution operation and can be implemented with the Fast Fourier Transform (FFT), bringing the complexity down to O(n log n).

Not sure I understand this right - is this basically treating both binary strings as square waves, converting them to the frequency domain and determining the offset as a pitch shift between the two spectrograms?

Precisely. If I could more easily get at the raw classifier output of webrtcvad, it should be possible to be even smarter (we could have square waves with any amplitude between -1 and +1, not just either -1 and +1, which should take into account the classifier uncertainty).

EDIT: err, I'm actually not sure about the pitch shift part, that's a bit of vocabulary I'm not familiar with. If you've seen the fast polynomial multiplication algo from CLRS, it's basically that. E.g. if we have strings 1101 and 0101, we can find the best alignment by looking at the exponent of the largest coefficient after multiplying


where polynomial(1101) = x^3 + x^2 - x + 1

and polynomial(reverse(0101)) = polynomial(1010) = x^3 - x^2 + x - 1

That is a very 'signals' based interpretation of the convolution theorem [1] applied to binary functions.

In essence, the fourier-transform is based on convolutions. F(eta) is essentially the convolution of f(x) with sin(eta x).* This is very loosely why the convolution theorem works.

[1] https://en.wikipedia.org/wiki/Convolution_theorem

* This excludes all cosine parts of the transform. Its neater to work in the complex domain and state that:

F(eta) = f(x) convolved with e^(eta i x)

This blows my mind (like everyone else's here)!

While playing digital copies of somewhat older movies (before BluRay rips came into vogue), a problem which surfaces frequently is that the frame rates of the video and the subtitle track are slightly mismatched, say video at 24fps, subtitle at 25.6fps(?). This makes the subtitles drift away from the video and require a manual intervention every few minutes. If I can't hear the audio properly for some reason, then I just become fed up and don't watch the movie at all. Add to that the existence of different 'cuts' for films, which add another dimension to the subtitle problem.

How would one even go about solving this problem?

As others have said, matching cuts is very similar to genome matching. In genome matching, often genes are slightly offset, so there is need to correct for that.

Matching speed might be possible based on first matching runs of one minute, and then trying to scale with that minute as a fixed point to see which scale works out. Perhaps some heuristics to take common frame rate-ratios into account might help.

An even crazier idea:

Use the fact that scaling becomes shifting after a logarithm, and try to match after setting g(x) = f(log(x)).

If you use vlc there is a subtitle fps that you can modify.

This looks pretty cool, right now I drop down to the command line to shift subs either direction but if they are off by too much or commercials/breaks don't line up then it's game over. I'll have to drop this out on my server, I'm tempted to run it across everything but I'll probably settle for running it manually when something doesn't line up in plex.

Edit: Hmm, my first test moved the subtitles but they ended up 3-5 seconds after the audio instead of the original 2-5 before... I'll have to keep testing when I come across bad subs.

My guess is that your subtitles don't have a "fixed drift". Unfortunately, this project only works for constant drift at the moment -- dynamic subtitle drift is a harder problem but absolutely worth thinking more about for future work.

"Although it can usually work if all you have is the video file, it will be faster (and potentially more accurate) if you have a correctly synchronized "reference" srt file, in which case you can do the following:

subsync reference.srt -i unsynchronized.srt -o synchronized.srt"

I really don't understand what the reference.srt is supposed to be here.

Edit: If I have a "reference" which is synchronized why do I have to do anything at all? Why would I need to synchronize anything then?

Edit2: If the French titles are a translation of the English ones, how wouldn't copying the timestamps from one to another be enough? Or do you want to say that the program can match the subtitles which aren't based on the English ones and have different number of sentences etc? But how can it then know which part of translation belongs to what? What's short in one language isn't in another etc.

Thanks for flagging -- this seems to be a common point of confusion; I'll work on updating the readme.

Let's say reference.srt is a set of English subtitles, and you have an out-of-alignment French subtitles file unsynchronized.srt. You could use the video directly as a reference in order to synchronize the French subtitles, but it will take a bit longer since we need to extract audio and perform voice detection over the whole video file. In this case, it will be faster to use your already-synchronized English subtitles file reference.srt.

I believe it's a typo and should read "Although it can usually work if all you have is the subtitles file [...]". This means that you would use a reference file that would allow you to synchronize your subtitles.

I could imagine having a file with correctly-synced subs in the wrong language. Maybe that would be a good reference file.

(I think you're correct subbing "subtitles file" for "video file".)

Thanks for this! I just happen to need it. I have the first season of my favourite US series, but dubbed in Spanish, because I want to train my ear. Unfortunately the subtitle files slowly drift apart from the audio by 10 seconds during each episode.

I used to have a script to “stretch” subtitles manually, but it’s quite fiddly: https://gist.github.com/jgthms/7dfc20db3478a938069a0191c4e30...

You are most welcome, although unfortunately it looks like this project doesn't support your use case yet. Right now it can only correct "constant drift", although I'll definitely be thinking more about other scenarios based on the reception.

That said, it sounds like for your particular case, if you use VLC, you might be able to adjust the subtitle FPS so that you don't have to continuously manually resync. (I haven't run into this so just going off what others have mentioned.)

I see this will only shift the timings some offset. But many times it isn't enough e.g. if the video source is NTSC and the subtitle were meant for PAL.

In this scenario I use Subtitle Workshop, which allows me to specify two (or more) pairs of times, and all subtitles are synchronized accordingly.

Optimally the two pairs of times would be near the beginning and the end, but to prevent big plot spoilers, I usually use the beginning and 2/3 of the length.

Sorry for the dumb question, but what exactly does this do?

`subsync reference.srt -i unsynchronized.srt -o synchronized.srt`

I mean, I already have a synchronized srt file, so what would I be syncing here?

Great question. The main use case I can think of is when reference.srt and unsynchronized.srt are in different languages, and you want to eventually merge them into a single dual-language subtitle file.

EDIT: Oh, I should also mention that you don't need a reference.srt -- it can look at the video directly and use that as a reference.

Ohh now I got you. I believe README should be improved in this part then.

It reads:


Although it can usually work if all you have is the video file, it will be faster (and potentially more accurate) if you have a correctly synchronized "reference" srt file, in which case you can do the following:

subsync reference.srt -i unsynchronized.srt -o synchronized.srt


I believe you should explain that if you have a reference file in another language which is correctly synchronized with that video, you can use that file instead of the video, as its timestamps will serve as references when synchronizing the target .srt file.

Now this has raised a question, what if the reference file has a different block count? For example, in some languages (like Chinese or Japanese) we can say a lot with fewer characters than in English. So in Chinese a text will stay on the screen for a long time, whereas in English the corresponding text would be split into two or more blocks. Wouldn't that make synchronization less accurate?

BTW that's a cool project. Thanks for sharing!

Really appreciate the feedback -- will definitely work on improving the clarity in the README. Glad you like the project!

Sorry, I forgot to answer your last question. It turns out that, because of how the algorithm works, the number of blocks shouldn't matter. Since it is discretizing time windows in 10ms increments, the granularity of the "effective blocks" is small enough that putting two separate large blocks on the screen, each for a shorter period of time, is roughly equivalent to putting a single large block on the screen for twice as long (for synchronization purposes, that is).

> when reference.srt and unsynchronized.srt are in different languages

I just want to thank you for including this use-case, because it's exactly the thing I'm regularly running into. Subs in one language are bundled with the vid, all subs from OpenSubtitles are desynchronized.

You are most welcome! This is exactly the use case I was initially targeting; I got super lucky that the VAD-based synchronization happened to be low-hanging fruit that has a higher "Wow" factor.

Given it just processes audio, how long does processing one video file usually take?

Audio extraction is actually the most expensive part. It depends on the length of the video, but my experience is that it finishes in 20 to 30 seconds. It's possible one might be able to sample different parts of the video to bring the runtime down further; I plan to experiment with this when I get the chance.

It's quite common for different people to create their own subtitles for videos, and for all of these different subtitle versions be available for download at various subtitle sites on the internet.

If you've already synchronized one of these from the video itself (by using the voice detection algorithm described in the Readme, or maybe even by hand), it looks like you can then synchronize the rest using the already synchronized subtitle file. It's probably faster.

I haven’t tried this command, but I believe the use case is as follows:

• Open audio/video file with —supposedly— synchronized file

• After a few seconds, I realize the subtitles appear before/after the dialogues

• I immediately close the multimedia player, and open the Terminal

• I execute the “subsync” command which does who knows what

• Open the SRT and discover that the subtitles are now in the correct timestamps

• ???

• Profit

Pretty much! Given the anecdotal success I've personally had with this approach, I'm hoping it could get picked up by VLC so that the algo can be run in there directly.

Previously: Fiddle with the subtitle/audio offset factor (and then they drift apart again slowly, driving you mad!)

Now: subsync

Soon: Players run subsync internally the press of one button or commandline switch.

The voice audio detection and then mapping is such a neat solution. I would have embedded parts of the surrounding audio in some base64 format into the subtitle file and then used that as an alignment clue. But this won't work when the languages don't match.

Actually, it unfortunately doesn't work if the drift gets worse over time -- so far, it only works with constant drift. Maybe fixing constant 1st derivative drift is the next step!

The drifts come from different speed of videos compared to the speed of the videos for which the titles were done.

E.g. if one subtitle is made for 24 frames per second speed (classic film speed) and you have a video presented in 25 frames per second (common in Europe). The original two hours video in 24 fps is then 5 minutes shorter in the Europe-origin version. Or the opposite: the subtitles for 120 minutes would at the end appear 5 minutes before!

Apparently there are some other speed changes, for which I don't know how they happen.

I have done one such correction once, using the linear function to model the correction ^based on the target times of the first and of the last title.

> only works with constant drift

I hoped that this solution would sync parts that have really different offsets between the audio and the subs, including changes from negative to positive offsets. Because that's the cases where automatic fixes in e.g. Aegisub don't suffice.

This happens when the vid and the subs are from different releases which apparently were edited for some reason―regional releases or something. Like, after some point the subs are suddenly off by a minute.

Oh! Yeah, that makes absolutely perfect sense! I just asked the same question here regarding the difference in FPS of video and subtitle.

I'm guessing `synchronized.srt` is just the name of the output file.

I think they were referring to the fact that "reference.srt" would already be synchronized.

It seems people are going back to P2P for movies. I'd forgotten the ordeals of unsynchronized srt files. Thanks to the Netflix era.

Probably because not everything you want to see is on Netflix anymore and it's getting worse.

I've had great results with the Descript app which can both transcribe and can export a synchronized subtitle file. https://www.descript.com/

Perhaps you can import an srt file, and have it sync'ed? or perhaps you can export the srt text and let Descript sync it and re-create an .srt file.

I recently ran across a similar tool for aligning subtitles based on a reference subtitle file called aligner:


It seemed to work pretty well in my testing. Maybe it could be useful to look at?

This project came up when I was researching this idea. It looks especially useful because the alignment algorithm is less restrictive (can introduce splits/breaks), so I plan on taking a closer look for sure. Given that it does not do any speech detection and requires a reference subtitles file, I'm not sure how well it would work for aligning the encoding for the speech-induced part, which will have a lot more noise. Particularly, the linked project works by aligning time intervals [a,b], while subsync is "dumber" in that it discretizes things into binary strings, and an interval constructed by merging consecutive 1's and 0's in speech detection-induced binary strings will probably be a lot harder to match to the subtitle intervals.

Either way, I also thought this was a cool project and will be thinking more about whether any ideas can be borrowed.

I remember the annoyance. I created a script to eye ball, guess and fix back then. It was good enough. https://gist.github.com/chanux/2042676

Super cool. This and the Python-based subtitle downloader subliminal make working with subtitles really easy. I had previously relied on incremental fudging of times using trial and error with www.sync-subtitles.com.

> I have yet to find a case where the automatic synchronization has been off by more than ~1 second.

1 second is in the 0 - 1000ms which is still a lot of room for manual tweaking in VLC. I can notice 50ms differences in sync.

This is quite amazing. The code quality overall looks pretty good too. Pleasantly surprised to see it in Python. Was dreading complex C++ code. Thanks for sharing. The code will aid my learning.

What about syncing up audio?

Yet to find a CLI solution for syncing up & replacing external audio on my camera's MP4.

Current solution is boot up FCPX and do a sync there which rips away minutes of my life.

I use Audacity for this.

I load both files, look for a peak or end of silence in both. The difference I remux with Mkvtoolnix. I do this with 2-3 spots.

Loading the file is usually the longest waiting period.

Please tell me someone that works at Plex is reading this thread. I know it might seem a bit distant for Americans but this HUGE!

A Kodi plugin might be a more realistic target, subtitles handling in Kodi is pretty great in my experience, especially compared to Plex (download while playing, sources addons, offset adjustement).

Surely you can do much better than this using a speech recogniser?

By speech recognizer, are you referring to a method that would perform speech-to-text transcription? If so, you could probably do better when your video and subtitles are in the same language, but you would lose the ability to be language-agnostic (unless you manage to solve machine translation well enough :)

Interestingly enough, I haven't actually found any cases where the synchronization doesn't work (assuming the only problem with the subtitles file is a time offset), so it actually looks like simple voice detection might be good enough for the target use-case of longer TV episodes and movies, although further evaluation is necessary.

How about if you just do speech-to-timed-phonemes (which doesn't require any kind of language model, just a model trained with people speaking in accents, to produce "canonical" phonemes) and text-to-phonemes (which just needs a dictionary with IPA metadata), and then attempt to synchronize your text phonemes to the audio's phonemes? It'd be the same as the current solution, but with higher dimensionality on the symbols.

Frequently there will be subtitles that describe sounds and noises for the deaf (i.e. door closes). How is this handled?

The approach taken here is to not worry about those. :)

These will introduce some noise, but the synchronization algorithm seems to be robust enough that noise doesn't matter (whether it's from the voice activity detector or from non speech-related subtitles).

Why? Detecting whether there is speech or not should be enough to align video and subs. It's not like the word order would be scrambled in the subtitle file and needs to be sorted according to the video. :)

I'm still waiting for the day there is an automatic synchronizer for the video and audio tracks in VLC. It's so frustrating trying to find the right offset when they get out of sync.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact