"[...] the naive O(n^2) strategy for scoring all alignments is unacceptable. Instead, we use the fact that "scoring all alignments" is a convolution operation and can be implemented with the Fast Fourier Transform (FFT), bringing the complexity down to O(n log n)."
I absolutely love it when something that, at very first glance, has no business in being solved in the frequency domain, gets solved in the frequency domain.
(note that \* means multiplication, there doesnt seem to be a way to escape an asterisk)
Lets start by seeing how this is a convolution. We have the videoSpeech sequence, and the subtitle sequence - each is a vector, indexed by time, of 0's and 1's indicating whether there is speech in that time. We can imagine padding the sequences out on either side with 0's, and consider the alignment task as shifting the subtitle sequence left and right in time until we get the best alignment with the 1's in the videoSpeech sequence. We can express the goodness of alignment as the number of matching 1's, aka the sum over all times t of videoSpeech(t) \* subtitle(t). This is the definition of a convolution: the convolution of two sequences gives a new sequence where the value at index i is this sum above where one of the sequences is shifted by i. Mathematically, conv(videoSpeech, subtitle)(i) = sum( videoSpeech(t)\* subtitle(t-i)). So we can rephrase this problem as, find the index i which maximizes the value of the convolution sequence.
The discrete fourier transform is a function that takes a sequence and gives another sequence. It's relevant here because it "turns convolution into multiplication": fourier(videoSpeech)(i) \* fourier(subtitle)(i) = fourier(conv(videoSpeech, subtitle))(i).
So finally to solve the problem, we get the pointwise product sequence S = fourier(videoSpeech) \* fourier(subtitle), do the inverse fourier transform on it invFourier(S), and maximize invFourier(S)(i) over i.
I tried this sentence on my 5-year-old and got a blank stare. He then proceeded to tell me a story about how Darth Vader is so scary and cool and that he's actually Luke's father. YMMV.
ELI5 v2: you can look at the sequences a different way, like tilting your head to look at them from a different angle. and from that angle, we can more easily try every possible way of putting them together, so that we can choose the best way.
I'm writing a program which takes multiple channels of near-periodic audio wav files, and outputs a phase-stable oscilloscope video. This is done by FFT-correlating (a buffer of recent oscilloscope plots) with (the audio signal) (with quite a bit of added complexity for better results). Incidentally I'm also using Python/Numpy/ffmpeg.
It's most useful for "complex" chiptune like FM/SNES/tracker/MIDI music, which are easy to split into monophonic single-note channels. https://github.com/jimbo1qaz/corrscope Should I submit this separately (Show HN)?
One property of the Fourier transform is that convolution in the time domain corresponds to element-wise multiplication in the frequency domain (https://en.wikipedia.org/wiki/Convolution_theorem), so you can compute the convolution efficiently by taking the FFT of both series, doing element-wise multiplication, and then taking the inverse FFT of the result.
What he is looking for is maximal correlation between two binary series (think "Pearson's r or r-square correlation coefficient"). Now, correlation is just like convolution except one series is flipped around on the time axis. Which means, if you have an efficient way to compute convolutions (and you do, through FFT), you have an efficient way to compute correlations.
(I don't think 5 year olds heard of Pearson's correlation coefficient or convolutions, but ... that's the best I can do).
In many ways, correlation and covariance are more fundamental than convolution - they are closely and directly related to the inner product.
And the only reason to use the FFT here is convenience (i.e., it is fast and simple to compute), but the correlation/convolution property applies to many "transform domains" - Laplace, Z, Cosine, Sine, and a few others (all of which are closely related among themselves, but only a few easy to compute numerically).
I am constantly impressed by ffmpeg and it capabilities.
Last week I was able to make it act as a proxy for live streaming video that reduced the volume of ad-breaks by 50% automatically by listening for SCTE35 cue packets in the stream and adjusting a volume filter accordingly.
My housemate has a reality TV addiction, and the ads were getting on my nerves. I just intended to see if it was possible, but I had a POC running in about 1/2 an hour.
`mpv --sub-filter-sdh ...`
That logic could probably be extracted into a separate subtitle filtering tool.
And for English-as-second-language folks, it's occasionally curious which expressions are used in the SDH descriptions.
Is there chance you could open source the code?
My method will only work with streams that have SCTE35 packets embedded. This is normally used for local ad insertion on cable networks, so it may not always be available on OTA streams.
There is not much code because it is mostly ffmpeg that does the work.
I used nodejs to manage an ffmpeg child process, read its output streams (one stream over a unix domain socket for the video, and the stdout stream for the SCTE35 packets) and pipe them to a HTTP client.
My code parses the SCTE packets and then sends commands into the running ffmpeg child process to change the audio levels.
I am happy to share it when I get home in a few hours.
Keep in mind it's PoC code, I stopped when it worked the first time :P
Try writing subtitles yourself, and you'll probably see why. There's a limited space and time budget - people need to be able to read the thing without spending too much time focusing on it.
Sometimes, if it's a quick back and forth, you even have to leave out some of the dialogue. Now what do you cut?
And this is just for subtitles in the same language. Imagine also facing translation issues - where you not only are trying to roughly hit the same meaning and connotations, but also need to stay relatively close to the acting, e.g. you can't translate a strong exclamation word into something too soft. And there's also the issue of difference between spoken and written language, e.g. the viewers may accept someone saying "fuck that", but seeing it written down "- Fuck that" feels awkward.
It is truly a work of art.
And that's probably also why people are doing it in their spare time, perhaps after having seen a bad job from the official movie.
I imagine doing it will also get you really close to understanding how the movie is built up.
- No English subtitles available for S04 of Rita (danish).. So I made English subtitles for E01, by machine translating existing German and Portuguese subs and using the 'original' German for reference. Turned out pretty great, but took an hour or 2. No way I'd do the whole season like that though.
- Season 2 of Blue Moon (in french) has been out for years, no English subtitles anywhere, seems like there never will be. There's a season 3 too I think, not sure.
- Trying to find subtitles for S02 of Overspel (dutch) there's everything available (but in 1 obscure subtitle site, not easy to find) except E07 and E08. I found someone in a forum asking about exactly those 2 subtitles in January 2014! No-one answered...
I'd be happy to pay for subtitles for things like that that I really want! The cost would be small, I guess, divided by a few or a few dozen or hundred people around the world. Then they would be available online forever, hopefully. Just need to coordinate the effort. That could be a great website.
Also, I helped a bit with some spanish and english subtitles for spanish-language tv shows on viki.com 6 or 7 years ago, which was a pretty great-seeming community-sourced translation effort, but it disappeared, along with all its subtitles. At least most of the content did, it was all asian stuff last time I looked, I think it must have changed hands.
I do, as a hobby. Typesetting in particular is one of the most rewarding things I've ever done. There are few things as satisfying me as being able to set a sign so that it looks like it's part of the video.
If you're curious, there's a tool out there called Aegisub. Pretty much every fansubber uses it, and it can handle most parts of the process: timing, typesetting, TL/editing, and it shows you a preview of what it looks like muxed with the video (though you do need a separate tool to actually generate muxed MKV files).
> Both are in English, so I don't see why they can't just transcribe what's being said. Instead, for dialogues such as "The greatest trick the Devil ever pulled was convincing the world he didn't exist.", the subtitle is written as "The devil tricked the world into thinking he didn't exist."
You have entered the wonderful world of Hong Kong bootlegs, which are usually just called "HKs". The people who work on these just plain don't care. They translate the movie into Chinese, and then for whatever other language they release subtitles in, they just Google Translate their Chinese subs into that language, even if it's the movie's original language.
Fansubbers hate them, and if the show or movie in question was never officially released in English, there are some groups who will clean up these subtitles to make them more presentable and fixing the godawful grammatical errors (this is called "scrubbing"). Of course, you're still left with a shitty base TL, but it's at least better than the HKs-as-is, so these scrubs are always just a stopgap until somebody can do a full TL. And there are some groups who will subtitle a series from scratch just to spite the HKs.
The only thing good about these subs is that they tend to turn into memes. I'd recommend looking up the HKs of Revenge of the Sith, which ended up becoming an impressive stockpile of memes, most famously being the origin of "Do not want!". For memes that came out of shows I actually watch, I'd recommend googling phrases like "Don't molest the lawyer" and "Gao Main Bastard".
Not just movies, but full overdubbing for 20-episode season disposable reality TV shows. Think, "Storage Wars" etc. Spent over an hour channel surfing and flipping languages.
I would /love/ to know how they take the original subtitle tracks and translate/dub them efficiently.
Wikipedia has some more detail: https://en.wikipedia.org/wiki/Dubbing_(filmmaking)#Methods
This could probably be similarly done in the script itself in the future.
If the result gets longer and the times for each line are a bit too short―I guess you'll have to read faster, har har.
Actually, iirc Aegisub can also do automatic stretching for different FPS, since that's a rather easy case.
However, it's not enough when the subs are from a different media release or an edit―which made me long for exactly the kind of solution like in the post.
Tell me more!
Recently I rolled my own code to play WebVTT to an audio (think video without the pictures) track. I had assumed there would be off the shelf libraries that would do this for me. Oh how wrong I was!
I took it a bit further than white text in a black box at the foot of the screen. Not having pictures kind of made it that way. So I decided I needed cartoon grade speech bubbles, with the speech bubble coming from the left for one voice and the other side of the screen for the other voice, again finding myself in the realms of no ready made examples to do this. The speech bubbles had to scale to fit the content with them being suitably rounded, a la cartoon style. I found an SVG solution to my problem.
The WebVTT format and variants have all kinds of goodies in them to position speech and do things in time, as per karaoke.
I think one reason I found myself in the world of rare code was that we assume anything to do with accessibility is for disabled people and therefore has to be boring and done with no creativity whatsoever.
I should include in the readme that for >1 hour movies, it usually finishes the synchronization in ~20 seconds, which compares favorably to the project linked in your comment (13 minutes unoptimized, 2 minutes optimized).
That's actually quite bad synchronization error for A-V; maybe it's less noticeable for subtitles. I wonder if there is a good way to reduce that error down to milliseconds.
Unfortunately, I'm not too familiar with the current state of the art for whole-genome alignment, so I don't know for sure which algorithms are considered, but MUMmer4 seems like a good place to start. These algorithms are designed to handle sequences up to billions of letters long (e.g. the human genome, which is about 3 billion letters).
> The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences.
> The Smith–Waterman algorithm is fairly demanding of time: To align two sequences of lengths m and n, O(mn) time is required.
Not sure I understand this right - is this basically treating both binary strings as square waves, converting them to the frequency domain and determining the offset as a pitch shift between the two spectrograms?
EDIT: err, I'm actually not sure about the pitch shift part, that's a bit of vocabulary I'm not familiar with. If you've seen the fast polynomial multiplication algo from CLRS, it's basically that. E.g. if we have strings 1101 and 0101, we can find the best alignment by looking at the exponent of the largest coefficient after multiplying
where polynomial(1101) = x^3 + x^2 - x + 1
and polynomial(reverse(0101)) = polynomial(1010) = x^3 - x^2 + x - 1
In essence, the fourier-transform is based on convolutions. F(eta) is essentially the convolution of f(x) with sin(eta x).* This is very loosely why the convolution theorem works.
* This excludes all cosine parts of the transform. Its neater to work in the complex domain and state that:
F(eta) = f(x) convolved with e^(eta i x)
While playing digital copies of somewhat older movies (before BluRay rips came into vogue), a problem which surfaces frequently is that the frame rates of the video and the subtitle track are slightly mismatched, say video at 24fps, subtitle at 25.6fps(?). This makes the subtitles drift away from the video and require a manual intervention every few minutes. If I can't hear the audio properly for some reason, then I just become fed up and don't watch the movie at all. Add to that the existence of different 'cuts' for films, which add another dimension to the subtitle problem.
How would one even go about solving this problem?
Matching speed might be possible based on first matching runs of one minute, and then trying to scale with that minute as a fixed point to see which scale works out. Perhaps some heuristics to take common frame rate-ratios into account might help.
An even crazier idea:
Use the fact that scaling becomes shifting after a logarithm, and try to match after setting g(x) = f(log(x)).
Edit: Hmm, my first test moved the subtitles but they ended up 3-5 seconds after the audio instead of the original 2-5 before... I'll have to keep testing when I come across bad subs.
subsync reference.srt -i unsynchronized.srt -o synchronized.srt"
I really don't understand what the reference.srt is supposed to be here.
Edit: If I have a "reference" which is synchronized why do I have to do anything at all? Why would I need to synchronize anything then?
Edit2: If the French titles are a translation of the English ones, how wouldn't copying the timestamps from one to another be enough? Or do you want to say that the program can match the subtitles which aren't based on the English ones and have different number of sentences etc? But how can it then know which part of translation belongs to what? What's short in one language isn't in another etc.
Let's say reference.srt is a set of English subtitles, and you have an out-of-alignment French subtitles file unsynchronized.srt. You could use the video directly as a reference in order to synchronize the French subtitles, but it will take a bit longer since we need to extract audio and perform voice detection over the whole video file. In this case, it will be faster to use your already-synchronized English subtitles file reference.srt.
(I think you're correct subbing "subtitles file" for "video file".)
I used to have a script to “stretch” subtitles manually, but it’s quite fiddly: https://gist.github.com/jgthms/7dfc20db3478a938069a0191c4e30...
That said, it sounds like for your particular case, if you use VLC, you might be able to adjust the subtitle FPS so that you don't have to continuously manually resync. (I haven't run into this so just going off what others have mentioned.)
In this scenario I use Subtitle Workshop, which allows me to specify two (or more) pairs of times, and all subtitles are synchronized accordingly.
Optimally the two pairs of times would be near the beginning and the end, but to prevent big plot spoilers, I usually use the beginning and 2/3 of the length.
`subsync reference.srt -i unsynchronized.srt -o synchronized.srt`
I mean, I already have a synchronized srt file, so what would I be syncing here?
EDIT: Oh, I should also mention that you don't need a reference.srt -- it can look at the video directly and use that as a reference.
Although it can usually work if all you have is the video file, it will be faster (and potentially more accurate) if you have a correctly synchronized "reference" srt file, in which case you can do the following:
subsync reference.srt -i unsynchronized.srt -o synchronized.srt
I believe you should explain that if you have a reference file in another language which is correctly synchronized with that video, you can use that file instead of the video, as its timestamps will serve as references when synchronizing the target .srt file.
Now this has raised a question, what if the reference file has a different block count? For example, in some languages (like Chinese or Japanese) we can say a lot with fewer characters than in English. So in Chinese a text will stay on the screen for a long time, whereas in English the corresponding text would be split into two or more blocks. Wouldn't that make synchronization less accurate?
BTW that's a cool project. Thanks for sharing!
I just want to thank you for including this use-case, because it's exactly the thing I'm regularly running into. Subs in one language are bundled with the vid, all subs from OpenSubtitles are desynchronized.
If you've already synchronized one of these from the video itself (by using the voice detection algorithm described in the Readme, or maybe even by hand), it looks like you can then synchronize the rest using the already synchronized subtitle file. It's probably faster.
• Open audio/video file with —supposedly— synchronized file
• After a few seconds, I realize the subtitles appear before/after the dialogues
• I immediately close the multimedia player, and open the Terminal
• I execute the “subsync” command which does who knows what
• Open the SRT and discover that the subtitles are now in the correct timestamps
Soon: Players run subsync internally the press of one button or commandline switch.
The voice audio detection and then mapping is such a neat solution. I would have embedded parts of the surrounding audio in some base64 format into the subtitle file and then used that as an alignment clue. But this won't work when the languages don't match.
E.g. if one subtitle is made for 24 frames per second speed (classic film speed) and you have a video presented in 25 frames per second (common in Europe). The original two hours video in 24 fps is then 5 minutes shorter in the Europe-origin version. Or the opposite: the subtitles for 120 minutes would at the end appear 5 minutes before!
Apparently there are some other speed changes, for which I don't know how they happen.
I have done one such correction once, using the linear function to model the correction ^based on the target times of the first and of the last title.
I hoped that this solution would sync parts that have really different offsets between the audio and the subs, including changes from negative to positive offsets. Because that's the cases where automatic fixes in e.g. Aegisub don't suffice.
This happens when the vid and the subs are from different releases which apparently were edited for some reason―regional releases or something. Like, after some point the subs are suddenly off by a minute.
Perhaps you can import an srt file, and have it sync'ed? or perhaps you can export the srt text and let Descript sync it and re-create an .srt file.
It seemed to work pretty well in my testing. Maybe it could be useful to look at?
Either way, I also thought this was a cool project and will be thinking more about whether any ideas can be borrowed.
1 second is in the 0 - 1000ms which is still a lot of room for manual tweaking in VLC. I can notice 50ms differences in sync.
Yet to find a CLI solution for syncing up & replacing external audio on my camera's MP4.
Current solution is boot up FCPX and do a sync there which rips away minutes of my life.
I load both files, look for a peak or end of silence in both. The difference I remux with Mkvtoolnix. I do this with 2-3 spots.
Loading the file is usually the longest waiting period.
Interestingly enough, I haven't actually found any cases where the synchronization doesn't work (assuming the only problem with the subtitles file is a time offset), so it actually looks like simple voice detection might be good enough for the target use-case of longer TV episodes and movies, although further evaluation is necessary.
These will introduce some noise, but the synchronization algorithm seems to be robust enough that noise doesn't matter (whether it's from the voice activity detector or from non speech-related subtitles).