Hacker News new | comments | ask | show | jobs | submit login
Extracting Chinese Hard Subs from a Video, Part 1 (kerrickstaley.com)
216 points by KerrickStaley on May 29, 2017 | hide | past | web | favorite | 64 comments

I actually solved this problem back in 2013 with a slightly more advanced technique (taking into account other signals such as motion), see http://up.csail.mit.edu/other-pubs/chi2014-smartsubs.pdf and http://up.csail.mit.edu/other-pubs/gkovacs-meng-thesis.pdf for the algorithm and https://github.com/gkovacs/extract-subtitle for the implementation (in python)

In my experience Tesseract improves massively if you can identify the font the text is written in and prepare a custom trained dataset for it to use. See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTess...

All of the failures were directly related to improperly isolated input. In addition, a huge percentage of Chinese text is written in very few fonts.

Could isolate the text by compositing neighbor images and getting the pixels that don't differ. The background is moving, the text is not.

This would also allow you to extract out the subs without having to OCR them/get characters. Could just erase all static artifacts (including subs but also things like watermarks).

An approach I really want to try is taking a stream of the video without subs (can easily be found online) and subtracting the two. You'd have to deal with differences in resolution and compression between the two, and also handle cases where the background is either white or black, but in theory it should work very well. I haven't had time to dig into this.

Seems like you could get gstreamer and some subtractive elements working pretty quick...

Legitimate question: why would you want the first video (hardcoded subs) if you have a second stream without them, better resolution maybe?

In order to have access to vocabulary words. From the article: > I wanted to get a transcript of the episode’s dialog so I could study the unfamiliar vocabulary. Unfortunately, the video files I have only have hard subtitles

Makes sense, thanks. I went straight to the technical details and missed that part.

This has to be worth a try.

This only works if the camera isn't fixed, though. In the frame from the post it might erase the dashboard, the car roof, and so on.

Just define a small area of the screen to run on then. Subtitles are typically within a very small portion of the screen

Nope, every channel of CCTV seems to have its own subtitle convention.

Which is irrelevant to stripping subs from one movie at a time.

One possible extension of this is to erase the original subtitles using something like gimp-resynthesizer's Heal Selection, and then replace them with translated ones, all automatically. (I've redrawn the video frame from the post[1] so you can see what I mean.)

[1]: https://static.linestarve.com/ext/ycombinator-news/itm144408...

Using a 2d image resynthesis algorithm is sub-optimal for video. There is basically no chance of it picking the same result every single frame, you will see a flickering shadow of weirdness where the old characters were.

You have lots of infomation of what image should be there; You have motion vectors signaling when an object or background has moved into (or out of) a blocked area. Or a maybe the motion vectors indicate object/background is stationary and you can use infomation from before/after the subtitles to fill it in.

I'm not aware of such a 3d video resynthesis algorithm existing, but it should be possible.

It's called video inpainting. I'm not familiar with it, but here's one paper with video examples I found: http://perso.telecom-paristech.fr/~gousseau/video_inpainting...

Google Translate is able to do this if you point it at live video.

Lately I am using the Copyfish Chrome extension for help with Chinese subtitles/images. The very nice thing about it is that it plays nice with the Zhongwen dictionary, which is another essential Chrome extension for Chinese learners.


Before that I was using the "Chinese Subtitle Translator" software: https://ocr.space/blog/p/chinese-subtitles-translator.html (Source code at https://github.com/A9T9/Chinese-Subtitles-Translator )

It uses Microsoft OCR and gets very good results.

http://projectnaptha.com/ is also a very similar project. I think it uses Ocrad.

I was hoping for a tool that erases Chinese hard subs from the video.

Can anyone who learned Chinese comment on the timeframe it takes to even get a basic understanding of what the subtitles say?

I used to learn English this way (watching US TV shows in English with English subs and very limited vocabulary). Eventually disabled the subs. Now watching everything in English. Love it.

edit: any tips on getting started with Chinese very welcome as well (apart from the standard stuff I find myself through Google or language courses)

After around 3000 hours of mostly self study I can understand 95+% of subtitles for TV shows and for a movie can follow enough dialog just through subtitles to understand what's going on at full speed. Study success scales pretty linearly with time spent as far as I can tell, assuming you have a fairly sensible study method. For 1000 hours study I'd imagine you'd get the basics of a lot of subtitles but probably not quickly enough to follow along at full speed.

Teachers can be helpful to point out some mistakes you'll miss through self study but unless you have a fair amount of cash to throw at the problem self study will likely be a better approach. For full fluency you probably want to target around 20k vocab, so it's a bit of a numbers game in terms of finding a quick way to improve your vocabulary. I use Skritter but I guess any SRS software should help a lot here.

Not entirely convinced by immersion as being necessary, I learned plenty of Chinese without it. Probably good for day to day vocab and motivation but unless your level is good you will still struggle to get into conversations where you use it, depending on how outgoing you are.

To be honest, learning Chinese has been a long and somewhat painful process. I'm about 6.5 years in at this point (2.5 years of school + 4 years of self study) and am starting to be able to pick up a newspaper or watch a show and get the gist of what it's about. I don't have a natural knack for languages and I was never devoting 100% of my time to it (I did software in school and now for work), so your experience may differ. Also I never had a chance for immersion—people I know that have spent 6 months in Taiwan or China are often above my level even though they've been studying for less total time, so immersion is really helpful.

I've found that SRS apps like Anki or Memrise and podcasts like Popup Chinese or ChinesePod have been really helpful. I'm also developing an approach where I extract vocabulary words from videos and newspaper articles and pre-study them before watching/reading—I'm hoping to blog about this at some point.

> Can anyone who learned Chinese comment on the timeframe it takes to even get a basic understanding of what the subtitles say?

3-5 years depending on how much you study.

> any tips on getting started with Chinese very welcome as well

The best tip you'll ever get is to study every day.

For more concrete examples of what/how to study, here is an excellent post that contains a lot of advice for independent Chinese learners:


I can highly recommend All Japanese All The Time http://ajatt.com/

The author Khatzumoto taught himself Japanese in 18 months "by having fun", and then got a job in Japan (at Sony iirc). He then went on to use the same method to learn Chinese.

If you can get through the website's navigation, and scroll through the occasional incoherent rambling, there is solid gold advice throughout :) Good luck!

ps. While the site focuses on Japanese, the advice translates to learning any language. Plus there is advice and resources specifically for Chinese :)

>it takes to even get a basic understanding of what the subtitles say?

The main challenge with Chinese is that - unlike any Western language - you can not guess the characters you do not know. You either know it or you don't. Only the context might offer some clues.

Getting started with Chinese: Depends on your talent. For me (untalented) taken a "real" class was essential for a good start. Self-learning for languages does not work for me.

> you can not guess the characters you do not know

That's not completely true.

Chinese characters are composed of parts, some of which are referred to as radicals. If you know the part/block, you can sometimes guess its meaning or sound. Also, new Chinese words are made of existing characters, so you can often guess multi-character words by knowing the individual characters. For example, the word for computer is "electric brain". You might be able to guess that if you knew that 電means electric and 腦 means brain

I recommend the site called http://hackingchinese.com

He has a lot of good tips for how to learn Chinese as a foreign adult

Yeah, I think this is one of those things where if you get yourself into an "Asian languages are hard" mindset, you unnecessarily complicate things. I don't speak Chinese, but I speak Japanese, which I learned primarily from free reading (mostly manga, which is great because it is almost completely conversational Japanese).

While it is true that you can't guess the meaning of characters that you don't know without any context, you definitely can lean the mean of characters that you don't know with context. In the same way, while I might understand the Greek and Roman roots for words in English, if I see the work by itself without any context, I'm not going to be able to understand it. With context, I can usually puzzle it out.

However, it is a huge mistake to ignore the benefits of learning Chinese characters. It is the reverse that is most beneficial. In Chinese the 3600 most common characters cover some insane percentage of the common words in use (only 2200 for Japanese!). These characters form a powerful mnemonic for learning vocabulary. I can easily learn to read the characters for a word and the word faster than I can learn the word by itself. As you learn more characters, it only accelerates your progress.

2-3 thousand characters seems like a daunting task, but as I always point out in these threads, adult level proficiency in a language requires somewhere around 15,000 word families (1 family = a word, all its inflections/conjugations/etc, and all related compound words). People telling you that you can be "fluent" with 2000 words are selling you something (2000 word families is the level of a toddler -- and then people are disappointed that they can only speak like a toddler...).

The absolute best thing you can do for learning a language that has Chinese characters in its writing is to learn those characters. I recommend learning them at the same time as the vocabulary. And to get back to the original comment: even now if I hear an unfamiliar word, I will usually trace what I think the characters are on my hand. Often people will correct me and once I get the correct characters the meaning is almost always obvious.

For example, the word for computer is "electric brain". You might be able to guess that if you knew that 電means electric and 腦 means brain

Or you could be like me, idly sitting in a Malaysian restaurant, reading "馬來" and trying to figure what "Horses Arriving" has to to do with curry.

Haha, well, I feel that one's easy too if you know that sometimes characters are used as phonetics for words in other languages. Just like English! Although you might not expect that at first from a non-phonetically written language.

Every language has exceptions to "rules". In fact, there are few hard and fast rules for any language, since every language is a mix of another.

But, once you get comfortable with a language, you can appreciate the differences and make interesting guesses about the exceptions' origins. It can be fun.

I like http://www.hanzicraft.com/ because it breaks down the characters into parts you can click on to get their definitions or origins. Hacking chinese also has a cool resources section here [1], categorized, so you can browse through dictionaries, listening tools, practice tests, browser add-ons, etc. etc.

[1] http://challenges.hackingchinese.com/resources/Beginner

Hanzicraft looks cool, though I feel like I should learn to speak Mandarin a lot better before worrying about reading it. Infact I'd be fairly happy to be a fluent speaker and illiterate.

If you want to type characters from parts, decomposing and recomposing from radicals, I made


Okay, people replying to you so far seem to understand what's going on. For those that don't get it: Why are these characters used in your particular case?

I only know about Japanese but in the past characters could be chosen based on their pronunciation and not just their meaning e.g. 仏蘭西 (Buddha, Orchid, West, fu ran su, France). This is called ateji:


The ateji for Malaysia is 馬来, Horse, come, ma rai.

Nowadays loanwords are usually written alphabetically instead of ateji. フランス and マレーシア. The old way is still written for abbreviations; 仏 is the equivalent of writing "Fr" in English.

An interesting case is America (亜米利加)where the abbreviation is 米 (pronounced "bei") not 亜 (pronounced "a") because A is for Asia! (亜細亜).

As any one reading HN should know, naming things is hard...

Because 馬 sounds like ma and 來 sounds (a little) like lay (actually, lie).

So, Malay (ie, of Malaysia).

Sorry, should have added a explanation.

The first character of 馬來 is "horse" and the second is "come/arrive". So I was sitting there puzzling it out until I realised that if I pronounced it in Mandarin, it sounded a lot like "Malay" - which is in fact what it means.

or you could just memorize diannao as computer because that is easier.

trying to connect electric brain to computer is similar to trying to connect "breakfast" to the morning meal. it's interesting to know the compound words, but breakfast is always the meal in the morning and not always the meal after a fast.

It's just an example of how you can intuit words you don't recognize. Memorizing all the words you don't know is.. challenging, to say the least

I use an app called Memrise and a website called chinesepod. There's a really good dictionary app called Pieco.

Going beyond the standard stuff I'd recommend a professional teacher since you really need to learn Chinese from multiple angles: learning to recognize and write the characters, reading pinyin, understanding spoken Mandarin and being able to speak.

Thanks for this! I'm looking forward to part 2 also.

A bit off-topic, but does anyone have Chinese TV show suggestions? I watched a few episodes of this show (他来了) already, but I didn't like it very much.

人民的名义: A new and very popular show about corruption.

爱情公寓: I think it's like Friends in Chinese. But I haven't watched Friends.

欢乐颂: Follows the lives of several young women sharing an apartment in Shanghai.

For the list of the most popular shows check here: https://movie.douban.com/tag/国产电视剧

It's a little dated now, but check out this list:


I'm watching 外科风云 on youtube. It's a hospital drama with some corruption thrown in. It has english subtitles as well.

There's In the Name of the People, which is similar to House of Cards

Hey OP, for a non-technical solution to your problem get in touch with these guys over WeChat:


They sell soft-copy transcriptions for any show you want.

They charge per episode so it can get pricey if you do an entire series but getting a couple of episodes is quite affordable.

I was actually thinking of doing something like this using Amazon Mechanical Turk, maybe not to subtitle the entire show but just to get a much bigger test set than I have the patience to label myself. I'll check them out, thanks!!

No worries. Prices I've seen so far range from RMB 10 to RMB 100 per episode - probably depending on how popular the show is and whether they already have a transcript available because someone else wanted it or whether they'll need to transcribe it for you.

He should have used the fact that there is a black border around the text.

I did! That's going to be in Part 2.

16 year Chinese learner here (not that it's relevant). I would try to hack a solution via the following approach.

1. Determine the area (if any) near the bottom with black-or-whiteness zones of a constant height (these are likely to be subtitles) by randomly selecting 10 frames from the middle of the movie. Extra points if you have it detect the sub color.

2. For each frame with unique subs, isolate the zones vertically (handles multi-line subs).

3. Determine black-or-whiteness of each vertical column in the text area. Moving inward from the left or right edge, crop everything until the black-or-whiteness within the constant height drops below a certain threshold. In the example shown, this would deal with ′…′二′′′'′ and would look like 0,0,0.01,0,0,0.01,3

4. Crop viciously within the assumed vertical height. This should remove issues like 逯 which should be 这.

5. There is probably a clustering-based approach you can use to remove background noise, either spatially or temporally, eg. temporally via imagemagick[0]: compare frame1.jpg frame2.jpg -compose src -fuzz 10% -highlight-color white -lowlight-color black output.jpg ... alternatively, if there is a surrounding color such as in the example, you could remove any pixel-groups that don't have it.

6. In terms of detecting frames in which you have a new set of subs, just compare the last black-white-extraction of the central maybe-has-subs area (ie. most commonly used portion thereof) with a delta of the last one, remembering no subs is also an option. In many cases this may align with keyframes.

If you like to solve image processing problems we're hiring - http://8-food.com/ - email in profile.

[0] http://www.imagemagick.org/Usage/compare/#difference

I totally thought the state-of-the-art for OCR would be ConvNets, but apparently it isn't? Or are there just not any easily available/usable libraries that do OCR with Convents? Or is the benefit of ConvNets marginal enough to not be useful?

Found a paper! Yes, it is better. Maybe not for this specific application and given there's no pre-trained network available, I can totally understand the choice made. https://www.semanticscholar.org/paper/The-recognition-of-Chi...

Here's a free complete OCR solution using LSTM https://github.com/tmbdev/ocropy

Actually very useful even for for other things, thanks for sharing! For example ripping DVD subtitles to SRT, or (I'm using my imagination) maybe in the future with content-aware fill removing hard coded subtitles and replacing them with filler space?

That should actually be possible with todays technology. Take an image and draw subtitles on it. This is input to train NN while original image is training output. Even better, use video stream directly... Not easy, but not impossible either.

DVD subtitles are already a separate layer to the movie stream, but it is a bitmap. Because it's a separate layer, OCR-ing should be easy.

And if you ask why it's a bitmap, that's because bitmaps support more than just plain text: color and typefaces to name 2 things. Imagine if DVD players have to implement text decoding ("Is this subtitle stream in UTF-8 or maybe some Cyrillic Code Page?") and rendering (color, placement, font files, etc...)

I know an ocr tool to do that on any mp4 files with vapoursynth: https://bitbucket.org/YuriZero/yolocr/src/989cf68d66cddfcf7b...

I've always thought a great feature of shows/movies with subtitles would be to the ability to display multiple at the same time.

It would really be useful for learning assuming the subtitles are translated well. Right now I've been able to do something with my own content by merging two subtitle files.

Maybe you could also add a transliteration of the characters if its a language like Chinese.

I'm also learning Mandarin and was wondering if this was possible (for a different show) just the other week! Thanks for the article, will be looking forward to Part 2 and 3. Also, is there an easy way to extract all the frames with unique subtitles?

Once you have the text corresponding to each frame, you can de-dupe it with its neighbors based on Levenshtein distance (can't use exact-match because of recognition errors). I found that for this show subtitles generally hang on-screen for 1-3 seconds, so you wouldn't have to do many comparisons.

cjy - please can you help me to find more double-subtitles? (Chinese and English, synced)

I have a program to add spaces between Chinese words, colours for the tones, pinyin, and a literal translation.


I already made a feature to list all the unique words in a movie, sort them by their frequency, and make a study sheet. I also made bash script generator to use ffmpeg to cut the movie to the subtitle time.

All I need to do now is recombine the subtitles based on the words, to make videos with lots of example sentences.

It's much easier to study with a real English translation though, instead of a literal word-for-word transcription. If you could help me get more input data (names of movies or songs, srt files), that would be wonderful!

It'd be very cool to use this to take a video with hard subs, inpaint the hard subs (even naïvely, or maybe with a motion compensator), and replace them with SRT/ASS subs.

Is the term "hard sub" that well known outside of the fansubbing community and other kinds video aficionados?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact