
Extracting Chinese Hard Subs from a Video, Part 1 - KerrickStaley
http://www.kerrickstaley.com/2017/05/29/extracting-chinese-subs-part-1
======
geza
I actually solved this problem back in 2013 with a slightly more advanced
technique (taking into account other signals such as motion), see
[http://up.csail.mit.edu/other-
pubs/chi2014-smartsubs.pdf](http://up.csail.mit.edu/other-
pubs/chi2014-smartsubs.pdf) and [http://up.csail.mit.edu/other-pubs/gkovacs-
meng-thesis.pdf](http://up.csail.mit.edu/other-pubs/gkovacs-meng-thesis.pdf)
for the algorithm and [https://github.com/gkovacs/extract-
subtitle](https://github.com/gkovacs/extract-subtitle) for the implementation
(in python)

------
mintplant
In my experience Tesseract improves massively if you can identify the font the
text is written in and prepare a custom trained dataset for it to use. See
[https://github.com/tesseract-
ocr/tesseract/wiki/TrainingTess...](https://github.com/tesseract-
ocr/tesseract/wiki/TrainingTesseract)

~~~
contingencies
All of the failures were directly related to improperly isolated input. In
addition, a huge percentage of Chinese text is written in very few fonts.

------
mmanfrin
Could isolate the text by compositing neighbor images and getting the pixels
that don't differ. The background is moving, the text is not.

This would also allow you to extract out the subs without having to OCR
them/get characters. Could just erase all static artifacts (including subs but
also things like watermarks).

~~~
KerrickStaley
An approach I really want to try is taking a stream of the video without subs
(can easily be found online) and subtracting the two. You'd have to deal with
differences in resolution and compression between the two, and also handle
cases where the background is either white or black, but in theory it should
work very well. I haven't had time to dig into this.

~~~
gbaygon
Legitimate question: why would you want the first video (hardcoded subs) if
you have a second stream without them, better resolution maybe?

~~~
kpozin
In order to have access to vocabulary words. From the article: > I wanted to
get a transcript of the episode’s dialog so I could study the unfamiliar
vocabulary. Unfortunately, the video files I have only have hard subtitles

~~~
gbaygon
Makes sense, thanks. I went straight to the technical details and missed that
part.

------
wolfgang42
One possible extension of this is to _erase_ the original subtitles using
something like gimp-resynthesizer's Heal Selection, and then replace them with
translated ones, all automatically. (I've redrawn the video frame from the
post[1] so you can see what I mean.)

[1]: [https://static.linestarve.com/ext/ycombinator-
news/itm144408...](https://static.linestarve.com/ext/ycombinator-
news/itm14440849/car-scene-en.png)

~~~
phire
Using a 2d image resynthesis algorithm is sub-optimal for video. There is
basically no chance of it picking the same result every single frame, you will
see a flickering shadow of weirdness where the old characters were.

You have lots of infomation of what image should be there; You have motion
vectors signaling when an object or background has moved into (or out of) a
blocked area. Or a maybe the motion vectors indicate object/background is
stationary and you can use infomation from before/after the subtitles to fill
it in.

I'm not aware of such a 3d video resynthesis algorithm existing, but it should
be possible.

~~~
sorenjan
It's called video inpainting. I'm not familiar with it, but here's one paper
with video examples I found: [http://perso.telecom-
paristech.fr/~gousseau/video_inpainting...](http://perso.telecom-
paristech.fr/~gousseau/video_inpainting/)

------
RandomBookmarks
Lately I am using the Copyfish Chrome extension for help with Chinese
subtitles/images. The _very_ nice thing about it is that it plays nice with
the Zhongwen dictionary, which is another essential Chrome extension for
Chinese learners.

[https://chrome.google.com/webstore/detail/copyfish-%F0%9F%90...](https://chrome.google.com/webstore/detail/copyfish-%F0%9F%90%9F-free-
ocr-soft/eenjdnjldapjajjofmldgmkjaienebbj?hl=en)

Before that I was using the "Chinese Subtitle Translator" software:
[https://ocr.space/blog/p/chinese-subtitles-
translator.html](https://ocr.space/blog/p/chinese-subtitles-translator.html)
(Source code at [https://github.com/A9T9/Chinese-Subtitles-
Translator](https://github.com/A9T9/Chinese-Subtitles-Translator) )

It uses Microsoft OCR and gets very good results.

~~~
arnioxux
[http://projectnaptha.com/](http://projectnaptha.com/) is also a very similar
project. I think it uses Ocrad.

------
amelius
I was hoping for a tool that _erases_ Chinese hard subs from the video.

------
philfrasty
Can anyone who learned Chinese comment on the timeframe it takes to even get a
basic understanding of what the subtitles say?

I used to learn English this way (watching US TV shows in English with English
subs and very limited vocabulary). Eventually disabled the subs. Now watching
everything in English. Love it.

edit: any tips on getting started with Chinese very welcome as well (apart
from the standard stuff I find myself through Google or language courses)

~~~
RandomBookmarks
>it takes to even get a basic understanding of what the subtitles say?

The main challenge with Chinese is that - unlike any Western language - you
can not guess the characters you do not know. You either know it or you don't.
Only the context might offer some clues.

Getting started with Chinese: Depends on your talent. For me (untalented)
taken a "real" class was essential for a good start. Self-learning for
languages does not work for me.

~~~
unityByFreedom
> you can not guess the characters you do not know

That's not completely true.

Chinese characters are composed of parts, some of which are referred to as
radicals. If you know the part/block, you can sometimes guess its meaning or
sound. Also, new Chinese words are made of existing characters, so you can
often guess multi-character words by knowing the individual characters. For
example, the word for computer is "electric brain". You might be able to guess
that if you knew that 電means electric and 腦 means brain

I recommend the site called
[http://hackingchinese.com](http://hackingchinese.com)

He has a lot of good tips for how to learn Chinese as a foreign adult

~~~
lacampbell
_For example, the word for computer is "electric brain". You might be able to
guess that if you knew that 電means electric and 腦 means brain_

Or you could be like me, idly sitting in a Malaysian restaurant, reading "馬來"
and trying to figure what "Horses Arriving" has to to do with curry.

~~~
darklajid
Okay, people replying to you so far seem to understand what's going on. For
those that don't get it: Why are these characters used in your particular
case?

~~~
rangibaby
I only know about Japanese but in the past characters could be chosen based on
their pronunciation and not just their meaning e.g. 仏蘭西 (Buddha, Orchid, West,
fu ran su, France). This is called ateji:

[https://en.wikipedia.org/wiki/Ateji](https://en.wikipedia.org/wiki/Ateji)

The ateji for Malaysia is 馬来, Horse, come, ma rai.

Nowadays loanwords are usually written alphabetically instead of ateji. フランス
and マレーシア. The old way is still written for abbreviations; 仏 is the equivalent
of writing "Fr" in English.

An interesting case is America (亜米利加）where the abbreviation is 米 (pronounced
"bei") not 亜 (pronounced "a") because A is for Asia! (亜細亜).

As any one reading HN should know, naming things is hard...

------
pixelperfect
Thanks for this! I'm looking forward to part 2 also.

A bit off-topic, but does anyone have Chinese TV show suggestions? I watched a
few episodes of this show (他来了) already, but I didn't like it very much.

~~~
gpetukhov
人民的名义: A new and very popular show about corruption.

爱情公寓: I think it's like Friends in Chinese. But I haven't watched Friends.

欢乐颂: Follows the lives of several young women sharing an apartment in
Shanghai.

For the list of the most popular shows check here:
[https://movie.douban.com/tag/国产电视剧](https://movie.douban.com/tag/国产电视剧)

------
imron
Hey OP, for a non-technical solution to your problem get in touch with these
guys over WeChat:

[http://www.bijianshang.com/page/contact/contact.php](http://www.bijianshang.com/page/contact/contact.php)

They sell soft-copy transcriptions for any show you want.

They charge per episode so it can get pricey if you do an entire series but
getting a couple of episodes is quite affordable.

~~~
KerrickStaley
I was actually thinking of doing something like this using Amazon Mechanical
Turk, maybe not to subtitle the entire show but just to get a much bigger test
set than I have the patience to label myself. I'll check them out, thanks!!

~~~
imron
No worries. Prices I've seen so far range from RMB 10 to RMB 100 per episode -
probably depending on how popular the show is and whether they already have a
transcript available because someone else wanted it or whether they'll need to
transcribe it for you.

------
callesgg
He should have used the fact that there is a black border around the text.

~~~
KerrickStaley
I did! That's going to be in Part 2.

------
contingencies
16 year Chinese learner here (not that it's relevant). I would try to hack a
solution via the following approach.

1\. Determine the area (if any) near the bottom with black-or-whiteness zones
of a constant height (these are likely to be subtitles) by randomly selecting
10 frames from the middle of the movie. Extra points if you have it detect the
sub color.

2\. For each frame with unique subs, isolate the zones vertically (handles
multi-line subs).

3\. Determine black-or-whiteness of each vertical column in the text area.
Moving inward from the left or right edge, crop everything until the black-or-
whiteness within the constant height drops below a certain threshold. In the
example shown, this would deal with ′…′二′′′'′ and would look like
0,0,0.01,0,0,0.01,3

4\. Crop viciously within the assumed vertical height. This should remove
issues like 逯 which should be 这.

5\. There is probably a clustering-based approach you can use to remove
background noise, either spatially or temporally, eg. temporally via
imagemagick[0]: compare frame1.jpg frame2.jpg -compose src -fuzz 10%
-highlight-color white -lowlight-color black output.jpg ... alternatively, if
there is a surrounding color such as in the example, you could remove any
pixel-groups that don't have it.

6\. In terms of detecting frames in which you have a new set of subs, just
compare the last black-white-extraction of the central maybe-has-subs area
(ie. most commonly used portion thereof) with a delta of the last one,
remembering no subs is also an option. In many cases this may align with
keyframes.

If you like to solve image processing problems we're hiring -
[http://8-food.com/](http://8-food.com/) \- email in profile.

[0]
[http://www.imagemagick.org/Usage/compare/#difference](http://www.imagemagick.org/Usage/compare/#difference)

------
milankragujevic
Actually very useful even for for other things, thanks for sharing! For
example ripping DVD subtitles to SRT, or (I'm using my imagination) maybe in
the future with content-aware fill removing hard coded subtitles and replacing
them with filler space?

~~~
Drdrdrq
That should actually be possible with todays technology. Take an image and
draw subtitles on it. This is input to train NN while original image is
training output. Even better, use video stream directly... Not easy, but not
impossible either.

------
Seanny123
I totally thought the state-of-the-art for OCR would be ConvNets, but
apparently it isn't? Or are there just not any easily available/usable
libraries that do OCR with Convents? Or is the benefit of ConvNets marginal
enough to not be useful?

~~~
Seanny123
Found a paper! Yes, it is better. Maybe not for this specific application and
given there's no pre-trained network available, I can totally understand the
choice made. [https://www.semanticscholar.org/paper/The-recognition-of-
Chi...](https://www.semanticscholar.org/paper/The-recognition-of-Chinese-
caption-text-in-news-vi-Zhong-Shi/bc3d70e410c36814b286f0db0cf429586bd91440)

------
manav
I've always thought a great feature of shows/movies with subtitles would be to
the ability to display multiple at the same time.

It would really be useful for learning assuming the subtitles are translated
well. Right now I've been able to do something with my own content by merging
two subtitle files.

Maybe you could also add a transliteration of the characters if its a language
like Chinese.

------
kcchouette
I know an ocr tool to do that on any mp4 files with vapoursynth:
[https://bitbucket.org/YuriZero/yolocr/src/989cf68d66cddfcf7b...](https://bitbucket.org/YuriZero/yolocr/src/989cf68d66cddfcf7b8b6b211b3d100f45e38c94/README_EN.md?at=master&fileviewer=file-
view-default)

------
cjy
I'm also learning Mandarin and was wondering if this was possible (for a
different show) just the other week! Thanks for the article, will be looking
forward to Part 2 and 3. Also, is there an easy way to extract all the frames
with unique subtitles?

~~~
KerrickStaley
Once you have the text corresponding to each frame, you can de-dupe it with
its neighbors based on Levenshtein distance (can't use exact-match because of
recognition errors). I found that for this show subtitles generally hang on-
screen for 1-3 seconds, so you wouldn't have to do many comparisons.

------
microcolonel
It'd be very cool to use this to take a video with hard subs, inpaint the hard
subs (even naïvely, or maybe with a motion compensator), and replace them with
SRT/ASS subs.

------
stcredzero
Is the term "hard sub" that well known outside of the fansubbing community and
other kinds video aficionados?

