This would also allow you to extract out the subs without having to OCR them/get characters. Could just erase all static artifacts (including subs but also things like watermarks).
You have lots of infomation of what image should be there; You have motion vectors signaling when an object or background has moved into (or out of) a blocked area. Or a maybe the motion vectors indicate object/background is stationary and you can use infomation from before/after the subtitles to fill it in.
I'm not aware of such a 3d video resynthesis algorithm existing, but it should be possible.
Before that I was using the "Chinese Subtitle Translator" software: https://ocr.space/blog/p/chinese-subtitles-translator.html
(Source code at https://github.com/A9T9/Chinese-Subtitles-Translator )
It uses Microsoft OCR and gets very good results.
I used to learn English this way (watching US TV shows in English with English subs and very limited vocabulary). Eventually disabled the subs. Now watching everything in English. Love it.
edit: any tips on getting started with Chinese very welcome as well (apart from the standard stuff I find myself through Google or language courses)
Teachers can be helpful to point out some mistakes you'll miss through self study but unless you have a fair amount of cash to throw at the problem self study will likely be a better approach. For full fluency you probably want to target around 20k vocab, so it's a bit of a numbers game in terms of finding a quick way to improve your vocabulary. I use Skritter but I guess any SRS software should help a lot here.
Not entirely convinced by immersion as being necessary, I learned plenty of Chinese without it. Probably good for day to day vocab and motivation but unless your level is good you will still struggle to get into conversations where you use it, depending on how outgoing you are.
I've found that SRS apps like Anki or Memrise and podcasts like Popup Chinese or ChinesePod have been really helpful. I'm also developing an approach where I extract vocabulary words from videos and newspaper articles and pre-study them before watching/reading—I'm hoping to blog about this at some point.
3-5 years depending on how much you study.
> any tips on getting started with Chinese very welcome as well
The best tip you'll ever get is to study every day.
For more concrete examples of what/how to study, here is an excellent post that contains a lot of advice for independent Chinese learners:
The author Khatzumoto taught himself Japanese in 18 months "by having fun", and then got a job in Japan (at Sony iirc). He then went on to use the same method to learn Chinese.
If you can get through the website's navigation, and scroll through the occasional incoherent rambling, there is solid gold advice throughout :) Good luck!
ps. While the site focuses on Japanese, the advice translates to learning any language. Plus there is advice and resources specifically for Chinese :)
The main challenge with Chinese is that - unlike any Western language - you can not guess the characters you do not know. You either know it or you don't. Only the context might offer some clues.
Getting started with Chinese: Depends on your talent. For me (untalented) taken a "real" class was essential for a good start. Self-learning for languages does not work for me.
That's not completely true.
Chinese characters are composed of parts, some of which are referred to as radicals. If you know the part/block, you can sometimes guess its meaning or sound. Also, new Chinese words are made of existing characters, so you can often guess multi-character words by knowing the individual characters. For example, the word for computer is "electric brain". You might be able to guess that if you knew that 電means electric and 腦 means brain
I recommend the site called http://hackingchinese.com
He has a lot of good tips for how to learn Chinese as a foreign adult
While it is true that you can't guess the meaning of characters that you don't know without any context, you definitely can lean the mean of characters that you don't know with context. In the same way, while I might understand the Greek and Roman roots for words in English, if I see the work by itself without any context, I'm not going to be able to understand it. With context, I can usually puzzle it out.
However, it is a huge mistake to ignore the benefits of learning Chinese characters. It is the reverse that is most beneficial. In Chinese the 3600 most common characters cover some insane percentage of the common words in use (only 2200 for Japanese!). These characters form a powerful mnemonic for learning vocabulary. I can easily learn to read the characters for a word and the word faster than I can learn the word by itself. As you learn more characters, it only accelerates your progress.
2-3 thousand characters seems like a daunting task, but as I always point out in these threads, adult level proficiency in a language requires somewhere around 15,000 word families (1 family = a word, all its inflections/conjugations/etc, and all related compound words). People telling you that you can be "fluent" with 2000 words are selling you something (2000 word families is the level of a toddler -- and then people are disappointed that they can only speak like a toddler...).
The absolute best thing you can do for learning a language that has Chinese characters in its writing is to learn those characters. I recommend learning them at the same time as the vocabulary. And to get back to the original comment: even now if I hear an unfamiliar word, I will usually trace what I think the characters are on my hand. Often people will correct me and once I get the correct characters the meaning is almost always obvious.
Or you could be like me, idly sitting in a Malaysian restaurant, reading "馬來" and trying to figure what "Horses Arriving" has to to do with curry.
Every language has exceptions to "rules". In fact, there are few hard and fast rules for any language, since every language is a mix of another.
But, once you get comfortable with a language, you can appreciate the differences and make interesting guesses about the exceptions' origins. It can be fun.
I like http://www.hanzicraft.com/ because it breaks down the characters into parts you can click on to get their definitions or origins. Hacking chinese also has a cool resources section here , categorized, so you can browse through dictionaries, listening tools, practice tests, browser add-ons, etc. etc.
The ateji for Malaysia is 馬来, Horse, come, ma rai.
Nowadays loanwords are usually written alphabetically instead of ateji. フランス and マレーシア. The old way is still written for abbreviations; 仏 is the equivalent of writing "Fr" in English.
An interesting case is America (亜米利加）where the abbreviation is 米 (pronounced "bei") not 亜 (pronounced "a") because A is for Asia! (亜細亜).
As any one reading HN should know, naming things is hard...
So, Malay (ie, of Malaysia).
The first character of 馬來 is "horse" and the second is "come/arrive". So I was sitting there puzzling it out until I realised that if I pronounced it in Mandarin, it sounded a lot like "Malay" - which is in fact what it means.
trying to connect electric brain to computer is similar to trying to connect "breakfast" to the morning meal. it's interesting to know the compound words, but breakfast is always the meal in the morning and not always the meal after a fast.
Going beyond the standard stuff I'd recommend a professional teacher since you really need to learn Chinese from multiple angles: learning to recognize and write the characters, reading pinyin, understanding spoken Mandarin and being able to speak.
A bit off-topic, but does anyone have Chinese TV show suggestions? I watched a few episodes of this show (他来了) already, but I didn't like it very much.
I think it's like Friends in Chinese. But I haven't watched Friends.
Follows the lives of several young women sharing an apartment in Shanghai.
For the list of the most popular shows check here: https://movie.douban.com/tag/国产电视剧
They sell soft-copy transcriptions for any show you want.
They charge per episode so it can get pricey if you do an entire series but getting a couple of episodes is quite affordable.
1. Determine the area (if any) near the bottom with black-or-whiteness zones of a constant height (these are likely to be subtitles) by randomly selecting 10 frames from the middle of the movie. Extra points if you have it detect the sub color.
2. For each frame with unique subs, isolate the zones vertically (handles multi-line subs).
3. Determine black-or-whiteness of each vertical column in the text area. Moving inward from the left or right edge, crop everything until the black-or-whiteness within the constant height drops below a certain threshold. In the example shown, this would deal with ′…′二′′′'′ and would look like 0,0,0.01,0,0,0.01,3
4. Crop viciously within the assumed vertical height. This should remove issues like 逯 which should be 这.
5. There is probably a clustering-based approach you can use to remove background noise, either spatially or temporally, eg. temporally via imagemagick: compare frame1.jpg frame2.jpg -compose src -fuzz 10% -highlight-color white -lowlight-color black output.jpg ... alternatively, if there is a surrounding color such as in the example, you could remove any pixel-groups that don't have it.
6. In terms of detecting frames in which you have a new set of subs, just compare the last black-white-extraction of the central maybe-has-subs area (ie. most commonly used portion thereof) with a delta of the last one, remembering no subs is also an option. In many cases this may align with keyframes.
If you like to solve image processing problems we're hiring - http://8-food.com/ - email in profile.
And if you ask why it's a bitmap, that's because bitmaps support more than just plain text: color and typefaces to name 2 things. Imagine if DVD players have to implement text decoding ("Is this subtitle stream in UTF-8 or maybe some Cyrillic Code Page?") and rendering (color, placement, font files, etc...)
It would really be useful for learning assuming the subtitles are translated well. Right now I've been able to do something with my own content by merging two subtitle files.
Maybe you could also add a transliteration of the characters if its a language like Chinese.
I have a program to add spaces between Chinese words, colours for the tones, pinyin, and a literal translation.
I already made a feature to list all the unique words in a movie, sort them by their frequency, and make a study sheet. I also made bash script generator to use ffmpeg to cut the movie to the subtitle time.
All I need to do now is recombine the subtitles based on the words, to make videos with lots of example sentences.
It's much easier to study with a real English translation though, instead of a literal word-for-word transcription. If you could help me get more input data (names of movies or songs, srt files), that would be wonderful!