
How to understand half of Harry Potter book in any language (+ source code) - legierski
http://blog.self.li/post/20854405575/how-to-understand-harry-potter-any-language
======
pooriaazimi
> _... as I’ve already read all 7 books in 2 languages before - I could pick
> up a lot just from context..._

That's what I've always said to my friends, but they insist on learning
_grammar_ , which of course they'll forget after a few hours/days. Three years
ago I could hardly read anything non-technical in English, now I can
understand _Lolita_ \- which is a hard novel to read for a non-native. All
thanks to audiobooks.

I found that audiobooks are the most fantastic way of learning a new
language... I couldn't have possibly read _Lolita_ or _Silmarillion_ before -
They're just too hard for someone who is trying to learn a new language. Long
sentences full of new/invented words - It's easy to lose the thread. But
listening to a skillful reader reading them aloud, you can understand even the
most complicated words and sentences just from the context, and the reader's
tone and emphasize...

If you want to learn a new language, do yourself a favor and listen to some
audiobooks. Pick a book you've already read in your native language
(preferably more than once) and you'll be amazed how easy it is to understand
and learn new words (you must have a basic understanding of that language of
course).

~~~
krelian
Audiobooks are great for learning the spoken part of a language but I would
guess that most of us "technical people" need and want to learn things from
the bottom up. That's the only way we feel that we really _know_ something.

It's true that grammar can be very complicated and will be easily forgotten
without practice. A good practice is reading a book and trying to see not only
if you can understand what is happening but also analyze each sentence and see
if you can understand which grammar rules are used in its construction and
why.

~~~
pooriaazimi
That's right. But you don't necessarily need to _know_ what a word means (its
exact meaning) in order to understand a sentence. You only have to _feel_ what
it means. For example, for a long time I didn't bother to look up the exact
meaning of words like 'ruminate', 'agitated', 'incredulity', 'ilk',
'perpetual', 'stagnant' and dozens of other words. I could _feel_ what they
meant, and that was enough. Just like when a 4-year old knows what 'tomorrow'
means without understanding the complexities of life and time. He doesn't need
to know about Earth orbiting around the Sun; tomorrow means 'sleep once, and
when you wake up, that's tomorrow'. I think you can learn a new language just
like children learn their mother tongue.

------
hkolek
I like the approach but I think it's a big mistake to not strip stop words. He
should focus on nouns and verbs imo. The top words he lists are all
stopwords/grammatical particles "de", "que", "la", "y" etc. I don't think
knowing those words will help to understand anything. I think if you
understand only the grammatical particles in a sentence it won't help at all
to understand the meaning of the sentence. On the other hand if you know the
verbs and nouns but not the grammatical particles you can at least infer some
meaning or what it's about.

~~~
aohtsab
When I was starting out in German, I started listening to the audiobooks and
reading the German text. Knowing the small words (prepositions and other
stopwords) isn't of much use on its own, but having a general feel for the
plot and knowing the context of what 'should be happening' in the book helped
give me a much broader understanding (and richer vocabulary!) than merely
ploughing through the German 101 textbook.

tl;dr: Applaud the idea, but it's misguided. You need to focus on
understanding words in greater context to derive any meaning.

~~~
vidarh
My final year I forced my way through Faust by using an old Danish translation
side by side with the German version to bring my grade up.

I'm Norwegian, so the old Danish translation gave me a "halfway house" -
modern Danish is very understandable for a Norwegian; older Danish gets quite
a bit closer to German. I still missed a lot of what was going on, but it
brought me up a full grade in a semester.

Of course I'd probably had done just as well with less effort had I chosen
something easier - I was being pretentious about it.

------
acslater00
TL;DR Nearly half of the word occurrences in Harry Potter are prepositions, so
if you learn a small number of them you can claim that you "understand half of
Harry Potter". For example, you can absorb sparking dialogue like the
following:

"Harry and to at to I to with Voldemort or what to and I do for, Hermoine!!"

~~~
frooxie
David Moser, in his text "Why Chinese Is So Damn Hard", points out that

Even though you may know 95% of the [words] in a given text, the remaining 5%
are often the very [words] that are crucial for understanding the main point
of the text. A non-native speaker of English reading an article with the
headline "JACUZZIS FOUND EFFECTIVE IN TREATING PHLEBITIS" is not going to get
very far if they don't know the words "jacuzzi" or "phlebitis".

~~~
wisty
If you understand 95% of words, you can start learning with authentic material
and a dictionary.

Of course, Chinese is a little hard (stroke order is harder than alphabetical,
and you can't guess the meaning as easily like you can with a related language
like German), but ... it's still a good way to go.

Of course, almost every language teacher on the planet will recommend you
start with common words and pattens (grammer, phrases, etc), then learn less
common stuff. It's not a revelation.

~~~
frooxie
My point was that understanding a certain percentage of the words is very
different from understanding the same percentage of the sentences in a book.

------
mseebach
So, the tl;dr is that this guy discovered that using a dictionary is a good
way of learning words in a different language.

In more detail, there's an assertion and a proposed solution - and nothing to
even remotely back up the assertion? Show me a page of Harry Potter in Spanish
translated in this manner - I somewhat doubt it will make much sense.

~~~
ajuc
That method of learning foreign language was adviced by David Snopek. American
that learned Polish that way in 1 year (and speaks fluently - this is very
hard for non-native Polish speakers, most people have problems with producing
correct sentences at all after one year).

His post about this: [http://www.linguatrek.com/blog/2010/12/harry-potter-the-
book...](http://www.linguatrek.com/blog/2010/12/harry-potter-the-book-that-
taught-me-polish)

You can guess surprisingly big part of a book basing just on the context and a
few words you already know. That's how kids learn their native language. It's
a good method, it worked for many people (including me:)) and I don't
understand why are you arguing it isn't.

~~~
mseebach
I'm not arguing it's a bad way to learn a language, although I can't see how
it will help you pronounce, but that's a different discussion. Quite the
opposite, I'm in deep agreement with the article's assertion that focussing on
grammar is counter-productive. 20 common nouns, 20 common verbs, "please" and
"thank you", and you're actually able to have a basic conversation.

I'm saying that this guy starts off OK, then goes on to devise some code to do
that, then declares success apparently without running the code and reasoning
about the output. He says that 50 words are 50% - but his script stops at 20?
And of those, he lists ten, which are, as others have pointed out, grammar
artefacts and stopwords - completely without consequence to getting to read
the book in question.

------
pm215
There's some similar statistics for Japanese novels here:
<http://pomax.nihongoresources.com/index.php?entry=1223045359> which I think
show that the problem is not at the "most common" end of the distribution but
at the "least common" end. The jump between '80% understanding' and '90%
understanding' requires knowing an extra 5341 words, 90% to 95% needs another
7495, and so on. Basically the long tail is really nasty, and even 90%
understanding is still not knowing one word in ten...

~~~
cskau
Thank you for sharing this. I'm doing something related at the moment, so it's
nice to see previous work like this.

------
korussian
That's a fantastic idea. I'm struggling to learn Korean, and it's tough
because few of the words are recognizable to me. I have a base of
English/French/Russian, so that doesn't help (much).

I would love to try to put my grammar/flash cards aside and go the Harry
Potter route.

I can't code. What could you do to help me, an average user, do this with
Korean?

~~~
legierski
I don't know much about Korean, but if I were you I would just get a book and
start reading it, no matter what. And check out somewhere online the most
popular words in this language.

~~~
klbarry
He would have to learn the Korean alphabet first (there are some highly
reviewed books on Amazon which I cannot attest to). I've heard it's not super
difficult to learn, though.

~~~
korussian
I learned the Korean alphabet (Hangul). It's by far the easiest/most logical
natural language alphabet to learn, and takes just a few hours to get
reasonably proficient. It rocks and I read it well... the trouble is I don't
know what I'm reading.

Korean is opaque, and though I have a book of the 6000 most used words in
Korean, I don't find it very helpful when picking up a children's book, for
example.

------
krelian
>did you know that out of 5 most popular languages in the world, 3 of them are
relatively easy to acquire? They are: English, Spanish and Russian, and my
plan is to be fluent in English and Spanish and be able to get by with Russian
by the end of 2013! Who’s with me?!)

I'll grant that Spanish is relatively easy but Russian is considered one of
the most difficult languages to learn. It's hard for me to judge the
difficulty of English but I wouldn't say it is an easy language.

~~~
apendleton
Russian is probably somewhere in between. How hard a language is to acquire
depends a lot on what language or languages you already speak, but for native
English speakers, Russian is considered difficult, but probably not "one of
the most difficult." The Defense Language Institute, which provides
instructions for native-English-speaking US military translators, classifies
Russian as a category 3 language (of four categories), which makes it harder
than French or Spanish (category 1), or German (category 2), and on the same
level as Farsi or Hindi, but easier than the a reasonably-sized swath of
category 4 languages that includes Mandarin, Japanese, and Arabic.

------
goblin89
I like HP series in this regard. Vocabulary complexity slightly increases with
each book, which helps to progressively learn a language.

------
nathell
Reinventing Zipf's law, huh?

~~~
legierski
Oh, never heard about it, thanks for the tip!

~~~
nathell
If you haven't, and are interested in this kind of thing, I strongly recommend
Manning and Schütze's "Foundations of Statistical Natural Language
Processing". This is _the_ book if you want a gentle yet thorough introduction
to statistical NLP.

------
rivalis
Nlp folks call those "stopwords," because they don't contribute much to
statistical understanding of text. That is, in most nlp applications, those
words are removed to leave more meaningful text behind. How did this make
front page?

------
wrs
Linguality (<http://www.linguality.com/>) prints French and Italian novels
with the original text on the right pages and a page-specific mini-dictionary
on the left pages. No need to keep stopping to look up words in a dictionary.

Unfortunately there are only a few Linguality books. Can you do this with an
e-reader?

------
tolliator
As a native russian speaker, I can say with utmost certainty that Russian is
_NOT_ the easiest language to learn. In fact, I would argue that it is
somewhere on the upper scale of difficulty.

I have been living in North America only 10 years - and I can't even teach
Russian to my own kids - we had to get a tutor.

------
raphman
Nice. Additional stemming would probably provide better data, however.

