
Show HN: 100K sentences mined from Wikipedia to help non-native English learners - abhas9
https://buildmyvocab.in/
======
cschmidt
I went to an interesting talk once at the Boston Python meetup, where a guy
figured out how to order sentences so could learn them in an order where you
already knew the "other" words in the sentence. Basically, making a directed
graph of vocabulary.

He was doing it to learn Latin, but you could do it for any language.

~~~
nandemo
How about this?

1\. Get a frequency list. The most common word's rank is 1, the second is 2,
etc. [0]

2\. Then use your favorite Spaced Repetition Software (such as anki) to learn
the words in that order.

3\. Define a sentence's difficulty as the maximum rank over all its words. You
could refine it by adding tie-breakers but I think it doesn't matter. Then
sort the sentences in order of difficulty.

[0] See
[https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists](https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists)

~~~
csa
Great idea but...

The word "run" is a relatively high frequency word. How many different
meanings for "run" does a learner need to know? Is that in isolation? With
collocations? As phrasal verbs?

In many languages, the most frequently appearing words also have the most
varied meanings. Interestingly, many highly vernacular languages also use
relatively few words, but those words have a lot of meanings that are clearly
known by the speech communities.

FWIW, some theoretical linguists consider this a non-issue. People in the
field (i.e., people might die if I get this wrong) know otherwise.

While your idea sounds nice in principle, I hope you will accept the idea that
reality may be slightly more complex.

* Collins Cobuild has 50+ meanings for "run" if phrasal verbs are included. Most non-native speakers are not even aware of the potential breadth of meanings it offers.

~~~
nandemo
It's not just an idea, I've used it to learn about 2000 Hebrew words, albeit
in a somewhat different form and along with other methods/materials.

I'm a somewhat experienced language learner. I'm fully aware that words have
multiple meanings but that's not as problematic as you think. A lot of
distinct meanings of "run" are related, so it's not like you need to memorize
every one individually. Besides, they aren't all equally important. In many
cases, those different meanings are paralled in my native language (or another
language I know), e.g you can translate "run" as "correr" in "he runs", "the
river runs..." and "they ran the risk...". More to the point, I always study
words in context. In this method the context is provided by the rest of the
sentence.

------
wodenokoto
How do you decide on which sentences to use?

I'm interested in generating example sentences myself, but in a way, that
chooses sentences that are simple, easy to understand and support the word,
they are supposed to exemplify.

For example "She got a __car __for her birthday, while she was traveling in
Italy eating pizza " does not tell the reader anything about what a car is, or
how the word should be used. However "He drives his car to work", is a much
better example of what a car is, what is a common associated verb and how it
fits in a sentence.

How do you optimise selection for sentence like the latter?

~~~
abhas9
That's a great question. Optimizing for sentence selection is important for
teaching. For now, I have a simple check that filters out sentences which are
longer than 160 characters.

Also, I believe that this is one thing which humans can do better. I,
therefore, plan to add upvote & downvote buttons to rate the quality of
sentences.

~~~
nmstoker
Up/down voting seems good.

I wonder if you might get a bit of an head start if you combine the shorter
sentence idea with selection based on higher n-gram counts. For instance, if
the keyword + words either side match a common n-gram, you could expect that
sentence was reasonably representative and boost it in the initial rankings as
compared to an n-gram that has a much lower count.

------
sengork
I've always thought that the Simple English article versions of Wikipedia were
always useful for non-native English speakers.
[https://simple.wikipedia.org/](https://simple.wikipedia.org/)

Most people seem to be unaware of this Wikipedia aspect.

~~~
sengork
Furthermore Simple English tends to be the very high level TL;DR version of
associated article.

------
AdmiralAsshat
Very cool! Although I feel that sometimes you really need a human touch to
make it truly comprehended. For instance, I random clicked on "antediluvian":

[https://buildmyvocab.in/antediluvian/](https://buildmyvocab.in/antediluvian/)

Everything here will get you a "good enough" understanding of what the word
means, but this is the only one that really comes close to explaining the
word's literal meaning, and it's too vague to be of much use:

 _any of the early patriarchs who lived prior to the Noachian deluge_

A non-native speaker isn't going to have any idea what "Noachian" means (a
native speaker probably isn't either unless they can explicitly identify
"Noah" as the root), and "deluge" is part of the root of the word we're
defining, so simply using the word "deluge" without explaining what it means
doesn't really help.

In short, this is a good groundwork, but I think it needs a human editor to
push the individual definitions from "acceptable" to "correct".

~~~
abhas9
Thanks for your awesome feedback. And yes this is just the initial ground work
and part of a larger experiment. We are also trying to teach English using
Bollywood movies and GIFs[1]. I agree that human editing is very important and
as such upvote/downvote button feature is in pipeline next.

[1] [https://buildmyvocab.com/ddlj.html](https://buildmyvocab.com/ddlj.html)

~~~
sridca
Hey cool project. Can this be used to learn French as a English speaker?

------
dcsan
I find there's a lot of material for studying isolated words, but as an
engineer, analyzing the sentence patterns and grammar is more interesting.

I'm working on a project to do this for a database of Chinese grammar
patterns. When there's enough sentence examples for each pattern as structured
data, we can then make games and other learning tools. For example: yīnwèi /
因为 / because [http://cgram.rikai-bots.com/grammar/yinwei](http://cgram.rikai-
bots.com/grammar/yinwei)

Now there's a magnets game to try to use that pattern: [http://cgram.rikai-
bots.com/magnets/?cnames=yinwei](http://cgram.rikai-
bots.com/magnets/?cnames=yinwei)

I would be happy to share the repo with anyone who's interested, or using the
data to make some other language learning games. PS I did a similar thing for
japanese before: JGram.org and it really helped me learn japanese quickly.

------
ekingr
In the same vein, for French translation, Linguee[1] uses many sources from
websites of organisations that display official content in several languages
(eg. the websites of the EU, of the Canadian Parliament...). The fact that
it's _official_ texts (eg. laws) makes it quite reliable.

[1] [http://www.linguee.fr](http://www.linguee.fr)

------
lkbm
This is pretty cool.

The second word I clicked was "cant"....and about half I saw were typos of
"can't", so, there's some bad data in there if you're trying to learn standard
english, but it's good data if you want to understand things people actually
write.

Anyway, time to go through and add some apostrophes to a few articles. :-)

------
bauer
Nicely done. You could add in a mailing list to send users a digest of new or
top vocabulary words every week.

~~~
galen211
Second that - would also be interesting for English speakers learning a
language like Chinese, where the average literate speaker can recognize about
5000 of the most frequently used characters.

~~~
peterburkimsher
galen - I'm making [http://pingtype.github.io](http://pingtype.github.io) for
learning Chinese! I don't think that studying characters is the most important
thing though. Words are more important than individual characters! My program
can help you type parts of a character, and put spaces between words
automatically.

------
peterburkimsher
Is there a word list for TOEFL and/or IELTS?

I'm using a similar strategy (movies, music, Bible, articles) for studying
Chinese. I'm using the TOCFL and HSK word lists. My friend uses a book with a
list of 15000 vocabulary words by Morris Hill. I can't find a txt version
though.

------
saurabh1728
[https://play.google.com/store/apps/details?id=com.buildmyvoc...](https://play.google.com/store/apps/details?id=com.buildmyvocab.app)

is this your app abhas ? Quite interesting

~~~
abhas9
Yes, this app is also a part of an initiative to make learning simpler and
fun.

------
miles
This list is to help non-native English learners? Many native English speakers
might have trouble with a few of these: abeyance, abscission, accretion,
amalgamate, anodyne, antediluvian, apposite, arabesque, atavism, and
avuncular.

~~~
camillomiller
As an Italian native with a classical studies background, this kind of words
are easy for me. They're almost all Latin-derived and they usually sound very
similar to the Italian equivalent. You wanna know what's hard for us? The
street talk. You know like when you shoot the breeze before you really spill
the beans about your shenanigans while riding shotgun on a friend's old
jalopy.

~~~
jaclaz
Yes, beside those you listed (that are not common in a normal conversation)
there is a large number of Latin derived words in English, as a rule of thumb,
there are almost always two words in English a non-latin derived one and a
latin derived one having more or less the same meaning (or a near enough one).

Usually native English speaking people will use the non-latin one, but of
course they also generally know it (or at least have heard or read the "other"
one) so they can understand you alright, but I am told that if/once you are
proficient enough in English, when you - mistakenly - use the latin based ones
you sound like being snob or "upper class" or very formal (and wanting to seem
so).

A few examples (of common words):

apartment=flat < I never managed to get this right

obscurity=darkness

arms=weapons

annually=yearly

legal=lawful

constructor=builder

transmit/transmitted=send/sent

custodian=keeper

~~~
camillomiller
Exactly :)

My friends say that I usually sound "academic" more than snob.

Some time ago I witnessed this first-hand. I was at a speaking event, the
speaker was Canadian. He said: "If you ever had the fortune to encounter
[Mister XYZ]". I would have definitively expected something like: "If you were
lucky to meet him"/"had the luck to have met him" or similar. He was native,
but very academic indeed. :)

------
goshx
Very nice! Thanks for sharing!

Some words are not found:
[https://buildmyvocab.in/affinity](https://buildmyvocab.in/affinity)

(Just a little correction there: does not exist*)

~~~
abhas9
Thanks for pointing this out. Will fix this soon.

------
mbrookes
Completely OT, but is there a mathematical explanation for why, when scrolled
the spaces between the words appear to form connect channels?

~~~
Jun8
I think you're referring to word rivers
([https://en.wikipedia.org/wiki/River_(typography)](https://en.wikipedia.org/wiki/River_\(typography\))).

------
WheelsAtLarge
This is good stuff. I like the sentences part but I would put the definition
before the sentences.

This would be a great foreign language tool too.

------
ilamparithi
I created something similar using the Wordnik api.
[https://www.greedge.com/grewordlist/](https://www.greedge.com/grewordlist/)

------
satysin
Very cool. As others have said I think you should add definitions for the
words (even a link off to an external one is fine) and pronunciation (with
audio, perhaps link to forvo.com?) would be superb.

------
opaqe
Mining canonical papers/text to generate standardized tests (SAT/GRE) might be
a further step. My guess is that both tests and commercial prep-material are
produced by committee.

------
dadvocate
Would be more intuitive if the meaning is presented first and then example
sentences

------
lacampbell
While I think it needs a little polishing (a lot of wikipedia sentences are
fairly hairy), I really like the core idea here. Keep up the good work.

------
Baeocystin
Neat.

::clicks on a random word::

"We couldn't find any sentences for the word centripetal."

So... Why is it one of the chosen few?

------
racl101
What?! No "cromulent"?

------
masteryupa_
Could something similar be done with other languages as well, say Simplified
Chinese?

~~~
yorwba
Simplified Chinese is a bit tricky, because the Chinese Wikipedia is mostly in
Traditional Chinese, because it's blocked by the Great Firewall.

Otherwise, if you can find a large corpus, segment it into words and do some
basic statistics, you could build something like this for any language.

The most similar implementation I am aware of (using the word 中文 (Chinese) as
an example) is
[http://ce.linedict.com/#/cnen/example?query=%E4%B8%AD%E6%96%...](http://ce.linedict.com/#/cnen/example?query=%E4%B8%AD%E6%96%87)

------
malandrew
What about other languages?

~~~
peterburkimsher
I'm working on Chinese, and everything I do gets put into
[http://pingtype.github.io](http://pingtype.github.io)

This week I plan to finish making clips for words in movie subtitles.

------
baby
1\. I don't really understand what this is about. Having a description on the
landing page would help.

> Barron's 800 Words list with example sentences

who is this Barron?

2\. Please can you add pronunciation :D

3\. words need a definition as well, not sure what some of these means even
with the examples.

~~~
dominotw
>who is this Barron?

I think if you are asking that question you are probably not the target
audience for this website.

~~~
sidegrid
I'd like to know who or what Barron is too.

------
jlebrech
can you make a container that lets you crawl any other language?

------
ge96
mined with a mithril axe

------
mrstatus
i am one of learner's of English, but i can't.. tips me to get my English
perfect.. [http://www.mrstatus.in/himbhoomi-jamabandi-
copy/](http://www.mrstatus.in/himbhoomi-jamabandi-copy/)

