

The unreasonable effectiveness of data - helwr
http://www.mt-archive.info/IEEE-Intelligent-2009-Halevy.pdf

======
coderdude
IMO this is how we'll nail semantic interpretation and NLP at large. In my own
'limited' first-hand experience with processing sentences from 10MM pages I've
seen some of the bigger picture. I say limited because of the size of the test
data, but the results were astounding none-the-less.

I created about 20 chunked-sentence patterns using regex. The purpose of the
patterns was to extract the SVO portions of simple sentences, such as the
following (please forgive any mis-tagged words as this is off the top of my
head):

(NX John/NNP NX) (VX is/VBZ going/VBG VX) to/IN (NX the/DT mall/NN NX)

With a small number of patterns in my arsenal I tokenized, tagged, and chunked
every complete sentence in the corpus (removing any parenthesized information
from each sentence). This of course also involved extracting the "good text"
from the HTML. (This area could have been improved a lot.)

From each of the template matches I stored the subject (John), predicate (is
going to), and the object (the mall). I wanted to try something new (to me) so
I lemmatized any non-proper nouns and removed all determiners. This left me
with 'John -> be go to -> mall.'

The real magic, for me at least, has been the predicates I've extracted. In my
mind the predicate was the most important because it models the relationship
between the subject and the object. Once lemmatized and with determiners
removed, many of the chunks I extracted mapped to a single value. This allowed
for me to turn that "single value" such as 'be go to' into an identifier that
would map to representations such as 'are going to', 'was going to', 'is going
to', etc. (with better examples available, of course). This allowed for X ->
be go to -> Y.

Another experiment was to see if I could create a graph of "higher-level" and
"lower-level" concepts, such as this (ordered from lowest-level concept to
highest-level concept):

be of -> be part of -> be small part of -> be very small part of

Each of those (minus 'be of') are predicates in assertions that I extracted,
which in theory can allow for walking through the graph to find assertions for
'be part of' while including predicates like 'be very small part of'.

Edit: The method for finding "lower-level concepts" was to remove some
combination of adjectives and adverbs from the predicates, but I did not go
deeper than that. This is also an area where I could have made much
improvement.

I used Hadoop to store the data, map reduce to process it, and MySQL to store
the results. Although I have HBase setup on the cluster, for this small
experiment I declined to use it. The Python package I used to do the
tokenization, tagging, and chunking was MontyLingua (somewhat modified for
better tagging and tokenization).

Edit: The end result was approximately 2MM assertions (just from those simple
sentences alone) and 4.4MM concepts.

If you're really into this stuff, check out the following papers (ordered from
coolest to cool).

Web-Scale Extraction of Structured Data:
[http://portal.acm.org/citation.cfm?id=1519112&dl=GUIDE&#...](http://portal.acm.org/citation.cfm?id=1519112&dl=GUIDE&coll=GUIDE&CFID=84796729&CFTOKEN=66539043)

WebTables: Exploring the Power of Tables on the Web:
<http://www.mit.edu/~y_z/papers/webtables-vldb08.pdf>

Open Information Extraction from the Web:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.5174&rep=rep1&type=pdf)

Processing Complex Sentences for Information Extraction:
<http://www.public.asu.edu/~cbaral/thesis/deepthi04.pdf>

A survey of approaches to automatic schema matching:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.16....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.16.700&rep=rep1&type=pdf)

Using Wikipedia to Bootstrap Open Information Extraction:
[http://portal.acm.org/citation.cfm?id=1519113&dl=GUIDE&#...</a>

~~~
ntoshev
Interesting. How many words is your corpus?

Right now I'm playing with a way to tokenize text in an unsupervised,
language-independent manner - my current line of thought is "try to predict
the next letter from n-grams, if the model has no idea then it's a word
boundary".

~~~
coderdude
I never got around to writing a job for that. Now you've got me wondering. I
counted each distinct word as a concept, but only from the sentences that
matched the regex patterns. The 4.4MM concepts, however, includes multi-word
concepts (as in 'be part of'). The number of distinct concepts in the corpus
is much higher.

Edit: I just went home and started a map reduce job to find the total distinct
tokens. I should be able to post the results within a few hours.

Unsupervised _and_ language-independent, now that's something worth working
towards. What data (or how much data) are you working with right now?

I plan on selling n-gram datasets from the Web, so I would like to have
another set of eyes look over them. If you'd like, shoot me an email at
james[at]webscaled[dot]com and I'll send you a copy to experiment with.

