
An Efficient Way to Extract the Main Topics from a Sentence - shlomib
http://thetokenizer.com/2013/05/09/efficient-way-to-extract-the-main-topics-of-a-sentence/
======
Arjuna
This is neat.

Shlomib, you might be interested in SHRDLU [1][2] if you are not aware of it.
It was developed by Terry Winograd [3] for his dissertation [4] at MIT. It is
a _natural language understanding_ [5] parser that allows you to interact with
a small world of 3D solids.

I think you will find the paper interesting, because it goes into detail on
sentence structure and associated parsing.

Here is a sample dialogue with the program, taken from Wikipedia:

    
    
      Person:   Pick up a big red block.
      Computer: OK.
      Person:   Grasp the pyramid.
      Computer: I don't understand which pyramid you mean.
      Person    (changing their mind): Find a block which is taller than the one you are holding and put it into the box.
      Computer: By "it", I assume you mean the block which is taller than the one I am holding.
      Computer: OK.
      Person:   What does the box contain?
      Computer: The blue pyramid and the blue block.
      Person:   What is the pyramid supported by?
      Computer: The box.
      Person:   How many blocks are not in the box?
      Computer: Four of them.
      Person:   Is at least one of them narrower than the one which I told you to pick up?
      Computer: Yes, the red cube.
    

[1] <https://en.wikipedia.org/wiki/SHRDLU>

[2] <http://hci.stanford.edu/winograd/shrdlu/>

[3] <https://en.wikipedia.org/wiki/Terry_Winograd>

[4] <http://hci.stanford.edu/winograd/shrdlu/AITR-235.pdf>

[5] <https://en.wikipedia.org/wiki/Natural_language_understanding>

~~~
tannerc
Why does this seem so scary/amazing?

It's basically taking what computers already do really well today and adding a
language "mask" onto it. Yet it still blows me away.

~~~
mindcrime
What's really amazing is how old SHRDLU is. It was developed back in the late
60's, early 70's.

Although... I guess you could look at it two ways: Be amazed at what SHRDLU
could do in 1970, or be disappointed that, given that we had that in 1970, we
don't have the "Star Trek Computer" yet in 2013.

~~~
shmageggy
SHRDLU is definitely amazing, especially given its age, but one's amazement is
tempered a little bit (or maybe enhanced, depending on perspective) when you
realize that it achieved what it did primarily through really great
engineering rather that some fundamental insight about language. Since
SHRDLU's world is so limited, Winograd was able to explicitly program every
facet of its language understanding. Unsurprisingly, this approach is totally
not scalable and this reveals a little about why we don't have fully human-
like language programs.

~~~
thaumaturgy
I think people underestimate how explicitly-programmed human language is in
humans. I'm starting to think that this might be _the_ central problem in NLP
right now.

Humans have good natural pattern-matching engines in their heads, but the
entire body of syntax and vocabulary available to a person is the result of
the memorization of a _huge_ amount of text. I suspect the majority of people
rarely ever develop truly novel words or phrases on their own (with the
notable exception of Lewis Carroll). (Aside: in fact, this is exactly how
"memes" work in the modern online sense; one person invents a novel word or
phrase, and that is then parroted by a huge number of other people.)

I recently started work on an attempt to improve the classification of English
vocabulary by grade level. I built a database using publicly-available
sources, and the number of unique words that the average child has been
exposed to by the 8th grade is _mind boggling_. One source cited 15,000 unique
words and over a million words read annually.

Aside from the words themselves, children have also by that age memorized an
even larger number of phrases, pieces of sentence structure, and full
sentences.

I think that because we aren't able to enumerate everything we've memorized,
we don't fully appreciate just how much data is stored in our heads. As a
result, I think it's possible that computer science researchers have largely
been chasing a ghost in terms of some kind of magical "understanding" of
language; the answer to NLP might actually be to simply store and access a
terabytes-sized data structure of vocabulary and phrases.

~~~
sc00ter
_the answer to NLP might actually be to simply store and access a terabytes-
sized data structure of vocabulary and phrases._

Isn't that effectively what google translate is doing? And it's results are...
varied.

~~~
scott_karana
I get the impression that Google Translate is strictly doing it in a Bayesian
sense. For example, the recent "he praised the iPad" debacle. [1][2]

[1] <http://code.google.com/p/android/issues/detail?id=38538> [2]
[http://techcrunch.com/2013/01/04/google-now-and-google-
trans...](http://techcrunch.com/2013/01/04/google-now-and-google-translate-
praise-the-ipad/)

------
languagehacker
Cool idea, but basically reinvents chunking, which NLTK already has
(<http://nltk.org/api/nltk.chunk.html>). Keep up the writing and NLP research
though :)

------
jweese
Nice writeup. A few comments:

So you're just identifying NPs and VPs in a sentence? So lets say I run your
program, and I get NPs "Instagram" and "Facebook", and the VP "acquired." The
question is, who did what to whom? Did Facebook acquire Instagram, or did
Instagram acquire Facebook?

Second, I think you're way over-emphasizing the supposed slowness of CFG
parsing. Yes, the complexity is O(n^3) in the length of the sentence, but in
practice, n is usually small. Modern statistical PCFG parsers are _fast_.

~~~
alok-g
My understanding is different, so please correct me / supply missing
information.

From what I understand, the average length of a typical written sentence is n
= ~27. OK, this is small by itself, but Stanford Parser (lexicalized PCFG) I
am using needs about 1 second to parse a sentence of this size. Imagine how
slow that is on a computer time-scale by comparing it to string-length
operation on the same sentence.

I do encounter many sentences that are as much as 100 words long (and reading
them myself, find nothing wrong with them). At about four times the length,
these take about a minute to parse!

I am trying to find information about speeds of other PCFG parsers, including
Collins, Charniak, Berkeley, etc. I understand dependency parsers are faster
but also generally lag in accuracy.

~~~
syllogism
The Stanford parser is particularly slow --- it's in java, and it's written
for research more than anything. The C&C CCG parser runs at about 60-80
sentences a second, although it gives either CCG constituents or dependencies
-- so the output may take some interpretation.

Shift-reduce dependency parsers are linear time, and are giving state-of-the-
art results. My parser's currently a pain in the ass to install, as it hasn't
really been released yet, but it does hundreds of sentences a second. Accuracy
is state-of-the-art -- 92-93% depending on the beam width and the evaluation
set (Stanford or MALT dependencies).

<https://github.com/syllog1sm/redshift/> . You'll want the develop branch.
It's GPL licensed.

It's implemented in Cython (i.e., almost all the code is Cython --- I'm not
using it just for the speed critical bits), which would make it easy to work
with if you're using Python. But, as I said...I don't claim it's currently fit
for human consumption.

A C++ implementation of the same algorithm is here:
<http://www.sutd.edu.sg/yuezhang.aspx> . Note his papers too -- he did some of
the important work on this line of research.

The last few years of work in shift-reduce dependency parsing have been a bit
of a break-through in parsing, imo.

~~~
bravura
_The Stanford parser is particularly slow --- it's in java, and it's written
for research more than anything._

The choice of language is not what causes the slowness of the Stanford parser.
It's the choice of search strategy, which trades-off speed for accuracy.

 _Shift-reduce dependency parsers are linear time, and are giving state-of-
the-art results._

No, this is incorrect.

The choice of parsing logic (shift-reduce, dependency) and the search strategy
(greedy, sometimes erroneously called "deterministic") are orthogonal. It's
the greedy search strategy that leads to linear time performance.

The choice of logic determines the lower-bound (best-case) on parsing
complexity. If you do exhaustive search for the exact solution of a shift-
reduce dependency parser, it is worst-case exponential. In practice, you don't
do exact search, and by using a beam search approximation you can get observed
linear-time performance.

[edit: You can read my thesis if you are not familiar with what a parsing
logic is.]

I am not aware of state-of-the-art results from greedy shift-reduce parsers.
Do you mind sharing?

~~~
syllogism
> The choice of language is not what causes the slowness of the Stanford
> parser. It's the choice of search strategy, which trades-off speed for
> accuracy.

Even amongst chart parsers it's not a very quick implementation.

> The choice of parsing logic (shift-reduce, dependency) and the search
> strategy (greedy, sometimes erroneously called "deterministic") are
> orthogonal. It's the greedy search strategy that leads to linear time
> performance.

They're really only conceptually orthogonal, because in practice once you
choose shift-reduce you're always going to choose
greedy/deterministic/whatever search. I'm not aware of any shift-
reduce/transition-based parsing results that don't use 1-best or beam search.

> I am not aware of state-of-the-art results from greedy shift-reduce parsers.
> Do you mind sharing?

Zhang and Nivre (2011)
[http://www.sutd.edu.sg/cmsresource/faculty/yuezhang/acl11j.p...](http://www.sutd.edu.sg/cmsresource/faculty/yuezhang/acl11j.pdf)
93.5 UAS on Stanford basic dependencies. This is the best published result on
the dataset, and the best published dependency parsing result for English.

Probably the parser is still worse than the C&J reranking parser, when 200
parses are supplied to the reranker. But I'd be very surprised if it wasn't
better than the Stanford parser, and probably also the Berkeley parser.

------
teeja
The sentence subject is one thing, the sentence <i>topic</i> might be quite
another.

Consider sentences like: "He joined the not-yet-famous Liverpool band in early
1958."

To many human beings the topic is quickly obvious. Parsing won't do the trick.

~~~
run4yourlives
The only reason that sentence is "obvious" to many people is because we have a
reference to a famous band from Liverpool that got its start in the late
50's/early 60's that is already embedded in our brain's library of facts.

Removed from that context human beings see that sentence as equally
meaningless as a parser, because it is. I'd imagine many young people (who
don't have the "correct" reference points) wouldn't have a clue that the
sentence is about George Harrison.

In order to properly handle this sentence one would need the same external
reference that your brain has. Without the reference the sentence can be
discarded as incomplete, since that's what a human would do too.

~~~
unhammer
You don't necessarily need knowledge to distinguish a topic from a subject
though. It's a grammatical distinction. See

[http://en.wikipedia.org/wiki/Topic%E2%80%93comment#Definitio...](http://en.wikipedia.org/wiki/Topic%E2%80%93comment#Definitions)
e.g. in the sentence (3) As for the little girl, the dog bit her "the dog" is
the subject NP, "the little girl" is the topic. That toy example is perfectly
parsable without semantics or even probabilities (though take any real-world
sentence and I'm betting you'll need more than just grammar).

------
DanBC
This is neat!

The article gives an example which I find a bit confusing.

>I ran it on this sentence -

> _“Swayy is a beautiful new dashboard for discovering and curating online
> content.”_

>And got this result -

> _This sentence is about: Swayy, beautiful new dashboard, online content_

That misses "discovering" and "curating", which I think are the most important
parts of that sentence.

~~~
arrrg
Nah, it filtered out the meaningless buzzwords.

It's a dashboard for online content. That pretty much implies the picking and
finding of said online content to be displayed on the dashboard.

~~~
danso
Huh? "Curating" and "discovering" may be overused tech verbs, but they are
vital in describing what the "dashboard" does. For example, you would never
describe the Google Analytics dashboard as something that curates or
discovers.

And far worse than buzzwords are adjectives. Does "beautiful" add anything to
that sentence?

~~~
arrrg
Google Analytics has nothing whatsoever to do with online content.

~~~
danso
You're missing the point. The OP is talking about a system for interpreting
sentences in bulk and extracting useful keywords. "beautiful new" are not
useful, and arguably, "dashboard" is not particularly useful. "Curating" and
"discovering", while grating to our ears, are definitely descriptive words of
purpose...because there are "dashboards" that have nothing to do with
"curating"...so ostensibly, "curating" has some use as a keyword

------
drakaal
Or you could just drop stop words and gerunds. This is another post from
"TheTokenizer" that over simplifies a complex problem and creates
devistatingly bad results.

The method described doesn't tell you what the sentence is about it tells you
which things aren't verbs and articles and does a poor job of it.

Granted single sentence keyword extraction is not easy, but this is an awful
approach. You'd be much better using Word Frequency analysis to determine the
rarest words in the sentence.

------
visarga
Just apply TFIDF to text and it extracts the most interesting words out of the
phrase - it's dead simple. You just count words and do a little scoring and
sorting. Example applied to tweets. Check out how the least significant words
come out last. Some words have been dropped (those with frequency less than 5
in a corpus of a few million phrases).

\------

\- "math final today 6-17-09 piece of cake hopefully i should do well since i
m a math nerd amp english amp social"

\- math, nerd, studies, piece, cake, english, amp, final, hopefully, social,
since, should, well, today, do, of,

\------

\- "anyone want an incredibly designed unique limited edition tee for the
summer check out www artcotic com"

\- tee, designed, incredibly, unique, edition, limited, summer, anyone, check,
www, want, com, an, out, for, the,

------
kylebgorman
As a working computational linguist I shiver anytime human language technology
is discussed on HN. Smart developers (who don't work on human language) are
shockingly ignorant about natural language processing. For instance, this
article reinvents "chunking". People who are interested in these problems are
advised to read the entire NLTK book and Jurafsky & Martin textbook before
reinventing square wheels. </$.02>

------
MojoJolo
For some NLP, I really suggest using OpenNLP (<http://opennlp.apache.org/>)
from Apache. It has libraries that can be trained to do different NLP tasks
like sentence splitting, tokenizers, POS tagging, and document classification.
I still didn't manage to use all of them but in my experience, it's very easy
to use. Documentation is good too!

~~~
mindcrime
I'm a fan of OpenNLP as well, although I haven't done a lot of performance
evaluation around it yet. Apache Stanbol[1] is also a very interesting
project, which leverages OpenNLP (among other things) for doing semantic
entity extraction from text.

Also, FWIW, I wrote an article[2] a while back, focusing on Open Source NLP
tools. It was aimed slightly more at business users than developers, so it
doesn't dig real deep on the tech side, but there is a list of popular OSS NLP
tools that people interested in this topic might find useful.

And if I can throw in another shameless plug (only because I think it will
genuinely be of interest, of course), I'll point out this post[3] on Prolog
resources, since Prolog often finds application in the NLP world.

[1]: <http://stanbol.apache.org>

[2]:
[http://osintegrators.com/opensoftwareintegrators|howyoucanbe...](http://osintegrators.com/opensoftwareintegrators|howyoucanbenefitfromopensourcenaturallanguageprocessing)

[3]: [http://fogbeam.blogspot.com/2013/05/prolog-im-going-to-
learn...](http://fogbeam.blogspot.com/2013/05/prolog-im-going-to-learn-
prolog.html)

~~~
danieldk
_And if I can throw in another shameless plug (only because I think it will
genuinely be of interest, of course), I'll point out this post[3] on Prolog
resources, since Prolog often find application in the NLP world_

You missed the nicest and most satisfying book ;):

<http://www.mtome.com/Publications/PNLA/pnla-digital.html>

It is simultaneously an introduction to Prolog and natural language parsing
using Prolog.

~~~
mindcrime
Very cool. That post was originally written quite some time ago, and it was
never meant to be an exhaustive list. That said, I'll add this to the list as
well. Thanks for the pointer!

------
pilooch
Good job. But I need to make the note that the title is confusing to NLP/ML
practitioners. 'topics' usually refer to clusters as captured by so-called
'topic models' [1], the output of an unsupervised learning method, usually a
variant of LDA. [1] <http://en.wikipedia.org/wiki/Topic_model>

------
thejosh
I was interested in this to parse some user entered data extracted from
Facebook, and found Text Razor[1] to be pretty good at this.

Natural Language is a beautiful thing.

[1] <http://www.textrazor.com/>

------
jbrooksuk
Without having Brown and NLTK in Node.js, I'm not sure how well I can add this
to my port of shlomibs original code. For those who haven't seen yet, I wrote
a port of the first part of this here <https://github.com/jbrooksuk/node-
summary>

Maybe later I'll give it a crack :)

~~~
edtechdev
<https://github.com/NaturalNode/natural>

~~~
jbrooksuk
Heh, I raised an issue on my GitHub page <https://github.com/jbrooksuk/node-
summary/issues/2> and found natural shortly after.

------
taf2
<https://github.com/taf2/rb-brill-tagger>. For anyone usin ruby can do
something very similar much of the code is c++ with smaller ruby API ... It's
pretty good bug reports welcome...

------
swah
What happens when you apply it repeatedly? Does the text keep shrinking?

------
marknutter
I smell a $30 million acquisition in the near future..

~~~
danieldk
By those standards, many companies will be worth billions ;).

(Chunking is not exactly new and PCFG parsing is pretty fast these days.)

~~~
shmageggy
I think this was a sarcastic commentary on the Summly acquisition.

~~~
marknutter
bingo :)

