

Natural language parsing for the web - adammichaelc
http://nlp.naturalparsing.com/browserparser/parse

======
LeBlanc
So I was out to dinner tonight and got an email from my VPS saying my disk IO
was high. At first I thought someone might have hacked my server, but it turns
out that someone posted the link to my site on HN!

I've been working on this website for the past couple of weeks. Please email
me at andrew [AT] naturalparsing [DOT] com if you have any questions or
suggestions, or want to use the API.

FYI, right now the API is not using all the capabilities of the Stanford
Parser, just the word tagging part. More features will be implemented soon.
Let me know if you have any specific requests.

Andrew

~~~
raffi
Have you seen: <http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/postagger/>

It's a POS tagger whose output is extremely close (read: identical) to the
Stanford POS tagger.

It's also much faster. I was able to tag a corpus of 300K sentences with this
in 15 minutes. With the Stanford POS tagger it took the entire weekend.

Sadly, this tool's license does not allow commercial use and it is not
released under the GPL license.

------
WildUtah
Parser spins and spins with no output for me. Have we already overloaded it?

------
fauigerzigerk
I'm waiting for the day when fruit flies finally get recognition as a species
and fruit stops flying ;-)

~~~
gtani
I cannot recommend this comment highly enough! I urge you to waste no time to
read this comment over and over again.

------
finin
I seem like it's just a part of speech tagger, rather than a parser.

~~~
epochwolf
Wouldn't that make it a lexer then?

~~~
finin
I suppose so, but for human languages it's much more difficult, since it
assigning a type to a token is context dependent.

~~~
danieldk
And there are unknown words (words that are not covered by the lexicon).

But the top-poster is right: the linked website does part of speech tagging,
not parsing.

Providing a wide-coverage parser for the web is still hard. The number of
possible parses for long sentences is enormous. Even if a sentence is not
ambiguous to us, grammars allow all kinds of ambiguities.

There is a web demo for a Dutch system that is developed in our research
group[1], but heap size and time limits are used, to exit gracefully if
parsing takes too much time/memory (and sentences with more than 20 words are
ignored, since you'll really want to do offline parsing).

[1] <http://www.let.rug.nl/~vannoord/bin/alpino>

~~~
sqrt17
Statistical methods do quite a good job of pushing that ambiguity back under
the rug (see link to Berkeley parser further up, or the work on statistical
disambiguation in Alpino).

As to the time consumption and complexity, that's a known problem of
unification grammars (or just any grammar that does a little bit more) - but
see this paper by Matsuzaki et al for efficient techniques to speed this up:
[http://www-tsujii.is.s.u-tokyo.ac.jp/~matuzaki/paper/ijcai20...](http://www-
tsujii.is.s.u-tokyo.ac.jp/~matuzaki/paper/ijcai2007.pdf)

~~~
danieldk
True, but you still have to build up a forest from which every parse can be
extracted (as Alpino does). Of course, it does reduce the cost of ambiguity.
E.g., I implemented packing in the Alpino chart generator, and there are very
many edges with the same 'semantics' that can be packed, especially since
Alpino allows for a lot of optional punctuation, and there are often roots
plus subcategorization frames that give multiple inflections. With a beam
search, unpacking is a lot more efficient.

There are also many other possible optimizations. E.g. with a left-corner
parsers you can exclude left-corner spines that will probably not lead to a
probable parse (Van Noord, Learning Efficient Parsing, 2009, and other work).
And, of course, reduction of lexical ambiguity by restricting frames using a
part of speech tagger.

Still, even in this best-1, optimized scenario, _real-time_ parsing of long
sentences is still hard. So, when parsing large corpora we usually apply time
and space limits (which is easy to do in Sicstus Prolog, with good recovery).

Thanks for the link to the paper!

------
AlexMuir
This is exactly the sort od thing that makes a trip to HN worthwhile! Very
cool, Thanks to high-school latin I was able to make a bit of sense of the
table of parts of speech. Shame I haven't got a use for it ATM.

------
sqrt17
Darn, now I'm tempted to write a pure javascript POS tagger just to show that
you don't really need anything server-side (or maybe just little bits here and
there so the web page doesn't need to load a 20MB model right away - the
computational effort, in any case, is not so bad that you couldn't do it in
JS).

Hmm. Maybe sometime.

EDIT: The Stanford POS tagger is more complex and quite a bit slower than
anything you'd do on your own. To quantify this, there's methods that are 10x
as fast while sacrificing 0.05%-0.2% accuracy. (Or the easy ones that are 100x
as fast, but are 1% less accurate - these would be fun to do in JS).

~~~
jacquesm
Bold statements like that need to be backed up with code!

~~~
sqrt17
I don't have a JS version, but here's a functioning HMM tagger in under 300
lines of Python: <http://gist.github.com/503784>

(model loading doesn't work yet for some reason, but you see what it's doing
in principle).

This uses a smoothed trigram HMM, so it should, in principle, be a bit better
than NLTK's HMM tagger but not as good as serious POS tagging packages (e.g.
hunpos, or the Stanford POS tagger)

------
jcsalterego
Interesting, but I wonder how commercial use of the API requires the license:

<http://otlportal.stanford.edu/techfinder/technology/ID=24472>

~~~
gojomo
From <http://nlp.stanford.edu/software/tagger.shtml>

_The tagger is licensed under the GNU General Public License (v2 or later)._

The GPL, like other licenses meeting the 'open source definition', has no
restrictions on use -- only on proprietary distribution under nonfree
licenses.

------
gtani
lots chatter about MorphAdorner recently (relatively)

<http://dhigger.blogspot.com/2009/08/research-project.html>

[http://workproduct.wordpress.com/2009/01/27/evaluating-
pos-t...](http://workproduct.wordpress.com/2009/01/27/evaluating-pos-taggers-
conclusions/)

------
pierrefar
It doesn't seem to be handling multi-word proper nouns as I thought it would:

"Pride and Prejudice is a good book." becomes "Pride/NNP and/CC Prejudice/NNP
is/VBZ a/DT good/JJ book/NN ./." I would have thought "Pride and Prejudice"
would be lumped together.

~~~
raffi
It's a little confusing but this looks like a front-end to Stanford's Part-of-
Speech tagger. POS taggers do not group multi-word tokens. This would be the
role of a chunker or a parser.

------
d_c
Did you develop the underlying POS tagger or the web interface?

------
zepolen
Shouldn't 'fuck' be a verb in this context?

go/VB fuck/NN yourself/PRP

~~~
danieldk
Using my own tagger[1], trained using the Brown corpus:

go/VB fuck/VB yourself/PPL

It's very much in the amount and kind of training data and features used
(assuming that the methodology is sound).

[1] <http://github.com/langkit/citar>

