
Try out Stanford's CoreNLP natural language software - apsec112
http://corenlp.run/
======
nl
CoreNLP is a _very_ good baseline for any NLP work. If you can build something
that beats it on a specific benchmark it's a pretty good bet you have
something that's pretty close to the state-of-the-art.

But.. there are problems. As software engineers, the (many) authors make great
researchers.

CoreNLP is wonderful in the _many_ different ways it _almost_ lets you
integrate into it without hacking the code. Changing the config of the various
components is fantastic, because you get a very comprehensive view of many
different ways people can configure almost the same thing. Environment
variables? System properties? Properties files in a specific location? In the
classpath? Json config? YAML? It almost supports them all - or rather
different parts use different ones, and the only way to work out exactly how
it works is to read the code.

Also, the licensing is annoying. Everyone doing commercial stuff with it just
puts it behind a web service anyway, so they should just embrace non-viral
license and get some input from the community.

Also SUTime. Yes, it works mostly, but wow :(

(Sorry for the rantish post. I use parts of CoreNLP a lot, and I'd love to see
it improve.)

~~~
fgimenez
Have you used spacy.io for comparison? The author seems to enjoy trolling
about the faults of CoreNLP, but the license is very agreeable and the results
seem on par.

~~~
nl
I've looked at it, and played with a bit. I think when I last looked at it the
licensing was even less friendly than CoreNLP. I believe this has been fixed
though, so it's probably worth looking at again.

From memory when I was last looking around I cared mostly about named entity
recognition (NER) and Spacy themselves say CoreNLP is better. CoreNLP has more
features too.

WordVector integration in Spacy looks interesting though. That's probably
enough to make me have a play with it.

~~~
lqdc13
I suggest Gensim for Word2Vec.

Spacy is really fast, the author is extremely knowledgeable, and it works well
on the datasets it was trained on. Problem with SpaCy for me was that it was
pre-trained on those texts and it was not possible to train it on new things.
Also their parser wasn't very customizable.

This pre-trained model is borderline useless if you want to obtain good
results on your data, which is probably very different from the data they
trained on.

For NER in the Python world, the best option is pycrfsuite. It works really
quickly and lets you easily define your own features. CRFSuite itself is a
work of art. I only wish Okazaki incorporated 2nd order transitions because
that makes a huge difference on some datasets.

CoreNLP is a lot worse than pyCRFsuite in terms of ease of integration and
performance in my experience. Particularly, if you want to define your own
features.

~~~
nl
Your Word2Vec comments are interesting. I've had pretty good success with that
original trained dataset the original Word2Vec implementation shipped with.
That's a big dataset of course, but one of the strengths of the model is that
if your training dataset is big enough you don't really need a lot of
specialized training.

 _CRFSuite_

I've never used this, but it's not really a ready-to-use NLP toolkit is it?
Isn't it more a tool for building NLP tools with?

~~~
lqdc13
Gensim's implemention is better than the original and allows for Python API
access to all the features. Highly recommended. Radim tends to write really
memory efficient code unlike some other Python libs, so you can deal with
large datasets.

If you want to do NER in a way that doesn't suck there is no way around making
your own model on your own training data.

It honestly takes only a few days of labeling things yourself. I found that
outsourcing the work to amazon turk is not a viable option because the graders
there are terrible. And they work about 30x slower than you do. Even if you
pay them $1/hr, that is like paying one person $30/hr. I'm not kidding.

Sure you can do a quick and dirty "send data to these guys and they'll do all
the work", but I haven't come across a model that works well on all datasets.
We're talking going from 30ish percent accuracy for a model not trained on
your dataset to low 90s for something trained on your dataset. Of course,
these are approximate numbers and it is definitely possible that your dataset
is almost exactly like the ones they trained their model on.

It's incredibly simple to make your own model.

1\. Label your data with brat:
[http://brat.nlplab.org/index.html](http://brat.nlplab.org/index.html) # 5
days for 2k one page documents.

2\. Tokenize data with nltk/spaCy. Come up with features and label using
pycrfsuite: [http://nbviewer.ipython.org/github/tpeng/python-
crfsuite/blo...](http://nbviewer.ipython.org/github/tpeng/python-
crfsuite/blob/master/examples/CoNLL%202002.ipynb) # 1 day

3\. Do more labeling, retokenizing, neural embedding from word2vec's similar
words to the tokens you have, tune parameters or come up with better features
such as your own dictionaries of entities, etc. Retrain the model. #2 weeks.

4.Done. Now you have a memory efficient fast model tuned on your data. You can
label anything you want. Not just Person/Company, but things like car vs
bicycle brands, computer parts, obfuscated email addresses, etc.

~~~
syllogism
I'd endorse pretty much all of this.

I want to make "domain adaptation as a service" the key part of spaCy's
business model: you send us text, we send you a good model. Internally this
will probably involve annotating part of the text, but that's a tactical
decision we'll make.

I hope we can make some break-throughs that help NER be much more general than
it is currently. But the current solution you describe works fine; it's just a
pain in the ass for each organization to take on. We want to have the required
infrastructure and expertise set up, and make the process seamless.

------
dacox
I took CoreNLP for a test drive last year during a content driven
recommendation project. Unfortunately we didn't have an opportunity to use it
on the project, but I was very impressed with what I saw. I was made aware of
it's existence after Stanford's online Sentiment
Analysis([http://nlp.stanford.edu/sentiment/](http://nlp.stanford.edu/sentiment/))
demo was released

------
harperlee
Mmhh, I tried with spanish and didn't work.

I see at least that identifies most words as FW, "foreign word", perhaps?

I'm not very familiar with CoreNLP (or NLP, more generally), does it only
support english out of the box? Or is it _nothing_ out of the box and this is
configured in english?

Google translate does a pretty good job of guessing the language of non-
ridiculously-short-and-ambiguous sentences. Also, from what I know about how
it works, it seems quite agnostic to specific languages. Does a less
stochastic approach (as what I assume NLP does) provide such flexibility?
Something akin to the nlp library knowing all languages, and deciding on which
of them a given sentence makes sense. I can't put an example on the table
right now, but surely there are sentences that can work on more than one
language; at least when you accept a couple of misspells.

------
gbrits
Any link/bridge to call python NLP libraries mentioned in this thread from
Node.js?

------
uniformlyrandom
Ha! NLP still does not get 'time flies like an arrow' phrase. It still thinks
'time flies' is a compound noun, apparently some kind of time-travelling
flies.

I remember reading about computers being confused by this phrase in 80s and
90s. Apparently, not much progress here.

~~~
syllogism
[http://spacy.io/displacy/?full=time%20flies%20like%20an%20ar...](http://spacy.io/displacy/?full=time%20flies%20like%20an%20arrow)

You can download the spaCy library for yourself if you suspect I've cooked the
example :)

~~~
imron
Unfortunately it fails with the suggestion provided by wodenokoto below:

[http://spacy.io/displacy/?full=fruit%20flies%20like%20a%20ba...](http://spacy.io/displacy/?full=fruit%20flies%20like%20a%20banana)

------
canjobear
Pretty unfortunate that the default demo sentence gets parsed wrong.

~~~
nl
Looks good to me?

What is it getting wrong?

~~~
canjobear
The dependency parse in the demo for "my dog also likes eating sausage" has
"eating" as an adjective modifying "sausage". It's as if "eating sausage" were
a kind of sausage.

The correct parse has "eating" as an xcomp of "likes", and "sausage" as a dobj
of "eating". You can see the correct parse for this structure if you plug "I
like eating sausage" into the demo.

EDIT: Weirdly, the demo of the same software at
[http://nlp.stanford.edu:8080/parser/](http://nlp.stanford.edu:8080/parser/)
doesn't make this mistake!

~~~
nl
Yes, I missed that. Good pickup.

 _Weirdly, the demo of the same software
at[http://nlp.stanford.edu:8080/parser/](http://nlp.stanford.edu:8080/parser/)
doesn't make this mistake!_

See my rant about the config of CoreNLP:
[https://news.ycombinator.com/item?id=10350090](https://news.ycombinator.com/item?id=10350090)

~~~
gangeli
These are likely using different parsers. Namely, the (faster) neural net
dependency parser running at corenlp.run -- which gets thrown off by the POS
tag error -- versus the constituency parser you linked to.

You're right that the configuration is confusing for new users. But, you have
to remember that this is first and foremost research code intended to be
flexible enough for the Stanford NLP group's research needs. In terms of
particular configuration sources, nearly everything should be configurable
from properties passed in as a properties file. Are there exceptions to this
rule?

~~~
nl
_In terms of particular configuration sources, nearly everything should be
configurable from properties passed in as a properties file. Are there
exceptions to this rule?_

SUTime is documented to accept a property to a file that is read to obtain
other properties. Quote:

 _sutime.rules = [path to rules file]_ [1]

I'm unclear if that works - looking at our code it appears the properties
files need to be in the classpath under a _package_ that is defined by the
_sutime.rules_ property.

I don't remember how other packages worked.

[1]
[http://nlp.stanford.edu/software/sutime.shtml](http://nlp.stanford.edu/software/sutime.shtml)

~~~
gangeli
This should be able to read rules from the filesystem as well (if not, it's a
bug and we should fix it!). The reason all of the examples are classpath
examples is just because we distribute our models as a jar by default. The
SUTime rules could be thought of more as a "model" for the rule-based system
rather than a configuration file.

------
linkydinkandyou
Hmmm. I tried "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo
buffalo."

Didn't get it right.

(See
[https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal...](https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo)
)

Did better on "Colorless green ideas sleep furiously."

~~~
kazinator
But that Buffalo sentence is something on which people also stumble. A machine
can be contrived to pass such a test case, while remaining a lousy parser
compared to people.

The real problem is that the parser comes so swiftly to wrong conclusion and
cheerfully presents it as a valid result.

It would look a lot better if it simply reported "error: cannot parse that".
(Better yet, with reasons: "I cannot parse that because I get stuck on this
specific ambiguity and it's just too much for me.").

Also, what about the possibility of multiple results? Language is ambiguous.
If something has two parses, it's wrong to assert just one.

This thing has made no consideration whatsoever that even a single instance of
"buffalo" in the sentence might conceivably be a verb, which flies in the face
of almost any noun in English being verbable.

~~~
lkjaero
But it won't ever have trouble, because it's not trying to understand the
sentence. It will tell you the most probable parsing of that sentence based on
its model, whether or not it makes sense to a human.

~~~
kazinator
People who manage parse the sentence also aren't trying understand it, except
as far as "Buffalo" is a proper noun denoting a city, which can be used to
form then phrases "Buffalo buffalo" == buffalo of/from/belonging to/related to
Buffalo, and trying various combinations of interpreting "buffalo" as a noun
(in various roles as subject, direct object and so on) or verb, and
determining elided words such as "which" or "that" complementizers heading off
phrases and embedded clauses.

It's almost purely syntactic reasoning. Searching these spaces of
possibilities is something which, you would think, a "natural language parser"
ought to be doing to earn its name.

Nobody actually knows what it means "to buffalo" something; it is not
necessary to know. People solve the parse in spite of knowing that there is
nothing to understand in the sentence.

~~~
dasyatidprime
"buffalo" can mean something like "bother" as an English verb (at least in
American informal use), so the whole sentence as parsed in English does have a
concrete mental image associated with it, in case that makes any difference.

