Hacker News new | past | comments | ask | show | jobs | submit login
Dependency parse tree visualization (spacy.io)
72 points by altro on Aug 20, 2015 | hide | past | web | favorite | 25 comments

It parses "fruit flies like a banana" the same way as "Time flies like an arrow".


Similarly, it seems to fail on "The old man the boat", marking "man" as a noun. The meaning of the sentence however, in this case, is fairly unambiguous, but parsing it can be tricky. See other: https://en.wikiped.org/wiki/Garden_path_sentence

In some sense, it's a mark of success for an AI system to fail in the same way that humans do. "The old man the boat" is a terrible sentence, essentially ungrammatical.

Can a sentence be terrible? In what way is it ungrammatical?

That's a good example as to why this tool should probably have the option to output a sample of the top-N guesses.

Some sentences are just totally ambiguous without context. "Fruit flies like a banana." isn't even good English. Is the sentence trying to say "Some particular fruit flies like a particular banana"? Or "All fruit flies like any banana"?

By the way, Spacy creator - how's the NER coming along?

Spacy's implementation, assuming it's roughly equivalent with the one syllog1sm blogged about, just does a greedy incremental parse so it only produces one candidate parse.

It is possible to do incremental dependency parsing with a beam, but all the copying of beam "states" is expensive and there are no guarantees that the n complete parses in the beam are really the n best parses w.r.t. the model.

Yes, I do greedy parsing. There are many advantages to this, which I'll write about before too long. Fundamentally, it's "the way forward". As models improve in accuracy, search matters less and less.

By the way, the beam isn't slow from the copying. That's really not so important. What matters is simply that you're evaluating, say, 8 times as many decisions. This makes the parser 6-7x slower (you can do a little bit of memoisation).

In that case, I wonder if it can output a probability score for each tag at each position, like pycrfsuite does? Then the output could be ensembled with other taggers, or otherwise pass that confidence information downstream.

Also, maybe a dumb question - is there any library or best-practice method for the ensembling of taggers / chunkers? Or must I create it myself from scratch?

How do you expect fruit to fly?

Please report any performance problems. I have this running on a pretty modest server, but it should be no trouble to throw up an EC2 instance if the traffic gets too much.

I did some simple stress testing that said it should handle a couple of hundred concurrent users, but I didn't put the time in to get a very realistic simulation. I worry it was too optimistic.

Nice, and has a built-in parser and annotation system too. Very nice.

See also: http://brat.nlplab.org/examples.html

Any others like this?

Full disclosure, brat author. We were inspired by things like "What's Wrong With My NLP?" [1] and TikZ-dependency [2]. But, truth be told, there is not all that many great visualisation tools out there for NLP data. The same goes for good and freely available NLP toolkits, things like SpaCy is very much an exception rather than the rule.

[1]: https://code.google.com/p/whatswrong/

[2]: http://sourceforge.net/projects/tikz-dependency/

Brat is awesome! I just started using it last week for a project. For people who don't know what Brat is, there is a nice demo here using Brat to visualize the output of Stanford CoreNLP:


Super interesting!

I'm feeding it Shakespeare, just for fun, however I'm having trouble understanding the meaning of having two CCMP in the context of the verb make in this

"Our doubts are traitors, and make us lose the good we oft might win, by fearing to attempt"[1]

[1] http://spacy.io/displacy/?full=Our%20doubts%20are%20traitors...

Well, the parse is wrong.

In linguistic terms this is a case of "over-generation": the parser has proposed an interpretation that's not "licensed" by the language in general. (In contrast, consider a sentence like "I shot an elephant in my trousers." A reading like "An elephant in my trousers was shot" is licensed but unlikely.)

To see the point of error, step through the parser until the focus is on "win", with 4 words on the stack. (Deep linking into particular states will come in a future version. For now, just press forward...).

At that state, the parser should attach "we might win" to "good", as a reduced relative clause. Instead it opts to pop "good" from the stack. It then ends up in a bad situation, and essentially attaches to "make" as a way to concede defeat on the arc, and get on with the rest of the sentence.

The parser actually could generate any projective tree over its input. It's only constrained by its statistics. On the benchmark evaluations this tends to perform better. I'm interested in trying out ways to incorporate syntactic restrictions, particularly for verb valencies, into the parser.

I put up a similar demo a few years ago if anyone is interested: http://nlpviz.bpodgursky.com

Interesting and quite neat! I tried this with the famous openings of two famous novels, "Pride and Prejudice" and "Ulysses"; it did well with the former but struggled a bit with the latter. I guess that's probably par for the course for most humans with those two texts, though.

I think it depends on the level of complexity, and common structure of the sentences. If a writer uses a bit more prosaic language it will fail.

I tried the famous opening from the "Commentarii de Bello Gallico".

"All Gaul is divided into three parts, one of which the Belgae inhabit, the Aquitani another, those who in their own language are called Celts, the third."

It failed to parse the tree in a correct way. On the other hand, it did quite well with simple sentences (randomly picked from wikipedia).

PS: I had also a hard time to understand Ulysses.

Interesting example. The Latin is much easier to parse than the English "translation" for this sentence. Actually I often tell people that a classics class is probably a better introduction for parsing than most linguistics 101 classes I've seen, which are usually a little bit airy.

Theory is definitely good, but it can't replace stepping through a lot of examples. Classics is probably the best place to get that.

Anyway. The Latin is probably easier to parse than the English "translation", since the translation here is hardly natural. I'd suggest that Classics translation have a particular tradition of being faithful to the syntax of the original.

Compare the parses for the original:


And what I would say is a more English-like version:


The English-like version is still wrong, and I'm interested to dig through what's gone wrong with it. But it's a much better parse than the tool was capable of producing for the Latinglish original.

The problem is that the translation of that sentence is very bad. It is totally unidiomatic in English. I would render it "...one inhabited by the Belgae, another inhabited by the Aquitani, and the third inhabited by those who in their in their own language are called Celts...".

Imagine if we could have a visualization of subject/verb/object and their relationships clearly outlined on hover on each forum of the internet!

It is not as curing cancer but as close as solving war as it gets

I tried "I am tired of being in front of my computer" and it treats "front" like a noun, and the tree is not quite right (the dependencies shown are like: I -> am -> tired -> of -> being -> in -> front -> of -> my computer, instead I think of something like: I -> am -> tired -> of -> (being -> in front of -> my computer)).

The UI is quite nice though.

Huh. I work for a book company and feel we could do some interesting things with a parser like this...

I'd be interested to know what you have in mind.

Well, get in touch! :)


Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact