

Parserator – A family of probabalistic parsers - vierja
http://parserator.datamade.us/

======
teraflop
To be a little bit pedantic: in NLP terms, this is a sequence labeling
problem, not parsing. "Parsing" is normally only used for methods that can
handle CFGs or similar nested structures. This tool just seems to be breaking
up a string into a linear sequence of labeled segments.

I don't mean that as a criticism of the project itself, though. The demos
alone are pretty cool, and the framework looks incredibly useful.

~~~
bravura
you're incorrect. To be more pedantic, sequence labeling is a special case of
parsing. Parsing need not be hierarchical. Chunking, aka shallow parsing, is a
sequence labeling task.

Parsing is commonly understood to infer an hierarchical structure, but that's
not required.

Indeed, the Latin root leads us to understand that parsing is about breaking
an object into parts. That these parts nest according to a particular grammar
is something of an important implementation detail.

------
Animats
Didn't they announce this on Hacker News a few weeks ago?

This might work better if you had a database of almost every street name and
almost every place name. Then you could take in an address, and classify words
as one or more of [StreetName, PlaceName, StreetType, etc.]. Some words can
appear in more than one of those categories, which is when a deterministic
parser without a full database fails. Then let the learning algorithm deal
with ambiguities such as "1 Park Lane", "1 Lane Park", and such. You'd have a
better chance of dealing with the hard cases. Expecting this to recognize
street words on its own is a reach.

You can get about 95% successful parsing of US business addresses with a
relatively simple parser that lacks a name database. (I have one running right
now on 20 million addresses.) Then it gets hard. Are they doing better than
that?

The commercial parsers with full address databases do much better.

 _" Duzbuns Hopsit pfarmerrsc"_

------
groby_b
Suffers from all the usual problems. Try e.g.

123 1/2 Green Onion, Some City, CA

1/2 is treated as a number suffix (wrong, part of the number), Onion is
treated as StreetPost, e.g. a classifier similar to Blvd. or Street.

Miss a comma, and the parser will be completely confused.

So, yay, now my address is going to be probabilistically wrong ;)

~~~
jdpage
It seems to be working a little differently now. (I was going to say "better"
but that's debatable.) 1/2 is still in the number suffix, but now "Green
Onion, Some City" is the place name, and it doesn't try to parse it further.

------
matt4077
This seems to be insanely relevant: I'm currently working on transforming
government documents into structured xml and I've had to be stopped repeatedly
from implementing something like this. I have a treetop grammar now that
(mostly) works, but I'm tempted to try this.

~~~
plongeur
You are? That's interesting, b/c that also something I am currently working
on.

What do you need the grammatical parsing for? Identification of named
entities?

~~~
matt4077
We're building a tracker for EU legislative process. There's xml markup for
legislative documents (akoma ntoso) and we need to transform the pdfs that the
EU publishes into it to allow, for example, user annotation (and just good
html representation in general. We've built on this South African project:
[https://github.com/longhotsummer/slaw](https://github.com/longhotsummer/slaw)

~~~
plongeur
Are you working for an NGO, like Open Data Foundation or something like that?

I'd be interested in following your progress. You can send me a mail, so we
can connect.

~~~
matt4077
It's an NGO startup:
[https://twitter.com/aecu_eu](https://twitter.com/aecu_eu)

------
kwhitefoot
Frederick James Jonathan Bull-White

Is three given names followed by a hyphenated family name.

Not a corporation.

I just made it up, but it is representative of a substantial subset of the
English speaking population of the world.

Have looked at Spanish naming conventions?

Without a substantial amount of context I think that such parser cannot be
made to be useful.

------
alexchamberlain
The probablepeople parser failed to parse "Alex Chamberlain"...

~~~
grrowl
They might have been having problems earlier, it works as expected when I
tried:

> Alex, GivenName

> Chamberlain, Surname

------
seanalltogether
I can't get any kind of US address to work, just says "error parsing that
address"

~~~
bcbrown
The two I tried both worked. Try this one: 142 West St, Cantonville, OH 98104
(Not a real address)

