

Usaddress – Python library for parsing unstructured address strings with NLP - danso
http://usaddress.rtfd.org/

======
cfqycwz
It is incredibly disappointing to me that this seemingly brand-new library is
Python-2-only. Why are new libraries being written for a legacy version of the
language?

~~~
meowface
I'm a Python developer and I still only write in Python 2. I'll sometimes
release projects and make them compatible with both Python 2 and 3, but for
any personal or professional (non-open source) work I use Python 2.

Python 2 still has too much pull. Lots of smaller open source projects and
third party libraries use 2, as do little snippets and examples all around the
Internet. I've never found a case where a project I wanted to use, modify, or
borrow from supported 3 but not 2, but the inverse is unfortunately still
quite common.

I think Python 3 is a better language over all, but it's easier for me as a
developer to keep using what I know will be supported and compatible.

------
spaznode
This is fucking awesome woooh! (Not sarcasm, I guess I'll find out next week
what getting crfsuite setup and adding training data is like. I hope not
traumatizing but either way I have very good reasons to invest tons of time
into project for job where we parse every craz-ily formatted address in U.S.
postal region..like all the time re-parse over and over and over..all work and
no play makes jack a dull boy)

------
sytelus
Not impressed at this stage. My rule based simple parser works actually
better. I've filed some examples I expected to work with such a heavyweight
approach as issue in repo:

[https://github.com/datamade/us-address-
parser/issues/50](https://github.com/datamade/us-address-parser/issues/50)

------
rdegges
This looks great! One thing that I find a tad bit annoying, however, is that
the output is a list of tuples -- would be nice to get a dictionary returned
instead. I think that makes more intuitive sense, eg:

{"number": "121", "street": "Blah Street", ...}

~~~
cfqycwz
I imagine a list of tuples is used to preserve the order of the street address
--I agree it would probably make much more sense to use an OrderedDict
instead.

------
Semiapies
It's a fantastic idea (and a _hard_ problem to boldly tackle), though still
very early in implementation. I filed an issue on mis-parsing of addresses
without street type identifiers (St/Dr/Av/Ln/etc.)

------
acbart
It'd be nice to see some more examples that show how sophisticated it is. What
are some weird inputs that it can figure out? Still, this is a pretty awesome
idea - definitely can see how it'll be useful for me.

~~~
Semiapies
So far? Omit a "St." or "Ave. in an address, in a form I see all the time
("123 Main Somewhere NY, 10101") and it gets thoroughly confused. The
author(s) are uninterested in fixing this, too - I don't think they care about
"weird inputs" at all.

------
curiously
I wonder if there's a library that will be able to categorize things like
company name, people's name, brand name, etc.

Or some way to train a process that will be able to predict what category an
entity falls into.

~~~
flarg
Named Entity Recognition in the NLTK library seems to work fine in natural
language text. See also [http://freshlyminted.co.uk/blog/2011/02/28/getting-
band-and-...](http://freshlyminted.co.uk/blog/2011/02/28/getting-band-and-
artist-names-nltk/)

