
Libpostal: A C library for parsing/normalizing street addresses around the world - polm23
https://github.com/openvenues/libpostal
======
newprint
I work for UPS as software developer and surprisingly, I work for the
department that is responsible for parsing addresses and matching them with
actual physical addresses. We cover US, CA & EU (incl. UK). In our department,
we have a guy whose entire career at UPS is nothing but maintaining library
that parses addresses. It is very hard to get things right, unless you
maintain that library all the time. Genuinely wish you good luck !

~~~
eruci
BTW, I'm a guy who has been building address parsing software since 2005. (
geocoder.ca, geocode.xyz ) It is HARD.

Currently I'm using machine learning similarly to libpostal to improve my
software.

It works better than libpostal in some cases (for eg:

USA: Start libpostal: 751 FAiR OKS AVENUE PASADNA CA

road => fair oks avenue pasadna

state => ca

house_number => 751

Geocoder.ca 751 FAiR OKS AVENUE PASADNA CA

[https://geocoder.ca/?locate=751+FAiR+OKS+AVENUE+++PASADNA+CA...](https://geocoder.ca/?locate=751+FAiR+OKS+AVENUE+++PASADNA+CA&geoit=x)

stnumber: 751

staddress: N Fair Oaks Ave

city: Pasadena

prov: CA

postal: 91103-3069

libpostal from Little Plate Shop 9 11 Deodar Drive Burliegh Heads QLD 4220

house => LITTLE PLATE SHOP

city => HEADS

postcode => 4220

road => DEODAR DRIVE BURLIEGH

state => QLD

house_number => 9 11

Geocode.xyz Little Plate Shop 9 11 Deodar Drive Burliegh Heads QLD 4220

addresst: DEODAR DR

region: QLD

postal: 4220

stnumber: 11

prov: AU

city: BURLEIGH HEADS

countryname: Australia

confidence: 0.7

And it works worse in some other cases...

Anyways, good luck to this developer. I don't think anyone will ever produce a
solution that works better than all others, but it is better if more of us
try.

~~~
microcolonel
I think that there is a good solution: supervised learning/segmentation with
direct user verification.

You have the user enter a free form address, and then translate it into a
structured address. If they correct any fields, you look at those and try to
figure out if the final result is correct or not, and integrate that.

Maybe this could be done as a service with iframes (like ReCaptcha); and since
the information in a full address is basically entirely public knowledge (at
least in the U.S.), you can keep all of it around in full detail.

------
vidarh
Already the name vs description reveals confusion: a street address and postal
address does not have a 1:1 correlation even before taking postal codes/zip
codes into account...

EDIT: examples includes differences in e.g visitor address vs. where mail
delivery should happen; leaving out or adding details for one or the other
(e.g many rural places you don't need to include road details for postal
addresses).

Different people also address the same location differently. E.g I regularly
have to tell delivery companies my address is in Surrey, even though my house
has been in London for more than 50 years.

------
flexer2
libpostal is a pretty incredible open source project, but addresses are so
complicated and nuanced that depending on what you’re doing, it might not be
able to keep up. I work for a real estate tech company where we do a lot of
address parsing and we had to move away from it because it’s just not quite
powerful enough to handle all of the edge cases you find in US addresses.
Right now we use SmartyStreets because their address parser is a bit better
for our use case. Libpostal is a great general purpose library but depending
on what level of accuracy you need, you might have to look for alternatives.

I spent time trying to use libpostal and build USPS address normalization
rules on top of it but there are so many edge cases it was more cost effective
to just purchase a solution from a vendor.

That is not to take away from this project — it’s quite good for a broad set
of addresses across the world — but for narrow use cases such as ours it just
couldn’t quite cut it.

------
eruci
Nice work, although it is a bit slow: (avg 5 seconds per address parsing)

date && perl geo.pl && date Sat Dec 29 16:18:30 UTC 2018 country => united
kingdom suburb => shoreditch house => the book club city => london postcode =>
ec2a 4rh road => leonard st house_number => 100-106 Sat Dec 29 16:18:35 UTC
2018

~~~
adrianmonk
It seems to have a whole lot of data that would probably need to be loaded.
Maybe a lot of that time is spent in initialization. Is it faster at parsing a
second, third, etc. address once loaded?

By the way, a more convenient way to benchmark Perl:

    
    
        perl -MBenchmark -e 'timethis(500, sub { ... your code here ... });'

~~~
eruci
You are correct. Only the first request takes about 4-5 seconds:

Start libpostal: Chong Co Thai Restaurant and Bar Shop 0039A Grand Central
Shopping Centre 1-7 Dent St Toowoomba QLD 4350

house => chong co thai restaurant and bar shop 0039a grand central shopping
centre

city => toowoomba

postcode => 4350

road => dent st

state => qld

house_number => 1-7

1: 4.10454607009888 seconds

Start libpostal: Little Plate Shop 9 11 Deodar Drive Burliegh Heads QLD 4220

house => little plate shop

city => heads

postcode => 4220

road => deodar drive burliegh

state => qld

house_number => 9 11

2: 0.000234127044677734 seconds

Start libpostal: Sheoak Shack Gallery Cafe 64 Fingal Rd Fingal Head NSW 2487

suburb => fingal head

house => sheoak shack gallery cafe

postcode => 2487

road => fingal rd

state => nsw

house_number => 64

3: 0.000188827514648438 seconds

Start libpostal: Chong Co Thai Restaurant and Bar Shop 0039A Grand Central
Shopping Centre 1-7 Dent St Toowoomba QLD 4350

house => chong co thai restaurant and bar shop 0039a grand central shopping
centre

city => toowoomba

postcode => 4350

road => dent st

state => qld

house_number => 1-7

4: 0.000257015228271484 seconds

------
chris_wot
Does this use for it's inspiration Frank's Compulsive Guide to Postal
Addresses? [1]

1\.
[http://www.columbia.edu/~fdc/postal/](http://www.columbia.edu/~fdc/postal/)

------
hrktb
From the linked post

> Street addresses are among the more quirky artifacts of human language, yet
> they are crucial to the increasing number of applications involving maps and
> location.

The main goal seems to be positionning a point on a map.

As pointed out by the other comments, it’s fairly different from dealing with
delivery addresses or legal addresses.

In particular it means parsing locations inside buildings (i.e. “3 appt of 2nd
floor”, “Building 103 - code 17234, 34 foobar street”) with random info baked
in for humans could easily trip it up and are not expected to either work
properly.

Still looks like a pretty ambitious and interesting effort.

~~~
maxerickson
It explicitly is not a geocoder (which is address->location), it just parses
and normalizes addresses.

It's meant to deal with something like "3 appt of 2nd floor", parsing and
tagging "3 appt" as unit=apt. 3 and "of 2nd floor" as level=2, even if that
string is mixed with further info like street and city and so on.

------
dang
From 2016:
[https://news.ycombinator.com/item?id=11173920](https://news.ycombinator.com/item?id=11173920)

------
Animats
I'm amazed that someone wrote something that's mostly string processing in
pure C at this late date.

~~~
blt
Building strings in C can be painful, but parsing them is not too bad. I
suppose security is another big concern, especially with something built
explicitly to process user-supplied data.

------
dimfeld
The libpostal developers have released bindings for a number of different
languages, which can be found at the Github organization page:
[https://github.com/openvenues](https://github.com/openvenues)

------
Asooka
> Street addresses

Well, this already fails for places that don't address by street. You might
think it's only some pre-industrial villages in the jungle, but examples would
be some eastern European countries and Japan - some (but not all) buildings
simply don't have a street address. Instead they have a number within a
district. But sometimes it's a building number on a street, but it's distinct
from the street's numbering system, so you can have Building 5 on st. Foo as a
distinct address from 5th Foo st., where there is a completely different
building. And of course there's no number on Foo st. that corresponds to
Building 5. Another fun case is when there's a district Foo and a street Foo
and e.g. Google Maps resolves "district Foo, building 5" as "No. 5, Foo st.".
Or when the district has a number in it, so "district Foo 3, building 275"
resolves to "district Foo, building 3", because of course the first Foo
doesn't have a number in it - there's no Foo 1, only Foo, Foo 2, etc.

Generally all residential buildings built by the communist regime follow that
system, while older buildings follow street numbers. Open Street Map actually
deals amazingly well with our addresses, while Google Maps fails miserably
most of the time. This is starting to become a problem as online services here
are integrating google's mapping technology, e.g. an app for hailing taxis
would ask you to type in your starting and destination address and if google
can't make sense of it, the taxi can go to some completely wrong place. I can
deal with it fine since I live here, but woe be any foreigner that would rely
on Google Maps.

The postal system works just fine, but sometimes I have to enter a district
name in the street field in online forms. As long as it arrives to the right
country, the postal workers here can make sense of the address just fine.

All of these shenanigans do work in a hierarchical way, so you can pretty much
expect to always have City, City sub-unit, Building designator as your address
schema, but the actual category of City sub-unit and Building designator is
sometimes "Street/number", sometimes "District/building number". You can of
course simply ignore that and not have your system work in weird places, but
if you're making a library for wide use and publishing it, I would appreciate
it if you take into account that not everybody addresses by street/number.

~~~
taneliv
It worked quite fine for fairly approximate things like "Tetuán, Madrid,
España" (district, city, country). However, it seemed to exhibit a tendency to
attribute suburbs and districts as houses, at least with the handful of Madrid
addresses I happened to have at hand.

If you have a use case for it and data to match, why not give it a try?

