Hacker News new | past | comments | ask | show | jobs | submit login
Libpostal: A C library for parsing/normalizing street addresses around the world (github.com/openvenues)
117 points by polm23 on Dec 29, 2018 | hide | past | favorite | 25 comments



I work for UPS as software developer and surprisingly, I work for the department that is responsible for parsing addresses and matching them with actual physical addresses. We cover US, CA & EU (incl. UK). In our department, we have a guy whose entire career at UPS is nothing but maintaining library that parses addresses. It is very hard to get things right, unless you maintain that library all the time. Genuinely wish you good luck !


BTW, I'm a guy who has been building address parsing software since 2005. ( geocoder.ca, geocode.xyz ) It is HARD.

Currently I'm using machine learning similarly to libpostal to improve my software.

It works better than libpostal in some cases (for eg:

USA: Start libpostal: 751 FAiR OKS AVENUE PASADNA CA

road => fair oks avenue pasadna

state => ca

house_number => 751

Geocoder.ca 751 FAiR OKS AVENUE PASADNA CA

https://geocoder.ca/?locate=751+FAiR+OKS+AVENUE+++PASADNA+CA...

stnumber: 751

staddress: N Fair Oaks Ave

city: Pasadena

prov: CA

postal: 91103-3069

libpostal from Little Plate Shop 9 11 Deodar Drive Burliegh Heads QLD 4220

house => LITTLE PLATE SHOP

city => HEADS

postcode => 4220

road => DEODAR DRIVE BURLIEGH

state => QLD

house_number => 9 11

Geocode.xyz Little Plate Shop 9 11 Deodar Drive Burliegh Heads QLD 4220

addresst: DEODAR DR

region: QLD

postal: 4220

stnumber: 11

prov: AU

city: BURLEIGH HEADS

countryname: Australia

confidence: 0.7

And it works worse in some other cases...

Anyways, good luck to this developer. I don't think anyone will ever produce a solution that works better than all others, but it is better if more of us try.


I think that there is a good solution: supervised learning/segmentation with direct user verification.

You have the user enter a free form address, and then translate it into a structured address. If they correct any fields, you look at those and try to figure out if the final result is correct or not, and integrate that.

Maybe this could be done as a service with iframes (like ReCaptcha); and since the information in a full address is basically entirely public knowledge (at least in the U.S.), you can keep all of it around in full detail.


It was funded by the now defunct Mapzen. I have no idea, but I wouldn't be surprised if it isn't getting much work done anymore.


Not your issue, but it makes me so angry that UPS charges me for address corrections ($12 a pop). The UPS supplied software often fails to identify missing suite number, can't figure out "State Route 123" vs "SR 123", etc. UPS is financially rewarded for bugs. Argh :)


Sorry to hear that. I would advise you to call UPS and dispute those charges. If the address existed for a while and you can google it, than it is definitely a bug.


Appreciate it. It's more the policy that's irritating. For example, "missing suite" usually just means the driver has to look at the business name on the shipping label and identify it in the strip mall. $12 charged to the shipper for a few seconds of thought where the UPS software didn't identify "missing suite".


Seems like this work should be open source, is it already? If not why? Publicly funded work like this seems like it should by default be open source?


Already the name vs description reveals confusion: a street address and postal address does not have a 1:1 correlation even before taking postal codes/zip codes into account...

EDIT: examples includes differences in e.g visitor address vs. where mail delivery should happen; leaving out or adding details for one or the other (e.g many rural places you don't need to include road details for postal addresses).

Different people also address the same location differently. E.g I regularly have to tell delivery companies my address is in Surrey, even though my house has been in London for more than 50 years.


libpostal is a pretty incredible open source project, but addresses are so complicated and nuanced that depending on what you’re doing, it might not be able to keep up. I work for a real estate tech company where we do a lot of address parsing and we had to move away from it because it’s just not quite powerful enough to handle all of the edge cases you find in US addresses. Right now we use SmartyStreets because their address parser is a bit better for our use case. Libpostal is a great general purpose library but depending on what level of accuracy you need, you might have to look for alternatives.

I spent time trying to use libpostal and build USPS address normalization rules on top of it but there are so many edge cases it was more cost effective to just purchase a solution from a vendor.

That is not to take away from this project — it’s quite good for a broad set of addresses across the world — but for narrow use cases such as ours it just couldn’t quite cut it.


Nice work, although it is a bit slow: (avg 5 seconds per address parsing)

date && perl geo.pl && date Sat Dec 29 16:18:30 UTC 2018 country => united kingdom suburb => shoreditch house => the book club city => london postcode => ec2a 4rh road => leonard st house_number => 100-106 Sat Dec 29 16:18:35 UTC 2018


It seems to have a whole lot of data that would probably need to be loaded. Maybe a lot of that time is spent in initialization. Is it faster at parsing a second, third, etc. address once loaded?

By the way, a more convenient way to benchmark Perl:

    perl -MBenchmark -e 'timethis(500, sub { ... your code here ... });'


You are correct. Only the first request takes about 4-5 seconds:

Start libpostal: Chong Co Thai Restaurant and Bar Shop 0039A Grand Central Shopping Centre 1-7 Dent St Toowoomba QLD 4350

house => chong co thai restaurant and bar shop 0039a grand central shopping centre

city => toowoomba

postcode => 4350

road => dent st

state => qld

house_number => 1-7

1: 4.10454607009888 seconds

Start libpostal: Little Plate Shop 9 11 Deodar Drive Burliegh Heads QLD 4220

house => little plate shop

city => heads

postcode => 4220

road => deodar drive burliegh

state => qld

house_number => 9 11

2: 0.000234127044677734 seconds

Start libpostal: Sheoak Shack Gallery Cafe 64 Fingal Rd Fingal Head NSW 2487

suburb => fingal head

house => sheoak shack gallery cafe

postcode => 2487

road => fingal rd

state => nsw

house_number => 64

3: 0.000188827514648438 seconds

Start libpostal: Chong Co Thai Restaurant and Bar Shop 0039A Grand Central Shopping Centre 1-7 Dent St Toowoomba QLD 4350

house => chong co thai restaurant and bar shop 0039a grand central shopping centre

city => toowoomba

postcode => 4350

road => dent st

state => qld

house_number => 1-7

4: 0.000257015228271484 seconds


Does this use for it's inspiration Frank's Compulsive Guide to Postal Addresses? [1]

1. http://www.columbia.edu/~fdc/postal/


From the linked post

> Street addresses are among the more quirky artifacts of human language, yet they are crucial to the increasing number of applications involving maps and location.

The main goal seems to be positionning a point on a map.

As pointed out by the other comments, it’s fairly different from dealing with delivery addresses or legal addresses.

In particular it means parsing locations inside buildings (i.e. “3 appt of 2nd floor”, “Building 103 - code 17234, 34 foobar street”) with random info baked in for humans could easily trip it up and are not expected to either work properly.

Still looks like a pretty ambitious and interesting effort.


It explicitly is not a geocoder (which is address->location), it just parses and normalizes addresses.

It's meant to deal with something like "3 appt of 2nd floor", parsing and tagging "3 appt" as unit=apt. 3 and "of 2nd floor" as level=2, even if that string is mixed with further info like street and city and so on.


> The main goal seems to be positionning a point on a map.

It says "Actually geocoding addresses to a lat/lon" in the "Non-goals" section of README.md.



I'm amazed that someone wrote something that's mostly string processing in pure C at this late date.


Building strings in C can be painful, but parsing them is not too bad. I suppose security is another big concern, especially with something built explicitly to process user-supplied data.


It's fast and you can make bindings for it to any other language extremely easily.


The libpostal developers have released bindings for a number of different languages, which can be found at the Github organization page: https://github.com/openvenues


> Street addresses

Well, this already fails for places that don't address by street. You might think it's only some pre-industrial villages in the jungle, but examples would be some eastern European countries and Japan - some (but not all) buildings simply don't have a street address. Instead they have a number within a district. But sometimes it's a building number on a street, but it's distinct from the street's numbering system, so you can have Building 5 on st. Foo as a distinct address from 5th Foo st., where there is a completely different building. And of course there's no number on Foo st. that corresponds to Building 5. Another fun case is when there's a district Foo and a street Foo and e.g. Google Maps resolves "district Foo, building 5" as "No. 5, Foo st.". Or when the district has a number in it, so "district Foo 3, building 275" resolves to "district Foo, building 3", because of course the first Foo doesn't have a number in it - there's no Foo 1, only Foo, Foo 2, etc.

Generally all residential buildings built by the communist regime follow that system, while older buildings follow street numbers. Open Street Map actually deals amazingly well with our addresses, while Google Maps fails miserably most of the time. This is starting to become a problem as online services here are integrating google's mapping technology, e.g. an app for hailing taxis would ask you to type in your starting and destination address and if google can't make sense of it, the taxi can go to some completely wrong place. I can deal with it fine since I live here, but woe be any foreigner that would rely on Google Maps.

The postal system works just fine, but sometimes I have to enter a district name in the street field in online forms. As long as it arrives to the right country, the postal workers here can make sense of the address just fine.

All of these shenanigans do work in a hierarchical way, so you can pretty much expect to always have City, City sub-unit, Building designator as your address schema, but the actual category of City sub-unit and Building designator is sometimes "Street/number", sometimes "District/building number". You can of course simply ignore that and not have your system work in weird places, but if you're making a library for wide use and publishing it, I would appreciate it if you take into account that not everybody addresses by street/number.


It worked quite fine for fairly approximate things like "Tetuán, Madrid, España" (district, city, country). However, it seemed to exhibit a tendency to attribute suburbs and districts as houses, at least with the handful of Madrid addresses I happened to have at hand.

If you have a use case for it and data to match, why not give it a try?


If you watch the GIF, the library parses Japanese addresses just fine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: