Hacker News new | past | comments | ask | show | jobs | submit login
Libpostal: international street address parsing in C trained on OpenStreetMap (mapzen.com)
74 points by riordan on Feb 25, 2016 | hide | past | favorite | 7 comments

An amazing effort, many congrats to all involved.

I work on the address-formatting project, one small piece of the many used here. We currently have formatting rules for 93% of the world's 249 territories (as defined by ISO 3166-1 alpha-2 codes), but we need help to finish things out - especially from people with local knowledge and native speakers. Even for the countries we've "finished" more tests are always useful.

Here's the repo if you'd like to get involved: https://github.com/OpenCageData/address-formatting

Here's a post I did a week ago on the regions we need help with, though since then we've started making good progress on Arabic speaking countries. http://blog.opencagedata.com/post/138991962708/an-update-on-...

Feel free to ping me if you'd like to get involved. Thanks.

Totally; on a number of occasions, Al's talked about address-formatting was a big underpinning of this effort. I'd be happy to help out with some of those other regions!

I'm pretty excited that this is the first time someone has been able to release into the open an address analysis engine like underpins Google's geocoder. You'd be surprised just how hand-tuned most of the other address parsing engines are (regex and case statements all the way down). This feels like a huge leap forward.

As someone who works on a geocoder, no, I am not surprised how hand-tuned things are. I'm guilty of such sins myself. Clearly this is the way forward.

Feel free to help out on address-formatting. Where we really need help is eastern Asian countries with double-byte character scripts. Specifically CN, HK, JP, KP, KR, MO, and TW - those countries obviously represent a significant chunk of the world's population, so would be great if we could get help on those from folks with local knowledge.

I have a project with similar aims: https://github.com/commerceguys/addressing

It uses the Google dataset, which is under the public domain. Might make sense to compare formats?

Btw, I love how your worldwide.yaml, both the deduplication and the redirection for subterritories (such as Vatican City).

Was planning to use the same dataset too in my Python module: https://github.com/scaleway/postal-address

And thanks @bojanz for asking Google about their data's license! :)

Haven't looked in details at your addressing PHP module or even Libpostal, but I feel like there should be some ways to deduplicate efforts and converge all datasets. Both for testing and i18n/l10n.

In the mean time, OpenCageData's address-formatting language-neutral YAML structure seems quite nice.

wow, great stuff, hadn't seen your project (or the Google dataset), thanks for making me aware. There are a few others as well that are programming language specific. Our goal from the beginning was to make templates that are language agnostic

The big challenge I think is the conflicting use cases between "official" postal format of a country and trying to represent an address in a way that makes sense to users - especially when you only have limited data available (for example when using a datasource like OpenStreetMap where you are at the whim of what the mapper decided to add). Our project isn't about forming perfect postal addresses for things like printing labels and such, it's about taking the real world data in OSM and making it look reasonable. As an example one of the next things I want to add is basic rules about postal codes so we can catch garbage that comes in when mappers mistakenly put the town name in the post code tag and such.

Will definitely take you up on the offer of comparing formats. Any further feedback you have would be really useful, you've obviously spent a lot of time thinking about this space.

other open traning dataset : https://openaddresses.io/

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact